An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello Elissa Castellanos (NEXTANT LLC),
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that your GPT-5-mini in Azure AI Foundry Agents times out in HIGH reasoning mode but MED has significantly lower quality and slower performance.
Let me start with your questions first:
-
There is official evidence that reasoning models spend more time processing, and higher reasoning effort increases processing time and hidden reasoning-token usage. - https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/reasoning, https://learn.microsoft.com/en-us/azure/foundry/agents/concepts/runtime-components For tool runs specifically, Microsoft documents a 10-minute run expiration for tool-output submission, use the right tools, so that exact tool-output rule should not be falsely presented as the confirmed cause. - https://learn.microsoft.com/en-us/azure/foundry/agents/how-to/tools/function-callingAre there known limitations or timeout constraints when using GPT-5-mini with HIGH reasoning effort in Azure AI Foundry Agents?
-
Yes. Higher reasoning effort makes reasoning models spend longer processing the request and generally increases hidden reasoning-token usage. - https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/reasoningIs HIGH reasoning expected to have substantially longer latency in agent orchestrations?
-
Yes. Use background responses with polling/resumption and optionally streaming with resumption. This is the recommended production pattern for long-running reasoning tasks and client timeout resilience. - https://learn.microsoft.com/en-us/agent-framework/agents/background-responses, https://devblogs.microsoft.com/agent-framework/handling-long-running-operations-with-background-responses/Are there recommended configurations, architectural patterns, or timeout settings that would allow HIGH reasoning to complete successfully?
-
Yes, but it must be proven. Quotas and limits are regional, per subscription, model, and deployment type, and Standard/Global/Data Zone deployments can experience latency variability under capacity pressure. Use Azure Monitor latency/token metrics and 429/capacity indicators to validate. - https://learn.microsoft.com/en-us/azure/foundry/openai/quotas-limits, https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/quotaCould this behavior be region-related or tied to current capacity/performance limitations?
-
Use HIGH only on the smallest, highest-value reasoning step where quality materially depends on it. For latency-sensitive or production workflows, run that HIGH step asynchronously/backgrounded rather than synchronously. Reasoning models are best for complex problem-solving, document comparison, coding, and workflow-management tasks, but latency/cost must be budgeted. - https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/reasoning, https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/latencyIs there guidance on when HIGH reasoning should or should not be used in multi-step agent workflows?
So, regarding your clarification, the timeout does not appear to come from multi-agent orchestration, external tools, or payload size. Since it also occurs in the Foundry portal with the agent isolated, the failing point is most likely the GPT-5-mini HIGH-reasoning generation step itself. - https://learn.microsoft.com/en-us/azure/foundry/agents/concepts/runtime-components, https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/reasoning
HIGH reasoning can take much longer because reasoning models spend extra processing time and use hidden reasoning tokens before producing the final answer. The production-safe fix is to run this step asynchronously using background responses with polling/resumption, rather than waiting synchronously for completion.
Test the same agent, input, model, and deployment using the Azure OpenAI Responses/Foundry Agent background pattern, for example: response = client.responses.create(..., background=True), then poll with client.responses.retrieve(response.id). If it completes there, use that pattern as the production architecture. - https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/responses, https://devblogs.microsoft.com/agent-framework/handling-long-running-operations-with-background-responses/
If it still fails at about 10 minutes, capture the response/request ID, timestamp, model version, deployment type, Azure Monitor latency/token metrics, and error details, then continue with Microsoft Support as a service or deployment-capacity issue. Also review quota/capacity, consider Provisioned Throughput for predictable latency, and monitor Foundry model/version updates. - https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/latency, https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/quota, https://learn.microsoft.com/en-us/azure/foundry/foundry-models/concepts/model-versions
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.