Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform
Hello Amey Sunu,
Welcome to Microsoft Q&A and Thank you for the detailed information.
An 8-minute response time from your gpt-5.1-chat deployment in Australia East is not typical under normal operating conditions, especially since your other model deployments are responding normally. Based on what you’ve described, this appears to be intermittent and region-specific, which helps narrow down the possible causes.
Regarding network latency: the normal round-trip latency to Australia East is typically very low (single-digit milliseconds from nearby regions).
An 8-minute delay would not be caused by standard network latency alone. Since your request eventually completes successfully after increasing the HttpClient timeout, this suggests the request is being accepted and processed, rather than failing due to connectivity issues. That points more toward backend processing delay rather than a pure networking problem.
There are several factors that can influence response time:
- Model type and workload characteristics
Prompt size (input token count)
max_tokens or total output tokens requested
Whether streaming is enabled or disabled
Overall system load or regional capacity pressure
Large prompts or high max_tokens settings can significantly increase generation time. If streaming is disabled and the model must generate a large completion before returning anything, the perceived latency can be much higher.
Since the issue is intermittent and specific to Australia East, this may indicate temporary regional capacity pressure or soft throttling. Unlike hard throttling (which returns HTTP 429), soft throttling can queue requests, resulting in long response times rather than immediate rejection. The fact that other regions or deployments behave normally further suggests this could be localized load behavior.
To further diagnose, I recommend:
Enable detailed logging for:
- Total request duration
- Input/output token usage
- Correlation ID (
x-request-id) - Timestamp in UTC
Compare:
- The same request payload sent to another region
- The same deployment under reduced prompt size or lower
max_tokens
Test with streaming enabled to see whether tokens begin returning quickly but full completion takes longer. If streaming starts quickly, generation time is the main contributor.
Monitor Azure metrics for your Azure OpenAI resource:
- Server latency
- Throttled requests
- Requests per minute
- Retry counts
It would also be helpful to monitor response times over several days to see if there’s a pattern related to time-of-day spikes or usage peaks. That can help determine whether this is capacity-related behavior during high-demand windows.
If this continues, you may want to consider production mitigation strategies such as:
Deploying a secondary region and implementing failover
Reducing max_tokens
Enabling streaming responses
Implementing retry logic with exponential backoff
Please refer this
I hope this helps, do let me know if you have any further queries.
Thank you!