Share via

Response Time from AustraliaEast Azure Open AI seems to take an unusually longer time

Amey Sunu 0 Reputation points
2026-02-05T11:07:18.4233333+00:00

Hi there, I'm having an unusual timeout from our HttpClient making requests to the gpt-5.1-chat model deployment in Australia East and was wondering if there is any ongoing issue with the response time from this deployment in particular. I lifted the timeout on our HttpClient to see what the expected response time was, and it took about 8 minutes to come back with a response. The other model deployment seems to be okay, and the Australian one keeps happening in a more often but an intermittent pattern.

Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform


2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 17,865 Reputation points Microsoft External Staff Moderator
    2026-02-18T11:25:38.7366667+00:00

    Hello Amey Sunu,

    Welcome to Microsoft Q&A and Thank you for the detailed information.

    An 8-minute response time from your gpt-5.1-chat deployment in Australia East is not typical under normal operating conditions, especially since your other model deployments are responding normally. Based on what you’ve described, this appears to be intermittent and region-specific, which helps narrow down the possible causes.

    Regarding network latency: the normal round-trip latency to Australia East is typically very low (single-digit milliseconds from nearby regions).

    An 8-minute delay would not be caused by standard network latency alone. Since your request eventually completes successfully after increasing the HttpClient timeout, this suggests the request is being accepted and processed, rather than failing due to connectivity issues. That points more toward backend processing delay rather than a pure networking problem.

    There are several factors that can influence response time:

    • Model type and workload characteristics

    Prompt size (input token count)

    max_tokens or total output tokens requested

    Whether streaming is enabled or disabled

    Overall system load or regional capacity pressure

    Large prompts or high max_tokens settings can significantly increase generation time. If streaming is disabled and the model must generate a large completion before returning anything, the perceived latency can be much higher.

    Since the issue is intermittent and specific to Australia East, this may indicate temporary regional capacity pressure or soft throttling. Unlike hard throttling (which returns HTTP 429), soft throttling can queue requests, resulting in long response times rather than immediate rejection. The fact that other regions or deployments behave normally further suggests this could be localized load behavior.

    To further diagnose, I recommend:

    Enable detailed logging for:

    • Total request duration
    • Input/output token usage
    • Correlation ID (x-request-id)
    • Timestamp in UTC

    Compare:

    • The same request payload sent to another region
    • The same deployment under reduced prompt size or lower max_tokens

    Test with streaming enabled to see whether tokens begin returning quickly but full completion takes longer. If streaming starts quickly, generation time is the main contributor.

    Monitor Azure metrics for your Azure OpenAI resource:

    • Server latency
    • Throttled requests
    • Requests per minute
    • Retry counts

    It would also be helpful to monitor response times over several days to see if there’s a pattern related to time-of-day spikes or usage peaks. That can help determine whether this is capacity-related behavior during high-demand windows.

    If this continues, you may want to consider production mitigation strategies such as:

    Deploying a secondary region and implementing failover

    Reducing max_tokens

    Enabling streaming responses

    Implementing retry logic with exponential backoff

    Please refer this

    I hope this helps, do let me know if you have any further queries.

    Thank you!

    1 person found this answer helpful.
    0 comments No comments

  2. Alex Burlachenko 20,585 Reputation points MVP Volunteer Moderator
    2026-02-12T10:32:38.5966667+00:00

    hi,

    8 mins for a chat model response is definitely not something I would consider normal behaviour even with fairly large prompts and generous max token settings and before assuming that there is a regional outage in australia east I would start by checking the azure status page together with the resource health blade of your azure openai resource in that region because sometimes there are capacity constraints or partial service degradations that do not immediately show up as full incidents but can still affect latency in a noticeable way especially under load.

    I would carefully compare the exact request payload between the working deployment and the slow one including prompt size max tokens temperature and any system messages because if the total token generation is significantly higher in the australia east deployment the model will naturally take longer to respond particularly if you are not using streaming and instead waiting for the full completion which can make the delay look like a timeout when in reality it is just long generation time.

    Checking is whether you are hitting rate limits or being silently throttled which may not always surface as a clear 429 error in your client logs but can still introduce queueing delay inside the service and you can verify this by reviewing metrics such as request latency tokens per minute and throttled requests in the azure portal under your openai resource.

    If the configuration is identical across regions and only australia east consistently shows intermittent 8 minute responses while other regions return quickly then I would strongly suggest opening a support ticket and including correlation ids from several slow requests so microsoft can trace backend processing time on that specific deployment because behaviour like this usually indicates heavy token generation internal queueing or regional capacity pressure rather than a simple http client timeout issue.

    rgds,

    Alex

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.