Share via

gpt 5.5 - terrible performance in azure EAST US2 region

GS 430 Reputation points
2026-05-01T16:05:15.5966667+00:00

Hello,

We deployed the gpt-5.5 model to Microsoft Foundry and are getting very slow responses compared to gpt 5.4 with the same setup-
Is anyone getting fast performance out of it or is that how good this is going to get ?

Azure OpenAI in Foundry Models
0 comments No comments

Answer accepted by question author

  1. Anshika Varshney 10,655 Reputation points Microsoft External Staff Moderator
    2026-05-05T03:10:32.71+00:00

    Hi GS,

    Thanks for sharing your observation. Yes, you are right that the model is available in that region, but what you are experiencing is a different issue. It is not about availability; it is about performance.

    Even when a model is supported in a region, performance can vary depending on a few factors.

    One common reason is regional load. Some regions like East US2 are heavily used, so requests may take longer or responses may feel slower or inconsistent. This does not mean the model is not working, but the backend capacity may be under higher demand at that time.

    Another thing to check is your deployment configuration. If the deployment is using a lower throughput setting or limited capacity, it can affect response speed and quality. Increasing the capacity or checking the tokens per minute allocation can sometimes improve performance.

    It is also worth verifying the prompt and usage pattern. Larger prompts, long conversation history, or complex instructions can increase latency and sometimes give the impression of poor performance.

    You can also try testing the same model in a different region if possible. If the same request gives better results in another region, then it confirms the issue is related to regional load or infrastructure differences rather than the model itself.

    In simple terms, supported region means you can use the model there, but performance is not always exactly the same across all regions.

    If you can share a bit more about what kind of delay or issue you are seeing, such as slow response or low-quality output, it will help to narrow down the exact cause.

    I Hope this helps. Do let me know if you have any further queries.

    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thankyou!

    Was this answer helpful?

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Q&A Assist
    2026-05-01T16:05:29.42+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    gpt-5.5 is available only in specific regions and, like other Foundry models, its latency and perceived “slowness” are primarily driven by workload characteristics and deployment configuration rather than the model version alone.

    Key points to check to improve performance relative to gpt-5.4:

    1. Region and deployment type
      Ensure the gpt-5.5 deployment is in a supported region and, if possible, compare in the same region and deployment type (for example, provisioned throughput vs. shared). gpt-5.5 regional provisioned throughput is currently available only in eastus.
    2. Throughput vs. latency (PTUs)
      If using provisioned throughput, verify that enough PTUs are allocated for the call pattern. Throughput (tokens per minute) and calls per minute scale roughly linearly with PTUs. For example, a workload of ~500k input + 150k output tokens per minute requires around 30 PTUs, and heavier workloads (millions of tokens per minute) require proportionally more. Under-provisioning can increase queueing and latency.
    3. Prompt and generation sizes
      Latency is mainly affected by:
      • Model type
      • Number of prompt tokens
      • Number of generated tokens
      • Overall load on the deployment
      Each additional generated token adds more latency than each prompt token. If gpt-5.5 prompts or max_tokens are larger than what was used with gpt-5.4, responses will feel slower even with the same infrastructure.
    4. Streaming for perceived responsiveness
      Enable streaming for interactive/chat scenarios. Streaming returns tokens incrementally as soon as they are available, improving perceived latency even if total completion time is similar. This is particularly useful for chatbots and conversational interfaces.
    5. Content filtering configuration
      Azure OpenAI content filters add safety checks but also add latency. For low‑risk scenarios, consider whether a modified content filtering policy is appropriate to reduce overhead. This requires going through the documented process to request changes to default filters.
    6. Separate workloads and batching
      • Avoid mixing very short and very long requests on the same deployment; mixed workloads can cause short calls to wait behind long ones and reduce cache efficiency. Use separate deployments for distinct workload patterns where possible.
      • If sending many small requests, consider batching multiple prompts into a single call and test whether this improves overall throughput and effective latency.
    7. Measure, don’t guess
      Use Azure Monitor metrics on the deployment to understand:
      • Calls per minute (Requests metric split by ModelDeploymentName)
      • Total tokens per minute (Processed Inference tokens)
      • Per‑call latency (end‑to‑end request time for non‑streaming, or time to first token and average token generation rate for streaming)
      Comparing these metrics between gpt-5.4 and gpt-5.5 under the same load, region, and PTU allocation will show whether the slowdown is due to model behavior, under‑provisioning, or contention.

    If, after matching region, PTUs, prompt/generation sizes, and workload separation, gpt-5.5 is still significantly slower than gpt-5.4 for the same measured tokens and call rate, collect the above metrics and open a support case so the service team can check for region‑specific or deployment‑specific issues.


    References:

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.