gpt 5.5 - terrible performance in azure EAST US2 region

Question

gpt 5.5 - terrible performance in azure EAST US2 region

GS 430

Hello,

We deployed the gpt-5.5 model to Microsoft Foundry and are getting very slow responses compared to gpt 5.4 with the same setup-
Is anyone getting fast performance out of it or is that how good this is going to get ?

0 comments

Answer accepted by question author

1 additional answer

Your answer

Answer 1

Hi GS,

Thanks for sharing your observation. Yes, you are right that the model is available in that region, but what you are experiencing is a different issue. It is not about availability; it is about performance.

Even when a model is supported in a region, performance can vary depending on a few factors.

One common reason is regional load. Some regions like East US2 are heavily used, so requests may take longer or responses may feel slower or inconsistent. This does not mean the model is not working, but the backend capacity may be under higher demand at that time.

Another thing to check is your deployment configuration. If the deployment is using a lower throughput setting or limited capacity, it can affect response speed and quality. Increasing the capacity or checking the tokens per minute allocation can sometimes improve performance.

It is also worth verifying the prompt and usage pattern. Larger prompts, long conversation history, or complex instructions can increase latency and sometimes give the impression of poor performance.

You can also try testing the same model in a different region if possible. If the same request gives better results in another region, then it confirms the issue is related to regional load or infrastructure differences rather than the model itself.

In simple terms, supported region means you can use the model there, but performance is not always exactly the same across all regions.

If you can share a bit more about what kind of delay or issue you are seeing, such as slow response or low-quality output, it will help to narrow down the exact cause.

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thankyou!

Anshika Varshney 10,655 Reputation points Microsoft External Staff Moderator

2026-05-08T03:37:34.6266667+00:00

Hi GS,

Could you please confirm whether it's alright to close this thread?if yes, please take a minute to accept my answer this helps increase visibility of this question for other members of the Microsoft Q&A community.

Thank you for helping to improve Microsoft Q&A!

Answer 2

gpt-5.5 is available only in specific regions and, like other Foundry models, its latency and perceived “slowness” are primarily driven by workload characteristics and deployment configuration rather than the model version alone.

Key points to check to improve performance relative to gpt-5.4:

Region and deployment type
Ensure the gpt-5.5 deployment is in a supported region and, if possible, compare in the same region and deployment type (for example, provisioned throughput vs. shared). gpt-5.5 regional provisioned throughput is currently available only in eastus.
Throughput vs. latency (PTUs)
If using provisioned throughput, verify that enough PTUs are allocated for the call pattern. Throughput (tokens per minute) and calls per minute scale roughly linearly with PTUs. For example, a workload of ~500k input + 150k output tokens per minute requires around 30 PTUs, and heavier workloads (millions of tokens per minute) require proportionally more. Under-provisioning can increase queueing and latency.
Prompt and generation sizes
Latency is mainly affected by:
- Model type
- Number of prompt tokens
- Number of generated tokens
- Overall load on the deployment
Each additional generated token adds more latency than each prompt token. If gpt-5.5 prompts or max_tokens are larger than what was used with gpt-5.4, responses will feel slower even with the same infrastructure.
Streaming for perceived responsiveness
Enable streaming for interactive/chat scenarios. Streaming returns tokens incrementally as soon as they are available, improving perceived latency even if total completion time is similar. This is particularly useful for chatbots and conversational interfaces.
Content filtering configuration
Azure OpenAI content filters add safety checks but also add latency. For low‑risk scenarios, consider whether a modified content filtering policy is appropriate to reduce overhead. This requires going through the documented process to request changes to default filters.
Separate workloads and batching
- Avoid mixing very short and very long requests on the same deployment; mixed workloads can cause short calls to wait behind long ones and reduce cache efficiency. Use separate deployments for distinct workload patterns where possible.
- If sending many small requests, consider batching multiple prompts into a single call and test whether this improves overall throughput and effective latency.
Measure, don’t guess
Use Azure Monitor metrics on the deployment to understand:
- Calls per minute (Requests metric split by ModelDeploymentName)
- Total tokens per minute (Processed Inference tokens)
- Per‑call latency (end‑to‑end request time for non‑streaming, or time to first token and average token generation rate for streaming)
Comparing these metrics between gpt-5.4 and gpt-5.5 under the same load, region, and PTU allocation will show whether the slowdown is due to model behavior, under‑provisioning, or contention.

If, after matching region, PTUs, prompt/generation sizes, and workload separation, gpt-5.5 is still significantly slower than gpt-5.4 for the same measured tokens and call rate, collect the above metrics and open a support case so the service team can check for region‑specific or deployment‑specific issues.

References:

Share via

gpt 5.5 - terrible performance in azure EAST US2 region

1 additional answer

Your answer