Severe Latency Degradation (~4 tokens/sec) Across Azure OpenAI Models in Sweden Central

Question

Severe Latency Degradation (~4 tokens/sec) Across Azure OpenAI Models in Sweden Central

Benedikt Hielscher 30

Hello everyone,

We are currently experiencing a severe performance degradation across our Azure OpenAI models deployed in the Sweden Central region.

Current Observations

Affected Region: Sweden Central (swedencentral)

Observed Throughput: Throughput has dropped to roughly 4 tokens per second, causing requests that normally take seconds to either drag out or hit client-side timeouts.

Models Impacted: This appears to be affecting all of our deployed models uniformly, rather than being isolated to a single deployment.

What We've Verified

No Workload Changes: Our input prompt sizes, overall traffic volume (TPM/RPM), and application configurations have not changed.

Azure Service Health: The main Azure Status dashboard isn't showing an active incident for OpenAI in this region, but the behavior strongly behaves like a regional backend capacity constraint or a platform-level load-balancing issue.

Karnam Venkata Rajeswari 3,575 Reputation points Microsoft External Staff Moderator

2026-06-08T13:39:55.6266667+00:00
Hello @Benedikt Hielscher ,

Welcome to Microsoft Q&A .Thank you for reaching out to us.

Based on the behavior described - consistently elevated latency across multiple models, reduced token throughput and no corresponding changes in workload patterns , this scenario aligns with a regional performance degradation affecting Azure OpenAI deployments in Sweden Central.

We are seeing similar patterns reported in parallel threads, where:

Multiple models in the same region are impacted simultaneously

Throughput and response-start times are significantly higher than expected

Other regions show comparatively stable performance under the same test conditions

In these cases, the behavior is typically associated with temporary backend capacity pressure or request queuing at the regional level.

This behavior is currently under review and the engineering teams are actively working to stabilize capacity and restore expected latency behavior.

To help reduce user impact please check if the following steps help-

Testing in alternate regions if available - If latency is significantly lower elsewhere, traffic routing can help maintain performance continuity

Enable streaming responses - Allows partial output to begin earlier, improving perceived responsiveness

Separate workloads if applicable - Isolate short interactive calls from longer generations to minimize queue contention

Continue monitoring key metrics Please focus on

Time to First Token / Time to Response

Time to Last Byte

Tokens per second trends These help confirm recovery as backend conditions improve.

The following references might be helpful , please check them out

Monitoring data reference for Azure OpenAI - Microsoft Foundry | Microsoft Learn

Azure OpenAI in Microsoft Foundry Models performance & latency - Microsoft Foundry | Microsoft Learn

Azure Service Health documentation - Azure Service Health | Microsoft Learn

We appreciate your patience while we are working on this

Thank you
Benedikt Hielscher 30 Reputation points

2026-06-08T13:52:16.9166667+00:00

Hello @Karnam Venkata Rajeswari

i just tested it again, and gpt-5-nano averaged about 60 t/s while 5.4 averaged around 15 t/s.

Both Models are Deployed in sweden central.
Karnam Venkata Rajeswari 3,575 Reputation points Microsoft External Staff Moderator

2026-06-09T02:59:32.45+00:00

Hello @Benedikt Hielscher ,

Following up to see if the response was helpful

Thank you
Karnam Venkata Rajeswari 3,575 Reputation points Microsoft External Staff Moderator

2026-06-10T12:50:19.75+00:00

Hello @Benedikt Hielscher ,

Checking to see if you had any chance to review the response

Thank you
Martin Günther 11 Reputation points

2026-06-10T18:58:34.62+00:00

We are also experiencing higher latency since 29. Mai and very high latency since 05. June with GPT 5.4 and 5.1 in Sweden Central.

The same prompts now take up to 120 sec. to first response and > 180 for full response which then hits client timeouts.
Benedikt Hielscher 30 Reputation points

2026-06-10T19:01:23.7+00:00

@Karnam Venkata Rajeswari

i saw the answer and i am aware of that. The workflows didn't change, but latency got worse, that's what i was reporting.
Martin Günther 11 Reputation points

2026-06-10T19:14:36.4633333+00:00

I can absolutely confirm @Benedikt Hielscher observation. It is getting worse. We've deployed GPT-5.4 as "Data Zone Standard"

1 answer

Your answer

Benedikt Hielscher 30 Reputation points

2026-06-08T13:52:16.9166667+00:00

Hello @Karnam Venkata Rajeswari

i just tested it again, and gpt-5-nano averaged about 60 t/s while 5.4 averaged around 15 t/s.

Both Models are Deployed in sweden central.
Karnam Venkata Rajeswari 3,575 Reputation points Microsoft External Staff Moderator

2026-06-09T02:59:32.45+00:00

Hello @Benedikt Hielscher ,

Following up to see if the response was helpful

Thank you
Karnam Venkata Rajeswari 3,575 Reputation points Microsoft External Staff Moderator

2026-06-10T12:50:19.75+00:00

Hello @Benedikt Hielscher ,

Checking to see if you had any chance to review the response

Thank you
Martin Günther 11 Reputation points

2026-06-10T18:58:34.62+00:00

We are also experiencing higher latency since 29. Mai and very high latency since 05. June with GPT 5.4 and 5.1 in Sweden Central.

The same prompts now take up to 120 sec. to first response and > 180 for full response which then hits client timeouts.
Benedikt Hielscher 30 Reputation points

2026-06-10T19:01:23.7+00:00

@Karnam Venkata Rajeswari

i saw the answer and i am aware of that. The workflows didn't change, but latency got worse, that's what i was reporting.
Martin Günther 11 Reputation points

2026-06-10T19:14:36.4633333+00:00

I can absolutely confirm @Benedikt Hielscher observation. It is getting worse. We've deployed GPT-5.4 as "Data Zone Standard"

Answer 1

The symptoms described (sudden, severe slowdown across all Azure OpenAI deployments in a single region, with no app or workload changes) are consistent with a regional service-side issue or capacity constraint rather than a model- or application-specific problem.

From the available information, the following points are relevant:

Latency for Azure OpenAI is primarily driven by:
- Model type
- Number of tokens in the prompt
- Number of tokens generated
- Overall load on the deployment and system
The total time to last token (TTLT) follows: TTLT = TTFT + (TBT × Tokens Generated) where TTFT is time to first token and TBT is time between tokens. A uniform drop to ~4 tokens/sec across models, without changes in prompt or output size, strongly suggests an increase in TBT due to backend load rather than anything in the client or prompt design.
When latency suddenly degrades without workload changes, and especially when it affects all models in a region, the recommended actions are:
- Check Azure Status and Service Health for the region to confirm whether there is an incident affecting Azure OpenAI.
- If nothing is reported, test the same workloads in another region where the same models are available (for example, another region listed as supporting gpt-4o or gpt-4o-mini in the model availability table) to confirm the issue is region-specific.
If the issue is confirmed to be regional and persists while other regions behave normally, this is indicative of a platform-side problem. In similar service-side timeout or performance issues (for example, with Document Intelligence or regional deployment problems), the guidance is:
- Verify there are no network/firewall/VNet issues on the client side.
- Collect request details (including correlation IDs from logs/SDK diagnostics) for affected calls.
- Escalate via Azure Support so the product group can investigate the regional backend.
To mitigate impact while the regional issue is investigated:
- If possible, temporarily route traffic to another region where the same models are available, based on the model summary and region availability table.
- Reduce max_tokens and overall generated tokens where feasible, as fewer output tokens directly reduce latency.
- Enable streaming for user-facing chat or interactive scenarios so users see partial responses earlier, improving perceived latency even if total TTLT remains high.
- Avoid mixing very long and short workloads on the same deployment; separate deployments per workload can help reduce queuing and batching delays.

Given the described behavior, the next concrete steps are:

Confirm via Service Health whether there is an active incident in Sweden Central for Azure OpenAI.
Run the same prompts against a deployment of the same model in another supported region to validate that throughput is normal elsewhere.
If the problem is isolated to Sweden Central and persists, open a support case with timestamps, deployment names, and correlation IDs so the Azure OpenAI team can investigate regional capacity or load-balancing issues.
As a temporary workaround, fail over latency-sensitive traffic to a healthy region and apply token/streaming optimizations to reduce user-visible impact.

References:

Share via

Severe Latency Degradation (~4 tokens/sec) Across Azure OpenAI Models in Sweden Central

Current Observations

What We've Verified

1 answer

Your answer