Share via

Severe Latency Degradation (~4 tokens/sec) Across Azure OpenAI Models in Sweden Central

Benedikt Hielscher 30 Reputation points
2026-06-08T13:10:56.4333333+00:00

Hello everyone,

We are currently experiencing a severe performance degradation across our Azure OpenAI models deployed in the Sweden Central region.

Current Observations

Affected Region: Sweden Central (swedencentral)

Observed Throughput: Throughput has dropped to roughly 4 tokens per second, causing requests that normally take seconds to either drag out or hit client-side timeouts.

Models Impacted: This appears to be affecting all of our deployed models uniformly, rather than being isolated to a single deployment.

What We've Verified

No Workload Changes: Our input prompt sizes, overall traffic volume (TPM/RPM), and application configurations have not changed.

Azure Service Health: The main Azure Status dashboard isn't showing an active incident for OpenAI in this region, but the behavior strongly behaves like a regional backend capacity constraint or a platform-level load-balancing issue.

Azure OpenAI in Foundry Models

1 answer

Sort by: Most helpful
  1. AI answer

    2026-06-08T13:11:14.6433333+00:00

    The symptoms described (sudden, severe slowdown across all Azure OpenAI deployments in a single region, with no app or workload changes) are consistent with a regional service-side issue or capacity constraint rather than a model- or application-specific problem.

    From the available information, the following points are relevant:

    1. Latency for Azure OpenAI is primarily driven by:
      • Model type
      • Number of tokens in the prompt
      • Number of tokens generated
      • Overall load on the deployment and system
      The total time to last token (TTLT) follows: TTLT = TTFT + (TBT × Tokens Generated) where TTFT is time to first token and TBT is time between tokens. A uniform drop to ~4 tokens/sec across models, without changes in prompt or output size, strongly suggests an increase in TBT due to backend load rather than anything in the client or prompt design.
    2. When latency suddenly degrades without workload changes, and especially when it affects all models in a region, the recommended actions are:
      • Check Azure Status and Service Health for the region to confirm whether there is an incident affecting Azure OpenAI.
      • If nothing is reported, test the same workloads in another region where the same models are available (for example, another region listed as supporting gpt-4o or gpt-4o-mini in the model availability table) to confirm the issue is region-specific.
    3. If the issue is confirmed to be regional and persists while other regions behave normally, this is indicative of a platform-side problem. In similar service-side timeout or performance issues (for example, with Document Intelligence or regional deployment problems), the guidance is:
      • Verify there are no network/firewall/VNet issues on the client side.
      • Collect request details (including correlation IDs from logs/SDK diagnostics) for affected calls.
      • Escalate via Azure Support so the product group can investigate the regional backend.
    4. To mitigate impact while the regional issue is investigated:
      • If possible, temporarily route traffic to another region where the same models are available, based on the model summary and region availability table.
      • Reduce max_tokens and overall generated tokens where feasible, as fewer output tokens directly reduce latency.
      • Enable streaming for user-facing chat or interactive scenarios so users see partial responses earlier, improving perceived latency even if total TTLT remains high.
      • Avoid mixing very long and short workloads on the same deployment; separate deployments per workload can help reduce queuing and batching delays.

    Given the described behavior, the next concrete steps are:

    1. Confirm via Service Health whether there is an active incident in Sweden Central for Azure OpenAI.
    2. Run the same prompts against a deployment of the same model in another supported region to validate that throughput is normal elsewhere.
    3. If the problem is isolated to Sweden Central and persists, open a support case with timestamps, deployment names, and correlation IDs so the Azure OpenAI team can investigate regional capacity or load-balancing issues.
    4. As a temporary workaround, fail over latency-sensitive traffic to a healthy region and apply token/streaming optimizations to reduce user-visible impact.

    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.