An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
The symptoms described (sudden, severe slowdown across all Azure OpenAI deployments in a single region, with no app or workload changes) are consistent with a regional service-side issue or capacity constraint rather than a model- or application-specific problem.
From the available information, the following points are relevant:
- Latency for Azure OpenAI is primarily driven by:
- Model type
- Number of tokens in the prompt
- Number of tokens generated
- Overall load on the deployment and system
TTLT = TTFT + (TBT × Tokens Generated)where TTFT is time to first token and TBT is time between tokens. A uniform drop to ~4 tokens/sec across models, without changes in prompt or output size, strongly suggests an increase in TBT due to backend load rather than anything in the client or prompt design. - When latency suddenly degrades without workload changes, and especially when it affects all models in a region, the recommended actions are:
- Check Azure Status and Service Health for the region to confirm whether there is an incident affecting Azure OpenAI.
- If nothing is reported, test the same workloads in another region where the same models are available (for example, another region listed as supporting
gpt-4oorgpt-4o-miniin the model availability table) to confirm the issue is region-specific.
- If the issue is confirmed to be regional and persists while other regions behave normally, this is indicative of a platform-side problem. In similar service-side timeout or performance issues (for example, with Document Intelligence or regional deployment problems), the guidance is:
- Verify there are no network/firewall/VNet issues on the client side.
- Collect request details (including correlation IDs from logs/SDK diagnostics) for affected calls.
- Escalate via Azure Support so the product group can investigate the regional backend.
- To mitigate impact while the regional issue is investigated:
- If possible, temporarily route traffic to another region where the same models are available, based on the model summary and region availability table.
- Reduce
max_tokensand overall generated tokens where feasible, as fewer output tokens directly reduce latency. - Enable streaming for user-facing chat or interactive scenarios so users see partial responses earlier, improving perceived latency even if total TTLT remains high.
- Avoid mixing very long and short workloads on the same deployment; separate deployments per workload can help reduce queuing and batching delays.
Given the described behavior, the next concrete steps are:
- Confirm via Service Health whether there is an active incident in Sweden Central for Azure OpenAI.
- Run the same prompts against a deployment of the same model in another supported region to validate that throughput is normal elsewhere.
- If the problem is isolated to Sweden Central and persists, open a support case with timestamps, deployment names, and correlation IDs so the Azure OpenAI team can investigate regional capacity or load-balancing issues.
- As a temporary workaround, fail over latency-sensitive traffic to a healthy region and apply token/streaming optimizations to reduce user-visible impact.
References: