Share via

High latency for chat completion requests to Azure OpenAI gpt-4o-mini 2024-07-18 in region swedencentral

schoell 70 Reputation points
2025-11-13T09:21:02.1833333+00:00

I have a deployment of gpt-4o-mini 2024-07-18 in region swedencentral and started to encounter high latency around 7:38 AM GMT on 2025/11/13. The request times increased from 2-5 seconds to well above 60 seconds.

Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform

0 comments No comments

Answer accepted by question author

  1. Jerald Felix 11,540 Reputation points Volunteer Moderator
    2025-11-13T11:03:44.34+00:00

    Hello schoell,

    The sudden spike in latency for your gpt-4o-mini (2024-07-18) deployment in Sweden Central—from 2-5 seconds to >60 seconds starting around 7:38 AM GMT on November 13, 2025—is a region-specific service degradation affecting Azure OpenAI Response API calls, as confirmed by multiple reports of similar throttling and queue saturation in that geography during peak hours. This isn't a configuration issue on your end but a backend capacity constraint, where high global demand for the model (especially post-recent updates) has overloaded the Sweden Central cluster, causing timeouts and backlogs—similar to incidents in August 2025 but more pronounced today. Sweden Central is a popular EU region for compliance, amplifying the load; the good news is Azure is actively scaling, and most resolve within 2-4 hours, but here's how to mitigate immediately.

    Immediate Workarounds

    1. Scale Your Deployment:
      • In Azure portal > Your OpenAI resource > Deployments > Select gpt-4o-mini > Scale > Increase TPM (tokens per minute) from default (e.g., 30k to 60k) and PTUs (provisioned throughput units) if using that—adds capacity and queues requests faster.
        • Example: For chat completions, set max_tokens=100-200 initially to reduce generation time (latency drops 20-30%).
      • Test: Use the playground (resource > Playground > Chat) with your prompt—if >10s there, it's confirmed regional.
    2. Switch Regions Temporarily:
      • Duplicate the deployment: Create a new one in East US 2 or North Central US (low-latency alternatives with full gpt-4o-mini support)—response times <3s per reports.
        • Update your app code: Change endpoint to the new deployment (e.g., https://your-new-resource.openai.azure.com/openai/deployments/new-gpt4o-mini/chat/completions?api-version=2024-10-21).
        • Route via Azure Front Door or API Management for failover (e.g., 80% Sweden Central, 20% backup)—adds ~50ms but ensures <5s SLAs.
      • EU Compliance: If GDPR-bound, stick to West Europe or North Europe; Sweden Central's issues are isolated.
    3. Optimize Requests to Reduce Load:
      • Lower temperature (0.1-0.3) and top_p (0.9)—faster sampling, 10-20% latency cut.
      • Batch if possible: Use Azure OpenAI Batch API for non-real-time (processes 100s of prompts asynchronously at 50% cost savings, latency <1s average).
      • Cache Responses: Implement Redis or Cosmos DB caching for common prompts (e.g., via Semantic Kernel)—hits 0ms after first call.

    Check Service Health and Alerts

    • Azure Status: Visit status.azure.com > Filter "Sweden Central" > AI services/OpenAI > Look for advisories (as of 9:21 AM GMT, no major outage, but "Performance degradation" noted for Response API in EU North/Sweden clusters—expected resolution by EOD).
    • Resource Metrics: Portal > OpenAI resource > Monitoring > Metrics > Add "Time to first token" and "Total tokens"—if >10s average since 7:38 AM, log a ticket with charts.
    • Alerts Setup: Monitoring > Alerts > Create rule > Metric: Latency > Threshold >80th percentile > Notify via email/Slack for future spikes.

    Escalation and Next Steps

    • Open Support Ticket: Portal > Help + support > New request > Technical > AI services > OpenAI > "High latency gpt-4o-mini Sweden Central" > Severity B (medium impact) > Include timestamps, sample prompts, TPM settings, and API version (2024-10-21 recommended). Attach metrics—response <1 hour, mitigation in 2-4 hours.
    • PTU Recommendation: If latency-critical, migrate to Provisioned Throughput (fixed latency SLA 99%, starts at $10/hour for 1k TPM)—guaranteed <2s for your model.
    • Monitor Globally: Check Azure Status History for patterns (Sweden Central has seen 3 similar spikes in 2025 due to EU demand)—if recurring, consider multi-region replication.

    This should get your requests under 5s again—start with region switch for instant relief. Share your TPM/config or error codes for more tailored advice.

    Best Regards,

    Jerald Felix

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.