Share via

gpt-4o-mini — Unexplained latency degradation since May 22, both East US and Sweden Central

Oron Karmona 35 Reputation points
2026-05-25T13:20:01.34+00:00

Description:

We are experiencing a significant and sustained increase in response latency for our gpt-4o-mini deployment starting May 22, 2026. The degradation is observed simultaneously in both East US and Sweden Central regions.

Observed metrics (Azure Monitor — Azure OpenAI resource):

┌───────────────────────────────┬──────────────────┬──────────────────────────┐

│ Metric │ Before May 22 │ After May 22 │

├───────────────────────────────┼──────────────────┼──────────────────────────┤

│ Time to first byte │ ~1 ms (stable) │ 4–12 ms (noisy, spiking) │

├───────────────────────────────┼──────────────────┼──────────────────────────┤

│ Time to last byte │ ~198 ms (stable) │ 800–2,376 ms │

├───────────────────────────────┼──────────────────┼──────────────────────────┤

│ Number of requests │ Unchanged │ Unchanged │

├───────────────────────────────┼──────────────────┼──────────────────────────┤

│ Token volume (input + output) │ Unchanged │ Unchanged │

└───────────────────────────────┴──────────────────┴──────────────────────────┘

Key observations:

  • Change point is clearly May 22 — flat baseline before, degraded after
  • Request volume and token counts are identical before and after, ruling out load increase
  • TTFB increased alongside TTLB — this is not an output-length issue; Azure is slower to begin responding
  • Both regions degraded simultaneously — rules out regional infrastructure
  • No 429s or error-rate increase — not a throttling/rate-limit issue
  • LiteLLM proxy latency was validated as a pass-through; the latency is Azure-side

What we need to know:

  1. Was any backend infrastructure, routing, or model version change applied to gpt-4o-mini deployments around May 22, 2026?
  2. Is the gpt-4o-mini deployment still serving gpt-4o-mini-2024-07-18 as the underlying model version, or was it silently migrated?
  3. Is there an active service-side issue affecting request routing or backend handling for this model, similar to the GPT-4.1-mini routing incident from February 2026?

Deployment details:

  • Model: gpt-4o-mini

gpt-4o-mini is a production model serving end-user requests. Response time has increased 5–12x, directly degrading user experience.

Thank you

Azure OpenAI in Foundry Models

2 answers

Sort by: Most helpful
  1. Matthieu Delanoë 20 Reputation points
    2026-05-28T06:53:19+00:00

    Hi all,

    Today latency seems to be back to normal on our deployments.

    But still no answer/confirmation from Microsoft on this

    Was this answer helpful?

    2 people found this answer helpful.

  2. AI answer

    2026-05-25T13:20:17.79+00:00

    The context does not contain any information about backend incidents, routing changes, or model-version changes for gpt‑4o‑mini on or around May 22, 2026, nor any confirmation of a service-side issue similar to a GPT‑4.1‑mini routing incident.

    The only supported guidance from the context that applies to latency issues is general performance tuning and monitoring:

    1. Use Azure Monitor to validate throughput and latency
      • Track:
        • Calls per minute using the Azure OpenAI Requests metric split by ModelDeploymentName.
        • Total tokens per minute using the Processed Inference tokens metric.
      • This confirms whether the service is processing the same volume and whether any throttling or capacity effects show up in metrics.
    2. Apply the documented latency optimization levers where possible
      • Model choice: If latency is critical and the workload allows, test GPT‑4o mini versus other available models to see if another model in the same region provides better latency characteristics.
      • Lower max_tokens: Even when actual output length is similar, a higher max_tokens setting increases latency.
      • Reduce total tokens: Shorter prompts and responses reduce end‑to‑end latency.
      • Enable streaming: Streaming does not reduce total processing time but improves perceived latency by returning tokens as they are generated.
      • Content filtering configuration: Content filters improve safety but add latency; where appropriate and compliant, evaluate whether modified content filtering policies are suitable.
    3. Next steps for unexplained step‑change latency
      • Because the context explicitly states that model availability and behavior are centrally managed and cannot be influenced or customized per customer or per region, and there is no published incident or change record in the provided material, any investigation of a specific date‑bound regression (like May 22, 2026) must go through Azure support.
      • Open a support ticket with:
        • Exact resource name and regions.
        • Deployment name(s) for gpt‑4o‑mini.
        • Time window showing the change (before/after May 22) and exported Azure Monitor charts for latency and tokens.
        • Example request IDs and timestamps for slow vs. normal calls.
      • The product group can then check for:
        • Backend routing or capacity changes affecting those regions.
        • Model implementation or configuration changes.
        • Any active or historical incidents not reflected in public documentation.

    The context does not provide any statement that gpt‑4o‑mini has been silently migrated to a different underlying version, nor any confirmation of a known service‑side issue on the date described. Only Azure support can confirm subscription‑ and time‑specific backend behavior.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.