Why there is a sudden decrease in token/sec in Azure Openai Service, leading to higher latency?

Pratik 0 Reputation points
2025-06-06T05:35:27.2833333+00:00

I am using Azure OpenAI service, Global Standard Deployment for model gpt-4o-mini.

Using this service in Production, for 3-4 hours, the latency stayed consistent (400-500 ms), but then there was a sudden spike in latency to around (1200-1400 ms). Tried looking into other metrics, and realised the Token/sec went from 140 to 40.

How could such a thing happen?
And, what can be done to avoid this?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,076 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Alex Burlachenko 9,585 Reputation points
    2025-06-08T12:53:50.4933333+00:00

    hi Pratik, thanks for bringing this up,

    latency spikes can be super annoying, especially when u're in prod and things just... slow down outta nowhere. I can imagine.... the azure openai service is usually pretty solid, but yeah, token/sec drops can happen. one common culprit is throttling. azure might be scaling down ur throughput if u hit some limits, or if there's backend load balancing magic going on. check ur quota and see if u're near any limits, sometimes it's just the system adjusting resources.

    as well lets check ur request patterns. if u suddenly sent a burst of heavier prompts (like way more tokens per call), the model might take longer to churn through them. gpt-4o-mini is fast, but it's not immune to getting bogged down. also, network stuff, azure's infra is great, but if there was a tiny hiccup in routing or regional load, it could cause delays ^)))

    now, for avoiding this in the future... if u're on a standard deployment, consider checking if provisioned throughput aka ptu could help. it's a bit more predictable for steady workloads provisioned throughput.

    And this isn't just an azure thing, any cloud ai service can act up like this. just an advise, always log token counts and response times. if u see a pattern, u can tweak ur app to smooth out requests or add retries with backoff. also, caching frequent responses might save u some tokens and speed things up.

    hope this helps! if u dig deeper and find something funky, let us know.

    Best regards,

    Alex

    and "yes" if you would follow me at Q&A - personaly thx.
    P.S. If my answer help to you, please Accept my answer
    PPS That is my Answer and not a Comment
    

    https://ctrlaltdel.blog/


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.