hi Pratik, thanks for bringing this up,
latency spikes can be super annoying, especially when u're in prod and things just... slow down outta nowhere. I can imagine.... the azure openai service is usually pretty solid, but yeah, token/sec drops can happen. one common culprit is throttling. azure might be scaling down ur throughput if u hit some limits, or if there's backend load balancing magic going on. check ur quota and see if u're near any limits, sometimes it's just the system adjusting resources.
as well lets check ur request patterns. if u suddenly sent a burst of heavier prompts (like way more tokens per call), the model might take longer to churn through them. gpt-4o-mini is fast, but it's not immune to getting bogged down. also, network stuff, azure's infra is great, but if there was a tiny hiccup in routing or regional load, it could cause delays ^)))
now, for avoiding this in the future... if u're on a standard deployment, consider checking if provisioned throughput aka ptu could help. it's a bit more predictable for steady workloads provisioned throughput.
And this isn't just an azure thing, any cloud ai service can act up like this. just an advise, always log token counts and response times. if u see a pattern, u can tweak ur app to smooth out requests or add retries with backoff. also, caching frequent responses might save u some tokens and speed things up.
hope this helps! if u dig deeper and find something funky, let us know.
Best regards,
Alex
and "yes" if you would follow me at Q&A - personaly thx.
P.S. If my answer help to you, please Accept my answer
PPS That is my Answer and not a Comment