Hello Stephen !
Thank you for posting on Microsoft Learn.
You're absolutely, you’re well under the defined rate limits for your deployment. However, in Azure OpenAI, rate limiting behavior can be affected by several less-obvious factors beyond just the per-minute numbers you're seeing in metrics.
Even though your deployment says:
- 8,000 tokens/minute
- 48 requests/minute
There may also be:
- Per-second limits
- Concurrency limits (especially for streaming requests)
- Region-specific backend throttling
- Shared tenant limitations if you’re on a public/shared Azure region
You might be hitting a per-second or concurrent request limit, even if your per-minute usage looks low.
If you're using streaming, it might consume resources differently:
- Azure may treat each streaming session as a long-lived connection, reducing how many concurrent ones can be handled at once.
- These can be throttled independently of normal completions.
I think this could explain why every other streaming request fails — you're effectively maxing out your concurrent streaming sessions, not request/minute.
Azure often uses sliding window rate enforcement, meaning a burst of requests at minute boundaries can still cause temporary blocks.
For example:
- 10 requests at
00:00:00
- 10 at
00:01:00
- 10 at
00:02:00
Could still count as 30 requests in one minute depending on how their window slides.
Are you possibly:
- Sending requests across multiple deployments?
- Using different API keys ?
Limits are enforced per deployment, per key, and per resource. If there's another app or teammate using the same resource, you could be rate-limited due to shared usage.