Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform
Hello Šimůnek, Marek,
Welcome to Microsoft Q&A and Thank you for the detailed explanation.
You’re right to be confused, based on your metrics alone, it looks like you should not be hitting any limits. The key point is that the 429s you’re seeing are not caused by your configured RPM/TPM quota, but by backend peak-load throttling that applies to shared Azure OpenAI deployments.
Below is a clearer breakdown.
Why you’re seeing 429 errors despite low traffic
- There are two different types of limits in Azure OpenAI
A. Your configured quota (RPM / TPM)
- Example: 3M tokens per minute
This is what you see in the Azure portal and documentation
Based on your data (~300 tokens/sec ≈ 18k TPM), you are well below this limit
You are not exceeding your quota
B. Backend peak-load / capacity throttling (this is what hit you)
The error message confirms this:
“The system is currently experiencing high demand… exceeds the maximum usage size allowed during peak load.”
This throttling:
Is dynamic and regional
Is not shown in the quota blade
Is triggered by GPU / cluster saturation, not your usage
Is more common on shared (non-provisioned) deployments
Is sensitive to concurrency and burstiness, not just token volume
So even with low overall traffic, requests can be rejected if:
- The region (e.g., West Europe) is under high demand
- Your requests land on a busy backend shard
- Multiple requests arrive concurrently
- The model (e.g.,
gpt-4.1-mini) is heavily used at that time
Why metrics look fine but errors still occur
Azure OpenAI metrics today show:
Requests per minute
Tokens per minute / second
They do not expose:
Backend queue depth
GPU contention
Per-cluster saturation
Peak-load thresholds
So there is currently no metric that warns you ahead of time that peak-load throttling is about to occur.
This is a known limitation.
Is there another “hidden” limit?
Yes, but it’s intentional.
Azure OpenAI applies fair-use capacity protection to shared deployments to:
Protect the platform during peak demand
Prevent noisy-neighbor scenarios
Maintain overall service stability
This can happen:
- Even when you’re far below quota
- Without a public service incident
- More often in high-demand regions like West Europe
Was this an outage or incident?
From the behavior and error message:
- This looks like transient regional capacity pressure, not a service outage
- No public incident is required for this throttling to occur
How to mitigate this from your side
Recommended option: Provisioned Throughput
The error message suggests this for a reason.
Provisioned Throughput provides:
Reserved GPU capacity
Predictable latency
No peak-load throttling
SLA-backed reliability
This is the only way to fully avoid this class of 429 errors.
If you remain on shared throughput, implement all of the following:
Retry logic with backoff
- Exponential backoff
- Respect
Retry-Afterheaders
Limit concurrency
- Token volume is low, but parallel requests matter
- Add client-side throttling or queues
Reduce burstiness
- Batch requests where possible
- Prefer streaming responses
Multi-region fallback
- Deploy the same model in another EU region
- Fail over on repeated 429s
Monitoring
- Azure Monitor can help you spot request spikes
- But note it cannot show backend saturation today
You did not exceed your documented rate limits
You hit backend peak-load throttling on a shared deployment
This can happen even with low traffic
Azure does not currently expose metrics for this limit
Provisioned Throughput is the only guaranteed fix
Your understanding is correct the missing piece is that quota does not equal guaranteed capacity in shared mode.
Please refer this
Monitoring Azure OpenAI without switching from your existing observability platform
Azure OpenAI in Microsoft Foundry Models quotas and limits
Manage Azure OpenAI in Microsoft Foundry Models quota
I Hope this helps. Do let me know if you have any further queries.
Thank you!