Rate limited even though we didnt have high traffic

Question

Rate limited even though we didnt have high traffic

Šimůnek, Marek 75

Hi,

we got 19k requests with http status code 429 even though the traffic was very minimal compare to our rate limit. Around 300 tokens per second => 18k tokens per min. Rate limit for specific model was gpt-4.1-mini was 3M tokens per min.

The error says:

The system is currently experiencing high demand and cannot process your request. Your request exceeds the maximum usage size allowed during peak load. Please retry after 6 seconds. For improved latency reliability, consider switching to Provisioned Throughput.

How is it possible? Is there another limit for peak load?
Does azure provide metrics for openai which can tell us we are close to rate limit?
Was there some accident? How could we mitigate it from our side?

azure Openai metrics:

openai-metrics

our-logs-with-errors

Details:

Data Zone Standard (EUR)
region: westeurope
{"from":"2026-01-25 13:40:44","to":"2026-01-25 18:34:29"}

SRILAKSHMI C 17,780 Reputation points Microsoft External Staff Moderator

2026-01-27T10:49:16.8166667+00:00

Hi Šimůnek, Marek,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!

1 answer

Your answer

SRILAKSHMI C 17,780 Reputation points Microsoft External Staff Moderator

2026-01-27T10:49:16.8166667+00:00

Hi Šimůnek, Marek,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!

Answer 1

Hello Šimůnek, Marek,

Welcome to Microsoft Q&A and Thank you for the detailed explanation.

You’re right to be confused, based on your metrics alone, it looks like you should not be hitting any limits. The key point is that the 429s you’re seeing are not caused by your configured RPM/TPM quota, but by backend peak-load throttling that applies to shared Azure OpenAI deployments.

Below is a clearer breakdown.

Why you’re seeing 429 errors despite low traffic

There are two different types of limits in Azure OpenAI

A. Your configured quota (RPM / TPM)

Example: 3M tokens per minute

This is what you see in the Azure portal and documentation

Based on your data (~300 tokens/sec ≈ 18k TPM), you are well below this limit

You are not exceeding your quota

B. Backend peak-load / capacity throttling (this is what hit you)

The error message confirms this:

“The system is currently experiencing high demand… exceeds the maximum usage size allowed during peak load.”

This throttling:

Is dynamic and regional

Is not shown in the quota blade

Is triggered by GPU / cluster saturation, not your usage

Is more common on shared (non-provisioned) deployments

Is sensitive to concurrency and burstiness, not just token volume

So even with low overall traffic, requests can be rejected if:

The region (e.g., West Europe) is under high demand
Your requests land on a busy backend shard
Multiple requests arrive concurrently
The model (e.g., gpt-4.1-mini) is heavily used at that time

Why metrics look fine but errors still occur

Azure OpenAI metrics today show:

Requests per minute

Tokens per minute / second

They do not expose:

Backend queue depth

GPU contention

Per-cluster saturation

Peak-load thresholds

So there is currently no metric that warns you ahead of time that peak-load throttling is about to occur.

This is a known limitation.

Is there another “hidden” limit?

Yes, but it’s intentional.

Azure OpenAI applies fair-use capacity protection to shared deployments to:

Protect the platform during peak demand

Prevent noisy-neighbor scenarios

Maintain overall service stability

This can happen:

Even when you’re far below quota
Without a public service incident
More often in high-demand regions like West Europe

Was this an outage or incident?

From the behavior and error message:

This looks like transient regional capacity pressure, not a service outage
No public incident is required for this throttling to occur

How to mitigate this from your side

Recommended option: Provisioned Throughput

The error message suggests this for a reason.

Provisioned Throughput provides:

Reserved GPU capacity

Predictable latency

No peak-load throttling

SLA-backed reliability

This is the only way to fully avoid this class of 429 errors.

If you remain on shared throughput, implement all of the following:

Retry logic with backoff

Exponential backoff
Respect Retry-After headers

Limit concurrency

Token volume is low, but parallel requests matter
Add client-side throttling or queues

Reduce burstiness

Batch requests where possible
Prefer streaming responses

Multi-region fallback

Deploy the same model in another EU region
Fail over on repeated 429s

Monitoring

Azure Monitor can help you spot request spikes
But note it cannot show backend saturation today

You did not exceed your documented rate limits

You hit backend peak-load throttling on a shared deployment

This can happen even with low traffic

Azure does not currently expose metrics for this limit

Provisioned Throughput is the only guaranteed fix

Your understanding is correct the missing piece is that quota does not equal guaranteed capacity in shared mode.

Please refer this

Monitoring Azure OpenAI without switching from your existing observability platform

Azure OpenAI in Microsoft Foundry Models quotas and limits

Manage Azure OpenAI in Microsoft Foundry Models quota

I Hope this helps. Do let me know if you have any further queries.

Thank you!

Share via

Rate limited even though we didnt have high traffic

1 answer

Your answer