Share via

Rate limited even though we didnt have high traffic

Šimůnek, Marek 75 Reputation points
2026-01-26T12:03:48.1966667+00:00

Hi,

we got 19k requests with http status code 429 even though the traffic was very minimal compare to our rate limit. Around 300 tokens per second => 18k tokens per min. Rate limit for specific model was gpt-4.1-mini was 3M tokens per min.

The error says:

The system is currently experiencing high demand and cannot process your request. Your request exceeds the maximum usage size allowed during peak load. Please retry after 6 seconds. For improved latency reliability, consider switching to Provisioned Throughput.

  1. How is it possible? Is there another limit for peak load?
  2. Does azure provide metrics for openai which can tell us we are close to rate limit?
  3. Was there some accident? How could we mitigate it from our side?

azure Openai metrics:

openai-metrics

our-logs-with-errors

Details:

  • Data Zone Standard (EUR)
  • region: westeurope
  • {"from":"2026-01-25 13:40:44","to":"2026-01-25 18:34:29"}
Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform


1 answer

Sort by: Most helpful
  1. SRILAKSHMI C 17,780 Reputation points Microsoft External Staff Moderator
    2026-01-26T15:08:19.79+00:00

    Hello Šimůnek, Marek,

    Welcome to Microsoft Q&A and Thank you for the detailed explanation.

    You’re right to be confused, based on your metrics alone, it looks like you should not be hitting any limits. The key point is that the 429s you’re seeing are not caused by your configured RPM/TPM quota, but by backend peak-load throttling that applies to shared Azure OpenAI deployments.

    Below is a clearer breakdown.

    Why you’re seeing 429 errors despite low traffic

    1. There are two different types of limits in Azure OpenAI

    A. Your configured quota (RPM / TPM)

    • Example: 3M tokens per minute

    This is what you see in the Azure portal and documentation

    Based on your data (~300 tokens/sec ≈ 18k TPM), you are well below this limit

    You are not exceeding your quota

    B. Backend peak-load / capacity throttling (this is what hit you)

    The error message confirms this:

    “The system is currently experiencing high demand… exceeds the maximum usage size allowed during peak load.”

    This throttling:

    Is dynamic and regional

    Is not shown in the quota blade

    Is triggered by GPU / cluster saturation, not your usage

    Is more common on shared (non-provisioned) deployments

    Is sensitive to concurrency and burstiness, not just token volume

    So even with low overall traffic, requests can be rejected if:

    • The region (e.g., West Europe) is under high demand
    • Your requests land on a busy backend shard
    • Multiple requests arrive concurrently
    • The model (e.g., gpt-4.1-mini) is heavily used at that time

    Why metrics look fine but errors still occur

    Azure OpenAI metrics today show:

    Requests per minute

    Tokens per minute / second

    They do not expose:

    Backend queue depth

    GPU contention

    Per-cluster saturation

    Peak-load thresholds

    So there is currently no metric that warns you ahead of time that peak-load throttling is about to occur.

    This is a known limitation.

    Is there another “hidden” limit?

    Yes, but it’s intentional.

    Azure OpenAI applies fair-use capacity protection to shared deployments to:

    Protect the platform during peak demand

    Prevent noisy-neighbor scenarios

    Maintain overall service stability

    This can happen:

    • Even when you’re far below quota
    • Without a public service incident
    • More often in high-demand regions like West Europe

    Was this an outage or incident?

    From the behavior and error message:

    • This looks like transient regional capacity pressure, not a service outage
    • No public incident is required for this throttling to occur

    How to mitigate this from your side

    Recommended option: Provisioned Throughput

    The error message suggests this for a reason.

    Provisioned Throughput provides:

    Reserved GPU capacity

    Predictable latency

    No peak-load throttling

    SLA-backed reliability

    This is the only way to fully avoid this class of 429 errors.

    If you remain on shared throughput, implement all of the following:

    Retry logic with backoff

    • Exponential backoff
    • Respect Retry-After headers

    Limit concurrency

    • Token volume is low, but parallel requests matter
    • Add client-side throttling or queues

    Reduce burstiness

    • Batch requests where possible
    • Prefer streaming responses

    Multi-region fallback

    • Deploy the same model in another EU region
    • Fail over on repeated 429s

    Monitoring

    • Azure Monitor can help you spot request spikes
    • But note it cannot show backend saturation today

    You did not exceed your documented rate limits

    You hit backend peak-load throttling on a shared deployment

    This can happen even with low traffic

    Azure does not currently expose metrics for this limit

    Provisioned Throughput is the only guaranteed fix

    Your understanding is correct the missing piece is that quota does not equal guaranteed capacity in shared mode.

    Please refer this

    Monitoring Azure OpenAI without switching from your existing observability platform

    Azure OpenAI in Microsoft Foundry Models quotas and limits

    Manage Azure OpenAI in Microsoft Foundry Models quota

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.