Share via

Why am I getting rate limited?

Stephen Cattaneo 90 Reputation points
2025-03-21T23:56:40.26+00:00

My deployment's limits are

  1. 8k Tokens per minute
  2. 48 Requests per minute

I'm using GPT-4o + OpenAI sdk (Assistants). Every other streaming request is getting rate limited --
LastError(code='rate_limit_exceeded', message='Rate limit is exceeded. Try again in 51 seconds.')

According to the metrics,
the max tokens I've hit for a 5 minute window is 292.
the max requests I've hit for a 5 minute window is 15 calls.

I'm no where near the limits of my deployment, why am I getting rate limited?

Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform


Answer accepted by question author
  1. Amira Bedhiafi 41,376 Reputation points MVP Volunteer Moderator
    2025-03-22T13:59:48.6666667+00:00

    Hello Stephen !

    Thank you for posting on Microsoft Learn.

    You're absolutely, you’re well under the defined rate limits for your deployment. However, in Azure OpenAI, rate limiting behavior can be affected by several less-obvious factors beyond just the per-minute numbers you're seeing in metrics.

    Even though your deployment says:

    • 8,000 tokens/minute
    • 48 requests/minute

    There may also be:

    • Per-second limits
    • Concurrency limits (especially for streaming requests)
    • Region-specific backend throttling
    • Shared tenant limitations if you’re on a public/shared Azure region

    You might be hitting a per-second or concurrent request limit, even if your per-minute usage looks low.

    If you're using streaming, it might consume resources differently:

    • Azure may treat each streaming session as a long-lived connection, reducing how many concurrent ones can be handled at once.
    • These can be throttled independently of normal completions.

    I think this could explain why every other streaming request fails — you're effectively maxing out your concurrent streaming sessions, not request/minute.

    Azure often uses sliding window rate enforcement, meaning a burst of requests at minute boundaries can still cause temporary blocks.

    For example:

    • 10 requests at 00:00:00
    • 10 at 00:01:00
    • 10 at 00:02:00 Could still count as 30 requests in one minute depending on how their window slides.

    Are you possibly:

    • Sending requests across multiple deployments?
    • Using different API keys ?

    Limits are enforced per deployment, per key, and per resource. If there's another app or teammate using the same resource, you could be rate-limited due to shared usage.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.