Why am I getting rate limited?

Stephen Cattaneo 20 Reputation points
2025-03-21T23:56:40.26+00:00

My deployment's limits are

  1. 8k Tokens per minute
  2. 48 Requests per minute

I'm using GPT-4o + OpenAI sdk (Assistants). Every other streaming request is getting rate limited --
LastError(code='rate_limit_exceeded', message='Rate limit is exceeded. Try again in 51 seconds.')

According to the metrics,
the max tokens I've hit for a 5 minute window is 292.
the max requests I've hit for a 5 minute window is 15 calls.

I'm no where near the limits of my deployment, why am I getting rate limited?

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,365 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 31,391 Reputation points
    2025-03-22T13:59:48.6666667+00:00

    Hello Stephen !

    Thank you for posting on Microsoft Learn.

    You're absolutely, you’re well under the defined rate limits for your deployment. However, in Azure OpenAI, rate limiting behavior can be affected by several less-obvious factors beyond just the per-minute numbers you're seeing in metrics.

    Even though your deployment says:

    • 8,000 tokens/minute
    • 48 requests/minute

    There may also be:

    • Per-second limits
    • Concurrency limits (especially for streaming requests)
    • Region-specific backend throttling
    • Shared tenant limitations if you’re on a public/shared Azure region

    You might be hitting a per-second or concurrent request limit, even if your per-minute usage looks low.

    If you're using streaming, it might consume resources differently:

    • Azure may treat each streaming session as a long-lived connection, reducing how many concurrent ones can be handled at once.
    • These can be throttled independently of normal completions.

    I think this could explain why every other streaming request fails — you're effectively maxing out your concurrent streaming sessions, not request/minute.

    Azure often uses sliding window rate enforcement, meaning a burst of requests at minute boundaries can still cause temporary blocks.

    For example:

    • 10 requests at 00:00:00
    • 10 at 00:01:00
    • 10 at 00:02:00 Could still count as 30 requests in one minute depending on how their window slides.

    Are you possibly:

    • Sending requests across multiple deployments?
    • Using different API keys ?

    Limits are enforced per deployment, per key, and per resource. If there's another app or teammate using the same resource, you could be rate-limited due to shared usage.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.