Why am I getting rate limited?

Question

Why am I getting rate limited?

Stephen Cattaneo 90

My deployment's limits are

8k Tokens per minute
48 Requests per minute

I'm using GPT-4o + OpenAI sdk (Assistants). Every other streaming request is getting rate limited --
LastError(code='rate_limit_exceeded', message='Rate limit is exceeded. Try again in 51 seconds.')

According to the metrics,
the max tokens I've hit for a 5 minute window is 292.
the max requests I've hit for a 5 minute window is 15 calls.

I'm no where near the limits of my deployment, why am I getting rate limited?

VSawhney 1,300 Reputation points Microsoft External Staff Moderator

2025-03-24T09:16:00.2033333+00:00
Hello Stephen Cattaneo,

Agreed with Amira Bedhiafi pointers on below limits -

Per-second limits

Concurrency limits (especially for streaming requests)

Region-specific backend throttling

Shared tenant limitations if you’re on a public/shared Azure region

You may refer this document: Azure OpenAI Service quotas and limits
You may also refer this document for rate limits: Understanding rate limits

In case you have further issues, please let us know your subscription plan to help you on the issue.

Thank you!
VSawhney 1,300 Reputation points Microsoft External Staff Moderator

2025-03-25T06:25:23.72+00:00

Hello Stephen Cattaneo,

We wanted to check if the answer provided by Amira Bedhiafi and the documents mentioned helped you. If you have any further queries, please feel free to reach us.

Thank you!
VSawhney 1,300 Reputation points Microsoft External Staff Moderator

2025-03-26T06:42:35.2366667+00:00

Hello Stephen Cattaneo,

Following up again to check if the answer provided by Amira Bedhiafi and the documents mentioned helped you. If you have any further queries, please feel free to reach us.

Thank you!

Answer accepted by question author

0 additional answers

Your answer

VSawhney 1,300 Reputation points Microsoft External Staff Moderator

2025-03-24T09:16:00.2033333+00:00

Hello Stephen Cattaneo,

Agreed with Amira Bedhiafi pointers on below limits -

Per-second limits

Concurrency limits (especially for streaming requests)

Region-specific backend throttling

Shared tenant limitations if you’re on a public/shared Azure region

You may refer this document: Azure OpenAI Service quotas and limits
You may also refer this document for rate limits: Understanding rate limits

In case you have further issues, please let us know your subscription plan to help you on the issue.

Thank you!
VSawhney 1,300 Reputation points Microsoft External Staff Moderator

2025-03-25T06:25:23.72+00:00

Hello Stephen Cattaneo,

We wanted to check if the answer provided by Amira Bedhiafi and the documents mentioned helped you. If you have any further queries, please feel free to reach us.

Thank you!
VSawhney 1,300 Reputation points Microsoft External Staff Moderator

2025-03-26T06:42:35.2366667+00:00

Hello Stephen Cattaneo,

Following up again to check if the answer provided by Amira Bedhiafi and the documents mentioned helped you. If you have any further queries, please feel free to reach us.

Thank you!

Answer 1

Hello Stephen !

Thank you for posting on Microsoft Learn.

You're absolutely, you’re well under the defined rate limits for your deployment. However, in Azure OpenAI, rate limiting behavior can be affected by several less-obvious factors beyond just the per-minute numbers you're seeing in metrics.

Even though your deployment says:

8,000 tokens/minute
48 requests/minute

There may also be:

Per-second limits
Concurrency limits (especially for streaming requests)
Region-specific backend throttling
Shared tenant limitations if you’re on a public/shared Azure region

You might be hitting a per-second or concurrent request limit, even if your per-minute usage looks low.

If you're using streaming, it might consume resources differently:

Azure may treat each streaming session as a long-lived connection, reducing how many concurrent ones can be handled at once.
These can be throttled independently of normal completions.

I think this could explain why every other streaming request fails — you're effectively maxing out your concurrent streaming sessions, not request/minute.

Azure often uses sliding window rate enforcement, meaning a burst of requests at minute boundaries can still cause temporary blocks.

For example:

10 requests at 00:00:00
10 at 00:01:00
10 at 00:02:00 Could still count as 30 requests in one minute depending on how their window slides.

Are you possibly:

Sending requests across multiple deployments?
Using different API keys ?

Limits are enforced per deployment, per key, and per resource. If there's another app or teammate using the same resource, you could be rate-limited due to shared usage.

Stephen Cattaneo 90 Reputation points

2025-04-01T16:52:11.72+00:00

These were streaming requests. So perhaps that was part of it.

I did a new deployment and issue has gone away.

My original deployment was "Standard", my current deployment is "Global Standard". I have not hit any rate limits since (I also noticed the rate limit ceiling is about an order of magnitude higher for both TpM and RpM).

Share via

Why am I getting rate limited?

0 additional answers

Your answer