Hidden Throttling Issue with o3 Models in AsyncAzureOpenAI Client

Question

Hidden Throttling Issue with o3 Models in AsyncAzureOpenAI Client

Felix Bergström 20

Experiencing what appears to be throttling (not rate limiting) when making parallel calls using the AsyncAzureOpenAI client with o3 models. The requests take longer to complete even when executed in parallel.

Context:

Granted a quota of 10M TPM for o3.
Not receiving rate limit errors (429).
Observed expected behavior when switching to other models, such as gpt-4o or o3-mini.

Test Results:

1 request with o3: Took roughly 1 minute to complete.
12 requests with o3 (in parallel): First request took roughly 2 minutes to finish, with all requests taking about 4 minutes to complete.

Switching to gpt-4o (without altering any code):

1 request with 4o: Took about 1 minute to complete.
12 requests with 4o: Took approximately 1 minute for all requests to finish.

This indicates that the code is functioning as intended, and logs confirm that all 12 requests are sent simultaneously.

Is there anything I can do? It seems absurd that I'm granted a quota that is practically impossible to reach (or even come near). My use case involves processing a lot of data at the same time, with the current o3 setup the use would be stuck waiting for 30-60 minutes behind a loading bar (compared to 2 minutes with 4o)

Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator

2025-05-08T02:17:06.62+00:00

Hi Felix Bergström,

Following up to see if the above suggestion was helpful. And, if you have any further query do let me know.

Thank you!
Felix Bergström 20 Reputation points

2025-05-08T09:12:14.0933333+00:00

Hello, yes creating multiple deployments fixed it, thank you!

Very strange solution though, it would of course be ideal if this wasn't necessary to begin with.
Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator

2025-05-08T09:18:31.6666667+00:00

Hi Felix Bergström

Glad to hear that worked! If you're confident the multiple deployments solved the issue, let me know—I'll convert my comment into an answer so you can upvote and accept it for visibility.

Thank you!
Felix Bergström 20 Reputation points

2025-05-08T09:35:59.1766667+00:00

Yes, I'm confident in that it at least improved the performance.
Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator

2025-05-08T13:26:58.04+00:00

Hi Felix Bergström,

Happy to hear that from you, I converted my comment to actual solution, could you please take some time to up-vote and accept it, so that it will be helpful for others in the community.

Thank you!

Accepted answer

0 additional answers

Your answer

Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator

2025-05-08T02:17:06.62+00:00

Hi Felix Bergström,

Following up to see if the above suggestion was helpful. And, if you have any further query do let me know.

Thank you!
Felix Bergström 20 Reputation points

2025-05-08T09:12:14.0933333+00:00

Hello, yes creating multiple deployments fixed it, thank you!

Very strange solution though, it would of course be ideal if this wasn't necessary to begin with.
Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator

2025-05-08T09:18:31.6666667+00:00

Hi Felix Bergström

Glad to hear that worked! If you're confident the multiple deployments solved the issue, let me know—I'll convert my comment into an answer so you can upvote and accept it for visibility.

Thank you!
Felix Bergström 20 Reputation points

2025-05-08T09:35:59.1766667+00:00

Yes, I'm confident in that it at least improved the performance.
Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator

2025-05-08T13:26:58.04+00:00

Hi Felix Bergström,

Happy to hear that from you, I converted my comment to actual solution, could you please take some time to up-vote and accept it, so that it will be helpful for others in the community.

Thank you!

Answer 1

Hi Felix Bergström,

You're experiencing slowdowns when using the o3 model, but not with gpt-4o or o3-mini.

This is not classic rate limiting, but likely a form of capacity-based throttling or backend queuing on Azure's side — especially for the o3 model.

This Is Happening Because:

Azure often implements backend-level throttling mechanisms that are:

· Not exposed to you via 429s or quota dashboards,

· Enforced due to model availability, deployment capacity, or internal scaling rules (especially for larger models like o3),

· Specific to region, SKU, or even the model variant (like o3 vs o3-mini).

Even though you're "granted" a 10M TPM quota, you're not guaranteed to fully use it in parallel. This is common in large-scale Azure deployments, especially with new or "heavy" models like o3.

1.Use deployment-level parallelism instead of pure async

If you’re using one single deployment for o3, Azure may serialize the queue or apply hidden throttling. Instead:

· Create multiple deployments of the same o3 model (e.g., o3-a, o3-b, o3-c)

· Route your parallel requests across these deployments (round-robin, or async queue)

This can sometimes bypass the internal queuing issue by distributing load across endpoints.

2.Switch to gpt-4o or o3-mini

Since your use case works well with gpt-4o, consider making that your default. gpt-4o is:

· Faster

· Cheaper

· More parallelizable

If o3 is required (e.g., due to fine-tuning or specific features), proceed with the given options.

3.Rate-limit manually and batch

As a workaround, batch or stagger your calls with small delays (e.g., 100ms between 3-4 grouped calls), even though this may hurt latency slightly. In some cases, this oddly improves throughput due to backend queue optimization.

Hope this helps, do let me know if you have any further queries.

Thank you!

Share via

Hidden Throttling Issue with o3 Models in AsyncAzureOpenAI Client

0 additional answers

Your answer