Hidden Throttling Issue with o3 Models in AsyncAzureOpenAI Client

Felix Bergström 20 Reputation points
2025-05-06T20:05:27.68+00:00

Experiencing what appears to be throttling (not rate limiting) when making parallel calls using the AsyncAzureOpenAI client with o3 models. The requests take longer to complete even when executed in parallel.

Context:

  • Granted a quota of 10M TPM for o3.
  • Not receiving rate limit errors (429).
  • Observed expected behavior when switching to other models, such as gpt-4o or o3-mini.

Test Results:

  • 1 request with o3: Took roughly 1 minute to complete.
  • 12 requests with o3 (in parallel): First request took roughly 2 minutes to finish, with all requests taking about 4 minutes to complete.

Switching to gpt-4o (without altering any code):

  • 1 request with 4o: Took about 1 minute to complete.
  • 12 requests with 4o: Took approximately 1 minute for all requests to finish.

This indicates that the code is functioning as intended, and logs confirm that all 12 requests are sent simultaneously.

Is there anything I can do? It seems absurd that I'm granted a quota that is practically impossible to reach (or even come near). My use case involves processing a lot of data at the same time, with the current o3 setup the use would be stuck waiting for 30-60 minutes behind a loading bar (compared to 2 minutes with 4o)

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,092 questions
{count} votes

Accepted answer
  1. Prashanth Veeragoni 5,170 Reputation points Microsoft External Staff Moderator
    2025-05-07T03:59:33.1366667+00:00

    Hi Felix Bergström,

    You're experiencing slowdowns when using the o3 model, but not with gpt-4o or o3-mini.

    This is not classic rate limiting, but likely a form of capacity-based throttling or backend queuing on Azure's side — especially for the o3 model.

    This Is Happening Because:

    Azure often implements backend-level throttling mechanisms that are:

    ·       Not exposed to you via 429s or quota dashboards,

    ·       Enforced due to model availability, deployment capacity, or internal scaling rules (especially for larger models like o3),

    ·       Specific to region, SKU, or even the model variant (like o3 vs o3-mini).

    Even though you're "granted" a 10M TPM quota, you're not guaranteed to fully use it in parallel. This is common in large-scale Azure deployments, especially with new or "heavy" models like o3.

    1.Use deployment-level parallelism instead of pure async

    If you’re using one single deployment for o3, Azure may serialize the queue or apply hidden throttling. Instead:

    ·       Create multiple deployments of the same o3 model (e.g., o3-a, o3-b, o3-c)

    ·       Route your parallel requests across these deployments (round-robin, or async queue)

    This can sometimes bypass the internal queuing issue by distributing load across endpoints.

    2.Switch to gpt-4o or o3-mini

    Since your use case works well with gpt-4o, consider making that your default. gpt-4o is:

    ·       Faster

    ·       Cheaper

    ·       More parallelizable

    If o3 is required (e.g., due to fine-tuning or specific features), proceed with the given options.

    3.Rate-limit manually and batch

    As a workaround, batch or stagger your calls with small delays (e.g., 100ms between 3-4 grouped calls), even though this may hurt latency slightly. In some cases, this oddly improves throughput due to backend queue optimization.

    Hope this helps, do let me know if you have any further queries. 

    Thank you!

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.