Hi Felix Bergström,
You're experiencing slowdowns when using the o3 model, but not with gpt-4o or o3-mini.
This is not classic rate limiting, but likely a form of capacity-based throttling or backend queuing on Azure's side — especially for the o3 model.
This Is Happening Because:
Azure often implements backend-level throttling mechanisms that are:
· Not exposed to you via 429s or quota dashboards,
· Enforced due to model availability, deployment capacity, or internal scaling rules (especially for larger models like o3),
· Specific to region, SKU, or even the model variant (like o3 vs o3-mini).
Even though you're "granted" a 10M TPM quota, you're not guaranteed to fully use it in parallel. This is common in large-scale Azure deployments, especially with new or "heavy" models like o3.
1.Use deployment-level parallelism instead of pure async
If you’re using one single deployment for o3, Azure may serialize the queue or apply hidden throttling. Instead:
· Create multiple deployments of the same o3 model (e.g., o3-a, o3-b, o3-c)
· Route your parallel requests across these deployments (round-robin, or async queue)
This can sometimes bypass the internal queuing issue by distributing load across endpoints.
2.Switch to gpt-4o or o3-mini
Since your use case works well with gpt-4o, consider making that your default. gpt-4o is:
· Faster
· Cheaper
· More parallelizable
If o3 is required (e.g., due to fine-tuning or specific features), proceed with the given options.
3.Rate-limit manually and batch
As a workaround, batch or stagger your calls with small delays (e.g., 100ms between 3-4 grouped calls), even though this may hurt latency slightly. In some cases, this oddly improves throughput due to backend queue optimization.
Hope this helps, do let me know if you have any further queries.
Thank you!