An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello @Vitalii Horbovyi
Thank you for the detailed explanation. You’ve already identified the key constraint correctly: for Standard (Pay-As-You-Go) deployments, throughput is governed by subscription-level quota (Tokens Per Minute and Requests Per Minute) for a given model and region. As a result, deploying multiple instances of the same model within the same subscription does not increase total throughput, since all deployments share the same quota pool.
Why You’re Seeing Throughput Degradation
Your workload is highly parallel and token-intensive:
- Approximately 10 sequential LLM calls per user session
- Roughly 10,000 tokens per call
- Around 100,000 tokens per user session
With just four concurrent users, this translates to:
- Approximately 40 simultaneous requests
- Up to 400,000 tokens being processed concurrently
As your workload approaches your subscription’s TPM/RPM limits, Azure OpenAI begins queueing requests. Even before explicit throttling (429) occurs, this queuing can significantly increase latency—which is exactly the behavior you're observing.
Can Multiple Subscriptions Increase Throughput?
Yes. This is a valid and widely adopted scaling strategy for Standard deployments.
Because quota is allocated per subscription, per region, per model, each additional subscription receives its own independent quota allocation. By deploying separate Azure OpenAI resources (or Azure AI Foundry hubs/projects) in multiple subscriptions, you can effectively scale throughput horizontally.
For example:
- Subscription A → Azure OpenAI resource → GPT-4.1-mini deployment
- Subscription B → Azure OpenAI resource → GPT-4.1-mini deployment
- Subscription C → Azure OpenAI resource → GPT-4.1-mini deployment
A centralized gateway can then distribute traffic across these deployments, allowing aggregate throughput to scale approximately linearly with the number of subscriptions.
This is often the most practical alternative when PTU is not yet financially viable.
Recommended Step: Request a Quota Increase
Before introducing architectural complexity, we strongly recommend submitting a quota increase request for your existing subscription.
Even with Standard deployments, you can request higher TPM/RPM allocations for GPT-4.1-mini in your target region. This is the simplest and most cost-effective way to gain additional capacity.
If approved, it may significantly improve throughput without requiring any changes to your application architecture.
Recommended Scalable Architecture
For sustained high concurrency, we recommend the following architecture:
Multiple Azure subscriptions (or multiple Azure OpenAI resources)
One GPT-4.1-mini deployment per resource
Centralized routing layer using:
- Azure Front Door
- Azure API Management
- Azure Traffic Manager
- or a custom gateway
Your gateway should:
- Route requests to the least-utilized endpoint
- Monitor token consumption and request rates
- Respect
Retry-Afterheaders - Automatically retry or fail over on
429responses - Implement exponential backoff and circuit breakers
- Dynamically adjust routing based on endpoint health and latency
This design provides:
- Increased aggregate throughput
- Better resilience and fault tolerance
- Isolation from individual quota exhaustion
- Improved operational flexibility
Additional Optimization Opportunities
1. Reduce Token Consumption
Token optimization is often the fastest path to better throughput.
- Minimize prompt size
- Trim unnecessary conversation history
- Use summarization between steps
- Set
max_output_tokensappropriately - Avoid over-allocating output tokens
Even a 20–30% reduction in tokens can materially improve throughput and reduce costs.
2. Consolidate Sequential Calls
Ten sequential calls per session compounds latency.
Consider:
- Combining multiple tasks into a single prompt
- Using structured outputs to perform multiple operations in one call
- Parallelizing independent steps where possible
Reducing the total number of model invocations can significantly improve end-to-end response times.
3. Use Smaller Models Strategically
Not every step may require GPT-4.1-mini.
For lighter tasks, consider GPT-4.1-nano
Smaller or specialized models for classification, extraction, or validation
Reserve GPT-4.1-mini for the most reasoning-intensive stages.
4. Enable Streaming
Streaming does not reduce total compute time, but it substantially improves perceived responsiveness by returning tokens as they are generated.
This can greatly enhance the user experience for interactive workloads.
Regional Scaling Option
If your compliance and data residency requirements permit, you can also distribute deployments across multiple Azure regions.
This provides:
- Additional quota pools
- Better geographic resilience
- Reduced risk of regional saturation
However, be sure to consider:
- Data residency requirements
- Regulatory constraints
- Cross-region latency
- Governance and monitoring complexity
About PTU
You are correct that Provisioned Throughput Units (PTU) may not be cost-effective at your current scale.
However, PTU remains the only option that provides:
- Guaranteed throughput
- Predictable latency
- Reserved capacity
- Performance consistency under sustained heavy load
For business-critical or highly predictable workloads, it may be worth revisiting as your usage grows.
Key Considerations for Multi-Subscription Scaling
- Ensure each subscription has adequate quota approved
- Centralize monitoring across all deployments
- Standardize RBAC, networking, and security policies
- Implement consolidated billing and cost tracking
- Maintain consistent deployment configurations
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer and Yes for was this answer helpful.
Thank you!