An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello @Animesh Aditha,
Thank you for sharing your observations. We understand how important consistent latency is for production workloads.
To address your primary concern directly: Microsoft does not intentionally degrade the performance of older models to encourage customers to migrate to newer ones. There is no deliberate throttling or artificial slowdown applied to older models.
What you are seeing is typically the result of a combination of normal platform dynamics, shared-capacity behavior, and the significant efficiency improvements built into newer model generations.
Why GPT-5 Mini May Appear Slower Over Time
If your gpt-5-mini deployment is now taking 90–120 seconds for workloads that previously completed in around 60 seconds, several factors may be contributing:
Shared infrastructure behavior: Standard (Pay-As-You-Go) deployments run on shared compute capacity. As overall demand for a model increases, queueing and reduced tokens-per-second can occur.
Regional demand fluctuations: Latency can vary based on aggregate usage in your deployment region.
Workload concurrency: Higher parallel request volumes can increase waiting time.
Prompt and output characteristics: Token count, reasoning depth, structured outputs, and tool usage all affect response time.
This is expected behavior for shared-capacity deployments.
Why GPT-5.4 Mini Is Significantly Faster
Newer models such as gpt-5.4-mini are designed with substantial improvements, including:
- More efficient inference architecture
- Higher token throughput
- Lower latency under concurrent workloads
- Better optimization for tool use and reasoning workflows
This is why the same workload may complete in under 20 seconds on gpt-5.4-mini while taking significantly longer on earlier mini models.
These gains reflect normal platform evolution not degradation of older models.
How to Investigate Current Performance
We recommend reviewing your Azure AI Foundry metrics and logs:
Navigate to Azure Portal → Monitor → Metrics / Logs
Review:
- Request volume
- Throttling events (HTTP 429)
- Time to first token
- Tokens per second
- End-to-end latency
- Error rates
This can help identify whether increased latency correlates with higher demand or quota constraints.
Recommendations to Improve and Stabilize Performance
1. Consider Provisioned Throughput Units (PTU)
For workloads requiring predictable latency and consistent throughput, PTU is the recommended option.
Benefits include:
- Reserved dedicated capacity
- Stable token generation rates
- Reduced latency variability
- Better performance under sustained load
Standard PAYG deployments do not provide latency guarantees.
2. Implement Load Balancing Across Deployments or Regions
To reduce the impact of localized capacity constraints:
Deploy across multiple regions and/or subscriptions
Us Azure Front Door, Azure Traffic Manager, Azure API Management
This helps distribute load and improve resiliency.
3. Verify API Version
Ensure you are using the latest supported API version, as newer versions often include performance improvements, reliability enhancements, and bug fixes.
Using the most current API version is strongly recommended.
4. Review Deployment Update Policy
For your deployment, consider the available update settings:
- Auto-update to default – Automatically receives the latest default model improvements
- Upgrade when expired – Remains on the current version until retirement
This allows you to balance stability with access to performance enhancements.
Will Newer Models Also Slow Down Over Time?
Not inherently.
While latency can fluctuate in shared-capacity environments due to demand, systematic long-term degradation is not expected. In fact:
- Platform optimizations continue over time
- Newer runtime improvements are regularly introduced
- Provisioned deployments offer highly consistent performance
Future model generations will likely continue to improve in latency, throughput, and efficiency.
Microsoft does not intentionally slow older models.
The observed latency increase is most likely due to shared-capacity dynamics and growing demand.
gpt-5.4-mini is significantly faster because of architectural and runtime improvements.
To improve consistency, consider:
- Migrating to
gpt-5.4-mini - Using PTU for guaranteed performance
- Load balancing across regions or deployments
- Monitoring throughput and throttling metrics
- Using the latest API version
Please refer this
Provisioned Throughput Units for Azure AI Foundry: https://learn.microsoft.com/azure/ai-foundry/openai/concepts/provisioned-throughput?tabs=global-ptum
Azure OpenAI Service Model Version Deprecation & Auto-update Policies: https://learn.microsoft.com/azure/ai-services/openai/how-to/working-with-models?tabs=powershell#model-updates
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer and Yes for was this answer helpful.
Thank you!