Share via

Degrading performance of AI Foundry Models Overtime

Animesh Aditha 20 Reputation points
2026-04-24T06:59:09.8066667+00:00

We had deployed gpt-5-mini on azure AI foundry whose token per second has degraded overtime.

For instance the same job on gpt-5-mini is taking 90-120 seconds on average which used to 60 seconds at the worst case while the newer models like gpt-5.4-mini is completing the same job in under 20 seconds.

This looks like an effort to push customers to use newer models but would like to confirm if this is the case and such degrading in peformance will continue to occur overtime even of the newer models

Azure OpenAI in Foundry Models

Answer accepted by question author

  1. SRILAKSHMI C 18,225 Reputation points Microsoft External Staff Moderator
    2026-04-25T10:53:00.3933333+00:00

    Hello @Animesh Aditha,

    Thank you for sharing your observations. We understand how important consistent latency is for production workloads.

    To address your primary concern directly: Microsoft does not intentionally degrade the performance of older models to encourage customers to migrate to newer ones. There is no deliberate throttling or artificial slowdown applied to older models.

    What you are seeing is typically the result of a combination of normal platform dynamics, shared-capacity behavior, and the significant efficiency improvements built into newer model generations.

    Why GPT-5 Mini May Appear Slower Over Time

    If your gpt-5-mini deployment is now taking 90–120 seconds for workloads that previously completed in around 60 seconds, several factors may be contributing:

    Shared infrastructure behavior: Standard (Pay-As-You-Go) deployments run on shared compute capacity. As overall demand for a model increases, queueing and reduced tokens-per-second can occur.

    Regional demand fluctuations: Latency can vary based on aggregate usage in your deployment region.

    Workload concurrency: Higher parallel request volumes can increase waiting time.

    Prompt and output characteristics: Token count, reasoning depth, structured outputs, and tool usage all affect response time.

    This is expected behavior for shared-capacity deployments.

    Why GPT-5.4 Mini Is Significantly Faster

    Newer models such as gpt-5.4-mini are designed with substantial improvements, including:

    • More efficient inference architecture
    • Higher token throughput
    • Lower latency under concurrent workloads
    • Better optimization for tool use and reasoning workflows

    This is why the same workload may complete in under 20 seconds on gpt-5.4-mini while taking significantly longer on earlier mini models.

    These gains reflect normal platform evolution not degradation of older models.

    How to Investigate Current Performance

    We recommend reviewing your Azure AI Foundry metrics and logs:

    Navigate to Azure Portal → Monitor → Metrics / Logs

    Review:

    • Request volume
    • Throttling events (HTTP 429)
    • Time to first token
    • Tokens per second
    • End-to-end latency
    • Error rates

    This can help identify whether increased latency correlates with higher demand or quota constraints.

    Recommendations to Improve and Stabilize Performance

    1. Consider Provisioned Throughput Units (PTU)

    For workloads requiring predictable latency and consistent throughput, PTU is the recommended option.

    Benefits include:

    • Reserved dedicated capacity
    • Stable token generation rates
    • Reduced latency variability
    • Better performance under sustained load

    Standard PAYG deployments do not provide latency guarantees.

    2. Implement Load Balancing Across Deployments or Regions

    To reduce the impact of localized capacity constraints:

    Deploy across multiple regions and/or subscriptions

    Us Azure Front Door, Azure Traffic Manager, Azure API Management

    This helps distribute load and improve resiliency.

    3. Verify API Version

    Ensure you are using the latest supported API version, as newer versions often include performance improvements, reliability enhancements, and bug fixes.

    Using the most current API version is strongly recommended.

    4. Review Deployment Update Policy

    For your deployment, consider the available update settings:

    • Auto-update to default – Automatically receives the latest default model improvements
    • Upgrade when expired – Remains on the current version until retirement

    This allows you to balance stability with access to performance enhancements.

    Will Newer Models Also Slow Down Over Time?

    Not inherently.

    While latency can fluctuate in shared-capacity environments due to demand, systematic long-term degradation is not expected. In fact:

    • Platform optimizations continue over time
    • Newer runtime improvements are regularly introduced
    • Provisioned deployments offer highly consistent performance

    Future model generations will likely continue to improve in latency, throughput, and efficiency.

    Microsoft does not intentionally slow older models.

    The observed latency increase is most likely due to shared-capacity dynamics and growing demand.

    gpt-5.4-mini is significantly faster because of architectural and runtime improvements.

    To improve consistency, consider:

    • Migrating to gpt-5.4-mini
    • Using PTU for guaranteed performance
    • Load balancing across regions or deployments
    • Monitoring throughput and throttling metrics
    • Using the latest API version

    Please refer this

    Provisioned Throughput Units for Azure AI Foundry: https://learn.microsoft.com/azure/ai-foundry/openai/concepts/provisioned-throughput?tabs=global-ptum

    Azure OpenAI Service Model Version Deprecation & Auto-update Policies: https://learn.microsoft.com/azure/ai-services/openai/how-to/working-with-models?tabs=powershell#model-updates

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!

    Was this answer helpful?

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Q&A Assist
    2026-04-24T06:59:22.67+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    The provided information only covers model availability, retirement timelines, and high-level guidance on choosing models and reasoning levels. It does not contain any data or statements about intentional performance degradation of specific models over time, nor about service-level guarantees or policies related to throttling older models.


    References:

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.