Strage latency issue in azure openai models

Gaurav Wagh 0 Reputation points
2025-03-17T11:07:34.8466667+00:00

I am using Azure OpenAI GPT-4o and GPT-4o-min models in backend. I am facing a strange behavious.
When I hit a first network request with a small simple prompt to the model, it takes around 15 seconds to respond to this. If I hit a second request with same prompt to the same model, this time it returns the response in 200 milliseconds.

Can anyone tell me, why this is happening? Why a normal request of some milliseconds is taking 14-15 seconds to get execute during the first network hit?

This 14-15 seconds delay at the initial request hit is not good for a better user experience, that's why I want to remove this delay.

Same thing is happening with

  • Azure OpenAi endpoint
  • OpenAi chat completion endpoint
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,080 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Azar 29,520 Reputation points MVP Volunteer Moderator
    2025-03-17T17:13:14.6133333+00:00

    Hi there Gaurav Wagh

    Thanks for using QandA plaftorm

    The first request delay in Azure OpenAI models is likely due to cold start latency. When a model isn't actively in use, it takes time to load resources and initialize before processing the request, leading to a 14-15 second delay. Subsequent requests are faster because the model remains warm.

    To reduce this, try keeping the model active by sending periodic lightweight requests or using Azure Reserved Capacity for consistent performance. alsocheck region availability, network latency, and service quotas.

    If this helps kindly accept the answer thanks much.

    0 comments No comments

  2. Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator
    2025-03-17T20:21:10.59+00:00

    Hi Gaurav Wagh,

    Here are the few potential causes for that,

    • Regional Load can occur due to increased demand, maintenance, or unexpected operational constraints, causing temporary slowdowns.
    • Configuration Differences between regions, such as variations in hardware, resource allocation, or deployment settings, may result in inconsistent performance.
    • Cold Start Latency occurs when the model has been idle for some time, requiring Azure to allocate resources before processing the first request, leading to delays.

    To address the issue, consider these steps:

    • Monitor Regional Service Health using tools like the Azure Service Health dashboard to identify ongoing issues or incidents in the affected region. Proactive monitoring and routing traffic to alternate regions during peak times can help mitigate latency concerns effectively.
    • Send Periodic Warm-Up Requests by making lightweight API calls every few minutes to keep the model active and reduce cold start delays.
    • Use Provisioned Throughput SKU to allocate dedicated resources, ensuring consistent performance without delays from resource allocation.
    • Optimize API Authentication by caching authentication tokens to avoid extra processing time during the initial request.
    • Monitor Resource Utilization using Azure Monitor and Application Insights to detect and mitigate performance fluctuations proactively.
    • Might this issue would be intermittent, it could be due to a temporary network or server issue. In this case, you can try again later to see if the issue has been resolved.

    Kindly refer this Performance and latency.

    I Hope this helps. Do let me know if you have any further queries.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.