Strage latency issue in azure openai models

Question

Strage latency issue in azure openai models

Gaurav Wagh 0

I am using Azure OpenAI GPT-4o and GPT-4o-min models in backend. I am facing a strange behavious.
When I hit a first network request with a small simple prompt to the model, it takes around 15 seconds to respond to this. If I hit a second request with same prompt to the same model, this time it returns the response in 200 milliseconds.

Can anyone tell me, why this is happening? Why a normal request of some milliseconds is taking 14-15 seconds to get execute during the first network hit?

This 14-15 seconds delay at the initial request hit is not good for a better user experience, that's why I want to remove this delay.

Same thing is happening with

Azure OpenAi endpoint
OpenAi chat completion endpoint

2 answers

Your answer

Answer 1

Hi there Gaurav Wagh

Thanks for using QandA plaftorm

The first request delay in Azure OpenAI models is likely due to cold start latency. When a model isn't actively in use, it takes time to load resources and initialize before processing the request, leading to a 14-15 second delay. Subsequent requests are faster because the model remains warm.

To reduce this, try keeping the model active by sending periodic lightweight requests or using Azure Reserved Capacity for consistent performance. alsocheck region availability, network latency, and service quotas.

If this helps kindly accept the answer thanks much.

Answer 2

Hi Gaurav Wagh,

Here are the few potential causes for that,

Regional Load can occur due to increased demand, maintenance, or unexpected operational constraints, causing temporary slowdowns.
Configuration Differences between regions, such as variations in hardware, resource allocation, or deployment settings, may result in inconsistent performance.
Cold Start Latency occurs when the model has been idle for some time, requiring Azure to allocate resources before processing the first request, leading to delays.

To address the issue, consider these steps:

Monitor Regional Service Health using tools like the Azure Service Health dashboard to identify ongoing issues or incidents in the affected region. Proactive monitoring and routing traffic to alternate regions during peak times can help mitigate latency concerns effectively.
Send Periodic Warm-Up Requests by making lightweight API calls every few minutes to keep the model active and reduce cold start delays.
Use Provisioned Throughput SKU to allocate dedicated resources, ensuring consistent performance without delays from resource allocation.
Optimize API Authentication by caching authentication tokens to avoid extra processing time during the initial request.
Monitor Resource Utilization using Azure Monitor and Application Insights to detect and mitigate performance fluctuations proactively.
Might this issue would be intermittent, it could be due to a temporary network or server issue. In this case, you can try again later to see if the issue has been resolved.

Kindly refer this Performance and latency.

I Hope this helps. Do let me know if you have any further queries.

Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator

2025-03-18T22:56:20.39+00:00

Hi Gaurav Wagh,
Just following up to see if you had a chance to review the above response. Thank you!

Share via

Strage latency issue in azure openai models

2 answers

Your answer