GPT-4o Deployment in West Europe - Severe Latency Issues

Question

GPT-4o Deployment in West Europe - Severe Latency Issues

Guillaume Lameyse 25

Hi,

We rely heavily on Azure’s GPT-4o deployment as a key component of our application. Our model deployment (August 2024 version) is hosted in the West Europe region. Typically, our API calls take:

5 seconds for small requests
45–90 seconds for larger requests

However, today we experienced severe performance degradation:

Small calls took 50 seconds instead of 5 (10x increase).
Large calls never returned a response at all—there were no errors, just no response.

We tested the same model using OpenAI's API, and it performed as expected, so the issue appears specific to our Azure deployment.

Additional Details:

We were well below our tokens per minute (TPM) and requests per minute (RPM) limits.
No recent changes were made to our application’s logic or request patterns.
This is a critical component for us, and we plan to significantly scale our usage in the coming months, so reliability is a major concern.

Questions:

Is there any way to check the real-time status of our deployment?
Are there known issues or regional limitations affecting West Europe?
What steps can we take to ensure more stable performance and avoid similar incidents?

Rafal Pakoca 5 Reputation points

2025-03-19T17:06:30.0533333+00:00

We have the same issue, from time to time requests are running super slow. Like simple "hi" prompt run for 15-20 seconds, were normally the time is <1s
Manas Mohanty 6,265 Reputation points Microsoft External Staff Moderator

2025-03-20T04:16:11.6066667+00:00

Hi Rafal Pakoca

You can switch to light weight version of GPT4o which is GPT4o mini to get lower inference time.

Other option to create multizone deployment and route requests through APIM or load balancer.

Reference-Build an Enterprise OpenAI deployment with APIM

Thank you.

Accepted answer

0 additional answers

Your answer

Rafal Pakoca 5 Reputation points

2025-03-19T17:06:30.0533333+00:00

We have the same issue, from time to time requests are running super slow. Like simple "hi" prompt run for 15-20 seconds, were normally the time is <1s
Manas Mohanty 6,265 Reputation points Microsoft External Staff Moderator

2025-03-20T04:16:11.6066667+00:00

Hi Rafal Pakoca

You can switch to light weight version of GPT4o which is GPT4o mini to get lower inference time.

Other option to create multizone deployment and route requests through APIM or load balancer.

Reference-Build an Enterprise OpenAI deployment with APIM

Thank you.

Answer 1

Sina Salam 22,031 Volunteer Moderator

Hello Guillaume Lameyse,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having issues with the GPT-4o Deployment in West Europe - Severe Latency Issues.

I'm sorry to hear about the performance issues you're experiencing with your Azure GPT-4o deployment, this is getting common recently. You can enhance the reliability and efficiency of your Azure GPT-4o deployment with some of the steps to diagnose and improve stability:

First, check the Azure Status page https://status.azure.com/en-us/status to monitor real-time service health. While there are no widely reported issues specific to the West Europe region, performance fluctuations may arise due to demand and infrastructure updates. You can stay informed by referring to the Azure OpenAI Service documentation.

To enhance performance, consider monitoring and analyzing usage with Azure Monitor to detect anomalies. Optimizing requests, such as reducing token generation or enabling streaming, can significantly improve response times, as noted in this discussion similar case - https://learn.microsoft.com/en-us/answers/questions/1696807/gpt-4o-slow-to-complete-after-repeated-runs

Also, deploying in multiple regions helps distribute workload and ensures redundancy and keeping your deployment up to date with the latest model versions, as detailed in this link: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models , can also improve performance. If issues persist, reaching out to Azure Support via your Azure Portal will be a better.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Guillaume Lameyse 25 Reputation points

2025-02-24T07:29:34.8966667+00:00
Hello @Sina Salam

Thanks for your answer, giving me already some insights on the issue and possible solutions. I still have some questions following your answer.

In the Azure AI Foundry portal, with our resource located in West Europe, we have a single model deployed using the "global standard" deployment type. This configuration is designed to leverage Azure's global infrastructure, dynamically routing traffic to the data center with the best availability for each request. I understand that this setup should eliminate the need to load balance across multiple resources.

Is it expected that, under the "global standard" deployment, if one data center experiences high load, the system will automatically offload requests to other centers to maintain optimal performance?

Would transitioning to a different deployment type, such as "global provisioned," enhance response consistency and reduce latency variability? Our workload involves high consistent volume, and we require low latency variance. The documentation suggests that for such scenarios, "global provisioned" deployments provide reserved model processing capacity for high and predictable throughput.

If we establish a new Azure AI Foundry resource in another region and deploy the model there, would this approach offer better redundancy and performance compared to solely relying on the "global standard" deployment within our current resource? Additionally, would implementing a load balancing strategy across multiple regional deployments be advisable in our case?

We aim to ensure high availability and consistent performance for our application and would appreciate guidance on the optimal deployment strategy to achieve this.
Sina Salam 22,031 Reputation points Volunteer Moderator

2025-02-24T12:29:51.5966667+00:00
Hello Guillaume Lameyse,

Thank you for your feedback and asking for more.

To ensure your application achieves high availability and consistent performance, the below are best strategies:

1. Global Standard Deployment

Azure's "global standard" deployment dynamically routes traffic to the data center with the best availability. This setup helps maintain optimal performance by offloading requests from overloaded centers. However, it may still experience some latency variability due to the dynamic nature of traffic routing.

2. Global Provisioned Deployment

Switching to a "global provisioned" deployment can enhance response consistency and reduce latency variability. This deployment type provides reserved model processing capacity, ensuring more consistent performance and lower latency variance, especially for high and predictable throughput workloads.

3. Deploying in Another Region

Creating a new Azure AI Foundry resource in another region and deploying the model there can improve redundancy and performance. This approach ensures that if one region faces issues, another can handle the load, providing higher availability. Multiple regional deployments can also reduce latency for users in different geographical locations.

4. Load Balancing Strategy

Implementing a load balancing strategy across multiple regional deployments is advisable. This strategy distributes traffic evenly, preventing any single region from becoming a bottleneck. It also enhances fault tolerance, ensuring your application remains available even if one region experiences downtime.

Recommendations:

Move to a "global provisioned" deployment for consistent performance and low latency variance.

Establish Azure AI Foundry resources in multiple regions to enhance redundancy and performance.

Use a load balancing strategy to distribute traffic across regional deployments, ensuring high availability and consistent performance.

By combining these strategies, you can achieve high availability and consistent performance for your application.
Manas Mohanty 6,265 Reputation points Microsoft External Staff Moderator

2025-02-24T12:37:37.53+00:00
Hi Guillaume Lameyse

Here are the answers to your queries.

"Global standard" deployment will route traffic to other data centers with low latency but not in another zone/region.

"Global provisioned" will have reserved capacity compared to global standard but will follow same routing as of global standard.

You can opt for Data zone standard or provisioned for routing to data center with in different zones.

Yes, deploying in other regions aside west US, eastus2, east us, Sweden central might render increased latency but not guaranteed for all regions in other regions in same region. Reference

Reason behind slow latency might be your deployment has become slow now because of continuously sending complex and long queries or longer outputs. Ideally you should opt for multi-shot learning and use simpler sentences, be clearer and more precise in your prompts.

You can limit your output generation size from max_token or asking in prompt itself (For e.g Please keep it inform of small chunks).

Agree with Sina Salam suggestion on following multi-region deployments.

Reference on multi-region deployment

Hope it helps

Thank you.

Share via

GPT-4o Deployment in West Europe - Severe Latency Issues

0 additional answers

Your answer