Request for Assistance: Critical Slowness When Calling Azure LLMs

Question

Request for Assistance: Critical Slowness When Calling Azure LLMs

GRT 0

Since this coupld of days ago, we've been experiencing significant slowness when calling Azure LLMs from our software. This issue affects both the development "nightly" version and the stable version installed on the staging server for the past few weeks. The production version, which uses the "raw" OpenAI model (soon to be replaced by the Azure version), does not exhibit this slowness.

Requests to the Azure LLMs are now taking considerably longer than before, causing our frontend to time out. While the slowness is consistent across all calls, those that were previously fast are relatively quicker now. The most pronounced delay is with the o3-mini model, which is inherently slower. Typically, calls do not return errors; they simply take a long time to respond.

I have tried updating all libraries and enabling the latest preview API version (2025-04-01-preview) on the development version, but this did not resolve the issue. Restarting the program also had no effect, and there are no signs of abnormal resource usage.

Your urgent assistance would be greatly appreciated.

GRT 0 Reputation points

2025-05-19T16:50:59.54+00:00

Notice the latency indication we saw in our Azure Open AI metricsScreenshot 2025-05-19 194815.png
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-19T19:37:00.5+00:00

Hi GRT,
Could you please confirm which Azure region your Azure OpenAI resource is deployed in?
GRT 0 Reputation points

2025-05-19T20:46:44.7466667+00:00

Sweden Central
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-19T21:13:43.18+00:00

We would like to inform you that there is an ongoing issue that our product team is actively investigating. They are working to identify the root cause and implement a resolution as quickly as possible. We appreciate your patience and will provide you with an update as soon as more information becomes available.
Arik Aronov 0 Reputation points

2025-05-20T09:37:40.86+00:00

Hi @Pavankumar Purilla - thank you for the update. Can you say if the issue only affects Sweden Central? And if there's an ETA for resolution?

Please note that the Azure statuses page doesn't indicate an issue with OpenAI service in any location.
Sina Salam 22,031 Reputation points Volunteer Moderator

2025-05-20T10:49:26.3766667+00:00
Hello GRT,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

Thank you for your patience. The latency issue in Sweden Central is under investigation by Microsoft. Meanwhile, we recommend testing your workload in a nearby region like West Europe, enabling prompt caching, and considering provisioned throughput if you're on pay-as-you-go. Also, monitor Azure’s status and updates pages for resolution timelines.

Use Azure’s Metrics Explorer to check latency and throttling for your Azure OpenAI resource:

Go to Azure Portal > Monitor > Metrics.

Select your OpenAI resource.

Add metrics like:

Latency

Throttled Requests

Tokens per second

https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource

If you're using Sweden Central, try deploying a temporary resource in a nearby region like West Europe or North Europe to compare performance.

Use the same model (o3-mini) and prompt.

Measure latency and throughput.

Provisioned throughput offers dedicated capacity, which can reduce latency during peak usage. - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource

If your prompts are repetitive, enabling prompt caching can reduce latency. - https://azure.microsoft.com/en-us/blog/announcing-the-availability-of-azure-openai-data-zones-and-latest-updates-from-azure-ai/

Update your frontend/backend to:

Increase timeout thresholds temporarily.

Implement retry logic with exponential backoff.

Since the issue is under investigation, monitor:

Azure Status Page

Azure Updates Blog

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-23T17:34:47.04+00:00

Hi GRT,
This issue has been mitigated. Could you please check now and let us know if you encountered any issues.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-26T16:30:00.9166667+00:00

Hi GRT,

We wanted to follow up regarding the latency issues you experienced with the o3-mini model in the Sweden region.

We’re pleased to inform you that the issue has been mitigated, and service performance has returned to normal levels.

We sincerely appreciate your patience while we worked on addressing this. If you continue to experience any issues or have further questions, please don’t hesitate to reach out.
Thank you!
GRT 0 Reputation points

2025-05-27T09:05:23.84+00:00

Hi Everyone,

Kindly notice our engineer feedback:

I conducted a comparative test to address the latency issues. The problem appears to be resolved for most models, except for the o3-mini, which is our primary model and typically consumes the most execution time. In my test, the o3-mini had an initial response time of approximately 1 minute and 53 seconds on "low" reasoning. This is a significant improvement from the previous >30 minutes, but still considerably higher than the 20 seconds it takes for "raw" OpenAI to respond to the same query on "medium" reasoning, which is the configuration on our Staging server. Once the response began, the tokens streamed at a slightly slower rate than usual, but still at an acceptable speed. Subsequent tests with Azure showed reasonable speeds comparable to "raw" OpenAI. This suggests that the initial test may have experienced unusual latency, or the issue is manifesting inconsistently. Unfortunately, I am unable to perform more in-depth testing at this time as our Staging server and resources are currently occupied with QA tests for a new version release. Also, re-deploying the o3-mini may prove problematic, as this model requires manual quota authorization from Microsoft to be deployed.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-27T19:13:58.14+00:00

Hi GRT,
Could you please share the metrics graph. I’ll contact my PG team to get further assistance.
GRT 0 Reputation points

2025-05-28T18:09:06.1366667+00:00

metrics.jpg

Just to emphasize: these charts understate the issue. In reality it took 23m 57.753s to get the full response (there are multiple agents in the orchestration graph), and this was after the response was cut short due to a parsing bug in the dev version - i.e. not all agents got to run. The same request took 01m 06.328s on non-Azure OpenAI models.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-29T01:38:49.3666667+00:00

Hi GRT,
I thought the issue was about o3-mini, but the screenshot you shared is for GPT-4.1. Could you please confirm if you’re experiencing latency issues for all Azure OpenAI models?

Manas Mohanty 6,370 Microsoft External Staff Moderator

Hi GRT

I have requested details in private chat to connect offline.

Could you share your input token size, share sample code and confirm presence of virtual network/firewall

Here is my results from repro

I just tested for GPT4.1 and 03-mini in Sweden Central to summarize 5000-word essay in 50 lines

Here are my results in East US 2 Resgion with max_token 10000 and max TPM 250 k from model deployment

East US 2

15 second for o3 mini 9.0 second for GPT 4.1 6.0 second for GPT 4o-mini

Sweden Central

31 second for o3-mini 15 second for GPT 4.1 4 second for GPT4o-mini

with below sample code


import os
import base64
from openai import AzureOpenAI

endpoint = os.getenv("ENDPOINT_URL", "<azureopenaiendpoint>")
deployment = os.getenv("DEPLOYMENT_NAME", "gpt-4.1")
subscription_key = os.getenv("AZURE_OPENAI_API_KEY", "<apikey>")

# Initialize Azure OpenAI client with key-based authentication
client = AzureOpenAI(
    azure_endpoint=endpoint,
    api_key=subscription_key,
    api_version="2025-01-01-preview",
)

# IMAGE_PATH = "YOUR_IMAGE_PATH"
# encoded_image = base64.b64encode(open(IMAGE_PATH, 'rb').read()).decode('ascii')

#Prepare the chat prompt
chat_prompt = [
    {
        "role": "developer",
        "content": [
            {
                "type": "text",
                "text": "You are an AI assistant that helps people summarize text in 10 lines"
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": """Summarize this Below is a 5000‐word essay on climate change. Please note that while every effort has been made to produce an essay that is as close as possible to 5000 words, slight deviations in word count may occur. You can use this essay as a starting point and adjust as needed.  \n   \n────────────────────  \nClimate Change: Causes, Impacts, and the Road to a Sustainable Future  \n   \nIntroduction  \n   \nClimate change is widely recognized as one of the mos  \n   \nI hope you find this essay useful for your purposes. ........If you need any adjustments or further sections expanded upon, please let me know! in 50 lines"""
            }
        ]
    },
    
        
    
]

# Include speech result if speech is enabled
messages = chat_prompt
import time
start = time.time()
# Generate the completion
completion = client.chat.completions.create(
    model=deployment,
    messages=messages,
    max_completion_tokens=4000,
    stop=None,
    stream=False
)
duration = time.time()-start
print("duration taken is",duration)
print(completion.to_json())

So, it seems latency is bit lower in Sweden Central

Suggestion - Load balance between multiple Global deployment regions with help of APIM

Looking forward to hearing from you.

Thank you.

GRT 0 Reputation points

2025-05-30T06:19:02.27+00:00

The token size varies significantly. The charts I sent were scaled to show a single request, so the numbers reflect the number of tokens. The o3-mini model is the main offender, using well over 5000 words (~3750 tokens) per request. It also performs more complex tasks than summarization.

The charts highlight the primary offending models.

I don't have any knowledge about a virtual network or firewall. The purpose of a Data Zone deployment is to comply with regulations, preventing a Global deployment. This is the main reason we opted for Azure.

Regarding tokens and times, see the attached example from my test UI using OpenAI - note the time on the top-right.time.jpg
Manas Mohanty 6,370 Reputation points Microsoft External Staff Moderator

2025-05-30T08:52:19.9466667+00:00

Thank you GRT for sharing the details on token size.

As per Screenshot

Total token for O3 mini - 59k

Total Token for GPT-4o - 10k

Total time - 1 minute 48 second

(including reasoning tokens)

Will check this data internally and shall schedule a call with you

Thank you.

Manas Mohanty 6,370 Microsoft External Staff Moderator

Hi GRT

I have tried to summarize a essay with size of 68k tokens and find 20 most repeated word along count in Sweden Central region today.

Duration found was - 18 second

import os
from openai import AzureOpenAI

# Environment setup
endpoint = os.getenv("ENDPOINT_URL", "https://testgpt41mm.openai.azure.com/")
deployment = os.getenv("DEPLOYMENT_NAME", "o3-mini")
subscription_key = os.getenv("AZURE_OPENAI_API_KEY", "<apikey>)

# Initialize Azure OpenAI client
client = AzureOpenAI(
    azure_endpoint=endpoint,
    api_key=subscription_key,
    api_version="2025-01-01-preview",
)

# Simulate a long reasoning prompt (~60k tokens)
long_text = """ A request for a **60,000-word essay** is extremely large—such text volume would be roughly equivalent to a published book, amounting to over 150 single-spaced pages. Given the constraints of this platform and your likely intent to gain a comprehensive yet readable analysis, I will provide you with a **detailed and extensive outline** of such an essay, along with a richly developed **executive summary** (2000–3000 words). This will give you a strong foundation to expand further or request more specific sections in-depth.  
   
If you would like a specific section (e.g., "Impact of Generative AI on White-Collar Jobs", or "Future Trends & Policy Recommendations") written in full detail, please specify.  
   
---  
   
# Impact of AI on the Job Market  
   
---  
   
## Executive Summary  
   
### Introduction  
   
The integration of Artificial Intelligence (AI) into the global labor market has stirred profound transformations, presenting a blend of challenges and opportunities across various sectors. As AI continues to advance and proliferate through technologies such as machine learning, automation, natural language processing, and robotics, its impact on employment patterns, skills demand, and socioeconomic structures has become more pronounced. This essay explores the multifaceted effects of AI on the job market, examining historical context, current dynamics, sectorial impacts, emerging opportunities, societal implications, and future outlooks.  
   
### Historical Context and Drivers  
   
AI isn’t the first technological innovation to disrupt the workplace; history offers multiple parallels—from the Industrial Revolution and the mechanization of agriculture, to the advent of computers automating back-office tasks. However, AI’s capacity to handle complex cognitive tasks, natural language understanding, and adaptive learning sets it apart, promising transformations not only in manual or routine work, but increasingly in white-collar and creative professions as well.  
   
Key drivers behind AI’s expanding workplace role include:  
   
- Dramatic cost reductions in computing power and data storage  
- Improvements in algorithms (deep learning, reinforcement learning)  
- Growth of big data sources for AI training  
- Market pressure for efficiency and innovation  
   
### Automation: Displacement and Transformation  
   
#### Displacement  
   
Many fear AI will render significant portions of the workforce obsolete. Automation of repetitive and predictable labor can already be seen in manufacturing, warehousing, logistics, and basic data entry. For example:  
   
- **Industrial robots** have displaced assembly line workers.  
- **AI-powered chatbots** are increasingly handling customer service roles.  
- **Automated checkouts** reduce the need for cashiers.  
   
Oxford University researchers (Frey & Osborne, 2013) estimated around 47% of U.S. jobs were at high risk of computerization. However, subsequent studies have nuanced these figures, suggesting that specific tasks within jobs, rather than entire occupations, are more likely to be automated.  
   
#### Transformation  
   
AI also augments human workers, transforming job roles rather than eliminating them. For instance:  
   
- **Healthcare**: AI diagnostics can assist radiologists; administrative AI apps reduce paperwork, allowing doctors to spend more time with patients.  
- **Finance**: Automated trading and fraud detection tools handle data-heavy tasks, while human analysts focus on strategy and client relations.  
- **Education**: AI grading and tutoring systems offer personalized feedback, enabling teachers to focus on more creative, interpersonal aspects.  
   
#### New Job Creation  
   
Historically, technological progress has created new industries and roles. AI advances spur demand for:  
   
- Data scientists, AI/ML engineers, AI product managers  
- Professionals in AI ethics, safety, and policy  
- “Robot maintenance” and AI system monitoring  
- Emerging creative fields (e.g., AI artists, prompt engineers)  
   
### Sector-Specific Impacts  
   
#### Manufacturing & Logistics  
   
These sectors have witnessed significant automation, from robotics on assembly lines to AI-powered inventory management. While manual labor jobs decline, there is increased demand for engineers, programmers, logistics planners, and maintenance technicians.  
   
#### Healthcare  
   
AI enables predictive diagnostics, personalized medicine, and intelligent medical devices. Some administrative          """ *15  # Adjust multiplier to reach ~60k tokens

print("Input entered word count", len(long_text))
# Prepare the chat prompt
chat_prompt = [
     {
        "role": "developer",
        "content": "You are an assistant that summarize input text in 30 lines and find 20 most repeated word and their count"
    },
    {
        "role": "user",
        "content": long_text
    }
]

# Generate the completion
import time
start_time = time.time()
completion = client.chat.completions.create(
    model=deployment,
    messages=chat_prompt,
    max_completion_tokens=2000,  # Adjust based on how much output you want
    reasoning_effort = "low" # low, medium, or high

)
dur = time.time()-start_time
print("Duration taken was ",dur)

print(completion.choices[0].message.content)

Output

Input entered word count 68055
Duration taken was  18.015239238739014
Below is a summary of the provided text in 30 lines followed by a list of 20 of its most repeated words (along with approximate counts):
──────────────────────────────
Summary (30 Lines):
1. The text begins by noting that a 60,000‐word essay is far too long for this platform.
2. Such an essay would equate to a full book, over 150 single-spaced pages.
3. Instead, the writer offers a detailed outline and an executive summary of 2000–3000 words.       
4. This approach is intended as a foundation that can be expanded later.
5. Readers are invited to request specific sections if needed (for example, on generative AI’s impact on white-collar jobs).
6. The main topic is introduced as the “Impact of AI on the Job Market.”
7. The executive summary opens with an introduction to AI’s growing influence in the global labor market.
8. It stresses that AI is changing how work is done and creating both challenges and opportunities. 
9. Various advanced technologies are mentioned, including machine learning, automation, natural language processing, and robotics.
10. Historical parallels (such as the Industrial Revolution and the automation of office tasks) are drawn.
11. The text explains that previous technological disruptions have transformed work environments.   
12. It emphasizes that unlike earlier technologies, AI can perform complex cognitive tasks.
13. The summary outlines key drivers of AI’s growth: falling costs in computing and data storage.   
14. It also highlights algorithmic improvements (like deep and reinforcement learning) and the availability of big data.
15. Market pressures for efficiency and innovation are noted as further catalysts.
16. The impact of AI is broken down into three parts: displacement of tasks, transformation of job roles, and the creation of new jobs.
17. In terms of displacement, repetitive and predictable tasks in many industries are already being automated.
18. Specific examples include industrial robots on assembly lines, chatbots in customer service, and automated checkouts in retail.
19. Research, such as that by Oxford University, is cited to illustrate the risk for many U.S. jobs.
20. However, the text clarifies that often only certain tasks within a job are automated.
21. The discussion then shifts to transformation, where AI augments and enhances human work.        
22. In healthcare, for instance, AI supports diagnostics and reduces administrative burdens.        
23. In finance, data-heavy tasks are automated while human analysts focus on strategy.
24. Educational applications also use AI for personalized grading and tutoring.
25. New job creation is another major theme, with fresh roles emerging alongside technological progress.
26. These new roles include positions for data scientists, AI/ML engineers, as well as AI product managers.
27. There is also increasing demand for professionals in AI ethics, safety, and policy.
28. The manufacturing and logistics sectors are noted for significant shifts from manual work toward technical roles.
29. In healthcare, advancements include predictive diagnostics and intelligent medical devices.     
30. Overall, the essay outlines that AI is reshaping the job market by transforming work, displacing selected tasks, and creating innovative roles.
──────────────────────────────
Twenty Most Repeated Words (Approximate Counts):
1. AI – 60
2. the – 55
3. and – 50
4. of – 45
5. a – 40
6. to – 38
7. in – 35
8. essay – 30
9. job – 28
10. automation – 25
11. with – 22
12. for – 20
13. is – 20
14. on – 20
15. jobs – 18
16. summary – 18
17. detailed – 16
18. outline – 16
19. market – 15
20. impact – 15

Looking forward to hearing an update on any improvement from your side before connecting scheduled call tomorrow.

Thank you.

Manas Mohanty 6,370 Reputation points Microsoft External Staff Moderator

2025-06-03T19:25:13.0133333+00:00

Hi GRT

Have to tried to reach product with observation on reliability issue (intermittent delay of 20 mins while normal response time is under 1 min).

Thank you for sharing the details.

1 answer

Your answer

GRT 0 Reputation points

2025-05-19T16:50:59.54+00:00

Notice the latency indication we saw in our Azure Open AI metricsScreenshot 2025-05-19 194815.png
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-19T19:37:00.5+00:00

Hi GRT,
Could you please confirm which Azure region your Azure OpenAI resource is deployed in?
GRT 0 Reputation points

2025-05-19T20:46:44.7466667+00:00

Sweden Central
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-19T21:13:43.18+00:00

We would like to inform you that there is an ongoing issue that our product team is actively investigating. They are working to identify the root cause and implement a resolution as quickly as possible. We appreciate your patience and will provide you with an update as soon as more information becomes available.
Arik Aronov 0 Reputation points

2025-05-20T09:37:40.86+00:00

Hi @Pavankumar Purilla - thank you for the update. Can you say if the issue only affects Sweden Central? And if there's an ETA for resolution?

Please note that the Azure statuses page doesn't indicate an issue with OpenAI service in any location.
Sina Salam 22,031 Reputation points Volunteer Moderator

2025-05-20T10:49:26.3766667+00:00

Hello GRT,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

Thank you for your patience. The latency issue in Sweden Central is under investigation by Microsoft. Meanwhile, we recommend testing your workload in a nearby region like West Europe, enabling prompt caching, and considering provisioned throughput if you're on pay-as-you-go. Also, monitor Azure’s status and updates pages for resolution timelines.

Use Azure’s Metrics Explorer to check latency and throttling for your Azure OpenAI resource:

Go to Azure Portal > Monitor > Metrics.

Select your OpenAI resource.

Add metrics like:

Latency

Throttled Requests

Tokens per second

https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource

If you're using Sweden Central, try deploying a temporary resource in a nearby region like West Europe or North Europe to compare performance.

Use the same model (o3-mini) and prompt.

Measure latency and throughput.

Provisioned throughput offers dedicated capacity, which can reduce latency during peak usage. - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/create-resource

If your prompts are repetitive, enabling prompt caching can reduce latency. - https://azure.microsoft.com/en-us/blog/announcing-the-availability-of-azure-openai-data-zones-and-latest-updates-from-azure-ai/

Update your frontend/backend to:

Increase timeout thresholds temporarily.

Implement retry logic with exponential backoff.

Since the issue is under investigation, monitor:

Azure Status Page

Azure Updates Blog

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-23T17:34:47.04+00:00

Hi GRT,
This issue has been mitigated. Could you please check now and let us know if you encountered any issues.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-26T16:30:00.9166667+00:00

Hi GRT,

We wanted to follow up regarding the latency issues you experienced with the o3-mini model in the Sweden region.

We’re pleased to inform you that the issue has been mitigated, and service performance has returned to normal levels.

We sincerely appreciate your patience while we worked on addressing this. If you continue to experience any issues or have further questions, please don’t hesitate to reach out.
Thank you!
GRT 0 Reputation points

2025-05-27T09:05:23.84+00:00

Hi Everyone,

Kindly notice our engineer feedback:

I conducted a comparative test to address the latency issues. The problem appears to be resolved for most models, except for the o3-mini, which is our primary model and typically consumes the most execution time. In my test, the o3-mini had an initial response time of approximately 1 minute and 53 seconds on "low" reasoning. This is a significant improvement from the previous >30 minutes, but still considerably higher than the 20 seconds it takes for "raw" OpenAI to respond to the same query on "medium" reasoning, which is the configuration on our Staging server. Once the response began, the tokens streamed at a slightly slower rate than usual, but still at an acceptable speed. Subsequent tests with Azure showed reasonable speeds comparable to "raw" OpenAI. This suggests that the initial test may have experienced unusual latency, or the issue is manifesting inconsistently. Unfortunately, I am unable to perform more in-depth testing at this time as our Staging server and resources are currently occupied with QA tests for a new version release. Also, re-deploying the o3-mini may prove problematic, as this model requires manual quota authorization from Microsoft to be deployed.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-27T19:13:58.14+00:00

Hi GRT,
Could you please share the metrics graph. I’ll contact my PG team to get further assistance.
GRT 0 Reputation points

2025-05-28T18:09:06.1366667+00:00

metrics.jpg

Just to emphasize: these charts understate the issue. In reality it took 23m 57.753s to get the full response (there are multiple agents in the orchestration graph), and this was after the response was cut short due to a parsing bug in the dev version - i.e. not all agents got to run. The same request took 01m 06.328s on non-Azure OpenAI models.
Pavankumar Purilla 8,665 Reputation points Microsoft External Staff Moderator

2025-05-29T01:38:49.3666667+00:00

Hi GRT,
I thought the issue was about o3-mini, but the screenshot you shared is for GPT-4.1. Could you please confirm if you’re experiencing latency issues for all Azure OpenAI models?
GRT 0 Reputation points

2025-05-30T06:19:02.27+00:00

The token size varies significantly. The charts I sent were scaled to show a single request, so the numbers reflect the number of tokens. The o3-mini model is the main offender, using well over 5000 words (~3750 tokens) per request. It also performs more complex tasks than summarization.

The charts highlight the primary offending models.

I don't have any knowledge about a virtual network or firewall. The purpose of a Data Zone deployment is to comply with regulations, preventing a Global deployment. This is the main reason we opted for Azure.

Regarding tokens and times, see the attached example from my test UI using OpenAI - note the time on the top-right.time.jpg
Manas Mohanty 6,370 Reputation points Microsoft External Staff Moderator

2025-05-30T08:52:19.9466667+00:00

Thank you GRT for sharing the details on token size.

As per Screenshot

Total token for O3 mini - 59k

Total Token for GPT-4o - 10k

Total time - 1 minute 48 second

(including reasoning tokens)

Will check this data internally and shall schedule a call with you

Thank you.
Manas Mohanty 6,370 Reputation points Microsoft External Staff Moderator

2025-06-03T19:25:13.0133333+00:00

Hi GRT

Have to tried to reach product with observation on reliability issue (intermittent delay of 20 mins while normal response time is under 1 min).

Thank you for sharing the details.

Answer 1

Hi GRT

Here is the summary of the case

Issue - Intermittent slowness in SQL LLM response

Suggestion shared

Alternate simple LLMs to reduce complexity

Configure Model deployment (Did not opt as customer has restriction on quota)

Suggested to load balance between multiple deployments if latency is higher in one deployment (Did not opt because of production environment)

Suggested to opt Provisioned throughput units

From PG side:

PG fixed the rate limits and loads

Observation support side:

Inference time is under 30 second for 60 k token and under 1 min for customer for SQL LLM

Saying that requesting to opt for slight optimization in model deployments and configuration instead of relying on backend efficiency fully.

Thank you.

Share via

Request for Assistance: Critical Slowness When Calling Azure LLMs

1 answer

Your answer