Request for Assistance: Critical Slowness When Calling Azure LLMs

GRT 0 Reputation points
2025-05-19T16:45:52.6733333+00:00

Since this coupld of days ago, we've been experiencing significant slowness when calling Azure LLMs from our software. This issue affects both the development "nightly" version and the stable version installed on the staging server for the past few weeks. The production version, which uses the "raw" OpenAI model (soon to be replaced by the Azure version), does not exhibit this slowness.

Requests to the Azure LLMs are now taking considerably longer than before, causing our frontend to time out. While the slowness is consistent across all calls, those that were previously fast are relatively quicker now. The most pronounced delay is with the o3-mini model, which is inherently slower. Typically, calls do not return errors; they simply take a long time to respond.

I have tried updating all libraries and enabling the latest preview API version (2025-04-01-preview) on the development version, but this did not resolve the issue. Restarting the program also had no effect, and there are no signs of abnormal resource usage.

Your urgent assistance would be greatly appreciated.

Azure AI Bot Service
Azure AI Bot Service
An Azure service that provides an integrated environment for bot development.
948 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Manas Mohanty 6,370 Reputation points Microsoft External Staff Moderator
    2025-06-09T10:15:39.53+00:00

    Hi GRT

    Here is the summary of the case

    Issue - Intermittent slowness in SQL LLM response

    Suggestion shared

    Alternate simple LLMs to reduce complexity

    Configure Model deployment (Did not opt as customer has restriction on quota)

    Suggested to load balance between multiple deployments if latency is higher in one deployment (Did not opt because of production environment)

    Suggested to opt Provisioned throughput units

    From PG side:

    PG fixed the rate limits and loads

    Observation support side:

    Inference time is under 30 second for 60 k token and under 1 min for customer for SQL LLM

    Saying that requesting to opt for slight optimization in model deployments and configuration instead of relying on backend efficiency fully.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.