Share via

AzureOPEN AI Latency increased to 2 min

prasanna kumar 0 Reputation points
2026-02-11T17:50:14.91+00:00

My azure openai deployed has increase latency of 2 min for single request why is that so how to fix it ?

User's image

Azure OpenAI Service
Azure OpenAI Service

An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.

{count} votes

4 answers

Sort by: Most helpful
  1. om gaikwad 5 Reputation points
    2026-02-11T19:12:36.0666667+00:00

    High latency (around 2 minutes per request) in Azure OpenAI is usually caused by one of the following:

    Common Reasons:

    • Large prompt size or high max_tokens value

    Using GPT-4 instead of faster models like GPT-4o or GPT-35-turbo

    Low TPM (Tokens Per Minute) quota causing throttling

    Regional capacity issues (requests getting queued)

    Network distance between your app and Azure region

    What You Can Do:

    Reduce max_tokens to the minimum required

    Enable streaming responses to improve perceived speed

    Check Azure Metrics for latency and throttled requests

    Increase TPM quota or scale your deployment

    Test with GPT-4o or GPT-35-turbo if performance is critical

    If possible, please share your model name, region, and token usage so the exact bottleneck can be identified.High latency (around 2 minutes per request) in Azure OpenAI is usually caused by one of the following:

    Common Reasons:

    Large prompt size or high max_tokens value

    Using GPT-4 instead of faster models like GPT-4o or GPT-35-turbo

    Low TPM (Tokens Per Minute) quota causing throttling

    Regional capacity issues (requests getting queued)

    Network distance between your app and Azure region

    What You Can Do:

    Reduce max_tokens to the minimum required

    Enable streaming responses to improve perceived speed

    Check Azure Metrics for latency and throttled requests

    Increase TPM quota or scale your deployment

    Test with GPT-4o or GPT-35-turbo if performance is critical

    If possible, please share your model name, region, and token usage so the exact bottleneck can be identified.

    1 person found this answer helpful.
    0 comments No comments

  2. Anshika Varshney 8,280 Reputation points Microsoft External Staff Moderator
    2026-02-12T19:14:18.57+00:00

    Hi prasanna kumar,

    You’re not alone this kind of ~2‑minute latency has been reported by multiple users on Azure OpenAI recently, and it’s usually not caused by a single factor.

    Based on community experience, the most common contributors are:

    • Model choice: Some models (especially heavier or reasoning-focused ones) have noticeably higher response times compared to faster options like GPT‑4o or GPT‑35‑Turbo.
    • Token limits and prompt size: Large prompts or high max_tokens values can significantly increase generation time.
    • TPM throttling: Even when deployments show higher limits, effective Tokens‑Per‑Minute can be lower, causing requests to queue.
    • Regional capacity constraints: Certain regions intermittently experience higher load, which results in delayed responses rather than outright failures.

    Things that typically help narrow this down:

    • Check Azure Metrics for throttling and latency at the deployment level.
    • Temporarily test with a smaller prompt and lower max_tokens to see if latency improves.
    • Compare behavior by switching to a different model or region, if possible.
    • Enable streaming responses to improve perceived latency, especially for longer outputs.

    I Hope this helps. Do let me know if you have any further queries.
    Thankyou!

    0 comments No comments

  3. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

  4. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.