Share via

Regular LLM gpt-5-mini latency

Raquel R 0 Reputation points
2026-04-23T13:25:31.5333333+00:00

Hi, I am trying to understand whether the latency I experience when calling an LLM is normal or whether something seems to be wrong.
For a query like below, run from terminal it takes between 3-4+ seconds to get a reply (without streaming).
Is that normal? I was told to expect less than 1s.

AZURE_OPENAI_DEPLOYMENT=gpt-5-mini

curl -sS "${AZURE_OPENAI_ENDPOINT}/openai/deployments/${AZURE_OPENAI_DEPLOYMENT}/chat/completions?api-version=${AZURE_OPENAI_API_VERSION}" \

-H "Content-Type: application/json" \

-H "api-key: ${AZURE_OPENAI_API_KEY}" \

-d '{

"messages": [

  {"role": "system", "content": "You are a helpful assistant. Say hi."},

  {"role": "user", "content": "Say hi."}

]

}' | jq -r '.choices[0].message.content // .error.message'

Azure OpenAI in Foundry Models

2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 18,225 Reputation points Microsoft External Staff Moderator
    2026-04-23T15:09:57.4366667+00:00

    Hello @Raquel R

    Thanks for sharing the detailed example

    Is 3–4 seconds normal for gpt-5-mini?

    Yes, for a non-streaming request in a Pay-As-You-Go (PAYG) deployment, an end-to-end latency of ~3–4 seconds is within the expected range, even for a simple prompt like “Say hi.”

    This does not indicate an issue with your deployment.

    Why latency is higher than expected

    The total response time you’re measuring includes multiple components:

    Network overhead Round-trip time + TLS handshake (more noticeable with curl)

    Multitenant routing (PAYG behavior) Azure OpenAI PAYG deployments are best-effort with no strict latency SLA. Requests may be routed to:

    • Busy compute nodes
    • Capacity in nearby regions (if needed)

    Model processing time Includes tokenization + inference + sequential token generation

    Content filtering layer Applied before returning the response

    Non-streaming response mode The API waits until the entire completion is generated before returning anything

    About the “<1 second latency” expectation

    This typically refers to Time to First Token (TTFT) under specific conditions:

    • When streaming is enabled
    • When using low-latency or provisioned deployments
    • Or in highly optimized/internal benchmarks

    For full responses without streaming, sub-second latency is not expected.

    Recommendations to optimize latency

    Depending on your use case, here are the most effective ways to reduce latency or improve user experience:

    1. Enable streaming

    Set:

    "stream": true
    

    Returns tokens incrementally instead of waiting for full completion

    TTFT is typically <1 second, even though total generation time remains similar

    This is the most impactful improvement for user-perceived latency

    2. Consider Provisioned Throughput Units (PTUs)

    If you require consistent and predictable latency, PAYG may not be sufficient.

    PTUs provide:

    • Dedicated capacity
    • Latency SLA-backed performance

    This avoids variability from multitenant routing and load

    Please refer this https://learn.microsoft.com/azure/ai-foundry/openai/concepts/provisioned-throughput

    3. Reduce token usage

    Lower max_tokens

    Keep prompts concise

    Why this matters:

    Input tokens increase preprocessing time

    Output tokens are generated sequentially - directly impacts latency

    4. Optimize model choice and configuration

    If advanced reasoning is not required Consider lighter/faster models (e.g., gpt-4o-mini, if available in your region)

    For reasoning-enabled models:

    Use:

    "reasoning_effort": "minimal"
    

    This reduces compute time and improves response speed

    5. Use SDKs instead of raw curl for benchmarking

    curl creates a new connection per request → adds overhead

    SDKs (Python, JavaScript, etc.) support:

    • Connection pooling
    • Keep-alive reuse

    This gives a more accurate representation of real application latency

    6. Review region placement and traffic patterns

    Latency can vary due to:

    • Regional capacity
    • Peak usage times

    Recommendations:

    Deploy in the closest available Azure region

    Monitor P50 / P95 latency metrics in Azure Monitor

    For high-scale apps Consider multi-region distribution

    7. Prompt caching

    Helps if identical prompts are reused frequently

    However, most latency comes from:

    • Model inference
    • Token generation

    So gains here are typically modest

    Please refer this

    Performance and Latency (throughput vs latency deep dive): https://learn.microsoft.com/azure/ai-foundry/openai/how-to/latency?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider

    Optimize Prompt Caching and Latency: https://learn.microsoft.com/azure/ai-services/openai/how-to/prompt-caching

    I hope this will help you. Please feel free to let me know if you have any other queries.

    Thank you!

    Was this answer helpful?

    1 person found this answer helpful.

  2. Q&A Assist
    2026-04-23T13:25:43.1033333+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    For Azure OpenAI chat/completions, per-call latency depends mainly on:

    1. Model type (here: gpt-5-mini)
    2. Prompt token count
    3. Generated token count
    4. Overall load on the deployment and system

    Latency is defined as the time to get a response back from the model. Each prompt token adds relatively little time compared to each incremental generated token, but total latency still scales with both prompt and output size and with system load.

    The documentation does not state a guaranteed or typical "< 1 second" latency for such calls. It explicitly notes that estimating per-call latency is challenging and that latency can vary with the above four factors and current deployment load. A 3–4+ second end-to-end time for a non-streaming request, even with a small prompt, is therefore within the range that can occur and is not, by itself, evidence of a malfunction.

    To understand whether this is normal for the specific deployment, measure latency and throughput using Azure Monitor:

    1. Measure calls per minute using the Azure OpenAI Requests metric split by ModelDeploymentName.
    2. Measure total tokens per minute (prompt + generated) using the Processed Inference tokens metric.
    3. For non-streaming calls, use End-to-end Request Time from the API gateway as the primary latency metric.

    If latency is consistently higher than desired, compare it against these metrics while adjusting prompt size, max tokens, and call rate, and check for periods of high system load.


    References:

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.