Regular LLM gpt-5-mini latency

Question

Regular LLM gpt-5-mini latency

Raquel R 0

Hi, I am trying to understand whether the latency I experience when calling an LLM is normal or whether something seems to be wrong.
For a query like below, run from terminal it takes between 3-4+ seconds to get a reply (without streaming).
Is that normal? I was told to expect less than 1s.

AZURE_OPENAI_DEPLOYMENT=gpt-5-mini

curl -sS "${AZURE_OPENAI_ENDPOINT}/openai/deployments/${AZURE_OPENAI_DEPLOYMENT}/chat/completions?api-version=${AZURE_OPENAI_API_VERSION}" \

-H "Content-Type: application/json" \

-H "api-key: ${AZURE_OPENAI_API_KEY}" \

-d '{

"messages": [

  {"role": "system", "content": "You are a helpful assistant. Say hi."},

  {"role": "user", "content": "Say hi."}

]

}' | jq -r '.choices[0].message.content // .error.message'

Raquel R 0 Reputation points

2026-04-24T07:36:36.51+00:00

Thank you, that is useful
Anshika Varshney 11,060 Reputation points Microsoft External Staff Moderator

2026-04-24T07:44:23.2666667+00:00

Hi Raquel R,

Thank you for sharing the update I appreciate you taking the time to confirm the resolution!

Since I’ve converted Srilakshmi's comment into an answer, could you please take a moment to mark it as Accepted? This helps others in the community with the same question find the solution more easily.

Thankyou!

2 answers

Your answer

Raquel R 0 Reputation points

2026-04-24T07:36:36.51+00:00

Thank you, that is useful
Anshika Varshney 11,060 Reputation points Microsoft External Staff Moderator

2026-04-24T07:44:23.2666667+00:00

Hi Raquel R,

Thank you for sharing the update I appreciate you taking the time to confirm the resolution!

Since I’ve converted Srilakshmi's comment into an answer, could you please take a moment to mark it as Accepted? This helps others in the community with the same question find the solution more easily.

Thankyou!

Answer 1

Hello @Raquel R

Thanks for sharing the detailed example

Is 3–4 seconds normal for gpt-5-mini?

Yes, for a non-streaming request in a Pay-As-You-Go (PAYG) deployment, an end-to-end latency of ~3–4 seconds is within the expected range, even for a simple prompt like “Say hi.”

This does not indicate an issue with your deployment.

Why latency is higher than expected

The total response time you’re measuring includes multiple components:

Network overhead Round-trip time + TLS handshake (more noticeable with curl)

Multitenant routing (PAYG behavior) Azure OpenAI PAYG deployments are best-effort with no strict latency SLA. Requests may be routed to:

Busy compute nodes
Capacity in nearby regions (if needed)

Model processing time Includes tokenization + inference + sequential token generation

Content filtering layer Applied before returning the response

Non-streaming response mode The API waits until the entire completion is generated before returning anything

About the “<1 second latency” expectation

This typically refers to Time to First Token (TTFT) under specific conditions:

When streaming is enabled
When using low-latency or provisioned deployments
Or in highly optimized/internal benchmarks

For full responses without streaming, sub-second latency is not expected.

Recommendations to optimize latency

Depending on your use case, here are the most effective ways to reduce latency or improve user experience:

1. Enable streaming

Set:

"stream": true

Returns tokens incrementally instead of waiting for full completion

TTFT is typically <1 second, even though total generation time remains similar

This is the most impactful improvement for user-perceived latency

2. Consider Provisioned Throughput Units (PTUs)

If you require consistent and predictable latency, PAYG may not be sufficient.

PTUs provide:

Dedicated capacity
Latency SLA-backed performance

This avoids variability from multitenant routing and load

Please refer this https://learn.microsoft.com/azure/ai-foundry/openai/concepts/provisioned-throughput

3. Reduce token usage

Lower max_tokens

Keep prompts concise

Why this matters:

Input tokens increase preprocessing time

Output tokens are generated sequentially - directly impacts latency

4. Optimize model choice and configuration

If advanced reasoning is not required Consider lighter/faster models (e.g., gpt-4o-mini, if available in your region)

For reasoning-enabled models:

Use:

"reasoning_effort": "minimal"

This reduces compute time and improves response speed

5. Use SDKs instead of raw curl for benchmarking

curl creates a new connection per request → adds overhead

SDKs (Python, JavaScript, etc.) support:

Connection pooling
Keep-alive reuse

This gives a more accurate representation of real application latency

6. Review region placement and traffic patterns

Latency can vary due to:

Regional capacity
Peak usage times

Recommendations:

Deploy in the closest available Azure region

Monitor P50 / P95 latency metrics in Azure Monitor

For high-scale apps Consider multi-region distribution

7. Prompt caching

Helps if identical prompts are reused frequently

However, most latency comes from:

Model inference
Token generation

So gains here are typically modest

Please refer this

Performance and Latency (throughput vs latency deep dive): https://learn.microsoft.com/azure/ai-foundry/openai/how-to/latency?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider

Optimize Prompt Caching and Latency: https://learn.microsoft.com/azure/ai-services/openai/how-to/prompt-caching

I hope this will help you. Please feel free to let me know if you have any other queries.

Thank you!

SRILAKSHMI C 18,225 Reputation points Microsoft External Staff Moderator

2026-04-30T12:14:44.3033333+00:00

Hi Raquel R,

Thank you for sharing the update I appreciate you taking the time to confirm the resolution!

Since I’ve converted my earlier comment into an answer, could you please take a moment to mark it as Accepted? This helps others in the community with the same question find the solution more easily.

Thank you!

Answer 2

For Azure OpenAI chat/completions, per-call latency depends mainly on:

Model type (here: gpt-5-mini)
Prompt token count
Generated token count
Overall load on the deployment and system

Latency is defined as the time to get a response back from the model. Each prompt token adds relatively little time compared to each incremental generated token, but total latency still scales with both prompt and output size and with system load.

The documentation does not state a guaranteed or typical "< 1 second" latency for such calls. It explicitly notes that estimating per-call latency is challenging and that latency can vary with the above four factors and current deployment load. A 3–4+ second end-to-end time for a non-streaming request, even with a small prompt, is therefore within the range that can occur and is not, by itself, evidence of a malfunction.

To understand whether this is normal for the specific deployment, measure latency and throughput using Azure Monitor:

Measure calls per minute using the Azure OpenAI Requests metric split by ModelDeploymentName.
Measure total tokens per minute (prompt + generated) using the Processed Inference tokens metric.
For non-streaming calls, use End-to-end Request Time from the API gateway as the primary latency metric.

If latency is consistently higher than desired, compare it against these metrics while adjusting prompt size, max tokens, and call rate, and check for periods of high system load.

References:

Performance and latency

Share via

Regular LLM gpt-5-mini latency

2 answers

Your answer