An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello @Raquel R
Thanks for sharing the detailed example
Is 3–4 seconds normal for gpt-5-mini?
Yes, for a non-streaming request in a Pay-As-You-Go (PAYG) deployment, an end-to-end latency of ~3–4 seconds is within the expected range, even for a simple prompt like “Say hi.”
This does not indicate an issue with your deployment.
Why latency is higher than expected
The total response time you’re measuring includes multiple components:
Network overhead Round-trip time + TLS handshake (more noticeable with curl)
Multitenant routing (PAYG behavior) Azure OpenAI PAYG deployments are best-effort with no strict latency SLA. Requests may be routed to:
- Busy compute nodes
- Capacity in nearby regions (if needed)
Model processing time Includes tokenization + inference + sequential token generation
Content filtering layer Applied before returning the response
Non-streaming response mode The API waits until the entire completion is generated before returning anything
About the “<1 second latency” expectation
This typically refers to Time to First Token (TTFT) under specific conditions:
- When streaming is enabled
- When using low-latency or provisioned deployments
- Or in highly optimized/internal benchmarks
For full responses without streaming, sub-second latency is not expected.
Recommendations to optimize latency
Depending on your use case, here are the most effective ways to reduce latency or improve user experience:
1. Enable streaming
Set:
"stream": true
Returns tokens incrementally instead of waiting for full completion
TTFT is typically <1 second, even though total generation time remains similar
This is the most impactful improvement for user-perceived latency
2. Consider Provisioned Throughput Units (PTUs)
If you require consistent and predictable latency, PAYG may not be sufficient.
PTUs provide:
- Dedicated capacity
- Latency SLA-backed performance
This avoids variability from multitenant routing and load
Please refer this https://learn.microsoft.com/azure/ai-foundry/openai/concepts/provisioned-throughput
3. Reduce token usage
Lower max_tokens
Keep prompts concise
Why this matters:
Input tokens increase preprocessing time
Output tokens are generated sequentially - directly impacts latency
4. Optimize model choice and configuration
If advanced reasoning is not required Consider lighter/faster models (e.g., gpt-4o-mini, if available in your region)
For reasoning-enabled models:
Use:
"reasoning_effort": "minimal"
This reduces compute time and improves response speed
5. Use SDKs instead of raw curl for benchmarking
curl creates a new connection per request → adds overhead
SDKs (Python, JavaScript, etc.) support:
- Connection pooling
- Keep-alive reuse
This gives a more accurate representation of real application latency
6. Review region placement and traffic patterns
Latency can vary due to:
- Regional capacity
- Peak usage times
Recommendations:
Deploy in the closest available Azure region
Monitor P50 / P95 latency metrics in Azure Monitor
For high-scale apps Consider multi-region distribution
7. Prompt caching
Helps if identical prompts are reused frequently
However, most latency comes from:
- Model inference
- Token generation
So gains here are typically modest
Please refer this
Performance and Latency (throughput vs latency deep dive): https://learn.microsoft.com/azure/ai-foundry/openai/how-to/latency?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider
Optimize Prompt Caching and Latency: https://learn.microsoft.com/azure/ai-services/openai/how-to/prompt-caching
I hope this will help you. Please feel free to let me know if you have any other queries.
Thank you!