An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello Yassine,
Welcome to Microsoft Q&A .Thank you for reaching out.
The concerns raised are fully understood, especially given the intentional choice to use open‑weight reasoning models and the expectation of timely responses for production scenarios.
Open‑weight reasoning models such as Kimi K2 are designed to perform deep, multi‑step reasoning and to support very large context sizes. Because of this design, individual requests can require extended execution time.
In Azure AI Foundry, request limits such as RPM or TPM primarily control how many requests can be admitted, but they do not directly reduce the execution time of a single request. As a result, increasing limits or enabling streaming may not significantly change end‑to‑end response time for complex reasoning workloads.
For open‑weight models hosted on shared inference infrastructure, response times can also vary based on overall system load and backend scheduling. Changing regions does not always lead to consistent latency improvements, as similar capacity characteristics can apply across regions for the same model family.
Current monitoring focuses on token usage and throughput trends, which means internal execution or queueing delays may not be directly visible.
It is also important to note that models such as Kimi K2 process each request independently.As of now, there is no documented prompt reuse or caching behavior for these models, so large prompts and long reasoning paths incur full processing time on every request. This can lead to consistently higher response times for workloads that rely on extensive reasoning.
Please check if the following approaches help:
- For deployment and capacity considerations , evaluate provisioned throughput to improve consistency in request admission during high load. This approach helps with throughput predictability, but it does not reduce per‑request reasoning or execution time.
- For workload shaping and resilience please consider implementing client‑side timeouts with retry logic and exponential backoff to manage temporary latency spikes. Configure primary and secondary region failover to improve availability during regional capacity fluctuations.
- For observability and validation ,review token usage and throughput metrics to confirm requests are being accepted without throttling. Use metric trends to distinguish between admission limits and model execution‑time behavior.
References:
Azure OpenAI in Microsoft Foundry Models performance & latency - Microsoft Foundry | Microsoft Learn
Monitoring data reference for Azure OpenAI - Microsoft Foundry | Microsoft Learn
Foundry Models sold directly by Azure - Microsoft Foundry | Microsoft Learn
Thank you
Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the response was helpful. This will be benefitting other community members who face the same issue