Share via

AI foundry performances for some OpenWeight models are unbelievably slow

Yassine 0 Reputation points
2026-03-31T08:38:53.86+00:00

I use models such as Kimi K2, in different regions, but the performances are really slow. The time of a response can take several minutes which defeats the purpose. The RPM are set to the highest and the monitoring does not really work.

Does anyone have an idea on how to resolve these performance issues? Changing or using different regions didn't really solve it.

Thank you

Azure OpenAI Service
Azure OpenAI Service

An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.


1 answer

Sort by: Most helpful
  1. Karnam Venkata Rajeswari 1,555 Reputation points Microsoft External Staff Moderator
    2026-04-01T10:41:58.8666667+00:00

    Hello Yassine,

    Welcome to Microsoft Q&A .Thank you for reaching out.

    The concerns raised are fully understood, especially given the intentional choice to use open‑weight reasoning models and the expectation of timely responses for production scenarios.

    Open‑weight reasoning models such as Kimi K2 are designed to perform deep, multi‑step reasoning and to support very large context sizes. Because of this design, individual requests can require extended execution time.

    In Azure AI Foundry, request limits such as RPM or TPM primarily control how many requests can be admitted, but they do not directly reduce the execution time of a single request. As a result, increasing limits or enabling streaming may not significantly change end‑to‑end response time for complex reasoning workloads.

    For open‑weight models hosted on shared inference infrastructure, response times can also vary based on overall system load and backend scheduling. Changing regions does not always lead to consistent latency improvements, as similar capacity characteristics can apply across regions for the same model family.

    Current monitoring focuses on token usage and throughput trends, which means internal execution or queueing delays may not be directly visible.

    It is also important to note that models such as Kimi K2 process each request independently.As of now, there is no documented prompt reuse or caching behavior for these models, so large prompts and long reasoning paths incur full processing time on every request. This can lead to consistently higher response times for workloads that rely on extensive reasoning.

    Please check if the following approaches help:

    1. For deployment and capacity considerations , evaluate provisioned throughput to improve consistency in request admission during high load. This approach helps with throughput predictability, but it does not reduce per‑request reasoning or execution time.
    2. For workload shaping and resilience please consider implementing client‑side timeouts with retry logic and exponential backoff to manage temporary latency spikes. Configure primary and secondary region failover to improve availability during regional capacity fluctuations.
    3. For observability and validation ,review token usage and throughput metrics to confirm requests are being accepted without throttling. Use metric trends to distinguish between admission limits and model execution‑time behavior.

    References:

    Azure OpenAI in Microsoft Foundry Models performance & latency - Microsoft Foundry | Microsoft Learn

    Monitoring data reference for Azure OpenAI - Microsoft Foundry | Microsoft Learn

    Foundry Models sold directly by Azure - Microsoft Foundry | Microsoft Learn

    Thank you

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the response was helpful. This will be benefitting other community members who face the same issue

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.