Performance Issue with Azure DeepSeek Model – High Response Time

sivasankar 5 Reputation points
2025-02-07T11:30:47.7766667+00:00

We have created and deployed the Azure DeepSeek model following the guidelines provided in this documentation link. While the model is functional, we are experiencing significant latency issues. A single query takes more than one minute to generate a response, which is impacting our use case.

Deployment Details:

  • Region: East US
  • Model: DeepSeek-R1

Could you please assist us in identifying the cause of this issue and suggest possible optimizations to improve response time and reduce latency?

Looking forward to your support.

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,603 questions
0 comments No comments
{count} vote

2 answers

Sort by: Most helpful
  1. Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator
    2025-02-07T18:31:38.8966667+00:00

    Hi sivasankar,
    Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

    Setting max_tokens to a lower value, like 800, helps the model generate shorter responses, reducing processing time. This improves response speed while keeping the output useful and relevant.

    Enabling response streaming (stream=True) lets the model send parts of the response as soon as they are ready. This makes interactions feel faster since users don’t have to wait for the full response.

    Using a shorter input prompt means the model processes less text, which speeds up response time. Keeping prompts clear and concise helps the model focus on what’s important and respond more quickly.
    I hope this information helps.


  2. sivasankar 5 Reputation points
    2025-03-24T08:17:19.9833333+00:00

    Hi @Pavankumar Purilla ,

    Thank you for your response. I've noticed that when using Deepseek, the response time is noticeably slower compared to other Azure models like GPT-4, even when using the same max tokens and streaming enabled.

    While the model eventually produces proper responses, the speed seems to be lagging. Is there any specific reason behind this difference in performance? Are there any configuration changes or optimizations that could be made to improve Deepseek's response time?

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.