I am facing latency issue in Azure OpenAI responses

Aqsa Rahman 0 Reputation points
2024-06-25T10:48:34.86+00:00

I'm using the gpt-4-turbo model for my application, and I've recently been experiencing slower response times from the API. Previously, responses took no longer than 5 seconds, but now some are exceeding 10 seconds. I even switched to the gpt-4-32k model, but the issue persists. My token generation is limited, as earlier responses with the same gpt-4-turbo model were generated in 2-3 seconds.

Screenshot 2024-06-25 at 3.48.00 PM

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,133 questions
{count} votes

1 answer

Sort by: Most helpful
  1. santoshkc 8,960 Reputation points Microsoft Vendor
    2024-06-25T13:09:36.69+00:00

    Hi @Aqsa Rahman,

    Thank you for reaching out to Microsoft Q&A forum!

    I understand that you are facing latency issue in Azure OpenAI. Latency is expected if you are using GPT4 models considering that gpt-4 has more capacity than the gpt-3.5 version.

    As of now, we do not offer Service Level Agreements (SLAs) for response times from the Azure OpenAI service.

    I suggest you to go through the Performance and latency documentation.

    Here are some of the best practices to lower latency:

    • Model latency: If model latency is important to you, we recommend trying out our latest models in the GPT-3.5 Turbo model series.
    • Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
    • Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
    • Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
    • Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

    I hope you understand! Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.