Long latencies when using gpt-4 and gpt-4-turbo models

Marc Zhang 35 Reputation points
2024-03-13T01:01:08.3133333+00:00

SCR-20240313-hwov

As seen in the picture. We have been seeing abnormally high latencies when using gpt-4 and gpt-4-turbo models. In some extreme cases, the latency is more than 1 minute. This has become unusable for us at the moment. Anyone experiencing similar issues or know what could be the cause of this?

For context, our input token length is about 6k tokens. We use function calling in our requests. We are not close to our throughput quota. We have multiple deployments in different regions, and face the same issue.

Any input is appreciated. Thanks in advance.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,414 questions
{count} votes

Accepted answer
  1. AshokPeddakotla-MSFT 29,481 Reputation points
    2024-03-14T15:58:51.93+00:00

    Marc Zhang Greetings & Welcome to Microsoft Q&A Forum!

    We have been seeing abnormally high latencies when using gpt-4 and gpt-4-turbo models. In some extreme cases, the latency is more than 1 minute. This has become unusable for us at the moment. Anyone experiencing similar issues or know what could be the cause of this? For context, our input token length is about 6k tokens. We use function calling in our requests. We are not close to our throughput quota. We have multiple deployments in different regions, and face the same issue.

    I understand that you are seeing high latency. Latency is expected if you are using GPT4 models considering that gpt-4 has more capacity than the gpt-3.5 version.

    As of now, we do not offer Service Level Agreements (SLAs) for response times from the Azure OpenAI service.

    As Charlie mentioned, I would suggest you, please check the documentation on improving the performance and latency.

    Also, Here are some of the best practices to lower latency:

    • Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
    • Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
    • Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
    • Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
    • Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

    I Hope this helps. Please let me know if you have any further queries.

    If the response helped, please do click Accept Answer and Yes for was this answer helpful.


1 additional answer

Sort by: Most helpful
  1. Charlie Wei 3,300 Reputation points
    2024-03-13T02:08:47.7766667+00:00

    Hello Marc Zhang,

    Regarding the issue of latencies, I recommend consulting this Microsoft Learn document for adjustments to see if improvements can be made. Thank you.

    Best regards,
    Charlie


    If you find my response helpful, please consider accepting this answer and voting 'yes' to support the community. Thank you!