GPT-4o Slow to complete after repeated runs.

Harvey Maddocks 0 Reputation points
2024-06-15T16:12:06.5066667+00:00

Hi,

I am running a deployed prompt flow that is using a deployed gpt-4o. It has 300k Tokens per minute quota.

I set it up to run a small batch of tweets to run a sentiment and a classification on the target tweet with the same prompt.

The response from the deployment is randomly very slow to complete. It will do most of the request in about 0.5 seconds. But then for a tweet it might take 2 Minutes to respond, then for another it might hang all together. But all very randomly. Repeated runs yield different times despite being the same corpus of tweets.

There is nothing in the logs of the deployment that would suggest anything wrong.

This effect is also seen when calling the API directly and using the openAI AzureOpenAi python SDK,

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,611 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. AshokPeddakotla-MSFT 30,241 Reputation points
    2024-06-18T03:09:24.2466667+00:00

    Harvey Maddocks Greetings & Welcome to Microsoft Q&A forum!

    I understand that you are having issues with GPT4o model slow response times.

    If you are using GPT4 model then latency is expected considering that gpt-4 has more capacity than the gpt-3.5 version. As of now, we do not offer Service Level Agreements (SLAs) for response times from the Azure OpenAI service.

    This article talks about Azure OpenAI service about improving the latency performance. You can control to improve the performance like Model selection, Generation size and Max tokens, streaming etc.

    Here are some of the best practices to lower latency:

    • Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
    • Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
    • Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
    • Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
    • Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

    Please let me know if you have any further queries.

    0 comments No comments