API for gpt-4-1106-preview extremely slow

Marijn Otte 70 Reputation points
2024-01-15T11:31:21.34+00:00

When we do API calls for the gpt-4-1106-preview model, the average response time is around 60 seconds. When we use the chat GUI in the Azure AI studio on the same model, with the same parameters, the response takes 10 - 20 seconds? What can we do to speed up the API? We already tried to tune the temperature, max_tokens and top_p parameters and to minimize the content filters, but they all make no significant difference.

Example API call:

time curl -X POST -H "Content-Type: application/json" -H "api-key: XXX" -d '{
  "messages": [
    {
      "role": "user",
      "content": "What does a cow eat?"
    }
  ],
  "model": "gpt-4-1106-preview",
  "stream": true,
  "temperature": 0.7,
  "frequency_penalty": 0,
  "presence_penalty": 0
}' "https://XXX-sweden.openai.azure.com/openai/deployments/gpt-4-1106-preview/chat/completions?api-version=2023-09-01-preview"


.....

data: [DONE]


real	1m7,174s
user	0m0,079s
sys	0m0,024s
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,227 questions
{count} votes

15 answers

Sort by: Most helpful
  1. David Chartrand 5 Reputation points
    2024-04-03T18:55:27.2133333+00:00

    Same issue here. 3.5 is not good enough. 4.0 is way too slow and not usable. Please fix it ASAP or we will need to find another LLM.

    1 person found this answer helpful.

  2. Saurabh Sharma 23,816 Reputation points Microsoft Employee
    2024-01-17T15:29:36.5633333+00:00

    @Marijn Otte Thanks for sharing the details. I will check it in the above regions. However, please note that GPT 4 is much slower than 3. If you look at the docs on latency, 35-turbo is the fastest model. Also, refer to the same documentation which provides different factors which you can control to improve the performance like Model selection, Generation size and Max tokens, streaming etc. Thanks Saurabh


  3. yue mao 0 Reputation points
    2024-05-11T02:32:47.5333333+00:00

    The response time of gpt4 has increased


  4. Brett H 5 Reputation points
    2024-05-11T09:06:43.6666667+00:00

    I agree with all of the above comments about slowness of gpt-4-1106-preview. However, utilizing streaming with GPT 4o significantly improves things.

    
            stream = openai.ChatCompletion.create(
                engine=XXX,
                max_tokens=XXX,
                temperature=XXX,
                messages=XXX,
                stream=True
            )
    

    By setting stream=True, utilising 'chunking' and having some code to continuously refresh the screen, the model response comes back gradually after a few seconds.


  5. Jarmo Hämäläinen 0 Reputation points
    2024-10-02T13:41:40.11+00:00

    Today I test Gpt4o Mini. My unit test has 4 questions in conversation.

    If there is 10 seconds wait between server calls then response times are about 3,60,60,60 seconds.

    If there is 70 seconds wait between server calls then response times are about 3,3,3,3 seconds.

    It seems there is extra 57 seconds in response time if call too fast (wait 10 seconds).

    Is this feature, bug or something else.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.