Same issue here. 3.5 is not good enough. 4.0 is way too slow and not usable. Please fix it ASAP or we will need to find another LLM.
API for gpt-4-1106-preview extremely slow
When we do API calls for the gpt-4-1106-preview model, the average response time is around 60 seconds. When we use the chat GUI in the Azure AI studio on the same model, with the same parameters, the response takes 10 - 20 seconds? What can we do to speed up the API? We already tried to tune the temperature, max_tokens and top_p parameters and to minimize the content filters, but they all make no significant difference.
Example API call:
time curl -X POST -H "Content-Type: application/json" -H "api-key: XXX" -d '{
"messages": [
{
"role": "user",
"content": "What does a cow eat?"
}
],
"model": "gpt-4-1106-preview",
"stream": true,
"temperature": 0.7,
"frequency_penalty": 0,
"presence_penalty": 0
}' "https://XXX-sweden.openai.azure.com/openai/deployments/gpt-4-1106-preview/chat/completions?api-version=2023-09-01-preview"
.....
data: [DONE]
real 1m7,174s
user 0m0,079s
sys 0m0,024s
15 answers
Sort by: Most helpful
-
-
Saurabh Sharma 23,816 Reputation points Microsoft Employee
2024-01-17T15:29:36.5633333+00:00 @Marijn Otte Thanks for sharing the details. I will check it in the above regions. However, please note that GPT 4 is much slower than 3. If you look at the docs on latency, 35-turbo is the fastest model. Also, refer to the same documentation which provides different factors which you can control to improve the performance like Model selection, Generation size and Max tokens, streaming etc. Thanks Saurabh
-
yue mao 0 Reputation points
2024-05-11T02:32:47.5333333+00:00 The response time of gpt4 has increased
-
Brett H 5 Reputation points
2024-05-11T09:06:43.6666667+00:00 I agree with all of the above comments about slowness of gpt-4-1106-preview. However, utilizing streaming with GPT 4o significantly improves things.
stream = openai.ChatCompletion.create( engine=XXX, max_tokens=XXX, temperature=XXX, messages=XXX, stream=True )
By setting stream=True, utilising 'chunking' and having some code to continuously refresh the screen, the model response comes back gradually after a few seconds.
-
Jarmo Hämäläinen 0 Reputation points
2024-10-02T13:41:40.11+00:00 Today I test Gpt4o Mini. My unit test has 4 questions in conversation.
If there is 10 seconds wait between server calls then response times are about 3,60,60,60 seconds.
If there is 70 seconds wait between server calls then response times are about 3,3,3,3 seconds.
It seems there is extra 57 seconds in response time if call too fast (wait 10 seconds).
Is this feature, bug or something else.