API for gpt-4-1106-preview extremely slow

Marijn Otte 70

When we do API calls for the gpt-4-1106-preview model, the average response time is around 60 seconds. When we use the chat GUI in the Azure AI studio on the same model, with the same parameters, the response takes 10 - 20 seconds? What can we do to speed up the API? We already tried to tune the temperature, max_tokens and top_p parameters and to minimize the content filters, but they all make no significant difference.

Example API call:

time curl -X POST -H "Content-Type: application/json" -H "api-key: XXX" -d '{
  "messages": [
    {
      "role": "user",
      "content": "What does a cow eat?"
    }
  ],
  "model": "gpt-4-1106-preview",
  "stream": true,
  "temperature": 0.7,
  "frequency_penalty": 0,
  "presence_penalty": 0
}' "https://XXX-sweden.openai.azure.com/openai/deployments/gpt-4-1106-preview/chat/completions?api-version=2023-09-01-preview"


.....

data: [DONE]


real	1m7,174s
user	0m0,079s
sys	0m0,024s

Saurabh Sharma 23,816 Reputation points Microsoft Employee

2024-01-16T04:39:01.1933333+00:00

Hi @Marijn Otte
Welcome to Microsoft Q&A! Thanks for posting the question.

Are you getting this every time you call the rest api ? I have tried this multiple times in my environment, but I am getting the results in approximately in 12 seconds.

Also, in which region your openai resource is? Thanks

Saurabh
Marijn Otte 70 Reputation points

2024-01-16T09:21:04.9566667+00:00

Hi @Saurabh Sharma ,

Thank you for your reply. Yes, I get this every time. I tried the regions Sweden Central, France Central and East US 2. East US 2 seems to be a little faster, but still always above 30 seconds.When I use the GPT-4 0613 model the response takes around 5-7 seconds, so the issue is related the 1106-preview model.
Tobi Akinyemi 5 Reputation points

2024-01-19T04:14:52.67+00:00

Same issue, extremely slow
Haas, David (Jamie) 5 Reputation points

2024-01-19T21:30:07.5733333+00:00

I am using Canada east with gpt-4-1106-preview for a company tool and the response times are typically over 1 minute. It's an awful experience for the users and I'm not sure it can be resolved.
Matthew Steele 10 Reputation points

2024-01-21T18:43:55.8966667+00:00

I have the same/even worse experience. Most short calls take 20-30s (for a handful of tokens), but requests with ~1-5k tokens of context return a timeout error after 5 minutes of waiting > 90% of the time. Unusable. Using east-us-2.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Juan Camilo Parra Martinez 0 Reputation points

2024-01-22T19:13:35.82+00:00

Same issue here, GPT 4-0613 answers in 20-30 seconds, gpt-4-1106-preview answer in 1:16 minutes, almost 3-4 times, on playground times are similar.
Juan Camilo Parra Martinez 0 Reputation points

2024-01-22T19:29:13.91+00:00

If I create a new OpenAI Resource performance is normal.
dankronstal 90 Reputation points

2024-01-25T05:01:50.0366667+00:00

I'm also seeing this (reported separately, before I noticed this thread, in this post https://learn.microsoft.com/en-us/answers/questions/1511386/azure-openai-service-gpt-4-is-excessively-(and-rec. The short version is that I feel like I tested everything (model versions, APIs, time of day, service-level token consumption, payload sizes), except multi-region (which other folks here seem to have done anyway) and this is a hugely impactful service degradation. Here's my experience:

My example above is very simple, but performance is actually similar for tests involving 2500+ tokens (vs ~400 for the completed interaction illustrated) so it's not a matter of token optimization. Still a major issue today, and while Saurabh is correct that GPT-3.5 models are faster, the real issue here is that with consistent testing over the past several months the performance of GPT-4 models have degraded very seriously and recently. So something has changed on that side, in terms of service capacity.
dankronstal 90 Reputation points

2024-01-25T05:08:10.58+00:00

duplicate comment - sorry.
WanisElabbar-4383 205 Reputation points

2024-03-06T09:16:32.2233333+00:00

Would like to bump this. The response of "GPT4 Turbo" or 1106-preview is extremely slow compared to the previous version of GPT4 "0613"!
Brady Begeman 25 Reputation points

2024-04-15T18:38:58.2833333+00:00

gpt-4-1106-preview (GPT-4 Turbo) is still effectively useless because the latency is on average 30 seconds or more before a response even starts. This is unacceptable for a product which is being sold as an enterprise solution. GPT-4 Turbo on ChatGPT is very quick (2 seconds or so), why is Azure OpenAI not held to the same standard of performance? These aren't new products at this point...

It's been over a month. What are you doing Microsoft?

15 answers

David Chartrand 5 Reputation points

2024-04-03T18:55:27.2133333+00:00

Same issue here. 3.5 is not good enough. 4.0 is way too slow and not usable. Please fix it ASAP or we will need to find another LLM.
Please sign in to rate this answer.

1 person found this answer helpful.
Mariusz Jeleń 0 Reputation points

2024-04-10T07:19:10.4033333+00:00

Same here. Kindly request to improve the performance of the Azure Open AI Service. Any forecast when we can expect this to become GA?
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Saurabh Sharma 23,816 Reputation points Microsoft Employee

2024-01-17T15:29:36.5633333+00:00

@Marijn Otte Thanks for sharing the details. I will check it in the above regions. However, please note that GPT 4 is much slower than 3. If you look at the docs on latency, 35-turbo is the fastest model. Also, refer to the same documentation which provides different factors which you can control to improve the performance like Model selection, Generation size and Max tokens, streaming etc. Thanks Saurabh
Please sign in to rate this answer.
David Foster 10 Reputation points

2024-01-25T10:52:37.9033333+00:00

It's not that GPT-4 is slower then GPT-3.5 - it's that GPT-4 latency has degraded so much in the last couple of months - i.e. it used to be fast, but now it's not. The change is a huge issue for any app in deployment. Is there a resolution planned?

Brett H 5 Reputation points

2024-01-27T23:37:10.4333333+00:00

Very slow for me. Yes, 35-turbo has faster response time, but it offers a quality of response that is not nearly as complete as gpt-4-1106-preview. Microsoft really need to resolve this issue.

Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

yue mao 0 Reputation points

2024-05-11T02:32:25.23+00:00

The response time of gpt4 has increased
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
yue mao 0 Reputation points

2024-05-11T02:32:47.5333333+00:00

The response time of gpt4 has increased
Please sign in to rate this answer.
WanisElabbar-4383 205 Reputation points

2024-05-13T14:00:56.1333333+00:00

GPT4 TURBO 128K is still slow. GPT4 (8k/32k) was always faster than this so-called turbo.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Brett H 5 Reputation points

2024-05-11T09:06:43.6666667+00:00
I agree with all of the above comments about slowness of gpt-4-1106-preview. However, utilizing streaming with GPT 4o significantly improves things.

stream = openai.ChatCompletion.create( engine=XXX, max_tokens=XXX, temperature=XXX, messages=XXX, stream=True )

By setting stream=True, utilising 'chunking' and having some code to continuously refresh the screen, the model response comes back gradually after a few seconds.
Please sign in to rate this answer.
WanisElabbar-4383 205 Reputation points

2024-05-13T13:58:59.87+00:00

Yes, streaming does help for chat use-cases.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Jarmo Hämäläinen 0 Reputation points

2024-10-02T13:41:40.11+00:00

Today I test Gpt4o Mini. My unit test has 4 questions in conversation.

If there is 10 seconds wait between server calls then response times are about 3,60,60,60 seconds.

If there is 70 seconds wait between server calls then response times are about 3,3,3,3 seconds.

It seems there is extra 57 seconds in response time if call too fast (wait 10 seconds).

Is this feature, bug or something else.
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

API for gpt-4-1106-preview extremely slow

15 answers

Your answer