Increased latency after switching to GPT-4 version 1106-Preview

Question

Increased latency after switching to GPT-4 version 1106-Preview

Anna-Marie Barborikova 40

We recently switched from GPT-4 version 0613 to version 1106-Preview and have experienced a significant increase in latencies. We have deployed our models in Sweden Central and East US 2, and the latency increase is similar for both regions. We are unsure if this is due to the new version being in preview or if it is related to the model's workload in the given location. Can anyone help provide some insight? Thank you.

Latency in East US 2 for GPT-4 version 0613:

Screenshot of latency in East US 2 for GPT-4 version 0613

Latency in East US 2 for GPT-4 version 1106-Preview:

Screenshot of latency in East US 2 for GPT-4 version 1106-Preview

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2023-12-14T04:44:27.55+00:00

@Anna Dombajova Welcome to Microsoft Q&A Forum, Thank you for posting your query here! Thanks for sharing the metrics screenshot for latency. Also could you please provide prompt size, max token set ?

Also apply the splitting for the latency metrics and check which API / operationName was time consuming ?

Also please check the Time to response metrics by applying splitting:

Awaiting your reply.
Anna-Marie Barborikova 40 Reputation points

2023-12-14T07:53:53.07+00:00

Hi, thank you for your answer.

I have found out that it was Api Azure OpenAI API version 2023-09-01-preview and ChatCompletions_Create operation that was very time-consuming.

Latency from yesterday was sometimes on avg in several seconds. There is no data available on Time to respond (the graph was empty).

Here you can see it corresponds with the amount of processed and generated tokens:

As the prompt size depends on user input, also we do not set max_tokens as we use chat api.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2023-12-18T04:28:02.5866667+00:00

@Anna Dombajova Please let me know if you had a chance to look at my analysis and suggestions. Awaiting your reply.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2023-12-21T04:37:08.4466667+00:00

@Anna Dombajova Just following up again to check if my above suggestion helped. Please let me know if you had any updates on this.
Anna-Marie Barborikova 40 Reputation points

2024-01-02T07:48:48.3533333+00:00

Hi, thank you for your suggestions and the useful info links.

We are already streaming the responses as the first time to respond is crucial to us as well as using Content Filters.

As for using GPT-3-turbo it is not very good for our use case as it has much lower accuracy in preserving HTML.

We could try to lower max_tokens, however, this would mean we would have to calculate it and set it explicitly which can be quite tedious. We are currently using a library that sets them to the remaining tokens left in the models' context window.

1 answer

Your answer

navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2023-12-14T04:44:27.55+00:00

@Anna Dombajova Welcome to Microsoft Q&A Forum, Thank you for posting your query here! Thanks for sharing the metrics screenshot for latency. Also could you please provide prompt size, max token set ?

Also apply the splitting for the latency metrics and check which API / operationName was time consuming ?

Also please check the Time to response metrics by applying splitting:

Awaiting your reply.
Anna-Marie Barborikova 40 Reputation points

2023-12-14T07:53:53.07+00:00

Hi, thank you for your answer.

I have found out that it was Api Azure OpenAI API version 2023-09-01-preview and ChatCompletions_Create operation that was very time-consuming.

Latency from yesterday was sometimes on avg in several seconds. There is no data available on Time to respond (the graph was empty).

Here you can see it corresponds with the amount of processed and generated tokens:

As the prompt size depends on user input, also we do not set max_tokens as we use chat api.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2023-12-18T04:28:02.5866667+00:00

@Anna Dombajova Please let me know if you had a chance to look at my analysis and suggestions. Awaiting your reply.
navba-MSFT 27,540 Reputation points Microsoft Employee Moderator

2023-12-21T04:37:08.4466667+00:00

@Anna Dombajova Just following up again to check if my above suggestion helped. Please let me know if you had any updates on this.
Anna-Marie Barborikova 40 Reputation points

2024-01-02T07:48:48.3533333+00:00

Hi, thank you for your suggestions and the useful info links.

We are already streaming the responses as the first time to respond is crucial to us as well as using Content Filters.

As for using GPT-3-turbo it is not very good for our use case as it has much lower accuracy in preserving HTML.

We could try to lower max_tokens, however, this would mean we would have to calculate it and set it explicitly which can be quite tedious. We are currently using a library that sets them to the remaining tokens left in the models' context window.

Answer 1

@Anna Dombajova Thanks for sharing the details. Please note that this latency issue is an expected behavior.

Reason:
This latency is expected considering that gpt-4 version 1106-Preview has more capacity. As of now, we do not offer Service Level Agreements (SLAs) for response times from the Azure OpenAI service.

Action Plan:
This article talks about Azure OpenAI service about improving the latency performance.

Here are some of the best practices to lower latency:

Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.

Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.

Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.

Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

Please let me know if you have any follow-up questions. I would be happy to answer it.

Share via

Increased latency after switching to GPT-4 version 1106-Preview

1 answer

Your answer