Performance issue in generating the results using Open AI.

Question

Performance issue in generating the results using Open AI.

Sudhindra Kulkarni 0

I am leveraging Azure OpenAI to build the interactive chatbot for a client, by using GPT-4 model.

The token limit has already increased to 80K at the subscription level. When I am asking any questions, the model searches in different data sources, i.e set of PDF documents stored in various directories, list of tables in Database, using respective agent and comes up with the response.

However, the performance is very poor. For a simple question, the model is taking 1 minute 40 seconds to come up with answer and the client is not going to like it.

Please suggest what is the best possible solution to this.

Even if I have to raise a ticket for this, where to do that, since I am not able to see the option to create ticket in Azure Portal.

1 answer

Your answer

Answer 1

Sudhindra Kulkarni Greetings & Welcome to Microsoft Q&A forum!

However, the performance is very poor. For a simple question, the model is taking 1 minute 40 seconds to come up with answer and the client is not going to like it. Please suggest what is the best possible solution to this.

If you are using GPT4 model then latency is expected considering that gpt-4 has more capacity than the gpt-3.5 version.

As of now, we do not offer Service Level Agreements (SLAs) for response times from the Azure OpenAI service. .

This article talks about Azure OpenAI service about improving the latency performance.

Here are some of the best practices to lower latency:

Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

Even if I have to raise a ticket for this, where to do that, since I am not able to see the option to create ticket in Azure Portal.

Please see how to Create an Azure support request or Create a support ticket directly from Azure portal.

Please let me know if you have any follow-up questions. I would be happy to answer it.

If the response helped, please do click Accept Answer and Yes for was this answer helpful.

Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.

Sudhindra Kulkarni 0 Reputation points

2024-05-30T06:54:08.8933333+00:00

Hi Ashok,

Thank you for clarifying this.

However, the GPT-3.5-turbo has token limit of 16K only, which is not at all sufficient for implementing Chatbot.

So, we cannot use GPT-3.5-turbo for sure.

Please suggest if there are any alternatives and also if there are any other ways to improve the performance.

Thanks, and Regards,

Sudhindra Kulkarni
AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-05-30T08:34:15.5733333+00:00

Sudhindra Kulkarni I understand your concern.

Please suggest if there are any alternatives and also if there are any other ways to improve the performance.

As suggested earlier, kindly check the documentation to improve the performance. Especially check the Streaming option.

Also, see Work with the GPT-3.5-Turbo and GPT-4 models and prompt engineering techniques .

If that doesn't help, you can also reach out to support for further investigation on which APIs causing the issue and the suggestions to improve the model responses.

Do let me know if you have any further queries.

Share via

Performance issue in generating the results using Open AI.

1 answer

Your answer