Improving streaming performance of OpenAI GPT endpoints in Azure

Mattias Stahre 0 Reputation points
2023-12-22T11:52:01.3833333+00:00

Hello,

I am currently testing OpenAI through Azure, but I am experiencing issues with the streaming performance. Instead of getting each word streamed one by one, I get chunks of words at a time, making the streaming very batchy. Each chunk only contains one word as expected, but the arrival of the packages from the endpoint behaves in this way.

To clarify, I have confirmed the same issue while running a curl command and streaming it. Are there any suggestions on how to optimize streaming performance for OpenAI GPT endpoints in Azure? Thank you.

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,080 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Saurabh Sharma 23,846 Reputation points Microsoft Employee Moderator
    2023-12-22T23:38:36.2133333+00:00

    Hi @Mattias Stahre

    Welcome to Microsoft Q&A! Thanks for posting the question.

    Can you please try to adjust the max_tokens parameter when making requests to the OpenAI GPT endpoint? The max_tokens parameter controls the maximum number of tokens that can be generated by the GPT model in a single request. By reducing the max_tokens parameter, you can increase the frequency of requests and potentially improve the streaming performance.

    Please check Azure OpenAI Service performance & latency article as it provides some background on how latency works with Azure OpenAI and how to optimize your environment to improve performance.

    Please let me know in case you still see any issues.

    Thanks

    Saurabh

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.