Welcome to Microsoft Q&A! Thanks for posting the question.
Can you please try to adjust the max_tokens
parameter when making requests to the OpenAI GPT endpoint? The max_tokens
parameter controls the maximum number of tokens that can be generated by the GPT model in a single request. By reducing the max_tokens
parameter, you can increase the frequency of requests and potentially improve the streaming performance.
Please check Azure OpenAI Service performance & latency article as it provides some background on how latency works with Azure OpenAI and how to optimize your environment to improve performance.
Please let me know in case you still see any issues.
Thanks
Saurabh