Azure GPT4o stream sends chunks at once in a short time with not good continuous flow.

김세형 105

Azure vs OpenAI

We're using gpt4o for streaming service that smooth real-time response is important.

Both Azure and openAI sends well-broken chunks on gpt4o stream completion.

But, There is an issue in Azure.

openAI gpt4o sends chunks smoothly without stuck.

Azure gpt4o sends chunks multiple chunks at once in a short time. And, it stops sending between the responses for 1~2 seconds. So, it's not usable for real-time service like chatbot.

Please refer the attached image.

Even though I disabled the content filter, the problem still existed.

However, this issue was addressed by changing APIM tier (dev -> standard v2). The architecture is ‘Client->API Management->openAI’. The throughput (500 requests/sec) was enough and usage was less than 5%. I never thought that could be the cause.

I've addressed this problem by changing the apim tier, I wonder the reason.

It seems that the chunks responded to by openai were accumulated in APIM and sent at once, but I do not know the exact cause.

AshokPeddakotla-MSFT 34,031 Reputation points

2024-06-12T12:15:28.77+00:00

김세형 Greetings!

Have a look at this similar issue and see if it helps: https://github.com/mckaywrigley/chatbot-ui/issues/761

If you have configured content filtering, see This article which talks about Azure OpenAI service improving the latency performance. Evaluate if any of your workloads would benefit from modified content filtering policies.

Do let me know if that helps.
Chris Hoder - MSFT 101 Reputation points Microsoft Employee

2024-06-13T13:54:29.9133333+00:00

You no longer need to apply for access to asynchronous streaming from the content filter. you can turn this on using a custom configuration: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython-new#content-streaming
김세형 105 Reputation points

2024-06-13T14:37:56.94+00:00

AshokPeddakotla-MSFT Chris Hoder - MSFT Thanks for responding.

Even if I disabled the content filter, the problem still existed.

But, this problem was solved by changing API management from developer tier to standard v2 tier.

Currently, our structure is ‘client->api management->openAI’.

Because it was a test stage, the tho throughput(500 requests/sec) was enough and usage was less than 5%. I didn't think it could be the cause.

I've changed the tier to standard v2, but I just wonder why developer tier had this issue.

I think the chunks that responded from openai were accumulated in the APIM(developer tier) and sent at once, but I don't know the exact cause.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
AshokPeddakotla-MSFT 34,031 Reputation points

2024-06-24T13:39:25.46+00:00

김세형 Apologies for the delayed response. Thanks for sharing the resolution.

As mentioned on the documentation Chris shared, Streaming impacts perceived latency. With streaming enabled you receive tokens back in chunks as soon as they're available. For end-users, this approach often feels like the model is responding faster even though the overall time to complete the request remains the same.

I'm not really sure if the issue is caused due to tier. How ever, for the benefit of the community I'm adding your response as answer so that you can accept it if you would like.

Share via

Azure GPT4o stream sends chunks at once in a short time with not good continuous flow.

Your answer