@Gokulraj A (LTIMINDTREE LIMITED) Welcome to Microsoft Q&A Forum, Thank you for posting your query here!
Please note that the, Azure OpenAI does not have a specific Service Level Agreement (SLA) for latency. Instead, the SLA primarily focuses on availability, which is maintained at a 99.9% level. This means that the emphasis is on ensuring the service is accessible and operational rather than guaranteeing specific performance metrics.
Suggestion 1:
This article talks about how to optimize the latency performance issues.
Suggestion 2:
GPT-4 version 0125-preview
is an updated version of the GPT-4 Turbo preview previously released as version 1106-preview
. Customers may find that GPT-4-0125-preview generates more output compared to the gpt-4-1106-preview. GPT-4-0125-preview also addresses bugs in gpt-4-1106-preview.
So my recommendation would be to make use of GPT-4 version 0125-preview
model instead. However the region availability for this is eastus
, northcentralus
and southcentralus
More info here.
Suggestion 3:
Try with isStreaming = TRUE
This parameter reduces first response latency due to data coming in chunks. IsStreaming is a user-controlled parameter in the chat API call. Users should optimize this setting based on their requirements to improve the perceived performance. By enabling streaming, users can receive data faster, which may improve the user experience even if the overall latency remains unchanged. More info here.
Suggestion 4:
TokenConsumed = max_token
Smaller token size leads to faster response. TokenConsumed is another user-controlled parameter in the chat API call. Users should carefully optimize token size (max_token) to find the right balance between response speed and data granularity. Adjusting this parameter can significantly impact the response time, so it is important to fine-tune it according to the specific use case and user requirements.
Suggestion 5:
If none of the above suggestions helps and if you still wish to troubleshoot this issue, since you are internal to Microsoft, I am sharing you the internal logs and its queries which you can run to identify the root cause.
**
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.