Azure OpenAI's API response became increasingly slow. (eastjp)

Question

Dear Azure Support Team, I am currently developing a new service utilizing OpenAI versions 3.5 and 4.0 within the East JP Region of Azure. Initially, when dispatching messages with a volume of approximately 2,000 to 3,000 tokens, the response time was consistently under 5 seconds. However, I've recently observed a significant increase in response times, now ranging between 10 to 15 seconds or more. Additionally, there seems to be a considerable variance in response speeds with each request. I am reaching out to inquire if there are any known issues that might be causing this slowdown in response times. If there are no reported problems, I would appreciate any suggestions or solutions that could help in resolving or mitigating this issue. Thank you for your assistance. Best regards,

Answer

homin wang Greetings & Welcome to Microsoft Q&A forum!

There are no known issues reported at the moment, there are a few things you can do to troubleshoot and mitigate the issue.

You can check if there are any changes in the usage of your Azure resources that might be causing the slowdown. You can use the Azure portal to monitor your resource usage and costs. You can also use the Azure Monitor to collect and analyze metrics and logs for your Azure resources.

This can help you identify any spikes in resource usage or other issues that might be causing the slowdown.

Please check the latency metrics and check which API operation is consuming more time.

You can open the Azure OpenAI resource from your portal and navigate to the metrics section and apply the splitting for the latency metrics and check which API / operationName was time consuming? User's image

I would appreciate any suggestions or solutions that could help in resolving or mitigating this issue.

I would suggest you, check the documentation to Improve performance.

Here are some of the best practices to lower latency:

Model latency: If model latency is important to you we recommend trying out our latest models in the GPT-3.5 Turbo model series.
Lower max tokens: OpenAI has found that even in cases where the total number of tokens generated is similar the request with the higher value set for the max token parameter will have more latency.
Lower total tokens generated: The fewer tokens generated the faster the overall response will be. Remember this is like having a for loop with n tokens = n iterations. Lower the number of tokens generated and overall response time will improve accordingly.
Streaming: Enabling streaming can be useful in managing user expectations in certain situations by allowing the user to see the model response as it is being generated rather than having to wait until the last token is ready.
Content Filtering improves safety, but it also impacts latency. Evaluate if any of your workloads would benefit from modified content filtering policies.

Do let me know if that helps or have any other queries.

If the response helped, please do click Accept Answer and Yes for was this answer helpful.

Doing so would help other community members with similar issue identify the solution. I highly appreciate your contribution to the community.

Share via

Azure OpenAI's API response became increasingly slow. (eastjp)

1 answer