'm noticing rate limits and throttling issues during high usage.

adan ameen 0 Reputation points
2025-03-25T21:36:36.1866667+00:00

I'm working on integrating Azure OpenAI GPT-4o into my chatbot, but I'm noticing rate limits and throttling issues during high usage.

Even though I’ve checked my quota limits in the Azure portal, I still get 429 errors (Too Many Requests) when multiple users interact with the bot simultaneously. Would increasing my SKU tier help, or is there a way to optimize requests for better performance?

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,651 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Azar 29,520 Reputation points MVP Volunteer Moderator
    2025-03-25T22:32:38.98+00:00

    Hi there

    Try upgrading to a higher SKU tier can help, but first, check your quota limits in the Azure portal and request an increase if needed. try batching requests, reducing unnecessary API calls, and implementing caching for frequently used responses. Also, use Azure OpenAI Rate Limit headers to monitor usage patterns and adjust accordingly. If traffic is unpredictable, implementing a queueing mechanism can help distribute requests more efficiently.

    If this helps kindly accpt the answer thanks.

    1 person found this answer helpful.
    0 comments No comments

  2. VSawhney 800 Reputation points Microsoft External Staff Moderator
    2025-03-26T11:37:40.5766667+00:00

    Hello adan ameen,

    When a deployment is created, the assigned TPM will directly map to the tokens-per-minute rate limit enforced on its inferencing requests. A Requests-Per-Minute (RPM) rate limit will also be enforced whose value is set proportionally to the TPM assignment using the following ratio:

    6 RPM per 1000 TPM.

    The flexibility to distribute TPM globally within a subscription and region has allowed Azure OpenAI Service to loosen other restrictions:  

    1. Increase TPM from model deployment to avail higher RPM and rate limit failure threshold.
    2. Create a multiple regions to deal regional outages, you can create outage alert from Azure status and perform remedial steps accordingly.
    3. Lessen your input query size and reduce max_token and
    4. Adopt retry in code

    Reference - https://cookbook.openai.com/examples/how_to_handle_rate_limits

    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    Thank you!

    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.