Not fully understanding quota on Azure OpenAI services

Question

So I currently am trying to create a chatbot based on enterprise data through a webapp in Azure. It seemed initially I wasn't having problems, but all of a sudden recently that I am in the final steps of my process, I am constantly confronted with this error?

Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-05-01-preview have exceeded token rate limit of your current AIServices S0 pricing tier. Please retry after 86400 seconds. Please contact Azure support service if you would like to further increase the default rate limit.'}}

I have just increased my quota, and currently have 30,000 tokens per minute, which says gives me 180 requests per minute. So I am not fully understanding.

Answer

@Luke Field Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

The error message you’re seeing is related to rate limiting, which is a common practice in APIs to prevent abuse and ensure fair usage. In your case, the error message indicates that you’ve exceeded the token rate limit of your current AIServices S0 pricing tier.

.

Azure OpenAI’s quota feature enables assignment of rate limits to your deployments, up-to a global limit called your “quota.” Quota is assigned to your subscription on a per-region, per-model basis in units of Tokens-per-Minute (TPM). More info here.

User's image

.

The rate limit for the ChatCompletions_Create Operation under Azure OpenAI API version 2024-05-01-preview is determined by the number of tokens in your requests, not just the number of requests. Each request can contain a different number of tokens, depending on the length and complexity of the text. If your requests contain a large number of tokens, you could hit your rate limit even if the number of requests is within the limit.

Background about the limits:

Tokens-Per-Minute (TPM) and Requests-Per-Minute (RPM) rate limits for the deployment.

TPM rate limits are based on the maximum number of tokens that are estimated to be processed by a request at the time the request is received.

RPM rate limits are based on the number of requests received over time. The rate limit expects that requests be evenly distributed over a one-minute period. If this average flow isn't maintained, then requests may receive a 429 response even though the limit isn't met when measured over the course of a minute.

More info here.

.

Suggestions and best practices:

To minimize issues related to rate limits follow the steps outlined here.
.
.*
View and request quota:***
For an all up view of your quota allocations across deployments in a given region, select Management > Quota in Azure AI Studio:
User's image

Usage/Limit: For the quota name, this shows how much quota is used by deployments and the total quota approved for this subscription and region. This amount of quota used is also represented in the bar graph.

.

Also you can leverage the Usage metrics to check the current usage:

User's image

Hope this helps. If you have any follow-up questions, please let me know. I would be happy to help.

Share via

Not fully understanding quota on Azure OpenAI services

1 answer

Your answer