Azure OpenAI Rate-Limiting Error

Question

Azure OpenAI Rate-Limiting Error

Khawar Habib 0

I have deployed Azure OpenAI service with gpt-35-turbo(0301) and set token per minute limit to 1K and it's displaying approx. 6 requests per minute.

User's image

In my first request, i have utilized only 223 tokens in total. I am adding usage response as well.

"usage": {
        "completion_tokens": 193,
        "prompt_tokens": 30,
        "total_tokens": 223
    }

When I attempted to verify it using Postman on the subsequent request, I encountered the following error. Could someone please explain how it is exceeding the rate limit?

Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-02-15-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 6 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.

Sina Salam 22,031 Reputation points Volunteer Moderator

2024-05-02T13:49:52.04+00:00

Hello Khawar Habib

Welcome to the Microsoft Q&A and thank you for posting your questions here.

Problem

I understand that you are facing rate-limiting errors while making requests to the Azure OpenAI service despite adhering to the token limit per minute. The error message indicates that requests have exceeded the call rate limit for the current pricing tier, prompting you to seek clarification on why the limit is being surpassed.

Scenario

You deployed Azure OpenAI service with the gpt-35-turbo(0301) model and set the token limit per minute to 1K. During initial testing, the first request utilised only 223 tokens. However, subsequent requests resulted in exceeding the rate limit, triggering an error message. You attempted to verify the issue using Postman but encountered the same error.

Solution

This prescribed solution was based on the scenario given and your questions, while focusing on the problem statement.

The message you received indicates that your requests were too frequent, surpassing the allowed limit for your current pricing level (S0). Even though your first request only used 223 tokens, subsequent ones might have gone over the limit. It suggests waiting 6 seconds before trying again, showing that the limit is being enforced.

To solve this, check your application's setup to ensure it's not sending more requests than it should within the set limit. Keep an eye on how your application is being used to spot any unexpected increases in requests that could cause these errors.

If you keep running into these errors and need more allowance, think about upgrading your pricing level or requesting a higher limit through the Azure portal. Make sure to keep an eye on your application's request patterns and adjust them as needed to stay within the allowed limits.

Finally

The rate limits for Azure OpenAI are typically defined in terms of Tokens-Per-Minute (TPM). Even though you’ve set a limit of 1,000 tokens per minute and your first request only used 223 tokens, the rate limit also considers the number of requests per minute. If the service evaluates the request rate over a short period, such as 1 or 10 seconds, and you send multiple requests within this window, you could exceed the rate limit even if the total token count is below 1,000.

For example, if the service monitors with a 1-second interval and your deployment allows for 600 RPM (requests per minute), you would be throttled if more than 10 requests are received in any given second2. This could explain why you’re seeing the rate-limiting error despite not exceeding the token limit.

References

Source: Azure OpenAI Service quotas and limits - Azure AI services. Accessed, 5/2/2024.

Source 2: Optimizing Azure OpenAI: A Guide to Limits, Quotas, and Best Practices. Accessed, 5/2/2024.

Also, endeavor to read more from the additional resources provided by the right side of this page.

Accept Answer

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

** Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful ** so that others in the community facing similar issues can easily find the solution.

Best Regards,

Sina Salam
AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-05-16T03:41:36.4566667+00:00

Khawar Habib Did you get a chance to see above response?

Do let us know if that helps or have any further queries.
AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-05-29T12:14:47.32+00:00
Khawar Habib To give more context, As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:

Prompt text and count

The max_tokens parameter setting

The best_of parameter setting

As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429 response code until the counter resets. For more details, see Understanding rate limits.

To minimize issues related to rate limits, it's a good idea to use the following techniques:

Set max_tokens and best_of to the minimum values that serve the needs of your scenario. For example, don’t set a large max-tokens value if you expect your responses to be small.

Use quota management to increase TPM on deployments with high traffic, and to reduce TPM on deployments with limited needs.

Implement retry logic in your application.

Avoid sharp changes in the workload. Increase the workload gradually.

Test different load increase patterns.

Hope this helps. Do let me know if you have any further queries.
Khawar Habib 0 Reputation points

2024-05-29T12:52:50.1366667+00:00

Thank you @Sana Salam for your response. We are currently experiencing breaches in the rate limit due to the volume of requests within a 10-second timeframe. Although tokens are available, as previously mentioned, the issue persists. I have configured 1000 tokens, allowing for an estimated 6 requests per minute. Therefore, Azure OpenAI service anticipates only 1 request every 10 seconds. Sending subsequent requests, regardless of token availability, will result in a breach of the rate limit based on the number of requests rather than tokens. Thanks again.
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-05-29T15:15:52.7166667+00:00
Hi @Khawar Habib

Thank you for concise information.

The rate limit is not only determined by the number of tokens but also by the number of requests per unit of time. Even though you have a large number of tokens available, if the number of requests exceeds the allowed limit within a specific timeframe (in this case, 10 seconds), you will still encounter rate limit breaches.

In your case, if the service anticipates only 1 request every 10 seconds, and you're sending more than this, it will indeed result in a breach of the rate limit.

The best practices solution is for you to implement a delay or queue system in your application to control the rate of requests sent to the Azure OpenAI service.

For example, you can use a fixed interval delay:

import time import requests def rate_limited_request(url, params, headers, interval=10): while True: response = requests.get(url, params=params, headers=headers) if response.status_code == 429: # Rate limit exceeded print("Rate limit exceeded. Waiting for 10 seconds...") time.sleep(interval) else: return response.json() # Usage url = "https://api.example.com/data" params = {"query": "example"} headers = {"Authorization": "Bearer YOUR_TOKEN"} # Send requests at a fixed interval of 10 seconds for _ in range(6): response = rate_limited_request(url, params, headers) print(response) time.sleep(10) # Wait for 10 seconds before sending the next request

After you've done the above try to also configure monitoring and logs, you will be fine.

Best Regards,

Sina
me_v2 20 Reputation points

2024-07-28T01:18:28.0966667+00:00

The error is vague; it doesn't say if the problem is requests per minute or tokens per minute.
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-07-28T20:30:28.1466667+00:00

Hi, Khawar Habib

if you configure the monitor and log as instructed. Could you share the error message or try to do that if you haven't and catch exceptions.

Your answer

AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-05-16T03:41:36.4566667+00:00

Khawar Habib Did you get a chance to see above response?

Do let us know if that helps or have any further queries.
AshokPeddakotla-MSFT 35,971 Reputation points Moderator

2024-05-29T12:14:47.32+00:00

Khawar Habib To give more context, As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:

Prompt text and count

The max_tokens parameter setting

The best_of parameter setting

As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429 response code until the counter resets. For more details, see Understanding rate limits.

To minimize issues related to rate limits, it's a good idea to use the following techniques:

Set max_tokens and best_of to the minimum values that serve the needs of your scenario. For example, don’t set a large max-tokens value if you expect your responses to be small.

Use quota management to increase TPM on deployments with high traffic, and to reduce TPM on deployments with limited needs.

Implement retry logic in your application.

Avoid sharp changes in the workload. Increase the workload gradually.

Test different load increase patterns.

Hope this helps. Do let me know if you have any further queries.
Khawar Habib 0 Reputation points

2024-05-29T12:52:50.1366667+00:00

Thank you @Sana Salam for your response. We are currently experiencing breaches in the rate limit due to the volume of requests within a 10-second timeframe. Although tokens are available, as previously mentioned, the issue persists. I have configured 1000 tokens, allowing for an estimated 6 requests per minute. Therefore, Azure OpenAI service anticipates only 1 request every 10 seconds. Sending subsequent requests, regardless of token availability, will result in a breach of the rate limit based on the number of requests rather than tokens. Thanks again.
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-05-29T15:15:52.7166667+00:00

Hi @Khawar Habib

Thank you for concise information.

The rate limit is not only determined by the number of tokens but also by the number of requests per unit of time. Even though you have a large number of tokens available, if the number of requests exceeds the allowed limit within a specific timeframe (in this case, 10 seconds), you will still encounter rate limit breaches.

In your case, if the service anticipates only 1 request every 10 seconds, and you're sending more than this, it will indeed result in a breach of the rate limit.

The best practices solution is for you to implement a delay or queue system in your application to control the rate of requests sent to the Azure OpenAI service.

For example, you can use a fixed interval delay:

import time import requests def rate_limited_request(url, params, headers, interval=10): while True: response = requests.get(url, params=params, headers=headers) if response.status_code == 429: # Rate limit exceeded print("Rate limit exceeded. Waiting for 10 seconds...") time.sleep(interval) else: return response.json() # Usage url = "https://api.example.com/data" params = {"query": "example"} headers = {"Authorization": "Bearer YOUR_TOKEN"} # Send requests at a fixed interval of 10 seconds for _ in range(6): response = rate_limited_request(url, params, headers) print(response) time.sleep(10) # Wait for 10 seconds before sending the next request

After you've done the above try to also configure monitoring and logs, you will be fine.

Best Regards,

Sina
me_v2 20 Reputation points

2024-07-28T01:18:28.0966667+00:00

The error is vague; it doesn't say if the problem is requests per minute or tokens per minute.
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-07-28T20:30:28.1466667+00:00

Hi, Khawar Habib

if you configure the monitor and log as instructed. Could you share the error message or try to do that if you haven't and catch exceptions.

Share via

Azure OpenAI Rate-Limiting Error

Problem

Scenario

Solution

Finally

References

Accept Answer

Your answer