Azure OpenAI rate limit exceeds

Question

Azure OpenAI rate limit exceeds

Simon 0

I have setup an openai ressource (sweden central) and deployed gpt4o-mini. I am trying to get an aisstant up and running.

I have allocated all available quota to this ressource (30k TPM) but I get a rate limit exceeded error with the second message. Even if I lower the system instructions to a minimum, I can't really use the API, since I keep getting that rate limit exceeded error.

I did search through forums and co to find a solution on my own. max_tokens and best_of do not apply here and as I said, the quota of 30k tpm is allocated to that specific resource. I tried it in the playground (see attached, clearly stating it only took 83 tokens (I know that usage calculation of tokens in request is a bit different from actual usage tokens, but still faaaaar away from any limit)).

Can anyone help? Am I missing something?

Thanks in advance,

Simon

kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2024-12-17T18:22:49.3566667+00:00

Hi Simon

Greetings & Welcome to the Microsoft Q&A forum! Thank you for sharing your query.

Following up to see if the below response was helpful.

1 answer

Your answer

kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2024-12-17T18:22:49.3566667+00:00

Hi Simon

Greetings & Welcome to the Microsoft Q&A forum! Thank you for sharing your query.

Following up to see if the below response was helpful.

Answer 1

Hello Simon,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are experiencing Azure OpenAI rate limit exceeds error when using the gpt-4o-mini model deployed in the Sweden Central region on Azure OpenAI Service.

The issue likely stems from Azure OpenAI’s resource allocation behavior or misconfiguration rather than token usage. However, you can do the following to permanently resolve the issue:

Use Tools like OpenAI Tokenizer to validate and calculate token usage accurately. Token consumption includes input tokens (system messages, prompts, user input) and output tokens (model response). So, misestimating these values may lead to exceeding limits. For instance, even short responses with lengthy prompts can breach token thresholds. Tool: https://platform.openai.com/tokenizer | Reference on more details - https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
Azure OpenAI allows you to monitor usage metrics and logs to detect spikes. This helps track TPM and RPM consumption in real time. Use Azure CLI to list deployments and resource metrics using bash: az openai model list --resource-group <Your-Resource-Group> --name <Your-Resource-Name> - Steps: Go to Azure Portal > Metrics for OpenAI resource monitoring - Reference: https://learn.microsoft.com/en-us/cli/azure/openai
To reduce token usage:
1. Shorten system messages and prompts.
2. Specify max_tokens in the API call to control output size.
3. Use OpenAI Tokenizer to validate reduced input/output tokens. For an examples:
```
      response = openai.ChatCompletion.create(
          model="gpt-4o-mini",
          messages=[{"role": "user", "content": "Short prompt"}],
          max_tokens=50  # Limit response
      )
```
  Reference**: Avoid Rate Limit Errors

You can introduce Delays Between Requests by enforces strict rate limits per minute. Adding intentional delays between API requests prevents exceeding RPM or TPM limits. This is a code sample:

   import time
   import openai
   def make_request_with_delay(prompt, delay=1):
       try:
           response = openai.ChatCompletion.create(
               model="gpt-4o-mini",
               messages=[{"role": "user", "content": prompt}],
               max_tokens=100
           )
           print(response['choices'][0]['message']['content'])
       except openai.error.RateLimitError:
           print("Rate limit exceeded. Retrying...")
           time.sleep(delay)
           make_request_with_delay(prompt, delay)
   make_request_with_delay("Hello!")

This handles rate limits gracefully by retrying requests after a short wait.

This might not really be an issue, but regional quotas may be saturated in Sweden Central. So, to address this, you can deploy the model in another region with better availability, like West Europe or North Europe et al. Check the regional availability support here - https://learn.microsoft.com/en-us/azure/ai-services/openai/region-support
Contact Azure Support for Quota Increase if the issue persists despite optimizing token usage and regional deployment, reach out to Azure Support to request a quota increase. Raise Support Requests - https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request

The above are tested solution and I believe it should solve your issue. So, I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Simon 0 Reputation points

2024-12-18T09:28:25.8533333+00:00

Hi,

thank you very much for your response.

Unfortunately this is not helpful to me. First of all I am not planning on using the Completions API, but the Assistants API. The latter does not support max_tokens as far as I know.

But this doesn't even occur when talking to the ressource via API. I am in the Foundry Portal and create a new assistant in the playground on a deployment I have allocated all available quota (30k tpm). My system prompt is "You are an assistant. You are answering with movie quotes". My first prompt was "Hello", overall token usage was 70. As soon as I try to send another prompt, I get a rate limit exceeded error.

This can't be a token usage issue, using lesser tokens than in my above mentioned test is nearly impossible.

I did create another deployment in France but the result was exactly the same, no possible usage for me.

Either this is just not working or I am missing something else, but I cannot use Azure at all. I wanted to move my Chat App that is currently running very smoothly with the OpenAI API to an Azure openAI instance due to the advantage of the server location being within the EU. But this unfortunately just not working.

Would it be any different if I went away from the pay-as-you-go subscription and pre order some ptu?

Kind Regards,

Simon
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-12-18T12:21:54.1966667+00:00

Hi Simon,

Thank you for your feedback.

I mentioned in the answer:

The importance of understanding token usage and quota allocation.

The issue likely stems from Azure OpenAI’s resource allocation behavior or misconfiguration rather than token usage.

Therefore, the issue might be related to internal rate limits specific to the Foundry Portal playground or deployment misconfigurations. I recommend checking quota allocation, monitoring metrics, and escalating the issue to Azure Support for a detailed investigation via your azure portal. Additionally, using a reserved subscription model or switching to API-based interactions might provide better stability for your assistant.

Let me know if you need help with these steps.

Thank you.
kothapally Snigdha 3,020 Reputation points Microsoft External Staff Moderator

2024-12-19T19:29:01.88+00:00

Hi Simon

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Share via

Azure OpenAI rate limit exceeds

1 answer

Your answer