Azure OpenAI rate limit exceeds

Simon 0 Reputation points
2024-12-17T14:50:12.83+00:00

I have setup an openai ressource (sweden central) and deployed gpt4o-mini. I am trying to get an aisstant up and running.

I have allocated all available quota to this ressource (30k TPM) but I get a rate limit exceeded error with the second message. Even if I lower the system instructions to a minimum, I can't really use the API, since I keep getting that rate limit exceeded error.

I did search through forums and co to find a solution on my own. max_tokens and best_of do not apply here and as I said, the quota of 30k tpm is allocated to that specific resource. I tried it in the playground (see attached, clearly stating it only took 83 tokens (I know that usage calculation of tokens in request is a bit different from actual usage tokens, but still faaaaar away from any limit)).

Can anyone help? Am I missing something?

Thanks in advance,

Simon

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,505 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 15,011 Reputation points
    2024-12-17T17:31:59.83+00:00

    Hello Simon,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are experiencing Azure OpenAI rate limit exceeds error when using the gpt-4o-mini model deployed in the Sweden Central region on Azure OpenAI Service.

    The issue likely stems from Azure OpenAI’s resource allocation behavior or misconfiguration rather than token usage. However, you can do the following to permanently resolve the issue:

    1. Use Tools like OpenAI Tokenizer to validate and calculate token usage accurately. Token consumption includes input tokens (system messages, prompts, user input) and output tokens (model response). So, misestimating these values may lead to exceeding limits. For instance, even short responses with lengthy prompts can breach token thresholds. Tool: https://platform.openai.com/tokenizer | Reference on more details - https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
    2. Azure OpenAI allows you to monitor usage metrics and logs to detect spikes. This helps track TPM and RPM consumption in real time. Use Azure CLI to list deployments and resource metrics using bash: az openai model list --resource-group <Your-Resource-Group> --name <Your-Resource-Name> - Steps: Go to Azure Portal > Metrics for OpenAI resource monitoring - Reference: https://learn.microsoft.com/en-us/cli/azure/openai
    3. To reduce token usage:
      1. Shorten system messages and prompts.
      2. Specify max_tokens in the API call to control output size.
      3. Use OpenAI Tokenizer to validate reduced input/output tokens. For an examples:
              response = openai.ChatCompletion.create(
                  model="gpt-4o-mini",
                  messages=[{"role": "user", "content": "Short prompt"}],
                  max_tokens=50  # Limit response
              )
        
        Reference**: Avoid Rate Limit Errors
    4. You can introduce Delays Between Requests by enforces strict rate limits per minute. Adding intentional delays between API requests prevents exceeding RPM or TPM limits. This is a code sample:
         import time
         import openai
         def make_request_with_delay(prompt, delay=1):
             try:
                 response = openai.ChatCompletion.create(
                     model="gpt-4o-mini",
                     messages=[{"role": "user", "content": prompt}],
                     max_tokens=100
                 )
                 print(response['choices'][0]['message']['content'])
             except openai.error.RateLimitError:
                 print("Rate limit exceeded. Retrying...")
                 time.sleep(delay)
                 make_request_with_delay(prompt, delay)
         make_request_with_delay("Hello!")
      
      This handles rate limits gracefully by retrying requests after a short wait.
    5. This might not really be an issue, but regional quotas may be saturated in Sweden Central. So, to address this, you can deploy the model in another region with better availability, like West Europe or North Europe et al. Check the regional availability support here - https://learn.microsoft.com/en-us/azure/ai-services/openai/region-support
    6. Contact Azure Support for Quota Increase if the issue persists despite optimizing token usage and regional deployment, reach out to Azure Support to request a quota increase. Raise Support Requests - https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request

    The above are tested solution and I believe it should solve your issue. So, I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.