Hello Simon,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are experiencing Azure OpenAI rate limit exceeds error when using the gpt-4o-mini model deployed in the Sweden Central region on Azure OpenAI Service.
The issue likely stems from Azure OpenAI’s resource allocation behavior or misconfiguration rather than token usage. However, you can do the following to permanently resolve the issue:
- Use Tools like OpenAI Tokenizer to validate and calculate token usage accurately. Token consumption includes input tokens (system messages, prompts, user input) and output tokens (model response). So, misestimating these values may lead to exceeding limits. For instance, even short responses with lengthy prompts can breach token thresholds. Tool: https://platform.openai.com/tokenizer | Reference on more details - https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
- Azure OpenAI allows you to monitor usage metrics and logs to detect spikes. This helps track TPM and RPM consumption in real time. Use Azure CLI to list deployments and resource metrics using bash:
az openai model list --resource-group <Your-Resource-Group> --name <Your-Resource-Name>
- Steps: Go to Azure Portal > Metrics for OpenAI resource monitoring - Reference: https://learn.microsoft.com/en-us/cli/azure/openai - To reduce token usage:
- Shorten system messages and prompts.
- Specify
max_tokens
in the API call to control output size. - Use OpenAI Tokenizer to validate reduced input/output tokens. For an examples:
Reference**: Avoid Rate Limit Errorsresponse = openai.ChatCompletion.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Short prompt"}], max_tokens=50 # Limit response )
- You can introduce Delays Between Requests by enforces strict rate limits per minute. Adding intentional delays between API requests prevents exceeding RPM or TPM limits. This is a code sample:
This handles rate limits gracefully by retrying requests after a short wait.import time import openai def make_request_with_delay(prompt, delay=1): try: response = openai.ChatCompletion.create( model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}], max_tokens=100 ) print(response['choices'][0]['message']['content']) except openai.error.RateLimitError: print("Rate limit exceeded. Retrying...") time.sleep(delay) make_request_with_delay(prompt, delay) make_request_with_delay("Hello!")
- This might not really be an issue, but regional quotas may be saturated in Sweden Central. So, to address this, you can deploy the model in another region with better availability, like West Europe or North Europe et al. Check the regional availability support here - https://learn.microsoft.com/en-us/azure/ai-services/openai/region-support
- Contact Azure Support for Quota Increase if the issue persists despite optimizing token usage and regional deployment, reach out to Azure Support to request a quota increase. Raise Support Requests - https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request
The above are tested solution and I believe it should solve your issue. So, I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.