Getting rate_limit_error when accessing gpt-4o-mini with Responses API deployed in AI Foundry

Question

Getting rate_limit_error when accessing gpt-4o-mini with Responses API deployed in AI Foundry

Wenjun Che 140

Hello

I have deployed gpt-4o-mini to AI Foundry. When I try to access it with OpenAI Responses API, I am getting rate_limit_error as soon as I send another request after receiving response for the first request. The Rate Limit for the deployment is 300 requests per minute. Here is my code:

const client = new AzureOpenAI({
    apiKey,
    apiVersion: '2025-01-01-preview',
    endpoint
});
 

const response = await client.responses.create({
    model:  'gpt-4o-mini',
    input: '...',
});

Response:  {"error":{"type":"rate_limit_error","param":"null","code":"null"}}

Thanks

Wenjun Che 140 Reputation points

2025-03-19T20:37:33.9966667+00:00
The same code works fine for openai client:

const`` ``client`` = ``new`` ``OpenAI``({``apiKey``});

const response = await client.responses.create({ model: 'gpt-4o-mini', input: '...', });

I am using

"openai": "4.87.3"
Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator

2025-03-19T23:45:45.2333333+00:00

Hi Wenjun Che

The rate_limit_error occurs because Azure OpenAI enforces limits based on tokens processed per minute (TPM) rather than just the number of requests. Even if your deployment allows 300 requests per minute, you may exceed the TPM limit if your requests contain too many tokens.

Azure calculates an estimated max processed-token count for each request. This count includes the number of tokens in your prompt (input), the expected output tokens (controlled by max_tokens), and any additional multipliers like the best_of parameter. If the total processed tokens exceed the allowed limit at any point within a minute, further requests will receive a 429 rate_limit_error until the limit resets.

To prevent this issue, you can optimize your requests by setting lower values for max_tokens and best_of to reduce token usage. It’s also important to gradually increase the workload instead of making sudden traffic spikes. If your usage is consistently hitting the limit, consider increasing the TPM quota in the Azure OpenAI portal.

Additionally, implementing retry logic can help handle temporary rate limits. Use exponential backoff, which means waiting a short time before retrying and increasing the wait time with each failure. You can also monitor your token usage using Azure’s built-in metrics and logging tools.

For more details, refer to Understanding Rate Limits in Azure OpenAI.

I hope this information helps.
Wenjun Che 140 Reputation points

2025-03-20T13:13:18.6933333+00:00
Hello Pavankumar

Thank you for your comment. Here are steps for my test:

send a simple prompt "why is sky blue" with Responses API

wait for response from gpt-4o-mini

receive response with total_tokens = 244

send the same prompt "why is sky blue" again

receive rate_limit_error

If I use chat completion API, client.chat.completions in AzureOpenAI, it works fine. So, I don't think I am exceeding the rate limit.

Thanks
Wenjun Che 140 Reputation points

2025-03-21T01:29:49.51+00:00

Hello Pavankumar

I really need to use Responses API since it can maintain conversations so I don't need to his chat history. I finally managed to get it working by setting max_output_tokens to 500 so it looks like the field is required. Is it documented anywhere ? gpt-4o-minu, hosted by OpenAI, works fine without setting max_output_tokens .
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-03-21T13:35:06.7333333+00:00

Hi Wenjun Che

Glad to hear that you were able to resolve the issue by declaring max_token parameter.

But we could not find documentation on client.responses.create on Azure OpenAI side. Seems to be legacy code though.

Would request you to take a minute to upvote the below answer if the pointer on changing max_token helped address your issue.Thank you.

Accepted answer

0 additional answers

Your answer

Wenjun Che 140 Reputation points

2025-03-19T20:37:33.9966667+00:00

The same code works fine for openai client:

const`` ``client`` = ``new`` ``OpenAI``({``apiKey``});

const response = await client.responses.create({ model: 'gpt-4o-mini', input: '...', });

I am using

"openai": "4.87.3"
Pavankumar Purilla 8,335 Reputation points Microsoft External Staff Moderator

2025-03-19T23:45:45.2333333+00:00

Hi Wenjun Che

The rate_limit_error occurs because Azure OpenAI enforces limits based on tokens processed per minute (TPM) rather than just the number of requests. Even if your deployment allows 300 requests per minute, you may exceed the TPM limit if your requests contain too many tokens.

Azure calculates an estimated max processed-token count for each request. This count includes the number of tokens in your prompt (input), the expected output tokens (controlled by max_tokens), and any additional multipliers like the best_of parameter. If the total processed tokens exceed the allowed limit at any point within a minute, further requests will receive a 429 rate_limit_error until the limit resets.

To prevent this issue, you can optimize your requests by setting lower values for max_tokens and best_of to reduce token usage. It’s also important to gradually increase the workload instead of making sudden traffic spikes. If your usage is consistently hitting the limit, consider increasing the TPM quota in the Azure OpenAI portal.

Additionally, implementing retry logic can help handle temporary rate limits. Use exponential backoff, which means waiting a short time before retrying and increasing the wait time with each failure. You can also monitor your token usage using Azure’s built-in metrics and logging tools.

For more details, refer to Understanding Rate Limits in Azure OpenAI.

I hope this information helps.
Wenjun Che 140 Reputation points

2025-03-20T13:13:18.6933333+00:00

Hello Pavankumar

Thank you for your comment. Here are steps for my test:

send a simple prompt "why is sky blue" with Responses API

wait for response from gpt-4o-mini

receive response with total_tokens = 244

send the same prompt "why is sky blue" again

receive rate_limit_error

If I use chat completion API, client.chat.completions in AzureOpenAI, it works fine. So, I don't think I am exceeding the rate limit.

Thanks
Wenjun Che 140 Reputation points

2025-03-21T01:29:49.51+00:00

Hello Pavankumar

I really need to use Responses API since it can maintain conversations so I don't need to his chat history. I finally managed to get it working by setting max_output_tokens to 500 so it looks like the field is required. Is it documented anywhere ? gpt-4o-minu, hosted by OpenAI, works fine without setting max_output_tokens .
Manas Mohanty 5,620 Reputation points Microsoft External Staff Moderator

2025-03-21T13:35:06.7333333+00:00

Hi Wenjun Che

Glad to hear that you were able to resolve the issue by declaring max_token parameter.

But we could not find documentation on client.responses.create on Azure OpenAI side. Seems to be legacy code though.

Would request you to take a minute to upvote the below answer if the pointer on changing max_token helped address your issue.Thank you.

Answer 1

Hi Wenjun Che,You can use the Chat Completions API correctly in Azure OpenAI with the following format:

const result = await client.chat.completions.create({
    messages: [{ role: "user", content: "Why is the sky blue?" }],
    model: "gpt-4o-mini",
    max_tokens: 100
});

Since you mentioned that client.chat.completions.create() works fine while client.responses.create() results in a rate limit error, this suggests that Azure manages rate limits differently for these APIs. The Responses API might be consuming tokens differently or facing stricter limitations in AI Foundry.

If possible, I recommend using Chat Completions API as it appears to work without issues. If you must use Responses API, try reducing max_tokens and check your Azure AI Foundry quota and token usage to ensure you're not hitting rate limits.
For more information: https://learn.microsoft.com/en-us/azure/ai-services/openai/supported-languages?tabs=dotnet-secure%2Csecure%2Cpython-secure%2Ccommand&pivots=programming-language-javascript#chat

Share via

Getting rate_limit_error when accessing gpt-4o-mini with Responses API deployed in AI Foundry

0 additional answers

Your answer