429 Rate Limit Errors on GPT=4.1

Lollop 0 Reputation points
2025-05-03T22:32:50.3333333+00:00

I am getting 429 Rate Limit errors on an Azure OpenAI gpt-4.1 resource; the details for this resource, as shown in Azure AI Foundry, are:

Rate Limit: 721,000 TPM

Requests: 721 RPM

But it is capped at 30K for some reason.

status_code: 429, model_name: gpt-4.1, body: {'message': 'Request too large for gpt-4.1 in organization org-<snip> on tokens per min (TPM): Limit 30000, Requested 42638. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,975 questions
{count} votes

1 answer

Sort by: Most helpful
  1. SriLakshmi C 4,875 Reputation points Microsoft External Staff Moderator
    2025-05-05T16:56:12.3433333+00:00

    Hello @Lollop,

    I understand that you're encountering a 429 Rate Limit error on your Azure OpenAI GPT-4.1 resource, which appears to be capped at 30,000 tokens per minute (TPM), despite the Azure portal displaying a quota of 721,000 TPM and 721 RPM. This mismatch typically occurs due to backend limitations imposed on specific models like GPT-4.1, which may enforce lower token caps than the Azure resource settings indicate. The error you're seeing suggests a request of 42,638 tokens exceeded the actual enforced limit of 30,000 TPM.

    To give more context, As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:

    • Prompt text and count
    • The max_tokens parameter setting
    • The best_of parameter setting

    As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429-response code until the counter resets. For more details, see Understanding rate limits.

    Please see Manage Azure OpenAI Service quota for more details.

    Also reduce the size of your input and output token counts per request, implement retry logic that respects the Retry-After header, and actively monitor your usage with Azure tools.

    I hope this helps, do let me know if you have further queries.

    Thank you!

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.