Hello @Lollop,
I understand that you're encountering a 429 Rate Limit error on your Azure OpenAI GPT-4.1 resource, which appears to be capped at 30,000 tokens per minute (TPM), despite the Azure portal displaying a quota of 721,000 TPM and 721 RPM. This mismatch typically occurs due to backend limitations imposed on specific models like GPT-4.1, which may enforce lower token caps than the Azure resource settings indicate. The error you're seeing suggests a request of 42,638 tokens exceeded the actual enforced limit of 30,000 TPM.
To give more context, As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:
- Prompt text and count
- The max_tokens parameter setting
- The best_of parameter setting
As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429-response code until the counter resets. For more details, see Understanding rate limits.
Please see Manage Azure OpenAI Service quota for more details.
Also reduce the size of your input and output token counts per request, implement retry logic that respects the Retry-After
header, and actively monitor your usage with Azure tools.
I hope this helps, do let me know if you have further queries.
Thank you!