Hello Đặng Hoàn Mỹ, you may want to check if you have the permissions to increase the quota on your subscription/service. More specifically, to answer your question, TPM rate limits are based on the maximum tokens estimated to be processed when the request is received. It is different than the token count used for billing, which is computed after all processing is completed. Azure OpenAI calculates a max processed-token count per request using
- Prompt text and count
- The max_tokens setting
- The best_of setting
This estimated count is added to a running token count of all requests, which resets every minute. A 429 response code is returned once the TPM rate limit is reached within the minute. You may find this article to be a good reference to read -- https://techcommunity.microsoft.com/t5/fasttrack-for-azure/optimizing-azure-openai-a-guide-to-limits-quotas-and-best/ba-p/4076268
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.