Rajeev Bhat Welcome to Microsoft Q&A forum!
Can you also share the deployed model details and the region where the model is deployed? Where are you seeing this error?
I understand that you have only tried with few requests and seeing this issue.
To give more context, As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:
- Prompt text and count
- The max_tokens parameter setting
- The best_of parameter setting
As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429 response code until the counter resets. For more details, see Understanding rate limits.
To minimize issues related to rate limits, it's a good idea to use the following techniques:
- Set max_tokens and best_of to the minimum values that serve the needs of your scenario. For example, don’t set a large max-tokens value if you expect your responses to be small.
- Use quota management to increase TPM on deployments with high traffic, and to reduce TPM on deployments with limited needs.
- Implement retry logic in your application.
- Avoid sharp changes in the workload. Increase the workload gradually.
- Test different load increase patterns.
Also, see Optimizing Azure OpenAI: A Guide to Limits, Quotas, and Best Practices for more information.
Hope this helps. Do let me know if you have any further queries.