Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform
Hello adan ameen,
When a deployment is created, the assigned TPM will directly map to the tokens-per-minute rate limit enforced on its inferencing requests. A Requests-Per-Minute (RPM) rate limit will also be enforced whose value is set proportionally to the TPM assignment using the following ratio:
6 RPM per 1000 TPM.
The flexibility to distribute TPM globally within a subscription and region has allowed Azure OpenAI Service to loosen other restrictions:
- Increase TPM from model deployment to avail higher RPM and rate limit failure threshold.
- Create a multiple regions to deal regional outages, you can create outage alert from Azure status and perform remedial steps accordingly.
- Lessen your input query size and reduce max_token and
- Adopt retry in code
Reference - https://cookbook.openai.com/examples/how_to_handle_rate_limits
Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.
Thank you!