Hi @Rajnish Soni,
Thank you for the detailed follow-up your observation is valid, and I can clarify how this works.
Azure OpenAI model deployments are provisioned based on Tokens Per Minute (TPM), and the Requests Per Minute (RPM) limit is derived from that based on the model’s characteristics. For the gpt-35-turbo-0125
model, the standard formula used by Azure is:
RPM = (TPM × 6) / 1000
In your scenario, you configured a capacity of 20, which equates to 20,000 TPM. Applying the formula:
(20,000 × 6) / 1000 = 120 RPM
This explains why the Azure Portal correctly shows the RPM as 120 for your deployment.
Now regarding the x-ratelimit-limit-requests: 20
value you see in the cURL response this reflects the per-second request limit, not the per-minute limit. When multiplied by 60 seconds, this suggests a potential throughput of up to 1,200 requests per minute in terms of burst capability:
20 requests/sec × 60 sec = 1200 requests/min
However, Azure enforces both token-based and request-based rate limits, and the lower of the two becomes the effective cap. So even if the per-second limit appears to allow for 1,200 RPM, your deployment's configuration (based on 20,000 TPM) imposes a real RPM limit of 120, which is what ultimately throttles your traffic.
Therefore, for configuring throttling in Azure API Management (APIM), you should rely on the RPM value shown in the Azure Portal or retrieved programmatically via the Azure Resource Manager (ARM) API using:
GET /accounts/{account}/deployments/{name}?api-version=2024-10-01
In the response, the properties.callRateLimit.count
field will give you the actual RPM value for the deployment.
I hope this helps. Do let me know if you have further queries.
Thank you!