what does 1000 TPM actually mean in deployments

Question

what does 1000 TPM actually mean in deployments

Hải Phạm 0

I have a GPT4-o-mini deployment and its config is 1000 TPM and corresponding RPM is 10.
But everytime I call an API such as chat completion API then I need to wait 1 minute to able to call API again. If not I will receive 429 response status. So what does 1000 TPM and 10 RPM do in this deployment. Can anyone tell me?

1 answer

Your answer

Answer 1

Vinodh247 34,741 MVP Volunteer Moderator

Hi Hải Phạm,

Thanks for reaching out to Microsoft Q&A.

The configuration of 1000 TPM and 10 RPM for your GPT-4-o-mini deployment specifies the rate limits for how frequently you can call the APIs within a given time frame.

Here's a detailed explanation:

What 1000 TPM and 10 RPM mean:

1000 TPM (Transactions Per Minute):

This is the aggregate limit on the number of API requests your deployment can handle per minute.
- It indicates that you can make up to 1000 calls in one minute, distributed across multiple users or processes.
  - If you exceed 1000 requests in a minute, subsequent requests will fail with a 429 Too Many Requests response.

10 RPM (Requests Per Minute):

This is the per-client limit (or burst limit) that restricts how frequently you can call the API from a single client or token.
- It means that a single client (ex: your API key or specific application) can only make 10 requests per minute.
- If you try to send more than 10 requests in one minute, you will encounter a 429 response, even if the overall deployment limit of 1000 TPM is not reached.

Why are you encountering delays?

When you call the Chat Completion API:

The 10 RPM rate limit applies to your specific API client, meaning you cannot make more than one request every 6 seconds (60 seconds/10 requests).
Even though your deployment supports up to 1000 TPM, this aggregate limit is shared across all clients. The 10 RPM restriction throttles your requests specifically.

Practical Example:

Scenario 1 (Single Client):

You call the API once and then immediately call it again within the same minute.
- If this exceeds the 10 RPM limit for your client, the second request will fail with a 429 status.
Scenario 2 (Multiple Clients):
```
- If 100 clients are using the deployment simultaneously and each is limited to **10 RPM**, the total traffic could reach **1000 TPM** without any client exceeding its individual limit.
```

Solution or Workarounds:

Optimize Request Frequency:

Space out your API calls to ensure you stay under the 10 RPM limit.

Parallelize Clients (if applicable):

  - If your use case involves multiple clients or tokens, distribute requests across them to take advantage of the higher aggregate TPM limit.

Check Rate Limits via API Response Headers:
- Responses from Azure OpenAI APIs often include rate limit headers (e.g., x-ratelimit-limit-requests and x-ratelimit-remaining-requests) to help you monitor usage and prevent hitting limits.

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

Hải Phạm 0 Reputation points

2024-11-24T04:00:23.37+00:00

Hi @Vinodh247
Thank you for your answer. I 've just checked again and I pretty sure that I need to wait for exactly 1 minute to able to call API again. Not after 6s like you said. And there is only 1 client using API that is me.
Hải Phạm 0 Reputation points

2024-11-24T04:37:57.94+00:00

Here is my response headers, I have 9 requests and 312 tokens left. But can not call one more API right after it but have to wait for 1m
Vinodh247 34,741 Reputation points MVP Volunteer Moderator

2024-11-24T06:45:48.66+00:00
Based on your response headers and your observation that you have to wait 1m to make the next API call, the behavior seems to be governed by how the rate limits are being applied to your deployment. if you analyze this further:

Key headers in your response:

x-ratelimit-remaining-requests: 9

Indicates that 9 requests are still available for your current minute window out of the total 10 RPM.

This suggests the limit is indeed 10 RPM, as you’ve consumed 1 request so far.

x-ratelimit-remaining-tokens: 312
- Shows the number of tokens available for the current **minute window**. - This is tied to the **token usage** limit (which is separate from RPM limits).

Your Observation:

Despite having remaining requests (x-ratelimit-remaining-requests: 9), the system forces you to wait a full minute before making the next API call.
Avinash Devarakonda 610 Reputation points Microsoft External Staff

2024-11-26T05:37:43.19+00:00

Hi Hải Phạm,

Just checking in to see if the response provided by @Vinodh247 was helpful.

Thank You.
Avinash Devarakonda 610 Reputation points Microsoft External Staff

2024-11-27T09:33:35.6433333+00:00

Hi Hải Phạm,

In your GPT-4o-mini deployment, the 1000 TPM (Tokens Per Minute) and 10 RPM (Requests Per Minute) settings define the rate limits for API usage. TPM restricts the total tokens processed in one minute, while RPM limits the number of requests allowed per minute. For 1000 TPM, the RPM is set proportionally at 6 RPM per 1000 TPM, but in your case, it's capped at 10 RPM. The 429 error occurs if requests exceed these limits. Azure enforces these limits dynamically, monitoring usage in short intervals. Even with 9 requests and tokens left, the enforced RPM limit may require waiting the full minute to reset. Adjusting deployment quotas can help manage rate limit issues.

Kindly consider Manage Azure OpenAI Service quota document as reference.

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Share via

what does 1000 TPM actually mean in deployments

1 answer

Your answer