Question about max tokens per minute in Azure OpenAI Service

Question

Question about max tokens per minute in Azure OpenAI Service

Brendan Lui 10

Which does "max tokens per minute" refer to?
Input prompt tokens, or output prompt tokens, or total?

Ramr-msft 17,826 Reputation points

2023-10-04T13:07:29.15+00:00
@Brendan Lui Thanks for the questions, Different model deployments, also called model classes have unique max TPM values that you're now able to control. This represents the maximum amount of TPM that can be allocated to that type of model deployment in a given region. While each model type represents its own unique model class, the max TPM value is currently only different for certain model classes:

GPT-4

GPT-4-32K

Text-Davinci-003

All other model classes have a common max TPM value.

Quota Tokens-Per-Minute (TPM) allocation is not related to the max input token limit of a model. Model input token limits are defined in the models table and are not impacted by changes made to TPM.
Brendan Lui 0 Reputation points

2023-10-04T14:27:55.53+00:00

so, it means the counting method of token per minute for all types of model deployment depends on the total of input and output token usage per minute, right?
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Brendan Lui 10 Reputation points

2023-10-05T02:16:27.6233333+00:00
e.g. "total_tokens" from Azure OpenAI API response will be used for counting?

"usage": { "prompt_tokens": 30, "completion_tokens": 90, "total_tokens": 120 }
Enrico Sabbadin (MSC Technology Italia) 16 Reputation points

2023-10-06T10:09:34.05+00:00
that's what i aa trying to understand as well

since here
https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota?tabs=rest#understanding-rate-limits

As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:

Prompt text and count

The max_tokens parameter setting

The best_of parameter setting

but if i do not provide the max_tokens parameter , what will be the estimated ? maybe the msx token limit of the model for one interaction (e.g. 16k for gpt3.5 16k ?)
Ramr-msft 17,826 Reputation points

2023-10-09T08:46:41.7566667+00:00

In other words: setting max_tokens for requests where you don't want most of the model context size to be used is always a good idea when it comes to TPM calculations. Separate from that, when just evaluating response quality, there shouldn't be a downside.

1 answer

Your answer

Ramr-msft 17,826 Reputation points

2023-10-04T13:07:29.15+00:00

@Brendan Lui Thanks for the questions, Different model deployments, also called model classes have unique max TPM values that you're now able to control. This represents the maximum amount of TPM that can be allocated to that type of model deployment in a given region. While each model type represents its own unique model class, the max TPM value is currently only different for certain model classes:

GPT-4

GPT-4-32K

Text-Davinci-003

All other model classes have a common max TPM value.

Quota Tokens-Per-Minute (TPM) allocation is not related to the max input token limit of a model. Model input token limits are defined in the models table and are not impacted by changes made to TPM.
Brendan Lui 0 Reputation points

2023-10-04T14:27:55.53+00:00

so, it means the counting method of token per minute for all types of model deployment depends on the total of input and output token usage per minute, right?
Deleted

This comment has been deleted due to a violation of our Code of Conduct. The comment was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.
Brendan Lui 10 Reputation points

2023-10-05T02:16:27.6233333+00:00

e.g. "total_tokens" from Azure OpenAI API response will be used for counting?

"usage": { "prompt_tokens": 30, "completion_tokens": 90, "total_tokens": 120 }
Enrico Sabbadin (MSC Technology Italia) 16 Reputation points

2023-10-06T10:09:34.05+00:00

that's what i aa trying to understand as well

since here
https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota?tabs=rest#understanding-rate-limits

As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:

Prompt text and count

The max_tokens parameter setting

The best_of parameter setting

but if i do not provide the max_tokens parameter , what will be the estimated ? maybe the msx token limit of the model for one interaction (e.g. 16k for gpt3.5 16k ?)
Ramr-msft 17,826 Reputation points

2023-10-09T08:46:41.7566667+00:00

In other words: setting max_tokens for requests where you don't want most of the model context size to be used is always a good idea when it comes to TPM calculations. Separate from that, when just evaluating response quality, there shouldn't be a downside.

Answer 1

Thanks for the details, If max_tokens isn't provided, the maximum possible value of the model context size will be inferred for the request's quota reservation. If the actual overall token count is substantially lower than this, that means that extra TPM pressure will be applied until the request finishes, at which point the superfluous reserved capacity no longer causes problems. An unconstrained request (no extra tokens) will not be rejected on its own, but the higher-than-needed inferred token reservation may contribute to subsequent requests failing (HTTP 429) until the unconstrained request resolves (with the lower real token count).

Setting max_tokens should have no impact on response quality so long as the value doesn't truncate the response (finish_reason == length). In situations where the anticipated/desired overall token consumption is substantially lower than the maximum model context size, though, it's still good practice to set max_tokens to a "safe" lower value, as will keep the running TPM count from excessively reserving quota and potentially preventing other requests from being serviced in the interim.

Share via

Question about max tokens per minute in Azure OpenAI Service

1 answer

Your answer