Question about max tokens per minute in Azure OpenAI Service

Brendan Lui 10 Reputation points
2023-10-04T08:48:20.1033333+00:00

Which does "max tokens per minute" refer to?
Input prompt tokens, or output prompt tokens, or total?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,108 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,826 Reputation points
    2023-10-09T08:46:22.7666667+00:00

    Thanks for the details, If max_tokens isn't provided, the maximum possible value of the model context size will be inferred for the request's quota reservation. If the actual overall token count is substantially lower than this, that means that extra TPM pressure will be applied until the request finishes, at which point the superfluous reserved capacity no longer causes problems. An unconstrained request (no extra tokens) will not be rejected on its own, but the higher-than-needed inferred token reservation may contribute to subsequent requests failing (HTTP 429) until the unconstrained request resolves (with the lower real token count).

    • Setting max_tokens should have no impact on response quality so long as the value doesn't truncate the response (finish_reason == length). In situations where the anticipated/desired overall token consumption is substantially lower than the maximum model context size, though, it's still good practice to set max_tokens to a "safe" lower value, as will keep the running TPM count from excessively reserving quota and potentially preventing other requests from being serviced in the interim.
    2 people found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.