Rate Limited with 1 query?

Rajeev Bhat 0 Reputation points
2024-07-24T17:17:45.78+00:00

just learning and playing with in chat playground. Deployed a new service - asked about 5 qs in the last 2-3 hours.. now I tried again - 2 req, 1 min apart and I get RL'ed

User's image

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,221 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Dillon Silzer 57,431 Reputation points
    2024-07-25T04:42:17.5666667+00:00

    Hi Rajeev,

    I'd recommend checking your Quotas:

    User's image

    If you are hitting your quotas, then I recommend requesting a rate limit increase using https://aka.ms/oai/quotaincrease


    If this is helpful please accept as answer or upvote.

    Best regards,

    Dillon Silzer, Director | Cloudaen.com | Cloudaen Computing Solutions

    0 comments No comments

  2. AshokPeddakotla-MSFT 34,611 Reputation points
    2024-07-25T05:12:05.5033333+00:00

    Rajeev Bhat Welcome to Microsoft Q&A forum!

    Can you also share the deployed model details and the region where the model is deployed? Where are you seeing this error?

    I understand that you have only tried with few requests and seeing this issue.

    To give more context, As each request is received, Azure OpenAI computes an estimated max processed-token count that includes the following:

    • Prompt text and count
    • The max_tokens parameter setting
    • The best_of parameter setting

    As requests come into the deployment endpoint, the estimated max-processed-token count is added to a running token count of all requests that is reset each minute. If at any time during that minute, the TPM rate limit value is reached, then further requests will receive a 429 response code until the counter resets. For more details, see Understanding rate limits.

    To minimize issues related to rate limits, it's a good idea to use the following techniques:

    • Set max_tokens and best_of to the minimum values that serve the needs of your scenario. For example, don’t set a large max-tokens value if you expect your responses to be small.
    • Use quota management to increase TPM on deployments with high traffic, and to reduce TPM on deployments with limited needs.
    • Implement retry logic in your application.
    • Avoid sharp changes in the workload. Increase the workload gradually.
    • Test different load increase patterns.

    Also, see Optimizing Azure OpenAI: A Guide to Limits, Quotas, and Best Practices for more information.

    Hope this helps. Do let me know if you have any further queries.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.