Assistants API Rate Limit Exceeded

Fin Systems 0 Reputation points
2024-12-09T14:25:55.5233333+00:00

Hello!

We have a Rate Limit Exceeded issue for Assistants using the File Search tool.

I would like to point out that before asking this question, we have studied all the forums and recommendations, done all the checks, but it did not help.

Below is a more detailed description of the problem.

  1. We use the gpt-4o (Global Standard) model, the configured limit is 30k TPM. We checked the settings in the Quotes section, everything is correct.
  2. We uploaded about 1000 of our files to the Vector Store so that the assistant could answer questions about the content of the files. The upload was successful. It satisfies all the restrictions.
  3. We uploaded and created the Vector Store back in late August-early September. At first, everything worked correctly, we did not encounter any problems.

In November, the problem with Rate Limit Exceeded began to arise. Moreover, waiting for a minute to pass does not always help.

In the process, we noticed the following feature: if you disable File Search and unhook Vector Store, the assistant works without the Rate Limit Exceed error, that is, it answers complex and extensive questions no matter how often we ask them, and the Rate Limit Exceed error never appears.

Accordingly, the problem seems to be either in the token limit or in the use of Vector Store. I repeat, this error did not occur at the end of August-beginning of September.

What needs to be done to solve this problem?

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,080 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,602 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Max Lacy 345 Reputation points
    2024-12-09T17:53:33.0333333+00:00

    It sounds like you're experiencing a frustrating issue with the "Rate Limit Exceeded" error for your assistant using the File Search tool. Here's an explanation and some potential solutions:

    Your model has a limit of 30,000 TPM, which translates to 180 Requests Per Minute (RPM) and 500 Tokens Per Second (TPS). Azure OpenAI monitors requests over short intervals, and uneven distribution of requests can trigger rate limiting. In a RAG scenario, each user query involves multiple steps:

    1. User Query: This counts as 1 request.
    2. Search Index Query: This also counts as 1 request.
    3. Language Model Query: Another 1 request.

    So, each RAG process involves a total of 3 requests.

    When processing 1,000 documents, the token usage can quickly exceed the TPM limit. This high token usage can be mitigated by limiting the number of documents retrieved, batching requests, and optimizing queries.

    While connected to the Vector Search, you might be hitting either the Rate Limit (RPM) or the Tokens Per Minute (TPM) limit. The RAG process involves multiple steps that can significantly increase the total token count and request rate.

    You could try:

    1. Monitor Token Usage: Use the Azure portal to track token usage and identify spikes.
    2. Optimize Queries: Limit the number of documents retrieved and use filters.
    3. Batch Requests: Spread out requests to avoid bursts.
    4. Implement Retry Logic: Use exponential backoff for retries.
    5. Increase TPM Quota: Request a higher TPM quota if needed.

    Note:

    I'm not sure if you're in the Assistants Playground, so some of these solutions may not be possible. If you have any more questions or need further assistance, feel free to ask!


  2. navba-MSFT 27,540 Reputation points Microsoft Employee Moderator
    2024-12-16T02:32:06.6433333+00:00

    @Fin Systems Yes, Please increase the TPM quota from the Azure AI Foundry as shown below:

    User's image

    This will open a quota request form. Use this form to request an increase due to your forecasted usage for Azure OpenAI Service. Microsoft will use the information you provide to assess your usage volume and patterns, allowing us to allocate the necessary GPU capacity to support your work. We will make every effort to accommodate your request; however, allocation is based on our current capacity and future deployments, and is subject to availability.

    .

    Hope this helps.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.