Assistants API Rate Limit Exceeded

Question

Assistants API Rate Limit Exceeded

Fin Systems 0

Hello!

We have a Rate Limit Exceeded issue for Assistants using the File Search tool.

I would like to point out that before asking this question, we have studied all the forums and recommendations, done all the checks, but it did not help.

Below is a more detailed description of the problem.

We use the gpt-4o (Global Standard) model, the configured limit is 30k TPM. We checked the settings in the Quotes section, everything is correct.
We uploaded about 1000 of our files to the Vector Store so that the assistant could answer questions about the content of the files. The upload was successful. It satisfies all the restrictions.
We uploaded and created the Vector Store back in late August-early September. At first, everything worked correctly, we did not encounter any problems.

In November, the problem with Rate Limit Exceeded began to arise. Moreover, waiting for a minute to pass does not always help.

In the process, we noticed the following feature: if you disable File Search and unhook Vector Store, the assistant works without the Rate Limit Exceed error, that is, it answers complex and extensive questions no matter how often we ask them, and the Rate Limit Exceed error never appears.

Accordingly, the problem seems to be either in the token limit or in the use of Vector Store. I repeat, this error did not occur at the end of August-beginning of September.

What needs to be done to solve this problem?

Fin Systems 0 Reputation points

2024-12-09T16:04:23.6233333+00:00

Here I'd like to add the following. With File Search turned on the issue appears even when we ask just common questions to assistant. With File Search turned off everything works without any issue. We need to make it work with File Search turned on
navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-12-10T07:14:02.5+00:00
@Fin Systems Thanks for Reaching the Microsoft Q&A Forum

.

The error message is related to rate limits, which is a common practice in APIs to prevent abuse and ensure fair usage.

To give more context, Tokens-Per-Minute (TPM) and Requests-Per-Minute (RPM) rate limits for the deployment.

.

TPM rate limits are based on the maximum number of tokens that are estimated to be processed by a request at the time the request is received.

.

RPM rate limits are based on the number of requests received over time. Azure OpenAI evaluates the rate of incoming requests over a small period of time, typically 1 or 10 seconds and then determines if the rate limits are being exceeded. If it estimates that the rate could exceed error 429 is reported. See this section from documentation to get a better understanding of how this works.

To summarize, the rate limits are estimated based on a small time period and is not the sum of actual requests received over a minute. This is true for all Azure cognitive services and error 429 is reported if the service sees the limit being breached. Follow the best practices to avoid this error and stay within the quota allocated.

Please see Manage Azure OpenAI Service quota for more details.

.

gpt-4o global standard quota service limits are shown below:

More info here.

.

Action Plan:
Plan 1:

While creating a deployment a Requests-Per-Minute (RPM) rate limit will also be enforced whose value is set proportionally to the TPM assignment using the following ratio:

6 RPM per 1000 TPM.

You should be able to see the RPM value while creating the deployment from studio.

Please increase the TPM and RPM to higher value.

.

Plan 2:

Check if you have exceeded the quota limit for your Azure OpenAI resources. To view your quota allocations across deployments in a given region, select Shared Resources> Quota in Azure OpenAI studio and click on the link to increase the quota*.*

.

Plan 3:

Also, to minimize issues related to rate limits, it's a good idea to use the following techniques:

Set max_tokens and best_of to the minimum values that serve the needs of your scenario. For example, don’t set a large max-tokens value if you expect your responses to be small.

Use quota management to increase TPM on deployments with high traffic, and to reduce TPM on deployments with limited needs.

Implement retry logic in your application.

Avoid sharp changes in the workload. Increase the workload gradually.

Test different load increase patterns.

.

Hope this helps. Do let me know if you have any further queries.
Fin Systems 0 Reputation points

2024-12-10T12:46:49.61+00:00
@navba-MSFT Thanks for your reply.

Let me make some notes:

As for your Plan 3 I read about it and followed similar instructions. But it does not work for us

As for Plan 1. I can not increase TPM value, because it is on maximum level

As for Plan 2.

Am I right, that I should click on button "Request Quota" to increase limits for my GPT-4o model?

2 answers

Your answer

Fin Systems 0 Reputation points

2024-12-09T16:04:23.6233333+00:00

Here I'd like to add the following. With File Search turned on the issue appears even when we ask just common questions to assistant. With File Search turned off everything works without any issue. We need to make it work with File Search turned on
Fin Systems 0 Reputation points

2024-12-10T12:46:49.61+00:00

@navba-MSFT Thanks for your reply.

Let me make some notes:

As for your Plan 3 I read about it and followed similar instructions. But it does not work for us

As for Plan 1. I can not increase TPM value, because it is on maximum level

As for Plan 2.

Am I right, that I should click on button "Request Quota" to increase limits for my GPT-4o model?

Answer 1

Max Lacy 345

It sounds like you're experiencing a frustrating issue with the "Rate Limit Exceeded" error for your assistant using the File Search tool. Here's an explanation and some potential solutions:

Your model has a limit of 30,000 TPM, which translates to 180 Requests Per Minute (RPM) and 500 Tokens Per Second (TPS). Azure OpenAI monitors requests over short intervals, and uneven distribution of requests can trigger rate limiting. In a RAG scenario, each user query involves multiple steps:

User Query: This counts as 1 request.
Search Index Query: This also counts as 1 request.
Language Model Query: Another 1 request.

So, each RAG process involves a total of 3 requests.

When processing 1,000 documents, the token usage can quickly exceed the TPM limit. This high token usage can be mitigated by limiting the number of documents retrieved, batching requests, and optimizing queries.

While connected to the Vector Search, you might be hitting either the Rate Limit (RPM) or the Tokens Per Minute (TPM) limit. The RAG process involves multiple steps that can significantly increase the total token count and request rate.

You could try:

Monitor Token Usage: Use the Azure portal to track token usage and identify spikes.
Optimize Queries: Limit the number of documents retrieved and use filters.
Batch Requests: Spread out requests to avoid bursts.
Implement Retry Logic: Use exponential backoff for retries.
Increase TPM Quota: Request a higher TPM quota if needed.

Note:

I'm not sure if you're in the Assistants Playground, so some of these solutions may not be possible. If you have any more questions or need further assistance, feel free to ask!

Fin Systems 0 Reputation points

2024-12-10T12:58:01.8666667+00:00
Hi, thanks for your reply.

I use C# client for my Assistant.

Couple of questions:

What do you mean "Optimize Queries"? As I understand I can use only one Vector Store for Assistant. So if I know, that user can ask question devoted to any of upload document, I should upload all of them into one single Vector Store. Am I right?

About batch requests. Do you mean, I should split the whole request to some set of subrequest and then send this set to the server?
David Liptak 20 Reputation points

2024-12-10T15:20:11.2833333+00:00

I encounter the same issue when uploading but a single file. This is an implementation/configuration issue on the Azure side of the fence, not on the Assistants clients side.
Fin Systems 0 Reputation points

2024-12-11T11:09:53.87+00:00

And how to solve the issue in this case?
Max Lacy 345 Reputation points

2024-12-11T16:11:07.0066667+00:00

After reviewing all the responses and some additional testing I was able to get the file store to work. The issue I'm seeing is related to the size of files being stored in the vector store. The guidance on the file store states "You can upload up to 10,000 files with a max size of 512 MB". However, when I upload 14MB PDF and ask any question the Rate Limit is triggered.

If I chunk the PDF down to 4-10 KB .txt files there is not an issue. I agree this is an implementation/configuration issue specifically on the chunking of documents. If you're looking for a immediate work around I would chunk your documents and then upload them to the vector store. 500 words in .txt worked for me.
Fin Systems 0 Reputation points

2024-12-12T13:13:08.7866667+00:00

@Max Lacy , thanks for your suggestion.

But unfortunately it didn't help me.

I reduced number of files from 1000 to 198. Total size of Vector Store is 921KB.

But I face with the same issue.

How to increase TPM Quota? Maybe it can help to solve the issue...
Max Lacy 345 Reputation points

2024-12-16T15:41:38.29+00:00

In addition to reducing the number of files you could try chunking the files into smaller chunks.

Answer 2

@Fin Systems Yes, Please increase the TPM quota from the Azure AI Foundry as shown below:

User's image

This will open a quota request form. Use this form to request an increase due to your forecasted usage for Azure OpenAI Service. Microsoft will use the information you provide to assess your usage volume and patterns, allowing us to allocate the necessary GPU capacity to support your work. We will make every effort to accommodate your request; however, allocation is based on our current capacity and future deployments, and is subject to availability.

.

Hope this helps.

Share via

Assistants API Rate Limit Exceeded

2 answers

Your answer