Can you help us get a detailed view of our actual usage — requests and tokens per minute — compared to our quota limits, so we can understand if rate limiting is affecting performance as we scale?

Mehdi Boumhicha 0 Reputation points
2025-05-21T13:56:53.73+00:00

We are currently using the Chat Completions API with a Data Zone Standard deployment. Our application handles normal text-to-text conversations. The instance is configured with the full quota:

200,000 tokens per minute

2,000 requests per minute

When the number of users is low, the experience is smooth. However, as soon as we reach around 10 users or more, the experience degrades noticeably — responses become slower or inconsistent.

Our main challenge is that we lack visibility into detailed metrics. We cannot accurately track our real-time usage in terms of requests and tokens per minute, which prevents us from clearly identifying whether the performance degradation is due to hitting the rate limits.

We would like to:

Get a detailed breakdown of our actual consumption compared to the defined limits.

Share our usage context with someone from Azure (how we use the API and the nature of our workload) so they can help us understand:

Whether our current deployment type is appropriate, and how we might scale as our usage grows

Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
4,120 questions
{count} votes

1 answer

Sort by: Most helpful
  1. SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator
    2025-05-22T10:28:45.06+00:00

    Hello @Mehdi Boumhicha,

    To gain better insight into your Azure OpenAI usage and determine whether rate limiting is affecting performance, it’s important to start by monitoring your real-time usage against your defined quotas.

    Enabling Azure Monitor and Application Insights on your Azure OpenAI resource. These tools will help you track essential metrics such as requests per minute (RPM), tokens per request, total tokens per minute (TPM), latency, and error rates. You can visualize this data in the Azure Metrics Explorer by navigating to your OpenAI resource in the Azure Portal and adding charts for total tokens, total requests, throttled requests, and latency. This makes it easier to correlate spikes in usage with any performance degradation.

    Enabling diagnostic logs and sending them to Log Analytics, Azure Storage, or Event Hub will give you detailed, quarriable data on token usage, API errors (like HTTP 429s for throttling), and response times. This is especially useful if you use Kusto Query Language (KQL) to perform time-based analysis to see how your usage patterns evolve throughout the day.

    please refer this Monitor Azure OpenAI.

    Next, evaluate whether you're hitting the current rate limits of 200,000 tokens per minute or 2,000 requests per minute. If you're observing increased latency or errors as user count grows particularly around or beyond 10 concurrent users you may be reaching these limits. Consider reviewing your deployment type as well. The Data Zone – Standard tier uses shared infrastructure, which may lead to resource contention. For higher performance and isolation, you might want to explore moving to Dedicated Capacity or Data Zone – Enterprise, which offer better performance predictability and sometimes autoscaling support.

    To improve scalability, implement optimizations such as batching requests to reduce RPM, minimizing token usage in prompts, caching responses for repeated queries, and distributing traffic across multiple deployments if needed. These steps can help mitigate performance issues even before scaling resources.

    Finally, to request a quota increase, gather key information such as your current usage metrics (peak tokens and requests per minute), your expected growth (e.g., forecasting a 5× increase in 3 months), and the business impact of hitting these limits. Be sure to clearly describe your workload for example, “handling text-to-text chat conversations for 10–50 concurrent users” and submit this through the Azure OpenAI Quota Increase in the Azure Portal. Also, you can submit the Request more quota through form.

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.