Hello @Mehdi Boumhicha,
To gain better insight into your Azure OpenAI usage and determine whether rate limiting is affecting performance, it’s important to start by monitoring your real-time usage against your defined quotas.
Enabling Azure Monitor and Application Insights on your Azure OpenAI resource. These tools will help you track essential metrics such as requests per minute (RPM), tokens per request, total tokens per minute (TPM), latency, and error rates. You can visualize this data in the Azure Metrics Explorer by navigating to your OpenAI resource in the Azure Portal and adding charts for total tokens, total requests, throttled requests, and latency. This makes it easier to correlate spikes in usage with any performance degradation.
Enabling diagnostic logs and sending them to Log Analytics, Azure Storage, or Event Hub will give you detailed, quarriable data on token usage, API errors (like HTTP 429s for throttling), and response times. This is especially useful if you use Kusto Query Language (KQL) to perform time-based analysis to see how your usage patterns evolve throughout the day.
please refer this Monitor Azure OpenAI.
Next, evaluate whether you're hitting the current rate limits of 200,000 tokens per minute or 2,000 requests per minute. If you're observing increased latency or errors as user count grows particularly around or beyond 10 concurrent users you may be reaching these limits. Consider reviewing your deployment type as well. The Data Zone – Standard tier uses shared infrastructure, which may lead to resource contention. For higher performance and isolation, you might want to explore moving to Dedicated Capacity or Data Zone – Enterprise, which offer better performance predictability and sometimes autoscaling support.
To improve scalability, implement optimizations such as batching requests to reduce RPM, minimizing token usage in prompts, caching responses for repeated queries, and distributing traffic across multiple deployments if needed. These steps can help mitigate performance issues even before scaling resources.
Finally, to request a quota increase, gather key information such as your current usage metrics (peak tokens and requests per minute), your expected growth (e.g., forecasting a 5× increase in 3 months), and the business impact of hitting these limits. Be sure to clearly describe your workload for example, “handling text-to-text chat conversations for 10–50 concurrent users” and submit this through the Azure OpenAI Quota Increase in the Azure Portal. Also, you can submit the Request more quota through form.
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer
and Yes
for was this answer helpful.
Thank you!