Can you help us get a detailed view of our actual usage — requests and tokens per minute — compared to our quota limits, so we can understand if rate limiting is affecting performance as we scale?

Question

Can you help us get a detailed view of our actual usage — requests and tokens per minute — compared to our quota limits, so we can understand if rate limiting is affecting performance as we scale?

Mehdi Boumhicha 0

We are currently using the Chat Completions API with a Data Zone Standard deployment. Our application handles normal text-to-text conversations. The instance is configured with the full quota:

200,000 tokens per minute

2,000 requests per minute

When the number of users is low, the experience is smooth. However, as soon as we reach around 10 users or more, the experience degrades noticeably — responses become slower or inconsistent.

Our main challenge is that we lack visibility into detailed metrics. We cannot accurately track our real-time usage in terms of requests and tokens per minute, which prevents us from clearly identifying whether the performance degradation is due to hitting the rate limits.

We would like to:

Get a detailed breakdown of our actual consumption compared to the defined limits.

Share our usage context with someone from Azure (how we use the API and the nature of our workload) so they can help us understand:

Whether our current deployment type is appropriate, and how we might scale as our usage grows

SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-05-23T10:46:36.9166667+00:00

Hi @Mehdi Boumhicha,

Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank you!
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-05-26T11:43:52.9966667+00:00

Hi @Mehdi Boumhicha,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!

1 answer

Your answer

SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-05-23T10:46:36.9166667+00:00

Hi @Mehdi Boumhicha,

Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thank you!
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-05-26T11:43:52.9966667+00:00

Hi @Mehdi Boumhicha,

Just checking in to see if you have got a chance to see my response to your question in resolving the issue.

If you are still facing any further issues, please don't hesitate to reach out to us. We are happy to assist you.

Looking forward to your response and appreciate your time on this.

If you feel that your quires have been resolved, please accept the answer by clicking the "Upvote" and "Accept Answer" on the post.

Thank you!

Answer 1

SriLakshmi C 6,250 Microsoft External Staff Moderator

Hello @Mehdi Boumhicha,

To gain better insight into your Azure OpenAI usage and determine whether rate limiting is affecting performance, it’s important to start by monitoring your real-time usage against your defined quotas.

Enabling Azure Monitor and Application Insights on your Azure OpenAI resource. These tools will help you track essential metrics such as requests per minute (RPM), tokens per request, total tokens per minute (TPM), latency, and error rates. You can visualize this data in the Azure Metrics Explorer by navigating to your OpenAI resource in the Azure Portal and adding charts for total tokens, total requests, throttled requests, and latency. This makes it easier to correlate spikes in usage with any performance degradation.

Enabling diagnostic logs and sending them to Log Analytics, Azure Storage, or Event Hub will give you detailed, quarriable data on token usage, API errors (like HTTP 429s for throttling), and response times. This is especially useful if you use Kusto Query Language (KQL) to perform time-based analysis to see how your usage patterns evolve throughout the day.

please refer this Monitor Azure OpenAI.

Next, evaluate whether you're hitting the current rate limits of 200,000 tokens per minute or 2,000 requests per minute. If you're observing increased latency or errors as user count grows particularly around or beyond 10 concurrent users you may be reaching these limits. Consider reviewing your deployment type as well. The Data Zone – Standard tier uses shared infrastructure, which may lead to resource contention. For higher performance and isolation, you might want to explore moving to Dedicated Capacity or Data Zone – Enterprise, which offer better performance predictability and sometimes autoscaling support.

To improve scalability, implement optimizations such as batching requests to reduce RPM, minimizing token usage in prompts, caching responses for repeated queries, and distributing traffic across multiple deployments if needed. These steps can help mitigate performance issues even before scaling resources.

Finally, to request a quota increase, gather key information such as your current usage metrics (peak tokens and requests per minute), your expected growth (e.g., forecasting a 5× increase in 3 months), and the business impact of hitting these limits. Be sure to clearly describe your workload for example, “handling text-to-text chat conversations for 10–50 concurrent users” and submit this through the Azure OpenAI Quota Increase in the Azure Portal. Also, you can submit the Request more quota through form.

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!

Mehdi Boumhicha 0 Reputation points

2025-05-27T16:12:53.18+00:00

Hello,

Thank you for your detailed message and guidance.

I’ve followed the suggested steps by creating a Log Analytics workspace and connecting it to my Azure OpenAI instance (France region) through Monitoring > Diagnostic settings. I successfully added the diagnostic setting pointing to my Log Analytics workspace.

However, I’m currently facing the following issues:

I'm unable to access detailed, queryable logs such as requests per minute (RPM), tokens per minute (TPM), or meaningful latency data through Log Analytics. The logs simply don’t appear or are incomplete.

In Monitoring > Metrics, I do see some graphs, but the available metrics (like ratelimit - sum) are not coherent with the actual behavior observed in my application. For example, ratelimit - sum appears much higher than total calls - sum, which doesn’t reflect the actual usage pattern.

Overall, I'm struggling to access accurate, actionable usage data and can’t clearly correlate the logs with actual performance or rate-limiting issues.

Additionally, it seems that the connection to Log Analytics is not properly enabling the expected detailed insights (tokens, RPM, latency, etc.), or perhaps there is a configuration issue I’m missing.

Would you be able to help clarify:

If there's a known limitation or delay in logs being visible in Log Analytics for Azure OpenAI?

If there are specific KQL queries or dashboards you recommend to retrieve these detailed metrics?

Whether the behavior I’m observing (metrics not matching reality) is expected in the France region or under a certain tier (e.g. Standard)?

Any help in better understanding and troubleshooting this would be greatly appreciated.

Best regards,
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-05-27T18:10:09.7066667+00:00
Hi @Mehdi Boumhicha,

Thank you for the detailed follow-up. I understand the challenge you're facing in gaining actionable visibility into your Azure OpenAI usage particularly around real-time metrics like requests per minute (RPM), tokens per minute (TPM), and latency. Based on your setup and the issues you're observing, here’s a comprehensive guide to help troubleshoot and improve visibility

If there's a known limitation or delay in logs being visible in Log Analytics for Azure OpenAI?

Since you’ve already connected your Azure OpenAI resource to a Log Analytics workspace via Diagnostic Settings, it’s important to validate a few key aspects to ensure the expected data is being captured.

First, be aware that there is typically a short delay often up to 15 minutes between the time logs are generated and when they appear in Log Analytics. If you’ve only recently enabled diagnostics, the logs may still be propagating.

Confirm that all relevant log categories are enabled in your Diagnostic Settings, including AuditLogs, RequestLogs, and especially TokenUsageLogs, which are critical for detailed insights into token usage and performance.

If the TokenUsageLogs category is unavailable or not generating data, this may be due to regional or tier-specific limitations particularly relevant in the France region or when using the Data Zone - Standard deployment.

The behavior you’re observing where the ratelimit - sum metric exceeds total calls - sum can be attributed to several underlying factors.

One common cause is burst traffic, where sudden spikes in concurrent requests temporarily exceed the per-minute threshold, triggering rate limits even if your overall usage appears moderate.

Additionally, if your application includes automatic retry logic in response to HTTP 429 (throttling) errors, this can artificially inflate the rate-limit count without a corresponding increase in successful API calls.

Because you're operating in the Standard tier, which uses shared infrastructure, backend resource contention can also lead to inconsistent latency and rate limiting even when your request and token usage remains within quota limits due to prioritization and queueing behavior on shared compute resources.

If there are specific KQL queries or dashboards you recommend to retrieve these detailed metrics?

To gain deeper insights into your Azure OpenAI usage and performance, it's recommended to leverage both Azure Metrics Explorer and Log Analytics with Kusto Query Language (KQL).

Within Azure Metrics Explorer, you can add key metrics such as Total Tokens - Sum, Total Calls - Sum, Ratelimit - Sum to your dashboard. This enables you to monitor real-time trends and identify correlations between usage spikes, throttling events, and latency issues.

Simultaneously, once logs are successfully flowing into your Log Analytics workspace, you can use custom KQL queries to create more granular dashboards. These queries allow you to break down your usage over time, detect patterns in request behavior, and pinpoint where rate limiting or performance degradation may be occurring.

Once logs are flowing, try the following KQL queries for custom dashboards:

To view the volume of operations over time split by OperationName

AzureDiagnostics | where ResourceProvider == "MICROSOFT.COGNITIVESERVICES" | summarize count() by bin(TimeGenerated, 1m), OperationName | render timechart

To find the average time it takes to perform an operation

AzureDiagnostics | where ResourceProvider == "MICROSOFT.COGNITIVESERVICES" | summarize avg(DurationMs) by OperationName

Whether the behavior I’m observing (metrics not matching reality) is expected in the France region or under a certain tier (e.g. Standard)?

While the France region is generally supported, it's possible that some diagnostic or telemetry features particularly TokenUsageLogs may be more stable or available in more established regions like East US or West Europe.

Additionally, the Data Zone – Standard tier uses shared infrastructure, which may lead to less predictable performance and less granular visibility compared to Enterprise or Dedicated Capacity tiers.

If you're planning to scale significantly, these tiers offer more consistent performance and better monitoring capabilities.

I hope you understand, do let me know if you have further queries.

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.

Thank you!
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-05-28T12:40:55.5766667+00:00

Hi @Mehdi Boumhicha,

Did you get any chance to review the above response. Do let me know if you have any further queries.

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.

Thank you!
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-05-29T10:22:17.3233333+00:00

Hi @Mehdi Boumhicha,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Could you please take a moment to retake the survey on the above response? Your feedback is greatly appreciated.

Thank you!
Mehdi Boumhicha 0 Reputation points

2025-06-05T07:55:33.28+00:00

Hi @SriLakshmi C ,

Thank you again for the detailed explanations and your support throughout and sorry for my late reply.

Indeed, after following your advice and setting up Log Analytics properly, I can now see logs flowing in, and the KQL queries have helped me track RPM and TPM clearly.

My current limits are 2,000 RPM and 200,000 TPM. From the logs, I noticed that even during peak usage, we only reached around 14 RPM and 40K TPM — which is well below the threshold. However, what’s confusing is that the RateLimit - sum metric is still consistently high, even with this low level of traffic.

This makes me feel like I might be seeing throttling logs not caused by my own usage — as if the rate limits reflect shared backend congestion rather than limits I'm personally hitting. This uncertainty really affects my confidence in ensuring the stability and scalability of my application. Without a reliable way to monitor true rate limit behavior specific to my usage, it's hard to predict performance under load.

I understand that the Standard tier is shared, but honestly, with only 10 concurrent users (light chat usage), it’s surprising and disappointing that it already leads to degradation or high throttling signals. I would expect the shared infrastructure to comfortably support this kind of load.

Do you think this performance (given my very low RPM/TPM) still justifies a move to Dedicated or Enterprise tier? Or is there a known issue affecting Standard tier stability in the France region?

Thanks again for your help and your time.
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-06-09T10:10:08.7433333+00:00

Hi @Mehdi Boumhicha,

Thank you once again for your detailed feedback, and no worries at all about the delay I’m glad to hear that the diagnostic setup and KQL queries have helped provide clearer visibility into your usage metrics.

Based on the data you’ve shared averaging around 14 RPM and 40K TPM, which are well within your assigned quotas of 2,000 RPM and 200,000 TPM you should reasonably expect consistent and smooth performance. However, the persistent high values in the RateLimit - sum metric, despite this low usage, strongly suggest that the throttling you're observing may be due to shared backend resource contention rather than limits imposed specifically on your traffic. This behavior is a known characteristic of the Data Zone – Standard tier, where infrastructure is shared across multiple tenants.

In this tier, rate limits can be enforced at the shared deployment level, meaning that throttling or queuing may occur even if your own application usage is moderate. This shared nature can result in inconsistent performance, latency spikes, or unexpected throttling, particularly during periods of high overall load in the region. As a result, metrics like RateLimit - sum may reflect global contention rather than activity directly tied to your own workload, which understandably creates uncertainty when trying to predict performance under scale.

Given that you’re already seeing degradation with as few as 10 concurrent users on lightweight chat traffic, this does raise concerns about the ability of the Standard tier to support your future scalability needs. If performance stability, responsiveness, and isolation are critical to your business, we recommend evaluating a move to either the Data Zone – Enterprise tier or Dedicated Capacity, both of which offer:

Isolated infrastructure, avoiding noisy-neighbor interference

Consistent and predictable performance under load

Enhanced telemetry and observability

In some cases, autoscaling support for better elasticity

While the Standard tier is ideal for testing and small-scale workloads, it may not be sufficient for production-grade systems with growing user bases or stringent latency requirements.

There are no currently documented issues specific to the France region, but if the behavior continues, I recommend raising a support request through your Azure subscription. This will allow the engineering team to investigate potential regional or platform-level anomalies more closely.

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!
SriLakshmi C 6,250 Reputation points Microsoft External Staff Moderator

2025-06-10T11:16:58.0133333+00:00

Hi @Mehdi Boumhicha,

Did you get any chance to check the above response. Thank you!

Share via

Can you help us get a detailed view of our actual usage — requests and tokens per minute — compared to our quota limits, so we can understand if rate limiting is affecting performance as we scale?

1 answer

Your answer