Share via

Clarification on Maximum Monthly Token Limits for PTU Model – Azure OpenAI

Vitor Cavaco 0 Reputation points
2025-10-02T10:37:30.81+00:00

I am currently using the GPT-4o-128K model with 1 provisioned PTU under the Azure OpenAI service.

I understand that each PTU provides up to 5,000 tokens per second, which theoretically equates to approximately 13.14 billion tokens per month.

However, I would like to confirm the following:

  1. Is there any maximum monthly token limit imposed by Microsoft under the PTU model, regardless of the provisioned throughput?
  2. Are there any throttling policies or additional restrictions that could limit total monthly consumption, even if usage remains within the contracted throughput?
  3. Do these limits apply per model instance, per subscription, or per region?
Azure OpenAI in Foundry Models
0 comments No comments

2 answers

Sort by: Most helpful
  1. Alex Burlachenko 21,715 Reputation points MVP Volunteer Moderator
    2025-10-02T14:04:51.9833333+00:00

    Vitor Cavaco hi,

    you've done the math correctly on the theoretical maximum, but the real world implementation has some important nuances.

    let's clarify the most important point. there is no separate, hard monthly token limit imposed by microsoft on top of the provisioned throughput unit. the ptu model is designed for predictable performance, not for capping monthly volume. your theoretical calculation of ~13.14 billion tokens is the intended capacity. however, you are right to ask about throttling. the 5,000 tokens per second is the key limit. this is a performance throttle, not a monthly quota. if you try to send more than 5,000 tokens in a single second, those excess requests will be throttled and fail. but if you spread your 13 billion tokens evenly across the month, you should not hit any throttle.

    these limits are applied per ptu, per model, per region. if you have one ptu for gpt 4o in east us, that's a separate pool of throughput from another ptu you might have for a different model or in a different region.

    no, there is no hidden monthly token cap. the only limit is the per second throughput of your provisioned ptu. as long as you stay under 5,000 tokens per second on average, you can use the full theoretical monthly capacity.

    good luck with your high scale application. it sounds like you are pushing the boundaries in a great way.

    regards,

    Alex

    and "yes" if you would follow me at Q&A - personaly thx.
    P.S. If my answer help to you, please Accept my answer
    

    https://ctrlaltdel.blog/

    Was this answer helpful?

    1 person found this answer helpful.
    0 comments No comments

  2. Manas Mohanty 16,935 Reputation points Microsoft External Staff Moderator
    2025-10-02T14:31:46.9233333+00:00

    Hi Vitor Cavaco

    1. Is there any maximum monthly token limit under the PTU model?

    No, there is no fixed monthly token cap imposed by Microsoft for Provisioned Throughput Units (PTUs). Your usage is governed by the throughput you’ve reserved (measured in tokens per second or tokens per minute), not by a monthly ceiling. For example, 1 PTU for GPT‑4o typically supports ~5,000 tokens/sec, which theoretically equals ~13.14 billion tokens/month if fully utilized.

    However, this is a theoretical maximum; actual throughput depends on your workload shape (input/output ratio, concurrency). [Understand...soft Learn]

    2. Are there throttling policies or restrictions that could limit monthly consumption?

    Yes, but they are utilization-based, not monthly caps:

    • Hard limit at 100% PTU utilization: When your deployment hits its provisioned capacity, the API returns HTTP 429 (Too Many Requests). This is by design to prevent overuse beyond your reserved throughput. Bursts slightly above 100% may be allowed briefly, but sustained overage is blocked. [Azure Open...soft Learn]
    • Dynamic throttling during spikes: If utilization approaches 90–100%, latency can increase. Some customers implement spillover strategies (e.g., failover to Pay-As-You-Go or another PTU deployment) to handle peaks. [github.com]
    • Abuse detection: Rarely, automated throttles may apply if traffic patterns trigger anti-abuse systems, but these are case-specific and can be remediated via support.

    3. Do these limits apply per model instance, subscription, or region?

    Hope it helps

    Thank you

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.