Question about Azure OpenAI quota in Canada Central/East

Question

Question about Azure OpenAI quota in Canada Central/East

Centaur MD 0

I’m trying to deploy Azure OpenAI models in Canada Central using Pay-As-You-Go billing. I want to use Standard or Global Standard deployments only, not Provisioned Throughput / PTU, because I want pay-per-token billing and no hourly provisioned capacity charges.

In Azure AI Foundry quota view, I currently see usable quota for transcription models such as:

gpt-4o-transcribe: 0/400K TPM
gpt-4o-transcribe-diarize: 0/400K TPM

However, most normal chat/reasoning models and embedding models show 0/0 TPM in Canada Central, including models like:

gpt-4o-mini
gpt-4.1-mini
gpt-4.1
text-embedding-3-small
gpt-5-mini
gpt-5.4-mini
gpt-5.4

My questions are:

Does 0/0 TPM mean these models are unavailable for my subscription in Canada Central, or do I need to request quota manually?
Is requesting a small quota increase, for example 25K-50K TPM, charged immediately, or am I only charged when tokens are actually used by deployed Standard/Global Standard models?
For a small bootstrapped application, is it better to request quota for only one chat model first, such as gpt-4o-mini or gpt-4.1-mini, plus one transcription model?
Are Standard / Global Standard deployments billed only per token usage, while Provisioned / PTU deployments are the ones that can create hourly charges while idle?
If a model shows only finetune quota in Canada Central, does that mean it cannot be used for normal chat completions unless separate non-finetune quota is available?

I’m trying to understand the safest low-cost way to start with Azure OpenAI in Canada Central without accidentally creating large idle charges.

0 comments

2 answers

Your answer

Answer 1

Hello @Centaur MD

Thank you for reaching out regarding Azure OpenAI quota behavior in Canada Central/East. I understand you would like to use Standard or Global Standard deployments with pay-per-token billing while avoiding unintended idle infrastructure charges. I’m happy to clarify how quota and billing work in this scenario.

When you see 0/0 TPM for a given model/region/deployment type in Azure AI Foundry, it means your subscription currently has zero quota allocated for that model in that region. This does not mean you are being billed, nor does it necessarily mean the model is unavailable in Canada Central globally. It simply means you cannot deploy that model until quota is requested and approved for your subscription.

In some situations, quota requests may be denied temporarily if the regional capacity pool is currently full. In that case, you may either need to wait for additional regional capacity or consider using another supported region.

Regarding billing, requesting a quota increase (for example, 25K–50K TPM) does not incur charges upfront. Standard and Global Standard deployments are strictly pay-per-token, meaning billing applies only when inference requests are processed. Simply having quota assigned to your subscription does not generate costs.

For a lean proof of concept or small-scale application, the recommended approach is to start with a minimal deployment footprint. A common recommendation would be:

One chat/completions model such as gpt-4o-mini or gpt-4.1-mini

One transcription model such as gpt-4o-transcribe

Optionally, one embedding model such as text-embedding-3-small if embeddings are required

This allows you to measure actual usage patterns before requesting additional quota or deploying more models.

It is also important to understand the difference between deployment types:

Standard / Global Standard

Pay-as-you-go pricing

Billed only for input/output token usage

No idle hourly infrastructure charges

Recommended for development, testing, and low-to-medium traffic workloads

Provisioned Throughput Units (PTU)

Reserved dedicated capacity

Can incur hourly charges even when idle

Typically intended for enterprise workloads requiring predictable throughput and latency guarantees

Based on your requirement to avoid unexpected idle charges, Standard or Global Standard deployments would be the safest and most cost-effective option.

Additionally, if a model row shows only fine-tuning-related quota (for example, availableFineTuneCapacity) but no deployable quota, this indicates that the region currently supports fine-tuning operations for that model but does not currently provide standard inference/chat deployment capacity for your subscription. Separate deployable quota would still be required for normal chat completions or embedding workloads.

As a best practice, we also recommend configuring Azure Cost Management budgets and alerts to help monitor consumption and avoid unexpected usage.

Please refer this

Manage quota + request increases: https://learn.microsoft.com/azure/ai-services/openai/how-to/quota

Troubleshoot regional quota capacity: https://learn.microsoft.com/azure/ai-services/openai/concepts/provisioned-throughput#quota

Model availability & limited access: https://learn.microsoft.com/azure/ai-services/openai/concepts/models

Quotas & limits overview: https://learn.microsoft.com/azure/foundry/openai/quotas-limits

I Hope this helps. Do let me know if you have any further queries.

If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

Thank you!

Answer 2

Meaning of 0/0 TPM in Canada Central Quota is defined per subscription, per region, and per model or deployment type in tokens per minute (TPM) and requests per minute (RPM).

If a model shows 0/0 TPM for a given region in the quota view, that indicates no usable quota is currently allocated for that model/deployment type in that region for the subscription. In practice, this means the model cannot be used there until quota is assigned, which typically requires a quota increase request or a change in offer/eligibility.

By contrast, models like gpt-4o-transcribe showing 0/400K TPM indicate that quota exists (400K TPM pool) and none is currently consumed.

Cost behavior when requesting quota Quota is a capacity limit, not a billing trigger. The documentation describes quota as the maximum TPM/RPM that can be assigned and used; it does not state that quota allocation itself generates charges.

Charges for Azure OpenAI are based on tokens actually consumed by requests to deployed models. Requesting a quota increase (for example, to 25K–50K TPM) does not by itself incur cost; cost is incurred only when deployments use that quota and process tokens.

Strategy for a small bootstrapped application Given that:

Quota is per model, per region, per deployment type, and
Usage tiers and quota limits are defined per model,

A conservative approach for a small application is:

Request quota for one primary chat model (for example, a single gpt-4o or gpt-4o-mini/gpt-4.1-mini standard/global standard deployment), and
Use the existing transcription quota (gpt-4o-transcribe / gpt-4o-transcribe-diarize) if it already shows nonzero TPM.

This minimizes the number of models with quota while still allowing both reasoning and transcription, and keeps the usage pattern easy to monitor against rate limits and usage tiers.

Billing model: Standard / Global Standard vs Provisioned / PTU The quotas and limits documentation distinguishes between standard / data zone standard / global standard deployments and provisioned throughput:

Usage tiers “only apply to standard, data zone standard, and global standard deployment types. Usage tiers don't apply to global batch and provisioned throughput deployments.”

From this:

Standard / Data zone standard / Global standard deployments are governed by token-based quotas and usage tiers and are billed based on tokens consumed.
Provisioned throughput is treated separately and is associated with capacity-style allocation (and thus can incur charges based on provisioned capacity, even when idle), not just per-token usage.

So, for minimizing idle cost, using standard or global standard deployments and avoiding provisioned throughput is aligned with the goal of pay-per-token billing.

Models showing only finetune quota Quota is defined per model or deployment type. If a model shows quota only for a finetune deployment type in a region, that indicates quota is available only for that deployment type there.

In that case, the model cannot be used for normal chat/completions in that region unless there is also non-finetune (standard/global standard) quota allocated for that model/deployment type. Finetune quota alone does not imply availability for regular inference deployments.

Summary for low-cost, low-risk start in Canada Central

0/0 TPM for a model in Canada Central means no usable quota there; a quota increase request is needed before using that model in that region.
Requesting quota (for example 25K–50K TPM) does not by itself incur charges; billing is driven by actual token usage.
For a small app, requesting quota for a single chat model plus using existing transcription quota is a simple, low-risk pattern.
Standard / global standard deployments are token-billed; provisioned throughput is capacity-billed and can incur idle charges.
If only finetune quota is visible for a model in a region, that model cannot be used for normal chat unless separate standard/global standard quota is also available.

References:

Share via

Question about Azure OpenAI quota in Canada Central/East

2 answers

Your answer