Share via

Question about Azure OpenAI quota in Canada Central/East

Centaur MD 0 Reputation points
2026-05-22T23:36:00.52+00:00

I’m trying to deploy Azure OpenAI models in Canada Central using Pay-As-You-Go billing. I want to use Standard or Global Standard deployments only, not Provisioned Throughput / PTU, because I want pay-per-token billing and no hourly provisioned capacity charges.

In Azure AI Foundry quota view, I currently see usable quota for transcription models such as:

  • gpt-4o-transcribe: 0/400K TPM
  • gpt-4o-transcribe-diarize: 0/400K TPM

However, most normal chat/reasoning models and embedding models show 0/0 TPM in Canada Central, including models like:

  • gpt-4o-mini
  • gpt-4.1-mini
  • gpt-4.1
  • text-embedding-3-small
  • gpt-5-mini
  • gpt-5.4-mini
  • gpt-5.4

My questions are:

  1. Does 0/0 TPM mean these models are unavailable for my subscription in Canada Central, or do I need to request quota manually?
  2. Is requesting a small quota increase, for example 25K-50K TPM, charged immediately, or am I only charged when tokens are actually used by deployed Standard/Global Standard models?
  3. For a small bootstrapped application, is it better to request quota for only one chat model first, such as gpt-4o-mini or gpt-4.1-mini, plus one transcription model?
  4. Are Standard / Global Standard deployments billed only per token usage, while Provisioned / PTU deployments are the ones that can create hourly charges while idle?
  5. If a model shows only finetune quota in Canada Central, does that mean it cannot be used for normal chat completions unless separate non-finetune quota is available?

I’m trying to understand the safest low-cost way to start with Azure OpenAI in Canada Central without accidentally creating large idle charges.

Azure OpenAI in Foundry Models
0 comments No comments

2 answers

Sort by: Most helpful
  1. SRILAKSHMI C 19,005 Reputation points Microsoft External Staff Moderator
    2026-05-26T14:27:04.6+00:00

    Hello @Centaur MD

    Thank you for reaching out regarding Azure OpenAI quota behavior in Canada Central/East. I understand you would like to use Standard or Global Standard deployments with pay-per-token billing while avoiding unintended idle infrastructure charges. I’m happy to clarify how quota and billing work in this scenario.

    When you see 0/0 TPM for a given model/region/deployment type in Azure AI Foundry, it means your subscription currently has zero quota allocated for that model in that region. This does not mean you are being billed, nor does it necessarily mean the model is unavailable in Canada Central globally. It simply means you cannot deploy that model until quota is requested and approved for your subscription.

    In some situations, quota requests may be denied temporarily if the regional capacity pool is currently full. In that case, you may either need to wait for additional regional capacity or consider using another supported region.

    Regarding billing, requesting a quota increase (for example, 25K–50K TPM) does not incur charges upfront. Standard and Global Standard deployments are strictly pay-per-token, meaning billing applies only when inference requests are processed. Simply having quota assigned to your subscription does not generate costs.

    For a lean proof of concept or small-scale application, the recommended approach is to start with a minimal deployment footprint. A common recommendation would be:

    • One chat/completions model such as gpt-4o-mini or gpt-4.1-mini

    One transcription model such as gpt-4o-transcribe

    Optionally, one embedding model such as text-embedding-3-small if embeddings are required

    This allows you to measure actual usage patterns before requesting additional quota or deploying more models.

    It is also important to understand the difference between deployment types:

    Standard / Global Standard

    Pay-as-you-go pricing

    Billed only for input/output token usage

    No idle hourly infrastructure charges

    Recommended for development, testing, and low-to-medium traffic workloads

    Provisioned Throughput Units (PTU)

    Reserved dedicated capacity

    Can incur hourly charges even when idle

    Typically intended for enterprise workloads requiring predictable throughput and latency guarantees

    Based on your requirement to avoid unexpected idle charges, Standard or Global Standard deployments would be the safest and most cost-effective option.

    Additionally, if a model row shows only fine-tuning-related quota (for example, availableFineTuneCapacity) but no deployable quota, this indicates that the region currently supports fine-tuning operations for that model but does not currently provide standard inference/chat deployment capacity for your subscription. Separate deployable quota would still be required for normal chat completions or embedding workloads.

    As a best practice, we also recommend configuring Azure Cost Management budgets and alerts to help monitor consumption and avoid unexpected usage.

    Please refer this

    Manage quota + request increases: https://learn.microsoft.com/azure/ai-services/openai/how-to/quota

    Troubleshoot regional quota capacity: https://learn.microsoft.com/azure/ai-services/openai/concepts/provisioned-throughput#quota

    Model availability & limited access: https://learn.microsoft.com/azure/ai-services/openai/concepts/models

    Quotas & limits overview: https://learn.microsoft.com/azure/foundry/openai/quotas-limits

    I Hope this helps. Do let me know if you have any further queries.


    If this answers your query, please do click Accept Answer and Yes for was this answer helpful.

    Thank you!

    Was this answer helpful?

    0 comments No comments

  2. AI answer

    2026-05-22T23:36:14.4633333+00:00
    1. Meaning of 0/0 TPM in Canada Central Quota is defined per subscription, per region, and per model or deployment type in tokens per minute (TPM) and requests per minute (RPM).

    If a model shows 0/0 TPM for a given region in the quota view, that indicates no usable quota is currently allocated for that model/deployment type in that region for the subscription. In practice, this means the model cannot be used there until quota is assigned, which typically requires a quota increase request or a change in offer/eligibility.

    By contrast, models like gpt-4o-transcribe showing 0/400K TPM indicate that quota exists (400K TPM pool) and none is currently consumed.

    1. Cost behavior when requesting quota Quota is a capacity limit, not a billing trigger. The documentation describes quota as the maximum TPM/RPM that can be assigned and used; it does not state that quota allocation itself generates charges.

    Charges for Azure OpenAI are based on tokens actually consumed by requests to deployed models. Requesting a quota increase (for example, to 25K–50K TPM) does not by itself incur cost; cost is incurred only when deployments use that quota and process tokens.

    1. Strategy for a small bootstrapped application Given that:
    • Quota is per model, per region, per deployment type, and
    • Usage tiers and quota limits are defined per model,

    A conservative approach for a small application is:

    • Request quota for one primary chat model (for example, a single gpt-4o or gpt-4o-mini/gpt-4.1-mini standard/global standard deployment), and
    • Use the existing transcription quota (gpt-4o-transcribe / gpt-4o-transcribe-diarize) if it already shows nonzero TPM.

    This minimizes the number of models with quota while still allowing both reasoning and transcription, and keeps the usage pattern easy to monitor against rate limits and usage tiers.

    1. Billing model: Standard / Global Standard vs Provisioned / PTU The quotas and limits documentation distinguishes between standard / data zone standard / global standard deployments and provisioned throughput:
    • Usage tiers “only apply to standard, data zone standard, and global standard deployment types. Usage tiers don't apply to global batch and provisioned throughput deployments.”

    From this:

    • Standard / Data zone standard / Global standard deployments are governed by token-based quotas and usage tiers and are billed based on tokens consumed.
    • Provisioned throughput is treated separately and is associated with capacity-style allocation (and thus can incur charges based on provisioned capacity, even when idle), not just per-token usage.

    So, for minimizing idle cost, using standard or global standard deployments and avoiding provisioned throughput is aligned with the goal of pay-per-token billing.

    1. Models showing only finetune quota Quota is defined per model or deployment type. If a model shows quota only for a finetune deployment type in a region, that indicates quota is available only for that deployment type there.

    In that case, the model cannot be used for normal chat/completions in that region unless there is also non-finetune (standard/global standard) quota allocated for that model/deployment type. Finetune quota alone does not imply availability for regular inference deployments.

    Summary for low-cost, low-risk start in Canada Central

    • 0/0 TPM for a model in Canada Central means no usable quota there; a quota increase request is needed before using that model in that region.
    • Requesting quota (for example 25K–50K TPM) does not by itself incur charges; billing is driven by actual token usage.
    • For a small app, requesting quota for a single chat model plus using existing transcription quota is a simple, low-risk pattern.
    • Standard / global standard deployments are token-billed; provisioned throughput is capacity-billed and can incur idle charges.
    • If only finetune quota is visible for a model in a region, that model cannot be used for normal chat unless separate standard/global standard quota is also available.

    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.