An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
Hello @Centaur MD
Thank you for reaching out regarding Azure OpenAI quota behavior in Canada Central/East. I understand you would like to use Standard or Global Standard deployments with pay-per-token billing while avoiding unintended idle infrastructure charges. I’m happy to clarify how quota and billing work in this scenario.
When you see 0/0 TPM for a given model/region/deployment type in Azure AI Foundry, it means your subscription currently has zero quota allocated for that model in that region. This does not mean you are being billed, nor does it necessarily mean the model is unavailable in Canada Central globally. It simply means you cannot deploy that model until quota is requested and approved for your subscription.
In some situations, quota requests may be denied temporarily if the regional capacity pool is currently full. In that case, you may either need to wait for additional regional capacity or consider using another supported region.
Regarding billing, requesting a quota increase (for example, 25K–50K TPM) does not incur charges upfront. Standard and Global Standard deployments are strictly pay-per-token, meaning billing applies only when inference requests are processed. Simply having quota assigned to your subscription does not generate costs.
For a lean proof of concept or small-scale application, the recommended approach is to start with a minimal deployment footprint. A common recommendation would be:
- One chat/completions model such as
gpt-4o-miniorgpt-4.1-mini
One transcription model such as gpt-4o-transcribe
Optionally, one embedding model such as text-embedding-3-small if embeddings are required
This allows you to measure actual usage patterns before requesting additional quota or deploying more models.
It is also important to understand the difference between deployment types:
Standard / Global Standard
Pay-as-you-go pricing
Billed only for input/output token usage
No idle hourly infrastructure charges
Recommended for development, testing, and low-to-medium traffic workloads
Provisioned Throughput Units (PTU)
Reserved dedicated capacity
Can incur hourly charges even when idle
Typically intended for enterprise workloads requiring predictable throughput and latency guarantees
Based on your requirement to avoid unexpected idle charges, Standard or Global Standard deployments would be the safest and most cost-effective option.
Additionally, if a model row shows only fine-tuning-related quota (for example, availableFineTuneCapacity) but no deployable quota, this indicates that the region currently supports fine-tuning operations for that model but does not currently provide standard inference/chat deployment capacity for your subscription. Separate deployable quota would still be required for normal chat completions or embedding workloads.
As a best practice, we also recommend configuring Azure Cost Management budgets and alerts to help monitor consumption and avoid unexpected usage.
Please refer this
Manage quota + request increases: https://learn.microsoft.com/azure/ai-services/openai/how-to/quota
Troubleshoot regional quota capacity: https://learn.microsoft.com/azure/ai-services/openai/concepts/provisioned-throughput#quota
Model availability & limited access: https://learn.microsoft.com/azure/ai-services/openai/concepts/models
Quotas & limits overview: https://learn.microsoft.com/azure/foundry/openai/quotas-limits
I Hope this helps. Do let me know if you have any further queries.
If this answers your query, please do click Accept Answer and Yes for was this answer helpful.
Thank you!