Azure OpenAI in Azure AI Foundry Models quotas and limits

2025-07-11

This article contains a quick reference and a detailed description of the quotas and limits for Azure OpenAI.

Scope of quota:

Quotas and limits are not enforced at the tenant level.
Instead, the highest level of quota restrictions are scoped at the Azure subscription level.

Regional quota allocation:

Tokens per minute (TPM) and requests per minute (RPM) limits are defined per region, per subscription, and per model/deployment type.
For example, if the gpt-4.1 global standard model is listed with a quota of 5 million TPM and 5,000 RPM, then each region where that model/deployment type is available has its own dedicated pool of quota of that amount for each of your Azure subscriptions. So within a single Azure subscription, it is possible to use a larger quantity of total TPM/RPM quota for a given model/deployment type, as long as you have resources/model deployments spread across multiple regions.

Quotas and limits reference

The following sections provide you with a quick guide to the default quotas and limits that apply to Azure OpenAI:

Limit Name	Limit Value
Azure OpenAI resources per region per Azure subscription	30
Default DALL-E 2 quota limits	2 concurrent requests
Default DALL-E 3 quota limits	2 capacity units (6 requests per minute)
Default GPT-image-1 quota limits	2 capacity units (6 requests per minute)
Default Sora quota limits	60 requests per minute
Default speech to text audio API quota limits	3 requests per minute
Maximum prompt tokens per request	Varies per model. For more information, see Azure OpenAI models
Max Standard deployments per resource	32
Max fine-tuned model deployments	5
Total number of training jobs per resource	100
Max simultaneous running training jobs per resource	1
Max training jobs queued	20
Max Files per resource (fine-tuning)	50
Total size of all files per resource (fine-tuning)	1 GB
Max training job time (job will fail if exceeded)	720 hours
Max training job size (tokens in training file) x (# of epochs)	2 Billion
Max size of all files per upload (Azure OpenAI on your data)	16 MB
Max number or inputs in array with `/embeddings`	2048
Max number of `/chat/completions` messages	2048
Max number of `/chat/completions` functions	128
Max number of `/chat completions` tools	128
Maximum number of Provisioned throughput units per deployment	100,000
Max files per Assistant/thread	10,000 when using the API or Azure AI Foundry portal.
Max file size for Assistants & fine-tuning	512 MB 200 MB via Azure AI Foundry portal
Max file upload requests per resource	30 requests per second
Max size for all uploaded files for Assistants	200 GB
Assistants token limit	2,000,000 token limit
GPT-4o and GPT-4.1 max images per request (# of images in the messages array/conversation history)	50
GPT-4 `vision-preview` & GPT-4 `turbo-2024-04-09` default max tokens	16 Increase the `max_tokens` parameter value to avoid truncated responses. GPT-4o max tokens defaults to 4096.
Max number of custom headers in API requests¹	10
Message character limit	1048576
Message size for audio files	20 MB

¹ Our current APIs allow up to 10 custom headers, which are passed through the pipeline, and returned. Some customers now exceed this header count resulting in HTTP 431 errors. There's no solution for this error, other than to reduce header volume. In future API versions we will no longer pass through custom headers. We recommend customers not depend on custom headers in future system architectures.

Note

Quota limits are subject to change.

Batch limits

Limit Name	Limit Value
Max files per resource	500
Max input file size	200 MB
Max requests per file	100,000

Batch quota

The table shows the batch quota limit. Quota values for global batch are represented in terms of enqueued tokens. When you submit a file for batch processing the number of tokens present in the file are counted. Until the batch job reaches a terminal state, those tokens will count against your total enqueued token limit.

Global batch

Model	Enterprise & MCA-E	Default	Monthly credit card based subscriptions	MSDN subscriptions	Azure for Students, Free Trials
`gpt-4.1`	5 B	200 M	50 M	90 K	N/A
`gpt-4.1 mini`	15B	1B	50M	90k	N/A
`gpt-4.1-nano`	15 B	1 B	50 M	90 K	N/A
`gpt-4o`	5 B	200 M	50 M	90 K	N/A
`gpt-4o-mini`	15 B	1 B	50 M	90 K	N/A
`gpt-4-turbo`	300 M	80 M	40 M	90 K	N/A
`gpt-4`	150 M	30 M	5 M	100 K	N/A
`gpt-35-turbo`	10 B	1 B	100 M	2 M	50 K
`o3-mini`	15 B	1 B	50 M	90 K	N/A
`o4-mini`	15 B	1 B	50 M	90 K	N/A

B = billion | M = million | K = thousand

Data zone batch

Model	Enterprise & MCA-E	Default	Monthly credit card based subscriptions	MSDN subscriptions	Azure for Students, Free Trials
`gpt-4.1`	500 M	30 M	30 M	90 K	N/A
`gpt-4.1-mini`	1.5 B	100 M	50 M	90 K	N/A
`gpt-4o`	500 M	30 M	30 M	90 K	N/A
`gpt-4o-mini`	1.5 B	100 M	50 M	90 K	N/A
`o3-mini`	1.5 B	100 M	50 M	90 K	N/A

GPT-4 rate limits

GPT-4.5 preview global standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4.5`	Enterprise & MCA-E	200 K	200
`gpt-4.5`	Default	150 K	150

GPT-4.1 series global standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4.1` (2025-04-14)	Enterprise & MCA-E	5 M	5 K
`gpt-4.1` (2025-04-14)	Default	1 M	1 K
`gpt-4.1-nano` (2025-04-14)	Enterprise & MCA-E	150 M	150 K
`gpt-4.1-nano` (2025-04-14)	Default	5 M	5 K
`gpt-4.1-mini` (2025-04-14)	Enterprise & MCA-E	150 M	150 K
`gpt-4.1-mini` (2025-04-14)	Default	5 M	5 K

GPT-4.1 series data zone standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4.1` (2025-04-14)	Enterprise & MCA-E	2 M	2 K
`gpt-4.1` (2025-04-14)	Default	300 K	300
`gpt-4.1-nano` (2025-04-14)	Enterprise & MCA-E	50 M	50 K
`gpt-4.1-nano` (2025-04-14)	Default	2 M	2 K
`gpt-4.1-mini` (2025-04-14)	Enterprise & MCA-E	50 M	50 K
`gpt-4.1-mini` (2025-04-14)	Default	2 M	2 K

GPT-4 Turbo

gpt-4 (turbo-2024-04-09) has rate limit tiers with higher limits for certain customer types.

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4` (turbo-2024-04-09)	Enterprise & MCA-E	2 M	12 K
`gpt-4` (turbo-2024-04-09)	Default	450 K	2.7 K

model-router rate limits

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`model-router` (2025-05-19)	Enterprise & MCA-E	10 M	10 K
`model-router` (2025-05-19)	Default	1 M	1 K

computer-use-preview global standard rate limits

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`computer-use-preview`	Enterprise & MCA-E	30 M	300 K
`computer-use-preview`	Default	450 K	4.5 K

o-series rate limits

Important

The ratio of Requests Per Minute (RPM) to Tokens Per Minute (TPM) for quota can vary by model. When you deploy a model programmatically or request a quota increase you don't have granular control over TPM and RPM as independent values. Quota is allocated in terms of units of capacity which have corresponding amounts of RPM & TPM:

Model	Capacity	Requests Per Minute (RPM)	Tokens Per Minute (TPM)
Older chat models:	1 Unit	6 RPM	1,000 TPM
o1 & o1-preview:	1 Unit	1 RPM	6,000 TPM
o3	1 Unit	1 RPM	1,000 TPM
o4-mini	1 Unit	1 RPM	1,000 TPM
o3-mini:	1 Unit	1 RPM	10,000 TPM
o1-mini:	1 Unit	1 RPM	10,000 TPM
o3-pro:	1 Unit	1 RPM	10,000 TPM

This is particularly important for programmatic model deployment as changes in RPM/TPM ratio can result in accidental misallocation of quota.

o-series global standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`codex-mini`	Enterprise & MCA-E	10 M	10 K
`o3-pro`	Enterprise & MCA-E	16 M	1.6 K
`o4-mini`	Enterprise & MCA-E	10 M	10 K
`o3`	Enterprise & MCA-E	10 M	10 K
`o3-mini`	Enterprise & MCA-E	50 M	5 K
`o1` & `o1-preview`	Enterprise & MCA-E	30 M	5 K
`o1-mini`	Enterprise & MCA-E	50 M	5 K
`codex-mini`	Default	1 M	1 K
`o3-pro`	Default	1.6 M	160
`o4-mini`	Default	1 M	1 K
`o3`	Default	1 M	1 K
`o3-mini`	Default	5 M	500
`o1` & `o1-preview`	Default	3 M	500
`o1-mini`	Default	5 M	500

o-series data zone standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`o3-mini`	Enterprise & MCA-E	20 M	2 K
`o3-mini`	Default	2 M	200
`o1`	Enterprise & MCA-E	6 M	1 K
`o1`	Default	600 K	100

o1-preview & o1-mini standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`o1-preview`	Enterprise & MCA-E	600 K	100
`o1-mini`	Enterprise & MCA-E	1 M	100
`o1-preview`	Default	300 K	50
`o1-mini`	Default	500 K	50

gpt-4o rate limits

gpt-4o and gpt-4o-mini have rate limit tiers with higher limits for certain customer types.

gpt-4o global standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4o`	Enterprise & MCA-E	30 M	180 K
`gpt-4o-mini`	Enterprise & MCA-E	50 M	300 K
`gpt-4o`	Default	450 K	2.7 K
`gpt-4o-mini`	Default	2 M	12 K

M = million | K = thousand

gpt-4o data zone standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4o`	Enterprise & MCA-E	10 M	60 K
`gpt-4o-mini`	Enterprise & MCA-E	20 M	120 K
`gpt-4o`	Default	300 K	1.8 K
`gpt-4o-mini`	Default	1 M	6 K

M = million | K = thousand

gpt-4o standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4o`	Enterprise & MCA-E	1 M	6 K
`gpt-4o-mini`	Enterprise & MCA-E	2 M	12 K
`gpt-4o`	Default	150 K	900
`gpt-4o-mini`	Default	450 K	2.7 K

M = million | K = thousand

gpt-4o audio

The rate limits for each gpt-4o audio model deployment are 100 K TPM and 1 K RPM. During the preview, Azure AI Foundry portal and APIs might inaccurately show different rate limits. Even if you try to set a different rate limit, the actual rate limit is 100 K TPM and 1 K RPM.

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-4o-audio-preview`	Default	450 K	1 K
`gpt-4o-realtime-preview`	Default	800 K	1 K
`gpt-4o-mini-audio-preview`	Default	2 M	1 K
`gpt-4o-mini-realtime-preview`	Default	800 K	1 K

M = million | K = thousand

GPT-image-1 rate limits

GPT0-image-1 global standard

Model	Tier	Quota Limit in tokens per minute (TPM)	Requests per minute
`gpt-image-1`	Enterprise & MCA-E	N/A	20
`gpt-image-1`	Default	N/A	6

Usage tiers

Global standard deployments use Azure's global infrastructure, dynamically routing customer traffic to the data center with best availability for the customer’s inference requests. Similarly, Data zone standard deployments allow you to use Azure global infrastructure to dynamically route traffic to the data center within the Microsoft defined data zone with the best availability for each request. This enables more consistent latency for customers with low to medium levels of traffic. Customers with high sustained levels of usage might see greater variability in response latency.

The Usage Limit determines the level of usage above which customers might see larger variability in response latency. A customer’s usage is defined per model and is the total tokens consumed across all deployments in all subscriptions in all regions for a given tenant.

Note

Usage tiers only apply to standard, data zone standard, and global standard deployment types. Usage tiers don't apply to global batch and provisioned throughput deployments.

Global standard, data zone standard, & standard

Model	Usage Tiers per month
`gpt-4` + `gpt-4-32k` (all versions)	6 Billion tokens
`gpt-4o`	12 Billion tokens
`gpt-4o-mini`	85 Billion tokens
`o3-mini`	50 Billion tokens
`o1`	4 Billon tokens
`o4-mini`	50 Billion tokens
`o3`	5 Billion tokens
`gpt-4.1`	30 Billion tokens
`gpt-4.1-mini`	150 Billion tokens
`gpt-4.1-nano`	550 Billion tokens

Other offer types

If your Azure subscription is linked to certain offer types, your max quota values are lower than the values indicated in the above tables.

Tier	Quota Limit in tokens per minute (TPM)
`Azure for Students`	1 K (all models) Exception o-series & GPT-4.1 & GPT 4.5 Preview: 0
`MSDN`	GPT-4o-mini: 200 K GPT 3.5 Turbo Series: 200 K GPT-4 series: 50 K computer-use-preview: 8 K gpt-4o-realtime-preview: 1 K o-series: 0 GPT 4.5 Preview: 0 GPT-4.1: 50 K GPT-4.1-nano: 200 K
`Standard`	GPT-4o-mini: 200 K GPT 3.5 Turbo Series: 200 K GPT-4 series: 50 K computer-use-preview: 30 K o-series: 0 GPT 4.5 Preview: 0 GPT-4.1: 50 K GPT-4.1-nano: 200 K
`Azure_MS-AZR-0111P` `Azure_MS-AZR-0035P` `Azure_MS-AZR-0025P` `Azure_MS-AZR-0052P`	GPT-4o-mini: 200 K GPT 3.5 Turbo Series: 200 K GPT-4 series: 50 K
`CSP Integration Sandbox` ^*	All models: 0
`Lightweight trial` `Free Trials` `Azure Pass`	All models: 0

^*This only applies to a small number of legacy CSP sandbox subscriptions. Use the query below to determine what quotaId is associated with your subscription.

To determine the offer type that is associated with your subscription, you can check your quotaId. If your quotaId isn't listed in this table, your subscription qualifies for default quota.

REST
CLI

API reference

az login
access_token=$(az account get-access-token --query accessToken -o tsv)

curl -X GET "https://management.azure.com/subscriptions/{subscriptionId}?api-version=2020-01-01" \
  -H "Authorization: Bearer $access_token" \
  -H "Content-Type: application/json"

az rest --method GET --uri "https://management.azure.com/subscriptions/{sub-id}?api-version=2020-01-01"

Output

{
  "authorizationSource": "Legacy",
  "displayName": "Pay-As-You-Go",
  "id": "/subscriptions/aaaaaa-bbbbb-cccc-ddddd-eeeeee",
  "state": "Enabled",
  "subscriptionId": "aaaaaa-bbbbb-cccc-ddddd-eeeeee",
  "subscriptionPolicies": {
    "locationPlacementId": "Public_2014-09-01",
    "quotaId": "PayAsYouGo_2014-09-01",
    "spendingLimit": "Off"
  }
}

Quota allocation/Offer type	Subscription quota ID
Enterprise & MCA-E	`EnterpriseAgreement_2014-09-01`
Pay-as-you-go	`PayAsYouGo_2014-09-01`
MSDN	`MSDN_2014-09-01`
CSP Integration Sandbox	`CSPDEVTEST_2018-05-01`
Azure for Students	`AzureForStudents_2018-01-01`
Free Trial	`FreeTrial_2014-09-01`
Azure Pass	`AzurePass_2014-09-01`
Azure_MS-AZR-0111P	`AzureInOpen_2014-09-01`
Azure_MS-AZR-0150P	`LightweightTrial_2016-09-01`
Azure_MS-AZR-0035P Azure_MS-AZR-0025P Azure_MS-AZR-0052P	`MPN_2014-09-01`
Azure_MS-AZR-0023P Azure_MS-AZR-0060P Azure_MS-AZR-0148P Azure_MS-AZR-0148G	`MSDNDevTest_2014-09-01`
Default	Any quota ID not listed in this table

General best practices to remain within rate limits

To minimize issues related to rate limits, it's a good idea to use the following techniques:

Implement retry logic in your application.
Avoid sharp changes in the workload. Increase the workload gradually.
Test different load increase patterns.
Increase the quota assigned to your deployment. Move quota from another deployment, if necessary.

How to request quota increases

Quota increase requests can be submitted via the quota increase request form. Due to high demand, quota increase requests are being accepted and are filled in the order they're received. Priority is given to customers who generate traffic that consumes the existing quota allocation, and your request might be denied if this condition isn't met.

For other rate limits, submit a service request.

Regional quota capacity limits

You can view quota availability by region for your subscription in the Azure AI Foundry portal.

Alternatively to view quota capacity by region for a specific model/version you can query the capacity API for your subscription. Provide a subscriptionId, model_name, and model_version and the API returns the available capacity for that model across all regions, and deployment types for your subscription.

Note

Currently both the Azure AI Foundry portal and the capacity API return quota/capacity information for models that are retired and no longer available.

API Reference

import requests
import json
from azure.identity import DefaultAzureCredential

subscriptionId = "Replace with your subscription ID" #replace with your subscription ID
model_name = "gpt-4o"     # Example value, replace with model name
model_version = "2024-08-06"   # Example value, replace with model version

token_credential = DefaultAzureCredential()
token = token_credential.get_token('https://management.azure.com/.default')
headers = {'Authorization': 'Bearer ' + token.token}

url = f"https://management.azure.com/subscriptions/{subscriptionId}/providers/Microsoft.CognitiveServices/modelCapacities"
params = {
    "api-version": "2024-06-01-preview",
    "modelFormat": "OpenAI",
    "modelName": model_name,
    "modelVersion": model_version
}

response = requests.get(url, params=params, headers=headers)
model_capacity = response.json()

print(json.dumps(model_capacity, indent=2))

Next steps

Explore how to manage quota for your Azure OpenAI deployments. Learn more about the underlying models that power Azure OpenAI.

Condividi tramite

Azure OpenAI in Azure AI Foundry Models quotas and limits

Quotas and limits reference

Batch limits

Batch quota

Global batch

Data zone batch

GPT-4 rate limits

GPT-4.5 preview global standard

GPT-4.1 series global standard

GPT-4.1 series data zone standard

GPT-4 Turbo

model-router rate limits

computer-use-preview global standard rate limits

o-series rate limits

o-series global standard

o-series data zone standard

o1-preview & o1-mini standard

gpt-4o rate limits

gpt-4o global standard

gpt-4o data zone standard

gpt-4o standard

gpt-4o audio

GPT-image-1 rate limits

GPT0-image-1 global standard

Usage tiers

Global standard, data zone standard, & standard

Other offer types

Output

General best practices to remain within rate limits

How to request quota increases

Regional quota capacity limits

Next steps

Commenti e suggerimenti

Risorse aggiuntive