Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Currently viewing:
New Foundry portal version - Switch to version for the classic Foundry portal
Before creating a provisioned deployment, estimate how many provisioned throughput units (PTUs) your workload needs. This article provides the per-model throughput parameters you need and shows how to calculate PTU requirements using sizing formulas or the Foundry capacity calculator.
If you're new to provisioned throughput, start with What is provisioned throughput for Foundry Models?. When you're ready to create your deployment, see Quickstart: Create a provisioned throughput deployment.
Prerequisites
- Familiarity with the concepts in What is provisioned throughput for Foundry Models?.
- An estimate of your workload characteristics: expected peak requests per minute (RPM), average prompt size in tokens, and average response size in tokens.
Estimate PTUs required
Two approaches are available for estimating the number of PTUs required for a workload:
- Use the sizing formulas for full control over the calculation
- Use the Foundry capacity calculator for a guided estimate.
Both approaches use per-model values from the deployment parameters tables to generate estimates. For the most accurate results, benchmark a deployment against representative traffic rather than relying solely on estimated inputs.
Note
For older models (before GPT-4o), the request/call shape distribution affects capacity consumption: a small number of large calls can consume significantly more capacity than many small calls with the same average token count. For GPT-4o and later models, TPM per PTU is set for input and output tokens separately, so this tiering effect doesn't apply.
Estimate manually
You can estimate the PTUs your workload requires using the model-specific values from the deployment parameters tables and information about your expected traffic as follows:
| Input | Description |
|---|---|
| Model | The model you plan to deploy, for example, gpt-5.2. Determines which Input TPM per PTU and output-to-input ratio values to use from the deployment parameters tables. |
| Deployment type | The provisioned deployment type: Global Provisioned, Data Zone Provisioned, or Regional Provisioned. |
| Peak RPM | The expected peak number of calls per minute sent to the model. |
| Average prompt size | The average number of input tokens per request. |
| Average response size | The average number of output tokens per request. |
| Cache rate | The percentage of input tokens served from the prompt cache. Use 0 if caching isn't used. Cached tokens are deducted 100% from the utilization calculation and don't consume PTU capacity. |
Normalized TPM
The manual calculation of PTUs converts your expected token volume into a single number called the normalized TPM. The number of PTUs required is then determined by dividing the normalized TPM by the model's Input TPM per PTU value.
Formulas:
- Input TPM = Peak RPM × average prompt size (tokens)
- Output TPM = Peak RPM × average response size (tokens)
- Normalized TPM = (input TPM × (1 − cache rate)) + (output-to-input ratio × output TPM)
- PTUs required = normalized TPM ÷ Input TPM per PTU
Worked example:
Suppose your application sends requests at a peak rate of 1,000 RPM, with an average prompt size of 200 tokens and an average response size of 20 tokens, using the gpt-5.2 model with Data Zone provisioned throughput deployment. From the table, gpt-5.2 has an Input TPM per PTU of 3,400 and an output-to-input ratio of 8.
- Input TPM = 1,000 × 200 = 200,000
- Output TPM = 1,000 × 20 = 20,000
- Normalized TPM (no cache) = 200,000 + (8 × 20,000) = 360,000
- PTUs required = 360,000 ÷ 3,400 = 105.88 (110 PTUs rounded up to the nearest 5 PTUs, matching the Data Zone Provisioned scale increment for gpt-5.2.)
If 50% of input tokens are served from the prompt cache:
- Effective input TPM = 200,000 × (1 − 0.50) = 100,000
- Normalized TPM = 100,000 + (8 × 20,000) = 260,000
- PTUs required = 260,000 ÷ 3,400 = 76.47 (80 PTUs rounded up to the nearest 5 PTUs, matching the Data Zone Provisioned scale increment for gpt-5.2.)
In summary, the PTUs needed for this example call shape with and without caching are as follows:
| Peak calls per minute (RPM) | Prompt size (tokens) | Response size (tokens) | Cache rate | Input TPM | Output TPM | Normalized TPM | Estimated PTUs | PTUs (rounded up)1 |
|---|---|---|---|---|---|---|---|---|
| 1,000 | 200 | 20 | 0% | 200,000 | 20,000 | 360,000 | 105.88 | 110 |
| 1,000 | 200 | 20 | 50% | 100,000 | 20,000 | 260,000 | 76.47 | 80 |
1 Rounded up to the nearest 5 PTUs, matching the Data Zone Provisioned scale increment for gpt-5.2.
Use the capacity calculator
Use the capacity calculator in the Foundry portal to size specific workload shapes. Find the calculator on the Quota page and enter the following parameters based on your workload:
| Input | Description |
|---|---|
| Model | The model you plan to use. |
| Version | The version of the model you plan to use. |
| Peak calls per min | The number of calls per minute expected to be sent to the model. |
| Tokens in prompt call | The number of tokens in the prompt for each call to the model. Calls with larger prompts consume more PTU capacity. The calculator assumes a single prompt value—for workloads with wide variance in prompt size, benchmark a deployment against your actual traffic for a more accurate estimate. |
| Tokens in model response | The number of tokens generated per call, also called generation size. Calls with larger generation sizes consume more PTU capacity. As with prompt tokens, the calculator assumes a single value. |
| Cache rate | Percentage of input tokens served from the prompt cache. |
After you fill in the required details, select Calculate. The output shows:
- The estimated PTU count required for the workload. This value is rounded up to the nearest PTU scale increment for the selected deployment type, or to the deployment type's minimum PTU count, depending on which one is larger.
- The raw (unrounded) estimated PTU count.
How input and output tokens affect throughput
The throughput (measured as tokens per minute, or TPM) that a deployment gets per PTU depends on the model and the mix of input and output tokens in a given minute. Generating output tokens requires more processing capacity than consuming input tokens.
For GPT-4.1 models and later, the system determines an output-to-input ratio to match the global standard price ratio between input and output tokens, with exceptions for some models. For example,
- For gpt-5, one output token counts as eight input tokens toward your utilization limit, matching the model's global standard price ratio.
- For gpt-4.1, one output token counts as four input tokens.
- Older models use different ratios.
For all deployments, cached tokens are deducted 100% from the utilization calculation, meaning repeated prompt tokens don't consume PTU capacity. See Prompt caching for more information.
Models with a non-standard output-to-input ratio
Some models use an output-to-input ratio that differs from their global standard price ratio. For example, with Llama-3.3-70B-Instruct, one output token counts as four input tokens toward your utilization limit, which differs from that model's standard price ratio. See pricing for Llama models for the full input and output pricing breakdown.
Deployment parameters and throughput values by model
The tables in this section list the throughput and deployment parameters for each supported model. To understand what the parameters in each row mean, see the Appendix.
Latest Azure OpenAI models
Note
gpt-5.4, gpt-4.1, gpt-4.1-mini, and gpt-4.1-nano don't support long context (requests estimated at larger than 128k prompt tokens).
| Topic | gpt-5.5 | gpt-5.4 | gpt-5.4-mini | gpt-5.3-codex | gpt-5.2 | gpt-5.2-codex | gpt-5.1 | gpt-5.1-codex | gpt-5 | gpt-5-mini | gpt-4.1 | gpt-4.1-mini | gpt-4.1-nano | o3 | o4-mini |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Global & data zone provisioned minimum deployment | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 15 |
| Global & data zone provisioned scale increment | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
| Regional provisioned minimum deployment | 50 | 50 | 25 | 50 | 50 | 50 | 50 | 50 | 50 | 25 | 50 | 25 | 25 | 50 | 25 |
| Regional provisioned scale increment | 50 | 50 | 25 | 50 | 50 | 50 | 50 | 50 | 50 | 25 | 50 | 25 | 25 | 50 | 25 |
| Input TPM per PTU | 1,200 | 2,400 | 7,900 | 3,400 | 3,400 | 3,400 | 4,750 | 4,750 | 4,750 | 23,750 | 3,000 | 14,900 | 59,400 | 3,000 | 5,400 |
| Output-to-input ratio | 6 | 6 | 6 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 4 | 4 | 4 | 4 | 4 |
| Latency target value1 | 99% > 100 TPS | 99% > 50 TPS | 99% > 100 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 80 TPS | 99% > 80 TPS | 99% > 90 TPS | 99% > 100 TPS | 99% > 80 TPS | 99% > 90 TPS |
1 Calculated as p50 request latency on a per 5-minute basis. TPS = tokens per second.
Previous Azure OpenAI models
| Topic | gpt-4o | gpt-4o-mini | o3-mini | o1 |
|---|---|---|---|---|
| Global & data zone provisioned minimum deployment | 15 | 15 | 15 | 15 |
| Global & data zone provisioned scale increment | 5 | 5 | 5 | 5 |
| Regional provisioned minimum deployment | 50 | 25 | 25 | 25 |
| Regional provisioned scale increment | 50 | 25 | 25 | 50 |
| Input TPM per PTU | 2,500 | 37,000 | 2,500 | 230 |
| Output-to-input ratio | 4 | 4 | 4 | 4 |
| Latency target value1 | 99% > 25 TPS | 99% > 33 TPS | 99% > 66 TPS | 99% > 25 TPS |
1 Calculated as the average request latency on a per-minute basis across the month. TPS = tokens per second.
Foundry Models sold by Azure
This section lists other Foundry Models sold by Azure, not including the Azure OpenAI in Foundry Models listed in the previous tables.
| Topic | Llama-3.3-70B-Instruct | DeepSeek-R1 | DeepSeek-V3-0324 |
|---|---|---|---|
| Global & data zone provisioned minimum deployment | 100 | 100 | 100 |
| Global & data zone provisioned scale increment | 100 | 100 | 100 |
| Regional provisioned minimum deployment | NA | NA | NA |
| Regional provisioned scale increment | NA | NA | NA |
| Input TPM per PTU | 8,450 | 4,000 | 4,000 |
| Output-to-input ratio | 41 | 4 | 4 |
| Latency target value2 | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS |
1 For Llama-3.3-70B-Instruct, one output token counts as four input tokens toward your utilization limit. This ratio differs from the global standard price ratio between input and output tokens. See Models with a non-standard output-to-input ratio and Llama model pricing.
2 Calculated as the average request latency on a per-minute basis across the month. TPS = tokens per second.
Fireworks on Microsoft Foundry models (Preview)
The following Fireworks on Microsoft Foundry models currently support provisioned throughput.
| Topic | DeepSeek v3.1 | DeepSeek v3.2 | DeepSeek V4 Flash | DeepSeek V4 Pro | Gemma 4 26B A4B IT | Gemma 4 31B IT | GLM-4.7 | GLM-5 | GLM-5.1 | gpt-oss-120b | Kimi K2 Instruct 0905 | Kimi K2 Thinking | Kimi K2.5 | Kimi K2.6 | Llama 3.1 8B Instruct | Ministral 3 3B Instruct 2512 | Qwen 3.5 9B | Qwen 3.5 35B A3B | Qwen 3.5 112B A10B | Qwen 3.5 397B |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Global provisioned minimum deployment | 200 | 300 | 100 | 400 | 200 | 200 | 200 | 300 | 400 | 40 | 200 | 200 | 200 | 200 | 40 | 40 | 40 | 40 | 450 | 200 |
| Global provisioned scale increment | 100 | 150 | 50 | 200 | 100 | 100 | 100 | 150 | 200 | 20 | 100 | 100 | 100 | 100 | 20 | 20 | 20 | 20 | 225 | 100 |
| Input TPM per PTU | 2,100 | 3,000 | 2,800 | 200 | 5,400 | 2,200 | 6,000 | 600 | 900 | 13,500 | 2,500 | 1,400 | 1,060 | 4,000 | 57,800 | 25,400 | 10,700 | 17,800 | 37,253 | 4,032 |
| Latency Target Value1 | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS | 99% > 50 TPS |
1 Calculated as the average request latency on a per-minute basis across the month. TPS = tokens per second.
Appendix
Each row in the tables corresponds to one of the following parameters:
| Parameter | Description |
|---|---|
| Global & data zone provisioned minimum deployment | The smallest number of PTUs you can deploy for Global Provisioned or Data Zone Provisioned deployment types. For example, gpt-5.2 requires a minimum deployment of 15 PTUs. |
| Global & data zone provisioned scale increment | The PTU increment in which you can increase or decrease a Global Provisioned or Data Zone Provisioned deployment. Continuing with the gpt-5.2 example, an increment of 5 means deployments can be sized at 15, 20, 25, and so on. |
| Regional provisioned minimum deployment | The smallest number of PTUs you can deploy for a Regional Provisioned deployment. For example, gpt-5.2 requires a minimum regional provisioned deployment of 50 PTUs. |
| Regional provisioned scale increment | The PTU increment for Regional Provisioned deployments. Continuing with the gpt-5.2 example, an increment of 50 means deployments can be sized at 50, 100, 150, and so on. |
| Input TPM per PTU | The maximum input tokens per minute (TPM) that one PTU supports. Use this value when estimating PTUs. |
| Output-to-input ratio | The weight applied to output tokens when estimating PTU requirements. This value reflects the model's global standard price ratio between output and input tokens, with exceptions for some models. For example, a ratio of 8 means one output token counts as eight input tokens toward the model's TPM limit. See Azure OpenAI pricing, Llama model pricing, and DeepSeek model pricing for current pricing. |
| Latency target value | The expected request latency at the stated PTU utilization level. Expressed as a percentile threshold—for example, "99% > 50 TPS" means 99% of requests are processed at more than 50 tokens per second. |