Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Microsoft Foundry Models is the hub for discovering and deploying a wide range of AI models for generative AI applications. To make a model available for inference requests, you deploy it. Foundry offers two deployment options depending on the model type and your infrastructure needs.
Tip
You don't always need to create a deployment. With instant models (preview), you call supported models by name and start running inference immediately — no deployment required.
Deployment options
Foundry provides two deployment options:
- Standard deployment in Foundry resources — For Foundry Models, including Foundry Models sold by Azure (also known as Azure Direct Models, or ADM) and select Models from partners and community. This option is the preferred and most capable deployment path.
- Managed compute deployment (preview) — Available for all Open Source Software (OSS) models, including models from partner and community, and custom models.
The Foundry portal automatically selects the appropriate deployment option based on the model you choose.
| Standard deployment in Foundry resources | Managed compute | |
|---|---|---|
| Models | ADM models (Azure OpenAI + partner models billed through Azure) and select Models from partners and community | Other models in the model catalog from partners and custom models. For example, models from Hugging Face, NVIDIA NIMs, industry models, and Databricks. |
| Billing | Token usage or provisioned throughput units (PTU) | Hourly per accelerator SKU |
| Data processing | Regional, data zone, or global | Regional only |
| Content filtering | Built-in and customizable | Via Azure AI Content Safety APIs |
Standard deployment in Foundry resources
Standard deployment in Foundry resources is the preferred deployment option in Foundry. It supports the widest range of capabilities and deployment types.
Which models use standard deployment?
All Foundry Models, including Foundry Models sold by Azure and select Models from partners and community use standard deployment. Foundry Models sold by Azure include all Azure OpenAI models and selected models from top providers that are billed through your Azure subscription, covered by Azure service-level agreements, and supported by Microsoft. Select Models from partners and community that use standard deployment include Anthropic models, and specific models from partners like Mistral, Cohere, and Meta.
Capabilities
Standard deployment supports:
- Multiple deployment types — Global Standard, Data Zone Standard, Regional Standard, Provisioned, Batch, and more. Each type controls where data is processed and how you pay. For details, see Deployment types for Microsoft Foundry Models.
- Data processing flexibility — Choose regional, data zone (US or EU), or global processing based on your compliance requirements.
- Content filtering — Built-in Azure AI Content Safety filters with customizable configurations.
- Keyless authentication — Microsoft Entra ID (recommended) and key-based authentication.
- Private networking — Virtual network integration for secure access.
- Provisioned throughput — Reserve capacity with PTUs for predictable, low-latency performance. For details, see Provisioned throughput.
Resource requirements
Standard deployment is available in:
- Foundry resources — The primary resource type for new Foundry projects. No AI Hub required.
- Azure OpenAI resources — If you use Azure OpenAI resources, the model catalog shows only Azure OpenAI models for deployment. Upgrade to a Foundry resource for access to the full set of Foundry Models.
To get started with deployment, see Deploy Microsoft Foundry Models in the Foundry portal or Deploy models using Azure CLI and Bicep.
Managed compute deployment (preview)
Note
Managed compute in Foundry is currently in public preview and registration is required to use it. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Managed compute in Foundry (preview) is a managed GPU platform-as-a-service (PaaS) that hosts open-source and custom-weight models on dedicated GPU capacity. You access managed compute deployments through the same Foundry project endpoint as other deployment types, with no virtual machines, clusters, or serving runtimes to own. Foundry sizes the deployment, provisions the accelerators, and keeps the runtime patched.
Important
Managed compute supports open-source, partner, industry, and custom models. Managed compute deployments are served on the unified Foundry project endpoint, using the same authentication, networking, and SDK surface.
Which models use managed compute?
Examples of model collections that require managed compute include:
- Hugging Face
- Some Meta models
- Some Mistral models
- NVIDIA inference microservices (NIMs)
- Industry models (Saifr, Rockwell, Bayer, Cerence, Sight Machine, Page AI, SDAIA)
- Databricks
- Custom models
Microsoft Foundry's catalog includes 10,000+ open-source and partner models, with approximately 50 new models published each month.
Capabilities
Managed compute (Preview) supports:
- Unified Foundry endpoint and authentication — Use the same project endpoint, API keys, Microsoft Entra ID, and private networking as pay-per-token and provisioned throughput deployments. Inference routes use
<endpoint>/managed-deployments/<deployment-name>/. Chat-completions-compatible runtimes also work on the standard/openai/v1/route with the OpenAI SDK. - Model-instance sizing — Deployments are sized in model-centric terms. You don't need to pick virtual machine SKUs, because Foundry chooses GPUs per instance based on model size, architecture, context length, and whether the workload is optimized for latency or throughput.
- Optimized inference runtimes — Microsoft-curated vLLM, SGLang, and NVIDIA NIM containers with continuous batching, speculative decoding, tensor parallelism, and LoRA hot-swap.
- Accelerator families — A100 (80 GB), H100 (80 GB), H200 (141 GB), and MI300X.
- Auto-scaling and scale-to-zero — Auto-scale from live traffic or scale manually. Configure an idle timeout so the deployment scales to zero when no traffic arrives, making billing stop immediately.
- Microsoft-managed runtimes — Microsoft owns serving runtimes, base container images, and security patches. Updates are applied to live deployments automatically.
- Observability metrics — Each deployment emits API call count by status code and response-time percentiles. Chat-completion models also emit input and output token counts, time-to-first-token (TTFT) percentiles, and total response-time percentiles, grouped by time.
Billing and quota
Managed compute billing is hourly per accelerator SKU, with throughput per GPU as the underlying billing unit. Auto-scale and scale-to-zero align cost with actual traffic so that billing stops immediately instances scale down.
Quota is granted per accelerator SKU per region through the Foundry quota process and is separate from Azure VM quota. Azure virtual machines are an infrastructure-as-a-service (IaaS) offering with regional SKUs; managed compute is a PaaS offering that leads with Global and Data Zone processing. Existing Azure VM quota can't be applied to a managed compute deployment.
Managed compute is currently available for global deployment. For rate estimates, see the Azure pricing calculator.
Get started
Deployment option comparison
Use Standard deployment in Foundry resources whenever possible. The following table compares capabilities across the two deployment options:
| Capability | Standard deployment in Foundry resources | Managed compute |
|---|---|---|
| Which models can be deployed? | All Foundry Models, including Foundry Models sold by Azure and select Models from partners and community | Open-source and partner models from the model catalog, NVIDIA NIM, and industry models |
| Deployment resource | Foundry resource | Foundry project |
| Requires AI Hub | No | No |
| Data processing options | Regional, data zone, global | Global |
| Private networking | Yes | Yes |
| Content filtering | Built-in and customizable | Not available in public preview |
| Keyless authentication | Yes (Microsoft Entra ID and key-based) | Yes (Microsoft Entra ID and key-based) |
| Billing | Token usage or provisioned throughput units | Hourly per accelerator SKU |
Tip
For detailed pricing information, see Plan and manage costs for Microsoft Foundry.