Deployment overview for Microsoft Foundry Models

Microsoft Foundry Models is the hub for discovering and deploying a wide range of AI models for generative AI applications. To make a model available for inference requests, you deploy it. Foundry offers two deployment options depending on the model type and your infrastructure needs.

Tip

You don't always need to create a deployment. With instant models (preview), you call supported models by name and start running inference immediately — no deployment required.

Deployment options

Foundry provides two deployment options:

Standard deployment in Foundry resources — For Foundry Models, including Foundry Models sold by Azure (also known as Azure Direct Models, or ADM) and select Models from partners and community. This option is the preferred and most capable deployment path.
Managed compute deployment (preview) — Available for all Open Source Software (OSS) models, including models from partner and community, and custom models.

The Foundry portal automatically selects the appropriate deployment option based on the model you choose.

	Standard deployment in Foundry resources	Managed compute
Models	ADM models (Azure OpenAI + partner models billed through Azure) and select Models from partners and community	Other models in the model catalog from partners and custom models. For example, models from Hugging Face, NVIDIA NIMs, industry models, and Databricks.
Billing	Token usage or provisioned throughput units (PTU)	Hourly per accelerator SKU
Data processing	Regional, data zone, or global	Regional only
Content filtering	Built-in and customizable	Via Azure AI Content Safety APIs

Standard deployment in Foundry resources

Standard deployment in Foundry resources is the preferred deployment option in Foundry. It supports the widest range of capabilities and deployment types.

Which models use standard deployment?

All Foundry Models, including Foundry Models sold by Azure and select Models from partners and community use standard deployment. Foundry Models sold by Azure include all Azure OpenAI models and selected models from top providers that are billed through your Azure subscription, covered by Azure service-level agreements, and supported by Microsoft. Select Models from partners and community that use standard deployment include Anthropic models, and specific models from partners like Mistral, Cohere, and Meta.

Capabilities

Standard deployment supports:

Multiple deployment types — Global Standard, Data Zone Standard, Regional Standard, Provisioned, Batch, and more. Each type controls where data is processed and how you pay. For details, see Deployment types for Microsoft Foundry Models.
Data processing flexibility — Choose regional, data zone (US or EU), or global processing based on your compliance requirements.
Content filtering — Built-in Azure AI Content Safety filters with customizable configurations.
Keyless authentication — Microsoft Entra ID (recommended) and key-based authentication.
Private networking — Virtual network integration for secure access.
Provisioned throughput — Reserve capacity with PTUs for predictable, low-latency performance. For details, see Provisioned throughput.

Resource requirements

Standard deployment is available in:

Foundry resources — The primary resource type for new Foundry projects. No AI Hub required.
Azure OpenAI resources — If you use Azure OpenAI resources, the model catalog shows only Azure OpenAI models for deployment. Upgrade to a Foundry resource for access to the full set of Foundry Models.

To get started with deployment, see Deploy Microsoft Foundry Models in the Foundry portal or Deploy models using Azure CLI and Bicep.

Managed compute deployment (preview)

Note

Managed compute in Foundry is currently in public preview and registration is required to use it. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Managed compute in Foundry (preview) is a managed GPU platform-as-a-service (PaaS) that hosts open-source and custom-weight models on dedicated GPU capacity. You access managed compute deployments through the same Foundry project endpoint as other deployment types, with no virtual machines, clusters, or serving runtimes to own. Foundry sizes the deployment, provisions the accelerators, and keeps the runtime patched.

Important

Managed compute supports open-source, partner, industry, and custom models. Managed compute deployments are served on the unified Foundry project endpoint, using the same authentication, networking, and SDK surface.

Which models use managed compute?

Examples of model collections that require managed compute include:

Hugging Face
Some Meta models
Some Mistral models
NVIDIA inference microservices (NIMs)
Industry models (Saifr, Rockwell, Bayer, Cerence, Sight Machine, Page AI, SDAIA)
Databricks
Custom models

Microsoft Foundry's catalog includes 10,000+ open-source and partner models, with approximately 50 new models published each month.

Capabilities

Managed compute (Preview) supports:

Unified Foundry endpoint and authentication — Use the same project endpoint, API keys, Microsoft Entra ID, and private networking as pay-per-token and provisioned throughput deployments. Inference routes use <endpoint>/managed-deployments/<deployment-name>/. Chat-completions-compatible runtimes also work on the standard /openai/v1/ route with the OpenAI SDK.
Model-instance sizing — Deployments are sized in model-centric terms. You don't need to pick virtual machine SKUs, because Foundry chooses GPUs per instance based on model size, architecture, context length, and whether the workload is optimized for latency or throughput.
Optimized inference runtimes — Microsoft-curated vLLM, SGLang, and NVIDIA NIM containers with continuous batching, speculative decoding, tensor parallelism, and LoRA hot-swap.
Accelerator families — A100 (80 GB), H100 (80 GB), H200 (141 GB), and MI300X.
Auto-scaling and scale-to-zero — Auto-scale from live traffic or scale manually. Configure an idle timeout so the deployment scales to zero when no traffic arrives, making billing stop immediately.
Microsoft-managed runtimes — Microsoft owns serving runtimes, base container images, and security patches. Updates are applied to live deployments automatically.
Observability metrics — Each deployment emits API call count by status code and response-time percentiles. Chat-completion models also emit input and output token counts, time-to-first-token (TTFT) percentiles, and total response-time percentiles, grouped by time.

Billing and quota

Managed compute billing is hourly per accelerator SKU, with throughput per GPU as the underlying billing unit. Auto-scale and scale-to-zero align cost with actual traffic so that billing stops immediately instances scale down.

Quota is granted per accelerator SKU per region through the Foundry quota process and is separate from Azure VM quota. Azure virtual machines are an infrastructure-as-a-service (IaaS) offering with regional SKUs; managed compute is a PaaS offering that leads with Global and Data Zone processing. Existing Azure VM quota can't be applied to a managed compute deployment.

Managed compute is currently available for global deployment. For rate estimates, see the Azure pricing calculator.

Get started

Deploy open-source models with managed compute

Deployment option comparison

Use Standard deployment in Foundry resources whenever possible. The following table compares capabilities across the two deployment options:

Capability	Standard deployment in Foundry resources	Managed compute
Which models can be deployed?	All Foundry Models, including Foundry Models sold by Azure and select Models from partners and community	Open-source and partner models from the model catalog, NVIDIA NIM, and industry models
Deployment resource	Foundry resource	Foundry project
Requires AI Hub	No	No
Data processing options	Regional, data zone, global	Global
Private networking	Yes	Yes
Content filtering	Built-in and customizable	Not available in public preview
Keyless authentication	Yes (Microsoft Entra ID and key-based)	Yes (Microsoft Entra ID and key-based)
Billing	Token usage or provisioned throughput units	Hourly per accelerator SKU

Tip

For detailed pricing information, see Plan and manage costs for Microsoft Foundry.

Feedback

Was this page helpful?

Last updated on 2026-06-03

Deployment overview for Microsoft Foundry Models

Deployment options

Standard deployment in Foundry resources

Which models use standard deployment?

Capabilities

Resource requirements

Managed compute deployment (preview)

Which models use managed compute?

Capabilities

Billing and quota

Get started

Deployment option comparison

Related content

Feedback

Additional resources