Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
This document refers to the Microsoft Foundry (classic) portal.
🔄 Switch to the Microsoft Foundry (new) documentation if you're using the new portal.
Note
This document refers to the Microsoft Foundry (new) portal.
Model router is a trained language model that intelligently routes your prompts in real time to the most suitable large language model (LLM) based on the complexity, reasoning, task type, and other attributes of the prompts. You deploy model router like any other Foundry model. Thus, it delivers high performance while saving on costs, reducing latencies, and increasing responsivness, while maintaining comparable quality, all packaged as a single model deployment.
Note
You do not need to separately deploy the supported LLMs for use with model router, with the exception of the Claude models. To use model router with your Claude models, first deploy them from the model catalog. The deployments will get invoked by Model router if they're selected for routing.
Tip
The Microsoft Foundry (new) portal offers enhanced configuration options for model router. Switch to the Microsoft Foundry (new) documentation to see the latest features.
How model router works
As a trained language model, model router applies intelligence to analyze your prompts in real time and determine the most suitable underlying LLMs for routing. It does not store your prompts. Moreover, it routes to only eligible models based on your access and deployment types, honoring data zone boundaries.
- In the default
Balancedmode, it considers all underlying models within a small quality range, for example 1-2% compared with the highest quality model for that prompt, and picks the most cost-effective model. - When the
Costrouting mode is selected, it considers a larger quality band, for example 5-6% range compared with the highest quality model for that prompt, and chooses the most cost-effective model. - When the
Qualityrouting mode is selected, it picks the highest quality rated model for the prompt, ignoring the cost.
Why use model router?
Model router optimizes costs and latencies while maintaining comparable quality. Smaller and cheaper models are used when they're sufficient for the task, but larger and more expensive models are available for more complex tasks. Also, reasoning models are available for tasks that require complex reasoning, and non-reasoning models are used otherwise. Model router provides a single deployment and chat experience that combines the best features from all of the underlying chat models.
The latest version, 2025-11-18 adds several capabilities:
- Support Global Standard and Data Zone Standard deployments.
- Adds support for new models:
grok-4,grok-4-fast-reasoning,DeepSeek-V3.1,gpt-oss-120b,Llama-4-Maverick-17B-128E-Instruct-FP8,gpt-4o,gpt-4o-mini,claude-haiku-4-5,claude-opus-4-1, andclaude-sonnet-4-5. - Quick deploy or Custom deploy with routing mode and model subset options.
- Routing mode: Optimize the routing logic for your needs. Supported options:
Quality,Cost,Balanced(default). - Model subset: Select your preferred models to create your model subset for routing.
- Support for agentic scenarios including tools so you can now use it in the Foundry Agent service.
Versioning
Each version of model router is associated with a specific set of underlying models and their versions. This set is fixed—only newer versions of model router can expose new underlying models.
If you select Auto-update at the deployment step (see Manage models), then your model router model automatically updates when new versions become available. When that happens, the set of underlying models also changes, which could affect the overall performance of the model and costs.
Underlying models
With the 2025-11-18 version, Model Router adds nine new models including Anthropic's Claude, DeepSeek, Llama, Grok models to support a total of 18 models available for routing your prompts.
| Model router version | Underlying models | Underlying model version |
|---|---|---|
2025-11-18 |
gpt-4.1 gpt-4.1-mini gpt-4.1-nano o4-mini gpt-5-nano gpt-5-mini gpt-5 gpt-5-chat Deepseek-v3.1 gpt-oss-120b llama4-maverick-instruct grok-4 grok-4-fast gpt-4o gpt-4o-mini claude-haiku-4-5 claude-opus-4-1 claude-sonnet-4-5 |
2025-04-14 2025-04-14 2025-04-14 2025-04-16 2025-08-07 2025-08-07 2025-08-07 2025-08-07 N/A N/A N/A N/A N/A 2024-11-20 2024-07-18 2025-10-01 2025-08-05 2025-09-29 |
2025-08-07 |
gpt-4.1 gpt-4.1-mini gpt-4.1-nano o4-mini gpt-5 gpt-5-mini gpt-5-nano gpt-5-chat |
2025-04-14 2025-04-14 2025-04-14 2025-04-16 2025-08-07 2025-08-07 2025-08-07 2025-08-07 |
2025-05-19 |
gpt-4.1 gpt-4.1-mini gpt-4.1-nano o4-mini |
2025-04-14 2025-04-14 2025-04-14 2025-04-16 |
Routing mode
With the latest version, if you choose custom deployment, you can select the routing mode to optimize for quality or cost while maintaining a baseline level of performance. Setting a routing mode is optional, and if you don’t set one, your deployment defaults to the balanced mode.
Available routing modes:
| Mode | Description |
|---|---|
| Balanced (default) | Considers both cost and quality dynamically. Perfect for general-purpose scenarios |
| Quality | Prioritizes for maximum accuracy. Best for complex reasoning or critical outputs |
| Cost | Prioritizes for more cost savings. Ideal for high-volume, budget-sensitive workloads |
Model subset
The latest version of model router supports model subsets: For custom deployments, you can specify which underlying models to include in routing decisions. This gives you more control over cost, compliance, and performance characteristics.
When new base models become available, they're not included in your selection unless you explicitly add them to your deployment's inclusion list.
Limitations
Resource limitations
| Region | Deployment types supported |
|---|---|
| East US 2 | Global Standard, Data zone Standard |
| Sweden Central | Global Standard, Data zone Standard |
Also see the Models page for the region availability and deployment types for model router.
Rate limits
| Model | Deployment Type | Default RPM | Default TPM | Enterprise and MCA-E RPM | Enterprise and MCA-E TPM |
|---|---|---|---|---|---|
model-router (2025-11-18) |
DataZoneStandard | 150 | 150,000 | 300 | 300,000 |
model-router (2025-11-18) |
GlobalStandard | 250 | 250,000 | 400 | 400,000 |
Also see Quotas and limits for rate limit information.
To overcome the limits on context window and parameters, use the Model subset feature to select your models for routing that support your desired properties.
Note
The context window limit listed on the Models page is the limit of the smallest underlying model. Other underlying models are compatible with larger context windows, which means an API call with a larger context will succeed only if the prompt happens to be routed to the right model, otherwise the call will fail. To shorten the context window, you can do one of the following:
- Summarize the prompt before passing it to the model
- Truncate the prompt into more relevant parts
- Use document embeddings and have the chat model retrieve relevant sections: see Azure AI Search
Model router accepts image inputs for Vision enabled chats (all of the underlying models can accept image input), but the routing decision is based on the text input only.
Model router doesn't process audio input.
Billing information
Starting November 2025, the model router usage will be charged for input prompts at the rate listed on the pricing page.
You can monitor the costs of your model router deployment in the Azure portal.