Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The Responses API is the programmatic interface for every model in Microsoft Foundry. Pass model-router in the model field to let Foundry pick the best model automatically. Or pass a specific model name for deterministic control. The code is the same—only the model value changes.
Note
You pass your deployment name to the model parameter. In most cases the deployment name matches the model name. For example, a gpt-4.1-mini deployment is called "gpt-4.1-mini".
Prerequisites
A Foundry project with a
model-routerdeployment. See Deploy model router.At least one named model deployment for deterministic calls (for example,
gpt-4.1-mini). See Deploy a model.Familiarity with the Responses API.
Python 3.9+
The Foundry SDK:
pip install "azure-ai-projects>=2.0.0"
Call models through the Responses API
The following sample calls several models through the same responses.create() interface, starting with model-router for automatic selection, then named models for deterministic control.
import os
import time
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
# Create the Foundry project client
project = AIProjectClient(
endpoint=os.environ["PROJECT_ENDPOINT"],
credential=DefaultAzureCredential(),
)
deployments = ["model-router", "gpt-5.2", "grok-4-fast-reasoning", "gpt-5-mini", "Deepseek-V3.2"]
prompt = "Explain retrieval-augmented generation in one sentence."
print(f"{'Deployment':<22} {'Responded':<22} {'Latency':>8} Response")
print("-" * 100)
# Get an OpenAI-compatible client that works with all Foundry models
with project.get_openai_client() as client:
for name in deployments:
start = time.time()
response = client.responses.create(model=name, input=prompt)
elapsed = time.time() - start
responded_model = response.model
print(
f"{name:<22} {responded_model:<22} {elapsed:>7.2f}s "
f"{response.output_text[:60]}"
)
The following table shows a sample output. Actual latency and response text vary per request.
| Deployment | Responded | Latency | Response |
|---|---|---|---|
| model-router | gpt-4.1-nano | 0.59 s | It combines retrieval of relevant documents with generati... |
| gpt-5.2 | same | 0.78 s | Retrieval-augmented generation enhances model output by firs... |
| grok-4-fast-reasoning | same | 0.65 s | It is a technique that grounds language model responses in... |
| gpt-5-mini | same | 0.67 s | It retrieves external knowledge to augment a model's genera... |
| Deepseek-V3.2 | same | 1.14 s | It augments model generation by first retrieving relevant do... |
See the first row: model-router didn't target a specific model, but the Responded column shows that it selected gpt-4.1-nano. For the named models that follow, the two columns match; The code is identical in every case.
Tip
The response.model field always returns the model that handled the request. Use it for logging, cost attribution, or debugging routing decisions.
Routing strategies
Every call goes through responses.create(). The model value is the only decision point.
| Use case | model value |
Result |
|---|---|---|
| The best model for each request, optimized by cost or quality | "model-router" |
Foundry evaluates the prompt and selects the best model from your configured pool |
| A specific model for compliance, reproducibility, or benchmarking | The model name ("gpt-5-2", "Deepseek-V3.2") |
Foundry routes to exactly that model |
| To switch between strategies | Change one string | The rest of the code stays identical |
Use model-router as your default. Customize your model router deployment with optional settings. See Model router deployment options.
Switch to a named model only when you need deterministic control.
Built-in enterprise capabilities
Every responses.create() call, whether routed through model-router or targeting a named model, automatically includes:
- Automatic failover—When using
model-router, if the selected model encounters a transient issue, model router transparently redirects the request to the next most appropriate model. No disruption to your application, no retry logic required. If you configure a model subset, that subset also serves as your fallback set — select at least two models to benefit from failover. - Prompt caching—Model router supports prompt caching. When model router delegates a request to a model that supports prompt caching, cached tokens are used automatically. Combined with model router's right-fit model selection, you get an extra efficiency lift: the optimal model for the task and reduced token costs on repeated prompt prefixes — no configuration needed.
- Content filtering—Configurable content safety applied to inputs and outputs without extra API parameters.
- Role-based access control—Azure role-based access control governs who can call which deployments. No separate API key management.
- Observability and tracing—Every request is logged with the selected model, latency, and token usage. Integrate with Azure Monitor or your existing observability stack.
- Data residency and compliance—Traffic stays within your Azure region. No data leaves your tenant boundary.
- Rate limiting and quotas—Per-deployment token-per-minute limits protect your workloads from noisy neighbors.
These capabilities apply uniformly across all models in the catalog. You don't opt in or configure them per request—they're part of the platform.
Related content
- Use model router for Microsoft Foundry — deployment, routing modes, model subsets
- Use the Azure OpenAI Responses API — streaming, tools, stored responses
- Microsoft Foundry SDKs and endpoints — client setup and endpoint patterns