Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Agentic Retrieval requires you to provide your own language model endpoint. Set up an OpenAI API-compatible endpoint by using one of the following methods.
All search types (hybrid, vector, text, and hybrid multimodal) are available with your language model endpoint.
After you create your endpoint, use it when you deploy the Agentic Retrieval extension. The endpoint URL, model name, and max tokens are required deployment parameters.
Important
Agentic Retrieval in Foundry Local is currently in PREVIEW. See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
Choose a setup method
Choose a method based on your environment, connectivity, and production requirements.
| Method | Description | Best for |
|---|---|---|
| Foundry Local on Azure Local | Deploy models on your Arc-connected cluster by using the Foundry Local extension. | Production deployments with Azure-managed models. |
| Microsoft Foundry | Deploy cloud-hosted models through the Foundry portal. | Cloud-connected deployments with managed models. |
Foundry Local
Deploy an AI model on your Arc-connected Kubernetes cluster by using the Foundry Local extension. Foundry Local is currently a CLI-based experience.
The following table summarizes the key properties of the Foundry Local extension:
| Property | Value |
|---|---|
| Extension type | Microsoft.Foundry |
| Default namespace | foundry-local-operator |
| Inference port | 5000 (TLS enabled by default via nginx sidecar) |
| API format | OpenAI-compatible (/v1/chat/completions) |
Foundry Local must be installed and operational on your cluster before you install the Agentic Retrieval extension. The model endpoint URL from Foundry Local is a required parameter during Agents and Tools deployment. If Foundry Local is not set up correctly, Agentic Retrieval fails at runtime with connection errors.
For setup instructions, see What is Foundry Local on Azure Local? and Foundry Local on GitHub.
This section shows how to deploy the recommended model (gpt-oss-20b) and configure its endpoint for use with Agents and Tools.
Prerequisites
Before you start, confirm that your cluster, tools, and access settings meet the minimum requirements for Foundry Local.
- Preview deployment access for Foundry Local on Azure Local
- Azure Arc-enabled Kubernetes cluster (Kubernetes 1.29 or later)
kubectlconfigured for your cluster- An app registration for authentication (Microsoft Entra ID)
- GPU nodes with sufficient memory for large language models (for example, 40 GB+ VRAM or multi-GPU setups recommended for gpt-oss-20b)
Step 1 - Install required extensions
Install the required Kubernetes extensions so your cluster can host and run Foundry Local model workloads.
Install cert-manager and trust-manager:
az k8s-extension create \ --cluster-name <your_arc_cluster_name> \ --name "azure-cert-manager" \ --resource-group <resource_group> \ --cluster-type connectedClusters \ --extension-type Microsoft.CertManagement \ --scope cluster \ --release-train stable \ --config config.enableGatewayAPI=true \ --config cert-manager.crds.keep=true \ --config trust-manager.defaultPackage.enabled=false \ --config trust-manager.secretTargets.enabled=true \ --config trust-manager.secretTargets.authorizedSecretsAll=trueInstall the Foundry inference operator:
az k8s-extension create \ --resource-group <resource_group> \ --cluster-name <cluster_name> \ --name "inference-operator" \ --extension-type Microsoft.Foundry \ --scope cluster \ --release-namespace "foundry-local-operator" \ --cluster-type connectedClusters \ --auto-upgrade-minor-version true \ --release-train stable \ --config entraAuth.tenantId="<tenant_id>" \ --config entraAuth.clientId="<client_id>"Important
Microsoft Entra ID authentication is required for Foundry Local. You must provide a valid
entraAuth.tenantIdandentraAuth.clientIdfrom an app registration. Agentic Retrieval uses this identity for secure communication with the Foundry Local endpoint.Verify installation:
kubectl get pods -n foundry-local-operator kubectl get crd | grep foundryExpected output: five pods in
RunningorCompletedstatus, and four Foundry Local custom resource definitions (CRDs) registered:inferenceservices.foundrylocal.azure.com modeldeployments.foundrylocal.azure.com models.foundrylocal.azure.com storemodels.foundrylocal.azure.comWarning
Don't proceed to Step 2 until all four CRDs are present. The
ModelDeploymentCRD is installed via a Helm pre-install hook that can fail silently. Ifmodeldeploymentsis missing, extract and apply it manually:helm get hooks inference-operator -n foundry-local-operator > /tmp/hooks.yaml # Extract the ModelDeployment CRD YAML from the hooks output, save to a file, then: kubectl apply -f /tmp/modeldeployment-crd.yaml
Step 2 - Deploy the recommended model (gpt-oss-20b)
Deploy the recommended gpt-oss-20b model to create a local inference endpoint for your language model configuration.
Create a ModelDeployment resource:
apiVersion: foundrylocal.azure.com/v1 kind: ModelDeployment metadata: name: gpt-oss-20b namespace: foundry-local-operator spec: workloadType: generative compute: gpu runtime: vllm model: catalog: name: gpt-oss-20b version: "latest" replicas: 1 port: 5000 resources: requests: { cpu: "4", memory: "32Gi" } limits: { cpu: "8", memory: "64Gi", gpu: 1 } nodeSelector: # For AKS-managed GPU nodes, use: kubernetes.azure.com/accelerator: nvidia agentpool: <your_gpu_node_pool> # For Azure Local (HaaS) GPU nodes, use: # nvidia.com/gpu.present: "true" tolerations: - { key: "sku", operator: "Equal", value: "gpu", effect: "NoSchedule" } - { key: "nvidia.com/gpu", operator: "Exists", effect: "NoSchedule" } endpoint: enabled: false vllm: modelCacheStorageGi: 100 preferences: gpu_memory_utilization: 0.92 max_model_len: 16384 dtype: "bfloat16" max_num_seqs: 128 max_num_batched_tokens: 4096 enforce_eager: trueThe model is deployed with
endpoint.enabled: false, which means it's accessed via internal Kubernetes service DNS rather than an external ingress. Set thenodeSelectorbased on your cluster type:- AKS-managed GPU nodes: Use
kubernetes.azure.com/accelerator: nvidiaandagentpool: <your_gpu_node_pool>. - Azure Local (HaaS) GPU nodes: Use
nvidia.com/gpu.present: "true".
KV cache quantization: If your GPU supports SM 9.0 or later (for example, NVIDIA H100), you can enable FP8 KV cache quantization by adding
kv_cache_dtype: "fp8_e4m3"undervllm.preferences. Don't use this setting on L40S, A100, or earlier GPUs.Important
Runtime selection: The
vllmruntime is required foragenticorcombinedmode because it provides full OpenAI API compatibility including tool calling (tools,tool_choice). Theonnx-genairuntime doesn't support tool calling and returnsInvalid JSONerrors from the agentic pipeline. Useonnx-genaionly forknowledge-only mode (no agentic features). The deployment mode is controlled by thelayerSelectionparameter, which you set during extension installation. For details, see Deployment parameter reference.- AKS-managed GPU nodes: Use
Apply the deployment:
kubectl apply -f model-deployment.yamlVerify it's running:
kubectl get modeldeployment gpt-oss-20b -n foundry-local-operatorWait until the status is Running. GPU model deployment typically takes 10-30 minutes depending on model size and whether the image is cached.
Step 3 - Verify the model endpoint
Test the deployed endpoint to confirm that it accepts chat completion requests and returns a valid response.
Discover the model service endpoint dynamically:
kubectl get svc <model-name> -n foundry-local-operator -o jsonpath='https://{.metadata.name}.{.metadata.namespace}.svc.cluster.local:{.spec.ports[0].port}'Service names vary by model deployment. Use this command to confirm the correct service name before proceeding.
Port-forward the model service:
kubectl port-forward svc/gpt-oss-20b -n foundry-local-operator 5000:5000Note the internal Kubernetes service URL for use during deployment:
https://gpt-oss-20b.foundry-local-operator.svc.cluster.local:5000/v1/chat/completionsUse this URL as your language model endpoint when Foundry Local runs on the same cluster as Agentic Retrieval.
Important
The endpoint must use
https://, nothttp://. Foundry Local enables TLS on port 5000 via an nginx sidecar with a self-signed certificate. Usinghttp://results in400 Bad Request: The plain HTTP request was sent to HTTPS port.Agentic Retrieval trusts the Foundry CA certificate automatically when
foundryClientIdis set. The Helm chart mounts thefoundry-local-operator-ca-bundleConfigMap and setsREQUESTS_CA_BUNDLE/SSL_CERT_FILEenvironment variables. WithoutfoundryClientId, HTTPS calls fail withSSL: CERTIFICATE_VERIFY_FAILED.If you have an external ingress configured, you can also use the external URL:
https://<foundry_ingress_host>/v1/chat/completionsTest the endpoint:
curl -X POST http://localhost:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-oss-20b", "messages": [ {"role": "user", "content": "Hello, what can you do?"} ], "max_tokens": 256 }'Port-forwarding connects directly to the model container, bypassing the TLS and authentication sidecars. This is appropriate for local verification only. In production, Agentic Retrieval connects to the model via the internal HTTPS service URL with Entra ID authentication handled automatically through the
foundryClientIdconfiguration.You should receive a JSON response with a
choicesarray.
Step 4 - Configure Agentic Retrieval
When you deploy the Agentic Retrieval extension, pass the following BYOM settings as --configuration-settings parameters to az k8s-extension create:
byom.enabled=true
byom.apiEndpoint=https://gpt-oss-20b.foundry-local-operator.svc.cluster.local:5000/v1/chat/completions
byom.apiModel=gpt-oss-20b
byom.maxTokensInK=16
foundryClientId=<foundry_app_registration_client_id>
The foundryClientId parameter enables Entra ID-based authentication between Agents and Tools and the Foundry Local endpoint. No API key secret is required when using Foundry Local.
After deployment, Agentic Retrieval uses the local gpt-oss-20b deployment for all language model interactions.
For optional operator parameters and namespace configuration, see Deployment parameter reference.
Microsoft Foundry
To use your own model with Agentic Retrieval, deploy a language model and create an endpoint by using Foundry.
Go to Foundry and sign in with your Azure account.
Create a new Foundry resource or go to an existing resource.
On the Foundry resource, select Models + endpoints.
Select Deploy model > Deploy base model.
Choose a chat completion model from the list like
gpt-4o.Select Confirm.
Edit the following fields as appropriate for your scenario:
Field Description Deployment name Choose a deployment name. The default is the name of the model you selected. Deployment type Select a deployment type. The default is Global Standard. Select Deploy to selected resource.
Wait for the deployment to complete and the State is Succeeded.
Get the endpoint and API key by selecting the deployed model. For example, the endpoint looks like the following URL.
https://<Foundry Resource Name>.openai.azure.com/openai/deployments/<Model Deployment Name>/chat/completions?api-version=<API Version>
For more information, see the following articles:
Validate your endpoint
Before deploying Agentic Retrieval, verify your endpoint works by sending a test request.
Note
The authentication header format depends on your provider. Foundry Local uses Microsoft Entra ID tokens (Authorization: Bearer <token>). Microsoft Foundry and Azure OpenAI use Authorization: Bearer <api-key>.
curl -X POST <your-endpoint-url> \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <your-api-key-if-needed>" \
-d '{
"model": "<your-model-name>",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}'
You should receive a JSON response with a choices array containing the model's answer. If this works, your endpoint is ready for Agentic Retrieval.