Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Agentic Retrieval requires a language model for inference. This article helps you choose the right model for your use case and understand the available deployment options.
Important
Agentic Retrieval in Foundry Local is currently in PREVIEW. See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.
Select a language model
Agentic Retrieval doesn't include any language models. You must provide your own LLM endpoint that exposes an OpenAI-compatible chat completions API. Both the agentic layer (for agent runs) and the knowledge layer (for RAG inference) use this endpoint.
Work with your application development team to choose the right model for your use case.
To choose the right model for your use case, refer to these resources from Microsoft:
- Blog: How to Choose the Right Models for Your Apps | Azure AI
- Video: How to Choose the Right Models for Your Apps | Azure AI - YouTube
- Microsoft Foundry also provides tooling such as model benchmarks to choose the right model.
Available models with Foundry Local
If you use Foundry Local as your model endpoint, the following models are available for deployment.
The recommended model for most use cases is gpt-oss-20b. For step-by-step deployment instructions, see Create your language model endpoint.
CPU-optimized models (no GPU required):
| Model | Parameters | Notes |
|---|---|---|
phi-3-mini-4k-instruct-generic-cpu:2 |
3.8B | Microsoft Phi-3 Mini |
phi-3.5-mini-instruct-generic-cpu:1 |
3.8B | Microsoft Phi-3.5 Mini |
qwen2.5-0.5b-instruct-generic-cpu:3 |
0.5B | Small, fast |
qwen2.5-1.5b-instruct-generic-cpu:3 |
1.5B | Larger Qwen |
llama3.2:1b |
1B | Meta Llama 3.2 |
llama3.2:3b |
3B | Meta Llama 3.2 |
GPU-optimized models (CUDA required):
| Model | Parameters |
|---|---|
gpt-oss-20b |
20B |
qwen2.5-1.5b-instruct-cuda-gpu:3 |
1.5B |
llama3.1:8b |
8B |
Recommended models
| Use case | Recommended model | Runtime |
|---|---|---|
Knowledge-only (layerSelection=knowledge) |
gpt-oss-20b |
vllm or onnx-genai |
Agentic or Combined (layerSelection=agentic or combined) |
gpt-oss-20b |
vllm (required for tool calling) |
| RAG and entity extraction | gpt-oss-20b |
vllm |
Set up your endpoint
After you choose a model, you need an OpenAI-compatible chat completions endpoint. Foundry Local on Azure Local is the recommended option because it runs on the same Arc-connected cluster as the extension. You can also use Microsoft Foundry for cloud-hosted models.
For supported methods and step-by-step setup instructions, see Create your language model endpoint.
Next steps
- Create your language model endpoint — set up your LLM endpoint.
- Deploy the Agentic Retrieval extension — use the endpoint during deployment.
- Configure BYOM endpoint authentication — set up authentication after deployment.