Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
Hosted agents and the Azure Developer CLI evaluation experience are currently in preview.
In this quickstart, you evaluate the hosted agent you deployed in Deploy your first hosted agent. You provide a test dataset, choose evaluators, run an evaluation against the deployed agent, and review the scores. Each step shows three ways to do the same task: the Azure Developer CLI (azd), the Microsoft Foundry portal, and the Python SDK.
Evaluation establishes a quality baseline for your agent and lets you set acceptance thresholds, such as a task adherence passing rate, before you release changes to users.
Prerequisites
Before you begin, you need:
A deployed, invokable hosted agent from Deploy your first hosted agent. For the Azure Developer CLI path, you also need the
azdproject directory you created in that quickstart.The Foundry User role on the Foundry resource.
A chat-completion model deployment in the same Foundry project to use as the judge model that scores responses. You can reuse the model deployment your agent already uses, including the one from the previous quickstart, so you don't need a separate deployment.
Important
The Foundry RBAC roles were recently renamed. Foundry User, Foundry Owner, Foundry Account Owner, and Foundry Project Manager were previously named Azure AI User, Azure AI Owner, Azure AI Account Owner, and Azure AI Project Manager. You might still see the previous names in some places while the rename rolls out. The role IDs and core permissions are unchanged by the rename.
Each step offers three paths. Use whichever you prefer:
- Azure Developer CLI: The
azd ai agentextension (azure.ai.agents), version 0.1.40-preview or later, which provides theazd ai agent evalcommands. This extension is included in themicrosoft.foundryextension you installed in the previous quickstart. Verify the installed version withazd ext list, and runazd ext upgrade microsoft.foundryif needed. Sign in withazd auth login. - Foundry portal: Access to the Foundry portal.
- Python SDK: Python 3.9 or later, and the Azure CLI signed in with
az loginso thatDefaultAzureCredentialcan authenticate. For installation, see Install the Azure CLI.
Step 1: Confirm your deployed agent
Evaluation runs against a deployed, invokable agent. Confirm your agent is deployed and available before you set up the evaluation.
From your azd project directory, verify the agent is deployed and invokable:
azd ai agent show
Send a test prompt:
azd ai agent invoke "Write a haiku about deploying cloud applications."
You should see a response within a few seconds.
Step 2: Set up built-in evaluators
Start with built-in evaluators to score your agent against a test dataset.
First, create a JSONL file of test queries for your agent. Each line is a JSON object with a query field. Save it inside your agent's source folder, as src/<your-agent-name>/tests/queries.jsonl:
{"query": "Write a haiku about deploying cloud applications."}
Then create an eval.yaml file in the same agent source folder, as src/<your-agent-name>/eval.yaml. It points to your dataset and lists the built-in evaluators to apply. The dataset.local_uri path is relative to this folder. Replace <your-agent-name> with your hosted agent's name and <your-chat-completion-deployment> with the judge model deployment:
name: agent-eval
agent:
name: <your-agent-name>
kind: hosted
dataset:
local_uri: tests/queries.jsonl
evaluators:
- builtin.intent_resolution
- builtin.task_adherence
options:
eval_model: <your-chat-completion-deployment>
max_samples: 15
The eval_model value is the judge model that scores responses; you can reuse the deployment your agent already uses.
Step 3: Run the evaluation
Run the suite against your deployed agent. The service sends each test query to the agent, captures the response, and scores it with your selected evaluators.
Note
Target-based evaluation invokes your hosted agent directly. It works with agents that use the responses or invocations protocol with synchronous, non-streaming execution. To evaluate agents that use the A2A or Activity protocol, or other execution patterns such as long-running or streaming, evaluate the traces your agent emits instead. See Trace evaluation.
Run the evaluation from the azd workspace root:
azd ai agent eval run --config eval.yaml
Note
azd ai agent eval run resolves the --config path relative to your agent's source folder under src/ (for example, src/<your-agent-name>/eval.yaml), not the current directory. Keep eval.yaml, and the dataset that its local_uri points to, inside that folder.
The command reads eval.yaml, sends each query to your agent, scores the responses, and prints a summary when it finishes:
Eval run started
Eval: eval_b36748dede424e4ba3f8e6c99ca2cf27
Run: evalrun_5f72ef189ad24790a32128e6f230b131
(✓) Done Eval run
Results: 1 total, 1 passed, 0 failed, 0 errored
Per-criteria results:
intent_resolution: 1 passed, 0 failed, 0 errored
task_adherence: 1 passed, 0 failed, 0 errored
Step 4: Review the results
Evaluations typically complete in a few minutes, depending on the number of queries.
List recent evaluations:
azd ai agent eval list
Eval ID Name Status of last run Runs
------- ---- ------------------ ----
* eval_b36748dede424e4ba3f8e6c99ca2cf27 agent-eval Completed 1
* = active eval in current environment
Show the most recent evaluation and its runs:
azd ai agent eval show
Eval: eval_b36748dede424e4ba3f8e6c99ca2cf27
Name: agent-eval
Agent: <your-agent-name>
Runs: 1
Recent runs:
Run ID Status Passed Failed Created
------ ------ ------ ------ -------
evalrun_5f72ef189ad24790a32128e6f230b131 Completed 1/1 0 2026-06-17 14:52 UTC
Use the results to confirm which agent version was evaluated and which evaluator scores were produced. To see per-evaluator details and a link to the report in the Foundry portal, run azd ai agent eval show <eval-id> --eval-run-id <run-id>.
Clean up resources
This quickstart registers a dataset, an evaluation, and run history in your Foundry project. These assets incur little or no ongoing cost.
To remove the hosted agent and the Azure resources you created, follow the cleanup steps in Deploy your first hosted agent.
Troubleshooting
| Issue | Solution |
|---|---|
azd ai agent eval command not found |
Run azd ext list and verify the azd ai agent extension is 0.1.40-preview or later. Upgrade with azd ext upgrade microsoft.foundry. |
azd ai agent eval run fails to find the agent |
Confirm the agent is deployed and invokable with azd ai agent show. Redeploy with azd deploy if needed. |
ModuleNotFoundError for azure.ai.projects or azure.identity |
Install the SDK: pip install "azure-ai-projects>=2.0.0" azure-identity. |
AuthenticationError, DefaultAzureCredential, or Forbidden failure |
Sign in with az login (or azd auth login for the CLI path), and confirm you have the Foundry User role on the project. Dataset uploads also require write access to the project's storage. |
| Agent target not found | Verify the agent name and version with project_client.agents.get("<your-agent-name>") or project_client.agents.list(). |
| Many errored rows or unexpectedly low scores | Open the report URL and check whether rows failed with agent response or evaluator errors. Fix the underlying errors, then rerun the evaluation. |
| Eval model deployment not found | Verify that the judge model deployment (AZURE_AI_MODEL_DEPLOYMENT_NAME for the SDK, or eval_model in eval.yaml) exists in your project under Build > Deployments. |
What you learned
In this quickstart, you:
- Created a test dataset and chose evaluators for your hosted agent.
- Ran an evaluation against the deployed agent.
- Reviewed aggregated and row-level results.
- Completed each task with the Azure Developer CLI, the Foundry portal, and the Python SDK.
Next steps
Continue improving your evaluation workflow:
- Set up continuous and scheduled evaluations to track your agent's quality in production.
Related content
- Evaluate your AI agents
- Run batch evaluations from the SDK
- Generate a synthetic evaluation dataset to create test queries and evaluators automatically.
- Troubleshoot evaluation and observability issues
- Agent evaluators reference
- What are hosted agents?