Edit

Quickstart: Evaluate your hosted agent

Note

Hosted agents and the Azure Developer CLI evaluation experience are currently in preview.

In this quickstart, you evaluate the hosted agent you deployed in Deploy your first hosted agent. You provide a test dataset, choose evaluators, run an evaluation against the deployed agent, and review the scores. Each step shows three ways to do the same task: the Azure Developer CLI (azd), the Microsoft Foundry portal, and the Python SDK.

Evaluation establishes a quality baseline for your agent and lets you set acceptance thresholds, such as a task adherence passing rate, before you release changes to users.

Prerequisites

Before you begin, you need:

  • A deployed, invokable hosted agent from Deploy your first hosted agent. For the Azure Developer CLI path, you also need the azd project directory you created in that quickstart.

  • The Foundry User role on the Foundry resource.

  • A chat-completion model deployment in the same Foundry project to use as the judge model that scores responses. You can reuse the model deployment your agent already uses, including the one from the previous quickstart, so you don't need a separate deployment.

    Important

    The Foundry RBAC roles were recently renamed. Foundry User, Foundry Owner, Foundry Account Owner, and Foundry Project Manager were previously named Azure AI User, Azure AI Owner, Azure AI Account Owner, and Azure AI Project Manager. You might still see the previous names in some places while the rename rolls out. The role IDs and core permissions are unchanged by the rename.

Each step offers three paths. Use whichever you prefer:

  • Azure Developer CLI: The azd ai agent extension (azure.ai.agents), version 0.1.40-preview or later, which provides the azd ai agent eval commands. This extension is included in the microsoft.foundry extension you installed in the previous quickstart. Verify the installed version with azd ext list, and run azd ext upgrade microsoft.foundry if needed. Sign in with azd auth login.
  • Foundry portal: Access to the Foundry portal.
  • Python SDK: Python 3.9 or later, and the Azure CLI signed in with az login so that DefaultAzureCredential can authenticate. For installation, see Install the Azure CLI.

Step 1: Confirm your deployed agent

Evaluation runs against a deployed, invokable agent. Confirm your agent is deployed and available before you set up the evaluation.

From your azd project directory, verify the agent is deployed and invokable:

azd ai agent show

Send a test prompt:

azd ai agent invoke "Write a haiku about deploying cloud applications."

You should see a response within a few seconds.

Step 2: Set up built-in evaluators

Start with built-in evaluators to score your agent against a test dataset.

First, create a JSONL file of test queries for your agent. Each line is a JSON object with a query field. Save it inside your agent's source folder, as src/<your-agent-name>/tests/queries.jsonl:

{"query": "Write a haiku about deploying cloud applications."}

Then create an eval.yaml file in the same agent source folder, as src/<your-agent-name>/eval.yaml. It points to your dataset and lists the built-in evaluators to apply. The dataset.local_uri path is relative to this folder. Replace <your-agent-name> with your hosted agent's name and <your-chat-completion-deployment> with the judge model deployment:

name: agent-eval
agent:
  name: <your-agent-name>
  kind: hosted
dataset:
  local_uri: tests/queries.jsonl
evaluators:
  - builtin.intent_resolution
  - builtin.task_adherence
options:
  eval_model: <your-chat-completion-deployment>
max_samples: 15

The eval_model value is the judge model that scores responses; you can reuse the deployment your agent already uses.

Step 3: Run the evaluation

Run the suite against your deployed agent. The service sends each test query to the agent, captures the response, and scores it with your selected evaluators.

Note

Target-based evaluation invokes your hosted agent directly. It works with agents that use the responses or invocations protocol with synchronous, non-streaming execution. To evaluate agents that use the A2A or Activity protocol, or other execution patterns such as long-running or streaming, evaluate the traces your agent emits instead. See Trace evaluation.

Run the evaluation from the azd workspace root:

azd ai agent eval run --config eval.yaml

Note

azd ai agent eval run resolves the --config path relative to your agent's source folder under src/ (for example, src/<your-agent-name>/eval.yaml), not the current directory. Keep eval.yaml, and the dataset that its local_uri points to, inside that folder.

The command reads eval.yaml, sends each query to your agent, scores the responses, and prints a summary when it finishes:

Eval run started
   Eval: eval_b36748dede424e4ba3f8e6c99ca2cf27
   Run:  evalrun_5f72ef189ad24790a32128e6f230b131
   (✓) Done  Eval run

Results:    1 total, 1 passed, 0 failed, 0 errored

Per-criteria results:
  intent_resolution: 1 passed, 0 failed, 0 errored
  task_adherence: 1 passed, 0 failed, 0 errored

Step 4: Review the results

Evaluations typically complete in a few minutes, depending on the number of queries.

List recent evaluations:

azd ai agent eval list
    Eval ID                                Name        Status of last run  Runs
    -------                                ----        ------------------  ----
*   eval_b36748dede424e4ba3f8e6c99ca2cf27  agent-eval  Completed           1

* = active eval in current environment

Show the most recent evaluation and its runs:

azd ai agent eval show
Eval:   eval_b36748dede424e4ba3f8e6c99ca2cf27
Name:   agent-eval
Agent:  <your-agent-name>
Runs:   1

Recent runs:
  Run ID                                    Status     Passed  Failed  Created
  ------                                    ------     ------  ------  -------
  evalrun_5f72ef189ad24790a32128e6f230b131  Completed  1/1     0       2026-06-17 14:52 UTC

Use the results to confirm which agent version was evaluated and which evaluator scores were produced. To see per-evaluator details and a link to the report in the Foundry portal, run azd ai agent eval show <eval-id> --eval-run-id <run-id>.

Clean up resources

This quickstart registers a dataset, an evaluation, and run history in your Foundry project. These assets incur little or no ongoing cost.

To remove the hosted agent and the Azure resources you created, follow the cleanup steps in Deploy your first hosted agent.

Troubleshooting

Issue Solution
azd ai agent eval command not found Run azd ext list and verify the azd ai agent extension is 0.1.40-preview or later. Upgrade with azd ext upgrade microsoft.foundry.
azd ai agent eval run fails to find the agent Confirm the agent is deployed and invokable with azd ai agent show. Redeploy with azd deploy if needed.
ModuleNotFoundError for azure.ai.projects or azure.identity Install the SDK: pip install "azure-ai-projects>=2.0.0" azure-identity.
AuthenticationError, DefaultAzureCredential, or Forbidden failure Sign in with az login (or azd auth login for the CLI path), and confirm you have the Foundry User role on the project. Dataset uploads also require write access to the project's storage.
Agent target not found Verify the agent name and version with project_client.agents.get("<your-agent-name>") or project_client.agents.list().
Many errored rows or unexpectedly low scores Open the report URL and check whether rows failed with agent response or evaluator errors. Fix the underlying errors, then rerun the evaluation.
Eval model deployment not found Verify that the judge model deployment (AZURE_AI_MODEL_DEPLOYMENT_NAME for the SDK, or eval_model in eval.yaml) exists in your project under Build > Deployments.

What you learned

In this quickstart, you:

  • Created a test dataset and chose evaluators for your hosted agent.
  • Ran an evaluation against the deployed agent.
  • Reviewed aggregated and row-level results.
  • Completed each task with the Azure Developer CLI, the Foundry portal, and the Python SDK.

Next steps

Continue improving your evaluation workflow: