Create an evaluation dataset (preview)

Important

Agent Optimizer is currently in limited preview and only available through a sign-up process. To access the service, complete the intake form. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

The agent optimizer evaluates your agent against a dataset — a collection of tasks with evaluation criteria. You can generate a dataset automatically from the CLI or create one manually for full control.

Prerequisites

A Foundry project with a deployed hosted agent
The azure.ai.agents CLI extension installed (see Quickstart: Optimize a hosted agent)

Generate a dataset (recommended)

The fastest way to create an evaluation dataset is with azd ai agent eval init. This command generates a dataset and adaptive evaluators tuned to your agent's domain:

azd ai agent eval init

The interactive wizard auto-detects your agent from azure.yaml and prompts for a generation instruction describing what your agent does and what scenarios to test.

Example output:

Detecting agent...
  Found: my-support-agent (hosted)

Generation prompt
  Describe what this agent does and what scenarios to test.
  > This agent handles customer support for electronics. Test returns, troubleshooting, and out-of-scope requests.

Generating dataset and evaluators...
  Dataset generation:    done  (registered: my-support-agent-eval-seed/v1)
  Evaluator generation:  done  (registered: my-support-agent-quality/v1)

Eval suite created
  Config:     eval.yaml
  Dataset:    .azure/.foundry/datasets/my-support-agent-eval-seed.v1.jsonl
  Evaluator:  .azure/.foundry/evaluators/my-support-agent-quality.v1.yaml

Review the generated assets, then run:
  azd ai agent eval run

Non-interactive mode

For scripted workflows, pass the inputs directly:

azd ai agent eval init \
  --gen-instruction "Customer support agent. Test refund handling, troubleshooting, and out-of-scope deflection." \
  --eval-model gpt-4.1-mini \
  --max-samples 50

Use your own data with generated evaluators

If you already have a golden dataset but want auto-generated evaluators:

azd ai agent eval init --dataset ./my-golden-dataset.jsonl

Run optimization with the generated config

After eval init completes, azd ai agent optimize auto-detects the generated eval.yaml:

azd ai agent optimize

Or pass it explicitly:

azd ai agent optimize --config eval.yaml

For the full evaluation CLI workflow, see Run agent evaluations with the azd CLI.

Create a custom dataset manually (advanced)

For full control over evaluation tasks and criteria, create a JSONL dataset by hand. This is useful when you need precise control over test scenarios or have production data to use directly.

By default, azd ai agent optimize uses a built-in dataset with 3 general coding tasks and 25 criteria. For meaningful optimization of your specific agent, create a custom dataset that reflects your agent's real-world use cases.

Dataset format

Datasets use JSONL (JSON Lines) format. Each line is one JSON object that represents a single evaluation task. A task is an individual scenario in the dataset. It contains a prompt and evaluation criteria.

{"name": "task_1", "prompt": "Your prompt here", "criteria": [{"name": "criterion_name", "instruction": "What the evaluator checks for"}]}
{"name": "task_2", "prompt": "Another prompt", "criteria": [{"name": "check_1", "instruction": "..."}, {"name": "check_2", "instruction": "..."}]}

Field reference

Field	Required	Description
`name`	Yes	Unique task identifier (for example, `"greeting"`, `"math_test"`)
`prompt`	Yes	The message sent to the agent
`criteria`	Yes	Array of evaluation criteria — rules that define what "good" looks like for the task
`criteria[].name`	Yes	Short name for the criterion (for example, `"is_polite"`)
`criteria[].instruction`	Yes	What the evaluator checks. Be specific and testable. The built-in evaluator (`builtin.task_adherence`) scores each criterion independently as a binary value (0 or 1).
`groundTruth`	No	Expected answer (used by some evaluators for reference)

Example: Customer support agent

{"name": "refund_policy", "prompt": "What is your refund policy?", "criteria": [{"name": "mentions_30_days", "instruction": "Response must mention the 30-day refund window"}, {"name": "polite_tone", "instruction": "Response must be professional and empathetic"}]}
{"name": "order_status", "prompt": "Where is my order #12345?", "criteria": [{"name": "asks_for_details", "instruction": "Agent should ask for email or order details to look up the order"}, {"name": "no_hallucination", "instruction": "Agent must NOT make up a fake order status"}]}
{"name": "out_of_scope", "prompt": "Can you help me fix my car?", "criteria": [{"name": "polite_decline", "instruction": "Agent should politely explain this is outside its scope"}, {"name": "redirect", "instruction": "Agent should suggest contacting an appropriate service"}]}

Example: Coding assistant

{"name": "python_function", "prompt": "Write a Python function to reverse a linked list", "criteria": [{"name": "correct_algorithm", "instruction": "The function must correctly reverse a singly linked list"}, {"name": "handles_empty", "instruction": "The function must handle an empty list without errors"}, {"name": "includes_docstring", "instruction": "The function should include a descriptive docstring"}]}
{"name": "explain_concept", "prompt": "Explain what a closure is in JavaScript", "criteria": [{"name": "accurate_definition", "instruction": "Must correctly define a closure as a function that captures variables from its enclosing scope"}, {"name": "includes_example", "instruction": "Must include at least one working code example"}]}

Use a custom dataset

Reference your dataset in a YAML config file:

# eval.yaml
agent:
  name: my-agent

dataset_file: ./my_eval_dataset.jsonl

evaluators:
  - builtin.task_adherence

options:
  eval_model: gpt-4.1-mini
  optimization_model: gpt-5.1
  max_iterations: 10

Then run:

azd ai agent optimize --config eval.yaml

Before you run the command, validate the JSONL syntax:

python -c "import json; [json.loads(l) for l in open('my_eval_dataset.jsonl')]"

Tips for writing good datasets

Be specific in criteria

Bad:

{"name": "good_answer", "instruction": "The response should be good"}

Good:

{"name": "mentions_30_days", "instruction": "Response must explicitly mention the 30-day refund window"}

Specific criteria give the evaluator a clear, binary signal. Vague criteria lead to inconsistent scoring.

Include edge cases

Test beyond the happy path. Include:

Out-of-scope requests — Inputs your agent should decline or redirect
Ambiguous queries — Tasks where the agent should ask for clarification
Adversarial inputs — Attempts to trick the agent into bad behavior
Multi-step tasks — Complex requests that require structured reasoning

Size guidelines

Dataset size	Trade-off
3–5 tasks	Quick iteration, limited signal
5–10 tasks	Good balance of speed and coverage
10–20 tasks	Comprehensive evaluation, longer runs
20+ tasks	Thorough but slow — consider for final validation

Each task can have multiple criteria. A dataset with 5 tasks × 4 criteria each = 20 evaluation signals.

Write prompts like real users

Use actual messages from your users if possible. Real prompts capture the vocabulary and context that your agent faces in production.

Criteria are scored independently

Each criterion gets a binary score (0 or 1). The task score is the average of its criteria scores. The overall score is the average across all tasks. This means:

A task with 4 criteria where 3 pass scores 0.75
An agent that passes all criteria on 2 of 3 tasks scores 0.67

Ground truth is optional

The groundTruth field provides a reference answer for evaluators that support it. This field isn't required. The builtin.task_adherence evaluator works entirely from criteria instructions.

{"name": "geography_fact", "prompt": "What is the largest city in France by population?", "groundTruth": "Paris", "criteria": [{"name": "correct_answer", "instruction": "Response must state that Paris is the largest city in France by population"}]}

Troubleshooting

Problem	Cause	Fix
`dataset_file not found`	Wrong path in `eval.yaml`	Use a path relative to the config file location
`invalid JSON on line N`	Malformed JSONL	Validate that each line is valid JSON. Check for trailing commas.
Scores are inconsistent between runs	Vague criteria	Make criteria specific and binary-testable

Feedback

Was this page helpful?

Last updated on 2026-06-02