Dalintis per


Review the agent evaluation checklist

Agent evaluation should be an iterative process starting from the agent envisioning and design phase, and continuing through agent deployment and regression detection. This template provides the essential elements for building evaluation test sets and how to implement and iterate through a four-stage structure throughout an agent's lifecycle.

Tip

Download the editable checklist template.

Stage 1: Build foundational evaluation test sets

Goal: Create and run a foundational evaluation test set that assesses the agent's core scenarios.

An evaluation test set is a group of test cases. A test case is an individual prompt-and-response pair to evaluate an agent's answer to a specific question. It includes a test prompt and an optional expected response (assertion) directly manifesting from the agent instruction requirement. A test case should also specify the acceptance criteria and test method to evaluate the quality.

Agent scenario1 Test prompt
(Sample question prompt to agent)
Expected response Acceptance criteria2
(Define what a successful response looks like: What passes and what doesn't)
Agent should answer policy content based on the policy knowledge article. "How many sick leave days does an employee get?" "30 days. <citation>" The response must contain the exact text from the policy knowledge and text match. The response must include a citation.
Agent shouldn't answer questions beyond the policy knowledge article. Direct answers to HR human support. "How many sick leave days does an employee get?" "The policy document doesn't specify the sick leave days. Consult HR on your sick leave policy." Response to forbidden case must route to human HR support.

Tip

1 Agent scenario: A foundational test set should include test cases that cover the agent's key scenarios or use cases. Use the agent scenario as guidance and focus on what the agent is intended to handle or avoid. This process helps you compile a targeted list of test prompts and should be closely coordinated with the development of agent instructions. To determine the right number of test cases, start with one test prompt for each key scenario. Begin with a small set of test cases, then iterate and refine as you gain insights and improve coverage.

2 Acceptance criteria: Clearly define what constitutes success. This definition can be challenging at first, so consider refining your criteria through iteration. Run the test prompt, review the response, and evaluate its quality by asking: Does it answer the main question? Does it use the correct information? Is the tone and style appropriate? Does it respect sharing permissions? Your insights from these questions help you establish acceptance criteria and, if needed, an expected response.

Stage 2: Establish a baseline and improve

Goal: Run evaluations and establish baseline metrics to benchmark and improve.

You can perform evaluation manually or use specialized tools. For manual evaluation, send the test prompt to the agent, review the response, use human judgment to determine if it meets the acceptance criteria, and record the result. Microsoft offers tools for agent evaluation, including the Copilot Studio agent evaluation feature.

Establish the baseline

  • Run the foundational test set against the agent.
  • Document pass or fail for each test case.
  • Calculate the overall pass rate: ______%.
  • Record the agent version and baseline date: ___________.

Root cause analysis and iteration

Review the evaluation results to identify false positives and true negatives for further analysis. A false positive is an answer marked as passing but should fail based on human judgment. A true negative is an answer correctly identified as a failure. Assess the failed cases from two perspectives:

  • Test case issue: Is the test prompt, expected answer, or acceptance criteria causing the failure?
  • Agent design issue: Does the failure indicate unclear agent instruction, or flaws in knowledge or tool configuration?

Identify the root cause and improve by either refining the test case or improving the agent design.

Tip

Evaluation passing score: Agents can produce varying responses to the same prompt due to their probabilistic nature. This variability might cause answers to pass or fail based on how strict the acceptance criteria are. To ensure reliable evaluation, run each test set multiple times and calculate the average success rate. Aim for a realistic pass rate of 80-90%, based on your business needs.

Stage 3: Implement systematic expansion

Goal: Build comprehensive evaluation suites on different agent quality categories.

Stages 1 and 2 established the foundational test set for the agent's primary use cases. Next, broaden your evaluation by creating test sets that assess various agent quality categories. The following list suggests categories that address different aspects of quality.

Quality category Goal
Foundational core The "must pass" set. It gauges the essential response quality at deployment and performs regression detection during operation.
Agent robustness One agent's core value over traditional software is its robustness in handling different use cases. This value can include:
  • How does the agent respond to the same question phrased in different terms?
  • How does the agent handle rich context provided in the prompt?
  • How to measure multi-intent in a single prompt?
  • Can the agent answer user-specific requests correctly?
The agent should handle the use case variance with grace and can be evaluated with dedicated test cases.
Architecture test Evaluate the agent's functional performance. Dimensions can include:
  • Tool call, action
  • Knowledge retrieval and citation behavior
  • Routing logic
  • Integrating handoffs
Edge cases How the agent should handle edge case with guardrails.
  • Boundary conditions
  • Not allowed and out of scope behaviors

Tip

Category purpose reference:

  • Core fails: Something is broken or isn't working. Investigate recent changes.
  • Robustness fails: Agent is too strict. It might be overly focused on specific phrasings.
  • Architecture fails: A specific component or workflow needs debugging.
  • Edge cases fail: Guardrails need improvement. Strengthen boundaries.  

Stage 4: Establish a continuous quality improvement evaluation operation

Goal: Establish continuous evaluation monitoring to maintain agent quality during operation.

Once you deploy an agent to production, it enters a stable phase. To maintain quality and quickly detect regressions or issues from product changes (such as model upgrades or knowledge system updates) or evolving use cases, set up an ongoing evaluation operation. Schedule regular evaluation runs or trigger them based on specific events for quality assurance.

  • Set up a regular evaluation maintenance cadence.
  • Suggested full suite evaluation triggers:
    • Model change
    • Major knowledge setup update
    • New tool or connector integrations
    • Production incident

Tip

Success indicator: You operationalize successfully when you can answer stakeholder concerns with specifics, instead of saying: "The agent seems okay."

You say: "Policy compliance is at 98%, but Personalization dropped to 87%—specifically, tenure-based policies aren't being applied. We identified the root cause and are iterating."