Run an evaluation for an agent (preview)

[This article is prerelease documentation and is subject to change.]

After creating a test set with conversations, you can run an evaluation to measure your agent's performance. The evaluation processes each conversation and produces scored results based on the selected test method.

Note

This article reflects the new agent experience in Microsoft Copilot Studio, which is currently available as a production-ready preview. Learn about the two experiences in Classic vs. new agent experience.

  • Production-ready previews are subject to supplemental terms of use.
  • Some capabilities available in the classic experience aren't yet available in the new experience.
  • Agents created in the new experience can't be converted to the classic experience.

Prerequisites

Run an evaluation

  1. Open your agent in Copilot Studio.
  2. Select the Evaluate tab.
  3. Make sure the evaluation you want to run is selected in the Evaluation dropdown.
  4. Add at least one conversation to the test set if you haven't already.
  5. In the Configure test set panel, verify:
    • The evaluation Name is set.
    • The Test method is configured (for example, General quality).
  6. Select Evaluate to start the evaluation.
  7. The evaluation processes each conversation in the test set. Depending on the number of conversations, this process might take several minutes.

Tip

You can run the same evaluation multiple times. Each run is saved separately, so you can compare results across runs to see how changes to your agent affect quality.

Re-run an evaluation

After making changes to your agent's instructions, knowledge, or tools, re-run the evaluation to measure the impact:

  1. On the Evaluate tab, select the evaluation you want to re-run from the Evaluation dropdown.
  2. Under Recent results, select Evaluate test set again or select the Evaluate test set icon within the test set to start a new run.
  3. Compare the new run's results with previous runs to see whether the changes improved or degraded performance.