Evaluate an agent (preview)

[This article is prerelease documentation and is subject to change.]

The Evaluate tab in the new agent experience provides structured, repeatable testing for your agent. Use evaluations to measure agent quality across test cases, track improvements over time, and validate your agent's behavior before publishing.

Note

This article reflects the new agent experience in Microsoft Copilot Studio, which is currently available as a production-ready preview. Learn about the two experiences in Classic vs. new agent experience.

  • Production-ready previews are subject to supplemental terms of use.
  • Some capabilities available in the classic experience aren't yet available in the new experience.
  • Agents created in the new experience can't be converted to the classic experience.

What is agent evaluation?

Agent evaluation lets you systematically test your agent's responses against quality standards. Instead of manually testing each scenario in the Preview tab, you create evaluations with test conversations, define how responses should be measured, and run evaluations to get quantitative results.

Evaluation helps you answer questions like:

  • Does the agent answer correctly across a range of expected scenarios?
  • Did a configuration change improve or degrade response quality?
  • Are tools being invoked when they should be?

Key concepts

The following concepts are key when running evaluations in Copilot Studio:

Conversations

A conversation is a test case that represents a scenario you want your agent to handle. Each conversation includes user messages and optionally expected agent responses. Organize conversations into evaluations. You can create conversations manually, generate them by AI, or upload them from a CSV file.

Evaluations

An evaluation is a named test set that combines conversations with a test method. You create evaluations from the Evaluate tab, add conversations to them, and then run them to produce scored results.

Test methods

Test methods define how the agent's responses are scored. Currently, the only available test method is the General quality test method: An AI-based assessment of whether responses meet quality standards, such as relevance and completeness.

Note

The General quality test method doesn't compare responses to expected answers.

User profile

Evaluations run under a user profile. You can manage which authenticated profile runs the evaluation to ensure the agent's tools and connections are fully testable.

Evaluation workflow

The typical evaluation workflow is:

  1. Create an evaluation: On the Evaluate tab, start a new evaluation. See Create a test set for an agent.
  2. Add conversations: Add test conversations by writing them manually, generating them with AI, or uploading a CSV file.
  3. Configure: Name the evaluation, select a test method, choose the agent version, and set the user profile.
  4. Run the evaluation: Select Evaluate to run the test and wait for results. See Run an evaluation for an agent.
  5. Review results: Analyze scores and identify areas for improvement. See View evaluation results for an agent.
  6. Iterate: Adjust your agent's instructions, knowledge, or tools, then run the evaluation again to measure the impact.