Automate testing with agent evaluation

[This article is prerelease documentation and is subject to change.]

As AI agents take on critical roles in business processes, the need for reliable, repeatable testing becomes essential. Agent evaluation lets you generate tests that simulate real-world scenarios for your agent. These tests cover more questions faster than manual, case-by-case testing. Then, you can measure the accuracy, relevancy, and quality of answers to the questions the agent is asked, based on the information the agent can access. By using the results from the test set, you can optimize your agent's behavior and validate that your agent meets your business and quality requirements.

Important

This article contains Microsoft Copilot Studio preview documentation and is subject to change.

Preview features aren't meant for production use and may have restricted functionality. These features are available before an official release so that you can get early access and provide feedback.

If you're building a production-ready agent, see Microsoft Copilot Studio Overview.

Why use automated testing?

Agent evaluation provides automated, structured testing. It helps catch problems early, reduces the risk of bad answers, and maintains quality as the agent evolves. This process brings an automated, repeatable form of quality assurance to agent testing. It makes sure the agent meets your business’s accuracy and reliability standards and gives transparency into how it's performing. It has different strengths than testing by using the test chat.

Agent evaluation measures correctness and performance, not AI ethics or safety problems. An agent might pass all evaluation tests but still, for example, produce an inappropriate answer to a question. Customers should still use responsible AI reviews and content safety filters; evaluations don’t replace those reviews and filters.

How agent evaluation works

Copilot Studio uses a test case for each agent evaluation. A test case is a single message or question that simulates what a user would ask your agent. A test case can also include the answer you expect your agent to reply with. For example:

The question: What are your business hours?
The expected response: We are open from 9 a.m. to 5 p.m. from Monday to Friday.

By using agent evaluation, you can generate, import, or manually write a group of test cases. This group of test cases is call a test set. A test set allows you to:

Run multiple test cases that cover a broad range of capabilities at once, instead of asking your agent one question at a time.
Analyze your agent's performance with an easily digestible aggregate score and also zoom in on individual test cases.
Test changes to your agents by using the same test set, so you have an objective standard to measure and compare changes in performance.
Quickly create new test sets or modify existing ones to cover changing agent capabilities or requirements.

The test set also includes the test methods you want to use. You can measure your agent's performance based on:

Exact match or keyword match: How closely your agent's answer to a question matches your expected response.
Semantic similarity: How closely your agent's answer matches the idea or intent of your expected response.
Quality: How well your agent's answers perform using an LLM-based evaluation.

You can also choose a user profile to act as the user sending the questions. The agent might be configured to respond to different users in different ways, or allow access to resources in different ways.

When you select a test set and run an agent evaluation, Copilot Studio sends the questions in the test cases, records the agent's responses, compares those responses against expected responses or a standard of quality, and assigns a score to each test case. You can also see the details, transcript, and activity map for each test case and which resources your agent used to create the response.

Test chat versus agent evaluation

Each method of testing gives you different insights into your agent's qualities and behavior:

Test chat:

Receives and responds to one question at a time. It's hard to repeat the same tests multiple times.
Allows you to test a full session containing multiple messages.
Allows you to interact with your agent as a user by using a chat interface.

Agent evaluation:

Can create and run multiple test cases at once. You can repeat tests by using the same test set.
Can only test one question and one response per test case. It doesn't test a full conversational session.
Choose different user profiles to simulate different users without needing to complete the interactions yourself.

When you test an agent, use both the test chat and agent evaluation for a full picture of your agent.

Pripomienky

Bola táto stránka užitočná?

Last updated on 2026-01-15