Automate testing with agent evaluation

As AI agents take on critical roles in business processes, the need for reliable, repeatable testing becomes essential. Agent evaluation lets you generate tests that simulate real-world scenarios for your agent. These tests cover more questions and conversations faster than manual, case-by-case testing. Then, you can measure the accuracy, relevancy, and quality of answers of your agent's interactions, based on the information the agent can access. By using the results from the test set, you can optimize your agent's behavior and validate that your agent meets your business and quality requirements.

Why use automated testing?

Agent evaluation provides automated, structured testing. It helps catch problems early, reduces the risk of bad answers, and maintains quality as the agent evolves. This process brings an automated, repeatable form of quality assurance to agent testing. It makes sure the agent meets your business's accuracy and reliability standards and gives transparency into how it's performing. It has different strengths than testing by using the test chat.

Agent evaluation measures correctness and performance, not AI ethics or safety problems. An agent might pass all evaluation tests but still, for example, produce an inappropriate answer to a question. Customers should still use responsible AI reviews and content safety filters; evaluations don't replace those reviews and filters.

Government Community Cloud limitations

Agent evaluation in Government Community Cloud (GCC) environments has the following limitations:

Makers can't add a user profile to their test sets. However, makers can still run evaluations without a user profile.
Makers can't use the similarity test method for evaluations. All other test methods are available.

How agent evaluation works

Copilot Studio uses a test case for each agent evaluation. A test case is a single interaction that simulates how a user would interact with your agent. The interaction can be a single question or an entire conversation.

A test case can also include the answer you expect your agent to reply with. For example:

The question: What are your business hours?
The expected response: We are open from 9 a.m. to 5 p.m. from Monday to Friday.

By using agent evaluation, you can generate, import, or manually write a group of test cases. This group of test cases is call a test set. A test set allows you to:

Run multiple test cases that cover a broad range of capabilities at once, instead of asking your agent one question at a time.
Analyze your agent's performance with an easily digestible aggregate score and also zoom in on individual test cases.
Test changes to your agents by using the same test set, so you have an objective standard to measure and compare changes in performance.
Quickly create new test sets or modify existing ones to cover changing agent capabilities or requirements.

Each test set can evaluate your agent using multiple test methods at once.

You can also choose a user profile to act as the stimulated user. The agent might be configured to respond to different users in different ways, or allow access to resources in different ways.

When you select a test set and run an agent evaluation, Copilot Studio sends the questions in the test cases, records the agent's responses, compares those responses against expected responses or a standard of quality, and assigns a score to each test case. You can also see the details, transcript, and activity map for each test case and which resources your agent used to create the response.

Create a comprehensive evaluation strategy

Before you run evaluations, define what success looks like for your agent and decide which scenarios matter most to your business outcomes. A clear strategy helps you choose the right test methods, prioritize high-impact test cases, and interpret results with the right context.

Use Architecting agent solutions: Evaluation frameworks to map business goals to measurable evaluation dimensions and scoring approaches.
Use Design and operationalize agent evaluation to build a repeatable evaluation process that supports ongoing quality improvements.

Test chat versus agent evaluation

Each method of testing gives you different insights into your agent's qualities and behavior:

Test chat:

Receives and responds to one question at a time. It's hard to repeat the same tests multiple times.
Allows you to test a full session containing multiple messages.
Allows you to interact with your agent as a user by using a chat interface.

Agent evaluation:

Can create and run multiple test cases at once by using a test set. You can repeat tests by testing with the same test set.
Can test one question and one response per test case, or one conversation per test case. However, you have less control over the conversations than you would when using the the test chat.
Choose different user profiles to simulate different users without needing to complete the interactions yourself.

When you test an agent, use both the test chat and agent evaluation for a full picture of your agent.

Feedback

War dës Säit hëllefräich?

Last updated on 2026-03-31