Kopīgot, izmantojot


Choose evaluation methods

[This article is prerelease documentation and is subject to change.]

When you create test sets, choose from different test methods to evaluate your agent's responses. Each test method has its own strengths and suits different types of evaluations.

Test method Measures Scoring Configurations
General quality How good is test case's answer based on specific qualities Scored out of 100% None
Compare meaning How well the meaning of the test case's answer matches the expected answer Scored out of 100% Pass score, expected answer
Tool use Whether the test case used the expected resources Pass/fail Expected capabilities
Keyword match Whether the test case used all or any of the expected keywords or phrases Pass/fail Expected keywords or phrases
Text similarity How well the text of the test case's answer matches the expected answer Scored out of 100% Pass score, expected answer
Exact match Whether the test case's answer matches the expected answer exactly Pass/fail Expected answer
Custom Labels answers based on the criteria you describe Pass/fail Test description and label descriptions

Add a test method

  1. When creating or editing a test set, select Add test method.

  2. Select all the methods you want to test with, then select OK. You can add multiple methods.

    1. Some methods require a pass score. The pass score determines what score results in a pass or a failure. Set the score, then select OK.

    2. Some test methods require more criteria.

  3. Select Save to save your changes to the test set.

Select an existing test method to edit that method's criteria or delete that method.

General quality

General quality helps you decide whether your agent's responses meet your standards. It uses a language model to assess how effectively an agent answers user questions.

General quality is especially helpful when there's no exact answer expected. It offers a flexible and scalable way to evaluate responses based on the retrieved documents and the conversation flow.

It uses these key criteria and applies a consistent prompt to guide scoring:

  • Relevance: To what extent the agent's response addresses the question. For example, does the agent's response stay on the subject and directly answer the question?

  • Groundedness: To what extent the agent's response is based on the provided context. For example, does the agent's response reference or rely on the information given in the context, rather than introducing unrelated or unsupported information?

  • Completeness: To what extent the agent's response provides all necessary information. For example, does the agent's response cover all aspects of the question and provide sufficient detail?

  • Abstention: Whether the agent attempted to answer the question.

To be considered high quality, a response must meet all these key criteria. If one criterion isn't met, the response is flagged for improvement. This scoring method ensures that only responses that are both complete and well-supported receive top marks. In contrast, answers that are incomplete or lack supporting evidence receive lower scores.

When adding or editing test methods, select General quality. All test sets start with this method by default.

You don't need to add expected answers to test cases to complete a general quality evaluation.

Compare meaning

Compare meaning evaluates how well the agent's answer reflects the intended meaning of the expected response. Instead of focusing on exact wording, it uses intent similarity, meaning it compares the ideas and meaning behind the words, to judge how closely the response aligns with what you expected.

Like general quality, compare meaning is especially helpful when there's no exact answer expected. It offers a flexible and scalable way to evaluate responses based on the retrieved documents and the conversation flow.

You can set a passing score threshold to determine what constitutes a passing score for an answer. The default passing score is 50. The compare meaning test method is useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

  1. When adding or editing test methods, select Compare meaning.

  2. Set the pass score for this method.

  3. Add the expected answers. Any test case without expected answers produces an Invalid result for this test method.

    1. Select a test case.

    2. Add the answer you expect.

    3. Select Apply to save the expected answer.

    4. Repeat for all the test cases you want to test by using this method.

Tool use

Tool use tests whether the agent triggered specific tools or topics during the run process. If it did, the result is marked as Pass. If it did not, the result is marked as Fail.

  1. When adding or editing test methods, select Tool use.

  2. Add the expected tools or topics. Any test case without expected answers produces an Invalid result for this test method.

    1. Select a test case. To add the same expected tools and topics for all test cases, select the Edit icon in the Tool use column heading.

    2. In the Edit test case pane, select the tools you expect your agent to use for that test case.

    3. Select OK.

    4. Select Apply to save changes.

    5. Repeat for all the test cases you want to test for tools use.

Keyword match

Keyword match checks whether the agent's answer contains some or all of the words or phrases from the expected response that you define. If it does, it passes. If it doesn’t, it fails. Keyword match is useful when an answer can be phrased in different correct ways, but key terms or ideas still need to be included in the response.

You can select if a pass requires Any of the keywords or All of them. Choosing Any means that if at least one word or phrase matches, the test case passes. Choosing All means that all expected words or phrases must match for a test case to pass.

  1. When adding or editing test methods, select Keyword match.

  2. Select whether a test case needs Any or All keywords to match.

  3. Add the expected keywords. Any test case without expected keywords produces an Invalid result for this test method.

    1. Select a test case.

    2. In the Edit test case pane, add a keyword or phrase you expect that case's answer to have.

    3. Select + Add to add more keywords or phrases. To remove a keyword or phrase, select the Delete icon.

    4. Select Apply to save the expected keywords.

    5. Repeat for all the test cases you want to test for keyword matching.

Text similarity

The similarity test method compares the similarity of the agent’s responses to the expected responses you define in your test set. It's useful when an answer can be phrased in different correct ways, but the overall meaning or intent still needs to come through.

It uses a cosine similarity metric to assess how similar the agent's answer is to the wording and meaning of the expected response and determines a score. The score ranges between 0 and 1, where 1 indicates the answer closely matches and 0 indicates it doesn't. You can set a passing score threshold to determine what constitutes a passing score for an answer.

  1. When adding or editing test methods, select Text similarity.

  2. Set the pass score for this method.

  3. Add the expected answers. Any test case without expected answers produces an Invalid result for this test method.

    1. Select a test case.

    2. Add the answer you expect.

    3. Select Apply to save the expected answer.

    4. Repeat for all the test cases you want to test by using this method.

Exact match

Exact match checks whether the agent's answer exactly matches the expected response in the test: character for character, word for word. If it's the same, it passes. If anything differs, it fails. Exact match is useful for short, precise answers such as numbers, codes, or fixed phrases. It doesn't suit answers that people can phrase in multiple correct ways.

  1. When adding or editing test methods, select Exact match.

  2. Add the expected answers. Any test case without expected answers produces an Invalid result for this test method.

    1. Select a test case.

    2. Add the answer you expect.

    3. Select Apply to save the expected answer.

    4. Repeat for all the test cases you want to test by using this method.

Custom

Custom is a customizable test method. It lets you test and label agent answers using your own criteria. For example, you can create a compliance test for an HR agent to label test answers as either compliant or noncompliant with your description of HR compliance.

A custom test has two components for you to configure:

Evaluation instructions: Describes the goal you want to accomplish with this test. What do you want the test to find out about your agent's answers?

Good evaluation instructions should:

  • Be goal oriented.

  • Use only the allowed characters.

  • Use bullet points and headings for organization.

For example:

Evaluate the agent's response for HR policy compliance.

What to check:
- Determine whether the answer protects privacy and avoids revealing or requesting sensitive data.
- Avoids discrimination, bias, or inappropriate judgments.
- Provides safe, neutral, HR-aligned guidance.
- Does not give legal advice or make definitive claims.

Labels: Describes the result assigned to each answer using the custom test. Labels also have pass/fail assignments, which count toward the test set pass rate for this test method.

Labels have a name and a description. A good description:

  • Is concise.

  • Contains the attributes you're looking for in matching answers.

One strategy for labels is to have two: one is answers that successfully fulfill the criteria you're looking for, and the other for answers that don't. For example, an HR policy compliance custom test might have Compliant and Non-Compliant as labels.

  1. When adding or editing test methods, select Custom.

  2. Enter a name for this custom test.

  3. Add evaluation instructions.

  4. Add two or more labels. Each label has a name and a description.

    To add more labels, select Add label.

    Label titles can only use letters, numbers, space, hyphen -, underscore _, forward slash /, ampersand &, plus sign +, and period ..

  5. Set the Pass or Fail result for each label.

  6. Select OK.