Evaluation sets

Article
06/25/2024

Important

To measure the quality of an agentic application, you need to be able to define what a high-quality, accurate response looks like. You do that by providing an evaluation set. This article covers the required schema of the evaluation set, which metrics are calculated based on what data is present in the evaluation set, and some best practices for creating an evaluation set.

Databricks recommends creating a human-labeled evaluation set. This is a set of representative questions and ground-truth answers. You can also optionally provide the supporting documents that you expect the response to be based on if your application includes a retrieval step.

A good evaluation set has the following characteristics:

Representative: It should accurately reflect the range of requests the application will encounter in production.
Challenging: It should include difficult and diverse cases to effectively test the full range of the application’s capabilities.
Continually updated: It should be updated regularly to reflect how the application is used and the changing patterns of production traffic.

To learn how to run an evaluation using the evaluation set, see How to run an evaluation and view the results.

Evaluation set schema

The following table shows the schema required for the DataFrame provided in the mlflow.evaluate() call.

Column	Data type	Description	Application passed as input argument	Previously generated outputs provided
request_id	string	Unique identifier of request.	Optional	Optional
request	string	Input to the application to evaluate, user’s question or query. For example, “What is RAG?”	Required	Required
expected_retrieved_context	array	Array of objects containing the expected retrieved context for the request (if the application includes a retrieval step). Array schema	Optional	Optional
expected_response	string	Ground-truth (correct) answer for the input request.	Optional	Optional
response	string	Response generated by the application being evaluated.	Generated by Agent Evaluation	Optional. If not provided then derived from the Trace. Either `response` or `trace` is required.
retrieved_context	array	Retrieval results generated by the retriever in the application being evaluated. If multiple retrieval steps are in the application, this is the retrieval results from the last step (chronologically in the trace). Array schema	Generated by Agent Evaluation	Optional. If not provided then derived from the provided trace.
trace	JSON string of MLflow Trace	MLflow Trace of the application’s execution on the corresponding request.	Generated by Agent Evaluation	Optional. Either `response` or `trace` is required.

Schema for arrays in evaluation set

The schema of the arrays expected_retrieved_context and retrieved_context is shown in the following table:

Column	Data type	Description	Application passed as input argument	Previously generated outputs provided
content	string	Contents of the retrieved context. String in any format, such as HTML, plain text, or Markdown.	Optional	Optional
doc_uri	string	Unique identifier (URI) of the parent document where the chunk came from.	Required	Required

Metrics available when the application is passed in through the `model` input argument

The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations that take the application as an input argument. The columns indicate the data included in the evaluation set, and an X indicates that the metric is supported when that data is provided.

For details about what these metrics measure, see Use agent metrics & LLM judges to evaluate app performance.

Calculated metrics	`request`	`request` and `expected_response`	`request`, `expected_response`, and `expected_retrieved_context`
`response/llm_judged/relevance_to_query/rating`	✓	✓	✓
`response/llm_judged/safety/rating`	✓	✓	✓
`response/llm_judged/groundedness/rating`	✓	✓	✓
`retrieval/llm_judged/chunk_relevance_precision`	✓	✓	✓
`agent/total_token_count`	✓	✓	✓
`agent/input_token_count`	✓	✓	✓
`agent/output_token_count`	✓	✓	✓
`response/llm_judged/correctness/rating`		✓	✓
`retrieval/ground_truth/document_recall`			✓

Sample evaluation set with only `request`

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }
]

Sample evaluation set with `request` and `expected_response`

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }
]

Sample evaluation set with `request`, `expected_response`, and `expected_retrieved_content`

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_1",
            },
            {
                "doc_uri": "doc_uri_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }
]

Metrics available when application outputs are provided

The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations where you provide a Dataframe with the evaluation set and application outputs. The columns indicate the data included in the evaluation set, and an X indicates that the metric is supported when that data is provided.

Calculated metrics	`request` and `response`	`request`, `response`, and `retrieved_context`	`request`, `response`, `retrieved_context`, and `expected_response`	`request`, `response`, `retrieved_context`, `expected_response`, and `expected_retrieved_context`
`response/llm_judged/relevance_to_query/rating`	✓	✓	✓	✓
`response/llm_judged/safety/rating`	✓	✓	✓	✓
`agent/request_token_count`	✓	✓	✓	✓
`agent/response_token_count`	✓	✓	✓	✓
Customer-defined LLM judges	✓	✓	✓	✓
`retrieval/llm_judged/chunk_relevance/precision`		✓	✓	✓
`response/llm_judged/groundedness/rating`		✓	✓	✓
`response/llm_judged/correctness/rating`			✓	✓
`retrieval/ground_truth/document_recall`				✓

Sample evaluation set with only `request` and `response`

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }
]

Sample evaluation set with `request`, `response`, and `retrieved_context`

eval_set = [
    {
        "request_id": "request-id", # optional, but useful for tracking
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with `request`, `response`, `retrieved_context`, and `expected_response`

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with `request`, `response`, `retrieved_context`, `expected_response`, and `expected_retrieved_context`

level_4_data  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Best practices for developing an evaluation set

Consider each sample, or group of samples, in the evaluation set as a unit test. That is, each sample should correspond to a specific scenario with an explicit expected outcome. For example, consider testing longer contexts, multi-hop reasoning, and ability to infer answers from indirect evidence.
Consider testing adversarial scenarios from malicious users.
There is no specific guideline on the number of questions to include in an evaluation set, but clear signals from high-quality data typically perform better than noisy signals from weak data.
Consider including examples that are very challenging, even for humans to answer.
Whether you are building a general-purpose application or targeting a specific domain, your app will likely encounter a wide variety of questions. The evaluation set should reflect that. For example, if you are creating an application to field specific HR questions, you should still consider testing other domains (for example, operations), to ensure that the application does not hallucinate or provide harmful responses.
High-quality, consistent human-generated labels are the best way to ensure that the ground truth values that you provide to the application accurately reflect the desired behavior. Some steps to ensure high-quality human labels are the following:
- Aggregate responses (labels) from multiple human labelers for the same question.
- Ensure that labeling instructions are clear and that the human labelers are consistent.
- Ensure that the conditions for the human-labeling process are identical to the format of requests submitted to the RAG application.
Human labelers are by nature noisy and inconsistent, for example due to different interpretations of the question. This is an important part of the process. Using human labeling can reveal interpretations of questions that you had not considered, and that might provide insight into behavior you observe in your application.

Share via

Evaluation sets

Evaluation set schema

Schema for arrays in evaluation set

Metrics available when the application is passed in through the `model` input argument

Sample evaluation set with only `request`

Sample evaluation set with `request` and `expected_response`

Sample evaluation set with `request`, `expected_response`, and `expected_retrieved_content`

Metrics available when application outputs are provided

Sample evaluation set with only `request` and `response`

Sample evaluation set with `request`, `response`, and `retrieved_context`

Sample evaluation set with `request`, `response`, `retrieved_context`, and `expected_response`

Sample evaluation set with `request`, `response`, `retrieved_context`, `expected_response`, and `expected_retrieved_context`

Best practices for developing an evaluation set

Feedback

Feedback

Additional resources

Share via

Evaluation sets

Evaluation set schema

Schema for arrays in evaluation set

Metrics available when the application is passed in through the model input argument

Sample evaluation set with only request

Sample evaluation set with request and expected_response

Sample evaluation set with request, expected_response, and expected_retrieved_content

Metrics available when application outputs are provided

Sample evaluation set with only request and response

Sample evaluation set with request, response, and retrieved_context

Sample evaluation set with request, response, retrieved_context, and expected_response

Sample evaluation set with request, response, retrieved_context, expected_response, and expected_retrieved_context

Best practices for developing an evaluation set

Feedback

Feedback

Additional resources

Metrics available when the application is passed in through the `model` input argument

Sample evaluation set with only `request`

Sample evaluation set with `request` and `expected_response`

Sample evaluation set with `request`, `expected_response`, and `expected_retrieved_content`

Sample evaluation set with only `request` and `response`

Sample evaluation set with `request`, `response`, and `retrieved_context`

Sample evaluation set with `request`, `response`, `retrieved_context`, and `expected_response`

Sample evaluation set with `request`, `response`, `retrieved_context`, `expected_response`, and `expected_retrieved_context`