Evaluation sets

Important

This feature is in Public Preview.

To measure the quality of an agentic application, you need to be able to define what a high-quality, accurate response looks like. You do that by providing an evaluation set. This article covers the required schema of the evaluation set, which metrics are calculated based on what data is present in the evaluation set, and some best practices for creating an evaluation set.

Databricks recommends creating a human-labeled evaluation set. This is a set of representative questions and ground-truth answers. You can also optionally provide the supporting documents that you expect the response to be based on if your application includes a retrieval step.

A good evaluation set has the following characteristics:

  • Representative: It should accurately reflect the range of requests the application will encounter in production.
  • Challenging: It should include difficult and diverse cases to effectively test the full range of the application’s capabilities.
  • Continually updated: It should be updated regularly to reflect how the application is used and the changing patterns of production traffic.

To learn how to run an evaluation using the evaluation set, see How to run an evaluation and view the results.

Evaluation set schema

The following table shows the schema required for the DataFrame provided in the mlflow.evaluate() call.

Column Data type Description Application passed as input argument Previously generated outputs provided
request_id string Unique identifier of request. Optional Optional
request string Input to the application to evaluate, user’s question or query. For example, “What is RAG?” Required Required
expected_retrieved_context array Array of objects containing the expected retrieved context for the request (if the application includes a retrieval step). Array schema Optional Optional
expected_response string Ground-truth (correct) answer for the input request. Optional Optional
response string Response generated by the application being evaluated. Generated by Agent Evaluation Optional. If not provided then derived from the Trace. Either response or trace is required.
retrieved_context array Retrieval results generated by the retriever in the application being evaluated. If multiple retrieval steps are in the application, this is the retrieval results from the last step (chronologically in the trace). Array schema Generated by Agent Evaluation Optional. If not provided then derived from the provided trace.
trace JSON string of MLflow Trace MLflow Trace of the application’s execution on the corresponding request. Generated by Agent Evaluation Optional. Either response or trace is required.

Schema for arrays in evaluation set

The schema of the arrays expected_retrieved_context and retrieved_context is shown in the following table:

Column Data type Description Application passed as input argument Previously generated outputs provided
content string Contents of the retrieved context. String in any format, such as HTML, plain text, or Markdown. Optional Optional
doc_uri string Unique identifier (URI) of the parent document where the chunk came from. Required Required

Metrics available when the application is passed in through the model input argument

The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations that take the application as an input argument. The columns indicate the data included in the evaluation set, and an X indicates that the metric is supported when that data is provided.

For details about what these metrics measure, see Use agent metrics & LLM judges to evaluate app performance.

Calculated metrics request request and expected_response request, expected_response, and expected_retrieved_context
response/llm_judged/relevance_to_query/rating
response/llm_judged/safety/rating
response/llm_judged/groundedness/rating
retrieval/llm_judged/chunk_relevance_precision
agent/total_token_count
agent/input_token_count
agent/output_token_count
response/llm_judged/correctness/rating
retrieval/ground_truth/document_recall

Sample evaluation set with only request

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
    }
]

Sample evaluation set with request and expected_response

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
    }
]

Sample evaluation set with request, expected_response, and expected_retrieved_content

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_1",
            },
            {
                "doc_uri": "doc_uri_2",
            },
        ],
        "expected_response": "There's no significant difference.",
    }
]

Metrics available when application outputs are provided

The metrics calculated are determined by the data you provide in the evaluation set. The table shows the dependencies for evaluations where you provide a Dataframe with the evaluation set and application outputs. The columns indicate the data included in the evaluation set, and an X indicates that the metric is supported when that data is provided.

Calculated metrics request and response request, response, and retrieved_context request, response, retrieved_context, and expected_response request, response, retrieved_context, expected_response, and expected_retrieved_context
response/llm_judged/relevance_to_query/rating
response/llm_judged/safety/rating
agent/request_token_count
agent/response_token_count
Customer-defined LLM judges
retrieval/llm_judged/chunk_relevance/precision
response/llm_judged/groundedness/rating
response/llm_judged/correctness/rating
retrieval/ground_truth/document_recall

Sample evaluation set with only request and response

eval_set = [
    {
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
    }
]

Sample evaluation set with request, response, and retrieved_context

eval_set = [
    {
        "request_id": "request-id", # optional, but useful for tracking
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with request, response, retrieved_context, and expected_response

eval_set  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Sample evaluation set with request, response, retrieved_context, expected_response, and expected_retrieved_context

level_4_data  = [
    {
        "request_id": "request-id",
        "request": "What is the difference between reduceByKey and groupByKey in Spark?",
        "expected_retrieved_context": [
            {
                "doc_uri": "doc_uri_2_1",
            },
            {
                "doc_uri": "doc_uri_2_2",
            },
        ],
        "expected_response": "There's no significant difference.",
        "response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
        "retrieved_context": [
            {
                # In `retrieved_context`, `content` is optional, but delivers additional functionality if provided (the Databricks Context Relevance LLM judge runs to check the relevance of the provided content to the request).
                "content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
                "doc_uri": "doc_uri_2_1",
            },
            {
                "content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
                "doc_uri": "doc_uri_6_extra",
            },
        ],
    }
]

Best practices for developing an evaluation set

  • Consider each sample, or group of samples, in the evaluation set as a unit test. That is, each sample should correspond to a specific scenario with an explicit expected outcome. For example, consider testing longer contexts, multi-hop reasoning, and ability to infer answers from indirect evidence.
  • Consider testing adversarial scenarios from malicious users.
  • There is no specific guideline on the number of questions to include in an evaluation set, but clear signals from high-quality data typically perform better than noisy signals from weak data.
  • Consider including examples that are very challenging, even for humans to answer.
  • Whether you are building a general-purpose application or targeting a specific domain, your app will likely encounter a wide variety of questions. The evaluation set should reflect that. For example, if you are creating an application to field specific HR questions, you should still consider testing other domains (for example, operations), to ensure that the application does not hallucinate or provide harmful responses.
  • High-quality, consistent human-generated labels are the best way to ensure that the ground truth values that you provide to the application accurately reflect the desired behavior. Some steps to ensure high-quality human labels are the following:
    • Aggregate responses (labels) from multiple human labelers for the same question.
    • Ensure that labeling instructions are clear and that the human labelers are consistent.
    • Ensure that the conditions for the human-labeling process are identical to the format of requests submitted to the RAG application.
  • Human labelers are by nature noisy and inconsistent, for example due to different interpretations of the question. This is an important part of the process. Using human labeling can reveal interpretations of questions that you had not considered, and that might provide insight into behavior you observe in your application.