How to run an evaluation and view the results
Important
This feature is in Public Preview.
This article describes how to run an evaluation and view the results using Mosaic AI Agent Evaluation.
To run an evaluation, you must specify an evaluation set. An evaluation set is a set of typical requests that a user would make to your agentic application. The evaluation set can also include the expected output for each input request. The purpose of the evaluation set is to help you measure and predict the performance of your agentic application by testing it on representative questions.
For more information about evaluation sets, including the required schema, see Evaluation sets.
To begin evaluation, you use the mlflow.evaluate()
method from the MLflow API. mlflow.evaluate()
computes quality, latency, and cost metrics for each input in the evaluation set and also computes aggregate metrics across all inputs. These metrics are also referred to as the evaluation results. The following code shows an example of calling mlflow.evaluate()
:
%pip install databricks-agents
dbutils.library.restartPython()
import mlflow
import pandas as pd
eval_df = pd.DataFrame(...)
# Puts the evaluation results in the current Run, alongside the logged model parameters
with mlflow.start_run():
logged_model_info = mlflow.langchain.log_model(...)
mlflow.evaluate(data=eval_df, model=logged_model_info.model_uri,
model_type="databricks-agent")
In this example, mlflow.evaluate()
logs its evaluation results in the enclosing MLflow run, along with information logged by other commands (e.g., model parameters). If you call mlflow.evaluate()
outside an MLflow run, it starts a new run and logs evaluation results in that run. For more information about mlflow.evaluate()
, including details on the evaluation results that are logged in the run, see the MLflow documentation.
Requirements
Azure AI-powered AI assistive features must be enabled for your workspace.
How to provide input to an evaluation run
There are two ways to provide input to an evaluation run:
- Provide previously generated outputs to compare to the evaluation set. This option is recommended if you want to evaluate outputs from an application that is already deployed to production, or if you want to compare evaluation results between evaluation configurations.
- Pass the application as an input argument.
mlflow.evaluate()
calls into the application for each input in the evaluation set and computes metrics on the generated output. This option is recommended if your application was logged using MLflow with MLflow Tracing enabled, or if your application is implemented as a Python function in a notebook. This option is not recommended if your application was developed outside of Databricks or is deployed outside of Databricks.
The following code samples show a minimal example for each method. For more detailed examples, see Examples of mlflow.evaluate() calls. For details about the evaluation set schema, see Evaluation set schema.
To provide previously generated outputs, specify only the evaluation set as shown in the following code, but ensure that it includes the generated outputs. For a more detailed example, see Example: Previously generated outputs provided.
evaluation_results = mlflow.evaluate( data=eval_set_with_chain_outputs_df, # pandas Dataframe with the evaluation set and application outputs model_type="databricks-agent", )
To have the
mlflow.evaluate()
call generate the outputs, specify the evaluation set and the application in the function call as shown in the following code. For a more detailed example, see Example: Agent Evaluation runs application.evaluation_results = mlflow.evaluate( data=eval_set_df, # pandas Dataframe containing just the evaluation set model=model, # Reference to the MLflow model that represents the application model_type="databricks-agent", )
Evaluation outputs
An evaluation generates two types of outputs:
- Data about each request in the evaluation set, including the following:
- Inputs sent to the agentic application.
- The application’s output
response
. - All intermediate data generated by the application, such as
retrieved_context
,trace
, and so on. - Ratings and rationales from each Databricks-specified and customer-specified LLM judge. The ratings characterize different quality aspects of the application outputs, including correctness, groundedness, retrieval precision, and so on.
- Other metrics based on the application’s trace, including latency and token counts for different steps.
- Aggregated metric values across the entire evaluation set, such as average and total token counts, average latencies, and so on.
These two types of outputs are returned from mlflow.evaluate()
and are also logged in an MLflow run. You can inspect the outputs in the notebook or from the page of the corresponding MLflow run.
Review output in the notebook
The following code shows some examples of how to review the results of an evaluation run from your notebook.
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
###
# Run evaluation
###
evaluation_results = mlflow.evaluate(..., model_type="databricks-agent")
###
# Access aggregated metric values across the entire evaluation set
###
metrics_as_dict = evaluation_results.metrics
metrics_as_pd_df = pd.DataFrame([evaluation_results.metrics])
# Sample usage
print(f"The percentage of generated responses that are grounded: {metrics_as_dict['response/llm_judged/groundedness/percentage']}")
###
# Access data about each question in the evaluation set
###
per_question_results_df = evaluation_results.tables['eval_results']
# Show information about responses that are not grounded
per_question_results_df[per_question_results_df["response/llm_judged/groundedness/rating"] == "no"].display()
The per_question_results_df
dataframe includes all of the columns in the input schema and all computed metrics specific to each request. For more details about each reported metric, see Use agent metrics & LLM judges to evaluate app performance.
Review output using the MLflow UI
Evaluation results are also available in the MLflow UI. To access the MLflow UI, click on the Experiment icon in notebook’s right sidebar and then on the corresponding run, or click the links that appear in the cell results for the notebook cell in which you ran mlflow.evaluate()
.
Review metrics for a single run
This section describes the metrics available for each evaluation run. To compare metrics across runs, see Compare metrics across runs.
Per-request metrics
Per-request metrics are available in databricks-agents
version 0.3.0 and above.
To see detailed metrics for each request in the evaluation set, click the Evaluation results tab on the MLflow Run page. This page shows a summary table of each evaluation run. For more details, click the Evaluation ID of a run.
The details page for the evaluation run shows the following:
- Model output: The generated response from the agentic app and its trace if included.
- Expected output: The expected response for each request.
- Detailed assessments: The assessments of the LLM judges on this data. Click See details to display the justifications provided by the judges.
Aggregated metrics across the full evaluation set
To see aggregated metric values across the full evaluation set, click the Overview tab (for numerical values) or the Model metrics tab (for charts).
Compare metrics across runs
It’s important to compare evaluation results across runs to see how your agentic application responds to changes. Comparing results can help you understand if your changes are positively impacting quality or help you troubleshoot changing behavior.
Compare per-request metrics across runs
To compare data for each individual request across runs, click the Evaluation tab on the Experiment page. A table shows each question in the evaluation set. Use the drop-down menus to select the columns to view.
Compare aggregated metrics across runs
You can access the same aggregated metrics from the Experiment page, which also allows you to compare these metrics across different runs. To access the Experiment page, click the Experiment icon in notebook’s right sidebar, or click the links that appear in the cell results for the notebook cell in which you ran mlflow.evaluate()
.
On the Experiment page, click . This allows you to visualize the aggregated metrics for the selected run and compare to past runs.
Examples of mlflow.evaluate()
calls
This section includes code samples of mlflow.evaluate()
calls, illustrating options for passing the application and the evaluation set to the call.
Example: Agent Evaluation runs application
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
###
# mlflow.evaluate() call
###
evaluation_results = mlflow.evaluate(
data=eval_set_df, # pandas DataFrame with just the evaluation set
model=model, # Reference to the application
model_type="databricks-agent",
)
###
# There are 4 options for passing an application in the `model` argument.
####
#### Option 1. Reference to a Unity Catalog registered model
model = "models:/catalog.schema.model_name/1" # 1 is the version number
#### Option 2. Reference to a MLflow logged model in the current MLflow Experiment
model = "runs:/6b69501828264f9s9a64eff825371711/chain"
# `6b69501828264f9s9a64eff825371711` is the run_id, `chain` is the artifact_path that was
# passed when calling mlflow.xxx.log_model(...).
# If you called model_info = mlflow.langchain.log_model() or mlflow.pyfunc.log_model(), you can access this value using `model_info.model_uri`.
#### Option 3. A PyFunc model that is loaded in the notebook
model = mlflow.pyfunc.load_model(...)
#### Option 4. A local function in the notebook
def model_fn(model_input):
# code that implements the application
response = 'the answer!'
return response
model = model_fn
###
# `data` is a pandas DataFrame with your evaluation set.
# These are simple examples. See the input schema for details.
####
# You do not have to start from a dictionary - you can use any existing pandas or
# Spark DataFrame with this schema.
# Minimal evaluation set
bare_minimum_eval_set_schema = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
}]
# Complete evaluation set
complete_eval_set_schema = [
{
"request_id": "your-request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
# In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
"content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_1",
},
{
"content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
}]
#### Convert dictionary to a pandas DataFrame
eval_set_df = pd.DataFrame(bare_minimum_eval_set_schema)
#### Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_df = spark_df.toPandas()
Example: Previously generated outputs provided
For the required evaluation set schema, see Evaluation sets.
%pip install databricks-agents pandas
dbutils.library.restartPython()
import mlflow
import pandas as pd
###
# mlflow.evaluate() call
###
evaluation_results = mlflow.evaluate(
data=eval_set_with_app_outputs_df, # pandas Dataframe with the evaluation set and application outputs
model_type="databricks-agent",
)
###
# `data` is a pandas DataFrame with your evaluation set and outputs generated by the application.
# These are simple examples. See the input schema for details.
####
# You do not have to start from a dictionary - you can use any existing pandas or
# Spark DataFrame with this schema.
# Bare minimum data
bare_minimum_input_schema = [
{
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
}]
complete_input_schema = [
{
"request_id": "your-request-id",
"request": "What is the difference between reduceByKey and groupByKey in Spark?",
"expected_retrieved_context": [
{
# In `expected_retrieved_context`, `content` is optional, and does not provide any additional functionality.
"content": "Answer segment 1 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_1",
},
{
"content": "Answer segment 2 related to What is the difference between reduceByKey and groupByKey in Spark?",
"doc_uri": "doc_uri_2_2",
},
],
"expected_response": "There's no significant difference.",
"response": "reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
"retrieved_context": [
{
# In `retrieved_context`, `content` is optional. If provided, the Databricks Context Relevance LLM Judge is executed to check the `content`'s relevance to the `request`.
"content": "reduceByKey reduces the amount of data shuffled by merging values before shuffling.",
"doc_uri": "doc_uri_2_1",
},
{
"content": "groupByKey may lead to inefficient data shuffling due to sending all values across the network.",
"doc_uri": "doc_uri_6_extra",
},
],
}]
#### Convert dictionary to a pandas DataFrame
eval_set_with_app_outputs_df = pd.DataFrame(bare_minimum_input_schema)
#### Use a Spark DataFrame
import numpy as np
spark_df = spark.table("catalog.schema.table") # or any other way to get a Spark DataFrame
eval_set_with_app_outputs_df = spark_df.toPandas()
Limitation
For multi-turn conversations, the evaluation output records only the last entry in the conversation.