Mosaic AI Agent Evaluation (MLflow 2)

Important

Databricks recommends using MLflow 3 for evaluating and monitoring GenAI apps. This page describes MLflow 2 Agent Evaluation.

For an introduction to evaluation and monitoring on MLflow 3, see Evaluate and monitor AI agents.
For information about migrating to MLflow 3, see Migrate to MLflow 3 from Agent Evaluation.
For MLflow 3 information on this topic, see Evaluate and monitor AI agents.

This article gives an overview of how to work with Mosaic AI Agent Evaluation, based on MLflow 2.

How do I use Agent Evaluation?

The following code shows how to call and test Agent Evaluation on previously generated outputs. It returns a dataframe with evaluation scores calculated by LLM judges that are part of Agent Evaluation. See Example notebooks for a quickstart notebook containing similar code that you can run in your Azure Databricks workspace.

You can copy and paste the following into your existing Azure Databricks notebook:

%pip install mlflow databricks-agents
dbutils.library.restartPython()

import mlflow
import pandas as pd

examples =  {
    "request": [
      {
      # Recommended `messages` format
        "messages": [{
          "role": "user",
          "content": "Spark is a data analytics framework."
        }],
      },
      # SplitChatMessagesRequest format
      {
        "query": "How do I convert a Spark DataFrame to Pandas?",
        "history": [
          {"role": "user", "content": "What is Spark?"},
          {"role": "assistant", "content": "Spark is a data processing engine."},
        ],
      }
      # Note: Using a primitive string is discouraged. The string will be wrapped in the
      # OpenAI messages format before being passed to your agent.
    ],
    "response": [
        "Spark is a data analytics framework.",
        "This is not possible as Spark is not a panda.",
    ],
    "retrieved_context": [ # Optional, needed for judging groundedness.
        [{"doc_uri": "doc1.txt", "content": "In 2013, Spark, a data analytics framework, was open sourced by UC Berkeley's AMPLab."}],
        [{"doc_uri": "doc2.txt", "content": "To convert a Spark DataFrame to Pandas, you can use toPandas()"}],
    ],
    "expected_response": [ # Optional, needed for judging correctness.
        "Spark is a data analytics framework.",
        "To convert a Spark DataFrame to Pandas, you can use the toPandas() method.",
    ]
}
global_guidelines = {
  "english": ["The response must be in English"],
  "clarity": ["The response must be clear, coherent, and concise"],
}

result = mlflow.evaluate(
    data=pd.DataFrame(examples),    # Your evaluation set
    # model=logged_model.model_uri, # If you have an MLFlow model. `retrieved_context` and `response` will be obtained from calling the model.
    model_type="databricks-agent",  # Enable Mosaic AI Agent Evaluation
    evaluator_config={
       "databricks-agent": {"global_guidelines": global_guidelines}
    }
)

# Review the evaluation results in the MLFLow UI (see console output), or access them in place:
display(result.tables['eval_results'])

Agent Evaluation inputs and outputs

Inputs

For details of the expected input for Agent Evaluation, including field names and data types, see the input schema. Some of the fields are the following:

User's query (request): Input to the agent (user's question or query). For example, “What is RAG?”.
Agent's response (response): Response generated by the agent. For example, “Retrieval augmented generation is …”.
Expected response (expected_response): (Optional) A ground truth (correct) response.
MLflow trace (trace): (Optional) The agent's MLflow trace, from which Agent Evaluation extracts intermediate outputs such as the retrieved context or tool calls. Alternatively, you can provide these intermediate outputs directly.
Guidelines (guidelines): (Optional) A list of guidelines or named guidelines that the model's output is expected to adhere to.

Outputs

Based on these inputs, Agent Evaluation produces two types of outputs:

Evaluation Results (per row): For each row provided as input, Agent Evaluation produces a corresponding output row that contains a detailed assessment of your agent's quality, cost, and latency.
- LLM judges check different aspects of quality, such as correctness or groundedness, outputting a yes/no score and written rationale for that score. For details, see How quality, cost, and latency are assessed by Agent Evaluation (MLflow 2).
- The LLM judges' assessments are combined to produce an overall score that indicates whether that row “passes” (is high quality) or “fails” (has a quality issue).
  - For any failing rows, a root cause is identified. Each root cause corresponds to a specific LLM judge's assessment, allowing you to use the judge's rationale to identify potential fixes.
- Cost and latency are extracted from the MLflow trace. For details, see How cost and latency are assessed.
Metrics (aggregate scores): Aggregated scores that summarize the quality, cost, and latency of your agent across all input rows. These include metrics such as the percentage of correct answers, average token count, average latency, and more. For details, see How cost and latency are assessed and How metrics are aggregated at the level of an MLflow run for quality, cost, and latency.

Development (offline evaluation) and production (online monitoring)

Agent Evaluation is designed to be consistent between your development (offline) and production (online) environments. This design enables a smooth transition from development to production, allowing you to quickly iterate, evaluate, deploy, and monitor high-quality agentic applications.

The main difference between development and production is that in production, you do not have ground-truth labels, while in development, you may optionally use ground-truth labels. Using ground-truth labels allows Agent Evaluation to compute additional quality metrics.

Development (offline)

In development, your requests and expected_responses come from an evaluation set. An evaluation set is a collection of representative inputs that your agent should be able to handle accurately. For more information about evaluation sets, see Evaluation sets (MLflow 2).

To get response and trace, Agent Evaluation can call your agent's code to generate these outputs for each row in the evaluation set. Alternatively, you can generate these outputs yourself and pass them to Agent Evaluation. See How to provide input to an evaluation run for more information.

Production (online)

For information about monitoring in production, see Monitor GenAI in production. This MLflow 3 feature is compatible with MLflow 2 experiments. To enable monitoring on the MLflow 2 experiment, use the MLflow 3 SDK, installing mlflow>=3.1.

Establish a quality benchmark with an evaluation set

To measure the quality of an AI application in development (offline), you need to define an evaluation set, that is, a set of representative questions and optional ground-truth answers. If the application involves a retrieval step, like in RAG workflows, then you can optionally provide supporting documents that you expect the response to be based on.

For details about evaluation sets, including metric dependencies and best practices, see Evaluation sets (MLflow 2).
For the required schema, see Agent Evaluation input schema (MLflow 2).
For information about how to synthetically generate a high-quality evaluation set, see Synthesize evaluation sets.

Evaluation runs

For details about how to run an evaluation, see Run an evaluation and view the results (MLflow 2). Agent Evaluation supports two options for providing output from the chain:

You can run the application as part of the evaluation run. The application generates results for each input in the evaluation set.
You can provide output from a previous run of the application.

For details and explanation of when to use each option, see Provide inputs to an evaluation run.

Get human feedback about the quality of a generative AI application

The Databricks review app makes it easy to gather feedback about the quality of an AI application from human reviewers. For details, see Use the review app for human reviews of a gen AI app (MLflow 2).

Geo availability of Mosaic AI Agent Evaluation

Mosaic AI Agent Evaluation is a Designated Service that uses Geos to manage data residency when processing customer content. To learn more about the availability of Agent Evaluation in different geographic areas, see Databricks Designated Services.

Pricing

For pricing information, see Mosaic AI Agent Evaluation pricing.

Information about the models powering LLM judges

LLM judges might use third-party services to evaluate your GenAI applications, including Azure OpenAI operated by Microsoft.
For Azure OpenAI, Databricks has opted out of Abuse Monitoring so no prompts or responses are stored with Azure OpenAI.
For European Union (EU) workspaces, LLM judges use models hosted in the EU. All other regions use models hosted in the US.
Disabling Partner-powered AI features prevents the LLM judge from calling partner-powered models. You can still use LLM judges by providing your own model.
LLM judges are intended to help customers evaluate their GenAI agents/applications, and LLM judge outputs should not be used to train, improve, or fine-tune an LLM.

Example notebooks

The following notebooks illustrate how to use Mosaic AI Agent Evaluation.

Mosaic AI Agent Evaluation quickstart notebook

Get notebook

Agent Evaluation custom metrics, guidelines and domain expert labels notebook