Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Important
This feature is in Public Preview.
This article gives an overview of how to work with Mosaic AI Agent Evaluation. Agent Evaluation helps developers evaluate the quality, cost, and latency of agentic AI applications, including RAG applications and chains. Agent Evaluation is designed to both identify quality issues and determine the root cause of those issues. The capabilities of Agent Evaluation are unified across the development, staging, and production phases of the MLOps life cycle, and all evaluation metrics and data are logged to MLflow Runs.
Agent Evaluation integrates advanced, research-backed evaluation techniques into a user-friendly SDK and UI that is integrated with your lakehouse, MLflow, and the other Databricks Data Intelligence Platform components. Developed in collaboration with Mosaic AI research, this proprietary technology offers a comprehensive approach to analyzing and enhancing agent performance.
Agentic AI applications are complex and involve many different components. Evaluating the performance of these applications is not as straightforward as evaluating the performance of traditional ML models. Both qualitative and quantitative metrics that are used to evaluate quality are inherently more complex. Agent Evaluation includes proprietary LLM judges and agent metrics to evaluate retrieval and request quality as well as overall performance metrics like latency and token cost.
The following code shows how to call and test Agent Evaluation on previously generated outputs. It returns a dataframe with evaluation scores calculated by LLM judges that are part of Agent Evaluation.
You can copy and paste the following into your existing Databricks notebook:
%pip install mlflow databricks-agents
dbutils.library.restartPython()
import mlflow
import pandas as pd
examples = {
"request": [
{
# Recommended `messages` format
"messages": [{
"role": "user",
"content": "Spark is a data analytics framework."
}],
},
# SplitChatMessagesRequest format
{
"query": "How do I convert a Spark DataFrame to Pandas?",
"history": [
{"role": "user", "content": "What is Spark?"},
{"role": "assistant", "content": "Spark is a data processing engine."},
],
}
# Note: Using a primitive string is discouraged. The string will be wrapped in the
# OpenAI messages format before being passed to your agent.
],
"response": [
"Spark is a data analytics framework.",
"This is not possible as Spark is not a panda.",
],
"retrieved_context": [ # Optional, needed for judging groundedness.
[{"doc_uri": "doc1.txt", "content": "In 2013, Spark, a data analytics framework, was open sourced by UC Berkeley's AMPLab."}],
[{"doc_uri": "doc2.txt", "content": "To convert a Spark DataFrame to Pandas, you can use toPandas()"}],
],
"expected_response": [ # Optional, needed for judging correctness.
"Spark is a data analytics framework.",
"To convert a Spark DataFrame to Pandas, you can use the toPandas() method.",
],
"guidelines": [
"The response must be in English",
"The response must be clear, coherent, and concise",
]
}
result = mlflow.evaluate(
data=pd.DataFrame(examples), # Your evaluation set
# model=logged_model.model_uri, # If you have an MLFlow model. `retrieved_context` and `response` will be obtained from calling the model.
model_type="databricks-agent", # Enable Mosaic AI Agent Evaluation
)
# Review the evaluation results in the MLFLow UI (see console output), or access them in place:
display(result.tables['eval_results'])
Alternatively, you can import and run the following notebook in your Databricks workspace:
The following diagram shows an overview of the inputs accepted by Agent Evaluation and the corresponding outputs produced by Agent Evaluation.
For details of the expected input for Agent Evaluation, including field names and data types, see the input schema. Some of the fields are the following:
request
): Input to the agent (user’s question or query). For example, “What is RAG?”.response
): Response generated by the agent. For example, “Retrieval augmented generation is …”.expected_response
): (Optional) A ground truth (correct) response.trace
): (Optional) The agent’s MLflow trace, from which Agent Evaluation extracts intermediate outputs such as the retrieved context or tool calls. Alternatively, you can provide these intermediate outputs directly.guidelines
): (Optional) A list of guidelines that the model’s output is expected to adhere to.Based on these inputs, Agent Evaluation produces two types of outputs:
Agent Evaluation is designed to be consistent between your development (offline) and production (online) environments. This design enables a smooth transition from development to production, allowing you to quickly iterate, evaluate, deploy, and monitor high-quality agentic applications.
The main difference between development and production is that in production, you do not have ground-truth labels, while in development, you may optionally use ground-truth labels. Using ground-truth labels allows Agent Evaluation to compute additional quality metrics.
In development, your requests
and expected_responses
come from an evaluation set. An evaluation set is a collection of representative inputs that your agent should be able to handle accurately. For more information about evaluation sets, see Evaluation sets.
To get response
and trace
, Agent Evaluation can call your agent’s code to generate these outputs for each row in the evaluation set. Alternatively, you can generate these outputs yourself and pass them to Agent Evaluation. See How to provide input to an evaluation run for more information.
In production, all inputs to Agent Evaluation come from your production logs.
If you use Mosaic AI Agent Framework to deploy your AI application, Agent Evaluation can be configured to automatically collect these inputs from the Agent-enhanced inference tables and continually update a monitoring dashboard. For more details, see How to monitor the quality of your agent on production traffic.
If you deploy your agent outside of Azure Databricks, you can ETL your logs to the required input schema and similarly configure a monitoring dashboard.
To measure the quality of an AI application in development (offline), you need to define an evaluation set, that is, a set of representative questions and optional ground-truth answers. If the application involves a retrieval step, like in RAG workflows, then you can optionally provide supporting documents that you expect the response to be based on.
For details about how to run an evaluation, see How to run an evaluation and view the results. Agent Evaluation supports two options for providing output from the chain:
For details and explanation of when to use each option, see How to provide input to an evaluation run.
The Databricks review app makes it easy to gather feedback about the quality of an AI application from human reviewers. For details, see Get feedback about the quality of an agentic application.
Mosaic AI Agent Evaluation is a Designated Service that uses Geos to manage data residency when processing customer content. To learn more about the availability of Agent Evaluation in different geographic areas, see Databricks Designated Services.
For pricing information, see Mosaic AI Agent Evaluation pricing.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register todayTraining
Module
Evaluate language models with Azure Databricks - Training
Evaluate language models with Azure Databricks
Certification
Microsoft Certified: Azure AI Engineer Associate - Certifications
Design and implement an Azure AI solution using Azure AI services, Azure AI Search, and Azure Open AI.