Enable measurement: Supporting infrastructure

This article details the infrastructure needed for measuring quality and how Databricks provides it. Measuring quality is not easy and requires a significant infrastructure investment.

Detailed trace logging

The core of your RAG application’s logic is a series of steps in the chain. To evaluate and debug quality, you need to implement instrumentation that tracks the chain’s inputs and outputs, along with each step of the chain, and its associated inputs and outputs. The instrumentation you put in place should work the same way in development and production.

In Databricks, MLflow Tracing provides this capability. With MLflow Trace Logging, you instrument your code in production, and get the same traces during development and in production. Production traces are logged as part of the Inference Table.

Stakeholder review UI

Most often, as a developer, you are not a domain expert in the content of the application you are developing. In order to collect feedback from human experts who can assess your application’s output quality, you need an interface that allows them to interact with early versions of the application and provide detailed feedback. Further, you need a way to load specific application outputs for the stakeholders to assess their quality.

This interface must track the application’s outputs and associated feedback in a structured manner, storing the full application trace and detailed feedback in a data table.

In Databricks, the Agent Evaluation Review App provides this capability.

Quality, cost, and latency metric framework

You need a way to define the metrics that comprehensively measure the quality of each component of your chain and the end-to-end application. Ideally, the framework would provide a suite of standard metrics out of the box, in addition to supporting customization, so you can add metrics that test specific aspects of quality that are unique to your business.

In Databricks, Agent Evaluation provides an out-of-the-box implementation, using hosted LLM judge models, for the necessary quality, cost, and latency metrics.

Evaluation harness

You need a way to quickly and efficiently get outputs from your chain for every question in your evaluation set, and then evaluate each output on the relevant metrics. This harness must be as efficient as possible, since you will run evaluation after every experiment that you try to improve quality.

In Databricks, Agent Evaluation provides an evaluation harness that is integrated with MLflow.

Evaluation set management

Your evaluation set is a living, breathing set of questions that you will update iteratively over the course of your application’s development and production lifecycle.

In Databricks, you can manage your evaluation set as a Delta Table. When evaluating with MLflow, MLflow will automatically log a snapshot of the version of the evaluation set used.

Experiment tracking framework

During the course of your application development, you will try many different experiments. An experiment tracking framework enables you to log each experiment and track its metrics vs. other experiments.

In Databricks, MLflow provides experiment tracking capabilities.

Chain parameterization framework

Many experiments you try require you to hold the chain’s code constant while iterating on various parameters used by the code. You need a framework that enables you to do this.

In Databricks, MLflow model configuration provides these capabilities.

Online monitoring

Once deployed, you need a way monitor the application’s health and on-going quality, cost, and latency.

In Databricks, Model Serving provides application health monitoring and Lakehouse Monitoring provides on-going outputs to a dashboard and monitors quality, cost, and latency.