Agent observability with MLflow Tracing

This article describes how to add observability to your generative AI applications with MLflow Tracing on Databricks.

What is MLflow Tracing?

MLflow Tracing provides end-to-end observability for GenAI applications from development to deployment. Tracing is fully integrated with Databricks's GenAI toolset, capturing detailed insights across the entire development and production lifecycle.

In-line tracing captures detailed information for each step in a gen AI app

The following are the key use cases for tracing in GenAI applications:

  • Streamlined debugging: Tracing provides visibility into each step of your GenAI application, making diagnosing and resolving issues easier.

  • Offline evaluation: Tracing generates valuable data for agent evaluation, allowing you to measure and improve the quality of agents over time.

  • Production monitoring: Tracing provides visibility into agent behavior and detailed execution steps, enabling you to monitor and optimize agent performance in production.

  • Audit logs: MLflow Tracing generates comprehensive audit logs of agent actions and decisions. This is vital for ensuring compliance and supporting debugging when unexpected issues arise.

Requirements

MLflow Tracing is available on MLflow versions 2.13.0 and above. Databricks recommends installing the latest version of MLflow to access the latest features and improvements.

Bash
%pip install mlflow>=2.13.0 -qqqU
%restart_python

Automatic tracing

MLflow autologging lets you quickly instrument your agent by adding a single line to your code, mlflow.<library>.autolog().

MLflow supports autologging for most popular agent authoring libraries. For more information about each authoring library, see MLflow autologging documentation:

Library Autologging version support Autologging command
LangChain 0.1.0 ~ Latest mlflow.langchain.autolog()
Langgraph 0.1.1 ~ Latest mlflow.langgraph.autolog()
OpenAI 1.0.0 ~ Latest mlflow.openai.autolog()
LlamaIndex 0.10.44 ~ Latest mlflow.llamaindex.autolog()
DSPy 2.5.17 ~ Latest mlflow.dspy.autolog()
Amazon Bedrock 1.33.0 ~ Latest (boto3) mlflow.bedrock.autolog()
Anthropic 0.30.0 ~ Latest mlflow.anthropic.autolog()
AutoGen 0.2.36 ~ 0.2.40 mlflow.autogen.autolog()
Google Gemini 1.0.0 ~ Latest mlflow.gemini.autolog()
CrewAI 0.80.0 ~ Latest mlflow.crewai.autolog()
LiteLLM 1.52.9 ~ Latest mlflow.litellm.autolog()
Groq 0.13.0 ~ Latest mlflow.groq.autolog()
Mistral 1.0.0 ~ Latest mlflow.mistral.autolog()

Disable autologging

Autologging tracing is enabled by default in Databricks Runtime 15.4 ML and above for the following libraries:

  • LangChain
  • Langgraph
  • OpenAI
  • LlamaIndex

To disable autologging tracing for these libraries, run the following command in a notebook:

Python
`mlflow.<library>.autolog(log_traces=False)`

Add traces manually

While autologging provides a convenient way to instrument agents, you may want to instrument your agent more granularly or add additional traces that autologging doesn't capture. In these cases, use MLflow Tracing APIs to manually add traces.

MLflow Tracing APIs are low-code APIs for adding traces without worrying about managing the tree structure of the trace. MLflow determines the appropriate parent-child span relationships automatically using the Python stack.

Combine autologging and manual tracing

Manual tracing APIs can be used with autologging. MLflow combines the spans created by autologging and manual tracing to create a complete trace of your agent execution. For an example of combining autologging and manual tracing, see Instrumenting a tool calling agent with MLflow Tracing.

Trace functions using the @mlflow.trace decorator

The simplest way to manually instrument your code is to decorate a function with the @mlflow.trace decorator. The MLflow trace decorator creates a "span" with the scope of the decorated function, which represents a unit of execution in a trace and is displayed as a single row in the trace visualization. The span captures the input and output of the function, latency, and any exceptions raised from the function.

For example, the following code creates a span named my_function that captures input arguments x and y and the output.

Python
import mlflow

@mlflow.trace
def add(x: int, y: int) -> int:
  return x + y

You can also customize the span name, span type, and add custom attributes to the span:

Python
from mlflow.entities import SpanType

@mlflow.trace(
  # By default, the function name is used as the span name. You can override it with the `name` parameter.
  name="my_add_function",
  # Specify the span type using the `span_type` parameter.
  span_type=SpanType.TOOL,
  # Add custom attributes to the span using the `attributes` parameter. By default, MLflow only captures input and output.
  attributes={"key": "value"}
)
def add(x: int, y: int) -> int:
  return x + y

Trace arbitrary code blocks using context manager

To create a span for an arbitrary block of code, not just a function, use mlflow.start_span() as a context manager that wraps the code block. The span starts when the context is entered and ends when the context is exited. The span input and outputs should be provided manually using setter methods of the span object yielded by the context manager. For more information, see MLflow documentation - context handler.

Python
with mlflow.start_span(name="my_span") as span:
  span.set_inputs({"x": x, "y": y})
  result = x + y
  span.set_outputs(result)
  span.set_attribute("key", "value")

Lower-level tracing libraries

MLflow also provides low-level APIs for explicitly controlling the trace tree structure. See MLflow documentation - Manual Instrumentation.

Tracing example: Combine autologging and manual traces

The following example combines OpenAI autologging and manual tracing to fully instrument a tool-calling agent.

Python
import json
from openai import OpenAI
import mlflow
from mlflow.entities import SpanType

client = OpenAI()

# Enable OpenAI autologging to capture LLM API calls
# (*Not necessary if you are using the Databricks Runtime 15.4 ML and above, where OpenAI autologging is enabled by default)
mlflow.openai.autolog()

# Define the tool function. Decorate it with `@mlflow.trace` to create a span for its execution.
@mlflow.trace(span_type=SpanType.TOOL)
def get_weather(city: str) -> str:
  if city == "Tokyo":
    return "sunny"
  elif city == "Paris":
    return "rainy"
  return "unknown"


tools = [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
      },
    },
  }
]

_tool_functions = {"get_weather": get_weather}

# Define a simple tool-calling agent
@mlflow.trace(span_type=SpanType.AGENT)
def run_tool_agent(question: str):
  messages = [{"role": "user", "content": question}]

  # Invoke the model with the given question and available tools
  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=tools,
  )
  ai_msg = response.choices[0].message
  messages.append(ai_msg)

  # If the model requests tool calls, invoke the function(s) with the specified arguments
  if tool_calls := ai_msg.tool_calls:
    for tool_call in tool_calls:
      function_name = tool_call.function.name
      if tool_func := _tool_functions.get(function_name):
        args = json.loads(tool_call.function.arguments)
        tool_result = tool_func(**args)
      else:
        raise RuntimeError("An invalid tool is returned from the assistant!")

      messages.append(
        {
          "role": "tool",
          "tool_call_id": tool_call.id,
          "content": tool_result,
        }
      )

    # Send the tool results to the model and get a new response
    response = client.chat.completions.create(
      model="gpt-4o-mini", messages=messages
    )

  return response.choices[0].message.content

# Run the tool calling agent
question = "What's the weather like in Paris today?"
answer = run_tool_agent(question)

Annotate traces with tags

MLflow trace tags are key-value pairs that allow you to add custom metadata to traces, such as a conversation ID, a user ID, Git commit hash, etc. Tags are displayed in the MLflow UI to filter and search traces.

Tags can be set to an ongoing or completed trace using MLflow APIs or the MLflow UI. The following example demonstrates adding a tag to an ongoing trace using the mlflow.update_current_trace() API.

Python
@mlflow.trace
def my_func(x):
    mlflow.update_current_trace(tags={"fruit": "apple"})
    return x + 1

To learn more about tagging traces and how to use them to filter and search traces, see MLflow documentation - Setting Trace Tags.

Review traces

To review traces after running the agent, use one of the following options:

  • In-line visualization: In Databricks notebooks, traces are rendered inline in the cell output.
  • MLflow experiment: In Databricks, go to Experiments > Select an experiment > Traces to view and search through all the traces for an experiment.
  • MLflow run: When the agent runs under an active MLflow Run, traces appear on the Run page of the MLflow UI.
  • Agent Evaluation UI: In Mosaic AI Agent Evaluation, you can review traces for each agent execution by clicking See detailed trace view in the evaluation result.
  • Trace Search API: To programmatically retrieve traces, use the Trace Search API.

Evaluate agents using traces

Trace data serves as a valuable resource for evaluating your agents. By capturing detailed information about the execution of your models, MLflow Tracing is instrumental in offline evaluation. You can use the trace data to evaluate your agent's performance against a golden dataset, identify issues, and improve your agent's performance.

Python
%pip install -U mlflow databricks-agents
%restart_python
Python
import mlflow

# Get the recent 50 successful traces from the experiment
traces = mlflow.search_traces(
    max_results=50,
    filter_string="status = 'OK'",
)

traces.drop_duplicates("request", inplace=True) # Drop duplicate requests.
traces["trace"] = traces["trace"].apply(lambda x: x.to_json()) # Convert the trace to JSON format.

# Evaluate the agent with the trace data
mlflow.evaluate(data=traces, model_type="databricks-agent")

To learn more about agent evaluation see Run an evaluation and view the results.

Monitor deployed agents with inference tables

After an agent is deployed to Mosaic AI Model Serving, you can use inference tables to monitor the agent. The inference tables contain detailed logs of requests, responses, agent traces, and agent feedback from the review app. This information lets you debug issues, monitor performance, and create a golden dataset for offline evaluation.

To enable inference tables for agent deployments, see Enable inference tables for AI agents.

Query online traces

Use a notebook to query the inference table and analyze the results.

To visualize traces, run display(<the request logs table>) and select rows to inspect:

Python
# Query the inference table
df = spark.sql("SELECT * FROM <catalog.schema.my-inference-table-name>")
display(df)

Monitor agents with dashboards

You can use online traces to create dashboards that monitor your agents in production. See How to monitor the quality of your agent on production traffic.

Trace overhead latency

Traces are written asynchronously to minimize performance impact. However, tracing still adds latency to endpoint response speed, particularly when the trace size for each inference request is large. Databricks recommends testing your endpoint to understand tracing latency impacts before deploying to production.

The following table provides rough estimates for latency impact by trace size:

Trace size per request Impact to response speed latency (ms)
~10 KB ~ 1 ms
~ 1 MB 50 ~ 100 ms
10 MB 150 ms ~

Troubleshooting

For troubleshooting and common questions, see the MLflow documentation: Tracing How-to Guide and MLflow documentation: FAQ