生產環境中的追蹤觀測性

2025-06-25

MLflow 追蹤藉由擷取執行詳細數據，並將其傳送至 Databricks 工作區，讓您可在 MLflow UI 中檢視它們，為部署在 Databricks 外部的生產 GenAI 應用程式提供完整的可觀察性。

MLflow 生產追蹤概觀

生產追蹤的運作方式：

您的應用程式會產生追蹤 - 每個 API 呼叫都會建立追蹤數據
追蹤紀錄會被保存到 Databricks MLflow 追蹤伺服器 - 使用工作區認證
在 MLflow UI 中檢視 - 在您的 Databricks 工作區中分析追蹤數據

此頁面涵蓋在 Databricks 外部部署的追蹤應用程式。如果您的應用程式是使用 Databricks 模型服務來部署，請參閱使用 Databricks 模型服務進行追蹤。

先決條件

備註

生產追蹤需要 MLflow 3。生產追蹤不支援 MLflow 2.x。

安裝必要的套件。下表描述您的選項：

主題	`mlflow-tracing`	`mlflow[databricks]`
建議的使用案例	生產部署	開發和實驗
福利	最小相依性以實現精簡、快速的部署效能已針對高流量追蹤進行優化著重於用於生產監控的用戶端追蹤	完整的 MLflow 實驗功能組合（包含用戶界面、LLM 擔任評審、開發工具等）包含所有開發工具和公用程式

## Install mlflow-tracing for production deployment tracing
%pip install --upgrade mlflow-tracing

## Install mlflow for experimentation and development
%pip install --upgrade "mlflow[databricks]>=3.1"

基本追蹤設定

設定應用程式部署以連線到 Databricks 工作區，讓 Databricks 可以收集追蹤。

設定下列環境變數：

# Required: Set the Databricks workspace host and authentication token
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-databricks-token"

# Required: Set MLflow Tracking URI to "databricks" to log to Databricks
export MLFLOW_TRACKING_URI=databricks

# Required: Configure the experiment name for organizing traces (must be a workspace path)
export MLFLOW_EXPERIMENT_NAME="/Shared/production-genai-app"

部署範例

設定環境變數之後，請將它們傳遞至您的應用程式。按兩下索引標籤，瞭解如何將連線詳細數據傳遞至不同的架構。

Docker（用於開發、傳遞和運行應用程式的平台）

針對 Docker 部署，透過容器組態傳遞環境變數：

# Dockerfile
FROM python:3.9-slim

# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application code
COPY . /app
WORKDIR /app

# Set default environment variables (can be overridden at runtime)
ENV DATABRICKS_HOST=""
ENV DATABRICKS_TOKEN=""
ENV MLFLOW_TRACKING_URI=databricks
ENV MLFLOW_EXPERIMENT_NAME="/Shared/production-genai-app"

CMD ["python", "app.py"]

使用環境變數執行容器：

docker run -d \
  -e DATABRICKS_HOST="https://your-workspace.cloud.databricks.com" \
  -e DATABRICKS_TOKEN="your-databricks-token" \
  -e MLFLOW_TRACKING_URI=databricks \
  -e MLFLOW_EXPERIMENT_NAME="/Shared/production-genai-app" \
  -e APP_VERSION="1.0.0" \
  your-app:latest

Kubernetes

針對 Kubernetes 部署，使用 ConfigMaps 和 Secrets 傳送環境變數。

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: databricks-config
data:
  DATABRICKS_HOST: 'https://your-workspace.cloud.databricks.com'
  MLFLOW_TRACKING_URI: databricks
  MLFLOW_EXPERIMENT_NAME: '/Shared/production-genai-app'

---
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: databricks-secrets
type: Opaque
stringData:
  DATABRICKS_TOKEN: 'your-databricks-token'

---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: genai-app
spec:
  template:
    spec:
      containers:
        - name: app
          image: your-app:latest
          envFrom:
            - configMapRef:
                name: databricks-config
            - secretRef:
                name: databricks-secrets
          env:
            - name: APP_VERSION
              value: '1.0.0'

確認追蹤資料收集

部署應用程式之後，請確認追蹤資料是否已正確收集。

import mlflow
from mlflow.client import MlflowClient
import os

# Ensure MLflow is configured for Databricks
mlflow.set_tracking_uri("databricks")

# Check connection to MLflow server
client = MlflowClient()
try:
    # List recent experiments to verify connectivity
    experiments = client.search_experiments()
    print(f"Connected to MLflow. Found {len(experiments)} experiments.")

    # Check if traces are being logged
    traces = mlflow.search_traces(
        experiment_names=[os.getenv("MLFLOW_EXPERIMENT_NAME", "/Shared/production-genai-app")],
        max_results=5
    )
    print(f"Found {len(traces)} recent traces.")
except Exception as e:
    print(f"Error connecting to MLflow: {e}")
    print(f"Check your authentication and connectivity")

為追蹤新增情境

完成基本追蹤工作後，請新增內容以提供更佳的偵錯和深入見解。 MLflow 具有下列標準化標記和屬性，可擷取重要的內容資訊：

請求追蹤 - 將追蹤連結至特定 API 呼叫以進行端到端除錯
用戶會話 - 群組相關互動以了解使用者旅程圖
環境數據 - 追蹤每個追蹤生成的部署、版本或區域
用戶意見反應 - 收集品質評等，並將其連結至特定互動

追蹤要求、會話和使用者背景

生產應用程式必須同時追蹤多個內容片段：用於偵錯的用戶端要求標識碼、多回合交談的會話標識碼、個人化和分析的使用者識別碼，以及作深入解析的環境元數據。以下是示範如何在 FastAPI 應用程式中追蹤所有這些專案的完整範例：

import mlflow
import os
from fastapi import FastAPI, Request, HTTPException
from pydantic import BaseModel

# Initialize FastAPI app
app = FastAPI()

class ChatRequest(BaseModel):
    message: str

@mlflow.trace # Ensure @mlflow.trace is the outermost decorator
@app.post("/chat") # FastAPI decorator should be inner
def handle_chat(request: Request, chat_request: ChatRequest):
    # Retrieve all context from request headers
    client_request_id = request.headers.get("X-Request-ID")
    session_id = request.headers.get("X-Session-ID")
    user_id = request.headers.get("X-User-ID")

    # Update the current trace with all context and environment metadata
    # The @mlflow.trace decorator ensures an active trace is available
    mlflow.update_current_trace(
        client_request_id=client_request_id,
        tags={
            # Session context - groups traces from multi-turn conversations
            "mlflow.trace.session": session_id,
            # User context - associates traces with specific users
            "mlflow.trace.user": user_id,
            # Environment metadata - tracks deployment context
            "environment": "production",
            "app_version": os.getenv("APP_VERSION", "1.0.0"),
            "deployment_id": os.getenv("DEPLOYMENT_ID", "unknown"),
            "region": os.getenv("REGION", "us-east-1")
        }
    )

    # --- Your application logic for processing the chat message ---
    # For example, calling a language model with context
    # response_text = my_llm_call(
    #     message=chat_request.message,
    #     session_id=session_id,
    #     user_id=user_id
    # )
    response_text = f"Processed message: '{chat_request.message}'"
    # --- End of application logic ---

    # Return response
    return {
        "response": response_text
    }

# To run this example (requires uvicorn and fastapi):
# uvicorn your_file_name:app --reload
#
# Example curl request with all context headers:
# curl -X POST "http://127.0.0.1:8000/chat" \
#      -H "Content-Type: application/json" \
#      -H "X-Request-ID: req-abc-123-xyz-789" \
#      -H "X-Session-ID: session-def-456-uvw-012" \
#      -H "X-User-ID: user-jane-doe-12345" \
#      -d '{"message": "What is my account balance?"}'

這個結合的方法提供數個優點：

用戶端要求標識碼：透過將追蹤與整個系統的特定用戶端要求相互關聯，來啟用端對端偵錯
會話標識碼（標籤：mlflow.trace.session）群組化多回合交談的追蹤，讓您能分析完整的交談流程
使用者識別碼 （標籤：）： mlflow.trace.user將追蹤與特定使用者產生關聯，以進行個人化、世代分析和使用者特定偵錯
環境元數據：追蹤部署環境（環境、版本、區域）以提供不同部署的營運見解和偵錯用途

欲了解有關為追蹤新增上下文的詳細資訊，請參閱追蹤使用者和會話和追蹤環境與上下文的文件。

收集用戶意見反應

擷取特定互動的用戶意見反應對於了解品質及改善 GenAI 應用程式至關重要。此範例會以上一節所示的用戶端要求標識符追蹤為基礎，示範如何使用該標識符將意見反應連結至特定追蹤。

以下是在 FastAPI 中實作意見反應收集的範例：

import mlflow
from mlflow.client import MlflowClient
from fastapi import FastAPI, Query, Request
from pydantic import BaseModel
from typing import Optional
from mlflow.entities import AssessmentSource

# Initialize FastAPI app
app = FastAPI()

class FeedbackRequest(BaseModel):
    is_correct: bool  # True for correct, False for incorrect
    comment: Optional[str] = None

@app.post("/chat_feedback")
def handle_chat_feedback(
    request: Request,
    client_request_id: str = Query(..., description="The client request ID from the original chat request"),
    feedback: FeedbackRequest = ...
):
    """
    Collect user feedback for a specific chat interaction identified by client_request_id.
    """
    # Search for the trace with the matching client_request_id
    client = MlflowClient()
    # Get the experiment by name (using Databricks workspace path)
    experiment = client.get_experiment_by_name("/Shared/production-app")
    traces = client.search_traces(
        experiment_ids=[experiment.experiment_id],
        filter_string=f"attributes.client_request_id = '{client_request_id}'",
        max_results=1
    )

    if not traces:
        return {
            "status": "error",
            "message": f"Unable to find data for client request ID: {client_request_id}"
        }, 500

    # Log feedback using MLflow's log_feedback API
    mlflow.log_feedback(
        trace_id=traces[0].info.trace_id,
        name="response_is_correct",
        value=feedback.is_correct,
        source=AssessmentSource(
            source_type="HUMAN",
            source_id=request.headers.get("X-User-ID")
        ),
        rationale=feedback.comment
    )

    return {
        "status": "success",
        "message": "Feedback recorded successfully",
        "trace_id": traces[0].info.trace_id,
        "client_request_id": client_request_id,
        "feedback_by": request.headers.get("X-User-ID")
    }

# Example usage:
# After a chat interaction returns a response, the client can submit feedback:
#
# curl -X POST "http://127.0.0.1:8000/chat_feedback?client_request_id=req-abc-123-xyz-789" \
#      -H "Content-Type: application/json" \
#      -H "X-User-ID: user-jane-doe-12345" \
#      -d '{
#        "is_correct": true,
#        "comment": "The response was accurate and helpful"
#      }'

此意見反應收集方法可讓您：

將意見反應連結至特定互動：使用用戶端要求標識碼來尋找確切的追蹤並附加意見反應
儲存結構化意見反應： log_feedback API 會建立適當的評定對象，這些物件會顯示在 MLflow UI 中
分析品質模式：使用其相關聯的意見反應來查詢追蹤，以識別哪些互動類型會收到正面或負面評等

您稍後可以使用 MLflow UI 查詢追蹤，或以程式設計方式分析模式並改善您的應用程式。

通過上下文查詢追蹤

使用內容資訊來分析生產行為：

import mlflow
from mlflow.client import MlflowClient
import pandas as pd

client = MlflowClient()
experiment = client.get_experiment_by_name("/Shared/production-app")

# Query traces by user
user_traces = client.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string="tags.`mlflow.trace.user` = 'user-jane-doe-12345'",
    max_results=100
)

# Query traces by session
session_traces = client.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string="tags.`mlflow.trace.session` = 'session-123'",
    max_results=100
)

後續步驟

使用這些建議的動作和教學課程繼續您的旅程。

在生產環境中執行計分器 - 設定生產流量的自動化質量評估

參考指南

查看本指南中提到的概念和功能的詳細文件。

生產監視概念 - 瞭解 MLflow 如何啟用持續質量監視
追蹤數據模型 - 了解追蹤、範圍和屬性
記錄評量 - 瞭解意見反應的儲存和使用方式