共用方式為


生產環境中的追蹤觀測性

MLflow 追蹤藉由擷取執行詳細數據,並將其傳送至 Databricks 工作區,讓您可在 MLflow UI 中檢視它們,為部署在 Databricks 外部的生產 GenAI 應用程式提供完整的可觀察性。

MLflow 生產追蹤概觀

生產追蹤的運作方式:

  1. 您的應用程式會產生追蹤 - 每個 API 呼叫都會建立追蹤數據
  2. 追蹤紀錄會被保存到 Databricks MLflow 追蹤伺服器 - 使用工作區認證
  3. 在 MLflow UI 中檢視 - 在您的 Databricks 工作區中分析追蹤數據

此頁面涵蓋在 Databricks 外部部署的追蹤應用程式。 如果您的應用程式是使用 Databricks 模型服務來部署,請參閱 使用 Databricks 模型服務進行追蹤

先決條件

備註

生產追蹤需要 MLflow 3。 生產追蹤不支援 MLflow 2.x。

安裝必要的套件。 下表描述您的選項:

主題 mlflow-tracing mlflow[databricks]
建議的使用案例 生產部署 開發和實驗
福利 最小相依性以實現精簡、快速的部署
效能已針對高流量追蹤進行優化
著重於用於生產監控的用戶端追蹤
完整的 MLflow 實驗功能組合(包含用戶界面、LLM 擔任評審、開發工具等)
包含所有開發工具和公用程式
## Install mlflow-tracing for production deployment tracing
%pip install --upgrade mlflow-tracing

## Install mlflow for experimentation and development
%pip install --upgrade "mlflow[databricks]>=3.1"

基本追蹤設定

設定應用程式部署以連線到 Databricks 工作區,讓 Databricks 可以收集追蹤。

設定下列環境變數:

# Required: Set the Databricks workspace host and authentication token
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-databricks-token"

# Required: Set MLflow Tracking URI to "databricks" to log to Databricks
export MLFLOW_TRACKING_URI=databricks

# Required: Configure the experiment name for organizing traces (must be a workspace path)
export MLFLOW_EXPERIMENT_NAME="/Shared/production-genai-app"

部署範例

設定環境變數之後,請將它們傳遞至您的應用程式。 按兩下索引標籤,瞭解如何將連線詳細數據傳遞至不同的架構。

Docker(用於開發、傳遞和運行應用程式的平台)

針對 Docker 部署,透過容器組態傳遞環境變數:

# Dockerfile
FROM python:3.9-slim

# Install dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy application code
COPY . /app
WORKDIR /app

# Set default environment variables (can be overridden at runtime)
ENV DATABRICKS_HOST=""
ENV DATABRICKS_TOKEN=""
ENV MLFLOW_TRACKING_URI=databricks
ENV MLFLOW_EXPERIMENT_NAME="/Shared/production-genai-app"

CMD ["python", "app.py"]

使用環境變數執行容器:

docker run -d \
  -e DATABRICKS_HOST="https://your-workspace.cloud.databricks.com" \
  -e DATABRICKS_TOKEN="your-databricks-token" \
  -e MLFLOW_TRACKING_URI=databricks \
  -e MLFLOW_EXPERIMENT_NAME="/Shared/production-genai-app" \
  -e APP_VERSION="1.0.0" \
  your-app:latest

Kubernetes

針對 Kubernetes 部署,使用 ConfigMaps 和 Secrets 傳送環境變數。

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: databricks-config
data:
  DATABRICKS_HOST: 'https://your-workspace.cloud.databricks.com'
  MLFLOW_TRACKING_URI: databricks
  MLFLOW_EXPERIMENT_NAME: '/Shared/production-genai-app'

---
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: databricks-secrets
type: Opaque
stringData:
  DATABRICKS_TOKEN: 'your-databricks-token'

---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: genai-app
spec:
  template:
    spec:
      containers:
        - name: app
          image: your-app:latest
          envFrom:
            - configMapRef:
                name: databricks-config
            - secretRef:
                name: databricks-secrets
          env:
            - name: APP_VERSION
              value: '1.0.0'

確認追蹤資料收集

部署應用程式之後,請確認追蹤資料是否已正確收集。

import mlflow
from mlflow.client import MlflowClient
import os

# Ensure MLflow is configured for Databricks
mlflow.set_tracking_uri("databricks")

# Check connection to MLflow server
client = MlflowClient()
try:
    # List recent experiments to verify connectivity
    experiments = client.search_experiments()
    print(f"Connected to MLflow. Found {len(experiments)} experiments.")

    # Check if traces are being logged
    traces = mlflow.search_traces(
        experiment_names=[os.getenv("MLFLOW_EXPERIMENT_NAME", "/Shared/production-genai-app")],
        max_results=5
    )
    print(f"Found {len(traces)} recent traces.")
except Exception as e:
    print(f"Error connecting to MLflow: {e}")
    print(f"Check your authentication and connectivity")

為追蹤新增情境

完成基本追蹤工作後,請新增內容以提供更佳的偵錯和深入見解。 MLflow 具有下列標準化標記和屬性,可擷取重要的內容資訊:

  • 請求追蹤 - 將追蹤連結至特定 API 呼叫以進行端到端除錯
  • 用戶會話 - 群組相關互動以了解使用者旅程圖
  • 環境數據 - 追蹤每個追蹤生成的部署、版本或區域
  • 用戶意見反應 - 收集品質評等,並將其連結至特定互動

追蹤要求、會話和使用者背景

生產應用程式必須同時追蹤多個內容片段:用於偵錯的用戶端要求標識碼、多回合交談的會話標識碼、個人化和分析的使用者識別碼,以及作深入解析的環境元數據。 以下是示範如何在 FastAPI 應用程式中追蹤所有這些專案的完整範例:

import mlflow
import os
from fastapi import FastAPI, Request, HTTPException
from pydantic import BaseModel

# Initialize FastAPI app
app = FastAPI()

class ChatRequest(BaseModel):
    message: str

@mlflow.trace # Ensure @mlflow.trace is the outermost decorator
@app.post("/chat") # FastAPI decorator should be inner
def handle_chat(request: Request, chat_request: ChatRequest):
    # Retrieve all context from request headers
    client_request_id = request.headers.get("X-Request-ID")
    session_id = request.headers.get("X-Session-ID")
    user_id = request.headers.get("X-User-ID")

    # Update the current trace with all context and environment metadata
    # The @mlflow.trace decorator ensures an active trace is available
    mlflow.update_current_trace(
        client_request_id=client_request_id,
        tags={
            # Session context - groups traces from multi-turn conversations
            "mlflow.trace.session": session_id,
            # User context - associates traces with specific users
            "mlflow.trace.user": user_id,
            # Environment metadata - tracks deployment context
            "environment": "production",
            "app_version": os.getenv("APP_VERSION", "1.0.0"),
            "deployment_id": os.getenv("DEPLOYMENT_ID", "unknown"),
            "region": os.getenv("REGION", "us-east-1")
        }
    )

    # --- Your application logic for processing the chat message ---
    # For example, calling a language model with context
    # response_text = my_llm_call(
    #     message=chat_request.message,
    #     session_id=session_id,
    #     user_id=user_id
    # )
    response_text = f"Processed message: '{chat_request.message}'"
    # --- End of application logic ---

    # Return response
    return {
        "response": response_text
    }

# To run this example (requires uvicorn and fastapi):
# uvicorn your_file_name:app --reload
#
# Example curl request with all context headers:
# curl -X POST "http://127.0.0.1:8000/chat" \
#      -H "Content-Type: application/json" \
#      -H "X-Request-ID: req-abc-123-xyz-789" \
#      -H "X-Session-ID: session-def-456-uvw-012" \
#      -H "X-User-ID: user-jane-doe-12345" \
#      -d '{"message": "What is my account balance?"}'

這個結合的方法提供數個優點:

  • 用戶端要求標識碼:透過將追蹤與整個系統的特定用戶端要求相互關聯,來啟用端對端偵錯
  • 會話標識碼(標籤:mlflow.trace.session)群組化多回合交談的追蹤,讓您能分析完整的交談流程
  • 使用者識別碼 (標籤: ): mlflow.trace.user將追蹤與特定使用者產生關聯,以進行個人化、世代分析和使用者特定偵錯
  • 環境元數據:追蹤部署環境(環境、版本、區域)以提供不同部署的營運見解和偵錯用途

欲了解有關為追蹤新增上下文的詳細資訊,請參閱 追蹤使用者和會話追蹤環境與上下文的文件。

收集用戶意見反應

擷取特定互動的用戶意見反應對於了解品質及改善 GenAI 應用程式至關重要。 此範例會以 上一節所示的用戶端要求標識符追蹤為基礎,示範如何使用該標識符將意見反應連結至特定追蹤。

以下是在 FastAPI 中實作意見反應收集的範例:

import mlflow
from mlflow.client import MlflowClient
from fastapi import FastAPI, Query, Request
from pydantic import BaseModel
from typing import Optional
from mlflow.entities import AssessmentSource

# Initialize FastAPI app
app = FastAPI()

class FeedbackRequest(BaseModel):
    is_correct: bool  # True for correct, False for incorrect
    comment: Optional[str] = None

@app.post("/chat_feedback")
def handle_chat_feedback(
    request: Request,
    client_request_id: str = Query(..., description="The client request ID from the original chat request"),
    feedback: FeedbackRequest = ...
):
    """
    Collect user feedback for a specific chat interaction identified by client_request_id.
    """
    # Search for the trace with the matching client_request_id
    client = MlflowClient()
    # Get the experiment by name (using Databricks workspace path)
    experiment = client.get_experiment_by_name("/Shared/production-app")
    traces = client.search_traces(
        experiment_ids=[experiment.experiment_id],
        filter_string=f"attributes.client_request_id = '{client_request_id}'",
        max_results=1
    )

    if not traces:
        return {
            "status": "error",
            "message": f"Unable to find data for client request ID: {client_request_id}"
        }, 500

    # Log feedback using MLflow's log_feedback API
    mlflow.log_feedback(
        trace_id=traces[0].info.trace_id,
        name="response_is_correct",
        value=feedback.is_correct,
        source=AssessmentSource(
            source_type="HUMAN",
            source_id=request.headers.get("X-User-ID")
        ),
        rationale=feedback.comment
    )

    return {
        "status": "success",
        "message": "Feedback recorded successfully",
        "trace_id": traces[0].info.trace_id,
        "client_request_id": client_request_id,
        "feedback_by": request.headers.get("X-User-ID")
    }

# Example usage:
# After a chat interaction returns a response, the client can submit feedback:
#
# curl -X POST "http://127.0.0.1:8000/chat_feedback?client_request_id=req-abc-123-xyz-789" \
#      -H "Content-Type: application/json" \
#      -H "X-User-ID: user-jane-doe-12345" \
#      -d '{
#        "is_correct": true,
#        "comment": "The response was accurate and helpful"
#      }'

此意見反應收集方法可讓您:

  • 將意見反應連結至特定互動:使用用戶端要求標識碼來尋找確切的追蹤並附加意見反應
  • 儲存結構化意見反應log_feedback API 會建立適當的評定對象,這些物件會顯示在 MLflow UI 中
  • 分析品質模式:使用其相關聯的意見反應來查詢追蹤,以識別哪些互動類型會收到正面或負面評等

您稍後可以使用 MLflow UI 查詢追蹤,或以程式設計方式分析模式並改善您的應用程式。

通過上下文查詢追蹤

使用內容資訊來分析生產行為:

import mlflow
from mlflow.client import MlflowClient
import pandas as pd

client = MlflowClient()
experiment = client.get_experiment_by_name("/Shared/production-app")

# Query traces by user
user_traces = client.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string="tags.`mlflow.trace.user` = 'user-jane-doe-12345'",
    max_results=100
)

# Query traces by session
session_traces = client.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string="tags.`mlflow.trace.session` = 'session-123'",
    max_results=100
)

後續步驟

使用這些建議的動作和教學課程繼續您的旅程。

參考指南

查看本指南中提到的概念和功能的詳細文件。