創建一個自定義的判斷器使用`make_judge()`

自訂評審是以 LLM 為基礎的評分員，可根據特定品質標準評估您的 GenAI 代理程式。本教學課程說明如何建立自訂評委，並使用它來評估客戶支援專員 make_judge()。

您將會：

建立要評估的範例代理程式
定義三個自訂評審來評估不同的標準
使用測試案例建置評估資料集
執行評估並比較不同代理程式組態的結果

步驟 1：建立要評估的代理程式

建立回應客戶支援問題的 GenAI 代理程式。代理有一個（假的）旋鈕來控制系統提示詞，讓您可以輕鬆比較裁判在「好」和「壞」對話之間的輸出結果。

初始化 OpenAI 用戶端，以連接到由 Databricks 或 OpenAI 裝載的 LLM。

Databricks 託管的 LLM

使用 MLflow 取得連線到 Databricks 裝載的 LLM 的 OpenAI 用戶端。從可用的基礎模型中選取模型。

import mlflow
from databricks.sdk import WorkspaceClient

# Enable MLflow's autologging to instrument your application with Tracing
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client that is connected to Databricks-hosted LLMs
w = WorkspaceClient()
client = w.serving_endpoints.get_open_ai_client()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

OpenAI 託管的 LLM

使用原生 OpenAI SDK 連線到 OpenAI 裝載的模型。從可用的 OpenAI 模型中選擇一個模型。

import mlflow
import os
import openai

# Ensure your OPENAI_API_KEY is set in your environment
# os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured

# Enable auto-tracing for OpenAI
mlflow.openai.autolog()

# Set up MLflow tracking to Databricks
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment("/Shared/docs-demo")

# Create an OpenAI client connected to OpenAI SDKs
client = openai.OpenAI()

# Select an LLM
model_name = "gpt-4o-mini"

定義客戶支援代理：

from mlflow.entities import Document
from typing import List, Dict, Any, cast


# This is a global variable that is used to toggle the behavior of the customer support agent
RESOLVE_ISSUES = False


@mlflow.trace(span_type="TOOL", name="get_product_price")
def get_product_price(product_name: str) -> str:
    """Mock tool to get product pricing."""
    return f"${45.99}"


@mlflow.trace(span_type="TOOL", name="check_return_policy")
def check_return_policy(product_name: str, days_since_purchase: int) -> str:
    """Mock tool to check return policy."""
    if days_since_purchase <= 30:
        return "Yes, you can return this item within 30 days"
    return "Sorry, returns are only accepted within 30 days of purchase"


@mlflow.trace
def customer_support_agent(messages: List[Dict[str, str]]):
    # We use this toggle to see how the judge handles the issue resolution status
    system_prompt_postfix = (
        f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
        if not RESOLVE_ISSUES
        else ""
    )

    # Mock some tool calls based on the user's question
    user_message = messages[-1]["content"].lower()
    tool_results = []

    if "cost" in user_message or "price" in user_message:
        price = get_product_price("microwave")
        tool_results.append(f"Price: {price}")

    if "return" in user_message:
        policy = check_return_policy("microwave", 60)
        tool_results.append(f"Return policy: {policy}")

    messages_for_llm = [
        {
            "role": "system",
            "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
        },
        *messages,
    ]

    if tool_results:
        messages_for_llm.append({
            "role": "system",
            "content": f"Tool results: {', '.join(tool_results)}"
        })

    # Call LLM to generate a response
    output = client.chat.completions.create(
        model=model_name,  # This example uses Databricks hosted Claude 4 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        messages=cast(Any, messages_for_llm),
    )

    return {
        "messages": [
            {"role": "assistant", "content": output.choices[0].message.content}
        ]
    }

步驟 2：定義自訂評委

定義三個自訂判斷：

使用輸入和輸出評估問題解決的法官。
檢查預期行為的法官。
以追蹤為基礎的判斷器，透過分析執行序列來驗證工具呼叫。

由make_judge()創建的司法返回mlflow.entities.Feedback對象。

範例判斷 1：評估問題解決

該法官通過分析對話歷史（輸入）和客服人員回應（輸出）來評估客戶問題是否成功解決。

from mlflow.genai.judges import make_judge
from typing import Literal


# Create a judge that evaluates issue resolution using inputs and outputs
issue_resolution_judge = make_judge(
    name="issue_resolution",
    instructions=(
        "Evaluate if the customer's issue was resolved in the conversation.\n\n"
        "User's messages: {{ inputs }}\n"
        "Agent's responses: {{ outputs }}"
    ),
    feedback_value_type=Literal["fully_resolved", "partially_resolved", "needs_follow_up"],
)

範例判斷2：檢查預期行為

該法官通過將輸出與預定義的期望進行比較，驗證代理響應是否表現出特定的預期行為（例如提供定價信息或解釋退貨政策）。

# Create a judge that checks against expected behaviors
expected_behaviors_judge = make_judge(
    name="expected_behaviors",
    instructions=(
        "Compare the agent's response in {{ outputs }} against the expected behaviors in {{ expectations }}.\n\n"
        "User's question: {{ inputs }}"
    ),
    feedback_value_type=Literal["meets_expectations", "partially_meets", "does_not_meet"],
)

範例判斷 3：使用追蹤型判斷來驗證工具呼叫

此法官會分析執行追蹤，以驗證是否呼叫了適當的工具。當您在指示中包含 {{ trace }} 時，法官會以追蹤為基礎，並獲得自主追蹤探索功能。

# Create a trace-based judge that validates tool calls from the trace
tool_call_judge = make_judge(
    name="tool_call_correctness",
    instructions=(
        "Analyze the execution {{ trace }} to determine if the agent called appropriate tools for the user's request.\n\n"
        "Examine the trace to:\n"
        "1. Identify what tools were available and their purposes\n"
        "2. Determine which tools were actually called\n"
        "3. Assess whether the tool calls were reasonable for addressing the user's question"
    ),
    feedback_value_type=bool,
    # To analyze a full trace with a trace-based judge, a model must be specified
    model="databricks:/databricks-gpt-5-mini",
)

步驟 3：建立範例評估數據集

每個 inputs 都通過 mlflow.genai.evaluate()傳遞給代理。您可以選擇性地包含 expectations 以啟用正確性檢查程式。

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
        "expectations": {
            "should_provide_pricing": True,
            "should_offer_alternatives": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
        "expectations": {
            "should_mention_return_policy": True,
            "should_ask_for_receipt": False,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
        "expectations": {
            "should_provide_troubleshooting_steps": True,
            "should_escalate_if_needed": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
        "expectations": {
            "should_remain_calm": True,
            "should_provide_solution": True,
        },
    },
]

第 4 步：使用評委評估您的經紀人

您可以同時使用多個評委來評估代理人的不同方面。執行評估，以比較客服專員嘗試解決問題與未解決問題時的行為。

import mlflow

# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False

result_unresolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,      # Checks inputs/outputs
        expected_behaviors_judge,    # Checks expected behaviors
        tool_call_judge,             # Validates tool usage
    ],
)

# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True

result_resolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,
        expected_behaviors_judge,
        tool_call_judge,
    ],
)

評估結果顯示每位評委如何對代理程式進行評分：

issue_resolution：將對話評為「完全解決」、「部分解決」或「需要後續跟進」。
expected_behaviors：檢查回應是否展現出預期行為（符合預期、不完全符合、不符合預期）
tool_call_correctness：驗證是否呼叫適當的工具（true/false）

後續步驟

應用自定義評審：

評估和改進 GenAI 應用程式 - 在端對端評估工作流程中使用自訂評委
GenAI 的生產監控 - 部署自訂判斷，在生產中進行持續品質監控

提高裁判準確性：

使評審與人為回饋保持一致 - 基礎評審是一個起點。當您收集有關應用程式輸出的專家意見反應時，請讓 LLM 評委與意見反應保持一致，以進一步提高評委準確性。

意見反應

此頁面對您有幫助嗎？

Last updated on 2025-11-15