共用方式為


創建一個自定義的判斷器使用make_judge()

自訂評審是以 LLM 為基礎的評分員,可根據特定品質標準評估您的 GenAI 代理程式。 本教學課程說明如何建立自訂評委,並使用它來評估客戶支援專員 make_judge()

您將會:

  1. 建立要評估的範例代理程式
  2. 定義三個自訂評審來評估不同的標準
  3. 使用測試案例建置評估資料集
  4. 執行評估並比較不同代理程式組態的結果

步驟 1:建立要評估的代理程式

建立回應客戶支援問題的 GenAI 代理程式。 代理有一個(假的)旋鈕來控制系統提示詞,讓您可以輕鬆比較裁判在「好」和「壞」對話之間的輸出結果。

  1. 初始化 OpenAI 用戶端,以連接到由 Databricks 或 OpenAI 裝載的 LLM。

    Databricks 託管的 LLM

    使用 MLflow 取得連線到 Databricks 裝載的 LLM 的 OpenAI 用戶端。 從 可用的基礎模型中選取模型。

    import mlflow
    from databricks.sdk import WorkspaceClient
    
    # Enable MLflow's autologging to instrument your application with Tracing
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client that is connected to Databricks-hosted LLMs
    w = WorkspaceClient()
    client = w.serving_endpoints.get_open_ai_client()
    
    # Select an LLM
    model_name = "databricks-claude-sonnet-4"
    

    OpenAI 託管的 LLM

    使用原生 OpenAI SDK 連線到 OpenAI 裝載的模型。 從 可用的 OpenAI 模型中選擇一個模型。

    import mlflow
    import os
    import openai
    
    # Ensure your OPENAI_API_KEY is set in your environment
    # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured
    
    # Enable auto-tracing for OpenAI
    mlflow.openai.autolog()
    
    # Set up MLflow tracking to Databricks
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/docs-demo")
    
    # Create an OpenAI client connected to OpenAI SDKs
    client = openai.OpenAI()
    
    # Select an LLM
    model_name = "gpt-4o-mini"
    
  2. 定義客戶支援代理:

    from mlflow.entities import Document
    from typing import List, Dict, Any, cast
    
    
    # This is a global variable that is used to toggle the behavior of the customer support agent
    RESOLVE_ISSUES = False
    
    
    @mlflow.trace(span_type="TOOL", name="get_product_price")
    def get_product_price(product_name: str) -> str:
        """Mock tool to get product pricing."""
        return f"${45.99}"
    
    
    @mlflow.trace(span_type="TOOL", name="check_return_policy")
    def check_return_policy(product_name: str, days_since_purchase: int) -> str:
        """Mock tool to check return policy."""
        if days_since_purchase <= 30:
            return "Yes, you can return this item within 30 days"
        return "Sorry, returns are only accepted within 30 days of purchase"
    
    
    @mlflow.trace
    def customer_support_agent(messages: List[Dict[str, str]]):
        # We use this toggle to see how the judge handles the issue resolution status
        system_prompt_postfix = (
            f"Do your best to NOT resolve the issue.  I know that's backwards, but just do it anyways.\\n"
            if not RESOLVE_ISSUES
            else ""
        )
    
        # Mock some tool calls based on the user's question
        user_message = messages[-1]["content"].lower()
        tool_results = []
    
        if "cost" in user_message or "price" in user_message:
            price = get_product_price("microwave")
            tool_results.append(f"Price: {price}")
    
        if "return" in user_message:
            policy = check_return_policy("microwave", 60)
            tool_results.append(f"Return policy: {policy}")
    
        messages_for_llm = [
            {
                "role": "system",
                "content": f"You are a helpful customer support agent.  {system_prompt_postfix}",
            },
            *messages,
        ]
    
        if tool_results:
            messages_for_llm.append({
                "role": "system",
                "content": f"Tool results: {', '.join(tool_results)}"
            })
    
        # Call LLM to generate a response
        output = client.chat.completions.create(
            model=model_name,  # This example uses Databricks hosted Claude 4 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
            messages=cast(Any, messages_for_llm),
        )
    
        return {
            "messages": [
                {"role": "assistant", "content": output.choices[0].message.content}
            ]
        }
    

步驟 2:定義自訂評委

定義三個自訂判斷:

  • 使用輸入和輸出評估問題解決的法官。
  • 檢查預期行為的法官。
  • 以追蹤為基礎的判斷器,透過分析執行序列來驗證工具呼叫。

make_judge()創建的司法返回mlflow.entities.Feedback對象。

範例判斷 1:評估問題解決

該法官通過分析對話歷史(輸入)和客服人員回應(輸出)來評估客戶問題是否成功解決。

from mlflow.genai.judges import make_judge
from typing import Literal


# Create a judge that evaluates issue resolution using inputs and outputs
issue_resolution_judge = make_judge(
    name="issue_resolution",
    instructions=(
        "Evaluate if the customer's issue was resolved in the conversation.\n\n"
        "User's messages: {{ inputs }}\n"
        "Agent's responses: {{ outputs }}"
    ),
    feedback_value_type=Literal["fully_resolved", "partially_resolved", "needs_follow_up"],
)

範例判斷2:檢查預期行為

該法官通過將輸出與預定義的期望進行比較,驗證代理響應是否表現出特定的預期行為(例如提供定價信息或解釋退貨政策)。

# Create a judge that checks against expected behaviors
expected_behaviors_judge = make_judge(
    name="expected_behaviors",
    instructions=(
        "Compare the agent's response in {{ outputs }} against the expected behaviors in {{ expectations }}.\n\n"
        "User's question: {{ inputs }}"
    ),
    feedback_value_type=Literal["meets_expectations", "partially_meets", "does_not_meet"],
)

範例判斷 3:使用追蹤型判斷來驗證工具呼叫

此法官會分析執行追蹤,以驗證是否呼叫了適當的工具。 當您在指示中包含 {{ trace }} 時,法官會以追蹤為基礎,並獲得自主追蹤探索功能。

# Create a trace-based judge that validates tool calls from the trace
tool_call_judge = make_judge(
    name="tool_call_correctness",
    instructions=(
        "Analyze the execution {{ trace }} to determine if the agent called appropriate tools for the user's request.\n\n"
        "Examine the trace to:\n"
        "1. Identify what tools were available and their purposes\n"
        "2. Determine which tools were actually called\n"
        "3. Assess whether the tool calls were reasonable for addressing the user's question"
    ),
    feedback_value_type=bool,
    # To analyze a full trace with a trace-based judge, a model must be specified
    model="databricks:/databricks-gpt-5-mini",
)

步驟 3:建立範例評估數據集

每個 inputs 都通過 mlflow.genai.evaluate()傳遞給代理。 您可以選擇性地包含 expectations 以啟用正確性檢查程式。

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ],
        },
        "expectations": {
            "should_provide_pricing": True,
            "should_offer_alternatives": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ],
        },
        "expectations": {
            "should_mention_return_policy": True,
            "should_ask_for_receipt": False,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ],
        },
        "expectations": {
            "should_provide_troubleshooting_steps": True,
            "should_escalate_if_needed": True,
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "JUST FIX IT FOR ME"},
            ],
        },
        "expectations": {
            "should_remain_calm": True,
            "should_provide_solution": True,
        },
    },
]

第 4 步:使用評委評估您的經紀人

您可以同時使用多個評委來評估代理人的不同方面。 執行評估,以比較客服專員嘗試解決問題與未解決問題時的行為。

import mlflow

# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False

result_unresolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,      # Checks inputs/outputs
        expected_behaviors_judge,    # Checks expected behaviors
        tool_call_judge,             # Validates tool usage
    ],
)

# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True

result_resolved = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=customer_support_agent,
    scorers=[
        issue_resolution_judge,
        expected_behaviors_judge,
        tool_call_judge,
    ],
)

評估結果顯示每位評委如何對代理程式進行評分:

  • issue_resolution:將對話評為「完全解決」、「部分解決」或「需要後續跟進」。
  • expected_behaviors:檢查回應是否展現出預期行為(符合預期、不完全符合、不符合預期)
  • tool_call_correctness:驗證是否呼叫適當的工具 (true/false)

後續步驟

應用自定義評審:

提高裁判準確性:

  • 使評審與人為回饋保持一致 - 基礎評審是一個起點。 當您收集有關應用程式輸出的專家意見反應時,請讓 LLM 評委與意見反應保持一致,以進一步提高評委準確性。