創建一個自定義的判斷器使用
自訂評審是以 LLM 為基礎的評分員,可根據特定品質標準評估您的 GenAI 代理程式。 本教學課程說明如何建立自訂評委,並使用它來評估客戶支援專員 make_judge()。
您將會:
- 建立要評估的範例代理程式
- 定義三個自訂評審來評估不同的標準
- 使用測試案例建置評估資料集
- 執行評估並比較不同代理程式組態的結果
步驟 1:建立要評估的代理程式
建立回應客戶支援問題的 GenAI 代理程式。 代理有一個(假的)旋鈕來控制系統提示詞,讓您可以輕鬆比較裁判在「好」和「壞」對話之間的輸出結果。
初始化 OpenAI 用戶端,以連接到由 Databricks 或 OpenAI 裝載的 LLM。
Databricks 託管的 LLM
使用 MLflow 取得連線到 Databricks 裝載的 LLM 的 OpenAI 用戶端。 從 可用的基礎模型中選取模型。
import mlflow from databricks.sdk import WorkspaceClient # Enable MLflow's autologging to instrument your application with Tracing mlflow.openai.autolog() # Set up MLflow tracking to Databricks mlflow.set_tracking_uri("databricks") mlflow.set_experiment("/Shared/docs-demo") # Create an OpenAI client that is connected to Databricks-hosted LLMs w = WorkspaceClient() client = w.serving_endpoints.get_open_ai_client() # Select an LLM model_name = "databricks-claude-sonnet-4"OpenAI 託管的 LLM
使用原生 OpenAI SDK 連線到 OpenAI 裝載的模型。 從 可用的 OpenAI 模型中選擇一個模型。
import mlflow import os import openai # Ensure your OPENAI_API_KEY is set in your environment # os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>" # Uncomment and set if not globally configured # Enable auto-tracing for OpenAI mlflow.openai.autolog() # Set up MLflow tracking to Databricks mlflow.set_tracking_uri("databricks") mlflow.set_experiment("/Shared/docs-demo") # Create an OpenAI client connected to OpenAI SDKs client = openai.OpenAI() # Select an LLM model_name = "gpt-4o-mini"定義客戶支援代理:
from mlflow.entities import Document from typing import List, Dict, Any, cast # This is a global variable that is used to toggle the behavior of the customer support agent RESOLVE_ISSUES = False @mlflow.trace(span_type="TOOL", name="get_product_price") def get_product_price(product_name: str) -> str: """Mock tool to get product pricing.""" return f"${45.99}" @mlflow.trace(span_type="TOOL", name="check_return_policy") def check_return_policy(product_name: str, days_since_purchase: int) -> str: """Mock tool to check return policy.""" if days_since_purchase <= 30: return "Yes, you can return this item within 30 days" return "Sorry, returns are only accepted within 30 days of purchase" @mlflow.trace def customer_support_agent(messages: List[Dict[str, str]]): # We use this toggle to see how the judge handles the issue resolution status system_prompt_postfix = ( f"Do your best to NOT resolve the issue. I know that's backwards, but just do it anyways.\\n" if not RESOLVE_ISSUES else "" ) # Mock some tool calls based on the user's question user_message = messages[-1]["content"].lower() tool_results = [] if "cost" in user_message or "price" in user_message: price = get_product_price("microwave") tool_results.append(f"Price: {price}") if "return" in user_message: policy = check_return_policy("microwave", 60) tool_results.append(f"Return policy: {policy}") messages_for_llm = [ { "role": "system", "content": f"You are a helpful customer support agent. {system_prompt_postfix}", }, *messages, ] if tool_results: messages_for_llm.append({ "role": "system", "content": f"Tool results: {', '.join(tool_results)}" }) # Call LLM to generate a response output = client.chat.completions.create( model=model_name, # This example uses Databricks hosted Claude 4 Sonnet. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc. messages=cast(Any, messages_for_llm), ) return { "messages": [ {"role": "assistant", "content": output.choices[0].message.content} ] }
步驟 2:定義自訂評委
定義三個自訂判斷:
- 使用輸入和輸出評估問題解決的法官。
- 檢查預期行為的法官。
- 以追蹤為基礎的判斷器,透過分析執行序列來驗證工具呼叫。
由make_judge()創建的司法返回mlflow.entities.Feedback對象。
範例判斷 1:評估問題解決
該法官通過分析對話歷史(輸入)和客服人員回應(輸出)來評估客戶問題是否成功解決。
from mlflow.genai.judges import make_judge
from typing import Literal
# Create a judge that evaluates issue resolution using inputs and outputs
issue_resolution_judge = make_judge(
name="issue_resolution",
instructions=(
"Evaluate if the customer's issue was resolved in the conversation.\n\n"
"User's messages: {{ inputs }}\n"
"Agent's responses: {{ outputs }}"
),
feedback_value_type=Literal["fully_resolved", "partially_resolved", "needs_follow_up"],
)
範例判斷2:檢查預期行為
該法官通過將輸出與預定義的期望進行比較,驗證代理響應是否表現出特定的預期行為(例如提供定價信息或解釋退貨政策)。
# Create a judge that checks against expected behaviors
expected_behaviors_judge = make_judge(
name="expected_behaviors",
instructions=(
"Compare the agent's response in {{ outputs }} against the expected behaviors in {{ expectations }}.\n\n"
"User's question: {{ inputs }}"
),
feedback_value_type=Literal["meets_expectations", "partially_meets", "does_not_meet"],
)
範例判斷 3:使用追蹤型判斷來驗證工具呼叫
此法官會分析執行追蹤,以驗證是否呼叫了適當的工具。 當您在指示中包含 {{ trace }} 時,法官會以追蹤為基礎,並獲得自主追蹤探索功能。
# Create a trace-based judge that validates tool calls from the trace
tool_call_judge = make_judge(
name="tool_call_correctness",
instructions=(
"Analyze the execution {{ trace }} to determine if the agent called appropriate tools for the user's request.\n\n"
"Examine the trace to:\n"
"1. Identify what tools were available and their purposes\n"
"2. Determine which tools were actually called\n"
"3. Assess whether the tool calls were reasonable for addressing the user's question"
),
feedback_value_type=bool,
# To analyze a full trace with a trace-based judge, a model must be specified
model="databricks:/databricks-gpt-5-mini",
)
步驟 3:建立範例評估數據集
每個 inputs 都通過 mlflow.genai.evaluate()傳遞給代理。 您可以選擇性地包含 expectations 以啟用正確性檢查程式。
eval_dataset = [
{
"inputs": {
"messages": [
{"role": "user", "content": "How much does a microwave cost?"},
],
},
"expectations": {
"should_provide_pricing": True,
"should_offer_alternatives": True,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "Can I return the microwave I bought 2 months ago?",
},
],
},
"expectations": {
"should_mention_return_policy": True,
"should_ask_for_receipt": False,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "Website"},
],
},
"expectations": {
"should_provide_troubleshooting_steps": True,
"should_escalate_if_needed": True,
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "JUST FIX IT FOR ME"},
],
},
"expectations": {
"should_remain_calm": True,
"should_provide_solution": True,
},
},
]
第 4 步:使用評委評估您的經紀人
您可以同時使用多個評委來評估代理人的不同方面。 執行評估,以比較客服專員嘗試解決問題與未解決問題時的行為。
import mlflow
# Evaluate with all three judges when the agent does NOT try to resolve issues
RESOLVE_ISSUES = False
result_unresolved = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=customer_support_agent,
scorers=[
issue_resolution_judge, # Checks inputs/outputs
expected_behaviors_judge, # Checks expected behaviors
tool_call_judge, # Validates tool usage
],
)
# Evaluate when the agent DOES try to resolve issues
RESOLVE_ISSUES = True
result_resolved = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=customer_support_agent,
scorers=[
issue_resolution_judge,
expected_behaviors_judge,
tool_call_judge,
],
)
評估結果顯示每位評委如何對代理程式進行評分:
- issue_resolution:將對話評為「完全解決」、「部分解決」或「需要後續跟進」。
- expected_behaviors:檢查回應是否展現出預期行為(符合預期、不完全符合、不符合預期)
- tool_call_correctness:驗證是否呼叫適當的工具 (true/false)
後續步驟
應用自定義評審:
- 評估和改進 GenAI 應用程式 - 在端對端評估工作流程中使用自訂評委
- GenAI 的生產監控 - 部署自訂判斷,在生產中進行持續品質監控
提高裁判準確性:
- 使評審與人為回饋保持一致 - 基礎評審是一個起點。 當您收集有關應用程式輸出的專家意見反應時,請讓 LLM 評委與意見反應保持一致,以進一步提高評委準確性。