在 MLflow Evaluation for GenAI 中, 自訂程式碼型評分器 可讓您為 AI 代理程式或應用程式定義彈性的評估計量。 這組範例和隨附的 範例筆記本 說明了使用程式碼型評分器的許多模式,這些評分器具有不同的輸入、輸出、實作和錯誤處理選項。
下圖說明一些自訂評分器的輸出,作為 MLflow UI 中的指標。
先決條件
- 更新 MLflow
- 定義您的 GenAI 應用程式
- 產生某些評分器範例中使用的追蹤
更新 mlflow
更新 mlflow[databricks] 至最新版本以獲得最佳 GenAI 體驗,並安裝 openai ,因為下列範例應用程式使用 OpenAI 用戶端。
%pip install -q --upgrade "mlflow[databricks]>=3.1" openai
dbutils.library.restartPython()
定義您的 GenAI 應用程式
下面的一些範例將使用以下 GenAI 應用程序,它是問答的通用助手。 下列代碼會使用 OpenAI 客戶端連線到 Databricks 系統託管的 LLM。
from databricks_openai import DatabricksOpenAI
import mlflow
# Create an OpenAI client that is connected to Databricks-hosted LLMs
client = DatabricksOpenAI()
# Select an LLM
model_name = "databricks-claude-sonnet-4"
mlflow.openai.autolog()
# If running outside of Databricks, set up MLflow tracking to Databricks.
# mlflow.set_tracking_uri("databricks")
# In Databricks notebooks, the experiment defaults to the notebook experiment.
# mlflow.set_experiment("/Shared/docs-demo")
@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
# 1. Prepare messages for the LLM
messages_for_llm = [
{"role": "system", "content": "You are a helpful assistant."},
*messages,
]
# 2. Call LLM to generate a response
response = client.chat.completions.create(
model= model_name,
messages=messages_for_llm,
)
return response.choices[0].message.content
sample_app([{"role": "user", "content": "What is the capital of France?"}])
產生痕跡
下面的 eval_dataset 被 mlflow.genai.evaluate() 用來使用佔位符評分器產生追蹤。
from mlflow.genai.scorers import scorer
eval_dataset = [
{
"inputs": {
"messages": [
{"role": "user", "content": "How much does a microwave cost?"},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "Can I return the microwave I bought 2 months ago?",
},
]
},
},
{
"inputs": {
"messages": [
{
"role": "user",
"content": "I'm having trouble with my account. I can't log in.",
},
{
"role": "assistant",
"content": "I'm sorry to hear that you're having trouble with your account. Are you using our website or mobile app?",
},
{"role": "user", "content": "Website"},
]
},
},
]
@scorer
def placeholder_metric() -> int:
# placeholder return value
return 1
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=sample_app,
scorers=[placeholder_metric]
)
generated_traces = mlflow.search_traces(run_id=eval_results.run_id)
generated_traces
mlflow.search_traces()上述函式會傳回追蹤的 Pandas DataFrame,以用於下列一些範例。
範例 1:存取資料從 Trace
存取完整的 MLflow Trace 物件 ,以使用各種詳細數據(範圍、輸入、輸出、屬性、計時)進行精細的計量計算。
此計分器會檢查追蹤的總運行時間是否在可接受的範圍內。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType
@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
# Search particular span type from the trace
llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]
response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # convert to seconds
max_duration = 5.0
if response_time <= max_duration:
return Feedback(
value="yes",
rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
)
else:
return Feedback(
value="no",
rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
)
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[llm_response_time_good]
)
範例 2:包裝預先定義的 LLM 法官
建立自訂評分器,以包覆 MLflow 的 內建 LLM 評委。 使用它來預處理裁判的追蹤資料或後處理其回饋。
此範例示範如何包裝 is_context_relevant 判斷,以評估助理的回應是否與使用者的查詢相關。 具體來說, inputs 欄位 for sample_app 是一個字典,例如: {"messages": [{"role": ..., "content": ...}, ...]}。 此評分器會擷取最後一個使用者訊息的內容,以傳遞給相關性判斷。
import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any
@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
last_user_message_content = None
if "messages" in inputs and isinstance(inputs["messages"], list):
for message in reversed(inputs["messages"]):
if message.get("role") == "user" and "content" in message:
last_user_message_content = message["content"]
break
if not last_user_message_content:
raise Exception("Could not extract the last user message from inputs to evaluate relevance.")
# Call the `relevance_to_query judge. It will return a Feedback object.
return is_context_relevant(
request=last_user_message_content,
context={"response": outputs},
)
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
custom_relevance_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[is_message_relevant]
)
範例 3:使用 expectations
期望是基本事實值或標籤,通常對於離線評估很重要。 執行 mlflow.genai.evaluate()時,可以透過兩種方式在引數中 data 指定期望:
-
expectations欄或欄位:例如,如果引數是字典清單或 Pandas DataFrame,則data每列可以包含一個expectations鍵。 與這個索引鍵相關聯的值會直接傳遞至您的自定義計分器。 -
trace欄或欄位:例如,如果data引數是由mlflow.search_traces()傳回的資料框,它將包含一個trace欄位,其中包含與追蹤相關的任何Expectation資料。
備註
生產監控通常沒有期望,因為您正在評估沒有基本事實的即時流量。 如果您想要針對離線和在線評估使用相同的計分器,請設計它以正常處理期望
此範例也示範如何使用自訂評分器與既定的 Safety 評分器。
import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer, Safety
from typing import Any, List, Optional, Union
expectations_eval_dataset_list = [
{
"inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
"expectations": {
"expected_response": "2+2 equals 4.",
"expected_keywords": ["4", "four", "equals"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
"expectations": {
"expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
"expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
}
},
{
"inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
"expectations": {
"expected_response": "Hello there!",
# No keywords needed for this one, but the field can be omitted or empty
}
}
]
範例 3.1:與預期回應完全相符
此計分器會檢查助理的回應是否與expected_response中提供的expectations完全匹配。
@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
# Scorer can return primitive value like bool, int, float, str, etc.
return outputs == expectations["expected_response"]
exact_match_eval_results = mlflow.genai.evaluate(
data=expectations_eval_dataset_list,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[exact_match, Safety()] # You can include any number of scorers
)
範例 3.2:根據預期進行關鍵字檢查
此計分器會檢查助理回應中是否包含來自expected_keywords的所有expectations。
@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
expected_keywords = expectations.get("expected_keywords")
print(expected_keywords)
if expected_keywords is None:
return Feedback(value="yes", rationale="No keywords were expected in the response.")
missing_keywords = []
for keyword in expected_keywords:
if keyword.lower() not in outputs.lower():
missing_keywords.append(keyword)
if not missing_keywords:
return Feedback(value="yes", rationale="All expected keywords are present in the response.")
else:
return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")
keyword_presence_eval_results = mlflow.genai.evaluate(
data=expectations_eval_dataset_list,
predict_fn=sample_app, # sample_app is from the prerequisite section
scorers=[keyword_presence_scorer]
)
範例 4:傳回多個意見反應物件
單一評分者可以傳回物件清單 Feedback ,讓一個評分者同時評估多個品質層面 (例如 PII、情緒和簡潔性)。
每個 Feedback 物件都應該有一個唯一的 name,該名稱會成為結果中的度量名稱。 請參閱 指標名稱的詳細資訊。
此範例示範一個計分器,針對每個追蹤紀錄傳回兩個不同的回饋:
-
is_not_empty_check:布爾值,指出響應內容是否為非空白。 -
response_char_length:回應字元長度的數值。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional
@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
feedbacks = []
# 1. Check if the response is not empty
feedbacks.append(
Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
)
# 2. Calculate response character length
char_length = len(outputs)
feedbacks.append(Feedback(name="response_char_length", value=char_length))
return feedbacks
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
multi_feedback_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[comprehensive_response_checker]
)
結果會有兩個數據行: is_not_empty_check 和 response_char_length 作為評量。
範例 5:將您自己的 LLM 用於判斷
在計分器內整合自定義或外部裝載的 LLM。 計分器負責處理 API 呼叫、輸入/輸出格式,並從 LLM 的回應中生成 Feedback,從而完全掌控評判過程。
您也可以設定source欄位在Feedback物件中,以指出評估的來源是 LLM 判定者。
import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional
# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.
Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.
Original User Query:
```{user_query}```
AI's Response:
```{llm_response_from_app}```
Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""
@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
user_query = inputs["messages"][-1]["content"]
# Call the Judge LLM using the OpenAI SDK client.
judge_llm_response_obj = client.chat.completions.create(
model="databricks-claude-sonnet-4-5", # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
messages=[
{"role": "system", "content": judge_system_prompt},
{"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
],
max_tokens=200, # Max tokens for the judge's rationale
temperature=0.0, # For more deterministic judging
)
judge_llm_output_text = judge_llm_response_obj.choices[0].message.content
# Parse the Judge LLM's JSON output.
judge_eval_json = json.loads(judge_llm_output_text)
parsed_score = int(judge_eval_json["score"])
parsed_rationale = judge_eval_json["rationale"]
return Feedback(
value=parsed_score,
rationale=parsed_rationale,
# Set the source of the assessment to indicate the LLM judge used to generate the feedback
source=AssessmentSource(
source_type=AssessmentSourceType.LLM_JUDGE,
source_id="claude-sonnet-4-5",
)
)
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[answer_quality]
)
透過在 UI 中開啟追蹤並按一下「answer_quality」評估,您可以看到評審的元數據,例如理由、時間戳記和評審模型名稱。 如果評審評估不正確,您可以通過單擊按鈕 Edit 來覆蓋分數。
新的評估取代了原來的法官評估。 編輯歷史記錄會保留以供將來參考。
範例 6:基於類別的評分器定義(僅限離線評估)
如果評分器需要狀態,則 @scorer 裝飾器型定義可能不夠。 相反地,對於較複雜的評分器,請使用 Scorer 基底類別。 類別 Scorer 是 Pydantic 物件,因此您可以定義其他欄位,並在方法中使用 __call__ 它們。
備註
僅支援離線評估的Scorer類別型子類別和mlflow.genai.evaluate()。 它們無法註冊用於 生產監控。 要在生產監控中使用自訂評分器,請使用 @scorer 裝飾器。
from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional
# Scorer class is a Pydantic object
class ResponseQualityScorer(Scorer):
# The `name` field is mandatory
name: str = "response_quality"
# Define additional fields
min_length: int = 50
required_sections: Optional[list[str]] = None
# Override the __call__ method to implement the scorer logic
def __call__(self, outputs: str) -> Feedback:
issues = []
# Check length
if len(outputs.split()) < self.min_length:
issues.append(f"Too short (minimum {self.min_length} words)")
# Check required sections
missing = [s for s in self.required_sections if s not in outputs]
if missing:
issues.append(f"Missing sections: {', '.join(missing)}")
if issues:
return Feedback(
value=False,
rationale="; ".join(issues)
)
return Feedback(
value=True,
rationale="Response meets all quality criteria"
)
response_quality_scorer = ResponseQualityScorer(required_sections=["# Summary", "# Sources"])
# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
class_based_scorer_results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[response_quality_scorer]
)
範例 7:評分器中的錯誤處理
下列範例示範使用兩種方法處理評分器中的錯誤:
- 明確處理錯誤:您可以明確識別錯誤輸入或捕獲其他異常,並以
AssessmentError的形式傳回Feedback。 - 讓例外狀況傳播 (建議) :針對大部分的錯誤,最好讓 MLflow 攔截例外狀況。 MLflow 會建立具有
Feedback錯誤詳細數據的物件,並會繼續執行。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentError
@scorer
def resilient_scorer(outputs, trace=None):
try:
response = outputs.get("response")
if not response:
return Feedback(
value=None,
error=AssessmentError(
error_code="MISSING_RESPONSE",
error_message="No response field in outputs"
)
)
# Your evaluation logic
return Feedback(value=True, rationale="Valid response")
except Exception as e:
# Let MLflow handle the error gracefully
raise
# Evaluation continues even if some scorers fail.
results = mlflow.genai.evaluate(
data=generated_traces,
scorers=[resilient_scorer]
)
範例 8:評分器中的命名慣例
下列範例說明 程式碼型評分器的命名行為。 行為可以總結為:
- 如果評分器傳回一或多個
Feedback物件,Feedback.name則欄位優先 (如果已指定)。 - 對於原始傳回值或未命名
Feedback,將使用函數名稱(用於@scorer裝飾器)或Scorer.name類別中的Scorer欄位。
from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional, Any, List
# Primitive value or single `Feedback` without a name: The scorer function name becomes the metric name.
@scorer
def decorator_primitive(outputs: str) -> int:
# metric name = "decorator_primitive"
return 1
@scorer
def decorator_unnamed_feedback(outputs: Any) -> Feedback:
# metric name = "decorator_unnamed_feedback"
return Feedback(value=True, rationale="Good quality")
# Single `Feedback` with an explicit name: The name specified in the `Feedback` object is used as the metric name.
@scorer
def decorator_feedback_named(outputs: Any) -> Feedback:
# metric name = "decorator_named_feedback"
return Feedback(name="decorator_named_feedback", value=True, rationale="Factual accuracy is high")
# Multiple `Feedback` objects: Names specified in each `Feedback` object are preserved. You must specify a unique name for each `Feedback`.
@scorer
def decorator_named_feedbacks(outputs) -> list[Feedback]:
return [
Feedback(name="decorator_named_feedback_1", value=True, rationale="No errors"),
Feedback(name="decorator_named_feedback_2", value=0.9, rationale="Very clear"),
]
# Class returning primitive value
class ScorerPrimitive(Scorer):
# metric name = "scorer_primitive"
name: str = "scorer_primitive"
def __call__(self, outputs: str) -> int:
return 1
scorer_primitive = ScorerPrimitive()
# Class returning a Feedback object without a name
class ScorerFeedbackUnnamed(Scorer):
# metric name = "scorer_named_feedback"
name: str = "scorer_named_feedback"
def __call__(self, outputs: str) -> Feedback:
return Feedback(value=True, rationale="Good")
scorer_feedback_unnamed = ScorerFeedbackUnnamed()
# Class returning a Feedback object with a name
class ScorerFeedbackNamed(Scorer):
# metric name = "scorer_named_feedback"
name: str = "scorer_feedback_named"
def __call__(self, outputs: str) -> Feedback:
return Feedback(name="scorer_named_feedback", value=True, rationale="Good")
scorer_feedback_named = ScorerFeedbackNamed()
# Class returning multiple Feedback objects with names
class ScorerNamedFeedbacks(Scorer):
# metric names = ["scorer_named_feedback_1", "scorer_named_feedback_1"]
name: str = "scorer_named_feedbacks" # Not used
def __call__(self, outputs: str) -> List[Feedback]:
return [
Feedback(name="scorer_named_feedback_1", value=True, rationale="Good"),
Feedback(name="scorer_named_feedback_2", value=1, rationale="ok"),
]
scorer_named_feedbacks = ScorerNamedFeedbacks()
mlflow.genai.evaluate(
data=generated_traces,
scorers=[
decorator_primitive,
decorator_unnamed_feedback,
decorator_feedback_named,
decorator_named_feedbacks,
scorer_primitive,
scorer_feedback_unnamed,
scorer_feedback_named,
scorer_named_feedbacks,
],
)
範例 9:串接評估結果
如果評分器指出某個追蹤子集有問題,您可以使用mlflow.search_traces()收集該追蹤子集以進行進一步的反覆操作。 下列範例會尋找一般的「安全性」故障,然後使用更為量身訂製的評分工具(使用內容原則檔進行評估的簡化範例)來分析失敗追蹤的子集。 或者,您可以使用有問題的追蹤子集來反覆調整您的 AI 應用程式,並改善其對具有挑戰性的輸入的表現。
from mlflow.genai.scorers import Safety, Guidelines
# Run initial evaluation
results1 = mlflow.genai.evaluate(
data=generated_traces,
scorers=[Safety()]
)
# Use results to create refined dataset
traces = mlflow.search_traces(run_id=results1.run_id)
# Filter to problematic traces
safety_failures = traces[traces['assessments'].apply(
lambda x: any(a['assessment_name'] == 'Safety' and a['feedback']['value'] == 'no' for a in x)
)]
# Updated app (not actually updated in this toy example)
updated_app = sample_app
# Re-evaluate with different scorers or updated app
if len(safety_failures) > 0:
results2 = mlflow.genai.evaluate(
data=safety_failures,
predict_fn=updated_app,
scorers=[
Guidelines(
name="content_policy",
guidelines="Response must follow our content policy"
)
]
)
範例 10:具有指導方針的條件邏輯
您可以將 準則評審 包裝在自訂程式碼型評分器中,以根據使用者屬性或其他內容套用不同的準則。
from mlflow.genai.scorers import scorer, Guidelines
@scorer
def premium_service_validator(inputs, outputs, trace=None):
"""Custom scorer that applies different guidelines based on user tier"""
# Extract user tier from inputs (could also come from trace)
user_tier = inputs.get("user_tier", "standard")
# Apply different guidelines based on user attributes
if user_tier == "premium":
# Premium users get more personalized, detailed responses
premium_judge = Guidelines(
name="premium_experience",
guidelines=[
"The response must acknowledge the user's premium status",
"The response must provide detailed explanations with at least 3 specific examples",
"The response must offer priority support options (e.g., 'direct line' or 'dedicated agent')",
"The response must not include any upselling or promotional content"
]
)
return premium_judge(inputs=inputs, outputs=outputs)
else:
# Standard users get clear but concise responses
standard_judge = Guidelines(
name="standard_experience",
guidelines=[
"The response must be helpful and professional",
"The response must be concise (under 100 words)",
"The response may mention premium features as upgrade options"
]
)
return standard_judge(inputs=inputs, outputs=outputs)
# Example evaluation data
eval_data = [
{
"inputs": {
"question": "How do I export my data?",
"user_tier": "premium"
},
"outputs": {
"response": "As a premium member, you have access to advanced export options. You can export in 5 formats: CSV, Excel, JSON, XML, and PDF. Here's how: 1) Go to Settings > Export, 2) Choose your format and date range, 3) Click 'Export Now'. For immediate assistance, call your dedicated support line at 1-800-PREMIUM."
}
},
{
"inputs": {
"question": "How do I export my data?",
"user_tier": "standard"
},
"outputs": {
"response": "You can export your data as CSV from Settings > Export. Premium users can access additional formats like Excel and PDF."
}
}
]
# Run evaluation with the custom scorer
results = mlflow.genai.evaluate(
data=eval_data,
scorers=[premium_service_validator]
)
範例筆記本
下列筆記本包含此頁面上的所有程序代碼。
MLflow 評估筆記本的程式碼型評分器
後續步驟
- 自訂 LLM 評分器 - 瞭解使用 LLM-as-a-judge 計量的語意評估,這比程式碼型評分器更容易定義。
- 在生產環境中執行計分器 - 部署計分器以進行持續監視。
- 建置評估數據集 - 為您的計分者建立測試數據。