共用方式為


程式碼型評分器範例

在 MLflow Evaluation for GenAI 中, 自訂程式碼型評分器 可讓您為 AI 代理程式或應用程式定義彈性的評估計量。 這組範例和隨附的 範例筆記本 說明了使用程式碼型評分器的許多模式,這些評分器具有不同的輸入、輸出、實作和錯誤處理選項。

下圖說明一些自訂評分器的輸出,作為 MLflow UI 中的指標。

自定義計分器開發

先決條件

  1. 更新 MLflow
  2. 定義您的 GenAI 應用程式
  3. 產生某些評分器範例中使用的追蹤

更新 mlflow

更新 mlflow[databricks] 至最新版本以獲得最佳 GenAI 體驗,並安裝 openai ,因為下列範例應用程式使用 OpenAI 用戶端。

%pip install -q --upgrade "mlflow[databricks]>=3.1" openai
dbutils.library.restartPython()

定義您的 GenAI 應用程式

下面的一些範例將使用以下 GenAI 應用程序,它是問答的通用助手。 下列代碼會使用 OpenAI 客戶端連線到 Databricks 系統託管的 LLM

from databricks_openai import DatabricksOpenAI
import mlflow

# Create an OpenAI client that is connected to Databricks-hosted LLMs
client = DatabricksOpenAI()

# Select an LLM
model_name = "databricks-claude-sonnet-4"

mlflow.openai.autolog()

# If running outside of Databricks, set up MLflow tracking to Databricks.
# mlflow.set_tracking_uri("databricks")

# In Databricks notebooks, the experiment defaults to the notebook experiment.
# mlflow.set_experiment("/Shared/docs-demo")

@mlflow.trace
def sample_app(messages: list[dict[str, str]]):
    # 1. Prepare messages for the LLM
    messages_for_llm = [
        {"role": "system", "content": "You are a helpful assistant."},
        *messages,
    ]

    # 2. Call LLM to generate a response
    response = client.chat.completions.create(
        model= model_name,
        messages=messages_for_llm,
    )
    return response.choices[0].message.content


sample_app([{"role": "user", "content": "What is the capital of France?"}])

產生痕跡

下面的 eval_datasetmlflow.genai.evaluate() 用來使用佔位符評分器產生追蹤。

from mlflow.genai.scorers import scorer

eval_dataset = [
    {
        "inputs": {
            "messages": [
                {"role": "user", "content": "How much does a microwave cost?"},
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "Can I return the microwave I bought 2 months ago?",
                },
            ]
        },
    },
    {
        "inputs": {
            "messages": [
                {
                    "role": "user",
                    "content": "I'm having trouble with my account.  I can't log in.",
                },
                {
                    "role": "assistant",
                    "content": "I'm sorry to hear that you're having trouble with your account.  Are you using our website or mobile app?",
                },
                {"role": "user", "content": "Website"},
            ]
        },
    },
]

@scorer
def placeholder_metric() -> int:
    # placeholder return value
    return 1

eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=sample_app,
    scorers=[placeholder_metric]
)

generated_traces = mlflow.search_traces(run_id=eval_results.run_id)
generated_traces

mlflow.search_traces()上述函式會傳回追蹤的 Pandas DataFrame,以用於下列一些範例。

範例 1:存取資料從 Trace

存取完整的 MLflow Trace 物件 ,以使用各種詳細數據(範圍、輸入、輸出、屬性、計時)進行精細的計量計算。

此計分器會檢查追蹤的總運行時間是否在可接受的範圍內。

import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Trace, Feedback, SpanType

@scorer
def llm_response_time_good(trace: Trace) -> Feedback:
    # Search particular span type from the trace
    llm_span = trace.search_spans(span_type=SpanType.CHAT_MODEL)[0]

    response_time = (llm_span.end_time_ns - llm_span.start_time_ns) / 1e9 # convert to seconds
    max_duration = 5.0
    if response_time <= max_duration:
        return Feedback(
            value="yes",
            rationale=f"LLM response time {response_time:.2f}s is within the {max_duration}s limit."
        )
    else:
        return Feedback(
            value="no",
            rationale=f"LLM response time {response_time:.2f}s exceeds the {max_duration}s limit."
        )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
span_check_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[llm_response_time_good]
)

範例 2:包裝預先定義的 LLM 法官

建立自訂評分器,以包覆 MLflow 的 內建 LLM 評委。 使用它來預處理裁判的追蹤資料或後處理其回饋。

此範例示範如何包裝 is_context_relevant 判斷,以評估助理的回應是否與使用者的查詢相關。 具體來說, inputs 欄位 for sample_app 是一個字典,例如: {"messages": [{"role": ..., "content": ...}, ...]}。 此評分器會擷取最後一個使用者訊息的內容,以傳遞給相關性判斷。

import mlflow
from mlflow.entities import Trace, Feedback
from mlflow.genai.judges import is_context_relevant
from mlflow.genai.scorers import scorer
from typing import Any

@scorer
def is_message_relevant(inputs: dict[str, Any], outputs: str) -> Feedback:
    last_user_message_content = None
    if "messages" in inputs and isinstance(inputs["messages"], list):
        for message in reversed(inputs["messages"]):
            if message.get("role") == "user" and "content" in message:
                last_user_message_content = message["content"]
                break

    if not last_user_message_content:
        raise Exception("Could not extract the last user message from inputs to evaluate relevance.")

    # Call the `relevance_to_query judge. It will return a Feedback object.
    return is_context_relevant(
        request=last_user_message_content,
        context={"response": outputs},
    )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
custom_relevance_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[is_message_relevant]
)

範例 3:使用 expectations

期望是基本事實值或標籤,通常對於離線評估很重要。 執行 mlflow.genai.evaluate()時,可以透過兩種方式在引數中 data 指定期望:

  • expectations 欄或欄位:例如,如果引數是字典清單或 Pandas DataFrame,則 data 每列可以包含一個 expectations 鍵。 與這個索引鍵相關聯的值會直接傳遞至您的自定義計分器。
  • trace欄或欄位:例如,如果data引數是由mlflow.search_traces()傳回的資料框,它將包含一個trace欄位,其中包含與追蹤相關的任何Expectation資料。

備註

生產監控通常沒有期望,因為您正在評估沒有基本事實的即時流量。 如果您想要針對離線和在線評估使用相同的計分器,請設計它以正常處理期望

此範例也示範如何使用自訂評分器與既定的 Safety 評分器。

import mlflow
from mlflow.entities import Feedback
from mlflow.genai.scorers import scorer, Safety
from typing import Any, List, Optional, Union

expectations_eval_dataset_list = [
    {
        "inputs": {"messages": [{"role": "user", "content": "What is 2+2?"}]},
        "expectations": {
            "expected_response": "2+2 equals 4.",
            "expected_keywords": ["4", "four", "equals"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Describe MLflow in one sentence."}]},
        "expectations": {
            "expected_response": "MLflow is an open-source platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.",
            "expected_keywords": ["mlflow", "open-source", "platform", "machine learning"],
        }
    },
    {
        "inputs": {"messages": [{"role": "user", "content": "Say hello."}]},
        "expectations": {
            "expected_response": "Hello there!",
            # No keywords needed for this one, but the field can be omitted or empty
        }
    }
]

範例 3.1:與預期回應完全相符

此計分器會檢查助理的回應是否與expected_response中提供的expectations完全匹配。

@scorer
def exact_match(outputs: str, expectations: dict[str, Any]) -> bool:
    # Scorer can return primitive value like bool, int, float, str, etc.
    return outputs == expectations["expected_response"]

exact_match_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[exact_match, Safety()]  # You can include any number of scorers
)

範例 3.2:根據預期進行關鍵字檢查

此計分器會檢查助理回應中是否包含來自expected_keywords的所有expectations

@scorer
def keyword_presence_scorer(outputs: str, expectations: dict[str, Any]) -> Feedback:
    expected_keywords = expectations.get("expected_keywords")
    print(expected_keywords)
    if expected_keywords is None:
        return Feedback(value="yes", rationale="No keywords were expected in the response.")

    missing_keywords = []
    for keyword in expected_keywords:
        if keyword.lower() not in outputs.lower():
            missing_keywords.append(keyword)

    if not missing_keywords:
        return Feedback(value="yes", rationale="All expected keywords are present in the response.")
    else:
        return Feedback(value="no", rationale=f"Missing keywords: {', '.join(missing_keywords)}.")

keyword_presence_eval_results = mlflow.genai.evaluate(
    data=expectations_eval_dataset_list,
    predict_fn=sample_app, # sample_app is from the prerequisite section
    scorers=[keyword_presence_scorer]
)

範例 4:傳回多個意見反應物件

單一評分者可以傳回物件清單 Feedback ,讓一個評分者同時評估多個品質層面 (例如 PII、情緒和簡潔性)。

每個 Feedback 物件都應該有一個唯一的 name,該名稱會成為結果中的度量名稱。 請參閱 指標名稱的詳細資訊

此範例示範一個計分器,針對每個追蹤紀錄傳回兩個不同的回饋:

  1. is_not_empty_check:布爾值,指出響應內容是否為非空白。
  2. response_char_length:回應字元長度的數值。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, Trace # Ensure Feedback and Trace are imported
from typing import Any, Optional

@scorer
def comprehensive_response_checker(outputs: str) -> list[Feedback]:
    feedbacks = []
    # 1. Check if the response is not empty
    feedbacks.append(
        Feedback(name="is_not_empty_check", value="yes" if outputs != "" else "no")
    )
    # 2. Calculate response character length
    char_length = len(outputs)
    feedbacks.append(Feedback(name="response_char_length", value=char_length))
    return feedbacks

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
multi_feedback_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[comprehensive_response_checker]
)

結果會有兩個數據行: is_not_empty_checkresponse_char_length 作為評量。

多重回饋結果

範例 5:將您自己的 LLM 用於判斷

在計分器內整合自定義或外部裝載的 LLM。 計分器負責處理 API 呼叫、輸入/輸出格式,並從 LLM 的回應中生成 Feedback,從而完全掌控評判過程。

您也可以設定source欄位在Feedback物件中,以指出評估的來源是 LLM 判定者。

import mlflow
import json
from mlflow.genai.scorers import scorer
from mlflow.entities import AssessmentSource, AssessmentSourceType, Feedback
from typing import Any, Optional

# Define the prompts for the Judge LLM.
judge_system_prompt = """
You are an impartial AI assistant responsible for evaluating the quality of a response generated by another AI model.
Your evaluation should be based on the original user query and the AI's response.
Provide a quality score as an integer from 1 to 5 (1=Poor, 2=Fair, 3=Good, 4=Very Good, 5=Excellent).
Also, provide a brief rationale for your score.

Your output MUST be a single valid JSON object with two keys: "score" (an integer) and "rationale" (a string).
Example:
{"score": 4, "rationale": "The response was mostly accurate and helpful, addressing the user's query directly."}
"""
judge_user_prompt = """
Please evaluate the AI's Response below based on the Original User Query.

Original User Query:
```{user_query}```

AI's Response:
```{llm_response_from_app}```

Provide your evaluation strictly as a JSON object with "score" and "rationale" keys.
"""

@scorer
def answer_quality(inputs: dict[str, Any], outputs: str) -> Feedback:
    user_query = inputs["messages"][-1]["content"]

    # Call the Judge LLM using the OpenAI SDK client.
    judge_llm_response_obj = client.chat.completions.create(
        model="databricks-claude-sonnet-4-5",  # This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o-mini, etc.
        messages=[
            {"role": "system", "content": judge_system_prompt},
            {"role": "user", "content": judge_user_prompt.format(user_query=user_query, llm_response_from_app=outputs)},
        ],
        max_tokens=200,  # Max tokens for the judge's rationale
        temperature=0.0, # For more deterministic judging
    )
    judge_llm_output_text = judge_llm_response_obj.choices[0].message.content

    # Parse the Judge LLM's JSON output.
    judge_eval_json = json.loads(judge_llm_output_text)
    parsed_score = int(judge_eval_json["score"])
    parsed_rationale = judge_eval_json["rationale"]

    return Feedback(
        value=parsed_score,
        rationale=parsed_rationale,
        # Set the source of the assessment to indicate the LLM judge used to generate the feedback
        source=AssessmentSource(
            source_type=AssessmentSourceType.LLM_JUDGE,
            source_id="claude-sonnet-4-5",
        )
    )

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
custom_llm_judge_eval_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[answer_quality]
)

透過在 UI 中開啟追蹤並按一下「answer_quality」評估,您可以看到評審的元數據,例如理由、時間戳記和評審模型名稱。 如果評審評估不正確,您可以通過單擊按鈕 Edit 來覆蓋分數。

新的評估取代了原來的法官評估。 編輯歷史記錄會保留以供將來參考。

編輯 LLM 評委評估

範例 6:基於類別的評分器定義(僅限離線評估)

如果評分器需要狀態,則 @scorer 裝飾器型定義可能不夠。 相反地,對於較複雜的評分器,請使用 Scorer 基底類別。 類別 Scorer 是 Pydantic 物件,因此您可以定義其他欄位,並在方法中使用 __call__ 它們。

備註

僅支援離線評估的Scorer類別型子類別和mlflow.genai.evaluate()。 它們無法註冊用於 生產監控。 要在生產監控中使用自訂評分器,請使用 @scorer 裝飾器

from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional

# Scorer class is a Pydantic object
class ResponseQualityScorer(Scorer):

    # The `name` field is mandatory
    name: str = "response_quality"

    # Define additional fields
    min_length: int = 50
    required_sections: Optional[list[str]] = None

    # Override the __call__ method to implement the scorer logic
    def __call__(self, outputs: str) -> Feedback:
        issues = []

        # Check length
        if len(outputs.split()) < self.min_length:
            issues.append(f"Too short (minimum {self.min_length} words)")

        # Check required sections
        missing = [s for s in self.required_sections if s not in outputs]
        if missing:
            issues.append(f"Missing sections: {', '.join(missing)}")

        if issues:
            return Feedback(
                value=False,
                rationale="; ".join(issues)
            )

        return Feedback(
            value=True,
            rationale="Response meets all quality criteria"
        )


response_quality_scorer = ResponseQualityScorer(required_sections=["# Summary", "# Sources"])

# Evaluate the scorer using the pre-generated traces from the prerequisite code block.
class_based_scorer_results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[response_quality_scorer]
)

範例 7:評分器中的錯誤處理

下列範例示範使用兩種方法處理評分器中的錯誤:

  • 明確處理錯誤:您可以明確識別錯誤輸入或捕獲其他異常,並以AssessmentError的形式傳回Feedback
  • 讓例外狀況傳播 (建議) :針對大部分的錯誤,最好讓 MLflow 攔截例外狀況。 MLflow 會建立具有 Feedback 錯誤詳細數據的物件,並會繼續執行。
import mlflow
from mlflow.genai.scorers import scorer
from mlflow.entities import Feedback, AssessmentError

@scorer
def resilient_scorer(outputs, trace=None):
    try:
        response = outputs.get("response")
        if not response:
            return Feedback(
                value=None,
                error=AssessmentError(
                    error_code="MISSING_RESPONSE",
                    error_message="No response field in outputs"
                )
            )
        # Your evaluation logic
        return Feedback(value=True, rationale="Valid response")
    except Exception as e:
        # Let MLflow handle the error gracefully
        raise

# Evaluation continues even if some scorers fail.
results = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[resilient_scorer]
)

範例 8:評分器中的命名慣例

下列範例說明 程式碼型評分器的命名行為。 行為可以總結為:

  1. 如果評分器傳回一或多個 Feedback 物件, Feedback.name 則欄位優先 (如果已指定)。
  2. 對於原始傳回值或未命名 Feedback,將使用函數名稱(用於 @scorer 裝飾器)或 Scorer.name 類別中的 Scorer 欄位。
from mlflow.genai.scorers import Scorer
from mlflow.entities import Feedback
from typing import Optional, Any, List

# Primitive value or single `Feedback` without a name: The scorer function name becomes the metric name.
@scorer
def decorator_primitive(outputs: str) -> int:
    # metric name = "decorator_primitive"
    return 1

@scorer
def decorator_unnamed_feedback(outputs: Any) -> Feedback:
    # metric name = "decorator_unnamed_feedback"
    return Feedback(value=True, rationale="Good quality")

# Single `Feedback` with an explicit name: The name specified in the `Feedback` object is used as the metric name.
@scorer
def decorator_feedback_named(outputs: Any) -> Feedback:
    # metric name = "decorator_named_feedback"
    return Feedback(name="decorator_named_feedback", value=True, rationale="Factual accuracy is high")

# Multiple `Feedback` objects: Names specified in each `Feedback` object are preserved. You must specify a unique name for each `Feedback`.
@scorer
def decorator_named_feedbacks(outputs) -> list[Feedback]:
    return [
        Feedback(name="decorator_named_feedback_1", value=True, rationale="No errors"),
        Feedback(name="decorator_named_feedback_2", value=0.9, rationale="Very clear"),
    ]

# Class returning primitive value
class ScorerPrimitive(Scorer):
    # metric name = "scorer_primitive"
    name: str = "scorer_primitive"
    def __call__(self, outputs: str) -> int:
        return 1

scorer_primitive = ScorerPrimitive()

# Class returning a Feedback object without a name
class ScorerFeedbackUnnamed(Scorer):
    # metric name = "scorer_named_feedback"
    name: str = "scorer_named_feedback"
    def __call__(self, outputs: str) -> Feedback:
        return Feedback(value=True, rationale="Good")

scorer_feedback_unnamed = ScorerFeedbackUnnamed()

# Class returning a Feedback object with a name
class ScorerFeedbackNamed(Scorer):
    # metric name = "scorer_named_feedback"
    name: str = "scorer_feedback_named"
    def __call__(self, outputs: str) -> Feedback:
        return Feedback(name="scorer_named_feedback", value=True, rationale="Good")

scorer_feedback_named = ScorerFeedbackNamed()

# Class returning multiple Feedback objects with names
class ScorerNamedFeedbacks(Scorer):
    # metric names = ["scorer_named_feedback_1", "scorer_named_feedback_1"]
    name: str = "scorer_named_feedbacks"  # Not used
    def __call__(self, outputs: str) -> List[Feedback]:
        return [
          Feedback(name="scorer_named_feedback_1", value=True, rationale="Good"),
          Feedback(name="scorer_named_feedback_2", value=1, rationale="ok"),
        ]

scorer_named_feedbacks = ScorerNamedFeedbacks()

mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[
      decorator_primitive,
      decorator_unnamed_feedback,
      decorator_feedback_named,
      decorator_named_feedbacks,
      scorer_primitive,
      scorer_feedback_unnamed,
      scorer_feedback_named,
      scorer_named_feedbacks,
    ],
)

範例 9:串接評估結果

如果評分器指出某個追蹤子集有問題,您可以使用mlflow.search_traces()收集該追蹤子集以進行進一步的反覆操作。 下列範例會尋找一般的「安全性」故障,然後使用更為量身訂製的評分工具(使用內容原則檔進行評估的簡化範例)來分析失敗追蹤的子集。 或者,您可以使用有問題的追蹤子集來反覆調整您的 AI 應用程式,並改善其對具有挑戰性的輸入的表現。

from mlflow.genai.scorers import Safety, Guidelines

# Run initial evaluation
results1 = mlflow.genai.evaluate(
    data=generated_traces,
    scorers=[Safety()]
)

# Use results to create refined dataset
traces = mlflow.search_traces(run_id=results1.run_id)

# Filter to problematic traces
safety_failures = traces[traces['assessments'].apply(
    lambda x: any(a['assessment_name'] == 'Safety' and a['feedback']['value'] == 'no' for a in x)
)]

# Updated app (not actually updated in this toy example)
updated_app = sample_app

# Re-evaluate with different scorers or updated app
if len(safety_failures) > 0:
  results2 = mlflow.genai.evaluate(
      data=safety_failures,
      predict_fn=updated_app,
      scorers=[
          Guidelines(
              name="content_policy",
              guidelines="Response must follow our content policy"
          )
      ]
  )

範例 10:具有指導方針的條件邏輯

您可以將 準則評審 包裝在自訂程式碼型評分器中,以根據使用者屬性或其他內容套用不同的準則。

from mlflow.genai.scorers import scorer, Guidelines

@scorer
def premium_service_validator(inputs, outputs, trace=None):
    """Custom scorer that applies different guidelines based on user tier"""

    # Extract user tier from inputs (could also come from trace)
    user_tier = inputs.get("user_tier", "standard")

    # Apply different guidelines based on user attributes
    if user_tier == "premium":
        # Premium users get more personalized, detailed responses
        premium_judge = Guidelines(
            name="premium_experience",
            guidelines=[
                "The response must acknowledge the user's premium status",
                "The response must provide detailed explanations with at least 3 specific examples",
                "The response must offer priority support options (e.g., 'direct line' or 'dedicated agent')",
                "The response must not include any upselling or promotional content"
            ]
        )
        return premium_judge(inputs=inputs, outputs=outputs)
    else:
        # Standard users get clear but concise responses
        standard_judge = Guidelines(
            name="standard_experience",
            guidelines=[
                "The response must be helpful and professional",
                "The response must be concise (under 100 words)",
                "The response may mention premium features as upgrade options"
            ]
        )
        return standard_judge(inputs=inputs, outputs=outputs)

# Example evaluation data
eval_data = [
    {
        "inputs": {
            "question": "How do I export my data?",
            "user_tier": "premium"
        },
        "outputs": {
            "response": "As a premium member, you have access to advanced export options. You can export in 5 formats: CSV, Excel, JSON, XML, and PDF. Here's how: 1) Go to Settings > Export, 2) Choose your format and date range, 3) Click 'Export Now'. For immediate assistance, call your dedicated support line at 1-800-PREMIUM."
        }
    },
    {
        "inputs": {
            "question": "How do I export my data?",
            "user_tier": "standard"
        },
        "outputs": {
            "response": "You can export your data as CSV from Settings > Export. Premium users can access additional formats like Excel and PDF."
        }
    }
]

# Run evaluation with the custom scorer
results = mlflow.genai.evaluate(
    data=eval_data,
    scorers=[premium_service_validator]
)

範例筆記本

下列筆記本包含此頁面上的所有程序代碼。

MLflow 評估筆記本的程式碼型評分器

拿筆記本

後續步驟