コンテキスト評価判定者 & 採点者

2025-06-11

judges.is_context_sufficient()事前に定義されたジャッジは、コンテキストが RAG システムによって取得されたか、ツール呼び出しによって生成されたコンテキストに、expected_factsまたはexpected_responseとして提供されたグラウンド・トゥルース・ラベルに基づいてユーザーの要求に適切に応答するのに十分な情報が含まれているかどうかを評価します。

このジャッジは、取得プロセスが必要なすべての情報を提供していることを確認する必要がある RAG システムを評価するために、定義済みの RetrievalSufficiency スコアラーを通じて入手できます。

APIシグネチャ

from mlflow.genai.judges import is_context_sufficient

def is_context_sufficient(
    *,
    request: str,                    # User's question or query
    context: Any,                    # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
    expected_facts: Optional[list[str]],       # List of expected facts (provide either expected_response or expected_facts)
    expected_response: Optional[str] = None,  #  Ground truth response (provide either expected_response or expected_facts)
    name: Optional[str] = None       # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

例を実行するための前提条件

MLflow と必要なパッケージをインストールする
```
pip install --upgrade "mlflow[databricks]>=3.1.0"
```
環境のセットアップのクイックスタートに従って、MLflow 実験を作成します。

ダイレクト SDK の使用

from mlflow.genai.judges import is_context_sufficient

# Example 1: Context contains sufficient information
feedback = is_context_sufficient(
    request="What is the capital of France?",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."}
    ],
    expected_facts=["Paris is the capital of France."]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of sufficiency

# Example 2: Context lacks necessary information
feedback = is_context_sufficient(
    request="What are MLflow's components?",
    context=[
        {"content": "MLflow is an open-source platform."}
    ],
    expected_facts=[
        "MLflow has four main components",
        "Components include Tracking",
        "Components include Projects"
    ]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Explanation of what's missing

事前構築済みのスコアラーの使用

is_context_sufficientジャッジは、RetrievalSufficiency事前構築済みのスコアラーを通じて入手できます。

要件:

トレース要件:
- MLflow トレースには、少なくとも1つのスパンにspan_typeがRETRIEVERに設定されている必要があります。
- inputs と outputs はトレースのルートスパン上にある必要があります
グラウンドトゥルースラベル: 必須 - expected_facts辞書にはexpected_responseまたはexpectationsのいずれかを含める必要があります。

import os
import mlflow
from openai import OpenAI
from mlflow.genai.scorers import RetrievalSufficiency
from mlflow.entities import Document
from typing import List

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
  api_key=cred.token,
  base_url=f"{cred.host}/serving-endpoints"
)

# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval - some queries return insufficient context
    if "capital of france" in query.lower():
        return [
            Document(
                id="doc_1",
                page_content="Paris is the capital of France.",
                metadata={"source": "geography.txt"}
            ),
            Document(
                id="doc_2",
                page_content="France is a country in Western Europe.",
                metadata={"source": "countries.txt"}
            )
        ]
    elif "mlflow components" in query.lower():
        # Incomplete retrieval - missing some components
        return [
            Document(
                id="doc_3",
                page_content="MLflow has multiple components including Tracking and Projects.",
                metadata={"source": "mlflow_intro.txt"}
            )
        ]
    else:
        return [
            Document(
                id="doc_4",
                page_content="General information about data science.",
                metadata={"source": "ds_basics.txt"}
            )
        ]

# Define your RAG app
@mlflow.trace
def rag_app(query: str):
    # Retrieve documents
    docs = retrieve_docs(query)
    context = "\n".join([doc.page_content for doc in docs])

    # Generate response
    messages = [
        {"role": "system", "content": f"Answer based on this context: {context}"},
        {"role": "user", "content": query}
    ]

    response = client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model="databricks-claude-3-7-sonnet",
        messages=messages
    )

    return {"response": response.choices[0].message.content}

# Create evaluation dataset with ground truth
eval_dataset = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "expectations": {
            "expected_facts": ["Paris is the capital of France."]
        }
    },
    {
        "inputs": {"query": "What are all the MLflow components?"},
        "expectations": {
            "expected_facts": [
                "MLflow has four main components",
                "Components include Tracking",
                "Components include Projects",
                "Components include Models",
                "Components include Registry"
            ]
        }
    }
]

# Run evaluation with RetrievalSufficiency scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[RetrievalSufficiency()]
)

結果を理解する

RetrievalSufficiencyスコアラーは、各レトリバースパンを個別に評価します。これにより、以下が実行されます。

取得したドキュメントに、予想されるファクトを生成するために必要なすべての情報が含まれている場合は、"yes" を返します
取得したドキュメントに重要な情報がない場合は "no" を返します。また、不足している内容を説明する根拠も含まれます。

これは、取得システムが必要なすべての情報をフェッチできないタイミングを特定するのに役立ちます。これは、RAG アプリケーションでの不完全または正しくない応答の一般的な原因です。

カスタムスコアラーの使用

定義済みのスコアラーの要件とは異なるデータ構造を持つアプリケーションを評価する場合は、カスタムスコアラーでジャッジをラップします。

from mlflow.genai.judges import is_context_sufficient
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"query": "What are the benefits of MLflow?"},
        "outputs": {
            "retrieved_context": [
                {"content": "MLflow simplifies ML lifecycle management."},
                {"content": "MLflow provides experiment tracking and model versioning."},
                {"content": "MLflow enables easy model deployment."}
            ]
        },
        "expectations": {
            "expected_facts": [
                "MLflow simplifies ML lifecycle management",
                "MLflow provides experiment tracking",
                "MLflow enables model deployment"
            ]
        }
    },
    {
        "inputs": {"query": "How does MLflow handle model versioning?"},
        "outputs": {
            "retrieved_context": [
                {"content": "MLflow is an open-source platform."}
            ]
        },
        "expectations": {
            "expected_facts": [
                "MLflow Model Registry handles versioning",
                "Models can have multiple versions",
                "Versions can be promoted through stages"
            ]
        }
    }
]

@scorer
def context_sufficiency_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any], expectations: Dict[Any, Any]):
    return is_context_sufficient(
        request=inputs["query"],
        context=outputs["retrieved_context"],
        expected_facts=expectations["expected_facts"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[context_sufficiency_scorer]
)

結果の解釈

ジャッジは、次の Feedback オブジェクトを返します。

value: コンテキストが十分な場合は "yes"、不十分な場合は "no"
rationale: コンテキストでどの予想される事実がカバーされているか、または欠落しているかの説明

次のステップ

コンテキストの関連性を評価する - 不足を確認する前に、取得したドキュメントが関連していることを確認する
グラウンド性を評価する - 応答で指定されたコンテキストのみが使用されていることを確認する
評価データセットを構築する - テストに必要なファクトを含む地上の真実データセットを作成する