地面判断和记分器

2025-06-11

预定义的judges.is_grounded()法官会评估应用程序的响应是否由所提供的上下文（从 RAG 系统或工具调用生成）事实支持，帮助检测与该上下文不符合的幻觉或不实陈述。

可以通过预定义的 RetrievalGroundedness 评分器获得此评分标准，用于评估需要确保响应基于检索到的信息的 RAG 应用程序。

API 签名

from mlflow.genai.judges import is_grounded

def is_grounded(
    *,
    request: str,               # User's original query
    response: str,              # Application's response
    context: Any,               # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
    name: Optional[str] = None  # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
    """Returns Feedback with 'yes' or 'no' value and a rationale"""

运行示例的先决条件

安装 MLflow 和所需包

pip install --upgrade "mlflow[databricks]>=3.1.0"

请按照设置环境快速指南创建 MLflow 试验。

直接使用 SDK

from mlflow.genai.judges import is_grounded

# Example 1: Response is grounded in context
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris",
    context=[
        {"content": "Paris is the capital of France."},
        {"content": "Paris is known for its Eiffel Tower."}
    ]
)
print(feedback.value)  # "yes"
print(feedback.rationale)  # Explanation of groundedness

# Example 2: Response contains hallucination
feedback = is_grounded(
    request="What is the capital of France?",
    response="Paris, which has a population of 10 million people",
    context=[
        {"content": "Paris is the capital of France."}
    ]
)
print(feedback.value)  # "no"
print(feedback.rationale)  # Identifies unsupported claim about population

使用预生成的记分器

is_grounded 法官可通过 RetrievalGroundedness 预设的记分器获得。

要求：

跟踪要求：
- MLflow 跟踪必须至少包含一个 span，其中 span_type 设置为 RETRIEVER
- inputs 和 outputs 必须位于跟踪的根节点上

import os
import mlflow
from openai import OpenAI
from mlflow.genai.scorers import RetrievalGroundedness
from mlflow.entities import Document
from typing import List

mlflow.openai.autolog()

# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
  api_key=cred.token,
  base_url=f"{cred.host}/serving-endpoints"
)

# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
    # Simulated retrieval based on query
    if "mlflow" in query.lower():
        return [
            Document(
                id="doc_1",
                page_content="MLflow is an open-source platform for managing the ML lifecycle.",
                metadata={"source": "mlflow_docs.txt"}
            ),
            Document(
                id="doc_2",
                page_content="MLflow provides tools for experiment tracking, model packaging, and deployment.",
                metadata={"source": "mlflow_features.txt"}
            )
        ]
    else:
        return [
            Document(
                id="doc_3",
                page_content="Machine learning involves training models on data.",
                metadata={"source": "ml_basics.txt"}
            )
        ]

# Define your RAG app
@mlflow.trace
def rag_app(query: str):
    # Retrieve relevant documents
    docs = retrieve_docs(query)
    context = "\n".join([doc.page_content for doc in docs])

    # Generate response using LLM
    messages = [
        {"role": "system", "content": f"Answer based on this context: {context}"},
        {"role": "user", "content": query}
    ]

    response = client.chat.completions.create(
        # This example uses Databricks hosted Claude.  If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
        model="databricks-claude-3-7-sonnet",
        messages=messages
    )

    return {"response": response.choices[0].message.content}

# Create evaluation dataset
eval_dataset = [
    {
        "inputs": {"query": "What is MLflow used for?"}
    },
    {
        "inputs": {"query": "What are the main features of MLflow?"}
    }
]

# Run evaluation with RetrievalGroundedness scorer
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=rag_app,
    scorers=[RetrievalGroundedness()]
)

在自定义评分器中使用

在评估具有与预定义评分程序要求不同的数据结构的应用程序时，请将法官包装在自定义记分器中：

from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer
from typing import Dict, Any

eval_dataset = [
    {
        "inputs": {"query": "What is MLflow used for?"},
        "outputs": {
            "response": "MLflow is used for managing the ML lifecycle, including experiment tracking and model deployment.",
            "retrieved_context": [
                {"content": "MLflow is a platform for managing the ML lifecycle."},
                {"content": "MLflow includes capabilities for experiment tracking, model packaging, and deployment."}
            ]
        }
    },
    {
        "inputs": {"query": "Who created MLflow?"},
        "outputs": {
            "response": "MLflow was created by Databricks in 2018 and has over 10,000 contributors.",
            "retrieved_context": [
                {"content": "MLflow was created by Databricks."},
                {"content": "MLflow was open-sourced in 2018."}
            ]
        }
    }
]

@scorer
def groundedness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
    return is_grounded(
        request=inputs["query"],
        response=outputs["response"],
        context=outputs["retrieved_context"]
    )

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    scorers=[groundedness_scorer]
)

解释结果

法官返回一个 Feedback 对象，其中包含：

value：如果响应已停止，则为“是”;如果响应包含幻觉，则为“否”
rationale：用于标识的详细说明：
- 上下文支持哪些语句
- 哪些陈述缺乏支持（幻觉）
- 支持或矛盾声明的上下文中的特定引文

后续步骤

评估上下文是否足够 - 检查检索器是否提供足够的信息
评估上下文相关性 - 确保检索的文档与查询相关
运行全面的 RAG 评估 - 合并多个评委以完成 RAG 评估