预定义的judges.is_grounded()
法官会评估应用程序的响应是否由所提供的上下文(从 RAG 系统或工具调用生成)事实支持,帮助检测与该上下文不符合的幻觉或不实陈述。
可以通过预定义的 RetrievalGroundedness
评分器获得此评分标准,用于评估需要确保响应基于检索到的信息的 RAG 应用程序。
API 签名
from mlflow.genai.judges import is_grounded
def is_grounded(
*,
request: str, # User's original query
response: str, # Application's response
context: Any, # Context to evaluate for relevance, can be any Python primitive or a JSON-seralizable dict
name: Optional[str] = None # Optional custom name for display in the MLflow UIs
) -> mlflow.entities.Feedback:
"""Returns Feedback with 'yes' or 'no' value and a rationale"""
运行示例的先决条件
安装 MLflow 和所需包
pip install --upgrade "mlflow[databricks]>=3.1.0"
请按照 设置环境快速指南 创建 MLflow 试验。
直接使用 SDK
from mlflow.genai.judges import is_grounded
# Example 1: Response is grounded in context
feedback = is_grounded(
request="What is the capital of France?",
response="Paris",
context=[
{"content": "Paris is the capital of France."},
{"content": "Paris is known for its Eiffel Tower."}
]
)
print(feedback.value) # "yes"
print(feedback.rationale) # Explanation of groundedness
# Example 2: Response contains hallucination
feedback = is_grounded(
request="What is the capital of France?",
response="Paris, which has a population of 10 million people",
context=[
{"content": "Paris is the capital of France."}
]
)
print(feedback.value) # "no"
print(feedback.rationale) # Identifies unsupported claim about population
使用预生成的记分器
is_grounded
法官可通过 RetrievalGroundedness
预设的记分器获得。
要求:
-
跟踪要求:
- MLflow 跟踪必须至少包含一个 span,其中
span_type
设置为RETRIEVER
-
inputs
和outputs
必须位于跟踪的根节点上
- MLflow 跟踪必须至少包含一个 span,其中
import os
import mlflow
from openai import OpenAI
from mlflow.genai.scorers import RetrievalGroundedness
from mlflow.entities import Document
from typing import List
mlflow.openai.autolog()
# Connect to a Databricks LLM via OpenAI using the same credentials as MLflow
# Alternatively, you can use your own OpenAI credentials here
mlflow_creds = mlflow.utils.databricks_utils.get_databricks_host_creds()
client = OpenAI(
api_key=cred.token,
base_url=f"{cred.host}/serving-endpoints"
)
# Define a retriever function with proper span type
@mlflow.trace(span_type="RETRIEVER")
def retrieve_docs(query: str) -> List[Document]:
# Simulated retrieval based on query
if "mlflow" in query.lower():
return [
Document(
id="doc_1",
page_content="MLflow is an open-source platform for managing the ML lifecycle.",
metadata={"source": "mlflow_docs.txt"}
),
Document(
id="doc_2",
page_content="MLflow provides tools for experiment tracking, model packaging, and deployment.",
metadata={"source": "mlflow_features.txt"}
)
]
else:
return [
Document(
id="doc_3",
page_content="Machine learning involves training models on data.",
metadata={"source": "ml_basics.txt"}
)
]
# Define your RAG app
@mlflow.trace
def rag_app(query: str):
# Retrieve relevant documents
docs = retrieve_docs(query)
context = "\n".join([doc.page_content for doc in docs])
# Generate response using LLM
messages = [
{"role": "system", "content": f"Answer based on this context: {context}"},
{"role": "user", "content": query}
]
response = client.chat.completions.create(
# This example uses Databricks hosted Claude. If you provide your own OpenAI credentials, replace with a valid OpenAI model e.g., gpt-4o, etc.
model="databricks-claude-3-7-sonnet",
messages=messages
)
return {"response": response.choices[0].message.content}
# Create evaluation dataset
eval_dataset = [
{
"inputs": {"query": "What is MLflow used for?"}
},
{
"inputs": {"query": "What are the main features of MLflow?"}
}
]
# Run evaluation with RetrievalGroundedness scorer
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
predict_fn=rag_app,
scorers=[RetrievalGroundedness()]
)
在自定义评分器中使用
在评估具有与预定义评分程序 要求 不同的数据结构的应用程序时,请将法官包装在自定义记分器中:
from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer
from typing import Dict, Any
eval_dataset = [
{
"inputs": {"query": "What is MLflow used for?"},
"outputs": {
"response": "MLflow is used for managing the ML lifecycle, including experiment tracking and model deployment.",
"retrieved_context": [
{"content": "MLflow is a platform for managing the ML lifecycle."},
{"content": "MLflow includes capabilities for experiment tracking, model packaging, and deployment."}
]
}
},
{
"inputs": {"query": "Who created MLflow?"},
"outputs": {
"response": "MLflow was created by Databricks in 2018 and has over 10,000 contributors.",
"retrieved_context": [
{"content": "MLflow was created by Databricks."},
{"content": "MLflow was open-sourced in 2018."}
]
}
}
]
@scorer
def groundedness_scorer(inputs: Dict[Any, Any], outputs: Dict[Any, Any]):
return is_grounded(
request=inputs["query"],
response=outputs["response"],
context=outputs["retrieved_context"]
)
# Run evaluation
eval_results = mlflow.genai.evaluate(
data=eval_dataset,
scorers=[groundedness_scorer]
)
解释结果
法官返回一个 Feedback
对象,其中包含:
-
value
:如果响应已停止,则为“是”;如果响应包含幻觉,则为“否” -
rationale
:用于标识的详细说明:- 上下文支持哪些语句
- 哪些陈述缺乏支持(幻觉)
- 支持或矛盾声明的上下文中的特定引文
后续步骤
- 评估上下文是否足够 - 检查检索器是否提供足够的信息
- 评估上下文相关性 - 确保检索的文档与查询相关
- 运行全面的 RAG 评估 - 合并多个评委以完成 RAG 评估