Retrieval-Augmented 世代（RAG）評估器

Retrieval-Augmented 生成（RAG）系統試圖根據使用者的查詢產生最相關的答案，並符合基礎文件的回應。使用者的查詢會觸發在基礎文件語料庫中進行搜尋，為 AI 模型產生回應提供基礎上下文。

Evaluator	Best practice	Use when	Purpose	Output
Document Retrieval	Process evaluation	檢索品質是 RAG 的瓶頸，而你有查詢相關性標籤（ground truth）來提供精確的搜尋品質指標，用於除錯和參數優化	透過比較檢索文件與真實標籤來衡量搜尋品質指標（Fidelity、NDCG、XDCG、最大相關性、漏洞）	綜合：Fidelity、NDCG、XDCG、Max Relevance、Holes（含通過/不通過）
Retrieval	Process evaluation	你想評估檢索出的上下文的文本品質，但你沒有確切的事實	衡量擷取的上下文區塊與使用 LLM 判定器處理查詢的相關性	二元：根據門檻（1-5 等級）進行通過/不通過
Groundedness	System evaluation	你需要一個全面的接地定義，能配合代理輸入，並帶著你自己的 GPT 模型作為 LLM 評審	衡量生成的回應與所給情境的契合度，且不需捏造內容（精確度方面）	二元：根據門檻（1-5 等級）進行通過/不通過
Groundedness Pro（預覽版）	System evaluation	你需要由 Azure AI 內容安全提供嚴格的接地定義，並使用我們的服務模型	利用 Azure AI 內容安全服務偵測回應是否嚴格符合上下文	Binary: True/False
Relevance	System evaluation	你想評估 RAG 回應對問題的回應程度，但沒有實地真相	衡量回應對查詢的準確性、完整性及直接相關性	二元：根據門檻（1-5 等級）進行通過/不通過
回應完整性（預覽）	System evaluation	你要確保 RAG 回應不會漏掉你那個真實資訊中的關鍵資訊（回憶面向）	衡量回應涵蓋預期資訊與實際資訊的完整性	二元：根據門檻（1-5 等級）進行通過/不通過

將 接地性 與 回應完整性 視為：

接地則著重於回應的精確度。它不包含接地語境以外的內容。
回應完整性著重於 回應的回憶 面向。與預期的回應或實地資訊相比，它不會遺漏關鍵資訊。

System evaluation

系統評估是檢視 RAG 工作流程中最終回應的品質。這些評估者根據所提供的情境與使用者查詢，確保 AI 生成的內容準確、相關且完整：

紮根性——回應是否紮根於所提供的情境，沒有捏造？
Groundedness Pro - 回應是否嚴格遵循上下文（Azure AI 內容安全）？
相關性——回應是否準確回應使用者的問題？
回應完整性（預覽）-回應是否涵蓋了所有關鍵資訊？

Examples:

Process evaluation

流程評估評估RAG系統中文件檢索步驟的品質。檢索步驟對於為語言模型提供相關上下文至關重要：

檢索——檢索出來的上下文區塊與查詢有多相關？
文件檢索——檢索與真實標籤的匹配度如何（需要 qrels）？

更多範例請參見所有品質評估者樣本s。

使用 RAG 評估器

RAG 評估者評估 AI 系統擷取並利用上下文產生紮實回應的能力。每個評估器都需要特定的資料映射與參數：

Evaluator	Required inputs	Required parameters
Groundedness	`response`（ `context` 建議）; `query` 可選用於增強評分;或 `query`， `response` 用於代理人反應模式	`deployment_name`
Groundedness Pro（預覽版）	`query`， `response`， `context`	(none)
Relevance	`query`、`response`	`deployment_name`
回應完整性（預覽）	`ground_truth`、`response`	`deployment_name`
Retrieval	`query`、`context`	`deployment_name`
Document Retrieval	`retrieval_ground_truth`、`retrieved_documents`	(none)

Example input

你的測試資料集應該包含資料映射中引用的欄位：

{"query": "What are the store hours?", "context": "Our store is open Monday-Friday 9am-6pm and Saturday 10am-4pm.", "response": "The store is open weekdays from 9am to 6pm and Saturdays from 10am to 4pm."}
{"query": "What is the return policy?", "context": "Items can be returned within 30 days with original receipt for full refund.", "response": "You can return items within 30 days if you have your receipt."}

Context format

欄位 context 是一個純字串，包含提供給模型的擷取上下文。在多區塊檢索中，使用區 \n\n 塊間的分隔符將區塊串接成單一字串：

{"query": "What is the return policy?", "context": "Items can be returned within 30 days with receipt.\n\nGift items are eligible for store credit only.", "response": "You can return items within 30 days with your receipt."}

Note

對於帶有 {{sample.output_items}}的代理評估，若回應包含工具呼叫訊息，該 context 欄位為可選——評估器可從工具呼叫結果中擷取上下文。

Configuration example

資料映射語法：

{{item.field_name}} 參考你測試資料集中的欄位（例如， {{item.query}}）。
{{sample.output_items}} 參考：評估過程中產生或檢索的代理人回應。評估代理人目標或代理人回應資料來源時，請使用此資料。對於代理評估而言， context 若回應包含工具呼叫，則為可選選項——評估器可從工具呼叫結果中擷取上下文。

Tip

為了獲得最佳的接地性結果，請提供三個欄位 — query、 response、 context和。 query場地為可選，但若有，能提升得分準確度。

testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "groundedness",
        "evaluator_name": "builtin.groundedness",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {
            "context": "{{item.context}}",
            "response": "{{item.response}}",
        },
    },
    {
        "type": "azure_ai_evaluator",
        "name": "relevance",
        "evaluator_name": "builtin.relevance",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {"query": "{{item.query}}", "response": "{{item.response}}"},
    },
    {
        "type": "azure_ai_evaluator",
        "name": "retrieval",
        "evaluator_name": "builtin.retrieval",
        "initialization_parameters": {"deployment_name": model_deployment},
        "data_mapping": {"query": "{{item.query}}", "context": "{{item.context}}"},
    },
]

關於執行評估及設定資料來源的詳細說明，請參閱 SDK 中的執行評估。

Example output

這些評估者會給出1到5分的分數，其中1分代表非常差，5分代表優異。預設通過門檻是3。達到或超過門檻的成績即視為及格。主要輸出欄位：

{
    "type": "azure_ai_evaluator",
    "name": "Groundedness",
    "metric": "groundedness",
    "score": 4,
    "label": "pass",
    "reason": "The response is well-grounded in the provided context without fabricating content.",
    "threshold": 3,
    "passed": true
}

Groundedness Pro 使用 Azure AI 內容安全服務，並回傳布林值結果而非數值分數：

{
    "type": "azure_ai_evaluator",
    "name": "Groundedness Pro",
    "metric": "groundedness_pro",
    "label": "pass",
    "reason": "The response is strictly consistent with the provided context.",
    "passed": true
}

Document retrieval

由於其在 RAG 中扮演上游角色，檢索品質非常重要。如果檢索品質不佳，且回應需要語料庫特定的知識，那麼你的語言模型給你滿意答案的機會就比較小。最精確的衡量方式是利用 document_retrieval 評估器評估檢索品質，並優化 RAG 的搜尋參數。

文件檢索評估器衡量 RAG 從文件儲存中取得正確文件的效率。作為一個適用於帶有實地真實性 RAG 情境的綜合評估器，它會計算出一份用於除錯 RAG 管線的有用搜尋品質指標清單：

Metric	Category	Description
Fidelity	Search Fidelity	頂尖 n 個檢索區塊對特定查詢內容的反映程度：資料集中已知良好文件總數中回傳的良好文件數量
NDCG	Search NDCG	排名是否理想，所有相關項目都排在榜首
XDCG	Search XDCG	不論其他索引文件的評分如何，頂尖 k 文件的結果有多好
最大相關性 N	搜尋 Max 相關性	在頂尖 k 區塊中的最大相關性
Holes	搜尋標籤 Sanity	缺少查詢、相關性判斷或實地資料的文件數量

在所謂的 參數掃描情境中，你可以利用這些指標來校準搜尋參數，以達到最佳的 RAG 結果。針對不同搜尋參數產生不同的檢索結果，例如搜尋演算法（向量、語意）、top_k，以及你想測試的區塊大小。然後用來 document_retrieval 找出能產生最高檢索品質的搜尋參數。

文件檢索範例

testing_criteria = [
    {
        "type": "azure_ai_evaluator",
        "name": "document_retrieval",
        "evaluator_name": "builtin.document_retrieval",
        "initialization_parameters": {
            "ground_truth_label_min": 1,  # SDK default: 0
            "ground_truth_label_max": 5,  # SDK default: 4
        },
        "data_mapping": {
            "retrieval_ground_truth": "{{item.retrieval_ground_truth}}",
            "retrieved_documents": "{{item.retrieved_documents}}",
        },
    },
]

包含 retrieval_ground_truth 每份文件由人類標記的相關性分數：

retrieval_ground_truth = [
    {"document_id": "1", "query_relevance_label": 4},
    {"document_id": "2", "query_relevance_label": 2},
]

這些 retrieved_documents 包含來自您搜尋系統的分數：

retrieved_documents = [
    {"document_id": "2", "relevance_score": 45.1},
    {"document_id": "6", "relevance_score": 35.8},
]

文件檢索輸出

評估者會 document_retrieval 回傳多個檢索品質的指標：

[
    {
        "type": "azure_ai_evaluator",
        "name": "Document Retrieval",
        "metric": "ndcg@3",
        "score": 0.646,
        "label": "pass",
        "passed": true
    },
    {
        "type": "azure_ai_evaluator",
        "name": "Document Retrieval",
        "metric": "fidelity",
        "score": 0.019,
        "label": "fail",
        "passed": false
    },
    # more metrics...
]

評估器同時回傳 xdcg@3、 top1_relevance、 top3_max_relevance、 holes及 holes_ratio 度量。

意見反應

此頁面對您有幫助嗎？

Last updated on 2026-04-30