評估您的資料代理程式（預覽）

2025-05-06

使用 Fabric SDK 進行評估可讓您以程式設計方式測試 Data Agent 回應自然語言問題的方式。使用簡單的 Python 介面，您可以定義基本事實範例、執行評估及分析結果，全都在筆記本環境中。這可協助您驗證正確性、偵錯錯誤，並在將代理程式部署到生產環境之前放心地改善代理程式。

先決條件

付費 F2 或更高網狀架構容量資源
已啟用 Fabric 資料代理租戶設定。
已啟用 Copilot 租戶切換。
已啟用 AI 的跨地理位置處理。
已啟用 AI 的跨地理位置儲存。
至少下列其中一個數據源：倉儲、Lakehouse、一或多個 Power BI 語意模型，或具有數據的 KQL 資料庫。

安裝數據代理程式 SDK

若要開始以程式設計方式評估網狀架構數據代理程式，您需要安裝 Fabric Data Agent Python SDK。此 SDK 提供與數據代理程式互動、執行評估及記錄結果所需的工具和方法。在筆記本中執行下列命令，以安裝最新版本：

%pip install -U fabric-data-agent-sdk

此步驟可確保您擁有 SDK 中可用的最 up-to日期功能和修正程式。

載入您的地面真相數據集

若要評估網狀架構數據代理程式，您需要一組範例問題以及預期的答案。這些問題可用來驗證代理程式回應真實世界查詢的準確程度。

您可以使用 pandas DataFrame 直接在程式代碼中定義這些問題：

import pandas as pd

# Define a sample evaluation set with user questions and their expected answers.
# You can modify the question/answer pairs to match your scenario.
df = pd.DataFrame(
    columns=["question", "expected_answer"],
    data=[
        ["Show total sales for Canadian Dollar for January 2013", "46,117.30"],
        ["What is the product with the highest total sales for Canadian Dollar in 2013", "Mountain-200 Black, 42"],
        ["Total sales outside of the US", "19,968,887.95"],
        ["Which product category had the highest total sales for Canadian Dollar in 2013", "Bikes (Total Sales: 938,654.76)"]
    ]
)

或者，如果您有現有的評估數據集，您可以從 CSV 檔案載入該數據集，其中包含問題和預期答案這兩個資料行。

# Load questions and expected answers from a CSV file
input_file_path = "/lakehouse/default/Files/Data/Input/curated_2.csv"
df = pd.read_csv(input_file_path)

此數據集可作為針對數據代理程式執行自動化評估的輸入，以評估精確度和涵蓋範圍。

評估及檢視您的數據代理程式

下一個步驟是使用 evaluate_data_agent 函式執行評估。此函式會比較代理程式的回應與預期的結果，並儲存評估計量。

from fabric.dataagent.evaluation import evaluate_data_agent

# Name of your Data Agent
data_agent_name = "AgentEvaluation"

# (Optional) Name of the workspace if the Data Agent is in a different workspace
workspace_name = None

# (Optional) Name of the output table to store evaluation results (default: "evaluation_output")
# Two tables will be created:
# - "<table_name>": contains summary results (e.g., accuracy)
# - "<table_name>_steps": contains detailed reasoning and step-by-step execution
table_name = "demo_evaluation_output"

# Specify the Data Agent stage: "production" (default) or "sandbox"
data_agent_stage = "production"

# Run the evaluation and get the evaluation ID
evaluation_id = evaluate_data_agent(
    df,
    data_agent_name,
    workspace_name=workspace_name,
    table_name=table_name,
    data_agent_stage=data_agent_stage
)

print(f"Unique ID for the current evaluation run: {evaluation_id}")

取得評估摘要

執行評估之後，您可以使用 get_evaluation_summary 函式擷取結果的高階摘要。此函式提供數據代理程式整體執行效能的深入解析，包括與預期答案相符的回應數目等計量。

from fabric.dataagent.evaluation import get_evaluation_summary

# Retrieve a summary of the evaluation results
df = get_evaluation_summary(table_name)

根據預設，此函式會尋找名為 evaluation_output的數據表。如果您在評估期間指定自訂資料表名稱（例如 “demo_evaluation_output），請將該名稱傳遞為 table_name 自變數。

傳回的 DataFrame 包含匯總的計量，例如正確、不正確或不清楚的回應數目。此結果可協助您快速評估代理的正確性，並找出需要改進的地方。

get_evaluation_summary

傳回包含已完成評估運行之高階摘要指標的 DataFrame，例如正確、不正確和不清楚的回應數目。

get_evaluation_summary(table_name='evaluation_output', verbose=False)

輸入參數：

table_name （str，選擇性） - 包含評估摘要結果的數據表名稱。預設為「evaluation_output」
verbose （bool，選擇性） - 如果設定為 True，則會將評估計量的摘要列印至主控台。預設為 False。

傳回：

DataFrame – pandas DataFrame，其中包含評估的摘要統計數據，例如：
- 已評估問題的總數
- 「true」、「false」和「不清楚」結果的計數
- 準確性

檢查詳細的評估結果

若要深入瞭解 Data Agent 如何回應每個個別問題，請使用函式 get_evaluation_details 。此函式會傳回評估回合的詳細細目，包括實際的代理程式回應、它們是否符合預期的答案，以及評估線程的連結（僅對執行評估的用戶可見）。

from fabric.dataagent.evaluation import get_evaluation_details

# Table name used during evaluation
table_name = "demo_evaluation_output"

# Whether to return all evaluation rows (True) or only failures (False)
get_all_rows = False

# Whether to print a summary of the results
verbose = True

# Retrieve evaluation details for a specific run
eval_details = get_evaluation_details(
    evaluation_id,
    table_name,
    get_all_rows=get_all_rows,
    verbose=verbose
)

get_evaluation_details

傳回 DataFrame，其中包含特定評估回合的詳細結果，包括問題、預期答案、代理程式回應、評估狀態和診斷元數據。

輸入參數：

evaluation_id （str） - 必填。評估執行的唯一標識碼，用來擷取詳細數據。
table_name （str，選擇性） - 包含評估結果的資料表名稱。預設為 evaluation_output。
get_all_rows （bool，選擇性） - 是否要從評估傳回所有數據列（True）或只傳回代理程式回應不正確或不清楚的數據列（False）。預設為 False。
verbose （bool，選擇性） - 如果設定為 True，則會將評估計量的摘要列印至控制台。預設為 False。

傳回：

DataFrame – 包含資料列層級評估結果的 pandas DataFrame，包括：
- question
- expected_answer
- actual_answer
- evaluation_result（true, false, unclear）
- thread_url （只能由執行評估的使用者存取）

自訂您的評估提示

根據預設，Fabric SDK 會使用內建提示來評估數據代理程式的實際答案是否符合預期的答案。不過，您可以使用critic_prompt參數，提供您自己的提示，以取得更細微或特定領域的評估。

您的自訂提示應該包含佔位元 {query}、 {expected_answer}與 {actual_answer}。這些占位符在評估過程中會動態替換每個問題。

from fabric.dataagent.evaluation import evaluate_data_agent

# Define a custom prompt for evaluating agent responses
critic_prompt = """
    Given the following query, expected answer, and actual answer, please determine if the actual answer is equivalent to expected answer. If they are equivalent, respond with 'yes'.

    Query: {query}

    Expected Answer:
    {expected_answer}

    Actual Answer:
    {actual_answer}

    Is the actual answer equivalent to the expected answer?
"""

# Name of the Data Agent
data_agent_name = "AgentEvaluation"

# Run evaluation using the custom critic prompt
evaluation_id = evaluate_data_agent(df, data_agent_name, critic_prompt=critic_prompt)

這項功能在下列情況下特別有用：

您希望針對何種條件符合的標準進行寬鬆或嚴格的設定。
您預期和實際答案的格式可能會有所不同，但仍在語意上相等。
您必須擷取應如何判斷答案的特定領域細微差別。

共用方式為

評估您的資料代理程式 （預覽）