Evaluation

Agent Framework 內建評估框架,讓您能衡量代理人的品質、安全性與正確性。 你可以在開發過程中執行快速的本地檢查,使用 Azure AI Foundry 的雲端評估器進行生產級評估,或將兩者合併於單一評估執行中。

評估架構圍繞幾個關鍵原則設計:

  • 提供者無關性 — 核心評估類型與協調功能可與任何評估提供者合作。
  • 零摩擦 ——從「我有經紀人」到「我有評估結果」,程式碼極少。
  • 漸進式揭露 — 簡單情境幾乎不需要程式碼。 進階劇本則建立在相同的原基基礎上。

核心概念

評估架構建立於三種類型之上:

類型 Purpose
評估項目 一個單一項目要評估——涵蓋整個對話,並透過分割策略推導出查詢/回應。
評估員 一個能評分項目的提供者——本地檢查、Azure AI Foundry,或任何自訂實作。
評估結果 評估執行的綜合結果——通過/失敗計數、每個項目的細節,以及選擇性的入口連結。

在 .NET 中,評估框架是基於 Microsoft.Extensions.AI.Evaluation 建立的。 評估器實作IAgentEvaluator介面,並透過在AIAgentRun上的擴充方法提供編排。

核心類型位於 Microsoft.Agents.AI 命名空間中:

using Microsoft.Agents.AI;

在 Python 中,評估框架是核心 agent_framework 套件的一部分。 評估器實作 Evaluator 協定,並透過 evaluate_agent()evaluate_workflow() 函式提供編排。

from agent_framework import (
    evaluate_agent,
    evaluate_workflow,
    EvalItem,
    EvalResults,
    LocalEvaluator,
)

地方評估員

LocalEvaluator 可在本地執行檢查,無需 API 呼叫——非常適合內迴路開發、CI 煙霧測試及快速迭代。 它接受任意數量的檢查功能,並將每個功能套用到每個項目上。

內建檢查

Agent Framework 內建常見情境檢查功能:

using Microsoft.Agents.AI;

var local = new LocalEvaluator(
    EvalChecks.KeywordCheck("weather", "temperature"),  // Response must contain these keywords
    EvalChecks.ToolCalledCheck("get_weather")            // Agent must have called this tool
);

自訂函數評估器

FunctionEvaluator.Create() 來包裝任何函式作為評估器檢查。 根據你需要的資料,有多種過載選項可供選擇:

using Microsoft.Agents.AI;

var local = new LocalEvaluator(
    // Simple: check only the response text
    FunctionEvaluator.Create("is_concise",
        (string response) => response.Split(' ').Length < 500),

    // With expected output: compare against ground truth
    FunctionEvaluator.Create("mentions_city",
        (string response, string? expectedOutput) =>
            expectedOutput != null && response.Contains(expectedOutput, StringComparison.OrdinalIgnoreCase)),

    // Full context: access the complete EvalItem
    FunctionEvaluator.Create("used_search",
        (EvalItem item) => item.Conversation.Any(m =>
            m.Text?.Contains("search", StringComparison.OrdinalIgnoreCase) == true))
);

內建檢查

Agent Framework 內建常見情境檢查功能:

檢查 其功能是什麼
keyword_check(*keywords) 回應必須包含所有指定關鍵字
tool_called_check(*tool_names) 代理人必須呼叫指定的工具
tool_calls_present 所有 expected_tool_calls 名字都會出現在聊天中(不排序,包含額外內容也可以)
tool_call_args_match 預期的工具呼叫名稱與參數匹配(子集匹配於 args 上)
from agent_framework import (
    LocalEvaluator,
    keyword_check,
    tool_called_check,
    tool_calls_present,
    tool_call_args_match,
)

local = LocalEvaluator(
    keyword_check("weather", "temperature"),  # Response must contain these keywords
    tool_called_check("get_weather"),          # Agent must have called this tool
    tool_calls_present,                        # All expected tool call names were made
    tool_call_args_match,                      # Expected tool calls match on name + args
)

自訂函數評估器

@evaluator 裝飾工具將任何函式包裝成評估器檢查。 它從 收到的資料由函數的 EvalItem 決定。

from agent_framework import evaluator, LocalEvaluator

@evaluator
def is_concise(response: str) -> bool:
    """Check response is under 500 words."""
    return len(response.split()) < 500

@evaluator
def mentions_city(response: str, expected_output: str) -> bool:
    """Check response contains the expected city name."""
    return expected_output.lower() in response.lower()

@evaluator
def used_tools(conversation: list, tools: list) -> float:
    """Score based on tool usage. Returns 0.0–1.0 (>= 0.5 passes)."""
    tool_calls = [c for m in conversation for c in (m.contents or []) if c.type == "function_call"]
    return min(len(tool_calls) / max(len(tools), 1), 1.0)

local = LocalEvaluator(is_concise, mentions_city, used_tools)

支援的參數名稱:query、、responseexpected_outputexpected_tool_callsconversationtoolscontext

回傳類型: bool、( float ≥ 0.5 = 通過)、 dict 帶有 scorepassed 鍵,或 CheckResult。 非同步函式則是自動處理的。

Azure AI Foundry 評估器

FoundryEvals 連接到 Azure AI Foundry 的評估服務,用於基於雲端的 LLM 充當裁判的評估。 結果可在 Foundry 入口網站中透過儀表板與比較檢視檢視。

using Microsoft.Agents.AI.AzureAI;

var foundry = new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence);
from agent_framework_azure_ai import FoundryEvals

evals = FoundryEvals(
    project_client=project_client,
    model_deployment="gpt-4o",
    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
)

預設情況下,執行 FoundryEvals相關性、 一致性任務依從 性評估器。 當項目包含工具定義時,會自動增加 工具呼叫的準確度

可用評估者

FoundryEvals 提供所有內建評估器名稱的常數:

類別 評審員
代理行為 intent_resolutiontask_adherencetask_completiontask_navigation_efficiency
工具使用 tool_call_accuracy、、 tool_selectiontool_input_accuracytool_output_utilizationtool_call_success
Quality coherencefluencyrelevancegroundednessresponse_completenesssimilarity
Safety violencesexualself_harmhate_unfairness

備註

FoundryEvals 需要一個Azure AI Foundry專案並部署 AI 模型。 參數 model_deployment 指定使用哪個模型作為 LLM 評審。

評估一位代理人

最簡單的評估情境是對測試查詢執行代理並對回應進行評分。 提供多樣的查詢,以進行具有統計上意義的評估。

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.Foundry;

var foundry = new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence);

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[]
    {
        "What's the weather in Seattle?",
        "Plan a weekend trip to Portland",
        "What restaurants are near Pike Place?",
    },
    foundry);

results.AssertAllPassed();  // Throws if any item failed

EvaluateAsyncAIAgent 上的一種擴展方法。 它會每個查詢執行一次代理,將每次互動轉換成 EvalItem,然後將批次傳給評估者。

from agent_framework import evaluate_agent
from agent_framework_azure_ai import FoundryEvals

evals = FoundryEvals(
    project_client=project_client,
    model_deployment="gpt-4o",
    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
)

results = await evaluate_agent(
    agent=my_agent,
    queries=[
        "What's the weather in Seattle?",
        "Plan a weekend trip to Portland",
        "What restaurants are near Pike Place?",
    ],
    evaluators=evals,
)

for r in results:
    print(f"{r.provider}: {r.passed}/{r.total}")
    r.assert_passed()  # Raises AssertionError if any item failed

evaluate_agent 每次查詢執行一次代理,將每次互動轉換為 EvalItem,並將批次傳遞給評估器。 它會為每位評估提供者回傳一個EvalResults

以重複次數衡量一致性

重複執行每個查詢多次,用來偵測非確定性行為。

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's the weather in Seattle?" },
    foundry,
    numRepetitions: 3);  // Each query runs 3 times independently
// Results contain 3 items (1 query × 3 repetitions)
results = await evaluate_agent(
    agent=my_agent,
    queries=["What's the weather in Seattle?"],
    evaluators=evals,
    num_repetitions=3,  # Each query runs 3 times independently
)
# Results contain 3 items (1 query × 3 repetitions)

根據預期輸出進行評估

提供根據事實的預期答案以評估正確性。 預期輸出會依位置與查詢配對:

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's 2+2?", "Capital of France?" },
    foundry,
    expectedOutput: new[] { "4", "Paris" });

您也可以指定預期的工具呼叫:

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's the weather in NYC?" },
    new LocalEvaluator(EvalChecks.ToolCalledCheck("get_weather")),
    expectedToolCalls: new[]
    {
        new[] { new ExpectedToolCall("get_weather") },
    });
from agent_framework import evaluate_agent, ExpectedToolCall

results = await evaluate_agent(
    agent=my_agent,
    queries=["What's 2+2?", "Capital of France?"],
    expected_output=["4", "Paris"],
    evaluators=evals,
)

您也可以指定預期的工具呼叫:

results = await evaluate_agent(
    agent=my_agent,
    queries=["What's the weather in NYC?"],
    expected_tool_calls=[ExpectedToolCall("get_weather", {"location": "NYC"})],
    evaluators=local,
)

評估既有回應

當你已經有來自日誌或先前執行的代理程式回應時,直接評估它們,不需要重新執行代理程式。

var response = await agent.RunAsync(new[] { new ChatMessage(ChatRole.User, "What's the weather?") });

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { response },
    new[] { "What's the weather?" },
    foundry);
from agent_framework import Message, evaluate_agent

response = await agent.run([Message("user", ["What's the weather?"])])

results = await evaluate_agent(
    agent=agent,
    responses=response,
    queries="What's the weather?",
    evaluators=evals,
)

對話劃分策略

多回合對話必須分成詢問與回應兩半以供評估。 你如何拆分決定你所評估的內容。

策略 行為 最適合用於
最後一回合 (預設) 在最後一個用戶訊息時分開。 直到它之前,一切都是查詢上下文;之後的一切都是回應。 特定點的響應品質
完整 第一個使用者訊息是查詢;剩下的全部就是回應。 任務完成與整體軌跡
每回合 每次使用者→助理的互動都會獨立評分,並納入累積的上下文評估。 細粒度分析
// Full conversation as context
AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "Plan a 3-day trip to Paris" },
    foundry,
    splitter: ConversationSplitters.Full);

// Per-turn: each exchange scored independently
var items = EvalItem.PerTurnItems(conversation);
var perTurnResults = await evaluator.EvaluateAsync(items);

你也可以實作自訂分配器,透過實作 IConversationSplitter

public class SplitBeforeToolCall : IConversationSplitter
{
    public (IReadOnlyList<ChatMessage> QueryMessages, IReadOnlyList<ChatMessage> ResponseMessages) Split(
        IReadOnlyList<ChatMessage> conversation)
    {
        // Custom split logic
        for (int i = 0; i < conversation.Count; i++)
        {
            if (conversation[i].Text?.Contains("tool_call") == true)
                return (conversation.Take(i).ToList(), conversation.Skip(i).ToList());
        }
        return ConversationSplitters.LastTurn.Split(conversation);
    }
}
from agent_framework import evaluate_agent, ConversationSplit

# Full conversation as context
results = await evaluate_agent(
    agent=agent,
    queries=["Plan a 3-day trip to Paris"],
    evaluators=evals,
    conversation_split=ConversationSplit.FULL,
)

# Per-turn: each exchange scored independently
from agent_framework import EvalItem

items = EvalItem.per_turn_items(conversation)
# Pass items directly to an evaluator
per_turn_results = await evaluator.evaluate(items)

你也可以提供自訂的拆分器——任何可呼叫的函數,能接收對話並回傳 (query_messages, response_messages)

def split_before_memory(conversation):
    """Split just before a memory-retrieval tool call."""
    for i, msg in enumerate(conversation):
        for c in msg.contents or []:
            if c.type == "function_call" and c.name == "retrieve_memory":
                return conversation[:i], conversation[i:]
    # Fallback to default
    return EvalItem._split_last_turn_static(conversation)

results = await evaluate_agent(
    agent=agent,
    queries=queries,
    evaluators=evals,
    conversation_split=split_before_memory,
)

評估工作流程

評估多代理人的工作流程並進行每位代理人的細分評估。 該框架會擷取每個子代理人的互動,並逐一評估,連同整個工作流程的整體輸出。

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.AzureAI;

Run run = await workflowRunner.RunAsync(workflow, "Plan a trip to Paris");

AgentEvaluationResults results = await run.EvaluateAsync(
    new FoundryEvals(chatConfiguration, FoundryEvals.Relevance));

Console.WriteLine($"Overall: {results.Passed}/{results.Total}");

// Per-agent breakdown
if (results.SubResults != null)
{
    foreach (var (name, sub) in results.SubResults)
    {
        Console.WriteLine($"  {name}: {sub.Passed}/{sub.Total}");
    }
}

results.AssertAllPassed();
from agent_framework import evaluate_workflow
from agent_framework_azure_ai import FoundryEvals

evals = FoundryEvals(project_client=project_client, model_deployment="gpt-4o")
result = await workflow.run("Plan a trip to Paris")

eval_results = await evaluate_workflow(
    workflow=workflow,
    workflow_result=result,
    evaluators=evals,
)

for r in eval_results:
    print(f"{r.provider}: {r.passed}/{r.total}")
    for name, sub in r.sub_results.items():
        print(f"  {name}: {sub.passed}/{sub.total}")

你也可以直接傳遞 queries,框架會為你執行工作流程。

eval_results = await evaluate_workflow(
    workflow=workflow,
    queries=["Plan a trip to Paris", "Book a flight to London"],
    evaluators=evals,
)

混合多個評估員

將本地檢查與雲端評估器合併在單一評估中。 每個評估者都產生自己的 EvalResults

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.AzureAI;

IReadOnlyList<AgentEvaluationResults> results = await agent.EvaluateAsync(
    new[] { "What's the weather in Seattle?" },
    evaluators: new IAgentEvaluator[]
    {
        new LocalEvaluator(
            EvalChecks.KeywordCheck("weather"),
            FunctionEvaluator.Create("is_helpful", (string r) => r.Split(' ').Length > 10)),
        new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence),
    });

// results[0] = local evaluator results
// results[1] = Foundry evaluator results
foreach (var r in results)
{
    Console.WriteLine($"{r.Provider}: {r.Passed}/{r.Total}");
}
from agent_framework import evaluate_agent, evaluator, LocalEvaluator, keyword_check
from agent_framework_azure_ai import FoundryEvals

@evaluator
def is_helpful(response: str) -> bool:
    return len(response.split()) > 10

foundry = FoundryEvals(
    project_client=project_client,
    model_deployment="gpt-4o",
    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
)

results = await evaluate_agent(
    agent=agent,
    queries=["What's the weather in Seattle?"],
    evaluators=[
        LocalEvaluator(is_helpful, keyword_check("weather")),
        foundry,
    ],
)

# results[0] = local evaluator results
# results[1] = Foundry evaluator results
for r in results:
    print(f"{r.provider}: {r.passed}/{r.total}")

MEAI 評估員

.NET評估框架直接與Microsoft.Extensions.AI.Evaluation評估器進行整合。 MEAI 的品質與安全評估員無需使用轉接頭:

using Microsoft.Extensions.AI.Evaluation;
using Microsoft.Extensions.AI.Evaluation.Quality;
using Microsoft.Extensions.AI.Evaluation.Safety;

// Quality evaluators
AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's the weather?" },
    new CompositeEvaluator(
        new RelevanceEvaluator(),
        new CoherenceEvaluator(),
        new GroundednessEvaluator()),
    chatConfiguration: new ChatConfiguration(evalClient));

// Safety evaluators
AgentEvaluationResults safetyResults = await agent.EvaluateAsync(
    new[] { "What's the weather?" },
    new ContentHarmEvaluator(),
    chatConfiguration: new ChatConfiguration(evalClient));

小提示

使用 MEAI 評估器時,請提供 chatConfiguration 參數,並設定一個符合評估模型要求的聊天用戶端。 這個用戶端被作為評審的LLM用來評分回應。

下一步