Edit

Evaluation

Agent Framework includes a built-in evaluation framework that lets you measure agent quality, safety, and correctness. You can run fast local checks during development, use Azure AI Foundry's cloud-based evaluators for production-grade assessment, or combine both in a single evaluation run.

The evaluation framework is designed around a few key principles:

  • Provider-agnostic — Core evaluation types and orchestration functions work with any evaluation provider.
  • Zero friction — Go from "I have an agent" to "I have eval results" with minimal code.
  • Progressive disclosure — Simple scenarios require near-zero code. Advanced scenarios build on the same primitives.

Core concepts

The evaluation framework is built on three types:

Type Purpose
EvalItem A single item to evaluate — wraps the full conversation and derives query/response via a split strategy.
Evaluator A provider that scores items — local checks, Azure AI Foundry, or any custom implementation.
EvalResults Aggregated results from an evaluation run — pass/fail counts, per-item detail, and optional portal links.

In .NET, the evaluation framework builds on Microsoft.Extensions.AI.Evaluation. Evaluators implement the IAgentEvaluator interface, and orchestration is provided through extension methods on AIAgent and Run.

The core types live in the Microsoft.Agents.AI namespace:

using Microsoft.Agents.AI;

In Python, the evaluation framework is part of the core agent_framework package. Evaluators implement the Evaluator protocol, and orchestration is provided through evaluate_agent() and evaluate_workflow() functions.

from agent_framework import (
    evaluate_agent,
    evaluate_workflow,
    EvalItem,
    EvalResults,
    LocalEvaluator,
)

Local evaluators

LocalEvaluator runs checks locally without API calls — ideal for inner-loop development, CI smoke tests, and fast iteration. It accepts any number of check functions and applies each one to every item.

Built-in checks

Agent Framework ships with built-in checks for common scenarios:

using Microsoft.Agents.AI;

var local = new LocalEvaluator(
    EvalChecks.KeywordCheck("weather", "temperature"),  // Response must contain these keywords
    EvalChecks.ToolCalledCheck("get_weather")            // Agent must have called this tool
);

Custom function evaluators

Use FunctionEvaluator.Create() to wrap any function as an evaluator check. Multiple overloads are available depending on what data you need:

using Microsoft.Agents.AI;

var local = new LocalEvaluator(
    // Simple: check only the response text
    FunctionEvaluator.Create("is_concise",
        (string response) => response.Split(' ').Length < 500),

    // With expected output: compare against ground truth
    FunctionEvaluator.Create("mentions_city",
        (string response, string? expectedOutput) =>
            expectedOutput != null && response.Contains(expectedOutput, StringComparison.OrdinalIgnoreCase)),

    // Full context: access the complete EvalItem
    FunctionEvaluator.Create("used_search",
        (EvalItem item) => item.Conversation.Any(m =>
            m.Text?.Contains("search", StringComparison.OrdinalIgnoreCase) == true))
);

Built-in checks

Agent Framework ships with built-in checks for common scenarios:

Check What it does
keyword_check(*keywords) Response must contain all specified keywords
tool_called_check(*tool_names) Agent must have called the specified tools
tool_calls_present All expected_tool_calls names appear in the conversation (unordered, extras OK)
tool_call_args_match Expected tool calls match on name and arguments (subset match on args)
from agent_framework import (
    LocalEvaluator,
    keyword_check,
    tool_called_check,
    tool_calls_present,
    tool_call_args_match,
)

local = LocalEvaluator(
    keyword_check("weather", "temperature"),  # Response must contain these keywords
    tool_called_check("get_weather"),          # Agent must have called this tool
    tool_calls_present,                        # All expected tool call names were made
    tool_call_args_match,                      # Expected tool calls match on name + args
)

Custom function evaluators

Use the @evaluator decorator to wrap any function as an evaluator check. The function's parameter names determine what data it receives from the EvalItem:

from agent_framework import evaluator, LocalEvaluator

@evaluator
def is_concise(response: str) -> bool:
    """Check response is under 500 words."""
    return len(response.split()) < 500

@evaluator
def mentions_city(response: str, expected_output: str) -> bool:
    """Check response contains the expected city name."""
    return expected_output.lower() in response.lower()

@evaluator
def used_tools(conversation: list, tools: list) -> float:
    """Score based on tool usage. Returns 0.0–1.0 (>= 0.5 passes)."""
    tool_calls = [c for m in conversation for c in (m.contents or []) if c.type == "function_call"]
    return min(len(tool_calls) / max(len(tools), 1), 1.0)

local = LocalEvaluator(is_concise, mentions_city, used_tools)

Supported parameter names: query, response, expected_output, expected_tool_calls, conversation, tools, context.

Return types: bool, float (≥ 0.5 = pass), dict with score or passed key, or CheckResult. Async functions are handled automatically.

Azure AI Foundry evaluators

FoundryEvals connects to Azure AI Foundry's evaluation service for cloud-based LLM-as-judge evaluation. Results are viewable in the Foundry portal with dashboards and comparison views.

using Microsoft.Agents.AI.AzureAI;

var foundry = new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence);
from agent_framework_azure_ai import FoundryEvals

evals = FoundryEvals(
    project_client=project_client,
    model_deployment="gpt-4o",
    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
)

By default, FoundryEvals runs relevance, coherence, and task adherence evaluators. When items contain tool definitions, it automatically adds tool call accuracy.

Available evaluators

FoundryEvals provides constants for all built-in evaluator names:

Category Evaluators
Agent behavior intent_resolution, task_adherence, task_completion, task_navigation_efficiency
Tool usage tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization, tool_call_success
Quality coherence, fluency, relevance, groundedness, response_completeness, similarity
Safety violence, sexual, self_harm, hate_unfairness

Note

FoundryEvals requires an Azure AI Foundry project with an AI model deployment. The model_deployment parameter specifies which model to use as the LLM judge.

Evaluate an agent

The simplest evaluation scenario runs an agent against test queries and scores the responses. Provide multiple diverse queries for statistically meaningful evaluation.

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.Foundry;

var foundry = new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence);

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[]
    {
        "What's the weather in Seattle?",
        "Plan a weekend trip to Portland",
        "What restaurants are near Pike Place?",
    },
    foundry);

results.AssertAllPassed();  // Throws if any item failed

EvaluateAsync is an extension method on AIAgent. It runs the agent once per query, converts each interaction to an EvalItem, and passes the batch to the evaluator.

from agent_framework import evaluate_agent
from agent_framework_azure_ai import FoundryEvals

evals = FoundryEvals(
    project_client=project_client,
    model_deployment="gpt-4o",
    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
)

results = await evaluate_agent(
    agent=my_agent,
    queries=[
        "What's the weather in Seattle?",
        "Plan a weekend trip to Portland",
        "What restaurants are near Pike Place?",
    ],
    evaluators=evals,
)

for r in results:
    print(f"{r.provider}: {r.passed}/{r.total}")
    r.assert_passed()  # Raises AssertionError if any item failed

evaluate_agent runs the agent once per query, converts each interaction to an EvalItem, and passes the batch to the evaluator. It returns one EvalResults per evaluator provider.

Measure consistency with repetitions

Run each query multiple times to detect non-deterministic behavior:

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's the weather in Seattle?" },
    foundry,
    numRepetitions: 3);  // Each query runs 3 times independently
// Results contain 3 items (1 query × 3 repetitions)
results = await evaluate_agent(
    agent=my_agent,
    queries=["What's the weather in Seattle?"],
    evaluators=evals,
    num_repetitions=3,  # Each query runs 3 times independently
)
# Results contain 3 items (1 query × 3 repetitions)

Evaluate with expected outputs

Provide ground-truth expected answers to evaluate correctness. Expected outputs are paired positionally with queries:

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's 2+2?", "Capital of France?" },
    foundry,
    expectedOutput: new[] { "4", "Paris" });

You can also specify expected tool calls:

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's the weather in NYC?" },
    new LocalEvaluator(EvalChecks.ToolCalledCheck("get_weather")),
    expectedToolCalls: new[]
    {
        new[] { new ExpectedToolCall("get_weather") },
    });
from agent_framework import evaluate_agent, ExpectedToolCall

results = await evaluate_agent(
    agent=my_agent,
    queries=["What's 2+2?", "Capital of France?"],
    expected_output=["4", "Paris"],
    evaluators=evals,
)

You can also specify expected tool calls:

results = await evaluate_agent(
    agent=my_agent,
    queries=["What's the weather in NYC?"],
    expected_tool_calls=[ExpectedToolCall("get_weather", {"location": "NYC"})],
    evaluators=local,
)

Evaluate pre-existing responses

When you already have agent responses from logs or previous runs, evaluate them directly without re-running the agent:

var response = await agent.RunAsync(new[] { new ChatMessage(ChatRole.User, "What's the weather?") });

AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { response },
    new[] { "What's the weather?" },
    foundry);
from agent_framework import Message, evaluate_agent

response = await agent.run([Message("user", ["What's the weather?"])])

results = await evaluate_agent(
    agent=agent,
    responses=response,
    queries="What's the weather?",
    evaluators=evals,
)

Conversation split strategies

Multi-turn conversations must be split into query and response halves for evaluation. How you split determines what you're evaluating.

Strategy Behavior Best for
Last turn (default) Split at the last user message. Everything up to it is query context; everything after is the response. Response quality at a specific point
Full First user message is the query; the entire remainder is the response. Task completion and overall trajectory
Per-turn Each user→assistant exchange is scored independently with cumulative context. Fine-grained analysis
// Full conversation as context
AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "Plan a 3-day trip to Paris" },
    foundry,
    splitter: ConversationSplitters.Full);

// Per-turn: each exchange scored independently
var items = EvalItem.PerTurnItems(conversation);
var perTurnResults = await evaluator.EvaluateAsync(items);

You can also implement a custom splitter by implementing IConversationSplitter:

public class SplitBeforeToolCall : IConversationSplitter
{
    public (IReadOnlyList<ChatMessage> QueryMessages, IReadOnlyList<ChatMessage> ResponseMessages) Split(
        IReadOnlyList<ChatMessage> conversation)
    {
        // Custom split logic
        for (int i = 0; i < conversation.Count; i++)
        {
            if (conversation[i].Text?.Contains("tool_call") == true)
                return (conversation.Take(i).ToList(), conversation.Skip(i).ToList());
        }
        return ConversationSplitters.LastTurn.Split(conversation);
    }
}
from agent_framework import evaluate_agent, ConversationSplit

# Full conversation as context
results = await evaluate_agent(
    agent=agent,
    queries=["Plan a 3-day trip to Paris"],
    evaluators=evals,
    conversation_split=ConversationSplit.FULL,
)

# Per-turn: each exchange scored independently
from agent_framework import EvalItem

items = EvalItem.per_turn_items(conversation)
# Pass items directly to an evaluator
per_turn_results = await evaluator.evaluate(items)

You can also provide a custom splitter — any callable that takes a conversation and returns (query_messages, response_messages):

def split_before_memory(conversation):
    """Split just before a memory-retrieval tool call."""
    for i, msg in enumerate(conversation):
        for c in msg.contents or []:
            if c.type == "function_call" and c.name == "retrieve_memory":
                return conversation[:i], conversation[i:]
    # Fallback to default
    return EvalItem._split_last_turn_static(conversation)

results = await evaluate_agent(
    agent=agent,
    queries=queries,
    evaluators=evals,
    conversation_split=split_before_memory,
)

Evaluate workflows

Evaluate multi-agent workflows with per-agent breakdown. The framework extracts each sub-agent's interactions and evaluates them individually, along with the workflow's overall output.

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.AzureAI;

Run run = await workflowRunner.RunAsync(workflow, "Plan a trip to Paris");

AgentEvaluationResults results = await run.EvaluateAsync(
    new FoundryEvals(chatConfiguration, FoundryEvals.Relevance));

Console.WriteLine($"Overall: {results.Passed}/{results.Total}");

// Per-agent breakdown
if (results.SubResults != null)
{
    foreach (var (name, sub) in results.SubResults)
    {
        Console.WriteLine($"  {name}: {sub.Passed}/{sub.Total}");
    }
}

results.AssertAllPassed();
from agent_framework import evaluate_workflow
from agent_framework_azure_ai import FoundryEvals

evals = FoundryEvals(project_client=project_client, model_deployment="gpt-4o")
result = await workflow.run("Plan a trip to Paris")

eval_results = await evaluate_workflow(
    workflow=workflow,
    workflow_result=result,
    evaluators=evals,
)

for r in eval_results:
    print(f"{r.provider}: {r.passed}/{r.total}")
    for name, sub in r.sub_results.items():
        print(f"  {name}: {sub.passed}/{sub.total}")

You can also pass queries directly and the framework will run the workflow for you:

eval_results = await evaluate_workflow(
    workflow=workflow,
    queries=["Plan a trip to Paris", "Book a flight to London"],
    evaluators=evals,
)

Mix multiple evaluators

Run local checks and cloud-based evaluators together in a single evaluation. Each evaluator produces its own EvalResults.

using Microsoft.Agents.AI;
using Microsoft.Agents.AI.AzureAI;

IReadOnlyList<AgentEvaluationResults> results = await agent.EvaluateAsync(
    new[] { "What's the weather in Seattle?" },
    evaluators: new IAgentEvaluator[]
    {
        new LocalEvaluator(
            EvalChecks.KeywordCheck("weather"),
            FunctionEvaluator.Create("is_helpful", (string r) => r.Split(' ').Length > 10)),
        new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence),
    });

// results[0] = local evaluator results
// results[1] = Foundry evaluator results
foreach (var r in results)
{
    Console.WriteLine($"{r.Provider}: {r.Passed}/{r.Total}");
}
from agent_framework import evaluate_agent, evaluator, LocalEvaluator, keyword_check
from agent_framework_azure_ai import FoundryEvals

@evaluator
def is_helpful(response: str) -> bool:
    return len(response.split()) > 10

foundry = FoundryEvals(
    project_client=project_client,
    model_deployment="gpt-4o",
    evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE],
)

results = await evaluate_agent(
    agent=agent,
    queries=["What's the weather in Seattle?"],
    evaluators=[
        LocalEvaluator(is_helpful, keyword_check("weather")),
        foundry,
    ],
)

# results[0] = local evaluator results
# results[1] = Foundry evaluator results
for r in results:
    print(f"{r.provider}: {r.passed}/{r.total}")

MEAI evaluators

The .NET evaluation framework integrates directly with Microsoft.Extensions.AI.Evaluation evaluators. Quality and safety evaluators from MEAI work without any adapter:

using Microsoft.Extensions.AI.Evaluation;
using Microsoft.Extensions.AI.Evaluation.Quality;
using Microsoft.Extensions.AI.Evaluation.Safety;

// Quality evaluators
AgentEvaluationResults results = await agent.EvaluateAsync(
    new[] { "What's the weather?" },
    new CompositeEvaluator(
        new RelevanceEvaluator(),
        new CoherenceEvaluator(),
        new GroundednessEvaluator()),
    chatConfiguration: new ChatConfiguration(evalClient));

// Safety evaluators
AgentEvaluationResults safetyResults = await agent.EvaluateAsync(
    new[] { "What's the weather?" },
    new ContentHarmEvaluator(),
    chatConfiguration: new ChatConfiguration(evalClient));

Tip

When using MEAI evaluators, provide a chatConfiguration parameter with a chat client configured for the evaluation model. This client is used by the LLM-as-judge evaluators to score responses.

Next steps