The Microsoft.Extensions.AI.Evaluation libraries

The Microsoft.Extensions.AI.Evaluation libraries simplify the process of evaluating the quality and safety of responses generated by AI models in .NET intelligent apps. Various quality metrics measure aspects like relevance, truthfulness, coherence, and completeness of the responses. Safety metrics measure aspects like hate and unfairness, violence, and sexual content. Evaluations are crucial in testing, because they help ensure that the AI model performs as expected and provides reliable and accurate results.

The evaluation libraries, which build on the Microsoft.Extensions.AI abstractions, are composed of the following NuGet packages:

📦 Microsoft.Extensions.AI.Evaluation – Defines the core abstractions and types for supporting evaluation.
📦 Microsoft.Extensions.AI.Evaluation.NLP - Contains evaluators that evaluate the similarity of an LLM's response text to one or more reference responses using natural language processing (NLP) metrics. These evaluators aren't LLM or AI-based; they use traditional NLP techniques such as text tokenization and n-gram analysis to evaluate text similarity.
📦 Microsoft.Extensions.AI.Evaluation.Quality – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
📦 Microsoft.Extensions.AI.Evaluation.Safety – Contains evaluators, such as the ProtectedMaterialEvaluator and ContentHarmEvaluator, that use the Microsoft Foundry Evaluation service to perform evaluations.
📦 Microsoft.Extensions.AI.Evaluation.Reporting – Contains support for caching LLM responses, storing the results of evaluations, and generating reports from that data.
📦 Microsoft.Extensions.AI.Evaluation.Reporting.Azure - Supports the reporting library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
📦 Microsoft.Extensions.AI.Evaluation.Console – A command-line tool for generating reports and managing evaluation data.

Test integration

The libraries integrate smoothly with existing .NET apps, letting you use existing testing infrastructure and familiar syntax to evaluate intelligent apps. You can use any test framework (for example, MSTest, xUnit, or NUnit) and testing workflow (for example, Test Explorer, dotnet test, or a CI/CD pipeline). The library also provides easy ways to do online evaluations of your application by publishing evaluation scores to telemetry and monitoring dashboards.

Comprehensive evaluation metrics

The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in quality, NLP, and safety evaluators and the metrics they measure.

To add your own evaluations, implement the IEvaluator interface.

Quality evaluators

Quality evaluators measure response quality. They use an LLM to perform the evaluation.

Evaluator type	Metric	Description
RelevanceEvaluator	`Relevance`	Evaluates how relevant a response is to a query
CompletenessEvaluator	`Completeness`	Evaluates how comprehensive and accurate a response is
RetrievalEvaluator	`Retrieval`	Evaluates performance in retrieving information for additional context
FluencyEvaluator	`Fluency`	Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability
CoherenceEvaluator	`Coherence`	Evaluates the logical and orderly presentation of ideas
EquivalenceEvaluator	`Equivalence`	Evaluates the similarity between the generated text and its ground truth with respect to a query
GroundednessEvaluator	`Groundedness`	Evaluates how well a generated response aligns with the given context
RelevanceTruthAndCompletenessEvaluator†	`Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)`	Evaluates how relevant, truthful, and complete a response is
IntentResolutionEvaluator	`Intent Resolution`	Evaluates an AI system's effectiveness at identifying and resolving user intent (agent-focused)
TaskAdherenceEvaluator	`Task Adherence`	Evaluates an AI system's effectiveness at adhering to the task assigned to it (agent-focused)
ToolCallAccuracyEvaluator	`Tool Call Accuracy`	Evaluates an AI system's effectiveness at using the tools supplied to it (agent-focused)

† This evaluator is marked experimental.

NLP evaluators

NLP evaluators evaluate the quality of an LLM response by comparing it to a reference response using natural language processing (NLP) techniques. These evaluators aren't LLM or AI-based; instead, they use older NLP techniques to perform text comparisons.

Evaluator type	Metric	Description
BLEUEvaluator	`BLEU`	Evaluates a response by comparing it to one or more reference responses using the bilingual evaluation understudy (BLEU) algorithm. This algorithm is commonly used to evaluate the quality of machine-translation or text-generation tasks.
GLEUEvaluator	`GLEU`	Measures the similarity between the generated response and one or more reference responses using the Google BLEU (GLEU) algorithm, a variant of the BLEU algorithm that's optimized for sentence-level evaluation.
F1Evaluator	`F1`	Evaluates a response by comparing it to a reference response using the F1 scoring algorithm (the ratio of the number of shared words between the generated response and the reference response).

Safety evaluators

Safety evaluators check for the presence of harmful, inappropriate, or unsafe content in a response. They rely on the Foundry Evaluation service, which uses a model that's fine-tuned to perform evaluations.

Evaluator type	Metric	Description
GroundednessProEvaluator	`Groundedness Pro`	Uses a fine-tuned model hosted behind the Foundry Evaluation service to evaluate how well a generated response aligns with the given context
ProtectedMaterialEvaluator	`Protected Material`	Evaluates response for the presence of protected material
UngroundedAttributesEvaluator	`Ungrounded Attributes`	Evaluates a response for the presence of content that indicates ungrounded inference of human attributes
HateAndUnfairnessEvaluator†	`Hate And Unfairness`	Evaluates a response for the presence of content that's hateful or unfair
SelfHarmEvaluator†	`Self Harm`	Evaluates a response for the presence of content that indicates self harm
ViolenceEvaluator†	`Violence`	Evaluates a response for the presence of violent content
SexualEvaluator†	`Sexual`	Evaluates a response for the presence of sexual content
CodeVulnerabilityEvaluator	`Code Vulnerability`	Evaluates a response for the presence of vulnerable code
IndirectAttackEvaluator	`Indirect Attack`	Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering

† In addition, the ContentHarmEvaluator provides single-shot evaluation for the four metrics supported by HateAndUnfairnessEvaluator, SelfHarmEvaluator, ViolenceEvaluator, and SexualEvaluator.

Cached responses

The library uses response caching functionality to persist responses from the AI model in a cache. In subsequent runs, if the request parameters (prompt and model) are unchanged, it serves responses from the cache for faster execution and lower cost.

Reporting

The library supports storing evaluation results and generating reports. The following image shows an example report in an Azure DevOps pipeline:

The dotnet aieval tool, which ships as part of the Microsoft.Extensions.AI.Evaluation.Console package, includes functionality for generating reports and managing the stored evaluation data and cached responses. For more information, see Generate a report.

Configuration

The libraries are flexible and you can pick the components you need. For example, disable response caching or tailor reporting to work best in your environment. You can also customize and configure your evaluations, for example, by adding customized metrics and reporting options.

Samples

For a more comprehensive tour of the functionality and APIs in the Microsoft.Extensions.AI.Evaluation libraries, see the API usage examples (dotnet/ai-samples repo). These examples are a collection of unit tests. Each unit test showcases a specific concept or API and builds on the concepts and APIs showcased in previous unit tests.

Feedback

Was this page helpful?

Last updated on 2026-04-09