Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The Microsoft.Extensions.AI.Evaluation libraries simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps. Various metrics measure aspects like relevance, truthfulness, coherence, and completeness of the responses. Evaluations are crucial in testing, because they help ensure that the AI model performs as expected and provides reliable and accurate results.
The evaluation libraries, which are built on top of the Microsoft.Extensions.AI abstractions, are composed of the following NuGet packages:
- 📦 Microsoft.Extensions.AI.Evaluation – Defines the core abstractions and types for supporting evaluation.
- 📦 Microsoft.Extensions.AI.Evaluation.Quality – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
- 📦 Microsoft.Extensions.AI.Evaluation.Safety – Contains evaluators, such as the
ProtectedMaterialEvaluator
andContentHarmEvaluator
, that use the Azure AI Foundry Evaluation service to perform evaluations. - 📦 Microsoft.Extensions.AI.Evaluation.Reporting – Contains support for caching LLM responses, storing the results of evaluations, and generating reports from that data.
- 📦 Microsoft.Extensions.AI.Evaluation.Reporting.Azure - Supports the reporting library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
- 📦 Microsoft.Extensions.AI.Evaluation.Console – A command-line tool for generating reports and managing evaluation data.
Test integration
The libraries are designed to integrate smoothly with existing .NET apps, allowing you to leverage existing testing infrastructures and familiar syntax to evaluate intelligent apps. You can use any test framework (for example, MSTest, xUnit, or NUnit) and testing workflow (for example, Test Explorer, dotnet test, or a CI/CD pipeline). The library also provides easy ways to do online evaluations of your application by publishing evaluation scores to telemetry and monitoring dashboards.
Comprehensive evaluation metrics
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in quality and safety evaluators and the metrics they measure.
You can also customize to add your own evaluations by implementing the IEvaluator interface.
Quality evaluators
Quality evaluators measure response quality. They use an LLM to perform the evaluation.
Evaluator type | Metric | Description |
---|---|---|
RelevanceEvaluator | Relevance |
Evaluates how relevant a response is to a query |
CompletenessEvaluator | Completeness |
Evaluates how comprehensive and accurate a response is |
RetrievalEvaluator | Retrieval |
Evaluates performance in retrieving information for additional context |
FluencyEvaluator | Fluency |
Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability |
CoherenceEvaluator | Coherence |
Evaluates the logical and orderly presentation of ideas |
EquivalenceEvaluator | Equivalence |
Evaluates the similarity between the generated text and its ground truth with respect to a query |
GroundednessEvaluator | Groundedness |
Evaluates how well a generated response aligns with the given context |
RelevanceTruthAndCompletenessEvaluator†| Relevance (RTC) , Truth (RTC) , and Completeness (RTC) |
Evaluates how relevant, truthful, and complete a response is |
†This evaluator is marked experimental.
Safety evaluators
Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.
Evaluator type | Metric | Description |
---|---|---|
GroundednessProEvaluator | Groundedness Pro |
Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context |
ProtectedMaterialEvaluator | Protected Material |
Evaluates response for the presence of protected material |
UngroundedAttributesEvaluator | Ungrounded Attributes |
Evaluates a response for the presence of content that indicates ungrounded inference of human attributes |
HateAndUnfairnessEvaluator†| Hate And Unfairness |
Evaluates a response for the presence of content that's hateful or unfair |
SelfHarmEvaluator†| Self Harm |
Evaluates a response for the presence of content that indicates self harm |
ViolenceEvaluator†| Violence |
Evaluates a response for the presence of violent content |
SexualEvaluator†| Sexual |
Evaluates a response for the presence of sexual content |
CodeVulnerabilityEvaluator | Code Vulnerability |
Evaluates a response for the presence of vulnerable code |
IndirectAttackEvaluator | Indirect Attack |
Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering |
†In addition, the ContentHarmEvaluator provides single-shot evaluation for the four metrics supported by HateAndUnfairnessEvaluator
, SelfHarmEvaluator
, ViolenceEvaluator
, and SexualEvaluator
.
Cached responses
The library uses response caching functionality, which means responses from the AI model are persisted in a cache. In subsequent runs, if the request parameters (prompt and model) are unchanged, responses are then served from the cache to enable faster execution and lower cost.
Reporting
The library contains support for storing evaluation results and generating reports. The following image shows an example report in an Azure DevOps pipeline:
The dotnet aieval
tool, which ships as part of the Microsoft.Extensions.AI.Evaluation.Console
package, includes functionality for generating reports and managing the stored evaluation data and cached responses. For more information, see Generate a report.
Configuration
The libraries are designed to be flexible. You can pick the components that you need. For example, you can disable response caching or tailor reporting to work best in your environment. You can also customize and configure your evaluations, for example, by adding customized metrics and reporting options.
Samples
For a more comprehensive tour of the functionality and APIs available in the Microsoft.Extensions.AI.Evaluation libraries, see the API usage examples (dotnet/ai-samples repo). These examples are structured as a collection of unit tests. Each unit test showcases a specific concept or API and builds on the concepts and APIs showcased in previous unit tests.