The Microsoft.Extensions.AI.Evaluation libraries (Preview)

Artikkel
21.02.2025

The Microsoft.Extensions.AI.Evaluation libraries (currently in preview) simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps. Various metrics measure aspects like relevance, truthfulness, coherence, and completeness of the responses. Evaluations are crucial in testing, because they help ensure that the AI model performs as expected and provides reliable and accurate results.

The evaluation libraries, which are built on top of the Microsoft.Extensions.AI abstractions, are composed of the following NuGet packages:

📦 Microsoft.Extensions.AI.Evaluation – Defines the core abstractions and types for supporting evaluation.
📦 Microsoft.Extensions.AI.Evaluation.Quality – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance, fluency, coherence, and truthfulness.
📦 Microsoft.Extensions.AI.Evaluation.Reporting – Contains support for caching LLM responses, storing the results of evaluations, and generating reports from that data.
📦 Microsoft.Extensions.AI.Evaluation.Reporting.Azure - Supports the reporting library with an implementation for caching LLM responses and storing the evaluation results in an Azure Storage container.
📦 Microsoft.Extensions.AI.Evaluation.Console – A command-line tool for generating reports and managing evaluation data.

Test integration

The libraries are designed to integrate smoothly with existing .NET apps, allowing you to leverage existing testing infrastructures and familiar syntax to evaluate intelligent apps. You can use any test framework (for example, MSTest, xUnit, or NUnit) and testing workflow (for example, Test Explorer, dotnet test, or a CI/CD pipeline). The library also provides easy ways to do online evaluations of your application by publishing evaluation scores to telemetry and monitoring dashboards.

Comprehensive evaluation metrics

The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following table shows the built-in evaluators.

Metric	Description	Evaluator type
Relevance, truth, and completeness	How effectively a response addresses a query	RelevanceTruthAndCompletenessEvaluator
Fluency	Grammatical accuracy, vocabulary range, sentence complexity, and overall readability	FluencyEvaluator
Coherence	The logical and orderly presentation of ideas	CoherenceEvaluator
Equivalence	The similarity between the generated text and its ground truth with respect to a query	EquivalenceEvaluator
Groundedness	How well a generated response aligns with the given context	GroundednessEvaluator

You can also customize to add your own evaluations by implementing the IEvaluator interface or extending the base classes such as ChatConversationEvaluator and SingleNumericMetricEvaluator.

Cached responses

The library uses response caching functionality, which means responses from the AI model are persisted in a cache. In subsequent runs, if the request parameters (prompt and model) are unchanged, responses are then served from the cache to enable faster execution and lower cost.

Reporting

The library contains support for storing evaluation results and generating reports. The following image shows an example report in an Azure DevOps pipeline:

The dotnet aieval tool, which ships as part of the Microsoft.Extensions.AI.Evaluation.Console package, also includes functionality for generating reports and managing the stored evaluation data and cached responses.

Configuration

The libraries are designed to be flexible. You can pick the components that you need. For example, you can disable response caching or tailor reporting to work best in your environment. You can also customize and configure your evaluations, for example, by adding customized metrics and reporting options.

Samples

For a more comprehensive tour of the functionality and APIs available in the Microsoft.Extensions.AI.Evaluation libraries, see the API usage examples (dotnet/ai-samples repo). These examples are structured as a collection of unit tests. Each unit test showcases a specific concept or API and builds on the concepts and APIs showcased in previous unit tests.

Flere ressurser

Dokumentasjon

Tutorial: Evaluate an LLM's prompt completions - .NET

Evaluate the coherence, relevance, and groundedness of an LLM's prompt completions using Azure OpenAI and the Semantic Kernel SDK for .NET.
Tutorial - Integrate OpenAI with the RAG pattern and vector search using Azure Cosmos DB for MongoDB - .NET

Create a simple recipe app using the RAG pattern and vector search using Azure Cosmos DB for MongoDB.
Get started with the 'chat using your own data sample' for .NET - .NET

Get started with .NET and search across your own data using a chat app sample implemented using Azure OpenAI Service and Retrieval Augmented Generation (RAG) in Azure AI Search. Easily deploy with Azure Developer CLI. This article uses the Azure AI Reference Template sample.
Manage OpenAI Content Filtering in a .NET app - .NET

Learn how to manage OpenAI content filtering programmatically in a .NET app using the OpenAI client library.
Scale Azure OpenAI for .NET chat sample using RAG - .NET

Learn how to add load balancing to your application to extend the chat app beyond the Azure OpenAI token and model quota limits.
Understanding OpenAI Function Calling - .NET

Understand how function calling enables you to integrate external tools with your OpenAI application.
Semantic Kernel overview for .NET - .NET

Learn how to add the Semantic Kernel SDK to your .NET projects and explore fundamental concepts

Opplæring

Modul

Kjør evalueringer og generer syntetiske datasett - Training

Lær hvordan du kjører evalueringer og genererer syntetiske datasett med Azure AI Evaluation SDK.

Sertifisering

Microsoft Certified: Grunnleggende om Azure AI - Certifications

Demonstrere grunnleggende AI-konsepter knyttet til utvikling av programvare og tjenester i Microsoft Azure for å opprette AI-løsninger.

Hendelser

Bygg intelligente apper

17. mars, 21 - 21. mars, 10

Bli med i meetup-serien for å bygge skalerbare AI-løsninger basert på virkelige brukstilfeller med andre utviklere og eksperter.

Registrer deg nå

Del via