Tutorial: Evaluate an LLM's prompt completions
In this tutorial, you evaluate the coherence, relevance, and groundedness of an LLM's prompt completions using Azure OpenAI and the Semantic Kernel SDK for .NET.
In this tutorial, you learn how to:
- Clone and build the evaluation application
- Configure the models
- Generate evaluation test data
- Perform an evaluation of your LLM
- Review the results of an evaluation
If you don't have an Azure subscription, create a free account before you begin.
Get the source for the evaluation application and ensure it can be built.
Clone the repository dotnet/ai-samples.
From a terminal or command prompt, navigate to the
ai-samples/src/llm-eval
directory.Build the evaluation application:
.NET CLIdotnet build .
Set the model to be tested and the models to perform evaluations and generate test data.
It's best to use a GPT-4 model for performing evaluation. You can use an Azure OpenAI resource, an OpenAI instance, or any LLM supported by the Semantic Kernel SDK. This article uses a GPT-4 model deployed to an Azure OpenAI resource for evaluations.
The KernelFactory
class (src/LLMEval.Test/KernelFactory.cs
) creates the kernels for evaluations, generating test data, and the LLM being tested.
The evaluation application tests the model that the KernelFactory.CreateKernelTest
method returns.
The Semantic Kernel SDK can integrate any model that supports the OpenAI Chat Completion API.
Update the KernelFactory.CreateKernelTest
method to return a Kernel
object that uses the model to be tested. For example, the following example creates a Kernel
object that uses a Llama 3 model deployed and hosted locally using Ollama:
public static Kernel CreateKernelTest()
{
IKernelBuilder builder = Kernel.CreateBuilder();
builder.AddOpenAIChatCompletion(
modelId: "phi3",
endpoint: new Uri("http://localhost:11434"),
apiKey: "api"
);
return builder.Build();
}
Use .NET user secrets to store the Azure OpenAI deployment info. From the ai-samples/src/llm-eval/LLMEval.Test
directory, run the following commands:
dotnet user-secrets init
dotnet user-secrets set "AZURE_OPENAI_MODEL" "<deployment-name>"
dotnet user-secrets set "AZURE_OPENAI_ENDPOINT" "<deployment-endpoint>"
dotnet user-secrets set "AZURE_OPENAI_KEY" "<deployment-key>"
The evaluation application is configured to use these secrets to connect to an Azure OpenAI model to perform evaluations. You can update this configuration in the KernelFactory.CreateKernelEval
method:
public static Kernel CreateKernelEval()
{
IConfigurationRoot config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
IKernelBuilder builder = Kernel.CreateBuilder();
builder.AddAzureOpenAIChatCompletion(
config["AZURE_OPENAI_MODEL"],
config["AZURE_OPENAI_ENDPOINT"],
config["AZURE_OPENAI_KEY"]
);
return builder.Build();
}
The evaluation application is configured to use the secrets set in the previous step to connect to an Azure OpenAI model to generate test data. You can update this configuration in the KernelFactory.CreateKernelGenerateData
method:
public static Kernel CreateKernelGenerateData()
{
IConfigurationRoot config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
IKernelBuilder builder = Kernel.CreateBuilder();
builder.AddAzureOpenAIChatCompletion(
config["AZURE_OPENAI_MODEL"],
config["AZURE_OPENAI_ENDPOINT"],
config["AZURE_OPENAI_KEY"]
);
return builder.Build();
}
The evaluation application compares an LLM's output to "ground truth" answers, which are ideal question-answer pairs. It's recommended to have at least 200 question-answer pairs for an evaluation.
You can use the evaluation application to generate an initial set of question-answer pairs. Then manually curate them by rewriting or removing any subpar answers.
Tips for generating test data:
- Generate more question-answer pairs than you need, then manually prune them based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.
- Be aware of the knowledge distribution so you effectively sample questions across the relevant knowledge space.
- Once your application is live, continually sample real user questions (within accordance to your privacy policy) to make sure you're representing the kinds of questions that users are asking.
From the
ai-samples/src/llm-eval/LLMEval.Test
directory, run the following command:.NET CLIdotnet run .
Select Generate QAs associated to a topic, and export to json, then press Enter.
Enter the number of question-answer pairs to be generated and their topic.
A preview of the generated question-answer pairs in JSON format is shown; enter the path of the file to save the JSON to.
Review the output JSON, and update or remove any incorrect or subpar answers.
Once you've curated the question-answer pairs, the evaluation application can use them to evaluate the outputs of the test model.
Copy the JSON file containing the question-answer pairs to
ai-samples/src/llm-eval/LLMEval.Test/assets/qa-02.json
.From the
ai-samples/src/llm-eval/LLMEval.Test
directory, run the following command:.NET CLIdotnet run .
Select List of QAs from a file, then press Enter.
The evaluation results are printed in a table format.
The evaluation results generated in the previous step include a coherence, relevance, and groundedness metric. These metrics are similar to the built-in metrics provided by the Azure AI Studio.
- Coherence: Measures how well the language model can produce outputs that flow smoothly, read naturally, and resemble human-like language.
- Based on
ai-samples/src/LLMEval.Core/_prompts/coherence/skprompt.txt
- Based on
- Relevance: Assesses the ability of answers to capture the key points of the context.
- Based on
ai-samples/src/LLMEval.Core/_prompts/relevance/skprompt.txt
- Based on
- Groundedness: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.
- Based on
ai-samples/src/LLMEval.Core/_prompts/groundedness/skprompt.txt
- Based on
If you no longer need them, delete the Azure OpenAI resource and GPT-4 model deployment.
- In the Azure Portal, navigate to the Azure OpenAI resource.
- Select the Azure OpenAI resource then select Delete.
.NET feedback
.NET is an open source project. Select a link to provide feedback: