Tutorial: Evaluate an LLM's prompt completions

Article
05/20/2024

In this tutorial, you evaluate the coherence, relevance, and groundedness of an LLM's prompt completions using Azure OpenAI and the Semantic Kernel SDK for .NET.

In this tutorial, you learn how to:

Clone and build the evaluation application
Configure the models
Generate evaluation test data
Perform an evaluation of your LLM
Review the results of an evaluation

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites

1 - Clone the evaluation application

Get the source for the evaluation application and ensure it can be built.

Clone the repository dotnet/ai-samples.
From a terminal or command prompt, navigate to the ai-samples/src/llm-eval directory.
Build the evaluation application:
```
dotnet build .
```

2 - Configure the models

Set the model to be tested and the models to perform evaluations and generate test data.

It's best to use a GPT-4 model for performing evaluation. You can use an Azure OpenAI resource, an OpenAI instance, or any LLM supported by the Semantic Kernel SDK. This article uses a GPT-4 model deployed to an Azure OpenAI resource for evaluations.

The KernelFactory class (src/LLMEval.Test/KernelFactory.cs) creates the kernels for evaluations, generating test data, and the LLM being tested.

Configure the model to test

The evaluation application tests the model that the KernelFactory.CreateKernelTest method returns.

The Semantic Kernel SDK can integrate any model that supports the OpenAI Chat Completion API.

Update the KernelFactory.CreateKernelTest method to return a Kernel object that uses the model to be tested. For example, the following example creates a Kernel object that uses a Llama 3 model deployed and hosted locally using Ollama:

public static Kernel CreateKernelTest()
{
    IKernelBuilder builder = Kernel.CreateBuilder();

    builder.AddOpenAIChatCompletion(
        modelId: "phi3",
        endpoint: new Uri("http://localhost:11434"),
        apiKey: "api"
    );

    return builder.Build();
}

Configure the model to perform evaluations

Use .NET user secrets to store the Azure OpenAI deployment info. From the ai-samples/src/llm-eval/LLMEval.Test directory, run the following commands:

dotnet user-secrets init
dotnet user-secrets set "AZURE_OPENAI_MODEL" "<deployment-name>"
dotnet user-secrets set "AZURE_OPENAI_ENDPOINT" "<deployment-endpoint>"
dotnet user-secrets set "AZURE_OPENAI_KEY" "<deployment-key>"

The evaluation application is configured to use these secrets to connect to an Azure OpenAI model to perform evaluations. You can update this configuration in the KernelFactory.CreateKernelEval method:

public static Kernel CreateKernelEval()
{
    IConfigurationRoot config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
    IKernelBuilder builder = Kernel.CreateBuilder();

    builder.AddAzureOpenAIChatCompletion(
        config["AZURE_OPENAI_MODEL"],
        config["AZURE_OPENAI_ENDPOINT"],
        config["AZURE_OPENAI_KEY"]
    );

    return builder.Build();
}

Configure the model to generate test data

The evaluation application is configured to use the secrets set in the previous step to connect to an Azure OpenAI model to generate test data. You can update this configuration in the KernelFactory.CreateKernelGenerateData method:

public static Kernel CreateKernelGenerateData()
{
    IConfigurationRoot config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
    IKernelBuilder builder = Kernel.CreateBuilder();

    builder.AddAzureOpenAIChatCompletion(
        config["AZURE_OPENAI_MODEL"],
        config["AZURE_OPENAI_ENDPOINT"],
        config["AZURE_OPENAI_KEY"]
    );

    return builder.Build();
}

3 - Generate test data

The evaluation application compares an LLM's output to "ground truth" answers, which are ideal question-answer pairs. It's recommended to have at least 200 question-answer pairs for an evaluation.

You can use the evaluation application to generate an initial set of question-answer pairs. Then manually curate them by rewriting or removing any subpar answers.

Tips for generating test data:

Generate more question-answer pairs than you need, then manually prune them based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.
Be aware of the knowledge distribution so you effectively sample questions across the relevant knowledge space.
Once your application is live, continually sample real user questions (within accordance to your privacy policy) to make sure you're representing the kinds of questions that users are asking.

From the ai-samples/src/llm-eval/LLMEval.Test directory, run the following command:
```
dotnet run .
```
Select Generate QAs associated to a topic, and export to json, then press Enter.
Enter the number of question-answer pairs to be generated and their topic.
A preview of the generated question-answer pairs in JSON format is shown; enter the path of the file to save the JSON to.
Review the output JSON, and update or remove any incorrect or subpar answers.

4 - Perform an evaluation

Once you've curated the question-answer pairs, the evaluation application can use them to evaluate the outputs of the test model.

Copy the JSON file containing the question-answer pairs to ai-samples/src/llm-eval/LLMEval.Test/assets/qa-02.json.
From the ai-samples/src/llm-eval/LLMEval.Test directory, run the following command:
```
dotnet run .
```
Select List of QAs from a file, then press Enter.
The evaluation results are printed in a table format.

5 - Review the evaluation results

The evaluation results generated in the previous step include a coherence, relevance, and groundedness metric. These metrics are similar to the built-in metrics provided by the Azure AI Studio.

Coherence: Measures how well the language model can produce outputs that flow smoothly, read naturally, and resemble human-like language.
- Based on ai-samples/src/LLMEval.Core/_prompts/coherence/skprompt.txt
Relevance: Assesses the ability of answers to capture the key points of the context.
- Based on ai-samples/src/LLMEval.Core/_prompts/relevance/skprompt.txt
Groundedness: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.
- Based on ai-samples/src/LLMEval.Core/_prompts/groundedness/skprompt.txt

Clean up resources

If you no longer need them, delete the Azure OpenAI resource and GPT-4 model deployment.

In the Azure Portal, navigate to the Azure OpenAI resource.
Select the Azure OpenAI resource then select Delete.

Share via

Tutorial: Evaluate an LLM's prompt completions

Prerequisites

1 - Clone the evaluation application

2 - Configure the models

Configure the model to test

Configure the model to perform evaluations

Configure the model to generate test data

3 - Generate test data

4 - Perform an evaluation

5 - Review the evaluation results

Clean up resources

Feedback

Additional resources

Share via

Tutorial: Evaluate an LLM's prompt completions

Prerequisites

1 - Clone the evaluation application

2 - Configure the models

Configure the model to test

Configure the model to perform evaluations

Configure the model to generate test data

3 - Generate test data

4 - Perform an evaluation

5 - Review the evaluation results

Clean up resources

Related content

Feedback

Additional resources