Tutorial: Evaluate an LLM's prompt completions

In this tutorial, you evaluate the coherence, relevance, and groundedness of an LLM's prompt completions using Azure OpenAI and the Semantic Kernel SDK for .NET.

Main UI of the Evaluation Application

In this tutorial, you learn how to:

  • Clone and build the evaluation application
  • Configure the models
  • Generate evaluation test data
  • Perform an evaluation of your LLM
  • Review the results of an evaluation

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites

1 - Clone the evaluation application

Get the source for the evaluation application and ensure it can be built.

  1. Clone the repository dotnet/ai-samples.

  2. From a terminal or command prompt, navigate to the ai-samples/src/llm-eval directory.

  3. Build the evaluation application:

    dotnet build .
    

2 - Configure the models

Set the model to be tested and the models to perform evaluations and generate test data.

It's best to use a GPT-4 model for performing evaluation. You can use an Azure OpenAI resource, an OpenAI instance, or any LLM supported by the Semantic Kernel SDK. This article uses a GPT-4 model deployed to an Azure OpenAI resource for evaluations.

The KernelFactory class (src/LLMEval.Test/KernelFactory.cs) creates the kernels for evaluations, generating test data, and the LLM being tested.

Configure the model to test

The evaluation application tests the model that the KernelFactory.CreateKernelTest method returns.

The Semantic Kernel SDK can integrate any model that supports the OpenAI Chat Completion API.

Update the KernelFactory.CreateKernelTest method to return a Kernel object that uses the model to be tested. For example, the following example creates a Kernel object that uses a Llama 3 model deployed and hosted locally using Ollama:

public static Kernel CreateKernelTest()
{
    IKernelBuilder builder = Kernel.CreateBuilder();

    builder.AddOpenAIChatCompletion(
        modelId: "phi3",
        endpoint: new Uri("http://localhost:11434"),
        apiKey: "api"
    );

    return builder.Build();
}

Configure the model to perform evaluations

Use .NET user secrets to store the Azure OpenAI deployment info. From the ai-samples/src/llm-eval/LLMEval.Test directory, run the following commands:

dotnet user-secrets init
dotnet user-secrets set "AZURE_OPENAI_MODEL" "<deployment-name>"
dotnet user-secrets set "AZURE_OPENAI_ENDPOINT" "<deployment-endpoint>"
dotnet user-secrets set "AZURE_OPENAI_KEY" "<deployment-key>"

The evaluation application is configured to use these secrets to connect to an Azure OpenAI model to perform evaluations. You can update this configuration in the KernelFactory.CreateKernelEval method:

public static Kernel CreateKernelEval()
{
    IConfigurationRoot config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
    IKernelBuilder builder = Kernel.CreateBuilder();

    builder.AddAzureOpenAIChatCompletion(
        config["AZURE_OPENAI_MODEL"],
        config["AZURE_OPENAI_ENDPOINT"],
        config["AZURE_OPENAI_KEY"]
    );

    return builder.Build();
}

Configure the model to generate test data

The evaluation application is configured to use the secrets set in the previous step to connect to an Azure OpenAI model to generate test data. You can update this configuration in the KernelFactory.CreateKernelGenerateData method:

public static Kernel CreateKernelGenerateData()
{
    IConfigurationRoot config = new ConfigurationBuilder().AddUserSecrets<Program>().Build();
    IKernelBuilder builder = Kernel.CreateBuilder();

    builder.AddAzureOpenAIChatCompletion(
        config["AZURE_OPENAI_MODEL"],
        config["AZURE_OPENAI_ENDPOINT"],
        config["AZURE_OPENAI_KEY"]
    );

    return builder.Build();
}

3 - Generate test data

The evaluation application compares an LLM's output to "ground truth" answers, which are ideal question-answer pairs. It's recommended to have at least 200 question-answer pairs for an evaluation.

You can use the evaluation application to generate an initial set of question-answer pairs. Then manually curate them by rewriting or removing any subpar answers.

Tips for generating test data:

  • Generate more question-answer pairs than you need, then manually prune them based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.
  • Be aware of the knowledge distribution so you effectively sample questions across the relevant knowledge space.
  • Once your application is live, continually sample real user questions (within accordance to your privacy policy) to make sure you're representing the kinds of questions that users are asking.
  1. From the ai-samples/src/llm-eval/LLMEval.Test directory, run the following command:

    dotnet run .
    
  2. Select Generate QAs associated to a topic, and export to json, then press Enter.

    Scenario selection step of the Evaluation Application

  3. Enter the number of question-answer pairs to be generated and their topic.

    Number and topic inputs for question-answer generation with the Evaluation Application

  4. A preview of the generated question-answer pairs in JSON format is shown; enter the path of the file to save the JSON to.

    Output file input for question-answer generation with the Evaluation Application

  5. Review the output JSON, and update or remove any incorrect or subpar answers.

4 - Perform an evaluation

Once you've curated the question-answer pairs, the evaluation application can use them to evaluate the outputs of the test model.

  1. Copy the JSON file containing the question-answer pairs to ai-samples/src/llm-eval/LLMEval.Test/assets/qa-02.json.

  2. From the ai-samples/src/llm-eval/LLMEval.Test directory, run the following command:

    dotnet run .
    
  3. Select List of QAs from a file, then press Enter.

    List of steps of the Evaluation Application with 'List of QAs from a file' selected

  4. The evaluation results are printed in a table format.

    Table showing the output of the Evaluation Application

5 - Review the evaluation results

The evaluation results generated in the previous step include a coherence, relevance, and groundedness metric. These metrics are similar to the built-in metrics provided by the Azure AI Studio.

  • Coherence: Measures how well the language model can produce outputs that flow smoothly, read naturally, and resemble human-like language.
    • Based on ai-samples/src/LLMEval.Core/_prompts/coherence/skprompt.txt
  • Relevance: Assesses the ability of answers to capture the key points of the context.
    • Based on ai-samples/src/LLMEval.Core/_prompts/relevance/skprompt.txt
  • Groundedness: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.
    • Based on ai-samples/src/LLMEval.Core/_prompts/groundedness/skprompt.txt

Clean up resources

If you no longer need them, delete the Azure OpenAI resource and GPT-4 model deployment.

  1. In the Azure Portal, navigate to the Azure OpenAI resource.
  2. Select the Azure OpenAI resource then select Delete.