Evaluation flows and metrics

Evaluation flows are a special type of prompt flow that calculates metrics to assess how well the outputs of a run meet specific criteria and goals. You can create or customize evaluation flows and metrics tailored to your tasks and objectives, and use them to evaluate other prompt flows. This article explains evaluation flows, how to develop and customize them, and how to use them in prompt flow batch runs to evaluate flow performance.

Understand evaluation flows

A prompt flow is a sequence of nodes that process input and generate output. Evaluation flows also consume required inputs and produce corresponding outputs that are usually scores or metrics. Evaluation flows differ from standard flows in their authoring experience and usage.

Evaluation flows usually run following the run they're testing by receiving its outputs and using the outputs to calculate scores and metrics. Evaluation flows log metrics by using the prompt flow SDK log_metric() function.

The outputs of the evaluation flow are results that measure the performance of the flow being tested. Evaluation flows can have an aggregation node that calculates the overall performance of the flow being tested over the test dataset.

The next sections describe how to define inputs and outputs in evaluation flows.

Inputs

Evaluation flows calculate metrics or scores for batch runs by taking in the outputs of the run they're testing. For example, if the flow being tested is a QnA flow that generates an answer based on a question, you might name an evaluation input as answer. If the flow being tested is a classification flow that classifies a text into a category, you might name an evaluation input as category.

You might need other inputs as ground truth. For example, if you want to calculate the accuracy of a classification flow, you need to provide the category column of the dataset as ground truth. If you want to calculate the accuracy of a QnA flow, you need to provide the answer column of the dataset as the ground truth. You might need some other inputs to calculate metrics, such as question and context in QnA or retrieval augmented generation (RAG) scenarios.

You define the inputs of the evaluation flow the same way you define the inputs of a standard flow. By default, evaluation uses the same dataset as the run being tested. However, if the corresponding labels or target ground truth values are in a different dataset, you can easily switch to that dataset.

Input descriptions

To describe the inputs needed for calculating metrics, you can add descriptions. The descriptions appear when you map the input sources in batch run submissions.

To add descriptions for each input, select Show description in the input section when developing your evaluation method, and then enter the descriptions.

Screenshot of Classification Accuracy Evaluation with hide description highlighted.

To hide the descriptions from the input form, select Hide description.

Screenshot of evaluation input mapping with the answers description highlighted.

Outputs and metrics

The outputs of an evaluation are results that show the performance of the flow being tested. The output usually contains metrics such as scores, and can also include text for reasoning and suggestions.

Output scores

A prompt flows processes one row of data at a time and generates an output record. Evaluation flows likewise can calculate scores for each row of data, so you can check how a flow performs on each individual data point.

You can record the scores for each data instance as evaluation flow outputs by specifying them in the output section of the evaluation flow. The authoring experience is the same as defining a standard flow output.

Screenshot of the outputs section showing a name and value.

You can view the individual scores in the Outputs tab when you select View outputs, the same as you check the outputs of a standard flow batch run. You can append these instance-level scores to the output of the flow you tested.

Aggregation and metrics logging

The evaluation flow also provides an overall assessment for the run. To distinguish the overall results from individual output scores, these overall run performance values are called metrics.

To calculate an overall assessment value based on individual scores, select the Aggregation checkbox on a Python node in an evaluation flow to turn it into a reduce node. The node then takes in the inputs as a list and processes them as a batch.

Screenshot of the Python node heading with the Aggregation checkbox selected.

By using aggregation, you can calculate and process all the scores of each flow output and compute an overall result by using each score. For example, to calculate the accuracy of a classification flow, you can calculate the accuracy of each score output and then calculate the average accuracy of all the score outputs. Then, you can log the average accuracy as a metric by using promptflow_sdk.log_metric(). Metrics must be numerical, such as float or int. String type metrics logging isn't supported.

The following code snippet is an example of calculating overall accuracy by averaging the accuracy score grades of all data points. The overall accuracy is logged as a metric by using promptflow_sdk.log_metric().

from typing import List
from promptflow import tool, log_metric

@tool
def calculate_accuracy(grades: List[str]): # Receive a list of grades from a previous node
    # calculate accuracy
    accuracy = round((grades.count("Correct") / len(grades)), 2)
    log_metric("accuracy", accuracy)

    return accuracy

Because you call this function in the Python node, you don't need to assign it elsewhere, and you can view the metrics later. After you use this evaluation method in a batch run, you can view the metric showing overall performance by selecting the Metrics tab when you view outputs.

Screenshot of the metrics tab that shows the metrics logged by log metrics.

Develop an evaluation flow

To develop your own evaluation flow, select Create on the Azure Machine Learning studio Prompt flow page. On the Create a new flow page, you can either:

  • Select Create on the Evaluation flow card under Create by type. This selection provides a template for developing a new evaluation method.

  • Select Evaluation flow in the Explore gallery, and select from one of the available built-in flows. Select View details to get a summary of each flow, and select Clone to open and customize the flow. The flow creation wizard helps you modify the flow for your own scenario.

Screenshot of different ways to create a new evaluation flow.

Calculate scores for each data point

Evaluation flows calculate scores and metrics for flows that run on datasets. The first step in evaluation flows is calculating scores for each individual data output.

For example, in the built-in Classification Accuracy Evaluation flow, the grade that measures the accuracy of each flow-generated output to its corresponding ground truth is calculated in the grade Python node.

If you use the evaluation flow template, you calculate this score in the line_process Python node. You can also replace the line_process python node with a large language model (LLM) node to use an LLM to calculate the score, or use multiple nodes to perform the calculation.

Screenshot of line process node in the template.

You specify the outputs of this node as the outputs of the evaluation flow, which indicates that the outputs are the scores calculated for each data sample. You can also output reasoning for more information, and it's the same experience as defining outputs in a standard flow.

Calculate and log metrics

The next step in evaluation is to calculate overall metrics to assess the run. You calculate metrics in a Python node that has the Aggregation option selected. This node takes in the scores from the previous calculation node and organizes them into a list, then calculates overall values.

If you use the evaluation template, this score is calculated in the aggregate node. The following code snippet shows the template for the aggregation node.


from typing import List
from promptflow import tool

@tool
def aggregate(processed_results: List[str]):
    """
    This tool aggregates the processed result of all lines and log metric.
    :param processed_results: List of the output of line_process node.
    """
    # Add your aggregation logic here
    aggregated_results = {}

    # Log metric
    # from promptflow import log_metric
    # log_metric(key="<my-metric-name>", value=aggregated_results["<my-metric-name>"])

    return aggregated_results

You can use your own aggregation logic, such as calculating score mean, median, or standard deviation.

Log the metrics by using the promptflow.log_metric() function. You can log multiple metrics in a single evaluation flow. Metrics must be numerical (float/int).

Use evaluation flows

After you create your own evaluation flow and metrics, you can use the flow to assess performance of a standard flow. For example, you can evaluate a QnA flow to test how it performs on a large dataset.

  1. In Azure Machine Learning studio, open the flow that you want to evaluate, and select Evaluate in the top menu bar.

    Screenshot of evaluation button.

  2. In the Batch run & Evaluate wizard, complete the Basic settings and Batch run settings to load the dataset for testing and configure the input mapping. For more information, see Submit batch run and evaluate a flow.

  3. In the Select evaluation step, you can select one or more of your customized evaluations or built-in evaluations to run. Customized evaluation lists all the evaluation flows that you created, cloned, or customized. Evaluation flows created by others working on the same project don't appear in this section.

    Screenshot of selecting customized evaluation.

  4. On the Configure evaluation screen, specify the sources of any input data needed for the evaluation method. For example, the ground truth column might come from a dataset. If your evaluation method doesn't require data from a dataset, you don't need to select a dataset, or reference any dataset columns in the input mapping section.

    In the Evaluation input mapping section, you can indicate the sources of required inputs for the evaluation. If the data source is from your run output, set the source as ${run.outputs.[OutputName]}. If the data is from your test dataset, set the source as ${data.[ColumnName]}. Any descriptions you set for the data inputs also appear here. For more information, see Submit batch run and evaluate a flow.

    Screenshot of evaluation input mapping.

    Important

    If your evaluation flow has an LLM node or requires a connection to consume credentials or other keys, you must enter the connection data in the Connection section of this screen to be able to use the evaluation flow.

  5. Select Review + submit and then select Submit to run the evaluation flow.

  6. After the evaluation flow completes, you can see the instance-level scores by selecting View batch runs > View latest batch run outputs at the top of the flow you evaluated. Select your evaluation run from the Append related results dropdown list to see the grade for each data row.

    Screenshot of the output tab with evaluation result appended and highlighted.