Evaluate your Semantic Kernel with prompt flow

In the rapidly evolving landscape of AI orchestration, a comprehensive evaluation of your plugins and planners is paramount for optimal performance. This article introduces how to evaluate your Semantic Kernel plugins and planners with prompt flow. Furthermore, you can learn the seamless integration story between prompt flow and Semantic Kernel.

The integration of Semantic Kernel with prompt flow is a significant milestone.

  • It allows you to harness the powerful AI orchestration capabilities of Semantic Kernel to enhance the efficiency and effectiveness of your prompt flow.
  • More importantly, it enables you to utilize prompt flow's powerful evaluation and experiment management to assess the quality of your Semantic Kernel plugins and planners comprehensively.

What is Semantic Kernel?

Semantic Kernel is an open-source SDK that lets you easily combine AI services with conventional programming languages like C# and Python. By doing so, you can create AI apps that combine the best of both worlds. It provides plugins and planners, which are powerful tool that makes use of AI capabilities to optimize operations, thereby driving efficiency and accuracy in planning.

Using prompt flow for plugin and planner evaluation

As you build plugins and add them to planners, it’s important to make sure they work as intended. This becomes crucial as more plugins are added, increasing the potential for errors.

Previously, testing plugins and planners was a manual, time-consuming process. Until now, you can automate this with prompt flow.

In our comprehensive updated documentation, we provide guidance step by step:

  1. Create a flow with Semantic Kernel.
  2. Executing batch tests.
  3. Conducting evaluations to quantitatively ascertain the accuracy of your planners and plugins.

Create a flow with Semantic Kernel

Similar to the integration of Langchain with prompt flow, Semantic Kernel, which also supports Python, can operate within prompt flow in the Python node.

Screenshot of prompt flow with Semantic kernel.

Prerequisites: Setup runtime and connection


Prior to developing the flow, it's essential to install the Semantic Kernel package in your runtime environment for executor.

To learn more, see Customize environment for runtime for guidance.


The approach to consume OpenAI or Azure OpenAI in Semantic Kernel is to obtain the keys you have specified in environment variables or stored in a .env file.

In prompt flow, you need to use Connection to store the keys. You can convert these keys from environment variables to key-values in a custom connection in prompt flow.

Screenshot of custom connection.

You can then utilize this custom connection to invoke your OpenAI or Azure OpenAI model within the flow.

Create and develop a flow

Once the setup is complete, you can conveniently convert your existing Semantic Kernel planner to a prompt flow by following the steps below:

  1. Create a standard flow.
  2. Select a runtime with Semantic Kernel installed.
  3. Select the + Python icon to create a new Python node.
  4. Name it as your planner name (e.g., math_planner).
  5. Select + button in Files tab to upload any other reference files (for example, plugins).
  6. Update the code in __.py file with your planner's code.
  7. Define the input and output of the planner node.
  8. Set the flow input and output.
  9. Click Run for a single test.

For example, we can create a flow with a Semantic Kernel planner that solves math problems. Follow this documentation with steps necessary to create a simple prompt flow with Semantic Kernel at its core.

Screenshot of creating a flow with semantic kernel planner.

Set up the connection in python code.

Screenshot of setting custom connection in python node.

Select the connection object in the node input, and set the model name of OpenAI or deployment name of Azure OpenAI.

Screenshot of setting model and key in node input.

Batch testing your plugins and planners

Instead of manually testing different scenarios one-by-one, now you can now automatically run large batches of tests using Prompt flow and benchmark data.

Screenshot of batch runs with prompt flow for Semantic kernel.

Once the flow has passed the single test run in the previous step, you can effortlessly create a batch test in prompt flow by adhering to the following steps:

  1. Create benchmark data in a jsonl file, contains a list of JSON objects that contains the input and the correct ground truth.
  2. Click Batch run to create a batch test.
  3. Complete the batch run settings, especially the data part.
  4. Submit run without evaluation (for this specific batch test, the Evaluation step can be skipped).

In our Running batches with prompt flow, we demonstrate how you can use this functionality to run batch tests on a planner that uses a math plugin. By defining a bunch of word problems, we can quickly test any changes we make to our plugins or planners so we can catch regressions early and often.

Screenshot of data of batch runs with prompt flow for Semantic kernel.

In your workspace, you can go to the Run list in prompt flow, select Details button, and then select Output tab to view the batch run result.

Screenshot of the run list.

Screenshot of the run detail.

Screenshot of the run output.

Evaluating the accuracy

Once a batch run is completed, you then need an easy way to determine the adequacy of the test results. This information can then be used to develop accuracy scores, which can be incrementally improved.

Screenshot of evaluating batch run with prompt flow.

Evaluation flows in prompt flow enable this functionality. Using the sample evaluation flows offered by prompt flow, you can assess various metrics such as classification accuracy, perceived intelligence, groundedness, and more.

Screenshot showing evaluation flow samples.

There's also the flexibility to develop your own custom evaluators if needed. My custom evaluation flow

In prompt flow, you can quick create an evaluation run based on a completed batch run by following the steps below:

  1. Prepare the evaluation flow and the complete a batch run.
  2. Click Run tab in home page to go to the run list.
  3. Go into the previous completed batch run.
  4. Click Evaluate in the above to create an evaluation run.
  5. Complete the evaluation settings, especially the evaluation flow and the input mapping.
  6. Submit run and wait for the result.

Screenshot showing add new evaluation.

Screenshot showing evaluation settings.

Follow this documentation for Semantic Kernel to learn more about how to use the math accuracy evaluation flow to test our planner to see how well it solves word problems.

After running the evaluator, you’ll get a summary back of your metrics. Initial runs may yield less than ideal results, which can be used as a motivation for immediate improvement.

To check the metrics, you can go back to the batch run detail page, click Details button, and then click Output tab, select the evaluation run name in the dropdown list to view the evaluation result.

Screenshot showing evaluation result.

You can check the aggregated metric in the Metrics tab.

Screenshot showing evaluation metrics.

Experiments for quality improvement

If you find that your plugins and planners aren’t performing as well as they should, there are steps you can take to make them better. In this documentation, we provide an in-depth guide on practical strategies to bolster the effectiveness of your plugins and planners. We recommend the following high-level considerations:

  1. Use a more advanced model like GPT-4 instead of GPT-3.5-turbo.
  2. Improve the description of your plugins so they’re easier for the planner to use.
  3. Inject additional help to the planner when sending the user’s ask.

By doing a combination of these three things, we demonstrate how you can take a failing planner and turn it into a winning one! At the end of the walkthrough, you should have a planner that can correctly answer all of the benchmark data.

Throughout the process of enhancing your plugins and planners in prompt flow, you can utilize the runs to monitor your experimental progress. Each iteration allows you to submit a batch run with an evaluation run at the same time.

Screenshot of batch run with evaluation.

This enables you to conveniently compare the results of various runs, assisting you in identifying which modifications are beneficial and which are not.

To compare, select the runs you wish to analyze, then select the Visualize outputs button in the above.

Screenshot of compare runs.

This will present you with a detailed table, line-by-line comparison of the results from selected runs.

Screenshot of compare runs details.

Next steps


Follow along with our documentations to get started! And keep an eye out for more integrations.

If you’re interested in learning more about how you can use prompt flow to test and evaluate Semantic Kernel, we recommend following along to the articles we created. At each step, we provide sample code and explanations so you can use prompt flow successfully with Semantic Kernel.

When your planner is fully prepared, it can be deployed as an online endpoint in Azure Machine Learning. This allows it to be easily integrated into your application for consumption. Learn more about how to deploy a flow as a managed online endpoint for real-time inference.