Share via


Batch testing for prompts (preview)

[This topic is prerelease documentation and is subject to change.]

Prompts enable you to create custom generative AI tools for business automation and agents. Ensuring the accuracy, reliability, and efficiency of these tools is critical. Batch testing of prompts is designed to enable you to validate and improve prompts used in AI tools across the platform.

Important

Core features of batch testing

Batch testing provides a systematic approach for validating prompts on diverse datasets. You can:

  • Upload or generate test datasets for comprehensive evaluation.
  • Define evaluation criteria for judging the test results.
  • Execute batch tests to assess prompt behavior across the test dataset.
  • Compare outcomes over time to ensure continuous improvement.
  • Review and adjust automatic evaluations to ensure alignment with your specific needs.

An accuracy score is calculated based on test results, giving you empirical data to trust your AI tools.

How to use batch testing

Use the following steps to set up and run batch tests for your prompts.

Define the test cases

  1. Sign in to Copilot Studio, Power Apps, or Power Automate.

  2. Access the list of prompts:

    • In Copilot Studio, select Tools, and then filter on prompts.
    • In Power Apps and Power Automate, select AI hub.
  3. Next to the prompt name, select the ellipsis (...).

  4. Select Test hub (Preview).

    Here's an example of the Tools screen in Copilot Studio:

    Screenshot of the menu with the 'Test hub - Preview' option.

    In Copilot Studio, the test hub looks like the following screenshot:

    Screenshot of the Test hub screen.

  5. Add your test cases using one the available options:

    • Upload: Allows you to upload test cases using a csv file. If you want to check the format of the file you need to upload, select Download test data schema.
    • AI-generate: Allows you to generate test cases using AI based on your prompt.
    • Use activity data: Allows you to pull the recent prompt activity to help you get started.
    • Manually Add: Allows you to create test cases manually.

    Any of the options help you create a list of test cases that you're able to run:

    Screenshot of the uploaded test cases.

Set evaluation criteria

  1. After you create the test cases, select Configure criteria in the configuration section on the right:

    Screenshot of configure evaluation criteria.

  2. Define the Passing score, which is the minimum score required for a response to pass.

  3. Choose one of the following prebuilt criteria:

    • Response quality: Tests responses for clarity, helpfulness, and tone
    • Response matches: Tests responses for specific words and meanings
    • JSON correctness: Tests that responses follow your data schema

    Screenshot of evaluation criteria.

    These criteria and passing score determine how test cases outputs are assessed during the evaluation process.

Run batch tests

  1. In the test cases screen, select Run all to run evaluation on all the test cases, or select the test cases to run and select Run selected.

    Screenshot of tests to run.

    The test hub evaluates results against the defined criteria, providing insights into the prompt's performance.

  2. Once the test cases evaluation is done, the result screen appears:

    Screenshot of tests results.

  3. To access previous evaluation runs, select the prompt name at the top of the screen in Copilot Studio, or select Run history in Power Apps or Power Automate.

    Screenshot of run history.

  4. To view details, select the evaluation run.

Run history allows you to monitor and analyze test results over time, including:

  • Track accuracy score progression across multiple test runs.
  • Compare outcomes from different runs to identify trends or regressions.
  • Access details of why a certain test result was classified pass or fail, thus offering more details for diagnosis.

Iterate on the test cases evaluation and monitor any significant change between evaluation runs.