Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
[This topic is prerelease documentation and is subject to change.]
Prompts enable you to create custom generative AI tools for business automation and agents. Ensuring the accuracy, reliability, and efficiency of these tools is critical. Batch testing of prompts is designed to enable you to validate and improve prompts used in AI tools across the platform.
Important
- This is a production-ready preview feature.
- Production-ready previews are subject to supplemental terms of use.
- Prompts run on GPT models powered by Azure OpenAI Service.
- This capability might not be available in your region yet. Learn more in the Prompts section in Feature availability by region or US Government environment.
- This capability might be subject to usage limits or capacity throttling.
Core features of batch testing
Batch testing provides a systematic approach for validating prompts on diverse datasets. You can:
- Upload or generate test datasets for comprehensive evaluation.
- Define evaluation criteria for judging the test results.
- Execute batch tests to assess prompt behavior across the test dataset.
- Compare outcomes over time to ensure continuous improvement.
- Review and adjust automatic evaluations to ensure alignment with your specific needs.
An accuracy score is calculated based on test results, giving you empirical data to trust your AI tools.
How to use batch testing
Use the following steps to set up and run batch tests for your prompts.
Define the test cases
Sign in to Copilot Studio, Power Apps, or Power Automate.
Access the list of prompts:
- In Copilot Studio, select Tools, and then filter on prompts.
- In Power Apps and Power Automate, select AI hub.
Next to the prompt name, select the ellipsis (...).
Select Test hub (Preview).
Here's an example of the Tools screen in Copilot Studio:
In Copilot Studio, the test hub looks like the following screenshot:
Add your test cases using one the available options:
- Upload: Allows you to upload test cases using a csv file. If you want to check the format of the file you need to upload, select Download test data schema.
- AI-generate: Allows you to generate test cases using AI based on your prompt.
- Use activity data: Allows you to pull the recent prompt activity to help you get started.
- Manually Add: Allows you to create test cases manually.
Any of the options help you create a list of test cases that you're able to run:
Set evaluation criteria
After you create the test cases, select Configure criteria in the configuration section on the right:
Define the Passing score, which is the minimum score required for a response to pass.
Choose one of the following prebuilt criteria:
- Response quality: Tests responses for clarity, helpfulness, and tone
- Response matches: Tests responses for specific words and meanings
- JSON correctness: Tests that responses follow your data schema
These criteria and passing score determine how test cases outputs are assessed during the evaluation process.
Run batch tests
In the test cases screen, select Run all to run evaluation on all the test cases, or select the test cases to run and select Run selected.
The test hub evaluates results against the defined criteria, providing insights into the prompt's performance.
Once the test cases evaluation is done, the result screen appears:
To access previous evaluation runs, select the prompt name at the top of the screen in Copilot Studio, or select Run history in Power Apps or Power Automate.
To view details, select the evaluation run.
Run history allows you to monitor and analyze test results over time, including:
- Track accuracy score progression across multiple test runs.
- Compare outcomes from different runs to identify trends or regressions.
- Access details of why a certain test result was classified pass or fail, thus offering more details for diagnosis.
Iterate on the test cases evaluation and monitor any significant change between evaluation runs.