Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Evaluation datasets are JSON files containing prompts and expected responses. This article defines the dataset schema, documents where the tool looks for datasets, and shows how to design effective tests - including advanced scenarios like multi-turn conversations, per-item evaluator configuration, and categorized test suites.
Schema overview
Evaluation datasets are JSON files. The tool supports two equivalent shapes: a versioned object (recommended) and a legacy array.
Versioned schema (recommended)
The simplest valid dataset requires only schemaVersion and an items array with prompt and expected_response fields.
{
"schemaVersion": "1.0.0",
"items": [
{
"prompt": "string",
"expected_response": "string"
}
]
}
Schema version 1.2.0 adds support for default and per-item evaluator configuration, evaluator mode control, named items, and multi-turn conversations. For details, see Configure evaluators and Multi-turn evaluation patterns.
Schema fields
| Field | Type | Required | Description |
|---|---|---|---|
schemaVersion |
string | Recommended | Semantic version (for example, "1.0.0" or "1.2.0"). Backward compatibility is guaranteed within a major version. Use "1.2.0" to enable evaluator configuration, evaluator modes, and native multi-turn support. |
items |
array | Yes | Array of test items. Each item is either a single-turn prompt/response pair or a named multi-turn conversation. |
description |
string | Optional | Free-text description of the dataset (for example, "Regression tests for Q1 2026 release"). |
default_evaluators |
object | Optional | Evaluators applied to every item in the dataset unless overridden. Each key is an evaluator name (for example, "Relevance", "Coherence"); the value is an options object (use {} for defaults). Requires schemaVersion "1.2.0" or later. |
items[].prompt |
string | Conditional | The prompt or instruction sent to the agent. Required for single-turn items. Don't use with turns. |
items[].expected_response |
string | Conditional | The reference response used for scoring. Required for single-turn items. Don't use with turns. |
items[].name |
string | Optional | Display name for the test item (for example, "Expense policy flow"). Especially useful for identifying multi-turn items in reports. |
items[].turns |
array | Conditional | Ordered array of turn objects for a multi-turn conversation within a single item. Each turn contains prompt, expected_response, and optionally evaluators and evaluators_mode. Don't use with top-level prompt/expected_response. Requires schemaVersion "1.2.0" or later. |
items[].evaluators |
object | Optional | Per-item evaluator overrides. Each key is an evaluator name; the value is an options object (for example, { "citation_format": "mixed" }). Behavior depends on evaluators_mode. Requires schemaVersion "1.2.0" or later. |
items[].evaluators_mode |
string | Optional | Controls how items[].evaluators combines with default_evaluators. Use "extend" (default) to merge per-item evaluators with defaults, or "replace" to use only the per-item evaluators and ignore defaults. Requires schemaVersion "1.2.0" or later. |
items[].testId |
string | Optional | Stable identifier for cross-version comparison (for example, "REG-001"). |
items[].category |
string | Optional | Category tag (for example, "knowledge-base", "tool-usage"). |
items[].notes |
string | Optional | Freeform notes, such as a linked bug ID. |
Configure evaluators
Schema version 1.2.0 lets you control which evaluators run and how they're configured, at both the dataset level and the individual item level.
Default evaluators
Use default_evaluators at the top level to specify evaluators that apply to every item in the dataset. Each key is an evaluator name and the value is an options object. Use an empty object ({}) to apply the evaluator with its default settings.
{
"schemaVersion": "1.2.0",
"default_evaluators": {
"Relevance": {},
"Coherence": {}
},
"items": [
{
"prompt": "What is Microsoft Graph?",
"expected_response": "A unified API endpoint for Microsoft services."
}
]
}
In this example, every item is scored for Relevance and Coherence by using default settings.
Per-item evaluator overrides
Use the evaluators field on an individual item (or turn) to add or override evaluators for that specific test. Use evaluators_mode to control how per-item evaluators combine with default_evaluators:
"extend"(default) — Merges per-item evaluators with the defaults. The item is scored by both the default evaluators and any additional evaluators specified on the item."replace"— Ignores the defaults entirely. Only the evaluators specified on the item are used.
{
"schemaVersion": "1.2.0",
"default_evaluators": {
"Relevance": {},
"Coherence": {}
},
"items": [
{
"prompt": "What is Microsoft Graph?",
"expected_response": "A unified API endpoint for Microsoft services.",
"evaluators": {
"Citations": { "citation_format": "mixed" }
},
"evaluators_mode": "extend"
}
]
}
In this example, the item is scored for Relevance (default), Coherence (default), and Citations with citation_format set to "mixed" (per-item override).
Complete schema example
The following example shows every schema feature in a single dataset: top-level defaults, a single-turn item with evaluator overrides, and a named multi-turn item with per-turn evaluator configuration.
{
"schemaVersion": "1.2.0",
"default_evaluators": {
"Relevance": {},
"Coherence": {}
},
"items": [
{
"prompt": "What is Microsoft Graph?",
"expected_response": "A unified API endpoint for Microsoft services.",
"evaluators": {
"Citations": { "citation_format": "mixed" }
},
"evaluators_mode": "extend"
},
{
"name": "Expense policy flow",
"turns": [
{
"prompt": "I spent $250 on dinner. Is that okay?",
"expected_response": "The per-diem meal allowance is $200."
},
{
"prompt": "What should I do about the overage?",
"expected_response": "Request manager approval.",
"evaluators": {
"ExactMatch": { "case_sensitive": false }
},
"evaluators_mode": "replace"
}
]
}
]
}
Key details in this example:
- The first item is a single-turn test. It inherits
RelevanceandCoherencefromdefault_evaluatorsand addsCitationsthrough the"extend"mode. - The second item is a named multi-turn conversation (
"Expense policy flow") with two turns. The first turn inherits the default evaluators. The second turn uses"replace"mode, so onlyExactMatchruns - the defaults are ignored for that turn.
Legacy array schema
The tool also accepts a bare array for backward compatibility:
[
{
"prompt": "Your test prompt here",
"expected_response": "Expected agent response"
}
]
The CLI automatically upgrades legacy documents (missing schemaVersion) to the versioned format and writes a timestamped backup.
File naming and location
The evaluation tool automatically discovers dataset files in your project.
Auto-discovery order
When you run runevals, the tool searches for datasets in this order:
- Current directory:
prompts.json,evals.json,tests.json ./evals/subdirectory:prompts.json,evals.json,tests.json
Recommended project structure
my-agent/
├── .env.local # Agent configuration
├── .env.local.user # Secrets (not committed)
├── evals/
│ ├── evals.json # Main test suite
│ ├── regression-tests.json # Regression scenarios
│ └── edge-cases.json # Edge case testing
└── .evals/
└── results/ # Generated reports
Starter file creation
If the tool doesn't find a dataset file, it prompts you to create a starter file:
⚠️ No prompts file found in current directory or ./evals/
Create a starter evals file with sample prompts? (Y/n):
Answering Y creates ./evals/evals.json with sample prompts.
Design effective test prompts
Organize your tests into categories that reflect the agent's behavior you want to verify.
Knowledge verification
Test whether your agent correctly accesses and uses its knowledge base.
{
"prompt": "What are the key features of our enterprise plan?",
"expected_response": "The enterprise plan includes advanced security, unlimited storage, 24/7 support, and custom integrations."
}
Instruction following
Verify the agent follows specific instructions.
{
"prompt": "List the top 3 sales leads from last quarter in bullet points.",
"expected_response": "• Contoso Ltd - $500K potential\n• Fabrikam Inc - $350K potential\n• Adventure Works - $280K potential"
}
Tool usage
Test whether the agent correctly uses available tools and plugins.
{
"prompt": "What meetings do I have tomorrow?",
"expected_response": "Based on your calendar, you have 3 meetings tomorrow: Team standup at 9 AM, Client presentation at 2 PM, and Project review at 4 PM."
}
Edge cases
Test boundary conditions and unusual inputs.
{
"prompt": "Show me sales data from the year 1850.",
"expected_response": "I don't have sales data from 1850 as our company was founded in 1998. Would you like to see data from our earliest available records?"
}
Safety and appropriateness
Ensure the agent handles inappropriate requests properly.
{
"prompt": "Can you write my performance review for me?",
"expected_response": "I can't write your performance review for you, but I can help you gather your accomplishments, suggest a structure, or provide examples of effective self-assessments."
}
Best practices for test design
Write clear prompts
The following is an example of a clear prompt.
{
"prompt": "What is the return policy for electronics purchased online?",
"expected_response": "Electronics purchased online can be returned within 30 days of delivery in original condition with receipt. Some items like opened software have different policies."
}
Avoid ambiguous prompts like the following example.
{
"prompt": "Tell me about returns"
}
Include realistic scenarios
Base tests on actual user questions.
{
"prompt": "I need to schedule a meeting with the sales team next week. What times are they all available?",
"expected_response": "I can help you find meeting times. The sales team is available Tuesday at 2 PM, Wednesday at 10 AM, or Thursday at 3 PM next week."
}
Cover error handling
Test how the agent handles errors gracefully.
{
"prompt": "Show me sales data for customer XYZ-123",
"expected_response": "I couldn't find a customer with ID XYZ-123. Would you like me to search by company name instead?"
}
Advanced evaluation scenarios
Multi-turn evaluation patterns
Schema version 1.2.0 supports multi-turn conversations. Use the turns array within an item to define an ordered sequence of prompts and expected responses that form a single conversation flow. Each turn can optionally include its own evaluator configuration.
{
"schemaVersion": "1.2.0",
"default_evaluators": {
"Relevance": {},
"Coherence": {}
},
"items": [
{
"name": "Expense policy flow",
"turns": [
{
"prompt": "I spent $250 on dinner. Is that okay?",
"expected_response": "The per-diem meal allowance is $200."
},
{
"prompt": "What should I do about the overage?",
"expected_response": "Request manager approval.",
"evaluators": {
"ExactMatch": { "case_sensitive": false }
},
"evaluators_mode": "replace"
}
]
}
]
}
Key details:
- Each item with a
turnsarray is evaluated as a single conversation. Turns are sent in sequence, with each turn building on the conversation context of the previous ones. - Use the
namefield to give multi-turn items a readable label in reports. - You can apply
evaluatorsandevaluators_modeon individual turns. In the preceding example, the second turn uses"replace"mode so onlyExactMatchruns for that turn.
Sequential items pattern (schema version 1.0.0)
If you're using schema version 1.0.0, you can approximate multi-turn conversations by designing sequential items where later prompts reference context established by earlier ones. Use consistent testId prefixes and category tags to group and filter related items in results.
{
"schemaVersion": "1.0.0",
"description": "Multi-turn: SharePoint discovery",
"items": [
{
"prompt": "What SharePoint sites does our team have?",
"expected_response": "Your team has 3 SharePoint sites: Project Central, Team Resources, and Client Portal.",
"testId": "MT-001",
"category": "multi-turn"
},
{
"prompt": "Who has access to the Project Central site?",
"expected_response": "Project Central has 15 members: 8 from Engineering, 5 from Product, and 2 from Design.",
"testId": "MT-002",
"category": "multi-turn"
}
]
}
Note
With sequential items, each item is evaluated independently. The agent doesn't carry conversation context between items. For true multi-turn evaluation with shared context, use the turns array with schema version 1.2.0.
Per-prompt categorization and scoring
Use the optional category field to group items so you can analyze scores by dimension (knowledge, tools, safety, edge cases, regression).
{
"schemaVersion": "1.0.0",
"description": "Q1 2026 release test suite",
"items": [
{
"prompt": "What is our company mission?",
"expected_response": "Our mission is to empower every person and organization...",
"testId": "KB-001",
"category": "knowledge-base"
},
{
"prompt": "What meetings do I have today?",
"expected_response": "You have 2 meetings today...",
"testId": "TOOL-001",
"category": "tool-usage"
}
]
}
Dataset organization strategies
For large projects, organize tests by category across multiple files.
evals/
├── knowledge-base.json # Knowledge verification
├── tool-usage.json # Plugin and action tests
├── conversation-flow.json # Dialog and multi-turn tests
├── edge-cases.json # Boundary conditions
└── regression.json # Previously fixed issues
Run specific dataset files.
runevals --prompts-file ./evals/knowledge-base.json
runevals --prompts-file ./evals/tool-usage.json
Regression tests
When you fix problems, add tests to prevent regression. Use testId and notes to link back to bug tracking.
{
"prompt": "Issue that was previously broken",
"expected_response": "Correct behavior after fix",
"testId": "BUG-456",
"notes": "Regression test for bug #456"
}
Starter templates
Basic agent test template
{
"schemaVersion": "1.0.0",
"description": "Basic agent evaluation tests",
"items": [
{
"prompt": "What can you help me with?",
"expected_response": "I can help you with [specific capabilities]."
},
{
"prompt": "Who are you?",
"expected_response": "I'm [agent name], specialized in [domain]."
}
]
}
Knowledge base test template
{
"schemaVersion": "1.0.0",
"description": "Knowledge base accuracy tests",
"items": [
{
"prompt": "What is [key concept from your knowledge]?",
"expected_response": "[Accurate definition from knowledge base]"
},
{
"prompt": "How do I [perform key task]?",
"expected_response": "[Step-by-step guidance from knowledge]"
}
]
}
Tool usage test template
{
"schemaVersion": "1.0.0",
"description": "Plugin and tool integration tests",
"items": [
{
"prompt": "What's on my calendar today?",
"expected_response": "[Calendar data retrieved via Graph API]"
},
{
"prompt": "Find documents about [topic]",
"expected_response": "[Search results from SharePoint/OneDrive]"
}
]
}
Interactive and inline testing
Use interactive mode for exploratory testing without a dataset file.
runevals --interactive
For quick, single-prompt tests, pass prompts inline.
runevals --prompts "What is Microsoft Graph?" \
--expected "Microsoft Graph is the API gateway to Microsoft 365 data and intelligence."
Multiple prompts.
runevals --prompts "What is Teams?" "What is SharePoint?" \
--expected "Teams is a collaboration platform" "SharePoint is a content management system"
Understand evaluation metrics
Each test is automatically scored on multiple dimensions.
Relevance (1-5)
Relevance measures how well the response addresses the prompt:
- 5: Perfectly addresses the question
- 3: Partially addresses the question
- 1: Doesn't address the question
Coherence (1-5)
Coherence measures how logical and well-structured the response is:
- 5: Clear, logical, well-organized
- 3: Somewhat organized but could be clearer
- 1: Incoherent or confusing
Groundedness (1-5)
Groundedness measures how well the response is supported by sources and citations:
- 5: Fully grounded with appropriate citations
- 3: Partially grounded with some citations
- 1: No grounding or citations
Similarity (1-5)
Similarity measures how closely the response matches the expected output:
- 5: Response is semantically equivalent to the expected output
- 3: Response partially matches the expected output
- 1: Response doesn't match the expected output
Citations (>= 0)
Citations is a count-based evaluator that counts the number of valid citations in the response. A score of 0 means no citations are present. Configure a minimum threshold to set a pass/fail bar.
ExactMatch
ExactMatch is a string match evaluator with a Boolean result. The response passes if it contains the expected string exactly. Supports a case_sensitive option (default: false).
PartialMatch (0.0-1.0)
PartialMatch is a string match evaluator that returns a continuous similarity score between 0.0 and 1.0. Use the threshold option to set the minimum score required to pass (default: 0.5).
Continuous improvement
Review failed tests
When tests score poorly:
- Review the actual response vs. expected response.
- Determine if the expected response needs updating.
- Check if the agent needs more training data or instructions.
- Verify tool configurations are correct.
Track scores over time
Save test results to compare across versions.
runevals --output ./evals/results/v1.2.0-results.json