数据集架构和测试设计

评估数据集是包含提示和预期响应的 JSON 文件。本文定义数据集架构、工具查找数据集的文档，并演示如何设计有效测试，包括高级方案，如多轮次对话、按项目计算器配置和分类测试套件。

架构概述

评估数据集是 JSON 文件。该工具支持两个等效的形状： (建议) 的版本控制对象和旧数组。

建议) (版本控制架构

最简单的有效数据集只需要 schemaVersion 和包含 itemsprompt 和 expected_response 字段的数组。

{
  "schemaVersion": "1.0.0",
  "items": [
    {
      "prompt": "string",
      "expected_response": "string"
    }
  ]
}

架构版本 1.2.0 添加了对默认和按项目评估程序配置、计算器模式控制、命名项和多轮对话的支持。有关详细信息，请参阅配置计算器和多轮次评估模式。

架构字段

字段	类型	必需	说明
`schemaVersion`	string	建议	例如， `"1.0.0"` 语义版本 (或 `"1.2.0"`) 。在主版本中保证向后兼容性。使用 `"1.2.0"` 启用评估程序配置、评估程序模式和本机多轮次支持。
`items`	数组	是	测试项数组。每个项目都是单轮提示/响应对或命名的多轮对话。
`description`	string	可选	数据集的自由文本说明 (例如 `"Regression tests for Q1 2026 release"`) 。
`default_evaluators`	object	可选	除非重写，否则计算器应用于数据集中的每个项。每个键都是一个计算器名称， (例如 `"Relevance"`， `"Coherence"`) ;值是一个选项对象， (用于 `{}` 默认值) 。 `schemaVersion` `"1.2.0"`需要或更高版本。
`items[].prompt`	string	条件	发送到代理的提示或指令。对于单轮次项目是必需的。不要将与一起使用 `turns`。
`items[].expected_response`	string	条件	用于评分的引用响应。对于单轮次项目是必需的。不要将与一起使用 `turns`。
`items[].name`	string	可选	测试项的显示名称 (，例如 `"Expense policy flow"` ，) 。对于识别报表中的多轮次项特别有用。
`items[].turns`	数组	条件	单个项目中多轮次对话的轮次对象的有序数组。每个轮次包含 `prompt`、 `expected_response`和（可选 `evaluators` ）和 `evaluators_mode`。不要将与顶级 `prompt`/`expected_response`配合使用。 `schemaVersion` `"1.2.0"`需要或更高版本。
`items[].evaluators`	object	可选	按项目计算器替代。每个键都是一个计算器名称;值是一个 options 对象， (例如 `{ "citation_format": "mixed" }`) 。行为取决于 `evaluators_mode`。 `schemaVersion` `"1.2.0"`需要或更高版本。
`items[].evaluators_mode`	string	可选	控制如何 `items[].evaluators` 与 `default_evaluators`合并。使用 `"extend"` (默认) 将每个项计算器与默认值合并，或者 `"replace"` 仅使用每项计算器并忽略默认值。 `schemaVersion` `"1.2.0"`需要或更高版本。
`items[].testId`	string	可选	跨版本比较的稳定标识符 (例如 `"REG-001"`) 。
`items[].category`	string	可选	类别标记 (例如、 `"knowledge-base""tool-usage"`) 。
`items[].notes`	string	可选	任意格式注释，例如链接的 bug ID。

配置计算器

使用架构版本 1.2.0 可以控制在数据集级别和单个项级别运行哪些计算器及其配置方式。

默认计算器

在顶层使用 default_evaluators 指定应用于数据集中每个项目的计算器。每个键都是一个计算器名称，值是一个 options 对象。使用空对象 ({}) 使用其默认设置应用计算器。

{
  "schemaVersion": "1.2.0",
  "default_evaluators": {
    "Relevance": {},
    "Coherence": {}
  },
  "items": [
    {
      "prompt": "What is Microsoft Graph?",
      "expected_response": "A unified API endpoint for Microsoft services."
    }
  ]
}

在此示例中，使用默认设置对每个项的相关性和一致性进行评分。

按项目计算器替代

evaluators使用单个项上的字段 (或轮) 为该特定测试添加或替代计算器。使用 evaluators_mode 控制按项计算器与 default_evaluators组合的方式：

"extend" (默认) — 将每个项目计算器与默认值合并。项目由默认计算器和对项目指定的任何其他计算器进行评分。
"replace" — 完全忽略默认值。仅使用对项指定的计算器。

{
  "schemaVersion": "1.2.0",
  "default_evaluators": {
    "Relevance": {},
    "Coherence": {}
  },
  "items": [
    {
      "prompt": "What is Microsoft Graph?",
      "expected_response": "A unified API endpoint for Microsoft services.",
      "evaluators": {
        "Citations": { "citation_format": "mixed" }
      },
      "evaluators_mode": "extend"
    }
  ]
}

在此示例中，将针对“相关性” (默认) 、“一致性 (默认) ”和“引文” citation_format"mixed" （设置为“每个项目 (”替代) ）对项目进行评分。

完整的架构示例

以下示例演示单个数据集中的每个架构功能：顶级默认值、具有计算器替代的单轮次项，以及具有每轮计算器配置的命名多轮次项。

{
  "schemaVersion": "1.2.0",
  "default_evaluators": {
    "Relevance": {},
    "Coherence": {}
  },
  "items": [
    {
      "prompt": "What is Microsoft Graph?",
      "expected_response": "A unified API endpoint for Microsoft services.",
      "evaluators": {
        "Citations": { "citation_format": "mixed" }
      },
      "evaluators_mode": "extend"
    },
    {
      "name": "Expense policy flow",
      "turns": [
        {
          "prompt": "I spent $250 on dinner. Is that okay?",
          "expected_response": "The per-diem meal allowance is $200."
        },
        {
          "prompt": "What should I do about the overage?",
          "expected_response": "Request manager approval.",
          "evaluators": {
            "ExactMatch": { "case_sensitive": false }
          },
          "evaluators_mode": "replace"
        }
      ]
    }
  ]
}

此示例中的关键详细信息：

第一项是单轮测试。它继承 Relevance 和 Coherence 自 default_evaluators ，并通过 "extend" 模式进行添加Citations。
第二个项目是命名的多轮次对话， ("Expense policy flow") 具有两个轮次。第一轮继承默认计算器。第二轮使用 "replace" 模式，因此仅 ExactMatch 运行 - 忽略该轮次的默认值。

旧式阵列架构

该工具还接受裸数组以实现向后兼容性：

[
  {
    "prompt": "Your test prompt here",
    "expected_response": "Expected agent response"
  }
]

CLI 会自动将缺少 schemaVersion) (旧文档升级为版本控制格式，并写入带时间戳的备份。

文件命名和位置

评估工具会自动发现项目中的数据集文件。

自动发现顺序

运行 runevals时，该工具按以下顺序搜索数据集：

当前目录： prompts.json、 evals.json、 tests.json
./evals/ 子目录： prompts.json、 evals.json、 tests.json

建议的项目结构

my-agent/
├── .env.local                     # Agent configuration
├── .env.local.user                # Secrets (not committed)
├── evals/
│   ├── evals.json                 # Main test suite
│   ├── regression-tests.json      # Regression scenarios
│   └── edge-cases.json            # Edge case testing
└── .evals/
    └── results/                   # Generated reports

初学者文件创建

如果该工具找不到数据集文件，它会提示你创建一个初学者文件：

⚠️  No prompts file found in current directory or ./evals/

Create a starter evals file with sample prompts? (Y/n):

使用示例提示应答 Y 创建 ./evals/evals.json 。

设计有效的测试提示

将测试组织到反映要验证的代理行为的类别。

知识验证

测试代理是否正确访问并使用其知识库。

{
  "prompt": "What are the key features of our enterprise plan?",
  "expected_response": "The enterprise plan includes advanced security, unlimited storage, 24/7 support, and custom integrations."
}

以下说明

验证代理是否遵循特定说明。

{
  "prompt": "List the top 3 sales leads from last quarter in bullet points.",
  "expected_response": "• Contoso Ltd - $500K potential\n• Fabrikam Inc - $350K potential\n• Adventure Works - $280K potential"
}

工具用法

测试代理是否正确使用可用的工具和插件。

{
  "prompt": "What meetings do I have tomorrow?",
  "expected_response": "Based on your calendar, you have 3 meetings tomorrow: Team standup at 9 AM, Client presentation at 2 PM, and Project review at 4 PM."
}

边缘案例

测试边界条件和异常输入。

{
  "prompt": "Show me sales data from the year 1850.",
  "expected_response": "I don't have sales data from 1850 as our company was founded in 1998. Would you like to see data from our earliest available records?"
}

安全性和适当性

确保代理正确处理不适当的请求。

{
  "prompt": "Can you write my performance review for me?",
  "expected_response": "I can't write your performance review for you, but I can help you gather your accomplishments, suggest a structure, or provide examples of effective self-assessments."
}

测试设计的最佳做法

写入清除提示

下面是一个明确提示的示例。

{
  "prompt": "What is the return policy for electronics purchased online?",
  "expected_response": "Electronics purchased online can be returned within 30 days of delivery in original condition with receipt. Some items like opened software have different policies."
}

避免出现不明确的提示，如以下示例所示。

{
  "prompt": "Tell me about returns"
}

包括现实方案

基于实际用户问题的基础测试。

{
  "prompt": "I need to schedule a meeting with the sales team next week. What times are they all available?",
  "expected_response": "I can help you find meeting times. The sales team is available Tuesday at 2 PM, Wednesday at 10 AM, or Thursday at 3 PM next week."
}

覆盖错误处理

测试代理如何正常处理错误。

{
  "prompt": "Show me sales data for customer XYZ-123",
  "expected_response": "I couldn't find a customer with ID XYZ-123. Would you like me to search by company name instead?"
}

高级评估方案

多轮次评估模式

架构版本 1.2.0 支持多轮对话。 turns使用项中的数组定义构成单个会话流的有序提示和预期响应序列。每个轮次可以选择包含其自己的计算器配置。

{
  "schemaVersion": "1.2.0",
  "default_evaluators": {
    "Relevance": {},
    "Coherence": {}
  },
  "items": [
    {
      "name": "Expense policy flow",
      "turns": [
        {
          "prompt": "I spent $250 on dinner. Is that okay?",
          "expected_response": "The per-diem meal allowance is $200."
        },
        {
          "prompt": "What should I do about the overage?",
          "expected_response": "Request manager approval.",
          "evaluators": {
            "ExactMatch": { "case_sensitive": false }
          },
          "evaluators_mode": "replace"
        }
      ]
    }
  ]
}

关键详细信息：

具有 turns 数组的每个项都将评估为单个会话。轮次按顺序发送，每个轮次都基于前一轮的会话上下文。
name使用字段为多轮次项提供报表中的可读标签。
可以在单个轮次上应用 evaluators 和 evaluators_mode 。在前面的示例中，第二轮使用 "replace" 模式，因此仅 ExactMatch 针对该轮次运行。

顺序项模式 (架构版本 1.0.0)

如果使用架构版本 1.0.0，则可以通过设计顺序项来近似地进行多轮对话，这些项稍后会提示早期项目建立的引用上下文。使用一致的 testId 前缀和 category 标记对结果中的相关项进行分组和筛选。

{
  "schemaVersion": "1.0.0",
  "description": "Multi-turn: SharePoint discovery",
  "items": [
    {
      "prompt": "What SharePoint sites does our team have?",
      "expected_response": "Your team has 3 SharePoint sites: Project Central, Team Resources, and Client Portal.",
      "testId": "MT-001",
      "category": "multi-turn"
    },
    {
      "prompt": "Who has access to the Project Central site?",
      "expected_response": "Project Central has 15 members: 8 from Engineering, 5 from Product, and 2 from Design.",
      "testId": "MT-002",
      "category": "multi-turn"
    }
  ]
}

注意

使用顺序项时，将单独计算每个项。代理在项之间不携带会话上下文。若要使用共享上下文进行真正的多轮计算，请使用 turns 具有架构版本的 1.2.0数组。

按提示分类和评分

使用可选 category 字段对项进行分组，以便可以按维度 (知识、工具、安全性、边缘案例、回归) 分析分数。

{
  "schemaVersion": "1.0.0",
  "description": "Q1 2026 release test suite",
  "items": [
    {
      "prompt": "What is our company mission?",
      "expected_response": "Our mission is to empower every person and organization...",
      "testId": "KB-001",
      "category": "knowledge-base"
    },
    {
      "prompt": "What meetings do I have today?",
      "expected_response": "You have 2 meetings today...",
      "testId": "TOOL-001",
      "category": "tool-usage"
    }
  ]
}

数据集组织策略

对于大型项目，可跨多个文件按类别组织测试。

evals/
├── knowledge-base.json       # Knowledge verification
├── tool-usage.json           # Plugin and action tests
├── conversation-flow.json    # Dialog and multi-turn tests
├── edge-cases.json           # Boundary conditions
└── regression.json           # Previously fixed issues

运行特定的数据集文件。

runevals --prompts-file ./evals/knowledge-base.json
runevals --prompts-file ./evals/tool-usage.json

回归测试

修复问题时，添加测试以防止回归。使用 testId 和 notes 链接回 bug 跟踪。

{
  "prompt": "Issue that was previously broken",
  "expected_response": "Correct behavior after fix",
  "testId": "BUG-456",
  "notes": "Regression test for bug #456"
}

初学者模板

基本代理测试模板

{
  "schemaVersion": "1.0.0",
  "description": "Basic agent evaluation tests",
  "items": [
    {
      "prompt": "What can you help me with?",
      "expected_response": "I can help you with [specific capabilities]."
    },
    {
      "prompt": "Who are you?",
      "expected_response": "I'm [agent name], specialized in [domain]."
    }
  ]
}

知识库测试模板

{
  "schemaVersion": "1.0.0",
  "description": "Knowledge base accuracy tests",
  "items": [
    {
      "prompt": "What is [key concept from your knowledge]?",
      "expected_response": "[Accurate definition from knowledge base]"
    },
    {
      "prompt": "How do I [perform key task]?",
      "expected_response": "[Step-by-step guidance from knowledge]"
    }
  ]
}

工具使用情况测试模板

{
  "schemaVersion": "1.0.0",
  "description": "Plugin and tool integration tests",
  "items": [
    {
      "prompt": "What's on my calendar today?",
      "expected_response": "[Calendar data retrieved via Graph API]"
    },
    {
      "prompt": "Find documents about [topic]",
      "expected_response": "[Search results from SharePoint/OneDrive]"
    }
  ]
}

交互式和内联测试

在没有数据集文件的情况下使用交互式模式进行探索性测试。

runevals --interactive

对于快速的单一提示测试，请内联传递提示。

runevals --prompts "What is Microsoft Graph?" \
  --expected "Microsoft Graph is the API gateway to Microsoft 365 data and intelligence."

多个提示。

runevals --prompts "What is Teams?" "What is SharePoint?" \
  --expected "Teams is a collaboration platform" "SharePoint is a content management system"

了解评估指标

每个测试都会在多个维度上自动评分。

一致性 (1-5)

一致性衡量响应的逻辑性和结构合理程度：

5：清晰、合乎逻辑、井然有序
3：有些有条理，但可能更清晰
1：不连贯或混乱

(1-5)

基础性衡量源和引文对响应的支持程度：

5：完全立足于适当的引文
3：部分以一些引文为基础
1：无基础或引文

相似性 (1-5)

相似性衡量响应与预期输出的匹配程度：

5：响应在语义上等效于预期输出
3：响应部分匹配预期输出
1：响应与预期的输出不匹配

引文 (>= 0)

引文是基于计数的计算器，用于计算响应中的有效引文数。分数 0 表示不存在引文。配置最小阈值以设置通过/失败栏。

ExactMatch

ExactMatch 是具有布尔结果的字符串匹配计算器。如果响应准确包含预期的字符串，则响应会传递。 case_sensitive支持默认选项 (： false) 。

PartialMatch (0.0-1.0)

PartialMatch 是一个字符串匹配计算器，它返回和 1.0之间的0.0连续相似性分数。 threshold使用选项设置通过 (默认值所需的最低分数： 0.5) 。

持续改进

查看失败的测试

当测试分数不佳时：

查看实际响应与预期响应。
确定是否需要更新预期的响应。
检查代理是否需要更多训练数据或说明。
验证工具配置是否正确。

跟踪一段时间内的分数

保存测试结果以在各个版本之间进行比较。

runevals --output ./evals/results/v1.2.0-results.json

反馈

此页面是否有帮助？

Last updated on 2026-05-02