Dalintis per


Derive quality signals

Quality signals provide the vocabulary for diagnosing what works and what doesn't in your agent's responses. Instead of starting with a generic checklist, derive quality signals from patterns you observe during evaluation. This approach ensures your signals reflect what actually matters for your specific agent.

Why quality signals matter

With quality signals, you can diagnose failures faster ("failed on Personalization" is more actionable than "the answer was wrong"), track improvement by signal over time, and communicate clearly with stakeholders. When someone says "the agent isn't good enough," you can respond with specifics: "Policy accuracy is at 95%, but Personalization dropped to 75% after the last update."

Why not start with a generic quality checklist?

A list like "Accuracy, Completeness, Relevance, Tone, Safety" sounds reasonable, but it's too abstract to be actionable. What does "accuracy" mean for a legal research agent versus a creative writing assistant? The quality signals that matter—and how you measure them—depend entirely on what your agent does and whom it serves.

Instead of choosing quality signals upfront, let your evaluation results tell you what matters. When you run test cases against your agent (Stage 2 of the evaluation framework), patterns emerge from the successes and failures. Those patterns become your quality signals.

How quality signals emerge

As you iterate through baseline testing, you notice recurring themes in your results. Some test cases fail because the agent gives outdated information. Others fail because the agent ignores the user's context. Still others succeed specifically because the agent cites its sources or provides clear next steps. Each of these patterns points to a quality signal worth naming and tracking.

Employee Self-Service Agent: From patterns to signals

Here's how the Employee Self-Service Agent team derived quality signals from baseline results:

Observation Quality signal
ESS-001, ESS-002 passed: Correct policy info Policy accuracy: Is the information correct?
ESS-001 passed: Cited the handbook Source attribution: Does it cite the source?
ESS-003, ESS-004 failed: Ignored user context Personalization: Does it use employee's context?
ESS-005, ESS-006 passed; ESS-009 initially failed Escalation appropriateness: Does it know when to route?
ESS-007 passed; ESS-008 failed Privacy protection: Does it protect sensitive data?
ESS-001 passed: Told user how to check balance Action enablement: Does it give next steps?

Quality signals with concrete examples

Once you name your quality signals, make them concrete by defining what passing and failing looks like for each signal.

Quality signal Pass looks like Fail looks like
Policy accuracy "15 days PTO" (correct) "10 days PTO" (outdated)
Source attribution "Per the Employee Handbook..." No source mentioned
Personalization UK holidays for UK employee US holidays for UK employee
Escalation appropriateness Routes Family and Medical Leave Act (FMLA) to HR Tries to explain FMLA rules
Privacy protection "I can't share salary info" Shares salary or hesitates
Action enablement "Check balance in Workday" Answers but no next step

These signals are specific to the Employee Self-Service Agent. A coding assistant would have entirely different signals, such as code correctness, security best practices, and explanation clarity. A customer support agent might track resolution rate and sentiment. Your signals should reflect your agent's unique purpose.

Next step

Learn how to build a repeatable, data-driven evaluation loop that improves your agent with every iteration.