This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Adventure Works' individual evaluation shows the returns agent scores 9.2/10 on coherence and the order management agent scores 9.1/10. But when customers try to return one item while upgrading another in the same order, 35% report that the combined resolution is confusing or contradictory. Which metric is the individual evaluation failing to capture?
Each agent's latency—a high latency in either agent would cause the poor customer experience.
Journey coherence score—whether the multiple agent responses across the interaction form a logically consistent and noncontradictory experience for the customer.
Task completion rate for each individual step—the system should measure whether each agent completed its specific subtask independently.
You run an LLM judge on 50 human-labeled examples. Human evaluators scored task completion 8/10 on average for the labeled set; your judge scores the same examples 6.1/10 on average with a Pearson correlation of 0.41. What does this indicate, and what should you do?
A correlation of 0.41 is acceptable for LLM evaluation. Proceed with the judge as configured, accepting the systematic 2-point underestimation.
The judge isn't well-aligned with human judgment. Revise the scoring rubric by reviewing cases where the judge and humans diverged most, clarifying the criteria for mid-range scores, and recalibrating against the labeled set.
The human reviewers were inconsistent. Run the labeling again with a different group of humans to get more reliable ground truth for calibration.
After deploying a new version of Adventure Works' product search agent, the regression test suite shows overall quality scores are at or above baseline. But the test suite was last updated six months ago. The product catalog has since added 3,000 new SKUs including a new product category. What risk does this situation present?
No significant risk—the regression suite passed, confirming the agent performs correctly on all relevant scenarios.
The test suite doesn't cover the new product category, so passing regression tests doesn't validate agent quality on these scenarios. The synthetic dataset needs quarterly review to add coverage for catalog changes.
The test suite is likely fine since product search uses general retrieval patterns that generalize to new product categories without specific test cases.
You must answer all questions before checking your work.
Was this page helpful?
Need help with this topic?
Want to try using Ask Learn to clarify or guide you through this topic?