Design root cause analysis for agent failures

Completed

Application Insights collects distributed trace data across every span in a multi-agent pipeline, giving engineers the observability foundation needed to perform systematic root cause analysis. KQL queries against this trace data reconstruct the precise sequence of agent calls, model responses, and tool interactions that connect an underlying change to a customer-visible failure.

Root cause analysis in multi-agent systems requires structured, systematic investigation rather than intuition-driven guessing. When Adventure Works' checkout completion rate dropped 15%, the immediate symptom—customers abandoning at payment step—pointed to the payment agent. But investigating the payment agent in isolation revealed nothing wrong. The actual root cause was three services upstream. The pricing agent had been updated with a new model deployment that occasionally returned prices formatted with currency symbols ("$99.99" instead of "99.99"). The payment processing API rejected these non-numeric values, causing silent failures that only surfaced when customers attempted checkout. Finding this causal chain required systematic hypothesis testing across the entire agent ecosystem rather than focusing on the obvious failure point.

The multi-agent RCA challenge and symptom vs. root cause

Traditional root cause analysis follows exception stack traces backward from the error to the triggering code. In a monolithic application, if function C throws an exception, you examine function C, then function B that called C, then function A that called B. The call stack provides a clear path to root cause. Multi-agent systems break this model in four key ways:

  • The agent that reports an error often isn't the agent that caused the problem.
  • Errors propagate across service boundaries with lossy context.
  • Cascading failures create multiple symptoms from a single root cause.
  • The "call stack" spans distributed traces across multiple services with inconsistent logging.

Adventure Works' debugging framework enforces the symptom vs. root cause distinction. The symptom is what monitoring detected: checkout completion rate dropped 15%. The observable failure is what customers experienced: payment processing errors at checkout. The proximate cause is the immediate technical failure: payment API rejecting price values. The root cause is the underlying change that triggered the failure chain: pricing agent model deployment with new formatting behavior. Effective RCA identifies the root cause, not just the symptom, because addressing symptoms (adding error handling to payment agent) doesn't prevent recurrence if the root cause (pricing agent output format) remains unfixed.

Structured hypothesis framework for systematic investigation

Adventure Works trains their on-call engineers to approach each incident with a structured hypothesis framework that tests likely failure modes systematically before investigating exotic theories. The framework prioritizes hypotheses by frequency—most production incidents fall into a few common categories, so testing those first accelerates time-to-resolution.

Hypothesis 1: Model deployment change tests whether a model version update caused the failure. Query deployment logs for model deployment changes in the 24 hours before the incident. If the pricing agent was updated from gpt-4o-2024-11-20 to gpt-4o-2025-01-21, that's a high-probability root cause. To test, run the failing request against both model versions in replay mode and compare outputs. If the new version produces different formatting, the hypothesis is confirmed. To fix, either revert the deployment or update the downstream consumer (payment agent) to handle both formats.

Hypothesis 2: Prompt regression examines whether a system prompt update changed agent behavior in unintended ways. Query version control for prompt file changes in the past week. If the pricing agent's prompt was updated to "always include currency context for clarity" without specifying format, the agent might interpret that as permission to add dollar signs. To test, load the previous prompt version into replay and check if outputs match the expected format. To fix, revert the prompt or refine it with explicit formatting constraints ("return prices as numeric values only, without currency symbols").

Hypothesis 3: Tool failure investigates whether downstream APIs or data sources are returning incorrect data or timing out. Check API status dashboards for the inventory service, product catalog API, and payment gateway. Query Application Insights for increased latency or error rates on tool call spans. If the product catalog API latency jumped from 200 ms to 3,000 ms, that increase could cause timeouts in the pricing agent's product lookup calls, leading to incomplete price calculations. To fix, address API performance issues or adjust timeout thresholds.

Hypothesis 4: Orchestration logic regression tests whether agent routing or delegation logic is sending requests to the wrong agents. If request type classification was recently updated, it might route certain checkout requests to a new agent that doesn't handle them correctly. Query for recent changes to routing rules or delegation logic. To test, manually route a failing request to the expected agent and validate success. To fix, correct the routing logic or add fallback handling.

Hypothesis 5: Configuration change examines whether environment variables, feature flags, connection strings, or service configuration changed. Azure App Configuration version history shows all recent configuration changes. If a feature flag enabled the new payment processing flow but the feature is incompletely implemented, that explains failures. To test, toggle the flag off in staging and validate that failures stop. To fix, either revert the flag or complete the feature implementation.

Hypothesis Evidence to check Test method Typical fix
Model deployment Deployment logs, model version changes Replay with both versions Revert or update consumers
Prompt regression Git history, prompt version changes Replay with old prompt Revert or refine prompt
Tool failure API dashboards, tool call latency/errors Check tool health + recent changes Fix API or adjust timeouts
Orchestration logic Routing rule changes, delegation updates Manual routing to expected agent Correct routing logic
Configuration change App config history, feature flags Toggle config in staging Revert or complete feature

The framework is applied sequentially: test hypothesis 1, if eliminated move to hypothesis 2, continue until a hypothesis is confirmed. Don't test all hypotheses in parallel—that wastes effort investigating low-probability causes while the system is degrading.

Timeline reconstruction with distributed traces

Once a hypothesis is confirmed, understanding the timeline of how the failure propagated through the system helps prevent similar cascading failures in the future. Adventure Works uses OpenTelemetry traces to reconstruct the causal timeline: which agent call preceded the failure, how long did each step take, when did latency or error rates start increasing, and what was the sequence of events from root cause to customer-visible symptom?

Timeline reconstruction queries Application Insights for all checkout transactions in the incident window (3:00 AM - 4:00 AM), groups them by success/failure status, and compares trace characteristics between successful and failed requests. The differential analysis reveals: failed requests spent 3,200 ms in the pricing agent vs. 450 ms in successful requests (indicating pricing was the problem area), failed requests show tool call errors in the pricing agent's product lookup (tool timeout), successful requests before 3:15 AM worked fine, failures started at 3:17 AM and peaked at 3:30 AM.

The timeline correlates with deployment logs: pricing agent model deployment to gpt-4o-2025-01-21 completed at 3:15 AM. The new model interacts poorly with the pricing agent's system prompt, producing verbose explanations that exceed the prompt's expected token budget, causing context truncation that corrupts price formatting. The causal chain: model deployment (3:15 AM) → verbose pricing explanations → context truncation → malformed price values → payment API rejections → checkout failures (3:17-4:00 AM, 15% of traffic).

// Application Insights KQL query to reconstruct incident timeline
let incident_start = datetime(2026-04-07 03:00:00);
let incident_end = datetime(2026-04-07 04:00:00);

// Get all checkout traces during incident window
let checkout_traces = requests
| where timestamp between (incident_start .. incident_end)
| where name == "POST /checkout/complete"
| extend trace_id = tostring(customDimensions.trace_id)
| extend success = success == true
| project timestamp, trace_id, success, duration;

// Get pricing agent spans for failed vs. successful checkouts
let pricing_spans = dependencies
| where timestamp between (incident_start .. incident_end)
| where name == "pricing_agent.calculate_total"
| extend trace_id = tostring(customDimensions.trace_id)
| project timestamp, trace_id, duration, success, resultCode;

// Join to compare pricing behavior in failed vs. successful traces
checkout_traces
| join kind=inner pricing_spans on trace_id
| summarize 
    avg_pricing_duration = avg(duration1),
    pricing_error_rate = countif(success1 == false) * 100.0 / count(),
    sample_count = count()
    by success
| project 
    checkout_outcome = iff(success, "successful", "failed"),
    avg_pricing_duration_ms = round(avg_pricing_duration, 0),
    pricing_error_rate_pct = round(pricing_error_rate, 2),
    sample_count

// Output:
// checkout_outcome  | avg_pricing_duration_ms | pricing_error_rate_pct | sample_count
// successful        | 450                      | 1.2                    | 8234
// failed            | 3200                     | 34.5                   | 1456

The KQL analysis confirms the hypothesis: failed checkouts have 7x longer pricing duration and 30x higher pricing error rate compared to successful checkouts. This differential is the smoking gun.

Canary analysis and failure correlation

Understanding whether a failure is isolated or systemic determines response urgency and scope. Adventure Works uses canary analysis to determine failure spread: is the failure isolated to one customer? One request type? One data center? Or is it systemic across all traffic?

Canary queries segment failed requests by: customer attributes (tier, region, account age), request characteristics (product category, order value, payment method), infrastructure dimensions (agent pod ID, model deployment endpoint, data center). If failures are concentrated in one segment, the root cause likely affects only that segment. If failures are distributed uniformly, the root cause is systemic.

Correlation analysis tests whether failure rate correlates with any deployment event, configuration change, or external dependency status. Start by querying how many deployments occurred in the 6 hours before the incident, whether any feature flags changed, and whether any external APIs experienced degradation. Time-correlate all change events with the incident start time. If the pricing agent deployment at 3:15 AM exactly precedes failure spike at 3:17 AM, correlation is strong.

Documenting findings in blameless postmortems

Root cause investigation produces a structured incident report that captures timeline, hypothesis tested, findings, and remediation. Adventure Works follows the blameless postmortem format: the report focuses on system failures and missing guardrails rather than individual errors. The document includes:

  • Incident summary: What happened, when, how many customers affected, business impact
  • Timeline: Sequence of events from trigger to detection to resolution
  • Root cause: The underlying issue (pricing agent model deployment with verbose output)
  • Contributing factors: Why the issue wasn't caught earlier (lack of output format validation in pricing agent, no canary deployment process for model updates)
  • Immediate remediation: What was done to restore service (reverted model deployment)
  • Preventive actions: What will prevent recurrence (add output format validation, implement canary deployments for model updates, expand regression tests with price format checks)

The postmortem is shared with engineering, leadership, and customer support. Contributing factors become backlog items with owners and due dates. The monthly reliability review tracks completion of preventive actions from all postmortems to ensure promised improvements actually ship.

Systematic hypothesis testing, timeline reconstruction, and canary analysis give engineers a repeatable path from symptom to root cause. The blameless postmortem closes the loop, converting each incident into documented learnings that prevent recurrence. Once this diagnostic discipline is in place, recurring and predictable failures become candidates for automated detection and remediation.

Key takeaways

  • Symptom vs. root cause distinctions are critical in multi-agent systems where the agent reporting an error is often not the agent that caused it, and cascading failures create misleading symptoms.
  • Structured hypothesis framework tests the five most common failure modes (model deployment change, prompt regression, tool failure, orchestration logic regression, configuration change) in priority order for efficient diagnosis.
  • Timeline reconstruction uses distributed traces and KQL queries to correlate failed vs. successful requests, revealing differential behavior patterns that pinpoint the problem area.
  • Canary analysis segments failures by customer attributes, request characteristics, and infrastructure dimensions to determine whether an issue is isolated or systemic.
  • Blameless postmortems document findings with a focus on system failures and missing guardrails, producing assigned action items that prevent recurrence.