This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
During a production incident replay, the replay runner detects that the agent's production output was 'approve refund for order #1234' but the replay output is 'request additional information from customer.' The replay uses the exact same customer input. What does this indicate?
The replay captured incorrect inputs—the production and replay inputs must be different, causing the output divergence.
The agent behavior is nondeterministic, possibly because one of the model deployments used a different version or temperature setting in production versus the replay environment.
The replay produced the correct response, and the production output was a bug. The replay can replace production to fix the issue.
At 2:15 AM, Adventure Works' order management agent begins returning errors for 12% of checkout requests. The distributed traces show the errors start consistently 47 ms into the order validation tool call. A deployment change was made to the validation service at 2:00 AM. Which root cause analysis (RCA) hypothesis should be tested first?
Model degradation hypothesis: the language model possibly received a bad update that affects order interpretation.
Tool failure hypothesis: a deployment to the order validation service at 2:00 AM likely introduced a bug that causes the 12% error rate, correlating with the timing and the 47 ms error location in the trace.
Orchestration logic hypothesis: the orchestrator was possibly misconfigured to route requests to the wrong agent after the 2:00 AM change.
Adventure Works' post-mortem for a 3-hour P2 incident reveals that three engineers were paged simultaneously, each investigated the same symptom independently without communicating, and the incident resolution was delayed by 90 minutes due to duplicated effort and contradictory remediation attempts. What process improvement does this indicate?
Reduce the on-call rotation to a single engineer to prevent the communication overhead of multi-person response.
Implement the incident command structure with a designated Incident Commander role who coordinates all investigation threads, makes remediation decisions, and prevents duplicated work.
Implement a 30-minute wait period before paging multiple engineers to allow the first responder to assess whether more help is needed.
You must answer all questions before checking your work.
Was this page helpful?
Need help with this topic?
Want to try using Ask Learn to clarify or guide you through this topic?