Establish incident response and post-mortem processes
Azure Monitor alert notifications serve as the starting point for structured incident response, providing the severity signal that determines escalation paths, on-call routing, and stakeholder communication for multi-agent failures. These alerts integrate with tools such as PagerDuty and Slack to initiate coordinated response processes that replace reactive firefighting with defined roles, communication protocols, and post-incident learning.
Incident response processes turn chaotic emergency reactions into coordinated, effective crisis management. When Adventure Works' checkout completion rate drops 15% at 3:47 AM, the response requires more than technical debugging—it needs incident command structure with clear roles, communication protocols that keep stakeholders informed, and post-incident learning processes that prevent recurrence. Without defined processes, incidents devolve into uncoordinated firefighting where multiple engineers duplicate effort, stakeholders don't get status updates, and the team moves on without capturing lessons learned.
Incident severity classification for multi-agent systems
Incident severity determines response urgency, required expertise, and communication expectations. Adventure Works defines four severity levels tailored to their multi-agent customer service platform's business impact.
P1 (Critical) incidents indicate customer checkout is completely blocked with error rate > 5%, authentication system is failing and customers can't log in, or data loss is detected (customer data corrupted or missing). P1 incidents have maximum business impact—revenue is stopping, customers are completely blocked, and potential for lasting reputational damage exists. P1 response:
- Page all on-call engineers immediately.
- Engage incident commander within 5 minutes.
- Provide executive updates every 30 minutes.
- Invoke vendor emergency support if needed.
- Authorize cost overruns to restore service quickly.
P2 (Major) incidents include checkout functionality is degraded but not fully blocked (completion rate down 10-50%), agents producing incorrect outputs > 10% of the time, or critical agent is offline but fallback agent is handling traffic. P2 incidents degrade service quality but don't completely block customers. P2 response:
- Notify on-call engineer via Slack.
- Establish diagnosis within 30 minutes.
- Provide stakeholder updates every 2 hours.
- Schedule remediation within 4 hours.
P3 (Minor) incidents involve single agent underperforming (latency increased 3x or quality degraded), noncritical feature is broken but affecting < 5% of traffic, or recommendations quality declined noticeably but checkout still works. P3 incidents create poor experience for some customers but don't threaten core functionality. P3 response:
- Create an incident ticket.
- Assign to the relevant team.
- Fix within the next business day.
- Minimal stakeholder communication required.
P4 (Warning) incidents are leading indicators that haven't impacted customers yet: cache hit rate dropped significantly, model confidence scores trending downward over days, disk space growing faster than expected, or external API latency increasing but still within SLA. P4 incidents require monitoring and preventive action but not emergency response. P4 response:
- Create a backlog item.
- Investigate during business hours.
- Implement preventive measures before the issue becomes P3 or higher.
| Severity | Impact | Response time | Update frequency | Escalation |
|---|---|---|---|---|
| P1 | Customers blocked completely | < 5 min | Every 30 min | Immediate page |
| P2 | Degraded service quality | < 30 min | Every 2 hours | Slack notification |
| P3 | Isolated agent issues | Next business day | Daily | Ticket creation |
| P4 | Warning indicators | Business hours | Weekly | Backlog item |
Severity classification happens during initial alert triage. The on-call engineer assesses business impact based on: how many customers are affected, are they completely blocked or just degraded, is revenue impacted, and is the issue escalating or stable? If any uncertainty about severity exists, classify higher—it's safer to downgrade from P1 to P2 than to underestimate a P1 as P3 and delay response.
On-call rotation and cross-functional expertise
Multi-agent systems require diverse expertise during incident response. A customer service platform incident can span multiple domains—model behavior, infrastructure failures, networking issues, or business logic bugs—each requiring a different specialty. Adventure Works addresses this with a multi-tier on-call rotation that combines specialists.
Primary on-call (L1) includes one AI engineer who understands model behavior and prompts, one platform engineer who knows infrastructure and deployment processes, and one customer service product manager who assesses business impact and communicates with stakeholders. The L1 rotation follows 7-day shifts with handoff meetings where outgoing on-call briefs incoming on-call on any developing issues.
Secondary on-call (L2) includes senior engineers from each specialty area plus the engineering manager. L2 provides escalation when L1 needs architectural decisions, vendor engagement, or authorization for high-impact changes (rollback production deployment, disable major feature). L2 engineers aren't paged immediately; they're notified when L1 escalates or when P1 incidents extend past 1 hour.
Service owner (engineering director + product lead) serves as ultimate escalation for major incidents requiring executive communication, emergency vendor support contracts, or significant cost authorization. Service owners are engaged for P1 incidents > 2 hours or P2 incidents affecting major customers.
On-call responsibilities are defined explicitly:
- L1 AI engineer diagnoses model and prompt issues, runs evaluation comparisons, and performs prompt rollbacks.
- L1 platform engineer investigates infrastructure metrics, executes runbooks, and manages deployments.
- L1 product manager assesses customer impact, drafts status updates, and coordinates communication with the customer success team.
Clear role definition prevents duplication and gaps where everyone assumes someone else is handling a task.
Incident command structure and communication protocols
Major incidents require coordination across multiple responders. Adventure Works uses the Incident Command System (ICS) structure from emergency management: one Incident Commander coordinates the overall response, one Technical Lead drives the diagnosis and remediation work, and one Communications Lead handles stakeholder updates.
For P1 and P2 incidents, the Incident Commander (IC) role is assigned to the most senior on-call engineer or escalates to L2 if needed. The IC makes go/no-go decisions on proposed fixes, coordinates across teams, manages escalation to vendors or executives, and maintains the incident timeline. The IC doesn't personally debug the issue—they coordinate the response while the Technical Lead digs into root cause.
The Technical Lead is the engineer actively investigating and fixing the issue. For model-related incidents, this is the AI engineer; for infrastructure issues, the platform engineer. The Technical Lead proposes fixes to the IC, implements approved changes, and reports status every 15-30 minutes. Multiple engineers can support the Technical Lead, but one person is designated as the lead to prevent confusion about who's driving.
The Communications Lead is typically the product manager on L1 or a customer success manager for customer-facing communication. This role drafts status page updates, notifies affected customers, provides executives with business impact summaries, and coordinates with customer support about expected resolution time. Separating communication from investigation ensures customers and stakeholders get timely updates even when engineers are deep in debugging.
## Incident Commander Quick Reference Card
### Your Role
- Coordinate overall response—don't debug personally
- Make deployment/rollback decisions
- Manage escalations to L2, vendors, executives
- Maintain incident timeline in Slack thread
- Declare resolution when service is restored
### First 5 Minutes
1. Acknowledge page in PagerDuty
2. Verify incident severity (upgrade/downgrade as needed)
3. Start incident Slack thread: #incident-2026-04-07-checkout
4. Assign Technical Lead (AI engineer for model issues, platform eng for infra)
5. Assign Communications Lead (PM or customer success)
6. Request initial assessment from Technical Lead
### Every 15-30 Minutes
- Request status update from Technical Lead
- Ask: What's the current hypothesis? What are you testing? What do you need?
- Update incident timeline with key findings
- Coordinate with Communications Lead on stakeholder updates
### Decision Framework
- Rollback deployments: Approve immediately for P1, validate impact for P2
- Feature disablement: Approve if restores service, document customer impact
- Vendor escalation: Engage immediately for P1 if vendor issue confirmed
- Cost overruns: Authorize up to $10K for P1 without exec approval
### Closing the Incident
- Service metrics returned to baseline for 30+ minutes
- Root cause identified (at least preliminary)
- Communications Lead notified customers of resolution
- Schedule postmortem meeting within 48 hours
- Hand off to Technical Lead to complete writeup
Blameless postmortem process and template
Within 48 hours of incident resolution, Adventure Works conducts a blameless postmortem meeting with all responders plus relevant stakeholders. The goal: understand what happened, why detection/response worked or failed, and what changes prevent recurrence. "Blameless" means the discussion focuses on system failures and missing safeguards, not on individual mistakes. Finger-pointing prevents honest discussion and discourages future transparency.
The postmortem document follows a standard template:
Incident summary: One-paragraph description of what happened, when, how many customers affected, business impact (revenue loss, customer complaints, service level agreement (SLA) breach). For example, "On April 7, 2026 at 3:17 AM UTC, checkout completion rate dropped from 94% to 79% due to payment processing failures. ~1,450 customers experienced checkout errors over 43 minutes before resolution. Estimated revenue impact: $87,000 in abandoned carts."
Timeline: Chronological sequence of events from trigger to detection to resolution. Include timestamps for key events: when the root cause was introduced (deployment, config change), when customers were first affected, when monitoring alerted on-call, when investigation began, when the hypothesis was confirmed, when remediation was deployed, and when service was restored. The timeline uses distributed trace data and deployment logs for accuracy.
Root cause: The underlying issue that triggered the incident. For example, "Pricing agent model deployment (gpt-4o-2025-01-21) at 3:15 AM UTC produced verbose pricing explanations that exceeded prompt token budget, causing context truncation that corrupted price formatting. Payment API rejected non-numeric price values, causing 503 errors."
Contributing factors: Why the issue wasn't caught earlier or why it spread. For example, "No output format validation in pricing agent's response parser. No canary deployment process for model updates (full deployment immediately to 100% of traffic). Regression test suite didn't include edge cases with long product descriptions that trigger verbose explanations."
What went well: Positive aspects of the response. For example, "Monitoring detected the issue within 2-minutes thanks to anomaly detection on checkout completion rate. On-call engineer diagnosed root cause in 18 minutes using replay debugging. Rollback execution completed in under 5 minutes."
What didn't go well: Challenges or failures in response. For example, "Initial hypothesis (payment gateway failure) delayed diagnosis by 10 minutes. No clear escalation path for model deployment rollbacks—Incident Commander wasn't sure who had authority. Customer success wasn't notified until 25 minutes into incident."
Action items: Specific, assigned improvements with owners and due dates. For example:
- Add output format validation to pricing agent (Owner: @jane-ai-eng, Due: 2026-04-14)
- Implement canary deployments for model updates (Owner: @bob-platform-eng, Due: 2026-04-21)
- Expand regression suite with product description length edge cases (Owner: @jane-ai-eng, Due: 2026-04-14)
- Create model rollback runbook with authority clarification (Owner: @alice-ic, Due: 2026-04-10)
The postmortem document is stored in a searchable repository (Confluence, GitHub wiki) and shared company-wide. Historical postmortems are invaluable references when similar incidents occur—"This reminds me of the April checkout incident, let me review that postmortem."
Reliability improvement tracking and metrics
Each postmortem produces reliability improvements: new monitoring coverage, automated remediation runbooks, architectural changes, or test cases. Adventure Works tracks these action items in Jira with a "reliability-improvement" tag and reviews completion monthly. The monthly reliability review agenda covers four areas:
- Reviewing all action items from recent postmortems and assessing completion status (done, in progress, or blocked).
- Identifying patterns across multiple incidents—for example, three incidents in six months related to model formatting may indicate a need for systematic output validation across all agents.
- Celebrating improvements that prevented incidents—for example, new cache monitoring that detected cache degradation before it affected customers.
Reliability metrics track improvement over time: MTBF (mean time between failures) measures how often incidents occur, MTTD (mean time to detect) measures monitoring effectiveness, MTTR (mean time to resolve) measures response efficiency, and recurrence rate measures whether similar incidents keep happening. Adventure Works targets: MTBF increasing quarter-over-quarter (fewer incidents), MTTD < 5 minutes for P1 incidents, MTTR < 1 hour for P1, and zero recurrence of previously postmortemed root causes.
Tip
When multiple incidents share the same contributing factor, treat it as a systemic gap rather than a per-incident fix. Assign one action item that addresses the pattern—for example, implementing canary deployments for all model updates—rather than creating separate remediation items for each incident. This prevents the same contributing factor from reappearing in future postmortems.
The difference between teams that improve reliability over time and teams that keep fighting the same fires often comes down to postmortem discipline. Severity classification and incident command structure contain the immediate damage. Blameless postmortems and reliability tracking are what ensure each incident makes the next one less likely.
Key takeaways
- Severity classification (P1-P4) determines response urgency, required expertise, and communication cadence, with ambiguous cases classified at the higher severity to avoid delayed response.
- Cross-functional on-call rotations pair AI engineers, platform engineers, and product managers in multi-tier (L1/L2/service owner) structures with explicit per-role responsibilities.
- Incident command structure assigns distinct Incident Commander, Technical Lead, and Communications Lead roles that separate coordination, diagnosis, and stakeholder communication.
- Blameless postmortems follow a standard template (summary, timeline, root cause, contributing factors, what went well, what didn't, action items) that focuses on system failures rather than individual mistakes.
- Reliability improvement tracking monitors MTBF, MTTD, MTTR, and recurrence rate to ensure postmortem action items ship and system reliability measurably improves over time.