The 5-Phase Incident Response Lifecycle
Phase 1: Detection
An incident begins when you detect it — not when it starts. The gap between when an incident starts and when you detect it is your MTTD (Mean Time to Detect). Every minute of MTTD is a minute of user impact without response.
Detection sources (in order of preference):
- Automated monitoring alerts (best). Your monitoring tool detects the issue and pages on-call. Zero human reaction time required. This is why investing in good monitoring pays off directly as lower MTTD.
- Health check failures. Synthetic checks or healthcheck endpoints fail, triggering alerts before users notice.
- Increased error rate alerts. Your error rate crosses a threshold and pages on-call, often before users actively report issues.
- Social media / user reports (worst). Users are complaining on Twitter before your monitoring caught it. MTTD is already high and MTTR will likely be high too.
Phase 2: Triage
Triage is the rapid assessment of what's wrong, how bad it is, and who needs to be involved. The goal: move from "alert fired" to "we understand the situation" in under 10 minutes for P1 incidents.
| Severity | Definition | Response Target |
|---|---|---|
| P0 (Critical) | Complete outage, all users impacted | Acknowledge in 5min, war room in 10min |
| P1 (High) | Major degradation, significant user impact | Acknowledge in 15min, response in 30min |
| P2 (Medium) | Partial degradation, limited user impact | Response within 2 hours |
| P3 (Low) | Minor issues, minimal impact | Address in next business day sprint |
Triage checklist (run in under 5 minutes):
- What service/component is affected?
- How many users are impacted? (percentage, count, specific segments?)
- Did anything change recently? (deploys, config changes, traffic spikes?)
- What is the error pattern? (all errors? specific endpoints? specific regions?)
- Is the situation stable, getting worse, or improving?
- Who needs to be in the room? (on-call + domain experts)
Phase 3: Mitigation
Mitigation means stopping or reducing user impact — even if you don't know the root cause yet. Common mitigation actions:
- Rollback the most recent deploy. The most common cause of incidents is a recent code change. If there was a deploy in the last 24 hours, rolling it back is the fastest path to service restoration.
- Failover to backup region/zone. If a specific AZ or region is degraded, route traffic to healthy ones. This should be automated via load balancer health checks.
- Enable feature flags to disable affected features. Kill switches let you disable specific functionality without a full rollback.
- Scale up capacity. If the issue is resource exhaustion (CPU, memory, connections), scaling horizontally may restore service while you diagnose.
- Apply rate limiting. If a single client or endpoint is generating abusive load, rate limiting protects the rest of your service.
- Database failover. If a primary database is unhealthy, promote a replica to primary via your database cluster tooling.
Detect incidents before users report them
Better Stack monitors your services from 30+ global locations and pages your on-call team the moment something breaks — cutting MTTD to seconds.
Try Better Stack Free →The Incident Commander Role
For P0/P1 incidents, designate an incident commander (IC) explicitly. The IC is not the person doing the most debugging — they're the person running the process. Without an IC, incidents devolve into chaos: multiple people attempting fixes that conflict, communication breaking down, key stakeholders left in the dark.
| Role | Responsibility |
|---|---|
| Incident Commander | Coordinates response, owns communication, makes escalation decisions |
| Technical Lead | Directs debugging, proposes and executes mitigation steps |
| Communications Lead | Drafts status page updates, internal Slack updates, stakeholder comms |
| Scribe | Documents timeline in real time — what was tried, what was found, what was done |
Communication During Incidents
Good communication is half of incident response. While the technical team works the problem, the communications lead keeps everyone informed:
- Internal updates every 15-30 minutes. Engineering leadership and customer success need to know what's happening. Even "still investigating, no update on ETA" beats silence.
- Status page updates within 5 minutes of detection. Customers checking your status page during an incident need information before they escalate to support. "We are investigating increased error rates" is better than a green status page during a red incident.
- Dedicated incident channel. Create a #incident-YYYY-MM-DD Slack channel for each incident. Keeps debugging conversation out of main channels and creates a searchable incident record.
- Resolution announcement. Post clearly when the incident is resolved, with a brief summary and ETA for postmortem.
Phase 4: Resolution
Resolution is when you've fixed the root cause — not just mitigated the symptoms. The criteria for "resolved" should be explicit:
- Error rates have returned to pre-incident baseline
- All affected services are responding with healthy latency
- Monitoring shows no remaining anomalies
- You have a clear explanation of what caused the incident and what fixed it
- The fix is verified stable (not just momentarily recovered)
Don't rush to declare resolution. A service that recovers briefly then degrades again extends MTTR and erodes trust further.
Alert Pro
14-day free trialStop checking — get alerted instantly
Next time your services goes down, you'll know in under 60 seconds — not when your users start complaining.
- Email alerts for your services + 9 more APIs
- $0 due today for trial
- Cancel anytime — $9/mo after trial
Phase 5: Postmortem
The postmortem is what prevents the incident from recurring. A good postmortem is blameless, structured, and generates specific action items with owners and due dates.
Postmortem template (minimum viable):
- Incident summary: What happened in 2-3 sentences (for readers who weren't involved)
- Timeline: When did the incident start? When was it detected? Key events during response. When was it resolved?
- Root cause analysis: The underlying technical cause. Use the "5 Whys" technique to get to the real cause, not just the proximate trigger.
- Contributing factors: What made this incident worse than it had to be? (Missing monitoring, unclear runbook, slow escalation?)
- Action items: Specific tasks with owners and due dates. Each should prevent recurrence or improve detection/response speed.
The 5 Whys Example
Problem: Database connection pool exhausted, causing 502 errors.
Why? → Application opened connections faster than closing them.
Why? → A new query was missing a finally block to close the connection.
Why? → Code review didn't catch it — we don't have a linter rule for connection hygiene.
Why? → We never established coding standards for database connection management.
Action item: Add a linter rule detecting unclosed database connections. Also: add connection pool utilization alerting at 70%.
Key Incident Response Metrics
| Metric | Definition | How to Improve |
|---|---|---|
| MTTD | Mean Time to Detect | Better monitoring coverage, lower alert thresholds |
| MTTA | Mean Time to Acknowledge | Faster on-call rotations, clearer escalation paths |
| MTTM | Mean Time to Mitigate | Pre-built runbooks, feature flags, rollback automation |
| MTTR | Mean Time to Resolve | Faster root cause analysis, better diagnostics tooling |
| Incident frequency | Incidents per week/month | Postmortem action items, reliability improvements |
Incident Response Tools
| Tool Category | Purpose | Options |
|---|---|---|
| Monitoring / Alerting | Detect incidents before users do | Better Stack, Datadog, Prometheus + Alertmanager |
| On-call management | Route alerts to the right person | PagerDuty, OpsGenie, Better Stack |
| Incident coordination | War room, timeline, communication | Incident.io, Statuspage, Slack |
| Status pages | External communication during outages | Better Stack, Atlassian Statuspage, PagerDuty |
| Postmortem tooling | Document and track action items | Incident.io, Notion, Confluence, Google Docs |