What is incident response?

Incident response is the structured process of detecting, triaging, containing, and resolving production incidents — outages, performance degradations, or security events that impact users. A good incident response process minimizes time-to-resolution (MTTR), reduces user impact, and captures learnings to prevent recurrence. It encompasses the people, processes, and tools used to manage incidents from detection through postmortem.

What are the stages of incident response?

The five core stages of incident response are: (1) Detection — alert fires or user report received; (2) Triage — assess severity, impact, and scope; (3) Mitigation — reduce or stop user impact (rollback, failover, rate limiting); (4) Resolution — fix the root cause; (5) Postmortem — document what happened, why, and what will prevent recurrence. The most important step is mitigation — restore service first, diagnose second.

What is an incident commander?

An incident commander (IC) is the person who coordinates the incident response effort. The IC does NOT necessarily fix the problem — they manage the process: assigning responders, ensuring communication flows (status updates to stakeholders, communication channel is orderly), making decisions about escalation, and running the postmortem. The IC role is borrowed from emergency response and is especially valuable for complex or prolonged incidents involving multiple teams.

What is MTTD and MTTR in incident response?

MTTD (Mean Time to Detect) measures how long it takes from an incident starting to your monitoring detecting it. MTTR (Mean Time to Resolve) measures the total time from detection to full resolution. Low MTTD requires good monitoring and alerting. Low MTTR requires fast triage processes, runbooks, and empowered responders who can act without approval chains. Track both metrics per service monthly to see if your incident response is improving.

What is the difference between an incident and a postmortem?

An incident is the live event — the production outage or degradation happening right now, being actively managed. A postmortem (also called a retrospective or incident review) is the structured analysis conducted after an incident is resolved. The postmortem answers: what happened, what the timeline was, what the root cause was, and what action items will prevent recurrence. Good postmortems are blameless and focus on system improvements, not individual fault.

Incident Response Guide: How to Handle Production Outages (2026)

The 5-Phase Incident Response Lifecycle

Phase 1: Detection

An incident begins when you detect it — not when it starts. The gap between when an incident starts and when you detect it is your MTTD (Mean Time to Detect). Every minute of MTTD is a minute of user impact without response.

Detection sources (in order of preference):

Automated monitoring alerts (best). Your monitoring tool detects the issue and pages on-call. Zero human reaction time required. This is why investing in good monitoring pays off directly as lower MTTD.
Health check failures. Synthetic checks or healthcheck endpoints fail, triggering alerts before users notice.
Increased error rate alerts. Your error rate crosses a threshold and pages on-call, often before users actively report issues.
Social media / user reports (worst). Users are complaining on Twitter before your monitoring caught it. MTTD is already high and MTTR will likely be high too.

Phase 2: Triage

Triage is the rapid assessment of what's wrong, how bad it is, and who needs to be involved. The goal: move from "alert fired" to "we understand the situation" in under 10 minutes for P1 incidents.

Severity	Definition	Response Target
P0 (Critical)	Complete outage, all users impacted	Acknowledge in 5min, war room in 10min
P1 (High)	Major degradation, significant user impact	Acknowledge in 15min, response in 30min
P2 (Medium)	Partial degradation, limited user impact	Response within 2 hours
P3 (Low)	Minor issues, minimal impact	Address in next business day sprint

Triage checklist (run in under 5 minutes):

What service/component is affected?
How many users are impacted? (percentage, count, specific segments?)
Did anything change recently? (deploys, config changes, traffic spikes?)
What is the error pattern? (all errors? specific endpoints? specific regions?)
Is the situation stable, getting worse, or improving?
Who needs to be in the room? (on-call + domain experts)

Phase 3: Mitigation

Mitigation means stopping or reducing user impact — even if you don't know the root cause yet. Common mitigation actions:

Rollback the most recent deploy. The most common cause of incidents is a recent code change. If there was a deploy in the last 24 hours, rolling it back is the fastest path to service restoration.
Failover to backup region/zone. If a specific AZ or region is degraded, route traffic to healthy ones. This should be automated via load balancer health checks.
Enable feature flags to disable affected features. Kill switches let you disable specific functionality without a full rollback.
Scale up capacity. If the issue is resource exhaustion (CPU, memory, connections), scaling horizontally may restore service while you diagnose.
Apply rate limiting. If a single client or endpoint is generating abusive load, rate limiting protects the rest of your service.
Database failover. If a primary database is unhealthy, promote a replica to primary via your database cluster tooling.

📡

Recommended

Detect incidents before users report them

Better Stack monitors your services from 30+ global locations and pages your on-call team the moment something breaks — cutting MTTD to seconds.

Try Better Stack Free →

The Incident Commander Role

For P0/P1 incidents, designate an incident commander (IC) explicitly. The IC is not the person doing the most debugging — they're the person running the process. Without an IC, incidents devolve into chaos: multiple people attempting fixes that conflict, communication breaking down, key stakeholders left in the dark.

Role	Responsibility
Incident Commander	Coordinates response, owns communication, makes escalation decisions
Technical Lead	Directs debugging, proposes and executes mitigation steps
Communications Lead	Drafts status page updates, internal Slack updates, stakeholder comms
Scribe	Documents timeline in real time — what was tried, what was found, what was done

Communication During Incidents

Good communication is half of incident response. While the technical team works the problem, the communications lead keeps everyone informed:

Internal updates every 15-30 minutes. Engineering leadership and customer success need to know what's happening. Even "still investigating, no update on ETA" beats silence.
Status page updates within 5 minutes of detection. Customers checking your status page during an incident need information before they escalate to support. "We are investigating increased error rates" is better than a green status page during a red incident.
Dedicated incident channel. Create a #incident-YYYY-MM-DD Slack channel for each incident. Keeps debugging conversation out of main channels and creates a searchable incident record.
Resolution announcement. Post clearly when the incident is resolved, with a brief summary and ETA for postmortem.

Phase 4: Resolution

Resolution is when you've fixed the root cause — not just mitigated the symptoms. The criteria for "resolved" should be explicit:

Error rates have returned to pre-incident baseline
All affected services are responding with healthy latency
Monitoring shows no remaining anomalies
You have a clear explanation of what caused the incident and what fixed it
The fix is verified stable (not just momentarily recovered)

Don't rush to declare resolution. A service that recovers briefly then degrades again extends MTTR and erodes trust further.

Alert Pro

14-day free trial

Stop checking — get alerted instantly

Next time your services goes down, you'll know in under 60 seconds — not when your users start complaining.

Email alerts for your services + 9 more APIs
$0 due today for trial
Cancel anytime — $9/mo after trial

Start Free Trial →Compare all plans →

Also recommended:

Better Stack — all-in-one monitoring 1Password — secure your API keys

Phase 5: Postmortem

The postmortem is what prevents the incident from recurring. A good postmortem is blameless, structured, and generates specific action items with owners and due dates.

Postmortem template (minimum viable):

Incident summary: What happened in 2-3 sentences (for readers who weren't involved)
Timeline: When did the incident start? When was it detected? Key events during response. When was it resolved?
Root cause analysis: The underlying technical cause. Use the "5 Whys" technique to get to the real cause, not just the proximate trigger.
Contributing factors: What made this incident worse than it had to be? (Missing monitoring, unclear runbook, slow escalation?)
Action items: Specific tasks with owners and due dates. Each should prevent recurrence or improve detection/response speed.

The 5 Whys Example

Problem: Database connection pool exhausted, causing 502 errors.
Why? → Application opened connections faster than closing them.
Why? → A new query was missing a finally block to close the connection.
Why? → Code review didn't catch it — we don't have a linter rule for connection hygiene.
Why? → We never established coding standards for database connection management.
Action item: Add a linter rule detecting unclosed database connections. Also: add connection pool utilization alerting at 70%.

Key Incident Response Metrics

Metric	Definition	How to Improve
MTTD	Mean Time to Detect	Better monitoring coverage, lower alert thresholds
MTTA	Mean Time to Acknowledge	Faster on-call rotations, clearer escalation paths
MTTM	Mean Time to Mitigate	Pre-built runbooks, feature flags, rollback automation
MTTR	Mean Time to Resolve	Faster root cause analysis, better diagnostics tooling
Incident frequency	Incidents per week/month	Postmortem action items, reliability improvements

Incident Response Tools

Tool Category	Purpose	Options
Monitoring / Alerting	Detect incidents before users do	Better Stack, Datadog, Prometheus + Alertmanager
On-call management	Route alerts to the right person	PagerDuty, OpsGenie, Better Stack
Incident coordination	War room, timeline, communication	Incident.io, Statuspage, Slack
Status pages	External communication during outages	Better Stack, Atlassian Statuspage, PagerDuty
Postmortem tooling	Document and track action items	Incident.io, Notion, Confluence, Google Docs