How to Write an Incident Postmortem: Templates, Examples & Best Practices

by API Status Check

TLDR: A postmortem is a structured review written after a production incident to capture what happened, why, and how to prevent recurrence. The best postmortems are blameless (focus on systems, not individuals), thorough (include a detailed timeline), and actionable (every finding gets an owner and deadline). Teams that consistently write postmortems reduce repeat incidents by 40-60%. This guide gives you a complete process, two ready-to-use templates, and lessons from how Google, Cloudflare, and GitLab handle theirs.

How to Write an Incident Postmortem

A server went down at 3 AM. Your team scrambled for two hours. The CEO emailed asking what happened. Now the dust has settled, and someone says: "We should write a postmortem."

If your reaction is a groan — you're doing postmortems wrong. A well-run postmortem isn't bureaucratic paperwork. It's the single highest-leverage activity your engineering team can do after an incident. It transforms painful failures into permanent improvements.

This guide covers everything you need to write effective postmortems: when to write them, what to include, how to run the meeting, and how to build a culture where engineers actually want to participate.

What Is an Incident Postmortem?

An incident postmortem (also called a post-incident review, incident retrospective, or after-action review) is a structured analysis of a production incident conducted after the incident is resolved. Its purpose is threefold:

  1. Understand what happened — Build a precise timeline of events, decisions, and actions
  2. Identify root causes — Dig beyond the proximate trigger to find systemic weaknesses
  3. Prevent recurrence — Define concrete action items with owners and deadlines

The term "postmortem" comes from medicine — literally "after death." In engineering, it's the examination of what killed your system's availability, and what you need to change so it doesn't happen again.

Postmortems vs. Incident Reports

These terms are sometimes used interchangeably, but they serve different purposes:

  • Incident report: A factual record of what happened — typically written during or immediately after the incident for stakeholders. It answers "what" and "when."
  • Postmortem: A deeper analysis conducted 24-72 hours later. It answers "why" and "what do we change." It includes root cause analysis, contributing factors, and action items.

Think of the incident report as the police report filed at the scene. The postmortem is the investigation that follows.

When Should You Write a Postmortem?

Not every incident needs a full postmortem. Writing postmortems for trivial issues leads to fatigue and dilutes the signal. Most teams trigger a postmortem when:

  • Severity threshold met: Any P1 (customer-facing outage) or P2 (significant degradation) incident
  • Duration threshold: Incident lasted longer than 30-60 minutes
  • Customer impact: External users were affected or notified
  • Data loss or security implications: Any incident involving data integrity or security
  • Near-misses with high potential: The failure didn't cause major impact but easily could have
  • Repeat incidents: The same failure mode has occurred before — this one gets a mandatory postmortem

A common anti-pattern is requiring postmortems only for "big" incidents. Some of the most valuable postmortems come from near-misses — incidents where a lucky coincidence prevented a major outage. These reveal systemic risks before they cause real damage.

The 24-72 Hour Window

Write the postmortem within 24-72 hours of resolution. Too soon and you lack perspective. Too late and memories fade, logs get rotated, and the urgency to fix underlying issues evaporates.

Google's SRE handbook recommends starting the postmortem document within 24 hours and completing it within 5 business days. Most mature teams aim for a completed draft within 48 hours and a review meeting within one week.

The Blameless Postmortem: Why Culture Matters More Than Format

The single most important concept in postmortem practice is blamelessness. Coined and popularized by John Allspaw (former CTO of Etsy), a blameless postmortem assumes that:

  1. Engineers acted with the best information they had at the time
  2. Asking "who caused this?" is the wrong question — the right question is "what about our systems allowed this to happen?"
  3. Punishment for honest mistakes creates a culture where people hide failures instead of learning from them

This isn't about removing accountability. It's about redirecting accountability from individuals to systems. Instead of "Dave pushed a bad config change," you write "The deployment pipeline allowed a config change to reach production without validation."

Why Blamelessness Produces Better Outcomes

When engineers fear punishment, they:

  • Underreport incidents and near-misses
  • Hide contributing factors they were involved in
  • Provide incomplete timelines
  • Resist being involved in postmortem discussions

When engineers feel safe, they:

  • Proactively report near-misses before they become outages
  • Volunteer detailed accounts of their decision-making
  • Identify systemic issues they've noticed but never reported
  • Suggest improvements they'd otherwise keep quiet about

The data backs this up. Research from the DevOps Research and Assessment (DORA) program consistently finds that high-performing engineering organizations practice blameless postmortems. Teams that blame individuals have 2.5x more repeat incidents than teams that focus on systemic fixes.

Blameless Language Guide

❌ Blame-oriented ✅ Systems-oriented
"Dave pushed a bad config" "A config change reached production without validation"
"The on-call engineer was too slow" "The alerting system didn't provide enough context for fast triage"
"QA missed this bug" "Our test coverage didn't include this failure mode"
"Someone forgot to update the runbook" "The runbook update wasn't part of the deployment checklist"
"The junior engineer broke production" "Our deployment guardrails didn't prevent an unsafe change"

Anatomy of an Effective Postmortem

Every good postmortem has seven sections. Some teams add more, but these are non-negotiable:

1. Summary

Two to four sentences that anyone in the company can understand. Include: what broke, how long it was broken, how many users were affected, and how it was fixed.

Example:

On March 15, 2026, the payment processing API experienced a complete outage from 14:32 to 16:07 UTC (95 minutes). Approximately 12,400 users were unable to complete purchases during this window. The root cause was a database migration that locked the transactions table. Service was restored by rolling back the migration and applying it with a non-locking strategy.

2. Impact

Quantify the damage. Be specific about:

  • Duration: Total time from first impact to full recovery
  • User impact: Number of users affected, error rates, failed transactions
  • Revenue impact: Lost revenue, SLA credits issued, refunds processed
  • Detection time: How long before the team knew about the issue (TTD — Time to Detect)
  • Resolution time: How long from detection to full resolution (TTR — Time to Resolve)

Impact quantification isn't just for accountability — it helps prioritize follow-up work. An incident that affected 50 users for 5 minutes doesn't need the same remediation investment as one that affected 50,000 users for 2 hours.

3. Timeline

The most labor-intensive but most valuable section. A minute-by-minute (or as close as possible) account of what happened. Include:

  • When the first signal appeared (even if nobody noticed at the time)
  • When alerts fired
  • When humans became aware
  • Every significant action taken during response
  • When the issue was mitigated vs. fully resolved
  • Key decisions and why they were made

Pro tip: Gather the timeline from multiple sources — monitoring dashboards, Slack messages, incident channel logs, deployment history, and git blame. No single person has the complete picture.

Example timeline entry:

14:32 UTC — Database migration job started (triggered by CI/CD pipeline after merge to main) 14:34 UTC — Transaction table locked; API responses begin timing out 14:36 UTC — Alerting fires: "Payment API error rate > 5%" (PagerDuty → on-call SRE) 14:38 UTC — On-call SRE acknowledges alert, opens incident channel in Slack 14:42 UTC — SRE identifies elevated database CPU and lock waits; suspects migration 14:51 UTC — Decision: attempt to kill the migration process rather than wait for completion 14:55 UTC — Migration killed, but table remains locked (zombie lock) 15:10 UTC — Database team engaged; manual lock release attempted 15:22 UTC — Lock cleared; API begins recovering 16:07 UTC — Error rates return to baseline; incident declared resolved

4. Root Cause Analysis

This is where you dig deep. The proximate cause is rarely the root cause. Use techniques like:

Five Whys:

  1. Why did the API go down? → The database migration locked the transactions table
  2. Why did a table lock cause an outage? → The API has no timeout on database connections and no circuit breaker
  3. Why was the migration applied during business hours? → The CI/CD pipeline runs migrations automatically on merge to main
  4. Why doesn't the pipeline check for table locks? → Migration safety checks were never implemented
  5. Why weren't they implemented? → The team didn't have a migration safety policy

Root cause: No migration safety policy or pre-flight checks in the CI/CD pipeline, combined with no circuit breaker in the API's database connection layer.

Note how the root cause is systemic, not individual. It's not "Dave merged during business hours" — it's "the system allowed unsafe migrations to run automatically during business hours."

5. Contributing Factors

Root causes don't act alone. Contributing factors are conditions that made the incident worse or more likely:

  • The runbook for database incidents was last updated 18 months ago
  • The on-call SRE had never handled a database lock incident before
  • Monitoring showed elevated error rates but didn't identify the specific database issue
  • The incident happened during a team offsite, reducing available responders

Contributing factors often reveal more actionable improvements than the root cause itself.

6. What Went Well

This section is often skipped and shouldn't be. Documenting what worked reinforces good practices:

  • Alerting fired within 2 minutes of impact (below 5-minute SLA)
  • Incident commander was appointed within 4 minutes
  • Customer communication was sent within 15 minutes
  • Team collaboration in the incident channel was focused and effective

7. Action Items

The most important section. Every action item must have:

  • What: A specific, actionable task (not "improve monitoring")
  • Who: A named owner (not "the team")
  • When: A deadline (not "when we get to it")
  • Priority: P0 (before next on-call rotation), P1 (this sprint), P2 (this quarter)

Example:

Action Item Owner Priority Deadline
Add migration safety check to CI/CD pipeline (reject table-locking migrations during business hours) Sarah K. P0 March 22
Implement circuit breaker on API→DB connections with 5s timeout Miguel R. P1 March 29
Update database incident runbook with lock diagnosis steps On-call team P1 March 29
Add database lock duration to monitoring dashboard Platform team P2 April 15
Schedule quarterly migration safety review Sarah K. P2 April 30

The cardinal sin of postmortems is action items that never get done. Track them in your issue tracker (Jira, Linear, GitHub Issues) alongside feature work. Review completion rates monthly. If your action item completion rate is below 70%, your postmortem process is broken.

Postmortem Template #1: Standard Engineering Template

Copy this template for most production incidents:

# Postmortem: [Incident Title]

**Date:** [Incident date]
**Authors:** [Names of postmortem authors]
**Status:** Draft | In Review | Complete
**Severity:** P1 / P2 / P3

## Summary
[2-4 sentences. What happened, how long, who was affected, how it was fixed.]

## Impact
- **Duration:** [Start time] to [End time] ([X] minutes)
- **Users affected:** [Number and description]
- **Revenue impact:** [If applicable]
- **Time to detect (TTD):** [X] minutes
- **Time to resolve (TTR):** [X] minutes
- **SLA impact:** [Any SLA breaches or credits]

## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | [Event description] |
| HH:MM | [Event description] |

## Root Cause
[Detailed root cause analysis. Use Five Whys or equivalent technique.]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Positive observation 1]
- [Positive observation 2]

## What Went Poorly
- [Issue 1]
- [Issue 2]

## Action Items
| Action | Owner | Priority | Deadline | Ticket |
|--------|-------|----------|----------|--------|
| [Action] | [Name] | P0/P1/P2 | [Date] | [Link] |

## Lessons Learned
[Key takeaways for the broader engineering organization.]

Postmortem Template #2: Abbreviated Template for P3 / Minor Incidents

Not every postmortem needs a 10-page document. For smaller incidents, use this lighter format:

# Quick Postmortem: [Incident Title]

**Date:** [Date] | **Duration:** [X] min | **Severity:** P3
**Authors:** [Name]

## What happened?
[3-5 sentences covering the incident, root cause, and fix.]

## What broke?
- Root cause: [One sentence]
- Contributing: [One sentence]

## Action items
- [ ] [Action] — @[Owner] — [Deadline]
- [ ] [Action] — @[Owner] — [Deadline]

This abbreviated format takes 15-20 minutes to complete and ensures even minor incidents get documented. The key is that it still requires at least one action item.

Running the Postmortem Meeting

A postmortem meeting (sometimes called a postmortem review) brings stakeholders together to discuss the written postmortem. Here's how to run one effectively:

Before the Meeting

  1. Draft the postmortem document — The incident commander or a designated author writes the first draft within 24-48 hours
  2. Gather data — Pull monitoring dashboards, Slack transcripts, deployment logs, and git history
  3. Share the draft — Send the document to all participants at least 24 hours before the meeting
  4. Set expectations — Remind participants this is a blameless review focused on systemic improvements

During the Meeting (45-60 minutes)

Segment Duration Purpose
Timeline walkthrough 15 min Ensure the timeline is accurate and complete
Root cause discussion 15 min Validate the root cause analysis; surface additional factors
Action item review 10 min Assign owners and deadlines; prioritize items
Open discussion 5-10 min Any insights, patterns, or concerns not yet captured

Facilitation tips:

  • Have a dedicated facilitator who wasn't directly involved in the incident
  • If discussion veers toward blame, redirect: "Let's focus on what the system allowed to happen"
  • Time-box the meeting strictly — if you need more time, schedule a follow-up
  • Record decisions and any changes to the document in real-time

After the Meeting

  1. Finalize the document — Incorporate feedback within 24 hours
  2. Create tickets — Every action item becomes a tracked ticket
  3. Distribute — Share the final postmortem with the engineering organization
  4. Follow up — Review action item completion at the next team retrospective

Real-World Postmortem Examples

Learning from other companies' postmortems is one of the best ways to improve your own. Here are notable examples:

Google: The Shakespeare Sonnet Outage

Google's SRE book includes a famous example postmortem where Shakespeare Search went down for 66 minutes due to a configuration push during a period of unusually high traffic. Key lessons:

  • The timeline is exhaustive (minute-by-minute)
  • Root cause includes both the config push AND the lack of canary validation
  • Action items are specific and assigned
  • "What went well" acknowledges the alerting system worked correctly

Cloudflare: July 2019 Global Outage

Cloudflare published a detailed postmortem after a regex rule in their WAF caused a global outage affecting millions of sites for 27 minutes. Notable aspects:

  • Technical depth is exceptional — they explain exactly why the regex caused CPU exhaustion
  • They describe their internal process failures honestly
  • Action items include both technical fixes (regex testing) and process changes (staged rollouts)

GitLab: Database Deletion Incident

In 2017, GitLab lost 6 hours of production data when a tired engineer deleted data from the wrong database. Their postmortem was remarkable for its radical transparency:

  • Published live on YouTube as they attempted recovery
  • The postmortem document was publicly editable
  • They listed every backup system and why each one had failed
  • It led to major infrastructure investment

Each of these examples demonstrates different strengths. Google's is methodical. Cloudflare's is technically deep. GitLab's is radically transparent. The best approach for your team depends on your culture and audience.

Common Postmortem Anti-Patterns

1. The Blame Game

"This happened because [person] did [thing]." This guarantees people will hide information in future postmortems. Redirect to systems: what allowed the mistake to reach production?

2. The Root Cause Monoculture

"Root cause: human error." Human error is never a root cause — it's a symptom. Dig deeper. Why was the human able to make that error? What guardrails were missing?

3. The Action Item Graveyard

Writing action items that never get implemented. If your completion rate is below 70%, you're training your team that postmortems are performative theater. Track items in your issue tracker and review monthly.

4. The Novel

A 15-page postmortem for a 10-minute P3 incident. Match the depth of analysis to the severity of the incident. Use the abbreviated template for minor issues.

5. The Retrospective Without Data

"We think the incident lasted about an hour." Precise data matters. If your monitoring can't tell you exactly when impact started and ended, that's an action item in itself.

6. The Spectator Sport

Only the incident responders attend the postmortem. Invite adjacent teams, product managers, and anyone who can benefit from understanding the failure mode. Some of the best action items come from people with fresh perspectives.

Building a Postmortem Culture

Writing individual postmortems is the easy part. Building a culture where postmortems are valued — where engineers proactively report near-misses and eagerly participate in reviews — takes deliberate effort.

Start with Leadership

Engineering leaders must model blameless behavior. If a VP responds to an outage with "who did this?", all your blameless culture efforts are undermined. Leaders should:

  • Thank engineers who report near-misses
  • Attend postmortem meetings regularly
  • Reference postmortem lessons in planning discussions
  • Never use postmortem findings in performance reviews as evidence of fault

Make Postmortems Visible

Publish postmortems to an internal wiki or knowledge base that the entire engineering organization can search. Some companies go further:

  • Google maintains an internal postmortem repository searchable by service, failure mode, and action item type
  • Etsy holds "Game Day" exercises based on past postmortems
  • Netflix publishes selected postmortems externally to build industry trust

Celebrate the Process

Some teams give awards for the best postmortem of the quarter — not the worst incident, but the most thorough analysis and most impactful action items. This reframes postmortems from punishment to recognition.

Track Metrics

Measure your postmortem practice like any other engineering process:

  • Completion rate: What percentage of qualifying incidents get a postmortem within SLA?
  • Time to postmortem: Average days from incident to completed postmortem
  • Action item completion: What percentage of action items are done on time?
  • Repeat incident rate: How often does the same root cause recur?

Postmortems and External Status Communication

A postmortem is an internal document, but its findings often inform external communication. When your incident affects customers, they deserve to know:

  1. What happened (at a level appropriate for your audience)
  2. What you're doing to prevent it from happening again
  3. Whether their data was affected

Many companies publish sanitized versions of their postmortems as incident reports on their status pages. This builds trust — customers can see you take reliability seriously.

Monitoring tools that help with the full incident lifecycle — from detection through postmortem:

  • Better Stack combines uptime monitoring, on-call alerting, and incident management in one platform. Their incident timelines automatically capture the data you need for postmortem timelines.
  • 1Password ensures your team can securely access production credentials during incidents without sharing passwords in Slack — a common contributing factor in slow incident response.

For monitoring external dependencies that trigger incidents, an API status aggregator like API Status Check gives you real-time visibility into third-party service outages so you can include accurate external timelines in your postmortems.

Key Takeaways

  1. Write postmortems within 24-72 hours while memories are fresh and urgency is high
  2. Practice blamelessness — focus on systems, not individuals. "What allowed this to happen?" beats "who caused this?"
  3. Use a consistent template — standardization makes postmortems easier to write, review, and search
  4. Every action item gets an owner, deadline, and ticket — untracked action items are abandoned action items
  5. Quantify impact — precise numbers drive appropriate prioritization of follow-up work
  6. Run a focused review meeting — 45-60 minutes, facilitated, time-boxed
  7. Track completion rates — if action items don't get done, the postmortem was wasted effort
  8. Learn from others — study public postmortems from Google, Cloudflare, and GitLab
  9. Build culture deliberately — leadership modeling, visibility, and celebration make postmortems stick

The organizations that handle incidents best aren't the ones that never fail — they're the ones that learn the fastest. A strong postmortem practice is the engine of that learning. Start with the templates in this guide, adapt them to your team, and commit to the process. Your future selves will thank you.

Frequently Asked Questions

What is a blameless postmortem?

A blameless postmortem is a post-incident review that focuses on systemic failures rather than individual mistakes. It assumes engineers acted with the best information available and seeks to improve systems, processes, and tools rather than punish people. Research from DORA shows that blameless cultures have 2.5x fewer repeat incidents.

How long should a postmortem take to write?

A standard postmortem for a P1/P2 incident takes 2-4 hours to draft, including data gathering. The abbreviated template for P3 incidents takes 15-20 minutes. Start within 24 hours of incident resolution and complete within 5 business days.

Who should attend the postmortem meeting?

At minimum: the incident commander, primary responders, the on-call engineer, and the team lead. Also invite adjacent team members who might have relevant context, product managers affected by the outage, and any engineers who can benefit from understanding the failure mode.

What's the difference between a postmortem and a retrospective?

A postmortem is a focused analysis of a specific production incident. A retrospective is a broader team reflection on a time period (usually a sprint or quarter). Postmortems are triggered by incidents; retrospectives happen on a regular schedule. Some teams review outstanding postmortem action items during retrospectives.

How do you handle repeat incidents in postmortems?

If the same failure mode recurs, the postmortem should explicitly reference the previous postmortem and investigate why the original action items didn't prevent recurrence. Common reasons: action items weren't completed, the fix was insufficient, or the scope of the original analysis was too narrow.

Should postmortems be shared publicly?

Internal sharing within the engineering organization is essential. External sharing (publishing on your engineering blog) is optional but builds trust with customers and the broader engineering community. Companies like Cloudflare, GitLab, and Atlassian regularly publish postmortems. If sharing externally, sanitize sensitive details and focus on lessons learned.

What tools are best for managing postmortems?

Dedicated incident management platforms like Incident.io, Rootly, and FireHydrant include built-in postmortem workflows. For simpler setups, a shared document template in Google Docs or Notion works fine — the key is consistency, not tooling. Track action items in your existing issue tracker (Jira, Linear, GitHub Issues) alongside feature work.

How do you measure postmortem effectiveness?

Track four metrics: postmortem completion rate (percentage of qualifying incidents that get a postmortem), time to postmortem (days from incident to completed document), action item completion rate (percentage done on time), and repeat incident rate (how often the same root cause recurs). Aim for >90% completion, <5 days average, >70% action item completion, and declining repeat rate.

🛠 Tools We Recommend

1PasswordDeveloper Security

Securely manage API keys, database credentials, and service tokens across your team.

Try 1Password
OpteryPrivacy Protection

Remove your personal data from 350+ data broker sites automatically.

Try Optery
SEMrushSEO Toolkit

Monitor your developer content performance and track API documentation rankings.

Try SEMrush

API Status Check

Stop checking API status pages manually

Get instant email alerts when OpenAI, Stripe, AWS, and 100+ APIs go down. Know before your users do.

Get Alerts — $9/mo →

Free dashboard available · 14-day trial on paid plans · Cancel anytime

Browse Free Dashboard →