Skip to main content
Learning & Improvement|Incident Postmortem
Learning & Improvement

Incident Postmortem

Learn from production incidents.

Incident Postmortem

Purpose: Learn from incidents systematically so they don't happen again

How to run this meeting

Blameless culture is non-negotiable. The postmortem is not a trial — it's an investigation. People make mistakes in the context of systems that allow those mistakes to have outsized impact; the goal is to fix the system, not punish the person. If attendees fear blame, they'll withhold the details you need most. Establish this norm explicitly at the start of the meeting, especially if leadership is present.

Write the timeline collaboratively before the meeting, not during it. The on-call engineer and anyone else involved should contribute to a shared draft in the 24–48 hours after the incident while memory is fresh. The meeting itself should review, correct, and deepen the timeline — not reconstruct it from scratch. For root cause analysis, use the "5 Whys" method: ask why something happened, then ask why that underlying thing happened, and repeat until you reach a systemic or structural cause rather than "human error."

Hold the postmortem within 5 business days of the incident. Waiting longer lets context fade and signals that learning isn't a priority. Keep the meeting to 60–90 minutes. If the incident was complex, split the timeline review from the action item discussion into two separate sessions. Every action item must have a single named owner and a concrete deadline — "the team" is not an owner.

Before the meeting

  • Draft the incident timeline in a shared doc before the meeting (invite all responders to contribute)
  • Calculate customer impact: number of users affected, duration, error rates, revenue impact if applicable
  • Pull monitoring graphs, alert logs, and deployment history for the relevant window
  • Identify all parties who should attend: responders, on-call engineers, any affected stakeholders
  • Send the draft timeline to attendees 24 hours before so the meeting focuses on analysis, not reconstruction

Meeting Details

  • Date:
  • Facilitator:
  • Attendees:
  • Duration: 60–90 minutes
  • Incident ID / Severity:
  • Incident Date:

Incident Summary

A 3–5 sentence plain-language description of what happened, suitable for sharing with non-technical stakeholders.

On January 14th, 2025 at 14:32 UTC, the Meridian platform experienced a complete outage of the checkout service lasting 47 minutes. The root cause was a misconfigured connection pool limit deployed in v2.31.0 that exhausted database connections under normal load. All checkout attempts failed with 503 errors during the outage window. The service was restored at 15:19 UTC after rolling back the deployment.


Impact

Quantify the blast radius. Include customer impact, internal impact, and any SLA implications.

  • Duration: 47 minutes (14:32–15:19 UTC)
  • Users affected: ~8,400 active sessions during the window; all checkout attempts failed
  • Revenue impact: Estimated $23,000 in lost transactions based on average order value
  • SLA: Breached the 99.9% monthly uptime SLA (third breach this quarter)
  • Support tickets opened: 142 during the incident window

Timeline

A chronological record of what happened. Include detection, escalation, diagnosis, and resolution steps. Note what was tried and what didn't work.

Time (UTC)Event
14:32v2.31.0 deployed to production via automated pipeline
14:38Checkout error rate crosses 5% threshold; alert fires but is routed to wrong PagerDuty channel
14:51Customer support reports spike in "can't complete purchase" tickets; escalates to on-call
14:54@jordan acknowledges incident, begins investigation
15:02Database connection exhaustion identified in logs
15:07Team attempts config-only fix; fails (requires restart)
15:11Decision made to rollback to v2.30.1
15:19Rollback complete, checkout error rate returns to baseline
15:25All-clear posted in #incidents

Root Causes

Use 5 Whys to get to systemic causes. Avoid stopping at "human error" — that's never a root cause.

Why did the outage occur? The database connection pool was exhausted under normal load.

Why was the pool exhausted? The max_connections value in v2.31.0 was set to 10 instead of 100 — a typo in the config file.

Why wasn't the typo caught? The config change was not covered by automated validation, and the PR reviewer focused on the code change, not the config values.

Why is there no config validation? Config files are treated as static artifacts and are not included in our integration test suite.

Why wasn't the misconfiguration caught in staging? Staging runs at 5% of production load and never exercised the connection limit.

Systemic root cause: Connection pool configuration is not validated automatically, and staging load does not reflect production conditions.


What Worked

Honest credit for what helped contain or resolve the incident.

  • Support team's escalation path to on-call was fast and effective once the right channel was reached
  • Rollback procedure was well-documented and executed cleanly in under 10 minutes
  • Post-incident communication to customers was drafted and sent within 30 minutes of resolution

What Didn't Work

Gaps in tooling, process, or communication that made the incident worse.

  • PagerDuty alert routing was misconfigured — alert fired but no one saw it for 13 minutes
  • Staging environment does not replicate production load, making it unable to catch this class of bug
  • The config change was not reviewed as carefully as the code change in the same PR

Action Items

OwnerActionDue DateStatus
@priyaAdd automated validation for connection pool config values in CI2025-01-28Open
@jordanAudit and fix PagerDuty routing rules for all checkout-related alerts2025-01-21Open
@carlosDocument and enforce config review checklist for infrastructure PRs2025-02-04Open
@miaInvestigate load testing options for staging to better simulate production traffic2025-02-11Open

Follow-up

The postmortem document is published to the engineering wiki within 48 hours and linked from the incident ticket. A summary is shared in the #engineering channel. Action items are tracked in the project management tool and reviewed at the next engineering all-hands. The on-call team reviews the document during the next on-call handoff. High-severity incidents (P0/P1) are summarized for leadership in a separate brief.

Skip the template

Let Stoa capture it automatically.

In Stoa, the AI agent listens to your incident postmortem and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.

Create your first Space — free