Incident Postmortem
Purpose: Learn from incidents systematically so they don't happen again
How to run this meeting
Blameless culture is non-negotiable. The postmortem is not a trial — it's an investigation. People make mistakes in the context of systems that allow those mistakes to have outsized impact; the goal is to fix the system, not punish the person. If attendees fear blame, they'll withhold the details you need most. Establish this norm explicitly at the start of the meeting, especially if leadership is present.
Write the timeline collaboratively before the meeting, not during it. The on-call engineer and anyone else involved should contribute to a shared draft in the 24–48 hours after the incident while memory is fresh. The meeting itself should review, correct, and deepen the timeline — not reconstruct it from scratch. For root cause analysis, use the "5 Whys" method: ask why something happened, then ask why that underlying thing happened, and repeat until you reach a systemic or structural cause rather than "human error."
Hold the postmortem within 5 business days of the incident. Waiting longer lets context fade and signals that learning isn't a priority. Keep the meeting to 60–90 minutes. If the incident was complex, split the timeline review from the action item discussion into two separate sessions. Every action item must have a single named owner and a concrete deadline — "the team" is not an owner.
Before the meeting
- Draft the incident timeline in a shared doc before the meeting (invite all responders to contribute)
- Calculate customer impact: number of users affected, duration, error rates, revenue impact if applicable
- Pull monitoring graphs, alert logs, and deployment history for the relevant window
- Identify all parties who should attend: responders, on-call engineers, any affected stakeholders
- Send the draft timeline to attendees 24 hours before so the meeting focuses on analysis, not reconstruction
Meeting Details
- Date:
- Facilitator:
- Attendees:
- Duration: 60–90 minutes
- Incident ID / Severity:
- Incident Date:
Incident Summary
A 3–5 sentence plain-language description of what happened, suitable for sharing with non-technical stakeholders.
On January 14th, 2025 at 14:32 UTC, the Meridian platform experienced a complete outage of the checkout service lasting 47 minutes. The root cause was a misconfigured connection pool limit deployed in v2.31.0 that exhausted database connections under normal load. All checkout attempts failed with 503 errors during the outage window. The service was restored at 15:19 UTC after rolling back the deployment.
Impact
Quantify the blast radius. Include customer impact, internal impact, and any SLA implications.
- Duration: 47 minutes (14:32–15:19 UTC)
- Users affected: ~8,400 active sessions during the window; all checkout attempts failed
- Revenue impact: Estimated $23,000 in lost transactions based on average order value
- SLA: Breached the 99.9% monthly uptime SLA (third breach this quarter)
- Support tickets opened: 142 during the incident window
Timeline
A chronological record of what happened. Include detection, escalation, diagnosis, and resolution steps. Note what was tried and what didn't work.
| Time (UTC) | Event |
|---|---|
| 14:32 | v2.31.0 deployed to production via automated pipeline |
| 14:38 | Checkout error rate crosses 5% threshold; alert fires but is routed to wrong PagerDuty channel |
| 14:51 | Customer support reports spike in "can't complete purchase" tickets; escalates to on-call |
| 14:54 | @jordan acknowledges incident, begins investigation |
| 15:02 | Database connection exhaustion identified in logs |
| 15:07 | Team attempts config-only fix; fails (requires restart) |
| 15:11 | Decision made to rollback to v2.30.1 |
| 15:19 | Rollback complete, checkout error rate returns to baseline |
| 15:25 | All-clear posted in #incidents |
Root Causes
Use 5 Whys to get to systemic causes. Avoid stopping at "human error" — that's never a root cause.
Why did the outage occur? The database connection pool was exhausted under normal load.
Why was the pool exhausted?
The max_connections value in v2.31.0 was set to 10 instead of 100 — a typo in the config file.
Why wasn't the typo caught? The config change was not covered by automated validation, and the PR reviewer focused on the code change, not the config values.
Why is there no config validation? Config files are treated as static artifacts and are not included in our integration test suite.
Why wasn't the misconfiguration caught in staging? Staging runs at 5% of production load and never exercised the connection limit.
Systemic root cause: Connection pool configuration is not validated automatically, and staging load does not reflect production conditions.
What Worked
Honest credit for what helped contain or resolve the incident.
- Support team's escalation path to on-call was fast and effective once the right channel was reached
- Rollback procedure was well-documented and executed cleanly in under 10 minutes
- Post-incident communication to customers was drafted and sent within 30 minutes of resolution
What Didn't Work
Gaps in tooling, process, or communication that made the incident worse.
- PagerDuty alert routing was misconfigured — alert fired but no one saw it for 13 minutes
- Staging environment does not replicate production load, making it unable to catch this class of bug
- The config change was not reviewed as carefully as the code change in the same PR
Action Items
| Owner | Action | Due Date | Status |
|---|---|---|---|
| @priya | Add automated validation for connection pool config values in CI | 2025-01-28 | Open |
| @jordan | Audit and fix PagerDuty routing rules for all checkout-related alerts | 2025-01-21 | Open |
| @carlos | Document and enforce config review checklist for infrastructure PRs | 2025-02-04 | Open |
| @mia | Investigate load testing options for staging to better simulate production traffic | 2025-02-11 | Open |
Follow-up
The postmortem document is published to the engineering wiki within 48 hours and linked from the incident ticket. A summary is shared in the #engineering channel. Action items are tracked in the project management tool and reviewed at the next engineering all-hands. The on-call team reviews the document during the next on-call handoff. High-severity incidents (P0/P1) are summarized for leadership in a separate brief.
Skip the template
Let Stoa capture it automatically.
In Stoa, the AI agent listens to your incident postmortem and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.
Create your first Space — free