Incident / Outage Sync
Purpose: Coordinate response to an active incident with clear ownership and rapid information sharing
How to run this meeting
The moment an incident is declared, designate an incident commander (IC). The IC does not fix the problem — their job is to run the response. They own the call, assign actions, communicate status externally, and keep the conversation from devolving into chaos. Engineers who are debugging should be focused on debugging, not on managing the process.
Communicate at regular intervals regardless of whether you have new information. A status update that says "still investigating, no change in impact, next update in 30 minutes" is more valuable than silence. Customers, support teams, and stakeholders need to know someone is in control. Set an explicit timer for your next update and stick to it.
Keep the diagnosis and the blame entirely separate — during the incident and immediately after. "Why did this happen?" is a valid question for the postmortem. During the incident, the only question that matters is "what do we do next?" If the conversation starts to drift toward fault-finding, the IC should redirect immediately. A no-blame culture is not just a value — it's operationally necessary because people will hide information if they're afraid of consequences.
Before the meeting
- Declare the incident in the incident management system and assign a severity (P0–P2)
- Page the on-call engineer and notify the IC
- Open a dedicated incident Slack channel (#incident-YYYY-MM-DD-short-description)
- Post the initial impact assessment within 5 minutes of declaration
- Brief the IC on what is known before the first call begins
Meeting Details
- Date:
- Facilitator (Incident Commander):
- Attendees:
- Duration: Open — sync every 15–30 minutes until resolved
Incident ID
A unique identifier, severity level, and the current declared impact. Update this in real time.
- Incident ID: INC-2026-0312-001
- Severity: P0
- Declared at: 2026-03-12 14:23 UTC
- Current status: Active — investigating
- Customer impact: Checkout flow returning 503 for approximately 40% of users. Estimated 2,400 affected transactions/hour.
- Affected services:
payments-api,order-service - Status page: Updated at 14:30 UTC — "We are investigating elevated error rates in our checkout flow."
Timeline of Events
Ordered list of what happened, when. Be precise. Add entries in real time — this becomes the source of truth for the postmortem.
| Time (UTC) | Event |
|---|---|
| 14:18 | Automated alert fires: payments-api p99 latency exceeds 8s |
| 14:21 | On-call @raj acknowledges alert |
| 14:23 | Incident declared P0 by @raj; @sonia assigned as IC |
| 14:25 | @raj confirms: 503s on /v1/checkout endpoint, no 5xx on other routes |
| 14:30 | Status page updated; #incident-0312 Slack channel opened |
| 14:35 | @dev team identifies DB connection pool exhaustion in payments-api logs |
| 14:42 | @kim attempts connection pool increase via config change — no improvement |
Current System State
Describe the system as it is right now. Metrics, error rates, which services are affected, what is healthy.
payments-api: returning 503 on ~40% of checkout requests; connection pool at 100% utilizationorder-service: elevated error rate as downstream consequence; not the root causeauth-service,catalog-service: healthy, no anomalies- DB primary: CPU at 78% (elevated), no replication lag
- Last successful deploy: 2026-03-12 11:05 UTC (payments-api v2.14.1)
Hypotheses
Ranked list of current theories about root cause. Strike through each one as it is ruled out.
- Connection pool exhaustion caused by slow queries — under investigation by @raj
Memory leak in v2.14.1— ruled out; heap stable per metrics- External dependency (Stripe) rate limiting — @kim checking Stripe dashboard
- Increased traffic volume overwhelming current pool size — request volume is normal per load balancer metrics
Actions Taken
What has been tried, by whom, and what the result was.
| Time (UTC) | Action | Owner | Result |
|---|---|---|---|
| 14:42 | Increased DB connection pool size to 200 | @kim | No improvement — pool still exhausted |
| 14:50 | Pulled slow query log from DB | @raj | Found 3 queries averaging 12s — details in thread |
| 14:55 | Identified missing index on orders.customer_id | @raj | Likely root cause — preparing fix |
Owners
Clear role assignments for the duration of the incident.
| Role | Owner |
|---|---|
| Incident Commander | @sonia |
| Technical Lead | @raj |
| Customer Communications | @support-lead |
| Status Page Updates | @sonia |
| Postmortem Owner | TBD at resolution |
Next Update Time
When is the next scheduled status update, internal and external?
- Next internal sync: 15:15 UTC
- Next status page update: 15:00 UTC (regardless of resolution status)
- Exec update: @sonia to ping @cto-slack at 15:00 UTC if not resolved
Action Items
| Owner | Action | Due Date | Status |
|---|---|---|---|
| @raj | Add index on orders.customer_id and validate in staging | ASAP | In Progress |
| @sonia | Update status page at 15:00 UTC | 15:00 UTC | Open |
| @sonia | Ping @cto-slack if not resolved by 15:00 | 15:00 UTC | Open |
| TBD | Schedule postmortem within 48h of resolution | Post-resolution | Open |
Follow-up
When the incident is resolved, post an all-clear to the incident Slack channel and update the status page with resolution details. The IC should send a brief incident summary to the broader engineering team within 2 hours. A blameless postmortem should be scheduled within 48 hours while the details are fresh — do not skip this step for P0/P1 incidents. The postmortem owner should be assigned before the incident call closes.
Skip the template
Let Stoa capture it automatically.
In Stoa, the AI agent listens to your incident / outage sync and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.
Create your first Space — free