Incident / Outage Sync | Stoa Meeting Templates

Purpose: Coordinate response to an active incident with clear ownership and rapid information sharing

How to run this meeting

The moment an incident is declared, designate an incident commander (IC). The IC does not fix the problem — their job is to run the response. They own the call, assign actions, communicate status externally, and keep the conversation from devolving into chaos. Engineers who are debugging should be focused on debugging, not on managing the process.

Communicate at regular intervals regardless of whether you have new information. A status update that says "still investigating, no change in impact, next update in 30 minutes" is more valuable than silence. Customers, support teams, and stakeholders need to know someone is in control. Set an explicit timer for your next update and stick to it.

Keep the diagnosis and the blame entirely separate — during the incident and immediately after. "Why did this happen?" is a valid question for the postmortem. During the incident, the only question that matters is "what do we do next?" If the conversation starts to drift toward fault-finding, the IC should redirect immediately. A no-blame culture is not just a value — it's operationally necessary because people will hide information if they're afraid of consequences.

Before the meeting

Declare the incident in the incident management system and assign a severity (P0–P2)
Page the on-call engineer and notify the IC
Open a dedicated incident Slack channel (#incident-YYYY-MM-DD-short-description)
Post the initial impact assessment within 5 minutes of declaration
Brief the IC on what is known before the first call begins

Meeting Details

Date:
Facilitator (Incident Commander):
Attendees:
Duration: Open — sync every 15–30 minutes until resolved

Incident ID

A unique identifier, severity level, and the current declared impact. Update this in real time.

Incident ID: INC-2026-0312-001
Severity: P0
Declared at: 2026-03-12 14:23 UTC
Current status: Active — investigating
Customer impact: Checkout flow returning 503 for approximately 40% of users. Estimated 2,400 affected transactions/hour.
Affected services: payments-api, order-service
Status page: Updated at 14:30 UTC — "We are investigating elevated error rates in our checkout flow."

Timeline of Events

Ordered list of what happened, when. Be precise. Add entries in real time — this becomes the source of truth for the postmortem.

Time (UTC)	Event
14:18	Automated alert fires: `payments-api` p99 latency exceeds 8s
14:21	On-call @raj acknowledges alert
14:23	Incident declared P0 by @raj; @sonia assigned as IC
14:25	@raj confirms: 503s on `/v1/checkout` endpoint, no 5xx on other routes
14:30	Status page updated; #incident-0312 Slack channel opened
14:35	@dev team identifies DB connection pool exhaustion in `payments-api` logs
14:42	@kim attempts connection pool increase via config change — no improvement

Current System State

Describe the system as it is right now. Metrics, error rates, which services are affected, what is healthy.

payments-api: returning 503 on ~40% of checkout requests; connection pool at 100% utilization
order-service: elevated error rate as downstream consequence; not the root cause
auth-service, catalog-service: healthy, no anomalies
DB primary: CPU at 78% (elevated), no replication lag
Last successful deploy: 2026-03-12 11:05 UTC (payments-api v2.14.1)

Hypotheses

Ranked list of current theories about root cause. Strike through each one as it is ruled out.

Connection pool exhaustion caused by slow queries — under investigation by @raj
~~Memory leak in v2.14.1~~ — ruled out; heap stable per metrics
External dependency (Stripe) rate limiting — @kim checking Stripe dashboard
Increased traffic volume overwhelming current pool size — request volume is normal per load balancer metrics

Actions Taken

What has been tried, by whom, and what the result was.

Time (UTC)	Action	Owner	Result
14:42	Increased DB connection pool size to 200	@kim	No improvement — pool still exhausted
14:50	Pulled slow query log from DB	@raj	Found 3 queries averaging 12s — details in thread
14:55	Identified missing index on `orders.customer_id`	@raj	Likely root cause — preparing fix

Owners

Clear role assignments for the duration of the incident.

Role	Owner
Incident Commander	@sonia
Technical Lead	@raj
Customer Communications	@support-lead
Status Page Updates	@sonia
Postmortem Owner	TBD at resolution

Next Update Time

When is the next scheduled status update, internal and external?

Next internal sync: 15:15 UTC
Next status page update: 15:00 UTC (regardless of resolution status)
Exec update: @sonia to ping @cto-slack at 15:00 UTC if not resolved

Action Items

Owner	Action	Due Date	Status
@raj	Add index on `orders.customer_id` and validate in staging	ASAP	In Progress
@sonia	Update status page at 15:00 UTC	15:00 UTC	Open
@sonia	Ping @cto-slack if not resolved by 15:00	15:00 UTC	Open
TBD	Schedule postmortem within 48h of resolution	Post-resolution	Open

Follow-up

When the incident is resolved, post an all-clear to the incident Slack channel and update the status page with resolution details. The IC should send a brief incident summary to the broader engineering team within 2 hours. A blameless postmortem should be scheduled within 48 hours while the details are fresh — do not skip this step for P0/P1 incidents. The postmortem owner should be assigned before the incident call closes.

Engineering Sync

Sprint Planning

Skip the template

Let Stoa capture it automatically.

In Stoa, the AI agent listens to your incident / outage sync and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.

Create your first Space — free