Skip to main content
Daily Execution & Coordination|Incident / Outage Sync
Daily Execution & Coordination

Incident / Outage Sync

Coordinate during an active production issue.

Incident / Outage Sync

Purpose: Coordinate response to an active incident with clear ownership and rapid information sharing

How to run this meeting

The moment an incident is declared, designate an incident commander (IC). The IC does not fix the problem — their job is to run the response. They own the call, assign actions, communicate status externally, and keep the conversation from devolving into chaos. Engineers who are debugging should be focused on debugging, not on managing the process.

Communicate at regular intervals regardless of whether you have new information. A status update that says "still investigating, no change in impact, next update in 30 minutes" is more valuable than silence. Customers, support teams, and stakeholders need to know someone is in control. Set an explicit timer for your next update and stick to it.

Keep the diagnosis and the blame entirely separate — during the incident and immediately after. "Why did this happen?" is a valid question for the postmortem. During the incident, the only question that matters is "what do we do next?" If the conversation starts to drift toward fault-finding, the IC should redirect immediately. A no-blame culture is not just a value — it's operationally necessary because people will hide information if they're afraid of consequences.

Before the meeting

  • Declare the incident in the incident management system and assign a severity (P0–P2)
  • Page the on-call engineer and notify the IC
  • Open a dedicated incident Slack channel (#incident-YYYY-MM-DD-short-description)
  • Post the initial impact assessment within 5 minutes of declaration
  • Brief the IC on what is known before the first call begins

Meeting Details

  • Date:
  • Facilitator (Incident Commander):
  • Attendees:
  • Duration: Open — sync every 15–30 minutes until resolved

Incident ID

A unique identifier, severity level, and the current declared impact. Update this in real time.

  • Incident ID: INC-2026-0312-001
  • Severity: P0
  • Declared at: 2026-03-12 14:23 UTC
  • Current status: Active — investigating
  • Customer impact: Checkout flow returning 503 for approximately 40% of users. Estimated 2,400 affected transactions/hour.
  • Affected services: payments-api, order-service
  • Status page: Updated at 14:30 UTC — "We are investigating elevated error rates in our checkout flow."

Timeline of Events

Ordered list of what happened, when. Be precise. Add entries in real time — this becomes the source of truth for the postmortem.

Time (UTC)Event
14:18Automated alert fires: payments-api p99 latency exceeds 8s
14:21On-call @raj acknowledges alert
14:23Incident declared P0 by @raj; @sonia assigned as IC
14:25@raj confirms: 503s on /v1/checkout endpoint, no 5xx on other routes
14:30Status page updated; #incident-0312 Slack channel opened
14:35@dev team identifies DB connection pool exhaustion in payments-api logs
14:42@kim attempts connection pool increase via config change — no improvement

Current System State

Describe the system as it is right now. Metrics, error rates, which services are affected, what is healthy.

  • payments-api: returning 503 on ~40% of checkout requests; connection pool at 100% utilization
  • order-service: elevated error rate as downstream consequence; not the root cause
  • auth-service, catalog-service: healthy, no anomalies
  • DB primary: CPU at 78% (elevated), no replication lag
  • Last successful deploy: 2026-03-12 11:05 UTC (payments-api v2.14.1)

Hypotheses

Ranked list of current theories about root cause. Strike through each one as it is ruled out.

  1. Connection pool exhaustion caused by slow queries — under investigation by @raj
  2. Memory leak in v2.14.1 — ruled out; heap stable per metrics
  3. External dependency (Stripe) rate limiting — @kim checking Stripe dashboard
  4. Increased traffic volume overwhelming current pool size — request volume is normal per load balancer metrics

Actions Taken

What has been tried, by whom, and what the result was.

Time (UTC)ActionOwnerResult
14:42Increased DB connection pool size to 200@kimNo improvement — pool still exhausted
14:50Pulled slow query log from DB@rajFound 3 queries averaging 12s — details in thread
14:55Identified missing index on orders.customer_id@rajLikely root cause — preparing fix

Owners

Clear role assignments for the duration of the incident.

RoleOwner
Incident Commander@sonia
Technical Lead@raj
Customer Communications@support-lead
Status Page Updates@sonia
Postmortem OwnerTBD at resolution

Next Update Time

When is the next scheduled status update, internal and external?

  • Next internal sync: 15:15 UTC
  • Next status page update: 15:00 UTC (regardless of resolution status)
  • Exec update: @sonia to ping @cto-slack at 15:00 UTC if not resolved

Action Items

OwnerActionDue DateStatus
@rajAdd index on orders.customer_id and validate in stagingASAPIn Progress
@soniaUpdate status page at 15:00 UTC15:00 UTCOpen
@soniaPing @cto-slack if not resolved by 15:0015:00 UTCOpen
TBDSchedule postmortem within 48h of resolutionPost-resolutionOpen

Follow-up

When the incident is resolved, post an all-clear to the incident Slack channel and update the status page with resolution details. The IC should send a brief incident summary to the broader engineering team within 2 hours. A blameless postmortem should be scheduled within 48 hours while the details are fresh — do not skip this step for P0/P1 incidents. The postmortem owner should be assigned before the incident call closes.

Skip the template

Let Stoa capture it automatically.

In Stoa, the AI agent listens to your incident / outage sync and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.

Create your first Space — free