SRE Reliability Review | Stoa Meeting Templates

Purpose: Assess SLO health, review incident trends, and prioritize reliability work to protect product velocity

How to run this meeting

Start with error budgets, not uptime percentages. The question is never "were we up?" — it's "how much budget did we burn, and can the product team still ship at the pace they need?" Open with a red/yellow/green summary of each SLO so the room immediately knows where to focus attention. If a service is burning its error budget faster than planned, that becomes the first agenda item, not an afterthought.

When reviewing incidents, resist the urge to just count them. Five P2s with a shared root cause tell a very different story than five unrelated P2s. Group incidents by theme — a pattern of database connection pool exhaustion, repeated CDN misconfigurations, or cascading failures from a single dependency — and treat the pattern as the unit of analysis. This is how you find systemic problems before they escalate to SEV-1s. Aim to spend no more than 20 minutes on lookback and at least 25 minutes on forward-looking risk and initiatives.

Reliability work competes directly with feature work, so close every meeting with a clear connection to product velocity. An exhausted error budget is a shipping freeze waiting to happen. Make that trade-off explicit so engineering leadership can make informed prioritization decisions rather than discovering the constraint mid-sprint.

Before the meeting

Pull current error budget burn rates for all tracked SLOs (not just uptime dashboards)
Compile incident log for the review period, grouped by severity and root cause theme
Flag any services approaching a scaling cliff or with known reliability debt
Confirm that on-call engineers from each relevant area have been invited
Pre-circulate the dashboard link so attendees aren't reading charts cold

Meeting Details

Date:
Facilitator:
Attendees:
Duration: 60 minutes (bi-weekly or monthly)

SLO Status

Summarize current performance against each SLO and its error budget. Note burn rate trend — are you burning faster or slower than last period?

Service	SLO Target	Current	Error Budget Remaining	Trend
Checkout API	99.9% availability	99.91%	78%	Stable
Search latency (p99)	< 400ms	387ms	61%	Improving
Notification delivery	99.5% success	99.1%	Exhausted	Degrading

Notification delivery SLO is exhausted for the month. Current burn rate would also breach next month's budget if unaddressed. Feature freeze on the notifications service recommended until budget recovers or SLO target is renegotiated with product.

Incident Trends

Analyze incidents from the review period by theme, not just count. What patterns are emerging?

Period: Feb 1 – Feb 28 | Total incidents: 11 (3 P1, 4 P2, 4 P3)

Themes identified:

Database connection pool exhaustion (4 incidents): Checkout, Orders, and User services all hit pool limits during peak traffic windows. Root cause differs slightly per service but the trigger is the same — traffic spikes above 3x baseline. This is a scaling cliff, not random failure.
Third-party payment gateway timeouts (2 P1s): Stripe had two degradation events; our retry logic amplified latency rather than isolating it. Circuit breaker configuration needs review.
Unrelated one-offs (5 incidents): No pattern identified. Normal noise.

MTTD: 8 min avg (target: < 10 min) — on track MTTR: 47 min avg (target: < 30 min) — needs attention; post-mortems to review runbook coverage

Risk Areas

Identify services, dependencies, or architectural patterns that pose near-term reliability risk. Prioritize by customer impact, not technical severity.

High — Customer-facing risk:

Connection pool exhaustion: Next sustained traffic spike (expected around March product launch) will trigger cascading failures across Checkout and Orders. Customer impact: abandoned purchases, revenue loss.
Notification delivery SLO breach: Already customer-visible. Users reporting missed password reset emails in support tickets.

Medium — Latency creep:

Search p99 is within SLO but has been trending toward the limit for 6 weeks. No headroom if traffic grows as projected.

Low — Internal tooling:

Internal metrics pipeline has had two silent failures this quarter. No customer impact yet, but we're flying blind when it goes down.

Reliability Initiatives

Track in-flight and proposed reliability investments. Connect each to its expected error budget impact and product velocity benefit.

Initiative	Owner	Status	Expected Impact
Connection pool auto-scaling	@priya	In progress (PR open)	Eliminates cliff; unblocks March launch
Circuit breaker for payment gateway	@tomás	Planned — next sprint	Reduces P1 blast radius from Stripe outages
Notification retry queue redesign	@wei	Scoping	Recovers notification SLO within 2 months
Runbook coverage audit	@sre-team	Not started	Targets MTTR reduction to < 30 min

Action Items

Owner	Action	Due Date	Status
@priya	Merge connection pool auto-scaling PR and validate in staging	2026-03-17	Open
@tomás	Draft circuit breaker config proposal for payment gateway	2026-03-20	Open
@wei	Schedule scoping session for notification retry queue redesign	2026-03-15	Open
@sre-team	Audit runbook coverage for all P1/P2 incident categories	2026-03-31	Open
@facilitator	Share error budget status with product leadership before sprint planning	2026-03-14	Open

Follow-up

Distribute notes and the SLO dashboard link to all attendees and engineering leadership within 24 hours. If any SLO is exhausted or at risk, the facilitator should notify the product manager for that area directly — this affects sprint planning. Reliability initiatives should be tracked in the same backlog as feature work, not a separate "tech debt" list, so prioritization trade-offs are visible.

Project Post-Launch Review

Capacity Planning

Skip the template

Let Stoa capture it automatically.

In Stoa, the AI agent listens to your sre reliability review and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.

Create your first Space — free