SRE Reliability Review
Purpose: Assess SLO health, review incident trends, and prioritize reliability work to protect product velocity
How to run this meeting
Start with error budgets, not uptime percentages. The question is never "were we up?" — it's "how much budget did we burn, and can the product team still ship at the pace they need?" Open with a red/yellow/green summary of each SLO so the room immediately knows where to focus attention. If a service is burning its error budget faster than planned, that becomes the first agenda item, not an afterthought.
When reviewing incidents, resist the urge to just count them. Five P2s with a shared root cause tell a very different story than five unrelated P2s. Group incidents by theme — a pattern of database connection pool exhaustion, repeated CDN misconfigurations, or cascading failures from a single dependency — and treat the pattern as the unit of analysis. This is how you find systemic problems before they escalate to SEV-1s. Aim to spend no more than 20 minutes on lookback and at least 25 minutes on forward-looking risk and initiatives.
Reliability work competes directly with feature work, so close every meeting with a clear connection to product velocity. An exhausted error budget is a shipping freeze waiting to happen. Make that trade-off explicit so engineering leadership can make informed prioritization decisions rather than discovering the constraint mid-sprint.
Before the meeting
- Pull current error budget burn rates for all tracked SLOs (not just uptime dashboards)
- Compile incident log for the review period, grouped by severity and root cause theme
- Flag any services approaching a scaling cliff or with known reliability debt
- Confirm that on-call engineers from each relevant area have been invited
- Pre-circulate the dashboard link so attendees aren't reading charts cold
Meeting Details
- Date:
- Facilitator:
- Attendees:
- Duration: 60 minutes (bi-weekly or monthly)
SLO Status
Summarize current performance against each SLO and its error budget. Note burn rate trend — are you burning faster or slower than last period?
| Service | SLO Target | Current | Error Budget Remaining | Trend |
|---|---|---|---|---|
| Checkout API | 99.9% availability | 99.91% | 78% | Stable |
| Search latency (p99) | < 400ms | 387ms | 61% | Improving |
| Notification delivery | 99.5% success | 99.1% | Exhausted | Degrading |
Notification delivery SLO is exhausted for the month. Current burn rate would also breach next month's budget if unaddressed. Feature freeze on the notifications service recommended until budget recovers or SLO target is renegotiated with product.
Incident Trends
Analyze incidents from the review period by theme, not just count. What patterns are emerging?
Period: Feb 1 – Feb 28 | Total incidents: 11 (3 P1, 4 P2, 4 P3)
Themes identified:
- Database connection pool exhaustion (4 incidents): Checkout, Orders, and User services all hit pool limits during peak traffic windows. Root cause differs slightly per service but the trigger is the same — traffic spikes above 3x baseline. This is a scaling cliff, not random failure.
- Third-party payment gateway timeouts (2 P1s): Stripe had two degradation events; our retry logic amplified latency rather than isolating it. Circuit breaker configuration needs review.
- Unrelated one-offs (5 incidents): No pattern identified. Normal noise.
MTTD: 8 min avg (target: < 10 min) — on track MTTR: 47 min avg (target: < 30 min) — needs attention; post-mortems to review runbook coverage
Risk Areas
Identify services, dependencies, or architectural patterns that pose near-term reliability risk. Prioritize by customer impact, not technical severity.
High — Customer-facing risk:
- Connection pool exhaustion: Next sustained traffic spike (expected around March product launch) will trigger cascading failures across Checkout and Orders. Customer impact: abandoned purchases, revenue loss.
- Notification delivery SLO breach: Already customer-visible. Users reporting missed password reset emails in support tickets.
Medium — Latency creep:
- Search p99 is within SLO but has been trending toward the limit for 6 weeks. No headroom if traffic grows as projected.
Low — Internal tooling:
- Internal metrics pipeline has had two silent failures this quarter. No customer impact yet, but we're flying blind when it goes down.
Reliability Initiatives
Track in-flight and proposed reliability investments. Connect each to its expected error budget impact and product velocity benefit.
| Initiative | Owner | Status | Expected Impact |
|---|---|---|---|
| Connection pool auto-scaling | @priya | In progress (PR open) | Eliminates cliff; unblocks March launch |
| Circuit breaker for payment gateway | @tomás | Planned — next sprint | Reduces P1 blast radius from Stripe outages |
| Notification retry queue redesign | @wei | Scoping | Recovers notification SLO within 2 months |
| Runbook coverage audit | @sre-team | Not started | Targets MTTR reduction to < 30 min |
Action Items
| Owner | Action | Due Date | Status |
|---|---|---|---|
| @priya | Merge connection pool auto-scaling PR and validate in staging | 2026-03-17 | Open |
| @tomás | Draft circuit breaker config proposal for payment gateway | 2026-03-20 | Open |
| @wei | Schedule scoping session for notification retry queue redesign | 2026-03-15 | Open |
| @sre-team | Audit runbook coverage for all P1/P2 incident categories | 2026-03-31 | Open |
| @facilitator | Share error budget status with product leadership before sprint planning | 2026-03-14 | Open |
Follow-up
Distribute notes and the SLO dashboard link to all attendees and engineering leadership within 24 hours. If any SLO is exhausted or at risk, the facilitator should notify the product manager for that area directly — this affects sprint planning. Reliability initiatives should be tracked in the same backlog as feature work, not a separate "tech debt" list, so prioritization trade-offs are visible.
Skip the template
Let Stoa capture it automatically.
In Stoa, the AI agent listens to your sre reliability review and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.
Create your first Space — free