Skip to main content
Operations & Reliability|SRE Reliability Review
Operations & Reliability

SRE Reliability Review

Review system reliability.

SRE Reliability Review

Purpose: Assess SLO health, review incident trends, and prioritize reliability work to protect product velocity

How to run this meeting

Start with error budgets, not uptime percentages. The question is never "were we up?" — it's "how much budget did we burn, and can the product team still ship at the pace they need?" Open with a red/yellow/green summary of each SLO so the room immediately knows where to focus attention. If a service is burning its error budget faster than planned, that becomes the first agenda item, not an afterthought.

When reviewing incidents, resist the urge to just count them. Five P2s with a shared root cause tell a very different story than five unrelated P2s. Group incidents by theme — a pattern of database connection pool exhaustion, repeated CDN misconfigurations, or cascading failures from a single dependency — and treat the pattern as the unit of analysis. This is how you find systemic problems before they escalate to SEV-1s. Aim to spend no more than 20 minutes on lookback and at least 25 minutes on forward-looking risk and initiatives.

Reliability work competes directly with feature work, so close every meeting with a clear connection to product velocity. An exhausted error budget is a shipping freeze waiting to happen. Make that trade-off explicit so engineering leadership can make informed prioritization decisions rather than discovering the constraint mid-sprint.

Before the meeting

  • Pull current error budget burn rates for all tracked SLOs (not just uptime dashboards)
  • Compile incident log for the review period, grouped by severity and root cause theme
  • Flag any services approaching a scaling cliff or with known reliability debt
  • Confirm that on-call engineers from each relevant area have been invited
  • Pre-circulate the dashboard link so attendees aren't reading charts cold

Meeting Details

  • Date:
  • Facilitator:
  • Attendees:
  • Duration: 60 minutes (bi-weekly or monthly)

SLO Status

Summarize current performance against each SLO and its error budget. Note burn rate trend — are you burning faster or slower than last period?

ServiceSLO TargetCurrentError Budget RemainingTrend
Checkout API99.9% availability99.91%78%Stable
Search latency (p99)< 400ms387ms61%Improving
Notification delivery99.5% success99.1%ExhaustedDegrading

Notification delivery SLO is exhausted for the month. Current burn rate would also breach next month's budget if unaddressed. Feature freeze on the notifications service recommended until budget recovers or SLO target is renegotiated with product.


Analyze incidents from the review period by theme, not just count. What patterns are emerging?

Period: Feb 1 – Feb 28 | Total incidents: 11 (3 P1, 4 P2, 4 P3)

Themes identified:

  • Database connection pool exhaustion (4 incidents): Checkout, Orders, and User services all hit pool limits during peak traffic windows. Root cause differs slightly per service but the trigger is the same — traffic spikes above 3x baseline. This is a scaling cliff, not random failure.
  • Third-party payment gateway timeouts (2 P1s): Stripe had two degradation events; our retry logic amplified latency rather than isolating it. Circuit breaker configuration needs review.
  • Unrelated one-offs (5 incidents): No pattern identified. Normal noise.

MTTD: 8 min avg (target: < 10 min) — on track MTTR: 47 min avg (target: < 30 min) — needs attention; post-mortems to review runbook coverage


Risk Areas

Identify services, dependencies, or architectural patterns that pose near-term reliability risk. Prioritize by customer impact, not technical severity.

High — Customer-facing risk:

  • Connection pool exhaustion: Next sustained traffic spike (expected around March product launch) will trigger cascading failures across Checkout and Orders. Customer impact: abandoned purchases, revenue loss.
  • Notification delivery SLO breach: Already customer-visible. Users reporting missed password reset emails in support tickets.

Medium — Latency creep:

  • Search p99 is within SLO but has been trending toward the limit for 6 weeks. No headroom if traffic grows as projected.

Low — Internal tooling:

  • Internal metrics pipeline has had two silent failures this quarter. No customer impact yet, but we're flying blind when it goes down.

Reliability Initiatives

Track in-flight and proposed reliability investments. Connect each to its expected error budget impact and product velocity benefit.

InitiativeOwnerStatusExpected Impact
Connection pool auto-scaling@priyaIn progress (PR open)Eliminates cliff; unblocks March launch
Circuit breaker for payment gateway@tomásPlanned — next sprintReduces P1 blast radius from Stripe outages
Notification retry queue redesign@weiScopingRecovers notification SLO within 2 months
Runbook coverage audit@sre-teamNot startedTargets MTTR reduction to < 30 min

Action Items

OwnerActionDue DateStatus
@priyaMerge connection pool auto-scaling PR and validate in staging2026-03-17Open
@tomásDraft circuit breaker config proposal for payment gateway2026-03-20Open
@weiSchedule scoping session for notification retry queue redesign2026-03-15Open
@sre-teamAudit runbook coverage for all P1/P2 incident categories2026-03-31Open
@facilitatorShare error budget status with product leadership before sprint planning2026-03-14Open

Follow-up

Distribute notes and the SLO dashboard link to all attendees and engineering leadership within 24 hours. If any SLO is exhausted or at risk, the facilitator should notify the product manager for that area directly — this affects sprint planning. Reliability initiatives should be tracked in the same backlog as feature work, not a separate "tech debt" list, so prioritization trade-offs are visible.

Skip the template

Let Stoa capture it automatically.

In Stoa, the AI agent listens to your sre reliability review and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.

Create your first Space — free