Skip to main content
Operations & Reliability|Capacity Planning
Operations & Reliability

Capacity Planning

Ensure infrastructure scaling.

Capacity Planning

Purpose: Project infrastructure needs 3–6 months ahead to prevent scaling failures and control costs

How to run this meeting

Plan for peak load, not average load. Average traffic is comfortable; peak traffic is where systems fail and where customers leave. Before the meeting begins, every participant should know the highest traffic event in the past quarter — whether that was a product launch, a marketing campaign, or an organic spike — and use that as the floor, not the ceiling, for projections. Teams that plan to average are perpetually surprised by peaks they could have predicted.

Distinguish between scaling cliffs and gradual growth. A gradual 20% traffic increase gives you time to react. A database that works fine at 70% capacity but falls over at 71% due to lock contention is a cliff — and cliffs don't send warning signals until you've already gone over the edge. Spend the first half of the meeting identifying any cliffs in your current architecture before discussing growth projections. One unaddressed cliff is more dangerous than three months of gradual growth you haven't planned for yet.

Cost projections belong in this meeting, not as a separate finance exercise. Every scaling decision is also a budget decision. When you're choosing between vertical scaling and a queue-based architecture redesign, the 12-month cost differential should be on the table alongside the engineering effort. Bring your current infrastructure bill and model out at least two scenarios — conservative growth and aggressive growth — so leadership can make an informed decision rather than a surprised one when the invoice arrives.

Before the meeting

  • Export current usage metrics for all major services: traffic (RPS), storage growth rate, compute utilization (avg and peak), database connections and query volume
  • Identify the peak load event from the past 90 days and document the traffic multiplier vs. average
  • Pull infrastructure cost data for the past 3 months and calculate month-over-month growth rate
  • Check vendor contract limits, reserved instance commitments, and any hard quotas
  • Prepare a 6-month traffic forecast using at least two growth scenarios (conservative and optimistic)
  • Invite: SRE lead, backend engineering leads, EM or engineering director, finance/FinOps if available

Meeting Details

  • Date:
  • Facilitator:
  • Attendees:
  • Duration: 75 minutes (quarterly)

Usage Metrics

Summarize current system utilization across key dimensions. Include both average and peak values — never report one without the other.

Snapshot: February 2026

ResourceAvg DailyPeak (last 90 days)Peak EventCapacity LimitHeadroom
API requests (RPS)4,2009,800Jan 18 launch~15,00035%
Primary DB connections340780Jan 18 launch1,00022%
Storage (Postgres)2.1 TB4 TB47%
Object storage (S3)18 TBUnlimited
App server CPU (p95)41%88%Jan 18 launch100%12%
Redis memory14 GB19 GBJan 18 launch25 GB24%

Key observation: App server CPU and DB connections both have less than 25% headroom at peak. A traffic event 15% larger than January's launch would likely trigger degradation in both.


Forecast

Project usage 3 and 6 months out under conservative and aggressive growth scenarios. Tie forecasts to known business events (launches, campaigns, seasonal patterns).

Assumptions:

  • Conservative: 8% MoM traffic growth, no major launches
  • Aggressive: 15% MoM traffic growth, two product launches (April, June) at 2.5x baseline each
MetricCurrent Peak3-Month Conservative3-Month Aggressive6-Month Conservative6-Month Aggressive
API RPS (peak)9,80012,30016,10015,50024,400
DB connections (peak)7809801,2801,2351,940
App server CPU (p95)88%~110%~145%~140%~215%
Storage (Postgres)2.1 TB2.8 TB3.1 TB3.7 TB4.5 TB

Critical finding: Under conservative growth, app server CPU exceeds capacity within 3 months. Under aggressive growth, database connections breach the limit before the April launch. Both scenarios require action before Q2.


Bottlenecks

Identify the specific constraints that will be hit first. A bottleneck is not just a resource that's running high — it's a resource with a non-linear failure mode at or near its limit.

Cliff #1 — App server CPU (URGENT): At 88% peak CPU today, there is no headroom for the April launch. CPU saturation causes request queuing, which causes latency spikes, which causes client timeouts and retries — a self-reinforcing failure spiral. This is not a gradual degradation; it's a cliff. Horizontal scaling is the most straightforward fix, but we should also profile the three highest-CPU endpoints before adding capacity blindly.

Cliff #2 — Database connection pool: Postgres max_connections is set to 1,000. At 780 peak connections today, a 30% traffic spike breaches this limit. Connection pool exhaustion causes immediate query failures with no graceful fallback. PgBouncer connection pooling is the standard mitigation and should be evaluated immediately.

Gradual concern — Postgres storage: At current growth, we'll hit the 4 TB instance limit in approximately 7 months under aggressive scenarios. This is gradual and manageable with a planned migration to a larger instance class or table partitioning strategy. Not urgent, but should be scheduled this quarter.


Scaling Plan

Specific actions to address each bottleneck and keep pace with forecast growth. Include engineering effort estimate and cost impact.

ActionAddressesEngineering EffortMonthly Cost ImpactTarget Completion
Add 4 app server instances (auto-scaling group update)CPU cliff2 days+$1,800/mo2026-03-25
Deploy PgBouncer connection poolingDB connection cliff3 days+$200/mo2026-03-28
Profile top-3 CPU-intensive API endpointsCPU efficiency1 weekPotential savings2026-04-10
Postgres storage: migrate to 8 TB instanceStorage headroom4 hours (maintenance window)+$600/mo2026-05-15
Evaluate read replica for reporting queriesDB load distribution1 week+$900/mo2026-04-30

Total projected infrastructure cost increase (6 months): +$3,500/mo (+23%) Cost if we don't act (incident response, emergency scaling, SLA credits): unquantified but historically 4–6x the proactive cost


Action Items

OwnerActionDue DateStatus
@sre-leadSubmit PR to update auto-scaling group config for app servers2026-03-20Open
@backend-leadEvaluate PgBouncer deployment in staging2026-03-22Open
@priyaProfile top-3 CPU-heavy endpoints and share findings2026-04-10Open
@sre-leadSchedule Postgres instance migration for maintenance window2026-05-01Open
@emGet budget approval for +$3,500/mo infrastructure increase2026-03-17Open
@facilitatorShare forecast and scaling plan with product team before April launch planning2026-03-16Open

Follow-up

Distribute the forecast document and scaling plan to engineering leadership and the product team. Any scaling actions that require budget approval should be escalated within 48 hours — infrastructure lead times can be longer than sprint cycles. Schedule a mid-quarter checkpoint (6 weeks out) to compare actual growth against forecast and adjust the plan if the aggressive scenario is tracking as more likely. Capacity planning decisions should be logged so the team can improve forecast accuracy over time.

Skip the template

Let Stoa capture it automatically.

In Stoa, the AI agent listens to your capacity planning and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.

Create your first Space — free