Capacity Planning
Purpose: Project infrastructure needs 3–6 months ahead to prevent scaling failures and control costs
How to run this meeting
Plan for peak load, not average load. Average traffic is comfortable; peak traffic is where systems fail and where customers leave. Before the meeting begins, every participant should know the highest traffic event in the past quarter — whether that was a product launch, a marketing campaign, or an organic spike — and use that as the floor, not the ceiling, for projections. Teams that plan to average are perpetually surprised by peaks they could have predicted.
Distinguish between scaling cliffs and gradual growth. A gradual 20% traffic increase gives you time to react. A database that works fine at 70% capacity but falls over at 71% due to lock contention is a cliff — and cliffs don't send warning signals until you've already gone over the edge. Spend the first half of the meeting identifying any cliffs in your current architecture before discussing growth projections. One unaddressed cliff is more dangerous than three months of gradual growth you haven't planned for yet.
Cost projections belong in this meeting, not as a separate finance exercise. Every scaling decision is also a budget decision. When you're choosing between vertical scaling and a queue-based architecture redesign, the 12-month cost differential should be on the table alongside the engineering effort. Bring your current infrastructure bill and model out at least two scenarios — conservative growth and aggressive growth — so leadership can make an informed decision rather than a surprised one when the invoice arrives.
Before the meeting
- Export current usage metrics for all major services: traffic (RPS), storage growth rate, compute utilization (avg and peak), database connections and query volume
- Identify the peak load event from the past 90 days and document the traffic multiplier vs. average
- Pull infrastructure cost data for the past 3 months and calculate month-over-month growth rate
- Check vendor contract limits, reserved instance commitments, and any hard quotas
- Prepare a 6-month traffic forecast using at least two growth scenarios (conservative and optimistic)
- Invite: SRE lead, backend engineering leads, EM or engineering director, finance/FinOps if available
Meeting Details
- Date:
- Facilitator:
- Attendees:
- Duration: 75 minutes (quarterly)
Usage Metrics
Summarize current system utilization across key dimensions. Include both average and peak values — never report one without the other.
Snapshot: February 2026
| Resource | Avg Daily | Peak (last 90 days) | Peak Event | Capacity Limit | Headroom |
|---|---|---|---|---|---|
| API requests (RPS) | 4,200 | 9,800 | Jan 18 launch | ~15,000 | 35% |
| Primary DB connections | 340 | 780 | Jan 18 launch | 1,000 | 22% |
| Storage (Postgres) | 2.1 TB | — | — | 4 TB | 47% |
| Object storage (S3) | 18 TB | — | — | Unlimited | — |
| App server CPU (p95) | 41% | 88% | Jan 18 launch | 100% | 12% |
| Redis memory | 14 GB | 19 GB | Jan 18 launch | 25 GB | 24% |
Key observation: App server CPU and DB connections both have less than 25% headroom at peak. A traffic event 15% larger than January's launch would likely trigger degradation in both.
Forecast
Project usage 3 and 6 months out under conservative and aggressive growth scenarios. Tie forecasts to known business events (launches, campaigns, seasonal patterns).
Assumptions:
- Conservative: 8% MoM traffic growth, no major launches
- Aggressive: 15% MoM traffic growth, two product launches (April, June) at 2.5x baseline each
| Metric | Current Peak | 3-Month Conservative | 3-Month Aggressive | 6-Month Conservative | 6-Month Aggressive |
|---|---|---|---|---|---|
| API RPS (peak) | 9,800 | 12,300 | 16,100 | 15,500 | 24,400 |
| DB connections (peak) | 780 | 980 | 1,280 | 1,235 | 1,940 |
| App server CPU (p95) | 88% | ~110% | ~145% | ~140% | ~215% |
| Storage (Postgres) | 2.1 TB | 2.8 TB | 3.1 TB | 3.7 TB | 4.5 TB |
Critical finding: Under conservative growth, app server CPU exceeds capacity within 3 months. Under aggressive growth, database connections breach the limit before the April launch. Both scenarios require action before Q2.
Bottlenecks
Identify the specific constraints that will be hit first. A bottleneck is not just a resource that's running high — it's a resource with a non-linear failure mode at or near its limit.
Cliff #1 — App server CPU (URGENT): At 88% peak CPU today, there is no headroom for the April launch. CPU saturation causes request queuing, which causes latency spikes, which causes client timeouts and retries — a self-reinforcing failure spiral. This is not a gradual degradation; it's a cliff. Horizontal scaling is the most straightforward fix, but we should also profile the three highest-CPU endpoints before adding capacity blindly.
Cliff #2 — Database connection pool: Postgres max_connections is set to 1,000. At 780 peak connections today, a 30% traffic spike breaches this limit. Connection pool exhaustion causes immediate query failures with no graceful fallback. PgBouncer connection pooling is the standard mitigation and should be evaluated immediately.
Gradual concern — Postgres storage: At current growth, we'll hit the 4 TB instance limit in approximately 7 months under aggressive scenarios. This is gradual and manageable with a planned migration to a larger instance class or table partitioning strategy. Not urgent, but should be scheduled this quarter.
Scaling Plan
Specific actions to address each bottleneck and keep pace with forecast growth. Include engineering effort estimate and cost impact.
| Action | Addresses | Engineering Effort | Monthly Cost Impact | Target Completion |
|---|---|---|---|---|
| Add 4 app server instances (auto-scaling group update) | CPU cliff | 2 days | +$1,800/mo | 2026-03-25 |
| Deploy PgBouncer connection pooling | DB connection cliff | 3 days | +$200/mo | 2026-03-28 |
| Profile top-3 CPU-intensive API endpoints | CPU efficiency | 1 week | Potential savings | 2026-04-10 |
| Postgres storage: migrate to 8 TB instance | Storage headroom | 4 hours (maintenance window) | +$600/mo | 2026-05-15 |
| Evaluate read replica for reporting queries | DB load distribution | 1 week | +$900/mo | 2026-04-30 |
Total projected infrastructure cost increase (6 months): +$3,500/mo (+23%) Cost if we don't act (incident response, emergency scaling, SLA credits): unquantified but historically 4–6x the proactive cost
Action Items
| Owner | Action | Due Date | Status |
|---|---|---|---|
| @sre-lead | Submit PR to update auto-scaling group config for app servers | 2026-03-20 | Open |
| @backend-lead | Evaluate PgBouncer deployment in staging | 2026-03-22 | Open |
| @priya | Profile top-3 CPU-heavy endpoints and share findings | 2026-04-10 | Open |
| @sre-lead | Schedule Postgres instance migration for maintenance window | 2026-05-01 | Open |
| @em | Get budget approval for +$3,500/mo infrastructure increase | 2026-03-17 | Open |
| @facilitator | Share forecast and scaling plan with product team before April launch planning | 2026-03-16 | Open |
Follow-up
Distribute the forecast document and scaling plan to engineering leadership and the product team. Any scaling actions that require budget approval should be escalated within 48 hours — infrastructure lead times can be longer than sprint cycles. Schedule a mid-quarter checkpoint (6 weeks out) to compare actual growth against forecast and adjust the plan if the aggressive scenario is tracking as more likely. Capacity planning decisions should be logged so the team can improve forecast accuracy over time.
Skip the template
Let Stoa capture it automatically.
In Stoa, the AI agent listens to your capacity planning and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.
Create your first Space — free