Capacity Planning | Stoa Meeting Templates

Purpose: Project infrastructure needs 3–6 months ahead to prevent scaling failures and control costs

How to run this meeting

Plan for peak load, not average load. Average traffic is comfortable; peak traffic is where systems fail and where customers leave. Before the meeting begins, every participant should know the highest traffic event in the past quarter — whether that was a product launch, a marketing campaign, or an organic spike — and use that as the floor, not the ceiling, for projections. Teams that plan to average are perpetually surprised by peaks they could have predicted.

Distinguish between scaling cliffs and gradual growth. A gradual 20% traffic increase gives you time to react. A database that works fine at 70% capacity but falls over at 71% due to lock contention is a cliff — and cliffs don't send warning signals until you've already gone over the edge. Spend the first half of the meeting identifying any cliffs in your current architecture before discussing growth projections. One unaddressed cliff is more dangerous than three months of gradual growth you haven't planned for yet.

Cost projections belong in this meeting, not as a separate finance exercise. Every scaling decision is also a budget decision. When you're choosing between vertical scaling and a queue-based architecture redesign, the 12-month cost differential should be on the table alongside the engineering effort. Bring your current infrastructure bill and model out at least two scenarios — conservative growth and aggressive growth — so leadership can make an informed decision rather than a surprised one when the invoice arrives.

Before the meeting

Export current usage metrics for all major services: traffic (RPS), storage growth rate, compute utilization (avg and peak), database connections and query volume
Identify the peak load event from the past 90 days and document the traffic multiplier vs. average
Pull infrastructure cost data for the past 3 months and calculate month-over-month growth rate
Check vendor contract limits, reserved instance commitments, and any hard quotas
Prepare a 6-month traffic forecast using at least two growth scenarios (conservative and optimistic)
Invite: SRE lead, backend engineering leads, EM or engineering director, finance/FinOps if available

Meeting Details

Date:
Facilitator:
Attendees:
Duration: 75 minutes (quarterly)

Usage Metrics

Summarize current system utilization across key dimensions. Include both average and peak values — never report one without the other.

Snapshot: February 2026

Resource	Avg Daily	Peak (last 90 days)	Peak Event	Capacity Limit	Headroom
API requests (RPS)	4,200	9,800	Jan 18 launch	~15,000	35%
Primary DB connections	340	780	Jan 18 launch	1,000	22%
Storage (Postgres)	2.1 TB	—	—	4 TB	47%
Object storage (S3)	18 TB	—	—	Unlimited	—
App server CPU (p95)	41%	88%	Jan 18 launch	100%	12%
Redis memory	14 GB	19 GB	Jan 18 launch	25 GB	24%

Key observation: App server CPU and DB connections both have less than 25% headroom at peak. A traffic event 15% larger than January's launch would likely trigger degradation in both.

Forecast

Project usage 3 and 6 months out under conservative and aggressive growth scenarios. Tie forecasts to known business events (launches, campaigns, seasonal patterns).

Assumptions:

Conservative: 8% MoM traffic growth, no major launches
Aggressive: 15% MoM traffic growth, two product launches (April, June) at 2.5x baseline each

Metric	Current Peak	3-Month Conservative	3-Month Aggressive	6-Month Conservative	6-Month Aggressive
API RPS (peak)	9,800	12,300	16,100	15,500	24,400
DB connections (peak)	780	980	1,280	1,235	1,940
App server CPU (p95)	88%	~110%	~145%	~140%	~215%
Storage (Postgres)	2.1 TB	2.8 TB	3.1 TB	3.7 TB	4.5 TB

Critical finding: Under conservative growth, app server CPU exceeds capacity within 3 months. Under aggressive growth, database connections breach the limit before the April launch. Both scenarios require action before Q2.

Bottlenecks

Identify the specific constraints that will be hit first. A bottleneck is not just a resource that's running high — it's a resource with a non-linear failure mode at or near its limit.

Cliff #1 — App server CPU (URGENT): At 88% peak CPU today, there is no headroom for the April launch. CPU saturation causes request queuing, which causes latency spikes, which causes client timeouts and retries — a self-reinforcing failure spiral. This is not a gradual degradation; it's a cliff. Horizontal scaling is the most straightforward fix, but we should also profile the three highest-CPU endpoints before adding capacity blindly.

Cliff #2 — Database connection pool: Postgres max_connections is set to 1,000. At 780 peak connections today, a 30% traffic spike breaches this limit. Connection pool exhaustion causes immediate query failures with no graceful fallback. PgBouncer connection pooling is the standard mitigation and should be evaluated immediately.

Gradual concern — Postgres storage: At current growth, we'll hit the 4 TB instance limit in approximately 7 months under aggressive scenarios. This is gradual and manageable with a planned migration to a larger instance class or table partitioning strategy. Not urgent, but should be scheduled this quarter.

Scaling Plan

Specific actions to address each bottleneck and keep pace with forecast growth. Include engineering effort estimate and cost impact.

Action	Addresses	Engineering Effort	Monthly Cost Impact	Target Completion
Add 4 app server instances (auto-scaling group update)	CPU cliff	2 days	+$1,800/mo	2026-03-25
Deploy PgBouncer connection pooling	DB connection cliff	3 days	+$200/mo	2026-03-28
Profile top-3 CPU-intensive API endpoints	CPU efficiency	1 week	Potential savings	2026-04-10
Postgres storage: migrate to 8 TB instance	Storage headroom	4 hours (maintenance window)	+$600/mo	2026-05-15
Evaluate read replica for reporting queries	DB load distribution	1 week	+$900/mo	2026-04-30

Total projected infrastructure cost increase (6 months): +$3,500/mo (+23%) Cost if we don't act (incident response, emergency scaling, SLA credits): unquantified but historically 4–6x the proactive cost

Action Items

Owner	Action	Due Date	Status
@sre-lead	Submit PR to update auto-scaling group config for app servers	2026-03-20	Open
@backend-lead	Evaluate PgBouncer deployment in staging	2026-03-22	Open
@priya	Profile top-3 CPU-heavy endpoints and share findings	2026-04-10	Open
@sre-lead	Schedule Postgres instance migration for maintenance window	2026-05-01	Open
@em	Get budget approval for +$3,500/mo infrastructure increase	2026-03-17	Open
@facilitator	Share forecast and scaling plan with product team before April launch planning	2026-03-16	Open

Follow-up

Distribute the forecast document and scaling plan to engineering leadership and the product team. Any scaling actions that require budget approval should be escalated within 48 hours — infrastructure lead times can be longer than sprint cycles. Schedule a mid-quarter checkpoint (6 weeks out) to compare actual growth against forecast and adjust the plan if the aggressive scenario is tracking as more likely. Capacity planning decisions should be logged so the team can improve forecast accuracy over time.

SRE Reliability Review

Hiring Review

Skip the template

Let Stoa capture it automatically.

In Stoa, the AI agent listens to your capacity planning and captures decisions, drafts artifacts, and tracks open questions in real time — no note-taking required.

Create your first Space — free