Markdown

GPU capacity planning¶

Scope: sizing a GPU fleet by building a demand model from workloads, setting target utilization and headroom, fitting the power/cooling envelope, and folding in procurement lead time; with a runnable capacity calculation and a planning checklist.

What it is¶

Capacity planning turns a forecast of workload demand into a procured GPU count, a power/cooling budget, and an order placed early enough to land before the demand does. It is the bridge between the unit economics of cloud and neoclouds and the physical build-out, and it produces three numbers that must agree:

GPUs needed: derived from the demand model, divided by a realistic target utilization (never 100%).
Power/cooling envelope: the IT load plus facility overhead the GPUs will draw, which often binds before budget does.
Order date: set by procurement lead time so capacity is racked and accepted before the demand curve crosses it.

The unit of demand is the GPU-hour, but the load-bearing input is useful GPU-hours: a job at 25% MFU consumes roughly twice the wall-clock of the same job at 50% MFU (cloud and neoclouds). Plan against measured utilization (SM-active / MFU), not nameplate.

Why it matters¶

Mis-sized fleets fail in two directions, both expensive:

Under-provisioned: queues back up, training campaigns slip, inference breaches its SLOs, and the team scrambles to rent burst capacity on-demand at the dearest rate (cloud and neoclouds).
Over-provisioned: idle reserved GPUs burn money at the full rate. An idle reserved H100 still bills; idle owned hardware still depreciates (generations turn over roughly yearly, e.g. Blackwell to Rubin) while drawing power.

The binding constraints are usually power availability and allocation lead time for constrained parts, not the budget line (cloud and neoclouds). A plan that confirms budget but not power/cooling or lead time is not a plan. Capacity planning makes those three constraints explicit and forces them to reconcile before money is committed.

When it is needed (and when not)¶

Needed when:

Standing up a new owned cluster, or expanding an existing one beyond a single node-pool scale-up.
Committing to reserved / committed cloud capacity or Capacity Blocks for a bounded campaign; a commitment is a sizing decision (cloud and neoclouds).
Demand is shifting between workload classes (e.g. training-heavy to inference-heavy), changing the per-GPU profile and the utilization target.

Not the right tool when:

You need burst or experimental capacity for hours-to-days, so use on-demand or spot with checkpoint discipline (DiLoCo, Ray for GPU Clusters); no demand model justifies a lead-time-bound purchase.
You are commissioning already-purchased nodes into a live cluster; that is execution, see Add GPU Capacity, not planning.
Procurement vendor selection / contracting: see Vendor Sourcing and Procurement Logistics; capacity planning produces the quantity and date it feeds.

How: implement, integrate, maintain¶

The flow: aggregate workloads into a demand model, divide by target utilization to get raw GPU count, add headroom, derive the power/cooling envelope, then back-date the order by lead time.

flowchart LR
  W["Workload inventory: training campaigns, inference QPS, batch/eval"] --> D["Demand model: peak + steady GPU-hours/day"]
  D --> U["Divide by target utilization (e.g. 0.65)"]
  U --> H["Add headroom + failure spare"]
  H --> N["GPU count -> node count"]
  N --> P["Power/cooling envelope: TDP x nodes x PUE"]
  N --> L["Back-date order by lead time"]
  P -->|"facility fits?"| OK["Commit"]
  P -.->|"power-bound"| RESIZE["Resize or phase"]
  L --> OK

Implement: a runnable capacity calculation¶

The script below takes a workload list and target utilization and returns the GPU count, node count, power envelope, and the order-by date. Numbers in the example workloads are illustrative; the per-GPU TDP (700 W for H100 SXM) is from the NVIDIA H100 datasheet. Replace the workloads, target utilization, and lead time with your own measured values.

#!/usr/bin/env python3
"""GPU capacity calculation. Inputs are illustrative; replace with measured values."""
from __future__ import annotations

import math
from dataclasses import dataclass
from datetime import date, timedelta


@dataclass(frozen=True)
class Workload:
    name: str
    gpus_peak: int          # GPUs needed at peak concurrency
    duty_cycle: float       # fraction of the planning window the workload is active (0..1)


@dataclass(frozen=True)
class FleetPlan:
    gpu_count: int
    node_count: int
    it_load_kw: float       # GPU TDP only
    facility_load_kw: float # IT load x PUE
    order_by: date


def plan_fleet(
    workloads: list[Workload],
    *,
    target_utilization: float,   # measured SM-active/MFU you sustain, NOT nameplate (e.g. 0.65)
    headroom: float,             # spare for bursts + node failure (e.g. 0.20 = 20%)
    gpus_per_node: int = 8,      # SXM baseboard
    gpu_tdp_w: int = 700,        # H100 SXM TDP, NVIDIA datasheet
    pue: float = 1.3,            # facility overhead multiplier; measure yours
    need_by: date,
    lead_time_days: int,         # order -> racked-and-accepted, from your vendor
) -> FleetPlan:
    assert 0.0 < target_utilization <= 1.0, "utilization is a fraction in (0,1]"
    assert headroom >= 0.0, "headroom is non-negative"

    # Demand: sum of peak GPUs weighted by how much of the window each runs.
    demand_gpus = sum(w.gpus_peak * w.duty_cycle for w in workloads)

    # Size to sustained demand divided by what you actually keep busy, then add headroom.
    raw = demand_gpus / target_utilization
    with_headroom = raw * (1.0 + headroom)

    gpu_count = math.ceil(with_headroom)
    node_count = math.ceil(gpu_count / gpus_per_node)
    gpu_count = node_count * gpus_per_node  # round up to whole nodes

    it_load_kw = gpu_count * gpu_tdp_w / 1000.0
    facility_load_kw = it_load_kw * pue
    order_by = need_by - timedelta(days=lead_time_days)

    return FleetPlan(gpu_count, node_count, it_load_kw, facility_load_kw, order_by)


if __name__ == "__main__":
    workloads = [
        Workload("pretrain-7b", gpus_peak=64, duty_cycle=0.80),   # near-continuous campaign
        Workload("sft-eval", gpus_peak=16, duty_cycle=0.30),      # bursty
        Workload("inference-prod", gpus_peak=24, duty_cycle=0.95),# steady serving floor
    ]
    p = plan_fleet(
        workloads,
        target_utilization=0.65,
        headroom=0.20,
        need_by=date(2026, 12, 1),
        lead_time_days=90,
    )
    print(f"GPUs:            {p.gpu_count}")
    print(f"Nodes (8-GPU):   {p.node_count}")
    print(f"IT load:         {p.it_load_kw:.1f} kW (GPU TDP only)")
    print(f"Facility load:   {p.facility_load_kw:.1f} kW (PUE {1.3})")
    print(f"Order by:        {p.order_by.isoformat()}")

Running this with the illustrative inputs sizes the demand at 64*0.80 + 16*0.30 + 24*0.95 = 78.4 GPU-equivalents, divides by 0.65 utilization (~120.6), adds 20% headroom (~144.7), rounds up to 19 eight-GPU nodes (152 GPUs), ~106 kW IT load, ~138 kW facility load at PUE 1.3, and an order-by date of 2026-09-02 for a 2026-12-01 need-by with a 90-day lead time. The IT-load-vs-facility split is what you take to facilities for the power/cooling check (datacentre readiness via vendor-sourcing-procurement). The GPU TDP is a floor on draw; whole-node and rack power (CPU, NICs, fans, NVSwitch) add to it, so confirm the real per-node figure against the system datasheet before committing the envelope.

Integrate: feed the plan from real telemetry¶

Do not guess target_utilization or duty_cycle; read them from the monitoring stack (Telemetry, Monitoring and Alerting, Observability and Monitoring). DCGM exposes SM activity; derive sustained utilization over a representative window:

# Fleet mean SM-active fraction over 28d -> use as target_utilization input.
# DCGM_FI_PROF_SM_ACTIVE is the fraction of time SMs were busy (0..1).
avg_over_time(
  avg(DCGM_FI_PROF_SM_ACTIVE)[28d:1h]
)

# Per-workload peak GPU concurrency (k8s) -> gpus_peak input per namespace.
max_over_time(
  sum by (namespace) (
    kube_pod_container_resource_requests{resource="nvidia_com_gpu"}
  )[28d:1h]
)

Pin the queries to your exporter's metric names (DCGM exporter and kube-state-metrics; names shift between versions). Feed the resulting fractions back into the script as target_utilization and the per-workload gpus_peak/duty_cycle.

Maintain: a planning checklist¶

Re-run the plan each quarter and whenever the workload mix shifts. Verify:

Demand model current: peak GPU concurrency and duty cycle taken from the last 28d of telemetry, not from intent. Track $/run and $/token alongside (cloud and neoclouds).
Utilization measured, not assumed: target_utilization is sustained SM-active/MFU (Observability and Monitoring); idle reserved capacity is the largest hidden cost.
Headroom covers failure spare: at least one node's worth (N+1) on top of burst headroom, so a node loss does not breach serving SLOs (SLO/SLI Catalog and Error-Budget Alerts, Inference SLO Breach).
Power/cooling envelope confirmed: facility signed off on the IT-plus-PUE load before the order, not after delivery (Vendor Sourcing and Procurement Logistics).
Lead time honoured: order-by date back-dated from need-by; constrained parts (optics, switches, GPUs) drive the critical path.
Build-vs-rent re-checked at the margin: for bursty or short-horizon demand, prefer on-demand/spot or neocloud over owned capacity (Cloud, Neoclouds and Cost/Capacity); reserve owned capacity for the steady floor.
Phasing for power-bound sites: if the facility cannot host the full envelope, phase the order to match power build-out rather than over-buying GPUs that cannot be powered.

Commissioning the procured nodes into the running cluster is a separate, executable procedure; see Add GPU Capacity. Acceptance and burn-in gate every node before it counts toward allocatable capacity (GPU Health Gating, Smoke Tests: GPU Platform).

References¶

NVIDIA H100 Tensor Core GPU datasheet (SXM 700 W TDP): https://resources.nvidia.com/en-us-gpu-resources/h100-datasheet-24306
NVIDIA DGX H100 user guide (system power, 8x H100, 6x 3.3 kW PSUs): https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html
AWS EC2 Capacity Blocks for ML (reserve a block for a fixed window): https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html
NVIDIA DCGM exporter / profiling metrics (DCGM_FI_PROF_SM_ACTIVE): https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
kube-state-metrics (resource requests metrics): https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md
FinOps Foundation (cloud cost discipline, unit economics): https://www.finops.org/
Uptime Institute on PUE (facility overhead): https://uptimeinstitute.com/resources/research-and-reports