Cloud, neoclouds & cost/capacity¶
Scope: overview and decision index for GPU beyond the owned hall: hyperscaler instances, the neoclouds, decentralized/permissionless GPU, and the economics that decide build-vs-rent. This page frames the choices and routes you to the focused pages that implement each; it is where deployment meets procurement strategy and unit economics.
flowchart LR
DEMAND["Workload demand"] --> MODEL["Capacity model"]
MODEL --> OWN["Owned cluster"]
MODEL --> RENT["Cloud or neocloud"]
MODEL --> SPOT["Spot or preemptible"]
OWN --> COST["Cost per useful unit"]
RENT --> COST
SPOT --> CHECKPOINT["Checkpoint discipline"]
Overview¶
Most GPU capacity is consumed, not owned. Knowing the supply side and the unit economics is what lets a platform or MLOps engineer answer the questions leadership actually asks: rent or build, reserved or spot, and why a job costs what it costs. The through-line is that utilisation is the lever: an idle reserved GPU burns money at the full rate, and a job at half the MFU costs nearly double.
Focused pages¶
This page is the index; the implementable detail lives in focused children:
- GPU provider landscape: use this when you need to pick which provider (hyperscaler vs neocloud vs decentralized) and compare their SKUs, networking, and access.
- GPU consumption models: use this when deciding how to buy: on-demand vs reserved/committed vs spot/preemptible vs Capacity Blocks, and bare-metal vs managed.
- Build vs rent GPU cost model: use this when you need the TCO/break-even math: capex vs opex, sustained utilisation, time horizon, and depreciation.
- GPU capacity planning: use this when sizing the fleet to demand, with power availability and allocation lead time as the binding constraints.
Core knowledge¶
The supply side¶
Three tiers: hyperscalers (AWS, GCP, Azure, OCI; managed Slurm/k8s on top), neoclouds (CoreWeave, Lambda, Crusoe, Nebius and peers; faster access, lower prices, usually bare-metal on InfiniBand), and decentralized/permissionless DePIN-GPU networks that aggregate heterogeneous globally distributed accelerators (which is why low-communication training like DiLoCo, distributed training, matters there). For SKUs, networking, and the full provider comparison see GPU provider landscape.
Consumption models¶
On-demand (flexible, dearest), reserved/committed (cheap if kept busy), spot/preemptible (cheapest, can vanish; only safe with checkpointing, storage and data/reliability and RAS), and Capacity Blocks (a block of GPUs for a fixed window); bare-metal vs managed is the orthogonal axis. Full decision guidance in GPU consumption models.
The economics¶
- The unit is the GPU-hour, but the number that matters is real utilisation, measured as SM-active/MFU (observability), not the misleading "GPU-util".
- FinOps signals: $/GPU-hour, true utilisation, $/training-run, and for inference $/token (or $/1M tokens). The Rubin pitch of "lower cost per token" (the Blackwell platform) targets exactly this business metric.
- MFU is cost: a run at 25% MFU costs roughly twice the same run at 50% MFU. Optimisation (performance tuning) is direct cost reduction, and the cleanest way to frame its value.
- Build vs rent: capex (own the hall, BOM validation/datacentre readiness) vs opex (rent); break-even turns on sustained utilisation and time horizon, with second-hand/distressed capacity as a third path and depreciation/export controls hitting residual value. The full TCO/break-even model lives in build vs rent GPU cost model.
Capacity planning¶
- The binding constraints are usually power availability (datacentre readiness) and allocation lead time for constrained parts, not budget. Match procured capacity to sustained demand, not peak hopes. See GPU capacity planning for fleet sizing.
Don't-miss checklist¶
- Match the consumption model to the workload: reserved for steady, spot for fault-tolerant/checkpointed, Capacity Blocks for bounded campaigns.
- Measure utilisation honestly (SM-active/MFU, observability); idle reserved capacity is the biggest hidden cost.
- Track $/token (inference) and $/run (training); tie optimisation (performance tuning) explicitly to cost.
- Confirm power and lead time before committing to capacity (datacentre readiness).
Failure modes¶
- Reserved capacity idle at low MFU: paying the full rate for a quarter of the useful work.
- Spot instances used without checkpointing: repeated lost work on preemption.
- Cost modelled on "GPU-util" instead of real SM-active/MFU, hiding the waste.
- Capacity bought with no confirmed power/cooling to host it (datacentre readiness).
Open questions & validation¶
- Current neocloud pricing and availability, and how they undercut hyperscalers.
- Hyperscaler Blackwell instance specifics (SKUs, networking, Capacity Block mechanics).
- A clean build-vs-rent break-even model with utilisation and horizon as the variables.
References¶
- AWS Capacity Blocks for ML: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html
- GCP AI Hypercomputer / accelerator-optimized machines: https://cloud.google.com/compute/docs/accelerator-optimized-machines
- Azure GPU-accelerated VM sizes: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/overview#gpu-accelerated
- FinOps Foundation (cloud cost discipline): https://www.finops.org/
Related: BOM · Physical · Platform · Training · GPU Platform Split-Plane Architecture · Observability · Optimization · Glossary