Skip to content
Markdown

GPU consumption models

Scope: On-demand vs reserved/committed vs spot/preemptible vs colocation/owned for GPUs. The price/risk tradeoffs, which workload shape each fits (bursty vs steady vs fault-tolerant), how to mix them in one fleet, and a worked monthly-cost comparison.

Time-sensitive, verify before executing

Every $/GPU-hour, discount percentage, and notice-window figure below is illustrative or a reference template. GPU pricing, spot availability, and interruption rates move weekly with supply, allocation, and region. Confirm against the live provider pricing page and a real quote before you plan against any number here.

What it is

A consumption model is the commercial contract under which you obtain a GPU-hour. It is orthogonal to where the silicon physically lives (cloud, neoclouds & cost) and to whether it is bare-metal or managed. The four durable models, cheapest-to-dearest per useful hour when fully utilised, ordered roughly by commitment:

  • On-demand. Pay the list $/GPU-hour by the second/hour, no commitment, start and stop at will. Dearest per hour; zero commitment risk; subject to stock-out (the instance may not be available when you ask).
  • Reserved / committed: commit to a term (e.g. 1 or 3 years, or a fixed block window) in exchange for a discount. AWS On-Demand Capacity Reservations and Capacity Blocks for ML, GCP Committed Use Discounts (CUDs) plus reservations, Azure Reserved VM Instances, and neocloud term contracts. Cheap only if kept busy. An idle reservation burns the full committed rate.
  • Spot / preemptible: bid on spare capacity at a deep discount (cloud providers advertise up to ~90% off on-demand), with the provider's right to reclaim the instance on short notice. Cheapest per hour; only safe for fault-tolerant, checkpointed work.
  • Colocation / owned. Capex: buy the hardware, rent rack/power/cooling in a colo (or own the hall). Lowest marginal $/GPU-hour at high sustained utilisation over a multi-year horizon; carries depreciation, obsolescence, and lead-time risk.

These are mix-and-match. A real fleet usually layers a reserved/owned base under an on-demand burst layer with a spot opportunistic layer for checkpointed batch work.

Why it matters

The unit is the GPU-hour, but the number that matters is cost per useful unit ($/token for inference, $/training-run for training), and that is dominated by utilisation, not list price (cloud, neoclouds & cost, SLO/SLI catalog). Two facts drive the whole decision:

  • An idle reserved/owned GPU costs the full rate for zero work. A 60%-utilised 3-year reservation can be more expensive per useful hour than on-demand. Commitment only pays when sustained utilisation is high.
  • Spot is cheap because it can vanish. Without checkpoint/restart discipline (runbook: checkpoint recovery, DiLoCo), a preemption mid-run loses all unsaved progress, and the "saving" turns negative once you count re-run cost.

Picking the wrong model is the largest controllable cost lever in a GPU platform: reserving bursty work strands capital; running steady serving on on-demand overpays ~2–4x; running un-checkpointed training on spot loses work repeatedly.

When it is needed (and when not)

flowchart LR
  START["Workload to place"] --> Q1{"Fault-tolerant<br/>and checkpointed?"}
  Q1 -->|"Yes"| Q2{"Flexible on<br/>start time?"}
  Q1 -->|"No"| Q3{"Steady 24x7<br/>baseline?"}
  Q2 -->|"Yes"| SPOT["Spot / preemptible"]
  Q2 -->|"No"| Q3
  Q3 -->|"Yes, > ~1 yr horizon"| Q4{"Sustained<br/>util > break-even?"}
  Q3 -->|"No, bursty / short"| OD["On-demand"]
  Q4 -->|"Yes, multi-year"| OWN["Colocation / owned"]
  Q4 -->|"Yes, 1-3 yr"| RES["Reserved / committed"]
  Q4 -->|"No"| OD
  OD --> CB{"Bounded campaign,<br/>need guaranteed block?"}
  CB -->|"Yes"| BLOCK["Capacity Block / training plan"]
  • On-demand fits: bursty or spiky demand, dev/experimentation, short-lived jobs, latency-sensitive inference with no checkpoint tolerance, and anything you cannot forecast. Not for steady 24x7 baseload. You overpay.
  • Reserved/committed fits: steady baseload you will keep busy for the term (production inference floor, a standing training cluster). Not for bursty or uncertain demand; idle commitment is the classic FinOps failure mode.
  • Capacity Blocks / training plans fit: bounded campaigns (a single large training run) needing a guaranteed contiguous block of GPUs for days-to-weeks, where on-demand would stock out. Not for open-ended baseload.
  • Spot/preemptible fits: fault-tolerant, checkpointed, restartable work, such as batch inference, data prep, hyperparameter sweeps, and checkpoint-disciplined / low-communication training (DiLoCo, distributed training recipes). Not for un-checkpointed long runs or hard-SLO online serving.
  • Colocation/owned fits: high, sustained, multi-year demand where capex amortises below rental and you control power/cooling. Not when demand is uncertain, the horizon is short, or generations will turn over before break-even (depreciation risk; vendor sourcing & procurement).

Interruption mechanics differ by provider and are the load-bearing constraint for spot:

Provider Spot/preemptible reclaim signal Notes
AWS EC2 Spot 2-minute interruption notice; earlier rebalance recommendation signal poll instance metadata; use price-capacity-optimized allocation
GCP Spot VMs best-effort up to 30s (set preemption-notice to 120s for longer drain); no max runtime preemptible (legacy) VMs also stop after 24h
Azure Spot VMs minimum 30s eviction notice evicted on capacity need or when price exceeds your max; Deallocate or Delete policy

How: implement, integrate, maintain

Implement: mix the models in the scheduler. On Kubernetes, separate node pools by consumption model with labels/taints, then steer each workload with nodeSelector plus tolerations. Pair with a quota layer (Kueue, Volcano, manifest: Volcano job) so spot-tolerant batch can be preempted in-cluster while reserved serving keeps its floor. Reference manifest (template, adjust node labels to your provider):

# Spot-tolerant, checkpointed batch job. Tolerates the cloud spot taint and
# requests the spot node pool. Pair with a checkpoint sidecar / app-level save.
apiVersion: batch/v1
kind: Job
metadata:
  name: hpo-sweep-spot
spec:
  backoffLimit: 12            # expect preemptions; retry the job
  template:
    spec:
      restartPolicy: OnFailure
      nodeSelector:
        consumption-model: spot          # your own label on the spot pool
        cloud.google.com/gke-spot: "true" # GKE example; AWS/Azure use their own
      tolerations:
        - key: cloud.google.com/gke-spot  # GKE taint; AWS: aws.amazon.com/spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      terminationGracePeriodSeconds: 110  # drain inside the ~120s notice window
      containers:
        - name: trainer
          image: ghcr.io/example/trainer:latest
          resources:
            limits:
              nvidia.com/gpu: 1
          # App MUST checkpoint to durable storage and resume from latest.
          command: ["python", "train.py", "--resume-from", "/ckpt/latest"]
          lifecycle:
            preStop:                       # flush a checkpoint on the eviction signal
              exec:
                command: ["python", "save_checkpoint.py", "--out", "/ckpt/latest"]
          volumeMounts:
            - name: ckpt
              mountPath: /ckpt
      volumes:
        - name: ckpt
          persistentVolumeClaim:
            claimName: ckpt-durable        # durable, NOT node-local

Implement: reserve a guaranteed block (AWS Capacity Blocks for ML). For a bounded campaign needing GPUs that on-demand would stock out, reserve a block 1–14 days (or multiples of 7, up to 182 days) starting up to 8 weeks out; you pay up front:

# Find an available Capacity Block offering, then purchase it. Template — confirm
# instance type, count, and the offering id against your account/region.
aws ec2 describe-capacity-block-offerings \
  --instance-type p5.48xlarge \
  --instance-count 2 \
  --start-date-range 2026-07-01T00:00:00Z \
  --end-date-range  2026-07-08T00:00:00Z \
  --capacity-duration-hours 168

aws ec2 purchase-capacity-block \
  --capacity-block-offering-id "cbo-EXAMPLE0123456789" \
  --instance-platform "Linux/UNIX"
# Launch into the reservation by targeting its capacity-reservation-id.

Integrate: worked monthly-cost comparison. One steady inference replica needing 1 H100-class GPU 24x7 for 730 h/month. Rates are illustrative placeholders. Substitute live quotes.

Model Effective $/GPU-h (illustrative) Assumed utilisation Monthly $ (illustrative) Risk
On-demand $3.00 100% (you pay even if idle) $2,190 none; may stock out
Reserved (1-yr) $1.80 (≈40% off) needs ≥ ~60% to beat on-demand $1,314 idle = full burn
Spot $0.90 (≈70% off) effective after re-runs ≈ $657 + re-run overhead preemption
Colocation/owned hardware capex ÷ amortisation + colo opex high sustained over 3 yr capex-amortised; lowest at high util depreciation, lead time

Break-even rule of thumb for committing vs on-demand: commit when expected sustained utilisation > (committed rate ÷ on-demand rate). At the placeholders above, a 1-yr reservation at $1.80 vs on-demand $3.00 breaks even near 60% utilisation. Below that, on-demand is cheaper per useful hour. Spot's headline saving is only real net of preemption re-run cost; budget the re-run overhead explicitly.

Maintain: monitor for the failure modes. Track effective $/useful-hour, not list price. A PromQL signal for idle-but-paid reserved GPUs (full burn at low utilisation; see telemetry & monitoring, observability):

# Reserved GPUs averaging < 30% SM-active over 6h = paying full rate for little work.
# Label your reserved nodes consumption_model="reserved" in DCGM exporter relabeling.
avg_over_time(
  DCGM_FI_PROF_SM_ACTIVE{consumption_model="reserved"}[6h]
) < 0.30
# Spot preemption rate — if churn is high, the discount may be eaten by lost work.
sum(rate(kube_pod_deleted_total{node_pool="spot"}[1h]))

Run these against utilisation SLOs (SLO/SLI catalog, runbook: MFU regression). Recheck reservation coverage each renewal cycle: right-size commitments to sustained demand, never to peak hopes.

References

  • AWS EC2 Spot interruption notices (2-minute notice): https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
  • AWS EC2 rebalance recommendations: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/rebalance-recommendations.html
  • AWS Capacity Blocks for ML (durations, 8-week horizon, up-front billing): https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html
  • AWS Capacity Blocks pricing: https://aws.amazon.com/ec2/capacityblocks/pricing/
  • GCP Spot VMs (best-effort 30s notice, no max runtime, up to ~91% off): https://docs.cloud.google.com/compute/docs/instances/spot
  • GCP Committed Use Discounts (not applicable to Spot): https://docs.cloud.google.com/compute/docs/instances/committed-use-discounts-overview
  • Azure Spot Virtual Machines (30s eviction notice, capacity/price triggers, eviction policies): https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms
  • FinOps Foundation (cloud cost discipline): https://www.finops.org/

Related: Cloud & cost · Vendor sourcing · Workload bring-up · Kueue quota · DiLoCo · SLO/SLI · Telemetry · Glossary