Markdown

GPU provider landscape¶

Scope: the four categories you can rent GPUs from (hyperscalers, GPU neoclouds, decentralized/marketplace, and second-hand/distressed capacity), what each is good at, and a concrete checklist (availability, network/fabric, egress, contract terms) to evaluate one before you commit.

Time-sensitive: verify before committing

Provider names, programs, SKUs, lead times, fabric specs, egress rates, and prices in GPU rental move constantly with supply, allocation, and geopolitics (export controls). Treat every vendor name and number below as a reference template, not a guarantee. Confirm against the provider's live quote and current docs before you sign. All price and percentage figures are illustrative, verify with the provider.

What it is¶

Providers are the supply side of build-vs-rent (cloud, neoclouds & cost). Most GPU capacity is consumed, not owned, so the first decision is which kind of supplier fits the workload. Four durable categories, ordered roughly by price-down / risk-up:

Hyperscalers: AWS, GCP, Azure, OCI. GPU SKUs sit inside a full cloud (IAM, managed k8s/Slurm, object store, VPC peering). Highest price, deepest ecosystem, broadest region coverage, and reservation mechanics like AWS Capacity Blocks for ML. Default GPU interconnect on most GPU instances is high-speed Ethernet/RDMA (AWS EFA, GCP GPUDirect-TCPX/RDMA); top SKUs offer NVLink within a node and dedicated fabrics on rack-scale systems.
GPU neoclouds: GPU-specialist providers (illustratively CoreWeave, Lambda, Crusoe, Nebius, Together AI, Nscale, Fluidstack, RunPod, Vast.ai). Bare-metal or thin k8s/Slurm on top, frequently InfiniBand fabrics, near-zero egress, and prices commonly 40–85% below hyperscalers for the same chip (illustrative). Many are NVIDIA Cloud Partners (NCP) reachable through the DGX Cloud Lepton marketplace, which brokers capacity across providers and centralizes billing.
Decentralized / marketplace (DePIN): networks that aggregate heterogeneous, globally distributed GPUs from many independent suppliers (illustratively io.net, Akash, Vast.ai's marketplace, Render). Cheapest headline, but heterogeneous hardware, variable interconnect, weak/absent SLAs, and globally dispersed nodes, which is exactly why low-communication training (DiLoCo, Distributed Training Recipes) matters here rather than the single-DC high-bandwidth regime.
Second-hand / distressed capacity: reserved blocks resold by buyers who over-committed, or hardware from wound-down operators, brokered by SIs/NCPs. Can be very cheap for a bounded campaign but carries provenance, remaining-warranty, depreciation/obsolescence (generations turn over ~yearly), and export-control residual-value risk.

The through-line from Cloud, Neoclouds and Cost/Capacity still holds: utilisation is the lever. A cheap GPU-hour at 25% MFU costs roughly double a dearer one at 50% MFU, so the provider choice is only half the unit-economics story.

Why it matters¶

Price spread is large and category-driven. The same H100 can be ~3x cheaper on a neocloud or DePIN network than on a hyperscaler (illustrative). Picking the wrong category for the workload leaves money or reliability on the table.
Egress is a hidden bill. Hyperscalers commonly charge ~$0.05–$0.12/GB outbound; many neoclouds charge zero (illustrative, verify). Training checkpoints and dataset movement make this a real line item; model it before, not after.
Fabric decides multi-node viability. A job that needs all-reduce at scale is gated by the interconnect (HPC Networking Fabric, NCCL Collectives and Algorithm Selection). InfiniBand vs commodity Ethernet vs internet-scale DePIN links is the difference between linear scaling and a stalled run.
Contract terms decide whether you can actually run. On-demand vs reserved vs spot vs Capacity Block, preemption notice, minimum commit, and exit terms determine cost and whether checkpointing is mandatory (FSDP checkpoint discipline).

When it is needed (and when not)¶

Use this page when you are choosing a rental provider or comparing quotes. Skip it when you have already decided to build (go to BOM Validation/Vendor Sourcing and Procurement Logistics) or when you already have allocated capacity and just need to bring a workload up (Workload and Bring-Up Recipes, Serving Open-Weight Models).

Category fit by workload:

flowchart LR
  W["Workload"] --> Q1{"Multi-node tight<br/>all-reduce?"}
  Q1 -->|"yes"| Q2{"Enterprise SLA /<br/>compliance / region?"}
  Q1 -->|"no, fault-tolerant<br/>or single-node"| Q3{"Cost the<br/>dominant axis?"}
  Q2 -->|"yes"| HS["Hyperscaler<br/>(reserved / Capacity Block)"]
  Q2 -->|"no, want price + IB"| NC["GPU neocloud<br/>(InfiniBand, low egress)"]
  Q3 -->|"yes, low-comms ok"| DP["Decentralized / marketplace<br/>(DiLoCo-style, checkpoint hard)"]
  Q3 -->|"bounded campaign,<br/>cheap blocks"| SH["Second-hand / distressed<br/>(verify provenance)"]
  Q3 -->|"no"| NC

Hyperscaler when you need region/compliance coverage, integration with an existing cloud estate, a managed control plane, or reserved/Capacity-Block guarantees for a tight all-reduce job.
Neocloud when you want most of the hyperscaler's multi-node capability (InfiniBand) at a large discount with low egress, and can tolerate a thinner managed surface.
Decentralized/marketplace when the job is fault-tolerant, low-communication, or single-node (inference, embarrassingly-parallel, DiLoCo-style training) and cost dominates SLA.
Second-hand/distressed for a bounded, time-boxed campaign where a cheap block beats a clean contract, never for steady production you must depend on.

How: implement, integrate, maintain¶

Implement: the evaluation checklist. Score every candidate quote on four axes before price. The runnable scorer below encodes them; fill it from each provider's quote and docs.

Availability: can they deliver the SKU, count, and topology in your window? Allocation, not list price, is usually the binding constraint for current GPUs. Confirm reservation type (on-demand / reserved / spot / Capacity Block), preemption notice, and minimum commit.
Network / fabric: intra-node (NVLink/NVSwitch) and inter-node fabric (InfiniBand XDR/NDR vs Ethernet/RoCE vs internet), rail topology, and whether GPUDirect RDMA/Storage is available (HPC Networking Fabric, GPUDirect Storage). Tight multi-node training needs a non-blocking IB/RoCE fabric; inference and low-comms training do not.
Egress: $/GB outbound and inter-region/inter-AZ. Multiply by your real checkpoint + dataset movement. Near-zero egress materially changes landed cost.
Contract terms: minimum commit, term length, exit/early-termination, SLA/credits, support tier, data-locality/sovereignty, and export-control posture for your SKU and region.

#!/usr/bin/env python3
"""Score GPU rental quotes across availability, fabric, egress, contract.

Runnable: `python score_providers.py`. All inputs are illustrative placeholders —
replace with figures from each provider's live quote and docs. No network calls.
"""
from __future__ import annotations

from dataclasses import dataclass, field


@dataclass(frozen=True)
class Quote:
    name: str
    category: str                 # hyperscaler | neocloud | decentralized | second_hand
    usd_per_gpu_hour: float        # on-demand or reserved effective rate
    egress_usd_per_gb: float
    fabric: str                    # infiniband_xdr | infiniband_ndr | roce | ethernet | internet
    can_deliver_in_window: bool    # allocation confirmed for SKU+count+date
    reservation: str               # on_demand | reserved | spot | capacity_block
    min_commit_months: int
    sla_pct: float | None          # contractual uptime, None if absent

# Higher score = better fit for tight multi-node training. Tune weights per workload.
FABRIC_RANK = {"infiniband_xdr": 5, "infiniband_ndr": 4, "roce": 3,
               "ethernet": 2, "internet": 1}


def monthly_landed_cost(q: Quote, gpu_hours: float, egress_tb: float) -> float:
    """Compute = rate*hours; egress in TB -> GB. Excludes storage/support."""
    return q.usd_per_gpu_hour * gpu_hours + q.egress_usd_per_gb * egress_tb * 1024


def fit_score(q: Quote, need_tight_fabric: bool) -> float:
    if not q.can_deliver_in_window:
        return 0.0  # cannot deliver => unusable, score-fast exit
    fabric = FABRIC_RANK.get(q.fabric, 0)
    fabric_term = fabric if need_tight_fabric else min(fabric, 2)
    sla_term = (q.sla_pct or 0.0) / 100.0
    commit_penalty = q.min_commit_months * 0.05
    return fabric_term + 2.0 * sla_term - commit_penalty


def evaluate(quotes: list[Quote], gpu_hours: float, egress_tb: float,
             need_tight_fabric: bool) -> list[tuple[str, float, float]]:
    rows = []
    for q in quotes:
        cost = monthly_landed_cost(q, gpu_hours, egress_tb)
        rows.append((q.name, round(cost, 2), round(fit_score(q, need_tight_fabric), 2)))
    return sorted(rows, key=lambda r: (-r[2], r[1]))  # best fit first, then cheapest


if __name__ == "__main__":
    # Illustrative — replace every number with a real quote.
    quotes = [
        Quote("hyperscaler-A", "hyperscaler", 6.50, 0.09, "ethernet",
              True, "capacity_block", 1, 99.9),
        Quote("neocloud-B", "neocloud", 2.40, 0.00, "infiniband_xdr",
              True, "reserved", 6, 99.0),
        Quote("depin-C", "decentralized", 1.20, 0.00, "internet",
              True, "on_demand", 0, None),
    ]
    hours = 8 * 24 * 30        # 8 GPUs, full month
    egress_tb = 5.0            # checkpoints + dataset pulls, illustrative
    for name, cost, score in evaluate(quotes, hours, egress_tb, need_tight_fabric=True):
        print(f"{name:16s} landed=${cost:>10,.2f}/mo  fit={score:>5.2f}")

Integrate. Once a provider is picked, the rest of this KB takes over: stand up the control plane on the rented nodes, starting with Cluster Orchestration: Kubernetes, k3s, Ray, Slurm (Kubernetes for GPU Clusters/k3s/Slurm for GPU Clusters/Ray for GPU Clusters), the GPU platform via Kubernetes and Helm: GPU Platform, batch scheduling via Volcano Gang Scheduler/Volcano Job, and serving via Inference Serving and Optimization/Serving Open-Weight Models. Validate the fabric on arrival before trusting it (Fabric Bring-Up, Validation and Benchmarking, Smoke Tests: GPU Platform, GPU Health Gating).

Maintain. Track landed cost and real utilisation, not headline rate: $/GPU-hour x true MFU/SM-active, $/run (training), $/token (inference) per SLO/SLI Catalog and Error-Budget Alerts and Observability and Monitoring/Telemetry, Monitoring and Alerting. A reserved or committed block sitting idle is the biggest hidden cost; alert on it. Below: a Prometheus rule that flags reserved capacity burning money at low utilisation, and a query to surface egress-heavy workloads.

# prometheus-rules-rented-gpu.yaml — alert on idle reserved capacity + low MFU.
# Assumes DCGM exporter (DCGM_FI_DEV_GPU_UTIL, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE)
# and a static label provider/reservation on the GPU nodes. Verify metric names
# against your exporter version (manifest-dcgm-exporter).
groups:
  - name: rented-gpu-economics
    rules:
      - alert: ReservedGpuIdle
        # Reserved/committed GPUs with tensor-core activity near zero for 30m.
        expr: |
          avg by (provider, node) (
            DCGM_FI_PROF_PIPE_TENSOR_ACTIVE{reservation=~"reserved|capacity_block"}
          ) < 0.05
        for: 30m
        labels: {severity: warning}
        annotations:
          summary: "Reserved GPUs idle on {{ $labels.provider }}/{{ $labels.node }}"
          description: "Paying the full reserved rate for <5% tensor activity."

      - alert: LowMfuSustained
        # Sustained low real utilisation => paying ~2x per useful unit of work.
        expr: |
          avg by (provider, node) (
            DCGM_FI_PROF_PIPE_TENSOR_ACTIVE
          ) < 0.25
        for: 2h
        labels: {severity: info}
        annotations:
          summary: "Low MFU on {{ $labels.provider }}/{{ $labels.node }}"
          description: "Cost per useful unit roughly doubles vs 50% MFU; investigate."

# Top egress-cost workloads this month (multiply bytes by your $/GB before signing).
# Requires a network exporter exposing outbound bytes labelled by namespace.
topk(10,
  sum by (namespace) (
    increase(container_network_transmit_bytes_total[30d])
  ) / 1024 / 1024 / 1024            # -> GiB egress per namespace
)

Run the rule file with promtool check rules prometheus-rules-rented-gpu.yaml before loading it, and the scorer with python score_providers.py. Both are reference templates; substitute your provider's real quote, metric names, and egress rate.

References¶

AWS EC2 Capacity Blocks for ML: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html
AWS EFA (Elastic Fabric Adapter): https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html
GCP accelerator-optimized machines (A3/A4, GPUDirect-TCPX/RDMA): https://cloud.google.com/compute/docs/accelerator-optimized-machines
Azure GPU-accelerated VM sizes: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/overview#gpu-accelerated
NVIDIA DGX Cloud Lepton (NCP marketplace): https://www.nvidia.com/en-us/data-center/dgx-cloud-lepton/
NVIDIA DGX Cloud Lepton announcement (CoreWeave, Crusoe, Lambda, Nebius, Nscale, et al.): https://nvidianews.nvidia.com/news/nvidia-announces-dgx-cloud-lepton-to-connect-developers-to-nvidias-global-compute-ecosystem
NVIDIA Cloud Partners (NCP): https://www.nvidia.com/en-us/data-center/gpu-cloud-computing/partners/
NVIDIA Quantum InfiniBand (fabric for multi-node training): https://www.nvidia.com/en-us/networking/products/infiniband/
Akash Network docs (decentralized compute marketplace): https://akash.network/docs/
FinOps Foundation (cloud cost discipline): https://www.finops.org/
DCGM Exporter metrics (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE): https://github.com/NVIDIA/dcgm-exporter