Skip to content
Markdown

GPU cluster storage and data platform

Scope: feeding the GPUs. Parallel filesystems, object storage, local scratch, checkpoint strategy, GPUDirect Storage, and the data-loading path that keeps GPUs busy rather than starved. The storage fabric is one of the four SuperPOD fabrics (networking fabric); this page is what runs over it.

flowchart LR
  DATA["Dataset shards"] --> CACHE["Local NVMe cache"]
  CACHE --> LOADER["Dataloader or DALI"]
  LOADER --> GPU["GPU memory"]
  GPU --> CKPT["Sharded checkpoints"]
  CKPT --> STORE["Parallel filesystem or object store"]

Overview

An idle GPU waiting on I/O is the most expensive idle in the building. Storage is therefore a GPU performance problem, not a back-office one. Two load patterns dominate: training (huge sequential reads of dataset shards plus bursty checkpoint writes) and inference (model load, KV-cache spill, RAG retrieval). The skill is sizing and shaping storage so the data path never gates the accelerators, and proving it with a real dataloader rather than a spec sheet.

Core knowledge

The tiers

  • Local NVMe scratch: per-node, fastest, ephemeral. Dataset cache, intermediate state, checkpoint staging, spill.
  • Parallel / shared filesystem: the working set and checkpoints, mounted cluster-wide. Open/HPC: Lustre, IBM Storage Scale (GPFS), BeeGFS. AI-era commercial flash: WEKA, VAST Data, DDN (EXAScaler/Lustre, Infinia). These deliver the GB/s-per-node throughput and the high-IOPS metadata rate that AI needs. The B300 SuperPOD reference architecture sizes storage per scalable unit: a single SU aggregate read of 80 GB/s (Standard) up to 250 GB/s (Enhanced), with 4 GB/s per GPU for high-resolution image pipelines (BOM validation).
  • Object storage (S3-compatible): the durable lake and cold tier for datasets and archived checkpoints. Streamed via mountpoint-s3, fsspec, or tiered automatically by WEKA/VAST. Treat its latency and egress as nothing like local.

GPUDirect Storage (GDS)

  • DMA path from NVMe or the storage NIC directly into GPU memory, bypassing the CPU bounce buffer (the cuFile API, nvidia-fs module). Cuts latency and CPU overhead for data-heavy loads; the storage analogue of GPUDirect RDMA (GPU performance and health). Requires a supported filesystem and correct setup. Confirm it is actually engaged, not silently bypassed.

Checkpointing (the dominant write pattern at scale)

  • Asynchronous: overlap the checkpoint write with continued compute instead of blocking every rank.
  • Sharded / distributed: each rank writes its own shard via torch.distributed.checkpoint (DCP), so write bandwidth scales with ranks instead of funnelling through rank 0.
  • Tiered: checkpoint to local NVMe first, then drain to shared/object storage asynchronously.
  • Cadence: checkpoint often enough that lost work on a failure is cheaper than the restart, sized against fleet MTBF (reliability and RAS). At very large scale, in-memory or peer checkpointing (to a neighbour's RAM) reduces the storage hit.

Data loading (the silent GPU-starver)

  • The bottleneck is usually the CPU/IO feed, not the GPU. Levers: enough DataLoader workers, prefetch, pinned memory for fast H2D copies, and GPU-accelerated decode/augment via NVIDIA DALI.
  • Shard the dataset: pack millions of small files into tar/parquet/Arrow shards (WebDataset, Mosaic StreamingDataset). Millions of small files trigger a metadata storm that melts the parallel FS for every tenant.
  • Cache hot shards node-local; read-through caches (Alluxio, JuiceFS) front object stores for repeated epochs.

Don't-miss checklist

  • Size per-node storage bandwidth to exceed what a training step consumes; verify with fio and a real dataloader, not vendor specs.
  • Shard datasets; never train off millions of small files on a shared filesystem.
  • Async + sharded checkpoints; tie cadence to fleet MTBF (reliability and RAS).
  • Keep the storage fabric separate from the compute fabric (often NDR vs XDR, networking fabric).
  • Confirm GDS is actually engaged where the design assumes it.
  • Use pinned memory and DALI before blaming the GPU for low utilisation (performance tuning).

Failure modes

  • GPUs below 50% utilisation because the dataloader/IO cannot keep up: the most common and most invisible tax.
  • Metadata storms from small-file datasets crippling the parallel FS cluster-wide.
  • Synchronous checkpoints stalling all ranks every N steps; or a checkpoint write storm saturating the FS.
  • Resume that loses dataloader position or RNG state (distributed training).
  • Object-store latency assumed equal to local, surfacing as inference cold starts and RAG stalls.

Open questions & validation

  • Benchmark one parallel FS (Lustre or WEKA/VAST) with fio and a real dataloader: mount, stripe/layout, metadata behaviour under load.
  • GPUDirect Storage setup and a measured before/after.
  • DCP sharded async checkpointing at scale, with a verified deterministic resume (distributed training).

References

  • GPUDirect Storage: https://docs.nvidia.com/gpudirect-storage/index.html
  • PyTorch Distributed Checkpoint (DCP): https://pytorch.org/docs/stable/distributed.checkpoint.html
  • NVIDIA DALI (data loading): https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
  • WebDataset: https://github.com/webdataset/webdataset
  • Lustre: https://www.lustre.org/ · WEKA: https://www.weka.io/ · VAST: https://www.vastdata.com/

Related: Fabric · Performance · Training · Reliability · Optimization · Glossary