Markdown

GPU cluster storage and data platform¶

Scope: feeding the GPUs. Parallel filesystems, object storage, local scratch, checkpoint strategy, GPUDirect Storage, and the data-loading path that keeps GPUs busy rather than starved. The storage fabric is one of the four SuperPOD fabrics (networking fabric); this page is what runs over it.

flowchart LR
  DATA["Dataset shards"] --> CACHE["Local NVMe cache"]
  CACHE --> LOADER["Dataloader or DALI"]
  LOADER --> GPU["GPU memory"]
  GPU --> CKPT["Sharded checkpoints"]
  CKPT --> STORE["Parallel filesystem or object store"]

Overview¶

An idle GPU waiting on I/O is the most expensive idle in the building. Storage is therefore a GPU performance problem, not a back-office one. Two load patterns dominate: training (huge sequential reads of dataset shards plus bursty checkpoint writes) and inference (model load, KV-cache spill, RAG retrieval). The skill is sizing and shaping storage so the data path never gates the accelerators, and proving it with a real dataloader rather than a spec sheet.

Core knowledge¶

The tiers¶

Local NVMe scratch: per-node, fastest, ephemeral. Dataset cache, intermediate state, checkpoint staging, spill.
Parallel / shared filesystem: the working set and checkpoints, mounted cluster-wide. Open/HPC: Lustre, IBM Storage Scale (GPFS), BeeGFS. AI-era commercial flash: WEKA, VAST Data, DDN (EXAScaler/Lustre, Infinia). These deliver the GB/s-per-node throughput and the high-IOPS metadata rate that AI needs. The B300 SuperPOD reference architecture sizes storage per scalable unit: a single SU aggregate read of 80 GB/s (Standard) up to 250 GB/s (Enhanced), with 4 GB/s per GPU for high-resolution image pipelines (BOM validation).
Object storage (S3-compatible): the durable lake and cold tier for datasets and archived checkpoints. Streamed via mountpoint-s3, fsspec, or tiered automatically by WEKA/VAST. Treat its latency and egress as nothing like local.

GPUDirect Storage (GDS)¶

DMA path from NVMe or the storage NIC directly into GPU memory, bypassing the CPU bounce buffer (the cuFile API, nvidia-fs module). Cuts latency and CPU overhead for data-heavy loads; the storage analogue of GPUDirect RDMA (GPU performance and health). Requires a supported filesystem and correct setup. Confirm it is actually engaged, not silently bypassed.

Checkpointing (the dominant write pattern at scale)¶

Asynchronous: overlap the checkpoint write with continued compute instead of blocking every rank.
Sharded / distributed: each rank writes its own shard via torch.distributed.checkpoint (DCP), so write bandwidth scales with ranks instead of funnelling through rank 0.
Tiered: checkpoint to local NVMe first, then drain to shared/object storage asynchronously.
Cadence: checkpoint often enough that lost work on a failure is cheaper than the restart, sized against fleet MTBF (reliability and RAS). At very large scale, in-memory or peer checkpointing (to a neighbour's RAM) reduces the storage hit.

Data loading (the silent GPU-starver)¶

The bottleneck is usually the CPU/IO feed, not the GPU. Levers: enough DataLoader workers, prefetch, pinned memory for fast H2D copies, and GPU-accelerated decode/augment via NVIDIA DALI.
Shard the dataset: pack millions of small files into tar/parquet/Arrow shards (WebDataset, Mosaic StreamingDataset). Millions of small files trigger a metadata storm that melts the parallel FS for every tenant.
Cache hot shards node-local; read-through caches (Alluxio, JuiceFS) front object stores for repeated epochs.

Don't-miss checklist¶

Size per-node storage bandwidth to exceed what a training step consumes; verify with fio and a real dataloader, not vendor specs.
Shard datasets; never train off millions of small files on a shared filesystem.
Async + sharded checkpoints; tie cadence to fleet MTBF (reliability and RAS).
Keep the storage fabric separate from the compute fabric (often NDR vs XDR, networking fabric).
Confirm GDS is actually engaged where the design assumes it.
Use pinned memory and DALI before blaming the GPU for low utilisation (performance tuning).

Failure modes¶

GPUs below 50% utilisation because the dataloader/IO cannot keep up: the most common and most invisible tax.
Metadata storms from small-file datasets crippling the parallel FS cluster-wide.
Synchronous checkpoints stalling all ranks every N steps; or a checkpoint write storm saturating the FS.
Resume that loses dataloader position or RNG state (distributed training).
Object-store latency assumed equal to local, surfacing as inference cold starts and RAG stalls.

Open questions & validation¶

Benchmark one parallel FS (Lustre or WEKA/VAST) with fio and a real dataloader: mount, stripe/layout, metadata behaviour under load.
GPUDirect Storage setup and a measured before/after.
DCP sharded async checkpointing at scale, with a verified deterministic resume (distributed training).

References¶

GPUDirect Storage: https://docs.nvidia.com/gpudirect-storage/index.html
PyTorch Distributed Checkpoint (DCP): https://pytorch.org/docs/stable/distributed.checkpoint.html
NVIDIA DALI (data loading): https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
WebDataset: https://github.com/webdataset/webdataset
Lustre: https://www.lustre.org/ · WEKA: https://www.weka.io/ · VAST: https://www.vastdata.com/

Related: Fabric · Performance · Training · Reliability · Optimization · Glossary