GPU cluster storage and data platform¶
Scope: feeding the GPUs. Parallel filesystems, object storage, local scratch, checkpoint strategy, GPUDirect Storage, and the data-loading path that keeps GPUs busy rather than starved. The storage fabric is one of the four SuperPOD fabrics (networking fabric); this page is what runs over it.
flowchart LR
DATA["Dataset shards"] --> CACHE["Local NVMe cache"]
CACHE --> LOADER["Dataloader or DALI"]
LOADER --> GPU["GPU memory"]
GPU --> CKPT["Sharded checkpoints"]
CKPT --> STORE["Parallel filesystem or object store"]
Overview¶
An idle GPU waiting on I/O is the most expensive idle in the building. Storage is therefore a GPU performance problem, not a back-office one. Two load patterns dominate: training (huge sequential reads of dataset shards plus bursty checkpoint writes) and inference (model load, KV-cache spill, RAG retrieval). The skill is sizing and shaping storage so the data path never gates the accelerators, and proving it with a real dataloader rather than a spec sheet.
Core knowledge¶
The tiers¶
- Local NVMe scratch: per-node, fastest, ephemeral. Dataset cache, intermediate state, checkpoint staging, spill.
- Parallel / shared filesystem: the working set and checkpoints, mounted cluster-wide. Open/HPC: Lustre, IBM Storage Scale (GPFS), BeeGFS. AI-era commercial flash: WEKA, VAST Data, DDN (EXAScaler/Lustre, Infinia). These deliver the GB/s-per-node throughput and the high-IOPS metadata rate that AI needs. The B300 SuperPOD reference architecture sizes storage per scalable unit: a single SU aggregate read of 80 GB/s (Standard) up to 250 GB/s (Enhanced), with 4 GB/s per GPU for high-resolution image pipelines (BOM validation).
- Object storage (S3-compatible): the durable lake and cold tier for datasets and archived checkpoints. Streamed via
mountpoint-s3,fsspec, or tiered automatically by WEKA/VAST. Treat its latency and egress as nothing like local.
GPUDirect Storage (GDS)¶
- DMA path from NVMe or the storage NIC directly into GPU memory, bypassing the CPU bounce buffer (the
cuFileAPI,nvidia-fsmodule). Cuts latency and CPU overhead for data-heavy loads; the storage analogue of GPUDirect RDMA (GPU performance and health). Requires a supported filesystem and correct setup. Confirm it is actually engaged, not silently bypassed.
Checkpointing (the dominant write pattern at scale)¶
- Asynchronous: overlap the checkpoint write with continued compute instead of blocking every rank.
- Sharded / distributed: each rank writes its own shard via
torch.distributed.checkpoint(DCP), so write bandwidth scales with ranks instead of funnelling through rank 0. - Tiered: checkpoint to local NVMe first, then drain to shared/object storage asynchronously.
- Cadence: checkpoint often enough that lost work on a failure is cheaper than the restart, sized against fleet MTBF (reliability and RAS). At very large scale, in-memory or peer checkpointing (to a neighbour's RAM) reduces the storage hit.
Data loading (the silent GPU-starver)¶
- The bottleneck is usually the CPU/IO feed, not the GPU. Levers: enough
DataLoaderworkers, prefetch, pinned memory for fast H2D copies, and GPU-accelerated decode/augment via NVIDIA DALI. - Shard the dataset: pack millions of small files into tar/parquet/Arrow shards (WebDataset, Mosaic StreamingDataset). Millions of small files trigger a metadata storm that melts the parallel FS for every tenant.
- Cache hot shards node-local; read-through caches (Alluxio, JuiceFS) front object stores for repeated epochs.
Don't-miss checklist¶
- Size per-node storage bandwidth to exceed what a training step consumes; verify with
fioand a real dataloader, not vendor specs. - Shard datasets; never train off millions of small files on a shared filesystem.
- Async + sharded checkpoints; tie cadence to fleet MTBF (reliability and RAS).
- Keep the storage fabric separate from the compute fabric (often NDR vs XDR, networking fabric).
- Confirm GDS is actually engaged where the design assumes it.
- Use pinned memory and DALI before blaming the GPU for low utilisation (performance tuning).
Failure modes¶
- GPUs below 50% utilisation because the dataloader/IO cannot keep up: the most common and most invisible tax.
- Metadata storms from small-file datasets crippling the parallel FS cluster-wide.
- Synchronous checkpoints stalling all ranks every N steps; or a checkpoint write storm saturating the FS.
- Resume that loses dataloader position or RNG state (distributed training).
- Object-store latency assumed equal to local, surfacing as inference cold starts and RAG stalls.
Open questions & validation¶
- Benchmark one parallel FS (Lustre or WEKA/VAST) with
fioand a real dataloader: mount, stripe/layout, metadata behaviour under load. - GPUDirect Storage setup and a measured before/after.
- DCP sharded async checkpointing at scale, with a verified deterministic resume (distributed training).
References¶
- GPUDirect Storage: https://docs.nvidia.com/gpudirect-storage/index.html
- PyTorch Distributed Checkpoint (DCP): https://pytorch.org/docs/stable/distributed.checkpoint.html
- NVIDIA DALI (data loading): https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
- WebDataset: https://github.com/webdataset/webdataset
- Lustre: https://www.lustre.org/ · WEKA: https://www.weka.io/ · VAST: https://www.vastdata.com/
Related: Fabric · Performance · Training · Reliability · Optimization · Glossary