Markdown

GPUDirect storage (GDS)¶

Scope: the direct DMA path from NVMe/NVMe-oF/RDMA-NAS into GPU HBM that bypasses the CPU bounce buffer, via the cuFile API and the nvidia-fs kernel module; enabling, verifying engagement, compatible filesystems, and benchmarking with gdsio.

flowchart LR
  subgraph WITHOUT["Without GDS (staged DMA)"]
    S1["NVMe / RDMA NIC"] --> H1["Host bounce buffer (CPU RAM)"]
    H1 --> G1["GPU HBM"]
  end
  subgraph WITH["With GDS (direct path)"]
    S2["NVMe / RDMA NIC"] --> G2["GPU HBM"]
    CPU["CPU"] -. "orchestrates DMA only" .-> S2
  end

What it is¶

GDS lets the GPU initiate a DMA that moves data directly between a storage device (local NVMe, NVMe-oF) or an RDMA NIC and GPU HBM, without the extra copy through a host bounce buffer in CPU memory. On the conventional path the data lands in CPU RAM first, then a CUDA copy stages it into GPU memory; GDS removes that intermediate hop.¹ The CPU still configures and orchestrates the transfer. GDS removes the host memory copy, not the CPU's control role.²

Two pieces make it work:

cuFile: the user-space library (libcufile) exposing the I/O API. It registers GPU device buffers, integrates with POSIX file descriptors, and handles buffer alignment.³⁴
nvidia-fs: the kernel driver that orchestrates the DMA directly between the storage device or RDMA NIC and GPU memory.⁵

Host pinned memory is not used in the storage-to-GPU data path.⁵ GDS is the storage analogue of GPUDirect RDMA: GDS accelerates storage-to-GPU DMA, GPUDirect RDMA accelerates network-to-GPU DMA. Neither eliminates CPU orchestration; both remove the host bounce buffer.⁶

Why it matters¶

An idle GPU waiting on I/O is the most expensive idle in the building. The conventional staged copy burns CPU cycles and host memory bandwidth that scale with read throughput. The book's worked example: feeding 1,000 batches/s of 1 MB each is ~1,000 MB/s; doing that copy on the CPU easily consumes a few cores. With GDS the GPU pulls it directly from disk and frees those cores for preprocessing. At higher rates (or thousands of GPUs) the effect compounds.⁷

Reported uplift (validate on your own fabric, see caveat below): a VAST Data benchmark measured a ~20% read-throughput boost on an A100 and 30%+ on an H100, the larger H100 gain attributed to its higher NIC bandwidth and greater CPU burden on the staged path.⁸ The general rule: if the CPU was comfortably handling the copies, GDS mainly lowers CPU usage rather than raising throughput; if the CPU is saturated with memcpy, GDS helps a lot.⁹

Uplift varies by I/O size, queue depth, NIC generation, and filesystem implementation. Treat any quoted percentage as a starting hypothesis, not a guarantee, and measure on your workload.¹⁰

When it is needed (and when not)¶

Helps:

Large sequential reads of dataset shards (Arrow, TFRecord, Parquet, WebDataset tar). Training is overwhelmingly read-heavy and this is where most GDS gains are measured.¹¹
CPU-bound ingest pipelines where host memcpy is the bottleneck.⁹
Checkpoint writes over RDMA, but only when the filesystem supports RDMA writes for GDS.¹²

Does not help (or is unavailable):

CPU was never the bottleneck: expect lower CPU usage, not more throughput.⁹
FUSE / user-space filesystems cannot deliver a GDS path: GDS requires kernel-level filesystem integration with O_DIRECT semantics. Only GDS-enabled kernel clients or specifically integrated parallel filesystems provide direct transfers into GPU memory. DeepSeek's 3FS, for example, uses RDMA transfers rather than a GDS path for its FUSE client.¹³ See DeepSeek 3FS filesystem.
Misaligned, non-O_DIRECT access: modern releases tolerate non-O_DIRECT descriptors, but misalignment may incur extra copies or reduced performance, silently defeating the point.¹⁴

Note GDS is orthogonal to the cuda-checkpoint path: CUDA process checkpoints copy device memory into host memory first (no direct GPU-to-storage DMA), so GDS tuning does not apply there.¹⁵ See runbook: checkpoint recovery.

How: implement, integrate, maintain¶

Compatible storage stacks¶

GDS support depends on the filesystem and an RDMA-capable stack. As of the book's writing, supported stacks include local NVMe and NVMe-oF on XFS/EXT4 with O_DIRECT, NFS over RDMA, and select parallel filesystems integrating with nvidia-fs: BeeGFS, WekaFS, VAST, IBM Storage Scale, among others.¹⁶¹³ See storage & data platform.

Verify the stack before trusting it¶

GDS engages silently or falls back to compatibility mode silently. Always confirm. The gdscheck tool reports platform support; -p prints per-transport support plus IOMMU and PCIe ACS state.¹⁷

# Platform / driver / per-filesystem support
/usr/local/cuda/gds/tools/gdscheck -p

# Kernel driver must be loaded for a true GDS path
lsmod | grep nvidia_fs

For best GDS performance the troubleshooting guide requires IOMMU disabled on x86_64 and recommends PCIe ACS disabled; gdscheck -p surfaces both.¹⁷ Disabling ACS is covered in service: ACS disable.

Use the API correctly¶

Minimal synchronous read with cuFile. Signatures per the GDS cuFile API Reference.⁴

#include <fcntl.h>
#include <cufile.h>
#include <cuda_runtime.h>

// 1. Open the GDS driver once per process.
cuFileDriverOpen();

// 2. Open the file with O_DIRECT to enable direct DMA and bypass the page cache.
int fd = open("/mnt/data/shard.bin", O_RDONLY | O_DIRECT);

CUfileDescr_t descr = {0};
descr.handle.fd = fd;
descr.type = CU_FILE_HANDLE_TYPE_OPAQUE_FD;

CUfileHandle_t fh;
cuFileHandleRegister(&fh, &descr);

// 3. Allocate and register the GPU destination buffer.
void *dev_ptr;
cudaMalloc(&dev_ptr, size);
cuFileBufRegister(dev_ptr, size, 0);

// 4. Direct DMA: storage -> GPU HBM, no host bounce buffer.
ssize_t n = cuFileRead(fh, dev_ptr, size, /*file_offset*/ 0, /*devPtr_offset*/ 0);

// 5. Teardown.
cuFileBufDeregister(dev_ptr);
cuFileHandleDeregister(fh);
cuFileDriverClose();

Use O_DIRECT when possible to enable direct DMA and bypass the OS page cache; modern releases can also operate on non-O_DIRECT descriptors, but misalignment may add copies.¹⁴ For overlap and pipelining on CUDA streams, use the async variants cuFileReadAsync / cuFileWriteAsync.¹⁸⁴ Most practitioners reach GDS through a storage vendor's plugin (WekaIO, DDN, VAST, Cloudian) or a framework, not raw cuFile.¹⁹

Benchmark with gdsio¶

gdsio ships at /usr/local/cuda/gds/tools/gdsio and compares disk-to-GPU throughput across transfer paths. The transfer selector -x chooses the path: -x 2 measures CPU-mediated transfers (host memory, async copies), -x 0 measures the GDS path. Keep -x consistent and every other flag identical when comparing.²⁰¹⁷

# Before: storage -> CPU memory only (CPU path, -x 2), read mode (-I 0)
/usr/local/cuda/gds/tools/gdsio \
    -f /mnt/data/large_file \
    -d 0 -w 4 -s 10G -i 1M -I 0 -x 2
# Total Throughput: 8.0 GB/s   Average Latency: 1.25 ms

# After: storage -> GPU memory via GDS (-x 0), same config
/usr/local/cuda/gds/tools/gdsio \
    -f /mnt/data/large_file \
    -d 0 -w 4 -s 10G -i 1M -I 0 -x 0
# Total Throughput: 9.6 GB/s   Average Latency: 1.00 ms

Flags: -f file path, -d GPU device index, -w worker threads, -s dataset size, -i I/O request size, -I mode (0 = read), -x transfer selector.²⁰¹⁷ The book's run shows GDS raising read throughput 8.0 -> 9.6 GB/s (+20%) and cutting latency 1.25 -> 1.00 ms (-20%) on this configuration.²¹ Run gdsio -h for the full flag set on your installed version.¹⁷

The numbers above are reference values transcribed from the book's example and NVIDIA docs. They are not hardware-tested in this KB. Reproduce them on your own NVMe, NIC, and filesystem before quoting them.

Maintain / observe¶

Trace the live path rather than assuming it. Nsight Systems with --trace=gds captures cuFile API activity on the timeline; enable cuFile static tracepoints via /etc/cufile.json to see cuFile events. Kernel-mode counters for the NVMe peer-to-peer DMA path are not exposed in Nsight Systems and may be unavailable for some GDS stacks.²² Pair with host I/O tools (iostat, iotop, nvme-cli) and DCGM to distinguish GPU starvation from device saturation.²³ See Nsight profiling workflow and fabric bring-up & benchmarking.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 5: "GPU-Based Storage I/O Optimizations" — sections "Using NVIDIA GDS", "Measuring GDS with gdsio", "DeepSeek's Fire-Flyer File System".
NVIDIA, GDS cuFile API Reference Guide — https://docs.nvidia.com/gpudirect-storage/api-reference-guide/index.html
NVIDIA, GPUDirect Storage Overview Guide — https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html
NVIDIA, GPUDirect Storage Installation and Troubleshooting Guide (gdscheck, gdsio, cufile.json, IOMMU/ACS) — https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html

Fregly, Ch. 5, "Using NVIDIA GDS": "Normally, when a GPU wants to read data from an NVMe SSD, the data first goes from SSD to CPU memory. Then a CUDA call copies the data from CPU memory to GPU memory." With GDS the GPU initiates a DMA against the SSD or NIC to move data into its own HBM. ↩
Fregly, Ch. 5: "GDS creates a direct DMA path that bypasses host memory bounce buffers... (Note: the CPU still configures and orchestrates the I/O.)" ↩
Fregly, Ch. 5: "You can use CUDA's cuFile library to read files through GDS. cuFile supports features like automatic buffer alignment and integration with common filesystems." ↩
NVIDIA GDS cuFile API Reference Guide — cuFileDriverOpen, cuFileHandleRegister, cuFileBufRegister, cuFileRead, cuFileReadAsync, and teardown signatures. ↩↩↩
Fregly, Ch. 5: "Host pinned memory is not used in the storage-to-GPU data path. cuFile registers GPU device buffers, and the nvidia-fs kernel driver orchestrates DMA directly between the storage device or RDMA NIC and GPU memory." ↩↩
Fregly, Ch. 5: "GDS complements GPUDirect RDMA since GDS accelerates storage-to-GPU DMA, while GPUDirect RDMA accelerates network-to-GPU DMA. Neither eliminates CPU orchestration. Both remove the host memory bounce buffer." ↩
Fregly, Ch. 5: 1 MB batches at 1,000 batches/s ~ 1,000 MB/s; the CPU copy "would easily consume a few cores. With GDS, the GPU would pull that 1,000 MB/s directly from disk and free up the CPU." ↩
Fregly, Ch. 5: VAST measured "a 20% read-throughput boost on an NVIDIA Ampere A100 GPU and a 30%+ increase on a Hopper H100 GPU due to its higher NIC bandwidth and greater CPU burden." ↩
Fregly, Ch. 5: "If your CPU was easily handling the data transfers, GDS might not change throughput much. However, it will lower CPU usage... if the CPU is saturated with many memcpy operations, then GDS will help a lot." ↩↩↩
Fregly, Ch. 5: "Validate on your workload and fabric, as uplifts vary by IO size, queue depth, NIC generation, filesystem implementation, etc." ↩
Fregly, Ch. 5: "Since training workloads are overwhelmingly read-heavy, most GDS performance gains are evaluated when reading data from storage." ↩
Fregly, Ch. 5: "For RDMA-accelerated writes, the filesystem must support RDMA writes for GDS." WekaFS ships GDS-aware plugins for both read and write over RDMA. ↩
Fregly, Ch. 5, "DeepSeek's Fire-Flyer File System": "If a file system is implemented using FUSE in user space, it will not be able to deliver a GDS path because GDS requires kernel-level filesystem integration with O_DIRECT semantics... use a GDS-enabled kernel filesystem client such as NVMe, NVMe-oF, BeeGFS, WekaFS, IBM Storage Scale, or VAST." ↩↩
Fregly, Ch. 5: "Use O_DIRECT when possible to enable direct DMA and bypass the OS page cache. With modern GDS releases, cuFile can also operate on non-O_DIRECT file descriptors, but misalignment may incur extra copies or reduced performance." ↩↩
Fregly, Ch. 5, "Checkpointing GPU State with cuda-checkpoint": "Unlike data ingestion with GDS, the checkpoint path does not DMA directly from GPU memory to storage. Instead, the device memory image is first brought into host memory by the driver during suspend." ↩
Fregly, Ch. 5: "supported stacks include local NVMe and NVMe-oF on XFS/EXT4 with O_DIRECT, NFS over RDMA, and select parallel filesystems such as BeeGFS, WekaFS, VAST, IBM Storage Scale, and others that integrate with nvidia-fs." ↩
NVIDIA GPUDirect Storage Installation and Troubleshooting Guide: gdscheck -p reports per-transport support, IOMMU, and PCIe ACS state; IOMMU must be disabled on x86_64 and PCIe ACS should be disabled for best performance; gdsio lives at /usr/local/cuda-x.y/gds/tools/gdsio (gdsio -h for flags); /etc/cufile.json configures libcufile. ↩↩↩↩↩
Fregly, Ch. 5: "You can also use cuFile's asynchronous APIs, such as cuFileReadAsync and cuFileWriteAsync to integrate storage I/O on CUDA streams... for overlap and pipelining." ↩
Fregly, Ch. 5: "Many storage vendors like WekaIO, DDN, VAST, Cloudian, etc., have released GDS-aware solutions or plugins so their systems can deliver data using RDMA directly into GPU memory." ↩
Fregly, Ch. 5, "Measuring GDS with gdsio": tool installed under /usr/local/cuda/gds/tools; "For gdsio, -x 2 measures CPU-mediated transfers, and -x 0 measures the GDS path"; commands use -d 0 -w 4 -s 10G -i 1M -I 0. ↩↩
Fregly, Ch. 5, Table 5-1: Storage->CPU (without GDS) 8.0 GB/s / 1.25 ms; Storage->GPU (with GDS) 9.6 GB/s (+20%) / 1.00 ms (-20%). ↩
Fregly, Ch. 5, "Monitoring Storage I/O": "Use the Nsight Systems option --trace=gds. This will capture cuFile API activity... enable GDS cuFile static tracepoints using /etc/cufile.json... Kernel-mode counters for NVMe peer-to-peer DMA paths are not exposed in Nsight Systems and may not be available for all GDS stacks." ↩
Fregly, Ch. 5: monitoring tools include "Linux iostat, iotop, nvme-cli, perf, and eBPF" plus DCGM for GPU I/O statistics. ↩