Markdown

GPU decompression and the nvCOMP library¶

Scope: trading cheap GPU cycles for scarce storage/PCIe bandwidth by storing data compressed on disk and decompressing it on the GPU. Covers the Blackwell hardware Decompression Engine (DE), the nvCOMP library (LZ4, Snappy, Deflate, zstd, GDeflate), pairing decompression with GDS, and deciding when compressed-on-disk plus GPU-decompress beats raw I/O.

flowchart LR
  subgraph RAW["Raw on disk"]
    R1["Uncompressed shard"] --> R2["NVMe / NIC"] --> R3["GPU HBM"]
  end
  subgraph COMP["Compressed on disk"]
    C1["Compressed shard (smaller)"] --> C2["NVMe / NIC"] --> C3["GPU HBM"]
    C3 --> C4["Decompression Engine or SM kernels"]
    C4 --> C5["Uncompressed bytes in HBM"]
  end

What it is¶

GPU decompression stores datasets compressed on the filesystem or object store and unpacks them on the GPU on the fly. It saves I/O bandwidth at the cost of some extra compute cycles, a good trade when I/O is the bottleneck and the GPUs (or CPUs) would otherwise be idle.¹ The book frames this as "another way to offload arithmetic operations from the CPU to the GPU," potentially overlapping data decoding with gradient computation during training.²

Two distinct mechanisms deliver it:

The Blackwell Decompression Engine (DE): a fixed-function hardware block on Blackwell GPUs (B200, B300, GB200, GB300) that accelerates Snappy, LZ4, and Deflate/GZip decompression.¹⁰ The book describes it as an "on-die Decompression Engine supporting formats such as LZ4, Snappy, and Deflate."³ It is integrated with the copy engine, so compressed data arriving over PCIe or NVLink-C2C can be decompressed in transit rather than requiring a host-to-device copy followed by a separate software decompression pass.¹⁰ NVIDIA quotes peak DE throughput up to 600 GB/s.¹¹
nvCOMP: NVIDIA's GPU library for fast lossless compression and decompression. It exposes LZ4, Snappy, Deflate, zstd, and the GPU-friendly GDeflate format (plus Cascaded, Bitcomp, ANS), and on Blackwell it transparently dispatches the DE-eligible formats to the hardware engine, falling back to its SM-based kernel implementations elsewhere.¹²¹¹

GDeflate matches the DEFLATE bitstream closely but is restructured for high-throughput GPU decode; nvCOMP exposes it through its own manager, but note the hardware DE itself accelerates Snappy/LZ4/Deflate, not GDeflate.¹⁴¹¹

Why it matters¶

A Blackwell GPU reads HBM at up to ~8 TB/s but pulls from storage across a far narrower pipe: PCIe Gen5 x16 is ~64 GB/s per direction, Gen6 x16 ~128 GB/s, NVLink-C2C up to ~900 GB/s bidirectional.⁹ Storage and the host link, not the SMs, are usually what starves the GPU. If a 2:1 compression ratio halves the bytes crossing that pipe, effective ingest throughput roughly doubles for the cost of decompression cycles the GPU has to spare.

Doing the decompression on the GPU (rather than the CPU) keeps the Grace cores free for tokenization, augmentation, and collation, and lets decode overlap with compute kernels.² On Blackwell the DE pushes the decode off the SMs entirely onto the fixed-function block, so it does not even compete with compute kernels for SM occupancy.¹⁰ The high-bandwidth NVLink-C2C link (up to 900 GB/s bidirectional) keeps any CPU-assisted staging from becoming the bottleneck.⁴

The trade only pays off if decompression does not replace I/O as the bottleneck. Verify your compression ratio and confirm the decode step is not the new long pole; otherwise the extra compression compute is not worth it.⁵

When it is needed (and when not)¶

Use compressed-on-disk + GPU-decompress when:

The workload is I/O bound and GPUs/CPUs have idle cycles, the canonical case for the trade.¹
Data already compresses well: JPEG image batches, Arrow/Parquet columnar text, already-Snappy/LZ4/Deflate batches.⁶
You are on Blackwell and can store batches in a DE-supported format (Snappy/LZ4/Deflate) to get in-pipeline hardware decode that frees SMs for compute kernels.³¹⁰

Reconsider when:

The pipeline is compute bound: you are spending GPU cycles you actually need, and adding decode work makes it worse.
Decompression becomes the new bottleneck: measure decode throughput against your target ingest rate before committing.⁵
The chosen format is a poor fit for the hardware: on B200 the DE has a maximum chunk length of 4 MiB; buffers larger than that fall back to the slower SM path.¹¹

For image decode specifically, nvJPEG decodes JPEGs directly on the GPU and is the right tool there; nvCOMP/DE target generic byte-stream lossless codecs.⁷ This page pairs with data loading pipeline tuning and the GPU-side preprocessing in NVIDIA DALI pipeline.

How: implement, integrate, maintain¶

Format choice¶

Favor the DE-accelerated formats for I/O-bound workloads on Blackwell: LZ4 (fast byte-level, no entropy stage), Snappy (similar, common for tabular data), Deflate (Huffman + LZ77, for compatibility with existing gzip/Deflate datasets).¹³³ zstd (Huffman + LZ77 + ANS) gives higher ratios but is an SM-path codec in nvCOMP, not a DE-accelerated one. GDeflate keeps DEFLATE's ratio while being restructured for fast GPU decode. Use it when you want DEFLATE-class ratios with nvCOMP's SM kernels.¹⁴

Engaging the hardware Decompression Engine¶

On Blackwell, nvCOMP uses the DE automatically when the output buffers are allocated from DE-capable memory; otherwise it falls back to SM-based decode with no code change.¹¹ The allocation must carry the hardware-decompress usage flag:

// Driver API: tag the allocation as DE-eligible
CUmemAllocationProp prop = {};
prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
prop.location.id   = device_id;
prop.allocFlags.usage = CU_MEM_CREATE_USAGE_HW_DECOMPRESS;   // DE flag
CUmemGenericAllocationHandle handle;
cuMemCreate(&handle, size, &prop, /*flags=*/0);

For the stream-ordered pool allocator, set the equivalent pool usage flag (cudaMemPoolCreateUsageHwDecompress) and allocate with cudaMallocFromPoolAsync.¹¹ A plain cudaMalloc output buffer routes nvCOMP to the SM fallback.¹¹ See CUDA stream-ordered allocator.

Query the DE's maximum per-chunk length so you can size batches to stay on the hardware path:

int max_len = 0;
cuDeviceGetAttribute(&max_len,
    CU_DEVICE_ATTRIBUTE_MEM_DECOMPRESS_MAXIMUM_LENGTH, device_id);
// On B200 this is 4 MiB; chunks above it fall back to SM decode.

At the low-level batched C API, the backend is selectable per call via the backend field of nvcompBatched<Format>DecompressOpts_t, passed to nvcompBatched<Format>DecompressAsync:¹⁰

NVCOMP_DECOMPRESS_BACKEND_DEFAULT: auto-detect the DE, fall back to CUDA SM kernels if unavailable.
NVCOMP_DECOMPRESS_BACKEND_HARDWARE: DE only; errors if the DE is unavailable.

The sort_before_hw_decompress option helps when a batch mixes widely varying chunk sizes.¹⁰

High-level manager API¶

For most pipelines the high-level C++ manager API is simpler than the batched calls. Each format has a manager deriving from nvcompManagerBase, configured with CompressionConfig / DecompressionConfig, and a create_manager() factory reconstructs the right manager from a compressed buffer's header:¹⁵

#include <nvcomp/lz4.hpp>
using namespace nvcomp;

// Compress: build an LZ4 manager bound to a CUDA stream
LZ4Manager mgr{/*uncomp_chunk_size=*/1 << 22, nvcompBatchedLZ4DefaultOpts, stream};
CompressionConfig cfg = mgr.configure_compression(input_bytes);
mgr.compress(d_uncomp, d_comp, cfg);

// Decompress: reconstruct the manager from the compressed buffer
auto dmgr = create_manager(d_comp, stream);
DecompressionConfig dcfg = dmgr->configure_decompression(d_comp);
dmgr->decompress(d_out, d_comp, dcfg);   // uses DE when d_out is DE-eligible

Per-format managers include LZ4Manager, SnappyManager, DeflateManager, ZstdManager, GdeflateManager, plus CascadedManager, BitcompManager, and ANSManager; each takes its nvcompBatched<Format>{Compress,Decompress}Opts_t option struct.¹⁵

The exact nvcomp:: header layout, manager constructor signatures, and opts struct fields are version-dependent. The snippets above are illustrative of the documented API shape. Pin them against the nvCOMP release you build against and against your own compiled headers before relying on them. Not hardware-tested here.¹⁵

Pairing with GDS¶

Compression and GDS compose: store shards compressed, pull them straight into HBM with cuFileRead (no host bounce buffer), then decompress in place on the GPU, DE-accelerated on Blackwell. This stacks the I/O-bandwidth win of compression on top of GDS's removal of the host copy.¹ See GPUDirect Storage (GDS) for enabling and verifying the direct path, and storage & data platform for the upstream layout. Keep O_DIRECT alignment intact on the GDS read so the direct path is not silently defeated before decompression even runs.

Maintain and verify¶

Verify the ratio at every stage. Many pipelines already compress; confirm the compression is applied end to end and the ratio justifies the decode cost.⁸
Measure decode vs. I/O. Profile with Nsight Systems to confirm decompression overlaps compute and has not become the new bottleneck.⁵
Confirm the hardware path engaged. On Blackwell, a missing usage flag or an oversized chunk silently routes to the SM fallback; check allocation flags and keep chunks within the queried CU_DEVICE_ATTRIBUTE_MEM_DECOMPRESS_MAXIMUM_LENGTH.¹¹
Pin formats to the hardware. For DE acceleration choose Snappy/LZ4/Deflate; reserve zstd/GDeflate for cases where the higher ratio outweighs running on SM kernels.¹⁰¹⁴

This sits within the broader Blackwell platform ingest story and the performance optimization loop of measure-tune-remeasure.

References¶

Book: Chris Fregly, AI Systems Performance Engineering (O'Reilly). Chapter 5, "GPU-Based Storage I/O Optimizations" (Tuning, Replicating, and Compressing Data); Chapter 2, "AI System Hardware Overview" (Blackwell GPU, NVLink-C2C, PCIe bandwidth).

Official documentation:

nvCOMP Decompression Engine FAQ — https://docs.nvidia.com/cuda/nvcomp/decompression_engine_faq.html
"Speeding Up Data Decompression with nvCOMP and the NVIDIA Blackwell Decompression Engine" (NVIDIA Technical Blog) — https://developer.nvidia.com/blog/speeding-up-data-decompression-with-nvcomp-and-the-nvidia-blackwell-decompression-engine/
nvCOMP product page (supported formats) — https://developer.nvidia.com/nvcomp
nvCOMP C++ API — https://docs.nvidia.com/cuda/nvcomp/cpp_api.html
nvCOMP GDeflate — https://docs.nvidia.com/cuda/nvcomp/gdeflate.html

Fregly, Ch. 5: "store data compressed on the filesystem or object store — and decompress them on the fly ... This can save I/O bandwidth at the cost of some extra CPU or GPU cycles. However, if I/O is the bottleneck and the CPUs and GPUs are idle, this is a reasonable trade-off." ↩↩↩
Fregly, Ch. 5: "This is another way to offload arithmetic operations from the CPU to the GPU — and possibly overlap data decoding with gradient computations during training." ↩↩
Fregly, Ch. 5: "Modern GPUs add an on-die Decompression Engine supporting formats such as LZ4, Snappy, and Deflate ... If you store compressed batches on disk, Blackwell GPUs can decompress them in-pipeline using the Decompression Engine. This frees SMs to run higher-value tasks such as compute kernels." ↩↩↩
Fregly, Ch. 5: "because of the high-bandwidth CPU-to-GPU NVLink-C2C interconnect (up to 900 GB/s bidirectional bandwidth), you can prevent CPU-assisted stages from becoming bottlenecks." ↩
Fregly, Ch. 5: "The key is still to make sure that the decompression time does not replace I/O as the bottleneck, otherwise it likely isn't worth the extra compression computations." ↩↩↩
Fregly, Ch. 5: "Examples include images (JPEGs) and compressed text (Arrow and Parquet)." ↩
Fregly, Ch. 5: "Libraries like nvJPEG can decode images on GPU." ↩
Fregly, Ch. 5: "you'll just want to verify the compression ratio and make sure you're using it at every step in the pipeline. The key is to find a balance such that the decompression step doesn't become the bottleneck." ↩
Fregly, Ch. 2: "PCIe Gen5 x16 (Blackwell B200) is about 64 GB/s per direction, and PCIe Gen6 x16 (Blackwell Ultra B300) is about 128 GB/s per direction"; B200 HBM3e aggregate bandwidth "up to roughly 8 TB/s per GPU." ↩
NVIDIA, nvCOMP Decompression Engine FAQ: the DE is a fixed-function hardware block accelerating Snappy, LZ4, and Deflate (including GZip) on B200/B300/GB200/GB300; integrated with the copy engine; backend selectable via backend in nvcompBatched<Format>DecompressOpts_t (NVCOMP_DECOMPRESS_BACKEND_DEFAULT, NVCOMP_DECOMPRESS_BACKEND_HARDWARE); sort_before_hw_decompress field. https://docs.nvidia.com/cuda/nvcomp/decompression_engine_faq.html ↩↩↩↩↩↩↩
NVIDIA Technical Blog, "Speeding Up Data Decompression with nvCOMP and the NVIDIA Blackwell Decompression Engine": DE up to 600 GB/s; engaged via CU_MEM_CREATE_USAGE_HW_DECOMPRESS (cuMemCreate) or cudaMemPoolCreateUsageHwDecompress (cudaMallocFromPoolAsync); plain cudaMalloc routes to SM fallback; B200 max chunk 4 MiB, queried via CU_DEVICE_ATTRIBUTE_MEM_DECOMPRESS_MAXIMUM_LENGTH. https://developer.nvidia.com/blog/speeding-up-data-decompression-with-nvcomp-and-the-nvidia-blackwell-decompression-engine/ ↩↩↩↩↩↩↩↩
NVIDIA, nvCOMP product page: GPU lossless compression/decompression supporting Snappy, zstd, Deflate, LZ4, plus GPU-optimized GDeflate, Bitcomp, gANS, Cascaded in one library. https://developer.nvidia.com/nvcomp ↩
NVIDIA, nvCOMP: LZ4 (byte-level, no entropy), Snappy (byte-level, tabular), Deflate (Huffman + LZ77), zstd (Huffman + LZ77 + ANS). https://developer.nvidia.com/nvcomp ↩
NVIDIA, nvCOMP GDeflate: closely matches the DEFLATE format for efficient GPU decompression while maintaining the same compression ratio (with minor end-effect caveats). https://docs.nvidia.com/cuda/nvcomp/gdeflate.html ↩↩↩
NVIDIA, nvCOMP C++ API: high-level managers (LZ4Manager, SnappyManager, DeflateManager, ZstdManager, GdeflateManager, CascadedManager, BitcompManager, ANSManager) deriving from nvcompManagerBase; CompressionConfig/DecompressionConfig; create_manager() factory; per-format nvcompBatched<Format>{Compress,Decompress}Opts_t. https://docs.nvidia.com/cuda/nvcomp/cpp_api.html ↩↩↩