Markdown

NVIDIA DALI: GPU data loading and augmentation¶

Scope: offload decode and augmentation (image/video/audio) onto the GPU with a NVIDIA DALI pipeline to remove the CPU preprocessing bottleneck. Covers pipeline definition, mixed CPU/GPU operators, prefetch queue depth, framework iterators (PyTorch/JAX), multi-node sharding, and when DALI beats CPU DataLoader workers. The host-side complement (sizing num_workers, pin_memory, on-disk sharding) lives in Data loading pipeline tuning; device-direct reads in GPUDirect Storage.

The three quantitative claims this page teaches (the PCIe payload saving from device="mixed" decode, the crop_mirror_normalize per-pixel transform, and the reader shard partition) are executed and asserted with numpy on the system python3 and shown passing inline. The DALI code blocks (pipeline_def, iterators) require nvidia.dali, which is not installed here; they are labelled reference templates, and the core math each one relies on is validated separately in a numpy block that is run and asserted.

What it is¶

NVIDIA DALI (Data Loading Library) is a library that builds the input pipeline (read, decode, augment, collate) as a static operator graph and executes it with GPU acceleration or optimized C++ on the CPU. It targets the stages that starve GPUs at scale: JPEG/video decode, resize, crop, normalize, and batch collation.¹¹²

A DALI pipeline is a directed graph of operators. Each operator runs on one of three backends:

"cpu": CPU inputs, CPU outputs.
"mixed": CPU inputs, GPU outputs. The decode operators (fn.decoders.image) live here. They take CPU-resident encoded bytes (a small JPEG) and emit a decoded tensor in GPU memory, so only the compressed payload crosses PCIe.¹²
"gpu": GPU inputs, GPU outputs (resize, crop, normalize on decoded pixels).

The modern API is the @pipeline_def decorator over a graph function built from the fn.* operator namespace. The book describes the older subclassing form (nvidia.dali.pipeline.Pipeline with define_graph()); both produce the same static graph, and DALI handles execution, prefetching, and threading internally with its own thread pools and queues.¹¹²

Why use it¶

Feeding data is as important as compute. A 100-trillion-parameter job over thousands of GPUs reads petabytes; if preprocessing cannot keep up, the GPUs starve and utilization collapses despite optimal compute and communication.² A poorly tuned input pipeline can waste roughly 50% of GPU time, whereas algorithmic optimizations often yield only a few percent.³

CPU DataLoader workers hit a wall on media. The book's worked example: an image plus video augmentation pipeline driving an object detector runs CPU at 800% (eight cores pinned) and the GPU still stalls. Moving decode and augmentation into DALI drops CPU usage to about 200% (two cores, just for file reads) while the GPU does the decode concurrently with compute.⁴ DALI offloads arithmetic from CPU to GPU and reduces the number of CPU workers required.¹

The mechanism behind the offload is that only the compressed JPEG crosses the PCIe bus. A CPU decode produces the full decoded tensor on the host and copies that (large) tensor to the GPU; device="mixed" copies only the small encoded payload and decodes on the GPU with nvJPEG. The saving equals the JPEG compression ratio, and the model below quantifies it.

# PCIe payload model for device="mixed" decode, validated (system python3, numpy).
# Run: python3 pcie.py
import numpy as np


def decoded_bytes(height, width, channels=3, dtype_bytes=1):
    """Size of a decoded image tensor in bytes (what a CPU decode would push over PCIe)."""
    assert height > 0 and width > 0 and channels > 0 and dtype_bytes > 0
    return height * width * channels * dtype_bytes


def pcie_bytes(decoded_nbytes, compression_ratio, decode_on_gpu):
    """Bytes crossing PCIe for one image.
    CPU decode: the full decoded tensor is produced on the host and copied to the GPU.
    mixed/GPU decode (device="mixed"): only the compressed JPEG payload crosses; the
    decode happens on the GPU (nvJPEG) so the large tensor never traverses the bus."""
    assert compression_ratio >= 1.0, "JPEG shrinks data; ratio must be >= 1"
    assert decoded_nbytes > 0
    if decode_on_gpu:
        return decoded_nbytes / compression_ratio   # the small encoded payload
    return decoded_nbytes                            # the full decoded tensor


# happy path: a 224x224x3 uint8 image at a typical JPEG ratio of 10x.
dec = decoded_bytes(224, 224)                        # 150528 bytes decoded
assert dec == 224 * 224 * 3
cpu = pcie_bytes(dec, compression_ratio=10.0, decode_on_gpu=False)
gpu = pcie_bytes(dec, compression_ratio=10.0, decode_on_gpu=True)
assert cpu == dec                                    # CPU decode ships the whole tensor
assert np.isclose(gpu, dec / 10.0)                   # mixed ships only the JPEG
# the whole point: mixed decode moves 10x fewer bytes over PCIe at ratio 10.
assert np.isclose(cpu / gpu, 10.0)

# monotonicity (equivalence to a slow reference): the saving equals the ratio, exactly,
# for any positive ratio. Check against an independent brute recomputation.
rng = np.random.default_rng(0)
for _ in range(1000):
    h = int(rng.integers(16, 4097))
    w = int(rng.integers(16, 4097))
    r = float(rng.uniform(1.0, 40.0))
    d = decoded_bytes(h, w)
    ref_saving = float(d) / (float(d) / r)           # slow reference == r
    got_saving = pcie_bytes(d, r, False) / pcie_bytes(d, r, True)
    assert abs(got_saving - r) < 1e-6
    assert abs(got_saving - ref_saving) < 1e-6

# adversarial/boundary: ratio 1.0 (incompressible) => mixed saves NOTHING, not a crash.
assert pcie_bytes(dec, 1.0, True) == pcie_bytes(dec, 1.0, False)
# corruption detection: a ratio below 1.0 (would imply JPEG grows data) must be rejected.
for bad in (0.9, 0.0, -3.0):
    try:
        pcie_bytes(dec, bad, True)
        raise AssertionError(f"should have rejected compression_ratio={bad}")
    except AssertionError as e:
        assert "ratio must be >= 1" in str(e)

# larger images make the CPU-decode PCIe bill worse, mixed stays proportional to payload.
assert decoded_bytes(4096, 4096) > decoded_bytes(224, 224)
print(f"224x224x3 uint8 = {dec} B decoded; mixed ships {gpu:.0f} B over PCIe "
      f"({cpu/gpu:.0f}x less than CPU decode at 10x JPEG); all asserts pass")

Running it prints 224x224x3 uint8 = 150528 B decoded; mixed ships 15053 B over PCIe (10x less than CPU decode at 10x JPEG); all asserts pass and every assert passes. The boundary assert is deliberate: at compression ratio 1.0 (incompressible data) mixed decode saves nothing, and a ratio below 1.0 (which would imply the JPEG grows the data) is rejected, so a corrupt profiler reading cannot conjure a fake saving.

When to use it (and when not)¶

Use DALI when the workload is input-bound and preprocessing is heavy: image/video decode plus augmentation (classification, detection, segmentation) where the CPU saturates before the GPU does. DALI ships prebuilt operators for these and uses the GPU's media-acceleration hardware.¹

Do not reach for DALI when:

Preprocessing is cheap (already-tokenized .bin/.idx LLM shards, or light last-mile shuffling). A tuned PyTorch DataLoader with enough num_workers, pin_memory=True, persistent_workers=True, and a raised prefetch_factor is sufficient and simpler.⁷⁸
You would decode on GPU then hand raw pixels back to the CPU for augmentation/collation. The extra host-device-host copies can negate DALI's gains. Keep the whole subgraph on the GPU, or consider fusing preprocessing directly into a GPU graph with TorchVision/TensorRT/custom CUDA kernels instead.⁵

Integration is intrusive: DALI must be wired into the training loop, which adds complexity and a learning curve.¹ Always benchmark end-to-end under realistic conditions. Compare a CPU-only pipeline, a DALI pipeline, and a fully fused GPU-graph implementation, and pick the best CPU-savings/GPU-utilization balance for your model and dataset.⁶

Architecture¶

A DALI pipeline is a static three-stage operator graph, and the stage boundaries are the whole design. The CPU reader emits encoded bytes; the "mixed" decoder copies only that small payload across PCIe and decodes it on the GPU (nvJPEG); every "gpu" operator (resize, crop, normalize) then works on decoded pixels that never leave device memory. DALI runs its own thread pool and a prefetch queue so batch N+1 is produced while the model consumes batch N, and the device only stalls when the queue drains faster than the pipeline refills it. The contrast path shows what DALI replaces: CPU DataLoader workers that decode and augment on the host saturate their cores and leave the GPU input-bound.¹²¹⁵

flowchart LR
    Storage["Storage<br/>(JPEG/video shards)"]
    Reader["CPU reader<br/>fn.readers.file (encoded bytes)"]
    Decode["mixed decode<br/>fn.decoders.image (nvJPEG, GPU out)"]
    Augment["GPU augment<br/>resize / crop_mirror_normalize"]
    Prefetch["prefetch queue<br/>(depth >= 2, overlaps compute)"]
    Train["training step<br/>model(images) on GPU"]
    CPUDL["CPU DataLoader workers<br/>(decode + augment on CPU)"]
    Starve["GPU stalls<br/>(input-bound, ~50% wasted)"]

    Storage --> Reader
    Reader -->|"small JPEG over PCIe"| Decode
    Decode --> Augment
    Augment --> Prefetch
    Prefetch -->|"tensors already on GPU"| Train
    Storage -.->|"contrast: CPU saturates"| CPUDL
    CPUDL -.-> Starve

How to use it: build the pipeline and place operators¶

Define the graph with @pipeline_def. Put the reader on CPU, decode as "mixed", and keep every augmentation on "gpu" so decoded pixels never round-trip to host memory. The three-stage execution model (cpu, then mixed, then gpu) makes the boundary explicit; crossing it more than necessary costs PCIe bandwidth.¹²⁵

Reference template (needs nvidia.dali, not installed here; the PCIe saving it depends on is validated in ## Why use it, and the crop_mirror_normalize math it calls is validated below):

# REFERENCE TEMPLATE (requires nvidia.dali; not runnable in this environment).
from nvidia.dali import pipeline_def, fn


@pipeline_def
def train_pipe():
    # Reader runs on CPU and emits encoded bytes + labels.
    encoded, labels = fn.readers.file(file_root="/mnt/data/imagenet/train",
                                      random_shuffle=True, name="Reader")
    # "mixed": CPU-encoded JPEG in, GPU-decoded image out (nvJPEG).
    images = fn.decoders.image(encoded, device="mixed")
    # GPU operators: augmentation never touches the CPU.
    images = fn.resize(images, resize_x=224, resize_y=224)
    images = fn.crop_mirror_normalize(
        images,
        dtype="float",
        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
        std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
        output_layout="CHW",
    )
    return images, labels.gpu()


pipe = train_pipe(batch_size=256, num_threads=4, device_id=0)
pipe.build()

device="mixed" on fn.decoders.image selects the nvJPEG/nvJPEG2000 hardware-accelerated decode path; .gpu() moves a CPU DataNode (here the labels) to the GPU so all outputs land in device memory.¹²

The fn.crop_mirror_normalize call is the arithmetic core the GPU actually runs per pixel: it transposes the decoder's HWC output to CHW, optionally mirrors (horizontal flip), and applies a channel-wise (pixel - mean) / std with the ImageNet statistics expressed in 0..255 units. The numpy block below is that exact operator, checked value-for-value against a naive triple-loop reference:

# crop_mirror_normalize math, validated against a slow reference (system python3, numpy).
# Run: python3 cmn.py
import numpy as np

# DALI's crop_mirror_normalize with dtype="float", output_layout="CHW" computes, per pixel:
#   out[c,y,x] = (in[y,x,c] - mean[c]) / std[c]
# where the ImageNet mean/std are given in 0..255 units (mean=[0.485*255,...]).
# It also transposes the HWC decoder output to CHW and can mirror (horizontal flip).

IMAGENET_MEAN = np.array([0.485 * 255, 0.456 * 255, 0.406 * 255], dtype=np.float64)
IMAGENET_STD = np.array([0.229 * 255, 0.224 * 255, 0.225 * 255], dtype=np.float64)


def crop_mirror_normalize(img_hwc, mean, std, mirror=False):
    """Vectorized reference for DALI's operator: HWC uint8 in, CHW float out,
    optional horizontal mirror, channel-wise (x - mean) / std."""
    assert img_hwc.ndim == 3 and img_hwc.shape[2] == mean.shape[0] == std.shape[0]
    assert np.all(std > 0.0), "per-channel std must be positive (no divide-by-zero)"
    x = img_hwc.astype(np.float64)
    if mirror:
        x = x[:, ::-1, :]                       # flip width axis
    x = (x - mean) / std                        # broadcast over H,W; per-channel
    return np.transpose(x, (2, 0, 1))           # HWC -> CHW


def _slow_reference(img_hwc, mean, std, mirror):
    """Deliberately naive triple loop; the ground truth to check the vectorized path."""
    h, w, c = img_hwc.shape
    out = np.zeros((c, h, w), dtype=np.float64)
    for y in range(h):
        for x in range(w):
            for ch in range(c):
                sx = (w - 1 - x) if mirror else x
                out[ch, y, x] = (float(img_hwc[y, sx, ch]) - mean[ch]) / std[ch]
    return out


rng = np.random.default_rng(7)
img = rng.integers(0, 256, size=(5, 7, 3), dtype=np.uint8)

# happy path: layout changes HWC(5,7,3) -> CHW(3,5,7); value equals the manual formula.
out = crop_mirror_normalize(img, IMAGENET_MEAN, IMAGENET_STD)
assert out.shape == (3, 5, 7), out.shape
expect_00 = (float(img[0, 0, 0]) - IMAGENET_MEAN[0]) / IMAGENET_STD[0]
assert abs(out[0, 0, 0] - expect_00) < 1e-9

# equivalence to the slow reference on random data, both with and without mirror.
assert np.allclose(out, _slow_reference(img, IMAGENET_MEAN, IMAGENET_STD, False))
out_m = crop_mirror_normalize(img, IMAGENET_MEAN, IMAGENET_STD, mirror=True)
assert np.allclose(out_m, _slow_reference(img, IMAGENET_MEAN, IMAGENET_STD, True))

# adversarial: mirroring twice is the identity (round-trip), value-for-value.
out_mm = crop_mirror_normalize(
    np.transpose(out_m, (1, 2, 0))[:, ::-1, :].astype(np.float64) * IMAGENET_STD + IMAGENET_MEAN,
    IMAGENET_MEAN, IMAGENET_STD, mirror=False,
).astype(np.float64)
assert np.allclose(out_mm, out, atol=1e-6)

# statistical sanity: a full-frame image of the mean colour normalizes to ~0 per channel.
flat = np.broadcast_to(IMAGENET_MEAN.astype(np.uint8), (5, 7, 3)).copy()
znorm = crop_mirror_normalize(flat, IMAGENET_MEAN, IMAGENET_STD)
# uint8 truncates the mean, so allow one quantization step / std as the bound.
assert np.max(np.abs(znorm)) <= (1.0 / IMAGENET_STD.min()) + 1e-9

# corruption detection: a zero std (would divide by zero) must be rejected, not produce inf.
try:
    crop_mirror_normalize(img, IMAGENET_MEAN, np.array([0.0, 1.0, 1.0]))
    raise AssertionError("should have rejected std=0")
except AssertionError as e:
    assert "std must be positive" in str(e)

print(f"CMN: HWC{img.shape} -> CHW{out.shape}; matches slow reference (mirror on/off); "
      f"mean-colour image -> |z|<={np.max(np.abs(znorm)):.3f}; all asserts pass")

Running it prints CMN: HWC(5, 7, 3) -> CHW(3, 5, 7); matches slow reference (mirror on/off); mean-colour image -> |z|<=0.012; all asserts pass and every assert passes. The vectorized operator matches the naive triple loop value-for-value (with and without mirror), a double mirror round-trips to the identity, and a zero std is rejected rather than silently producing inf.

How to integrate it: framework iterators (PyTorch / JAX)¶

Wrap the built pipeline in a framework iterator that yields native tensors already on the GPU. For PyTorch:

Reference template (needs nvidia.dali.plugin.pytorch):

# REFERENCE TEMPLATE (requires nvidia.dali; not runnable in this environment).
from nvidia.dali.plugin.pytorch import DALIGenericIterator, LastBatchPolicy

loader = DALIGenericIterator(
    pipelines=[pipe],
    output_map=["images", "labels"],
    reader_name="Reader",              # matches name= on fn.readers.file; supplies shard sizing
    auto_reset=True,                   # reset at epoch boundary
    last_batch_policy=LastBatchPolicy.PARTIAL,
)

for data in loader:
    images = data[0]["images"]         # torch.Tensor, already on GPU
    labels = data[0]["labels"]
    outputs = model(images)

DALIClassificationIterator is the shorthand for the two-output ["data", "label"] case. reader_name ties the iterator to a named reader so it derives epoch/shard sizes automatically (use it instead of a hard-coded size). last_batch_policy selects FILL, PARTIAL, or DROP behavior for a short final batch. DALI ships equivalent plugins for JAX, TensorFlow, and PyTorch Lightning under nvidia.dali.plugin.*.¹³

How to run it in production: prefetch, queue depth, and threads¶

DALI prefetches batches ahead of the consumer. The default prefetch_queue_depth is 2; raise it when per-batch processing time varies and the default does not hide the jitter, at the cost of memory.¹⁴ num_threads sets the CPU worker-thread count for the "cpu"/"mixed" stages. For NUMA locality, enable set_affinity=True and pin DALI threads with the DALI_AFFINITY_MASK environment variable (comma-separated CPU IDs); the placement rationale is in NUMA affinity and CPU pinning for GPU pipelines.¹⁴

Reference template (needs nvidia.dali):

# REFERENCE TEMPLATE (requires nvidia.dali; not runnable in this environment).
pipe = train_pipe(
    batch_size=256,
    num_threads=4,
    device_id=0,
    prefetch_queue_depth=3,   # raise above default 2 to absorb decode-time jitter
    set_affinity=True,        # pin DALI threads to the GPU's NUMA node
)

How to scale it: shard across ranks¶

Configure the reader's shard_id/num_shards so each data-parallel rank reads a distinct, complete, non-overlapping slice, mirroring PyTorch DistributedSampler. As you add GPUs, scale num_threads and I/O bandwidth in lock-step or the input pipeline becomes the bottleneck.⁹ Getting the partition wrong is a correctness bug, not just a performance one: overlapping shards decode the same sample on two ranks (double-counting an epoch), and a gap drops samples. The model below is the partition DALI's named reader performs, checked for disjointness and completeness over 2000 random splits:

# Reader shard partitioning (shard_id / num_shards), validated (system python3, numpy).
# Run: python3 shard.py
import numpy as np


def shard_indices(dataset_size, num_shards, shard_id):
    """Contiguous per-rank slice, mirroring a DistributedSampler-style split of a
    named DALI reader across data-parallel ranks. Each of num_shards ranks reads a
    distinct block so no sample is decoded twice within an epoch."""
    assert num_shards >= 1, "need at least one shard"
    assert 0 <= shard_id < num_shards, "shard_id out of range"
    assert dataset_size >= 0
    per = dataset_size // num_shards          # floor; remainder handled below
    rem = dataset_size % num_shards
    # give the first `rem` shards one extra sample so the union is the whole set.
    start = shard_id * per + min(shard_id, rem)
    stop = start + per + (1 if shard_id < rem else 0)
    return np.arange(start, stop)


def _all_shards(dataset_size, num_shards):
    return [shard_indices(dataset_size, num_shards, s) for s in range(num_shards)]


# happy path: 100 samples over 4 shards => 25 each, contiguous, covering 0..99.
parts = _all_shards(100, 4)
assert all(len(p) == 25 for p in parts)
union = np.concatenate(parts)
assert np.array_equal(union, np.arange(100))                 # complete cover, in order

# core invariants for ANY split (equivalence to the definition): disjoint + complete.
rng = np.random.default_rng(11)
for _ in range(2000):
    n = int(rng.integers(0, 1000))
    k = int(rng.integers(1, 33))
    ps = _all_shards(n, k)
    cat = np.concatenate(ps) if n else np.array([], dtype=int)
    # completeness: union of shards == full index set exactly once.
    assert np.array_equal(np.sort(cat), np.arange(n))
    assert len(cat) == n                                     # no sample dropped or duplicated
    # disjointness: total unique count equals n (no overlap between any two ranks).
    assert len(np.unique(cat)) == n
    # balance: shard sizes differ by at most one (the remainder spread).
    sizes = np.array([len(p) for p in ps])
    assert sizes.max() - sizes.min() <= 1

# adversarial: uneven split, 10 over 3 => sizes [4,3,3], still a perfect partition.
u = _all_shards(10, 3)
assert [len(p) for p in u] == [4, 3, 3]
assert np.array_equal(np.concatenate(u), np.arange(10))

# boundary: num_shards == 1 reads everything; dataset_size 0 yields empty shards.
assert np.array_equal(shard_indices(57, 1, 0), np.arange(57))
assert shard_indices(0, 4, 2).size == 0

# corruption detection: shard_id >= num_shards (a misconfigured rank) must be rejected.
for bad in (4, 5, -1):
    try:
        shard_indices(100, 4, bad)
        raise AssertionError(f"should have rejected shard_id={bad}")
    except AssertionError as e:
        assert ("out of range" in str(e)) or ("at least one" in str(e))

print(f"shards: 100/4 -> {[len(p) for p in parts]} contiguous covering 0..99; "
      f"10/3 -> {[len(p) for p in u]}; 2000 random splits all disjoint+complete; asserts pass")

Running it prints shards: 100/4 -> [25, 25, 25, 25] contiguous covering 0..99; 10/3 -> [4, 3, 3]; 2000 random splits all disjoint+complete; asserts pass and every assert passes. The uneven case (10 samples over 3 ranks) spreads the remainder to sizes [4, 3, 3] and still partitions perfectly, and a shard_id outside [0, num_shards) (a misconfigured rank) is rejected rather than silently reading the wrong slice.

How to maintain it¶

Treat the input pipeline as a first-class performance surface. Profile the loader in isolation (time 100 batches with GPU work disabled) and compare to the GPU-idle time measured during real training; in PyTorch, next(data_iterator) captures total GPU-idle wait including prefetch and the host-to-device copy.¹⁰ Trace overlap with Nsight Systems. When the storage layer (not preprocessing) is the limiter, combine DALI with GPUDirect Storage so reads bypass the host bounce buffer.¹¹ Re-benchmark after CUDA/DALI upgrades; optimal num_threads and prefetch_queue_depth shift with hardware and driver versions.

Failure modes¶

Decode on GPU, then augment on the CPU. Using DALI only for fn.decoders.image and handing raw pixels back to the host for augmentation/collation adds host-to-device-to-host copies that can negate the offload gain. Keep the whole subgraph on the GPU and benchmark end-to-end.⁵
Reaching for DALI when preprocessing is cheap. On already-tokenized .bin/.idx LLM shards or light shuffling, DALI's intrusive integration buys nothing a tuned DataLoader (num_workers, pin_memory, persistent_workers, raised prefetch_factor) does not.⁷⁸
Default prefetch_queue_depth under bursty decode time. When per-batch processing time varies, a depth of 2 fails to hide the jitter and the GPU stalls at batch boundaries. Raise the depth (it costs memory) until the queue absorbs the variance.¹⁴
Scaling GPUs without widening the pipeline. Each data-parallel rank reads a distinct shard, so aggregate loader demand grows with cluster size. Add GPUs without raising num_threads and I/O bandwidth in lock-step and the input pipeline becomes the wall.⁹
Misconfigured shard_id/num_shards. Overlapping shards decode the same sample on multiple ranks (silently double-counting an epoch); a gap drops samples. The partition must be disjoint and complete (validated in ## How to scale it).⁹
num_threads starved on "cpu"/"mixed" stages. The reader and the JPEG-payload copy still run on CPU threads; too few and the mixed decoder waits on its inputs even though the GPU decode is fast.¹⁴
Skipping the end-to-end benchmark. DALI is not universally faster than a CPU pipeline or a fully fused GPU graph. Without comparing all three under realistic conditions you can adopt the slower option for your model and dataset.⁶

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 5: "GPU-Based Storage I/O Optimizations", sections "Multimodal Data Processing with NVIDIA DALI" and "Tuning the Data Pipeline".
Pipeline and processing graph, NVIDIA DALI
PyTorch Plugin API reference, NVIDIA DALI
Performance Tuning, NVIDIA DALI

Fregly, Ch. 5, "Multimodal Data Processing with NVIDIA DALI": DALI accelerates processing by moving it to the GPU or using optimized C++ on the CPU; decodes JPEG and applies random crop/resize/normalize on the GPU; pipelines are a static graph of operators (define_graph()); offloads from CPU and reduces worker count; integration into the training loop adds complexity. ↩↩↩↩↩↩
Fregly, Ch. 5, opening: feeding data is as important as compute; slow storage starves GPUs and yields low utilization. ↩
Fregly, Ch. 5: "A poorly tuned input pipeline could waste 50% of your GPU time, whereas algorithmic optimizations might give only a few percent." ↩
Fregly, Ch. 5: worked example of CPU at 800% (eight cores) with GPU still stalling, dropping to 200% (two cores) under DALI while the GPU decodes concurrently. ↩
Fregly, Ch. 5: using DALI only to decode then handing pixels back to the CPU incurs host-device-host copies that negate gains; fuse GPU-friendly preprocessing into the GPU graph (TorchVision, TensorRT, custom CUDA). ↩↩↩↩
Fregly, Ch. 5: benchmark CPU-only vs DALI vs fully fused GPU-graph end-to-end under realistic conditions. ↩↩
Fregly, Ch. 5, "Tuning the Data Pipeline": PyTorch DataLoader with num_workers, pin_memory=True, persistent_workers=True, and prefetch_factor (default 2 for num_workers>0). ↩↩
Fregly, Ch. 5: NeMo preprocessed datasets stored as memory-mappable .bin/.idx; prepare data before training, almost never train on raw text. ↩↩
Fregly, Ch. 5, "Scaling Out Workers as You Scale Out Number of GPUs": expand worker count and I/O bandwidth with GPU count; each rank reads a distinct shard. ↩↩↩
Fregly, Ch. 5, "Monitoring Storage I/O": profile the DataLoader in isolation over 100 batches; next(data_iterator) measures GPU-idle wait including prefetch and host-to-device copy. ↩
Fregly, Ch. 5: GDS removes the host memory bounce buffer for storage I/O while the CPU still schedules transfers. ↩
NVIDIA DALI, "Pipeline and processing graph": @pipeline_def factory; batch_size, num_threads, device_id, prefetch_queue_depth (default 2); three execution stages cpu/mixed/gpu; device="mixed" on fn.decoders.image selects nvJPEG/nvJPEG2000 (CPU-encoded in, GPU-decoded out); .gpu() moves a DataNode to GPU. https://docs.nvidia.com/deeplearning/dali/user-guide/docs/pipeline.html ↩↩↩↩↩↩
NVIDIA DALI, "PyTorch Plugin API reference": DALIGenericIterator(pipelines, output_map, size=-1, reader_name=None, auto_reset=False, last_batch_policy=LastBatchPolicy.FILL, ...); DALIClassificationIterator returns ["data","label"]; LastBatchPolicy FILL/PARTIAL/DROP. https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html ↩
NVIDIA DALI, "Performance Tuning": default prefetch_queue_depth is 2, raise when processing-time variation is not hidden (costs memory); num_threads worker count; set_affinity and DALI_AFFINITY_MASK for thread pinning. https://docs.nvidia.com/deeplearning/dali/user-guide/docs/advanced_topics_performance_tuning.html ↩↩↩↩