Markdown

CUDA graphs: capture, replay, and launch overhead¶

Scope: amortizing per-kernel CPU launch overhead by capturing a fixed pipeline of kernels, copies, and events once and replaying it as a single submission: stream capture, instantiate/replay, when launch overhead dominates (decode, small batches), pointer-stability rules, dynamic graph update, and how PyTorch and vLLM use graphs in production inference.

What it is¶

A CUDA Graph is a recorded DAG of GPU operations (kernel launches, async memory copies, event records, host callbacks) with their dependencies fixed at capture time. Once captured and instantiated into an executable, the whole DAG is launched with a single host call (cudaGraphLaunch) instead of one host call per operation. The CUDA driver knows the full dependency graph up front, so it replays the prebuilt sequence directly on the GPU with minimal per-iteration CPU work.

The normal way to build one is stream capture. You put a stream into capture mode with cudaStreamBeginCapture(), enqueue the usual stream work (kernel<<<...,stream>>>, cudaMemcpyAsync, cudaEventRecord, cudaLaunchHostFunc), then close with cudaStreamEndCapture() to obtain a cudaGraph_t. Work launched into a capturing stream is not executed; it is appended to the internal graph being built. You then call cudaGraphInstantiate() to produce a launchable cudaGraphExec_t, and cudaGraphLaunch(exec, stream) to replay it. (NVIDIA CUDA Programming Guide: CUDA Graphs)

Graphs do not make any single kernel or copy faster. They remove the CPU from the per-iteration critical path and let the driver schedule a known batch of work back-to-back, closing the small idle gaps between consecutive kernel launches. (Fregly, Ch. 12)

Why use it¶

For a fixed pipeline replayed many times, the cost being eliminated is host-side launch overhead and host-device handshakes, not compute. The book's illustrative three-kernel pipeline run for 100 iterations collapses 300 separate kernel launches into 100 graph replays (one per iteration), drops 300 cudaDeviceSynchronize calls to 0, removes the roughly 3 microsecond inter-kernel idle gaps, and cuts end-to-end iteration latency from about 1.00 ms to about 0.75 ms, roughly 25% faster. These numbers are illustrative, used to explain the mechanism; actual gains depend on architecture and kernel mix. (Fregly, Ch. 12, Table 12-1)

The win scales with how launch-bound the workload is. When kernels are large (each runs for hundreds of microseconds), launch overhead is noise and graphs barely help. When you have many small kernels (short ops where CPU dispatch and driver overhead rival or exceed the GPU work), launch overhead dominates and graphs pay off the most. A graph replay skips all layers of argument setup and kernel dispatch (Python, C++, and CUDA driver overheads) and submits the whole graph with one cudaGraphLaunch. (PyTorch: Accelerating PyTorch with CUDA Graphs)

This maps directly onto LLM decode: per-token, single-row (or small-batch) forward passes through many tiny kernels (per-layer attention and MLP), where the model is host-launch-bound rather than compute-bound. Prefill on long sequences with large batches is the opposite regime: compute-bound, where graphs add little. Use the roofline / arithmetic intensity view and a Nsight profiling pass to confirm you are launch-bound before reaching for graphs.

The mechanism is a simple latency model you can reproduce. The block below computes per-iteration latency the eager way (one host launch plus an idle gap per kernel) and the graph way (one launch for the whole DAG, gaps closed), then asserts the graph is never slower, that the launch count collapses N:1, and that the win is large when launch-bound and near zero when compute-bound. Validated with numpy 2.4.6 under python3:

# Launch-overhead amortization model: eager vs graph-replayed pipeline latency.
# Core claim (Fregly Ch. 12, Table 12-1): a graph collapses N per-kernel host
# launches per iteration into ONE launch and removes inter-kernel idle gaps, so
# per-iteration latency drops without any kernel getting faster. We model both
# regimes and assert: the graph is never slower for a fixed pipeline, the launch
# count collapses N:1, the launch-bound regime yields a meaningful speedup, and
# the compute-bound regime yields almost none.
import numpy as np


def iter_latency_us(
    n_kernels: int,
    kernel_us: float,
    host_launch_us: float,
    inter_kernel_gap_us: float,
    graph_launch_us: float,
) -> tuple[float, float]:
    """Return (eager_us, graph_us) per-iteration latency for one pipeline pass.

    Eager path: each kernel costs a host launch plus its GPU time, and between
    consecutive kernels the host leaves an idle gap on the GPU timeline (the CPU
    is on the critical path). Graph path: ONE host launch replays the whole DAG
    and the driver runs the kernels back-to-back, closing the gaps. GPU compute
    time (n_kernels * kernel_us) is identical in both paths.
    """
    assert n_kernels >= 1 and kernel_us >= 0.0
    compute = n_kernels * kernel_us
    eager = n_kernels * host_launch_us + compute + (n_kernels - 1) * inter_kernel_gap_us
    graph = graph_launch_us + compute
    return eager, graph


# --- Launch-bound regime: many small kernels, host dispatch dominates ----------
N, ITERS = 3, 100
eager, graph = iter_latency_us(
    n_kernels=N,
    kernel_us=1.0,          # tiny GPU work per kernel
    host_launch_us=1.5,     # per-launch host dispatch (Python + C++ + driver)
    inter_kernel_gap_us=3.0,  # ~3 us idle gap between kernels (book figure)
    graph_launch_us=0.5,    # single cheap graph launch for the whole DAG
)
eager_ms = eager * ITERS / 1000.0
graph_ms = graph * ITERS / 1000.0
speedup = 1.0 - graph_ms / eager_ms
print(f"launch-bound: eager={eager_ms:.3f} ms  graph={graph_ms:.3f} ms  "
      f"faster={speedup*100:.1f}%")

# Graph is never slower than eager for a fixed pipeline.
assert graph <= eager, (graph, eager)
# The GPU compute floor is untouched: graphs speed up nothing on-device.
assert abs((graph - 0.5) - (eager - (N * 1.5 + (N - 1) * 3.0))) < 1e-9
# In the launch-bound regime the win is large (well above 20%).
assert speedup > 0.20, speedup

# --- Launch count collapses N:1 (300 launches -> 100 replays) ------------------
eager_launches = ITERS * N
graph_launches = ITERS * 1
assert eager_launches == 300 and graph_launches == 100
assert graph_launches == eager_launches // N

# --- Adversarial edge: compute-bound regime, graph barely helps ---------------
# Large kernels (hundreds of us). Launch overhead is now noise; win -> ~0.
eb, gb = iter_latency_us(
    n_kernels=N, kernel_us=400.0, host_launch_us=1.5,
    inter_kernel_gap_us=3.0, graph_launch_us=0.5,
)
big_speedup = 1.0 - gb / eb
print(f"compute-bound: eager={eb:.1f} us  graph={gb:.1f} us  "
      f"faster={big_speedup*100:.2f}%")
assert big_speedup < 0.01, big_speedup  # under 1% when compute dominates
assert gb <= eb

# --- Adversarial boundary: a single kernel has no inter-kernel gaps to close --
e1, g1 = iter_latency_us(
    n_kernels=1, kernel_us=1.0, host_launch_us=1.5,
    inter_kernel_gap_us=3.0, graph_launch_us=0.5,
)
# With one kernel the gap term is zero, so the only possible win is the cheaper
# single launch (1.5 - 0.5), never gap removal.
assert abs((e1 - g1) - (1.5 - 0.5)) < 1e-9, (e1, g1)

# --- Monotonicity: more small kernels => strictly larger absolute saving -------
prev = -1.0
for n in (1, 3, 8, 32):
    e, g = iter_latency_us(n, 1.0, 1.5, 3.0, 0.5)
    saved = e - g  # microseconds removed per iteration by graphing
    assert saved > prev, (n, saved, prev)
    prev = saved

print("block1 OK: amortization model reproduces the mechanism and edge cases")

The model is a teaching abstraction, not a hardware measurement: the per-op microsecond costs are inputs, not benchmarks. It shows the shape of the win (large when launch-bound, near zero when compute-bound), which is exactly the reasoning you apply before reaching for graphs. Always confirm the real delta with a profiler on your GPU.

When to use it (and when not)¶

Reach for CUDA Graphs when:

The pipeline is static and repetitive: same kernels, same shapes, same dependency structure across iterations.
Profiling shows visible gaps between kernels on the GPU timeline and the CPU is the bottleneck (many small kernels, low per-kernel duration).
You are doing autoregressive decode or small-batch inference where per-token latency is dominated by launch overhead.

Do not use graphs (or expect little) when:

Kernels are large and already saturate the GPU, so there is no launch overhead to amortize.
Shapes or control flow change every iteration in ways the graph cannot encode. A captured graph is invalid if the workload size changes; you must recapture or use cudaGraphExecUpdate (below). Inference engines work around this by bucketing/padding inputs to a fixed set of captured shapes.
The region contains capture-illegal operations: cudaMalloc/allocation inside capture, host-device sync primitives, print(), RNG host calls, or nested captures. The graph must record a pure, deterministic sequence of GPU work. (Fregly, Ch. 12)
Pointer stability cannot be guaranteed. Every tensor used in capture must already be allocated at a fixed address with a fixed shape; replay must reuse the same addresses. Resizing or reallocating between iterations breaks the graph.

Memory rule of thumb: allocate everything before capture, keep persistent input/output buffers, and between iterations copy new data into the static buffers rather than reallocating. This is exactly why frameworks back graph capture with static memory pools. (Fregly, Ch. 12)

Architecture¶

A graph has two lifetimes. Capture-time turns a stream of enqueued work into a cudaGraph_t (a topology of nodes and edges); instantiate-time compiles that topology into an immutable cudaGraphExec_t bound to fixed device pointers. Run-time then replays the executable many times, optionally patched in place by cudaGraphExecUpdate when only launch parameters change. The static memory pool sits underneath, holding the input/output buffers whose addresses the executable pins.

flowchart TB
  subgraph BUILD["Build once (per shape)"]
    W["Warm-up pass (lazy cuBLAS or cuDNN init, allocations)"] --> CAP["Stream capture: begin, enqueue kernels or copies or events, end"]
    CAP --> G["cudaGraph_t (topology: nodes + dependency edges)"]
    G --> INST["cudaGraphInstantiate"]
    INST --> EXEC["cudaGraphExec_t (executable, pinned pointers)"]
  end
  subgraph RUN["Run many times"]
    LOOP["Replay loop: copy inputs into static buffers, cudaGraphLaunch"]
  end
  subgraph MEM["Static memory pool (graph_pool_handle)"]
    BUF["Persistent input / output buffers at fixed addresses"]
  end
  EXEC --> LOOP
  BUF -. fixed addresses .- EXEC
  LOOP -->|"dims or pointers changed"| UPD["cudaGraphExecUpdate (patch in place)"]
  UPD --> LOOP
  LOOP -->|"structure changed"| CAP

The pinned-pointer edge is the load-bearing invariant: replay re-runs the recorded ops over whatever bytes currently occupy the captured addresses, so correctness depends entirely on the caller copying fresh inputs into those exact buffers before each launch. The next block makes that invariant executable.

How to use it¶

Capture and replay in CUDA C++¶

Reference template (needs a CUDA toolchain and GPU; not compiled here). The pattern is: create a non-blocking stream, capture A then B then C, instantiate once, and replay in a loop with a single host call per iteration.

#include <cuda_runtime.h>

cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

cudaGraph_t graph;
cudaGraphExec_t instance;

// Capture A -> B -> C on the stream
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
kernelA<<<grid, block, 0, stream>>>(d_X);
kernelB<<<grid, block, 0, stream>>>(d_Y);
kernelC<<<grid, block, 0, stream>>>(d_Z);
cudaStreamEndCapture(stream, &graph);

// Instantiate once
cudaGraphInstantiate(&instance, graph, nullptr, nullptr, 0);

// Replay the whole A -> B -> C sequence with one host call per iteration
for (int iter = 0; iter < 100; ++iter) {
    cudaGraphLaunch(instance, stream);   // no per-kernel sync; graph encodes deps
}
cudaStreamSynchronize(stream);           // sync once after the loop

cudaGraphExecDestroy(instance);
cudaGraphDestroy(graph);

You pay one launch per iteration instead of three, and the GPU runs A, B, C back-to-back in the recorded order. (NVIDIA CUDA Programming Guide: CUDA Graphs; Fregly, Ch. 12)

Capture and replay in PyTorch¶

Reference template (needs torch + CUDA; not run here). PyTorch exposes graphs via the raw torch.cuda.CUDAGraph class and the torch.cuda.graph context manager. The non-negotiables are a warm-up pass (so lazy cuBLAS/cuDNN init and allocations happen before capture) and static buffers with out= writes so capture records no allocations. (PyTorch CUDA semantics: CUDA Graphs)

import torch

X = torch.randn(1 << 20, device="cuda")

# Persistent buffers for pointer stability
static_x = torch.empty_like(X)
static_y = torch.empty_like(X)
static_z = torch.empty_like(X)
static_w = torch.empty_like(X)

# Warm up once: initialize kernels, allocations, cuBLAS/cuDNN contexts
_ = (X * 1.1 + 2.0).sqrt()
torch.cuda.synchronize()

# Seed the static input before capture
static_x.copy_(X)

g = torch.cuda.CUDAGraph()
stream = torch.cuda.Stream()
torch.cuda.synchronize()
with torch.cuda.graph(g, stream=stream):
    # Use out= to avoid allocations inside capture
    torch.mul(static_x, 1.1, out=static_y)
    torch.add(static_y, 2.0, out=static_z)
    torch.sqrt(static_z, out=static_w)

# Replay; if inputs change, copy into static_x first, then replay
for _ in range(100):
    # static_x.copy_(new_X)
    g.replay()

Skipping the warm-up may cause capture to fail or stall when a lazy init fires during capture. Replay requires the same memory addresses for the captured tensors; update inputs by copying into static_x, never by reallocating. (Fregly, Ch. 12)

The core correctness rule that the PyTorch template depends on (copy new data into the static buffer, then replay) is exactly what this numpy model asserts. It reproduces a captured op bound to fixed buffers, then proves that copy-then-replay matches an eager reference, that replaying without the copy silently reprocesses stale data, and that reallocating the input to a new address is rejected. Validated with numpy 2.4.6 under python3:

# Pointer-stability / static-buffer semantics of CUDA graph replay, modelled in
# numpy. A captured graph binds to FIXED device addresses. Replay re-runs the
# recorded ops over whatever bytes currently live at those addresses. Therefore:
#   (1) to feed new inputs you must copy INTO the static buffer, then replay;
#   (2) replaying WITHOUT the copy reprocesses stale data (a real, silent bug);
#   (3) reallocating an input to a NEW address invalidates the graph, which the
#       runtime detects by comparing the bound pointer identity.
# We reproduce all three and assert them, including the adversarial stale case.
import numpy as np


class CapturedGraph:
    """A one-op graph f(y) = (y * 1.1 + 2.0) ** 3 bound to a fixed input buffer.

    The cube-of-affine map is injective and finite for all real inputs, so a
    stale replay is always distinguishable from the intended-input result.
    The graph captures the *identity* (Python id) and shape of the static input
    and output arrays, exactly as a CUDA graph pins device pointers and shapes.
    replay() reads the current contents of the bound input in place.
    """

    def __init__(self, static_in: np.ndarray, static_out: np.ndarray) -> None:
        self._in_id = id(static_in)
        self._out_id = id(static_out)
        self._shape = static_in.shape
        self._in = static_in
        self._out = static_out

    def replay(self) -> None:
        # Replay must see the same address and shape it was captured with.
        assert id(self._in) == self._in_id, "input buffer pointer changed"
        assert id(self._out) == self._out_id, "output buffer pointer changed"
        assert self._in.shape == self._shape, "input shape changed"
        np.power(self._in * 1.1 + 2.0, 3, out=self._out)  # in-place, no allocation

    def rebind_input(self, new_in: np.ndarray) -> None:
        """Reallocating (new address) must be rejected: graph is now invalid."""
        if id(new_in) != self._in_id:
            raise ValueError("reallocated input invalidates the captured graph")


def reference(x: np.ndarray) -> np.ndarray:
    """Slow, allocating eager reference for equivalence checks."""
    return (x * 1.1 + 2.0) ** 3


rng = np.random.default_rng(0)

# Allocate persistent buffers ONCE, before capture.
static_in = np.empty(1024, dtype=np.float64)
static_out = np.empty(1024, dtype=np.float64)

x0 = rng.standard_normal(1024)
static_in[:] = x0            # seed the static input before capture
g = CapturedGraph(static_in, static_out)

# --- Correct usage: copy new input into the static buffer, then replay --------
g.replay()
assert np.allclose(static_out, reference(x0)), "replay != eager on seed input"

x1 = rng.standard_normal(1024)
static_in[:] = x1            # copy-in, NOT reallocate
g.replay()
assert np.allclose(static_out, reference(x1)), "replay != eager after copy-in"

# --- Adversarial: replay WITHOUT copying new data reprocesses STALE input -----
# This is the silent correctness bug the static-buffer rule exists to prevent.
x2 = rng.standard_normal(1024)          # new data the caller *meant* to use...
# ...but forgets `static_in[:] = x2` before replay:
g.replay()
assert np.allclose(static_out, reference(x1)), "output should still reflect x1"
assert not np.allclose(static_out, reference(x2)), (
    "stale replay must NOT equal the intended-but-not-copied input")

# Now do it correctly and the staleness disappears.
static_in[:] = x2
g.replay()
assert np.allclose(static_out, reference(x2)), "copy-in fixes the stale replay"

# --- Adversarial: reallocating the input to a new address is rejected ---------
realloc = np.array(static_in)   # same values, DIFFERENT object/address
try:
    g.rebind_input(realloc)
    raise AssertionError("reallocation should have been rejected")
except ValueError as exc:
    assert "invalidates" in str(exc)

# In-place mutation of the SAME object keeps the graph valid (pointer unchanged).
static_in[:] = 0.0
g.replay()
assert np.allclose(static_out, reference(np.zeros(1024))), "zeros case"

print("block2 OK: static-buffer replay, stale-input detection, realloc rejection")

How to integrate with it¶

Higher-level wrappers reduce the boilerplate above:

torch.cuda.make_graphed_callables(callables, sample_args) wraps nn.Modules/functions, handling warm-up, static buffers, capture, and replay (including the backward pass) automatically. (torch.cuda.make_graphed_callables)
torch.compile(model, mode="reduce-overhead") may wrap eligible, CUDA-only, static-address regions in CUDA Graphs to cut launch overhead. It does not guarantee graphing of all paths and can raise memory use from pooled buffers; always profile to confirm the win. (PyTorch CUDA semantics)
torch.cuda.graph_pool_handle() returns a token so multiple graphs can share one pointer-stable memory pool, keeping tensor addresses fixed across captures/replays. (Fregly, Ch. 12; PyTorch CUDA semantics)

See torch.compile for how the compiler decides which regions are graph-safe, and PyTorch CUDA memory allocator for how the caching allocator supplies the pointer-stable pool that capture requires.

How to run it in production¶

Inference engines: one graph per shape¶

vLLM and TensorRT-LLM capture a model's execution into a set of graphs, one per (batch size, sequence-length range) bucket, at startup/model-load. At runtime an incoming request is padded/bucketed to a supported shape and the matching precaptured graph is replayed. This is how decode latency is held low at scale: fixed shapes satisfy the pointer-and-shape stability rules, and per-token launch overhead is amortized to a single replay. (Fregly, Ch. 12). In vLLM this is the CUDA-graph path of the model runner; disabling it (eager mode) is a common A/B to measure the launch-overhead contribution. See inference serving and serving OSS models.

The correctness invariant of that padding scheme is testable in isolation: pick the smallest captured bucket that fits, pad the batch up to it, replay the fixed-shape graph, and keep only the valid prefix rows. The block below asserts the padded-graph result on the valid rows equals a slow eager run at the exact size for every batch, that garbage in the padded tail never leaks into the answer, and that an over-capacity request is rejected. Validated with numpy 2.4.6 under python3:

# "One graph per shape" bucketing/padding, the mechanism vLLM and TensorRT-LLM
# use to satisfy the fixed-shape rule during decode. A captured graph only runs
# one shape, so a runtime request of dynamic size B is padded UP to the nearest
# captured bucket, the matching graph is replayed, and only the first B output
# rows are kept. We model the bucket selection and the padded replay, then assert
# the padded-graph result on the valid rows EQUALS a slow eager run on the exact
# size, and cover the adversarial edges: exact-boundary hit, over-capacity reject,
# and padded (garbage) tail rows never leaking into the answer.
import numpy as np

BUCKETS = (1, 2, 4, 8, 16)  # captured (max) batch sizes, ascending


def pick_bucket(batch: int, buckets=BUCKETS) -> int:
    """Smallest captured bucket >= batch. Raise if it exceeds the largest."""
    assert batch >= 1, "batch must be positive"
    for b in buckets:
        if b >= batch:
            return b
    raise ValueError(f"batch {batch} exceeds largest captured bucket {buckets[-1]}")


def eager_step(x: np.ndarray, w: np.ndarray) -> np.ndarray:
    """Exact-size reference: a linear layer + ReLU over the real B rows."""
    return np.maximum(x @ w, 0.0)


def graphed_step(x: np.ndarray, w: np.ndarray, bucket: int) -> np.ndarray:
    """Pad x up to `bucket` rows with garbage, run the fixed-shape op, then slice
    back to the real B rows (exactly what a replayed decode graph returns)."""
    b, d = x.shape
    assert bucket >= b
    padded = np.full((bucket, d), np.nan, dtype=x.dtype)  # poison the pad rows
    padded[:b] = x
    # A CUDA graph cannot skip NaN rows; it computes them, we just never read them.
    out_full = np.maximum(np.nan_to_num(padded, nan=0.0) @ w, 0.0)
    return out_full[:b]  # keep only the valid prefix


rng = np.random.default_rng(7)
d_model = 32
W = rng.standard_normal((d_model, d_model))

# --- Bucket selection is correct across the range ------------------------------
assert pick_bucket(1) == 1
assert pick_bucket(3) == 4          # padded up
assert pick_bucket(4) == 4          # exact boundary hits its own bucket
assert pick_bucket(5) == 8
assert pick_bucket(16) == 16

# --- Padded replay equals exact-size eager on the valid rows, every batch ------
for real_b in range(1, 17):
    x = rng.standard_normal((real_b, d_model))
    bucket = pick_bucket(real_b)
    got = graphed_step(x, W, bucket)
    want = eager_step(x, W)
    assert got.shape == (real_b, d_model)
    assert np.allclose(got, want), f"mismatch at batch {real_b} (bucket {bucket})"

# --- Adversarial: garbage in the padded tail must never change the answer -----
x = rng.standard_normal((3, d_model))
bucket = pick_bucket(3)             # -> 4, so row index 3 is pure pad
want = eager_step(x, W)
got_a = graphed_step(x, W, bucket)
# Re-run with DIFFERENT poison in the pad region; the kept rows must be identical.
padded_b = np.full((bucket, d_model), 1e9, dtype=x.dtype)
padded_b[:3] = x
got_b = np.maximum(padded_b @ W, 0.0)[:3]
assert np.allclose(got_a, want) and np.allclose(got_b, want)
assert np.allclose(got_a, got_b), "pad contents leaked into valid rows"

# --- Adversarial: a request larger than any captured bucket is rejected --------
try:
    pick_bucket(17)
    raise AssertionError("over-capacity batch should have raised")
except ValueError as exc:
    assert "exceeds largest captured bucket" in str(exc)

# --- Adversarial boundary: batch 0 is invalid and must be rejected ------------
raised = False
try:
    pick_bucket(0)
except AssertionError:
    raised = True
assert raised, "batch 0 must be rejected"

print("block3 OK: bucket selection, padded replay == eager, pad isolation, limits")

The padded rows really are computed by the GPU (a graph cannot conditionally skip work); the engine simply never reads them, so pad contents are irrelevant to correctness but do cost flops. That is the price of holding shapes fixed, and it is why bucket granularity is a latency-versus-waste tradeoff.

How to maintain it¶

Dynamic graph update without recapture¶

When only launch parameters change (grid/block dims, kernel arguments, pointers) but the graph structure is identical, avoid a full recapture. Use cudaGraphExecUpdate to apply the changes to the existing executable, or the lower-level cudaGraphExecKernelNodeSetParams for a single node. The runtime validates the update and lets you replay immediately; an incompatible change (adding/removing nodes, structural changes) returns an error and forces a recapture. (Fregly, Ch. 12; NVIDIA: Employing CUDA Graphs in a Dynamic Environment)

Typical semi-static workflow:

Capture a template graph at the maximum expected size (e.g. batch 128).
For a smaller request (e.g. batch 64), call cudaGraphExecUpdate to adjust launch dims and swap pointers to a smaller buffer.
Replay. The update costs a few microseconds, preserving the sub-100 microsecond replay overhead.

Capture, replay, and update flow¶

flowchart LR
    A["Warm-up pass (init kernels, cuBLAS or cuDNN)"] --> B["cudaStreamBeginCapture"]
    B --> C["Enqueue kernels, async copies, events"]
    C --> D["cudaStreamEndCapture to cudaGraph_t"]
    D --> E["cudaGraphInstantiate to cudaGraphExec_t"]
    E --> F["Replay loop: cudaGraphLaunch each iteration"]
    F --> G{"Params changed?"}
    G -->|"dims or pointers only"| H["cudaGraphExecUpdate then replay"]
    G -->|"structure changed"| B
    H --> F

Verify it actually helped¶

Profile before and after. In a Nsight profiling workflow, Nsight Systems shows the per-iteration kernel-launch gaps collapsing into continuous back-to-back execution on the GPU timeline, and the host thread going idle between replays. Mark replay regions with NVTX ranges to correlate. Confirm the latency delta on your model and GPU. Never assume the illustrative 25% transfers. (Fregly, Ch. 12; PyTorch: Accelerating PyTorch with CUDA Graphs)

How to scale it¶

Two levers push graphs further once the single-stream case works.

Bucket coverage vs memory. Each captured shape is a separate cudaGraphExec_t with its own pinned buffers, so covering more (batch, sequence-length) buckets trades HBM and instantiate time for fewer recaptures at runtime. Share one pointer-stable pool across graphs with torch.cuda.graph_pool_handle() (or the engine's equivalent) so buffers overlap instead of multiplying. The cudaGraphExecUpdate template-graph workflow above lets one captured shape serve a range of smaller sizes, cutting the number of graphs you must hold.

Device-initiated launch (advanced). To remove the CPU from the launch decision entirely, instantiate with cudaGraphInstantiateFlagDeviceLaunch, then cudaGraphUpload() the executable on a host stream before any device-side launch (a device launch without upload errors out). A running kernel can then call cudaGraphLaunch(graphExec, stream) with a reserved stream constant selecting the mode: cudaStreamGraphFireAndForget (child runs immediately, parent does not wait), cudaStreamGraphTailLaunch (deferred until the launching graph completes, used to build GPU-resident scheduler loops), or cudaStreamGraphFireAndForgetAsSibling. Documented limits: up to 120 total fire-and-forget graphs per execution, up to 255 pending tail launches, and only one pending self-tail-launch. Device-launch latency stays roughly flat as the graph grows; the book reports about 2x lower launch latency versus host-side launch for the same graph. (NVIDIA CUDA Programming Guide: Device Graph Launch; Fregly, Ch. 12)

For the broader on-device-scheduling picture (megakernels, persistent kernels, dynamic parallelism), see persistent kernels and megakernels and dynamic parallelism and device launch.

Failure modes¶

Stale inputs from a missed copy. Replay reads whatever is in the static buffers now. If you forget to copy fresh data into static_x before replay(), the graph silently reprocesses the previous iteration's input and produces a plausible but wrong result (reproduced in the block2 stale-input case). Always copy-in, then replay.
Reallocation or resize breaks pointer stability. Any tensor that is reallocated, resized, or moved between iterations no longer lives at the captured address, so replay uses a dangling or wrong pointer. Allocate once before capture and reuse; never rebuild buffers inside the loop.
Capture-illegal operations abort or corrupt the capture. cudaMalloc/allocation inside capture, host-device sync primitives, print(), host RNG calls, and nested captures are not recordable. They cause capture to fail or record a non-deterministic sequence. Keep the captured region a pure sequence of GPU work. (Fregly, Ch. 12)
Skipped warm-up stalls or fails capture. If lazy cuBLAS/cuDNN init or a first allocation fires during capture (because you skipped the warm-up pass), capture can stall or error. Warm up once with a throwaway pass, synchronize, then capture. (PyTorch CUDA semantics)
Shape drift invalidates the graph. A workload whose size changes every iteration cannot be encoded by one graph. Without bucketing/padding to a fixed set of shapes (the block3 mechanism), you fall back to constant recapture, which erases the win. Bucket, or use cudaGraphExecUpdate for parameter-only changes.
Structural change forces a full recapture. cudaGraphExecUpdate only patches parameters; adding or removing nodes returns an error. Detect the error and recapture rather than replaying a mismatched executable. (Fregly, Ch. 12; NVIDIA: Employing CUDA Graphs in a Dynamic Environment)
Device launch without upload errors out. A device-initiated cudaGraphLaunch fails unless the executable was cudaGraphUpload()ed on a host stream first, and it is bounded by the fire-and-forget (120), pending-tail (255), and single self-tail-launch limits. Exceeding them is a runtime error, not a silent slowdown.
No profiled win. Graphs applied to a compute-bound region (large kernels) add capture and memory cost for a near-zero latency gain (the block1 compute-bound case). Confirm you are launch-bound in Nsight before and after; the illustrative 25% does not transfer. (PyTorch: Accelerating PyTorch with CUDA Graphs)

Reference templates (the CUDA C++ and PyTorch blocks) are not hardware-tested here; their flags and numbers are grounded in the cited book chapter and official NVIDIA/PyTorch docs. The three numpy blocks are executed and asserted under python3 (numpy 2.4.6). Benchmark on your target before relying on any latency figure.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 12: "Dynamic Scheduling, CUDA Graphs, and Device-Initiated Kernel Orchestration", pp. 489-500 (CUDA Graphs, capture/replay, Table 12-1, dynamic graph update, device-initiated launch, conditional nodes).
NVIDIA, CUDA C++ Programming Guide, CUDA Graphs and Device Graph Launch: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html
NVIDIA Technical Blog, Getting Started with CUDA Graphs: https://developer.nvidia.com/blog/cuda-graphs/
NVIDIA Technical Blog, Employing CUDA Graphs in a Dynamic Environment: https://developer.nvidia.com/blog/employing-cuda-graphs-in-a-dynamic-environment/
PyTorch, CUDA semantics: CUDA Graphs: https://docs.pytorch.org/docs/stable/notes/cuda.html#cuda-graphs
PyTorch, torch.cuda.make_graphed_callables: https://docs.pytorch.org/docs/stable/generated/torch.cuda.make_graphed_callables.html
PyTorch Blog, Accelerating PyTorch with CUDA Graphs: https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/