Skip to content
Markdown

Dynamic parallelism and device-initiated launch

Scope: launching work from the device, via CUDA Dynamic Parallelism (a kernel launches kernels) and device graph launch (a kernel launches a preinstantiated CUDA graph), to keep orchestration on the GPU and cut CPU launch latency for data-dependent work such as MoE routing and variable-sized tasks.

What it is

Normally every kernel and every graph is launched by the host: the CPU decides what runs next, configures the launch, and submits it across the PCIe/NVLink boundary. For data-dependent control flow (where the next launch's shape or even existence depends on values the GPU just computed), this forces a device-to-host round trip per decision. The GPU stalls waiting for the CPU to read a result and dispatch the follow-up.

Two device-side mechanisms move that decision onto the GPU:

CUDA Dynamic Parallelism (CDP) lets a running kernel launch child kernels using the same triple-chevron syntax as host code: kernel<<<Dg, Db, Ns, S>>>(args), where Dg/Db are grid/block dims, Ns is dynamic shared memory, and S is a stream. Each thread can make an independent launch decision; device-side launches are asynchronous with respect to the launching thread, exactly like host launches. (NVIDIA CUDA Programming Guide, Dynamic Parallelism)

Device graph launch lets a running kernel launch an entire preinstantiated CUDA graph (a fixed DAG of kernel/memcpy/memset/child-graph nodes) based on runtime data. The graph is built and instantiated once on the host, uploaded to the device, then relaunched from device code with near-flat latency regardless of graph width. (NVIDIA Technical Blog, Enabling Dynamic Control Flow in CUDA Graphs with Device Graph Launch)

Both keep the orchestration loop GPU-resident. The CPU is removed from the per-decision critical path. (Fregly, Ch. 12)

Why it matters

The cost being eliminated is launch latency on the device-to-host-to-device path, not compute. When a kernel must inspect its own output to decide the next launch, host orchestration pays a full round trip (submit, GPU completes, CPU observes, CPU dispatches) for every decision. On launch-bound, fine-grained, data-dependent workloads this dominates wall-clock time.

Device graph launch is the stronger of the two for repeated structured work: its launch latency is lower than host launch and stays roughly flat as the graph grows wider, where host launch cost scales with the number of nodes submitted. (NVIDIA Technical Blog, Device Graph Launch) The book reports roughly 2x lower launch latency versus host-side launch for the same graph; treat that as illustrative, not a target for your hardware. (Fregly, Ch. 12)

Concrete motivating cases:

  • MoE routing: a router kernel computes per-token expert assignments, then the GPU itself must dispatch the right expert kernels with token-count-dependent shapes. Keeping the dispatch on-device avoids a host round trip per routing step.
  • Variable / recursive work: tree and graph traversal, adaptive mesh refinement, sparse expansion, where the amount of follow-on work is unknown until the parent runs.
  • GPU-resident scheduler loops: a graph that relaunches itself (tail launch) to drive an iterative pipeline without returning to the CPU between iterations. (Fregly, Ch. 12)

When it is needed (and when not)

Reach for device-initiated launch when all of the following hold:

  • The next launch's parameters or existence is data-dependent on values the GPU just produced.
  • Profiling shows the host is on the critical path, with visible device-to-host-to-device gaps per decision (confirm with a Nsight profiling pass).
  • The decision granularity is fine: many small dependent launches, so the round-trip cost is paid often.

Prefer device graph launch over raw CDP when the dependent work is a fixed-structure DAG replayed with varying data: you get the graph's batched scheduling plus device launch, and the node set is validated once at instantiate.

Do not use these when:

  • The pipeline is static and host-known. Plain CUDA graphs with host replay are simpler and sufficient.
  • Kernels are large and already saturate the GPU. Launch latency is noise; there is nothing to amortize. Check the roofline / arithmetic-intensity regime first.
  • The work can be expressed as a single grid with persistent kernels or grid-stride loops. A persistent kernel that pulls work from a queue often beats per-item child launches and avoids device-launch overhead entirely. (Fregly, Ch. 12)
  • You need the parent to block on and read child results mid-kernel. As of CUDA 12.0 that pattern is gone (see below); restructure as a tail launch instead.

CDP carries real per-launch overhead (each device launch allocates from a device-side launch pool and consumes a pending-launch slot). Over-decomposing into millions of tiny child grids regresses performance. Launch coarse-grained children, or use a persistent kernel. (Fregly, Ch. 12)

How: implement, integrate, maintain

Build flags (CDP and device graph launch)

Device-side launch requires relocatable device code and the device runtime library:

nvcc -arch=sm_90 -rdc=true device_launch.cu -o device_launch -lcudadevrt
  • -rdc=true: relocatable device code (required to call the device runtime).
  • -lcudadevrt: link the CUDA device runtime.

(NVIDIA CUDA Programming Guide, Dynamic Parallelism)

CDP2: no mid-kernel device-side synchronize

The legacy cudaDeviceSynchronize() inside device code (CDP1), which let a parent block until its children finished, was deprecated in CUDA 11.6 and removed in CUDA 12.0 (CDP2). On compute capability 9.0+ only CDP2 exists; CDP1 is not available. For compute capability < 9.0 you can opt back into the old behavior at compile time with -DCUDA_FORCE_CDP1_IF_SUPPORTED, but do not build new code on it. (Host-side cudaDeviceSynchronize() is unaffected and still supported.) (NVIDIA CUDA Programming Guide, Dynamic Parallelism; NVIDIA Developer Forums, cudaDeviceSynchronize from device code is deprecated)

The CDP2 replacement pattern: instead of blocking on a child and consuming its result inline, launch the consumer as a tail-launch child into cudaStreamTailLaunch. Tail launches execute only after the launching grid and its fire-and-forget children complete, so the consumer sees the producer's writes without an in-kernel sync. (NVIDIA CUDA Programming Guide, Dynamic Parallelism)

Device-code named streams for CDP:

  • cudaStreamFireAndForget: child runs independently; parent does not wait.
  • cudaStreamTailLaunch: child runs after the parent grid (and its fire-and-forget children) complete.

CDP example: data-dependent child launch (CDP2)

#include <cuda_runtime.h>

__global__ void process_expert(const float* tokens, int count, int expert_id);

__global__ void reduce_results(const float* partials, int n_experts);

// Router decides, per block, how much follow-on work to launch.
__global__ void route(const float* tokens, const int* expert_counts,
                      const int* expert_offsets, int n_experts,
                      float* partials) {
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        for (int e = 0; e < n_experts; ++e) {
            int count = expert_counts[e];           // computed earlier on-device
            if (count == 0) continue;               // skip empty experts
            int blocks = (count + 255) / 256;
            // Fire-and-forget: independent expert kernels, sized at runtime.
            process_expert<<<blocks, 256, 0, cudaStreamFireAndForget>>>(
                tokens + expert_offsets[e], count, e);
        }
        // Tail launch: runs only after all fire-and-forget children finish,
        // so it observes their writes without an in-kernel synchronize (CDP2).
        reduce_results<<<1, 256, 0, cudaStreamTailLaunch>>>(partials, n_experts);
    }
}

Each process_expert child is sized from a count the GPU computed, with no host round trip. The reduce_results tail launch replaces the removed cudaDeviceSynchronize() + inline reduce. (NVIDIA CUDA Programming Guide, Dynamic Parallelism)

Memory consistency: parent and child share global memory with weak ordering. A child's global writes are guaranteed visible to the parent only at a tail-launch boundary; local and shared memory are private and never shared across the parent/child boundary. Pass data through global memory only. (NVIDIA CUDA Programming Guide, Dynamic Parallelism)

Device graph launch: instantiate, upload, relaunch from device

A graph launched from the device must be:

  1. Instantiated with the device-launch flag: cudaGraphInstantiate(&exec, graph, cudaGraphInstantiateFlagDeviceLaunch).
  2. Uploaded to the device before any device-side launch, either explicitly via cudaGraphUpload(exec, stream) or implicitly on a first host launch. A device launch with no prior upload errors out.
  3. Composed only of kernel, memcpy, memset, and child-graph nodes (the device-launchable node set).

Device graphs cannot be launched into ordinary streams; they use distinct named stream constants that select the mode. (NVIDIA CUDA Programming Guide, CUDA Graphs; NVIDIA Technical Blog, Device Graph Launch)

Host setup:

cudaGraphExec_t exec;
// Instantiate for device launch.
cudaGraphInstantiate(&exec, graph, cudaGraphInstantiateFlagDeviceLaunch);
// Upload device resources before the kernel that will relaunch it runs.
cudaGraphUpload(exec, stream);

scheduler<<<1, 1, 0, stream>>>(exec, /* state */ d_state);
cudaStreamSynchronize(stream);

Device-side launch from a kernel:

__global__ void scheduler(cudaGraphExec_t work, int* state) {
    // Fire-and-forget: dispatch immediately, independent of this grid.
    cudaGraphLaunch(work, cudaStreamGraphFireAndForget);

    if (should_continue(state)) {
        // Self-relaunch via tail launch builds a GPU-resident loop:
        // this graph runs again after the current execution completes.
        cudaGraphLaunch(cudaGetCurrentGraphExec(), cudaStreamGraphTailLaunch);
    }
}

Device-side named stream constants and helper:

  • cudaStreamGraphFireAndForget: graph runs immediately, independent of the launching graph and of other fire-and-forget launches.
  • cudaStreamGraphTailLaunch: graph runs after the launching graph (and its fire-and-forget launches) complete.
  • cudaStreamGraphFireAndForgetAsSibling: fire-and-forget enqueued as a sibling of the launching graph rather than as its child.
  • cudaGetCurrentGraphExec(): returns the currently executing graph's handle, used for the self-relaunch (tail-launch-to-self) loop.

(NVIDIA Technical Blog, Device Graph Launch; NVIDIA CUDA Runtime API, Graph Management)

Pending-launch limits (do not exceed)

Device launch enforces hard pending-launch limits per execution. The sibling CUDA graphs page records the documented caps: up to 120 total fire-and-forget graphs per execution, up to 255 pending tail launches, and only one pending self-relaunch at a time. Build scheduler loops around the single-self-relaunch rule rather than queueing many. (NVIDIA Technical Blog, Device Graph Launch; Fregly, Ch. 12)

Control flow

flowchart TD
    A["Host: build graph, cudaGraphInstantiate (DeviceLaunch flag)"] --> B["Host: cudaGraphUpload(exec, stream)"]
    B --> C["Host: launch scheduler kernel"]
    C --> D{"Device: inspect runtime state"}
    D -->|"dispatch work now"| E["cudaGraphLaunch FireAndForget"]
    D -->|"continue loop"| F["cudaGraphLaunch TailLaunch on cudaGetCurrentGraphExec"]
    D -->|"done"| G["Return to host"]
    E --> D
    F --> D

Maintain and verify

  • Decide CDP vs device graph by structure: variable, irregular, possibly recursive child work fits CDP; a fixed-structure DAG replayed with runtime data fits device graph launch.
  • Watch device-launch overhead. Each child/graph launch consumes a pending-launch slot and pool memory; over-decomposition regresses throughput. Compare against a persistent kernel pulling from a work queue before committing to fine-grained device launches. (Fregly, Ch. 12)
  • Profile with a Nsight workflow: confirm the host-side per-decision gaps actually collapse and that device-launch overhead does not eat the savings. Do not assume the illustrative ~2x; measure on your target GPU.

Reference templates only. APIs, flags, and limits are grounded in the cited book chapter and official NVIDIA docs. Not hardware-tested here. Benchmark on your target before relying on any figure.

References

  • Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 12: "Dynamic Scheduling, CUDA Graphs, and Device-Initiated Kernel Orchestration" (CUDA Dynamic Parallelism, device-initiated and self-relaunching graph launch, pending-launch limits, ~2x device-launch latency figure).
  • NVIDIA, CUDA C++ Programming Guide — Dynamic Parallelism: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/dynamic-parallelism.html
  • NVIDIA, CUDA C++ Programming Guide — CUDA Graphs (Device Graph Launch): https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html
  • NVIDIA Technical Blog, Enabling Dynamic Control Flow in CUDA Graphs with Device Graph Launch: https://developer.nvidia.com/blog/enabling-dynamic-control-flow-in-cuda-graphs-with-device-graph-launch/
  • NVIDIA, CUDA Runtime API — Graph Management: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html
  • NVIDIA Developer Forums, cudaDeviceSynchronize from device code is deprecated: https://forums.developer.nvidia.com/t/cudadevicesynchronize-from-device-code-is-deprecated/215900

Related: CUDA Graphs: Capture, Replay, and Launch Overhead · Persistent Kernels and Megakernels · CUDA Streams and Concurrency · CUDA Stream-Ordered Memory Allocator · Profiling GPUs: Nsight Systems and Nsight Compute · FlashAttention and Multi-Head Latent Attention · Roofline Model and Arithmetic Intensity · Glossary