Markdown

Persistent kernels and megakernels¶

Scope: launch one grid sized to the GPU and loop over a work queue to amortize per-launch overhead and keep SMs resident (persistent kernels); fuse a whole model or a full decode step into a single launch (megakernels); and the trade-offs against many small kernels and against CUDA Graphs.

What it is¶

A persistent kernel inverts the usual CUDA launch model. The default model is one wave per work item: you size the grid to the problem (gridDim = ceil(n / blockDim)), the driver schedules blocks onto SMs in waves, blocks run to completion and retire, and you launch again for the next batch of work. A persistent kernel instead launches once with a grid sized to the hardware (roughly as many blocks as the GPU can hold co-resident), and each block then loops, pulling work from a queue or striding over the data until the work is drained. The blocks stay resident on the SMs for the whole computation instead of being created and destroyed per item.

The simplest persistent pattern is the grid-stride loop: each thread starts at its global index and advances by the total number of threads in the grid, so a fixed grid covers an arbitrarily large input. (NVIDIA CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops)

__global__ void saxpy(int n, float a, const float *x, float *y) {
    // stride = total threads in the grid; one fixed grid covers any n
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x) {
        y[i] = a * x[i] + y[i];
    }
}

NVIDIA lists the relevant properties directly: reusing threads across multiple elements "amortizes thread creation and destruction cost," the grid size becomes a tuning knob decoupled from problem size, addressing within a warp stays unit-stride so you "get maximum memory coalescing," and you can run the same kernel with one block of one thread to validate against a serial reference. (NVIDIA Grid-Stride Loops)

A grid-stride loop alone does not let blocks coordinate. To make a persistent grid act as one cooperating unit (a global barrier between phases, a shared work queue with cross-block visibility), you need grid-wide synchronization, which requires a cooperative launch: cudaLaunchCooperativeKernel, with the device-side handle cooperative_groups::this_grid() and a grid.sync() barrier. Grid sync is only legal when every block is simultaneously resident, so the launch is restricted to a grid that provably fits on the GPU. (NVIDIA CUDA Programming Guide: Cooperative Groups)

A megakernel takes persistence to its conclusion: fuse an entire model forward pass (or a full autoregressive decode step) into one kernel launch. Instead of the host dispatching dozens or hundreds of small kernels per token (per-layer attention, GEMMs, norms, activations, and the collectives between them), the whole graph runs inside a single resident grid that schedules its own sub-tasks across SMs and keeps intermediates in on-chip memory or global memory it never hands back to the host between ops. Production megakernels are compiler-generated rather than hand-written: the Mirage Persistent Kernel (MPK) "automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs all necessary computation and communication within a single kernel launch." (Mirage Persistent Kernel)

Why it matters¶

The cost being attacked is per-launch overhead and the host critical path, not arithmetic. Every kernel launch carries fixed CPU-side cost (argument marshalling, driver dispatch, queue submission) plus a scheduling gap on the GPU between consecutive launches. When kernels are large, that overhead is noise. When kernels are small and numerous (the defining shape of LLM decode, where each per-token forward pass is a long chain of tiny, memory-bound ops over a single row or small batch), the host cannot issue work fast enough to keep the SMs busy, and the GPU idles in the gaps. This is the launch-bound regime; confirm it with the roofline / arithmetic-intensity view and an Nsight profiling pass before reaching for any of these techniques. (Fregly, AI Systems Performance Engineering)

Persistent kernels remove the per-item launch entirely: one launch, then the resident grid drains all work. Megakernels go further and remove the inter-op launches and host round-trips inside a single forward pass, and they additionally let intermediate results stay on-chip instead of spilling to global memory between every operator. The reported payoff for the fully-fused approach is large: MPK "reduces LLM inference latency by 1.2x to 6.7x" versus kernel-per-operator serving, attributing the win to eliminated launch overhead, reduced intermediate memory traffic, and overlap of otherwise-serialized work. (Mirage Persistent Kernel) Treat that range as a published result for their workloads and hardware, not a guarantee for yours.

Persistence also enables optimizations that per-wave launches cannot express: a global reduction or a multi-phase algorithm where the whole grid must see a consistent intermediate before the next phase, done with one launch and grid.sync() barriers instead of launch-sync-launch. (NVIDIA Cooperative Groups)

When it is needed (and when not)¶

Reach for a persistent kernel when:

The workload is launch-bound: many small, short kernels where host dispatch rivals or exceeds GPU work (decode, small-batch inference, fine-grained graph algorithms). Verify with a profiler first.
You need cross-block coordination within one launch (a global barrier, a producer/consumer work queue, or a multi-phase algorithm) that would otherwise force a launch-sync-launch sequence.
You want a single grid to absorb a variable or oversized input (grid-stride) and to make grid size an independent tuning knob.

Reach for a megakernel when:

You are squeezing the last latency out of low-batch decode and have already exhausted kernel fusion, CUDA Graphs, and tensor-core mixed precision, and a profiler still shows launch/scheduling gaps and global-memory round-trips between operators.
You can adopt a compiler/runtime (e.g. Mirage) that generates the fused kernel; hand-writing and maintaining a model-wide megakernel by hand is a large, brittle undertaking.

Do not use these (or expect little) when:

Kernels are large and already saturate the GPU: there is no launch overhead to amortize, and forcing persistence only adds complexity. This is the prefill regime (long sequences, large batches): compute-bound, where many-small-kernel overhead is negligible.
The work is irregular and one-shot: a single large kernel with a well-sized grid is simpler and just as fast.
CUDA Graphs already solve it. Graphs eliminate per-launch CPU overhead for a static, repeated pipeline with zero kernel rewriting (see the contrast below). Try graphs first; a persistent kernel or megakernel is the heavier hammer.
You cannot bound occupancy and resources. A cooperative/persistent grid must fit on the GPU; if your block needs so many registers or so much shared memory that only a few blocks fit per SM, a persistent grid may underutilize the device or fail to launch.

Persistent kernels vs CUDA Graphs vs many small kernels¶

Approach	Launches	What it removes	Rewrite cost	Shape flexibility
Many small kernels (default)	one per op, per item	nothing	none	full
CUDA Graphs	one `cudaGraphLaunch` per replay	per-launch CPU dispatch for a fixed, captured pipeline	low (capture/replay, static buffers)	low: one graph per captured shape
Persistent kernel	one cooperative launch	per-item launch + enables in-grid coordination	medium (work-queue / grid-stride loop, occupancy sizing)	high (grid-stride absorbs size)
Megakernel	one launch for the whole pass	inter-op launches + host round-trips + inter-op global-memory traffic	high (usually compiler-generated)	low/medium (compiled per model/shape)

CUDA Graphs and persistent kernels are complementary, not competing. Graphs keep your existing kernels and remove only the host cost of replaying a fixed DAG; a persistent kernel changes the kernel itself to stay resident and pull work, removing the launches outright and allowing cross-block coordination a graph cannot express. Megakernels are the most aggressive and the most expensive to build and maintain. (Fregly, AI Systems Performance Engineering)

How: implement, integrate, maintain¶

Size the grid to the GPU¶

A persistent grid must not exceed what the GPU can hold co-resident; for a cooperative launch this is a hard requirement. Compute the safe size from occupancy: ask how many blocks of your kernel fit per SM, then multiply by the SM count. The total number of cooperative blocks cannot exceed cudaOccupancyMaxActiveBlocksPerMultiprocessor times multiProcessorCount. (NVIDIA CUDA Programming Guide: Cooperative Groups)

int dev = 0;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, dev);

int numThreads = 256;
int numBlocksPerSm = 0;
// Occupancy: max resident blocks of this kernel per SM at this block size / smem
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &numBlocksPerSm, my_persistent_kernel, numThreads, /*dynamicSmem=*/0);

// Persistent grid = exactly fills the device; never larger for a cooperative launch
dim3 dimGrid(prop.multiProcessorCount * numBlocksPerSm, 1, 1);
dim3 dimBlock(numThreads, 1, 1);

This multiProcessorCount * numBlocksPerSm figure is the canonical persistent/cooperative grid size. (NVIDIA CUDA Programming Guide: Cooperative Groups; NVIDIA, CUDA Runtime API, cudaLaunchCooperativeKernel)

Launch cooperatively and synchronize the whole grid¶

Grid-wide grid.sync() is only valid under cudaLaunchCooperativeKernel, and only on a device whose cudaDevAttrCooperativeLaunch attribute is non-zero (compute capability 6.0 and higher, on Linux without MPS, Linux + MPS on compute capability 7.0+, or Windows). (NVIDIA CUDA Programming Guide: Cooperative Groups)

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void persistent_phased(float *data, int n, int phases) {
    cg::grid_group grid = cg::this_grid();   // all threads in the grid

    for (int p = 0; p < phases; ++p) {
        // Phase work: persistent grid-stride over the data
        for (int i = grid.thread_rank(); i < n; i += grid.num_threads()) {
            data[i] = phase_step(data[i], p);
        }
        grid.sync();   // global barrier: all blocks see phase p before phase p+1
    }
}

// Host: gate on device support, size to the GPU, then cooperative-launch
int coop = 0;
cudaDeviceGetAttribute(&coop, cudaDevAttrCooperativeLaunch, dev);
if (!coop) { /* fall back to launch-sync-launch */ }

void *args[] = { &d_data, &n, &phases };
cudaLaunchCooperativeKernel(
    (void *)persistent_phased, dimGrid, dimBlock, args, /*sharedMem=*/0, stream);

If the requested grid exceeds the co-resident limit, the cooperative launch fails rather than silently serializing, which is why occupancy-based sizing above is mandatory, not optional. (NVIDIA, CUDA Runtime API, cudaLaunchCooperativeKernel)

A work-queue persistent kernel¶

When work items are heterogeneous, replace the grid-stride loop over a flat array with a shared work queue: each block atomically claims the next item index from a global counter and processes it, looping until the queue is drained. This load-balances irregular work across resident blocks without relaunching.

__global__ void persistent_worker(const Task *tasks, int num_tasks,
                                  unsigned int *next, Result *out) {
    while (true) {
        // One thread per block claims the next task; broadcast to the block
        __shared__ unsigned int idx;
        if (threadIdx.x == 0) idx = atomicAdd(next, 1u);
        __syncthreads();
        if (idx >= (unsigned)num_tasks) return;   // queue drained: block exits

        process_task(tasks[idx], &out[idx]);       // block-wide work on one task
    }
}

Reset *next to 0 before launch. Blocks self-terminate when the counter passes num_tasks, so the single launch covers all work regardless of how the items distribute. (Fregly, AI Systems Performance Engineering)

Megakernels: prefer a compiler¶

A whole-model megakernel coordinates many operator types across SMs inside one launch, often with an in-kernel scheduler and an SM-level task graph. This is impractical to write and maintain by hand for a real model; the production path is a compiler/runtime that lowers a tensor program into the fused kernel and generates the per-task CUDA. Mirage (MPK) is the reference open implementation: it compiles an LLM into a single persistent megakernel, with its in-kernel runtime distributing tasks across SMs (its worker/scheduler count "must match the number of physical SMs"). (Mirage Persistent Kernel) Adopt such a toolchain rather than rolling your own, and validate numerics against the unfused reference.

Persistent execution flow¶

flowchart TD
    A["Profile: confirm launch-bound (small kernels, host gaps)"] --> B{"Static repeated pipeline?"}
    B -->|"yes, no cross-block coord needed"| C["Use CUDA Graphs first (cheaper)"]
    B -->|"need in-grid coordination or per-item launches dominate"| D["Size grid: occupancy x multiProcessorCount"]
    D --> E["Check cudaDevAttrCooperativeLaunch"]
    E --> F["cudaLaunchCooperativeKernel: one resident grid"]
    F --> G["Loop: grid-stride or atomic work-queue"]
    G --> H{"Phase boundary?"}
    H -->|"yes"| I["grid.sync() global barrier"]
    I --> G
    H -->|"queue drained"| J["Blocks self-terminate"]
    D --> K["Whole-model fusion needed? compiler-generated megakernel"]

Maintain and verify¶

Verify the win on the GPU timeline. In an Nsight profiling workflow, Nsight Systems should show the per-item launch gaps collapsing into one continuous resident kernel; bracket regions with NVTX. Never assume a published speedup (the MPK 1.2x–6.7x range) transfers to your model and hardware. Measure it. (Mirage Persistent Kernel)
Watch occupancy regressions. A persistent kernel that grows its register or shared-memory footprint drops numBlocksPerSm, shrinking the resident grid and starving the device. Re-run cudaOccupancyMaxActiveBlocksPerMultiprocessor after any change and see CUDA occupancy tuning.
Mind grid-sync portability. grid.sync() requires cooperative-launch support and the grid to fit; keep a launch-sync-launch fallback for devices or shapes where the cooperative launch cannot be satisfied. (NVIDIA Cooperative Groups)
Avoid deadlock. Every block reaching a grid.sync() must reach it; divergent early-exit before a shared barrier hangs the grid. Drain work then barrier, or barrier then exit; never split across blocks.

Reference templates only; APIs, flags, and numbers are grounded in the cited official NVIDIA docs and the Mirage source, with the book cited for conceptual framing. Not hardware-tested here. Benchmark on your target before relying on any figure. The Fregly chapter is cited at chapter granularity; page-level anchors were not independently verifiable for this page.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly) — chapters on dynamic scheduling, persistent kernels, and device-initiated/megakernel orchestration (launch-overhead framing, decode launch-bound regime, persistent work queues). Cited at chapter granularity.
NVIDIA, CUDA C++ Programming Guide — Cooperative Groups (this_grid(), grid.sync(), cudaLaunchCooperativeKernel, cudaDevAttrCooperativeLaunch, co-residency and grid-size constraint): https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cooperative-groups.html
NVIDIA, CUDA Runtime API — cudaLaunchCooperativeKernel, cudaOccupancyMaxActiveBlocksPerMultiprocessor: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html
NVIDIA Technical Blog, CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
NVIDIA Technical Blog, CUDA Pro Tip: Occupancy API Simplifies Launch Configuration: https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/
Mirage Persistent Kernel (MPK) — compiling LLMs into a single megakernel; 1.2x–6.7x latency reduction; in-kernel SM-level runtime: https://github.com/mirage-project/mirage