Markdown

Instruction-level parallelism and warp stall analysis¶

Scope: instruction-level parallelism (ILP) on GPUs, using independent instructions within a thread to hide latency at low occupancy. Covers loop unrolling and multiple accumulators, the Nsight Compute warp-stall taxonomy (long scoreboard, barrier, not-selected, and friends), and the three bottleneck regimes (latency-, bandwidth-, compute-bound) the stall breakdown distinguishes.

What it is¶

A GPU hides instruction and memory latency two ways, and they are independent. Thread-level parallelism (TLP) hides latency across warps: when one warp stalls, the scheduler issues from another resident warp; this is the occupancy story. Instruction-level parallelism (ILP) hides latency within a single thread: if a thread issues several mutually independent instructions back-to-back, the warp scheduler can keep issuing past an outstanding long-latency operation instead of stalling on it.

The two are substitutes for latency hiding, not just additive. Volkov's classic result is that a kernel can reach near-peak throughput at low occupancy by raising ILP, for example by giving each thread several independent accumulators so multiple FMAs and loads are in flight per thread, rather than relying on many warps to fill the issue slots.¹ This matters because occupancy is bounded by registers and shared memory: a register-heavy kernel that cannot get many warps resident can still saturate the pipelines if each warp carries enough independent work.

The mechanism is the scoreboard. Each thread's outstanding loads and dependent results are tracked; an instruction that consumes a not-yet-ready result cannot issue until the scoreboard clears. Independent instructions placed between a load and its first use give the scheduler something to issue while the load is in flight. Unrolling and multiple accumulators are the two standard ways to manufacture that independence.

Whether a kernel is starved for ILP, for warps, for bandwidth, or for compute is read directly from the Warp State Statistics section of Nsight Compute, which samples why warps could not issue and attributes the cycles to named stall reasons (below).

Why it matters¶

The scheduler issues at most one instruction per warp per cycle per sub-partition. A cycle in which no warp is eligible is a wasted issue slot: pipeline throughput you paid for and did not use. Nsight Compute's Issue Slot Utilization (eligible/issued warps per scheduler) and Warp Cycles Per Issued Instruction quantify exactly this loss. Low ILP shows up as long runs of Stall Long Scoreboard cycles with no other eligible warp to cover them: the SM sits idle behind in-flight memory.

Two practical consequences:

At low occupancy, ILP is the only lever left. If registers cap you to ~25–50% occupancy, you cannot hide DRAM latency with more warps. Independent loads and accumulators per thread are what keep the pipeline fed. Refusing to add warps and instead adding ILP is often the correct trade: fewer warps, more registers each, more independent work per warp.
The stall taxonomy tells you which optimization is even relevant. Long Scoreboard-dominated → memory-latency-bound, fix with ILP or coalescing. MIO/LG Throttle-dominated → bandwidth-bound, fix with tiling/fusion. Not Selected-dominated with high issue utilization → you are compute-bound and already fast. Optimizing the wrong regime wastes effort; the warp-state breakdown prevents that. See roofline for the bound-classification framing.

When it is needed (and when not)¶

Reach for ILP when the profiler says the kernel is latency-bound: Stall Long Scoreboard (or Short Scoreboard) dominates the warp-state breakdown, achieved occupancy is moderate-to-low, eligible-warps-per-scheduler is below the scheduler's active-warp limit (the SM keeps running out of ready warps), and achieved memory bandwidth is well under peak. That signature means latency, not bytes/sec and not FLOPs, is the binding constraint, and more independent in-flight work hides it.

Specifically, raise ILP when:

Occupancy is structurally capped (high register pressure, large shared-memory tiles) so you cannot add warps cheaply; give each thread more independent work instead.
Long Scoreboard is high but DRAM throughput is low: the memory system is idle behind serialized dependent loads.

Do not add ILP when:

The kernel is bandwidth-bound: MIO Throttle / LG Throttle high, DRAM at ~80%+ of peak HBM. More in-flight loads cannot move more bytes/second; raise arithmetic intensity instead (tiling, fusion, coalescing, lower precision via tensor cores).
The kernel is compute-bound: pipes near peak, high issue utilization, Not Selected dominant. The warps are already issuing every cycle; extra ILP only raises register pressure and can lower occupancy for no gain.
Barrier dominates: the bottleneck is __syncthreads() imbalance, not latency. Fix the imbalance or reduce synchronization; ILP is irrelevant.

ILP costs registers (each accumulator and each in-flight load is live state), so it competes with occupancy for the register file. The right balance is empirical: sweep unroll factors and accumulator counts, time each, and re-profile. See CUDA Occupancy Tuning for the occupancy side of the same trade.

How: implement, integrate, maintain¶

1. The warp-stall taxonomy (Nsight Compute)¶

ncu periodically samples each warp's reason for not issuing and reports the distribution in Warp State Statistics. The load-bearing reasons, with their canonical meanings:

Stall reason	Meaning	Typical fix
Long Scoreboard	Waiting on a data dependency for a global / local / texture / surface memory load. The stall is charged to the consumer instruction but caused by the in-flight load.²	More ILP (independent loads before first use), coalescing, cache the value in shared memory.
Short Scoreboard	Waiting on a shorter-latency dependency — typically a shared-memory (`LDS`) access or an MUFU/SFU result (e.g. `exp`).³	Reduce bank conflicts, hide with independent work, restructure math.
Barrier	Warp arrived at a `__syncthreads()` (or named barrier) and is waiting for other warps in the block.⁴	Balance per-warp work, reduce sync points, split blocks.
Not Selected	Warp was eligible but the scheduler picked another warp this cycle.⁵	None needed — this means you have surplus warps. It is the signature of a healthy compute-bound kernel and an opportunity to lower occupancy.
Wait	Waiting on a fixed-latency execution dependency (a previous arithmetic instruction's result).	More ILP — independent arithmetic between producer and consumer.
MIO Throttle	Backed up at the Memory-Input/Output instruction queue (shared mem, special-function, dynamic-branch units saturated).	Reduce pressure on the throttled pipe; this is throughput, not latency.
LG Throttle	Local/Global load-store queue saturated — the LSU is the bottleneck.	Fewer / wider / coalesced accesses; this signals bandwidth/throughput limits, not ILP starvation.
No Instruction	Front-end has no instruction (I-cache miss, or just-issued a branch).	Usually code-size / control-flow; rarely the top reason.

Interpretation rule: Long Scoreboard high and Not Selected low and bandwidth low ⇒ latency-bound, ILP/occupancy will help. Not Selected high ⇒ issue-bound / compute-bound, you are already fast. Throttle reasons high ⇒ a specific pipe is saturated (bandwidth-bound), raise arithmetic intensity. Cross-check with Compute (SM) Throughput % vs Memory Throughput % in the Speed Of Light section before concluding.

The dominant stall reason routes you to one regime, and each regime has a different fix:

flowchart TD
    A["Top warp-stall reason (ncu Warp State Statistics)"] --> B{"Which reason dominates?"}
    B -->|"Long / Short Scoreboard, low DRAM %"| C["Latency-bound: add ILP (accumulators, unroll) or coalesce"]
    B -->|"MIO / LG Throttle, DRAM near peak"| D["Bandwidth-bound: raise arithmetic intensity (tile, fuse, lower precision)"]
    B -->|"Not Selected, high issue utilization"| E["Compute-bound: already fast; consider lowering occupancy"]
    B -->|"Barrier"| F["Sync imbalance: balance work, cut sync points"]

Profile the dominant kernel (named first by Nsight Systems), compiling with -lineinfo so stalls attribute to source lines:

ncu --set full --section WarpStateStats --section SchedulerStats \
    --section SpeedOfLight -o ilp_report ./my_app

2. Manufacture ILP: multiple independent accumulators¶

A single accumulator serializes the FMA chain: each acc += ... waits on the previous one (a Wait / Short Scoreboard stall). Splitting into several independent accumulators lets the scheduler issue them back-to-back; they are summed once at the end. This is the textbook ILP transform.¹

// Latency-bound: one dependency chain, each FMA waits on the prior result.
__global__ void dot_serial(const float* a, const float* b, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    float acc = 0.0f;
    for (int i = tid; i < n; i += stride) {
        acc += a[i] * b[i];            // each iteration depends on the last
    }
    atomicAdd(out, acc);
}

// Higher ILP: four independent chains -> four FMAs and four loads in flight.
__global__ void dot_ilp4(const float* a, const float* b, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    float acc0 = 0.0f, acc1 = 0.0f, acc2 = 0.0f, acc3 = 0.0f;
    int i = tid;
    for (; i + 3 * stride < n; i += 4 * stride) {   // 4 independent lanes of work
        acc0 += a[i]              * b[i];
        acc1 += a[i + stride]     * b[i + stride];
        acc2 += a[i + 2 * stride] * b[i + 2 * stride];
        acc3 += a[i + 3 * stride] * b[i + 3 * stride];
    }
    for (; i < n; i += stride) {                     // remainder
        acc0 += a[i] * b[i];
    }
    atomicAdd(out, (acc0 + acc1) + (acc2 + acc3));
}

Note: floating-point addition is not associative, so reassociating the sum into independent partials changes rounding. This is the same reordering the compiler already performs under -ffast-math / --use_fast_math for reductions; accept it where reproducibility tolerates it, or keep the count modest.

3. Manufacture ILP: loop unrolling with `#pragma unroll`¶

Unrolling exposes independent iterations to the scheduler and amortizes loop overhead. The CUDA front-end supports a precise pragma placed immediately before the loop:⁶

#pragma unroll (no argument): fully unroll a loop whose trip count is a compile-time constant; if the trip count is not constant, the loop is not unrolled.
#pragma unroll N (positive integer): unroll N times.
#pragma unroll 1: explicitly prevent the compiler from unrolling this loop.

__global__ void stencil(const float* in, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;
    float acc = 0.0f;
    #pragma unroll          // K is constant -> fully unrolled, loads independent
    for (int k = 0; k < K; ++k) {
        acc += in[idx + k];
    }
    out[idx] = acc;
}

Unrolling raises register pressure (more live values), so it trades against occupancy: profile achieved occupancy and runtime after changing the factor. Over-unrolling can spill registers to local memory (off-chip, hundreds of cycles), a net loss. Sweep the factor; do not assume more is better.

4. Frameworks¶

PyTorch users rarely write this by hand: cuBLAS/cuDNN GEMM and convolution kernels are already ILP-tuned (register-blocked with multiple accumulators per thread), and torch.compile / Triton-generated kernels expose unroll and accumulator-count knobs to their autotuner. The ILP lens still pays off when reading a torch.compile or Triton profile: a custom kernel showing Long Scoreboard-dominated stalls at low occupancy is the classic candidate for more accumulators / a larger register tile. See Frameworks and Performance Optimization and Tuning.

5. Maintain¶

Re-profile after every change: fixing latency usually exposes the next bottleneck (bandwidth or compute), and the optimal unroll/accumulator count shifts across architectures (register-file size and pipe latencies differ from Ampere to Hopper to Blackwell). Pin the verified counts behind a comment citing the profile, and re-sweep when retargeting. Cross-check with GPU Diagnostics and Validation for fleet signals and goodput for why wasted issue slots erode useful throughput.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 8 "Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism" — ILP vs occupancy as substitute latency-hiding mechanisms, multiple accumulators / loop unrolling, the warp-stall breakdown, and the latency-/bandwidth-/compute-bound regimes. The book presents these as a profile-driven workflow and directs readers to NVIDIA tooling docs for exact metric definitions, which are used below as the authoritative source. (Page numbers omitted: this page was authored without a paginated copy of the book to hand; all hardware-specific claims are grounded in the NVIDIA docs cited below, and no figure here is hardware-tested by the author.)
NVIDIA, Nsight Compute — Kernel Profiling Guide: Warp State Statistics, Scheduler Statistics, and stall-reason definitions (Long/Short Scoreboard, Barrier, Not Selected, Wait, MIO/LG Throttle, No Instruction): https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html
NVIDIA Developer Forums (NVIDIA staff, citing the profiler docs), "Long scoreboard stall meanings" — Long Scoreboard = waiting on a global/local/texture/surface load data dependency: https://forums.developer.nvidia.com/t/long-scoreboard-stall-meanings/230738
NVIDIA, CUDA C++ Programming Guide — #pragma unroll semantics (C++ language extensions): https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
Vasily Volkov, "Better Performance at Lower Occupancy" / "Unrolling Parallel Loops" (NVIDIA GTC/SC tutorial) — the foundational result that ILP hides latency at low occupancy via independent instructions and multiple accumulators: https://www.nvidia.com/docs/IO/116711/sc11-unrolling-parallel-loops.pdf
NVIDIA, CUDA C++ Best Practices Guide — instruction-level optimization and latency hiding: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Volkov, "Unrolling Parallel Loops" (NVIDIA SC11): a kernel can approach peak throughput at low occupancy by increasing independent work per thread (multiple accumulators / unrolling), since ILP and TLP are substitute mechanisms for hiding latency. ↩↩
Nsight Compute Profiling Guide / NVIDIA Developer Forums: "Long Scoreboard indicates waiting on a data dependency for local, global, texture, or surface load." The stall is charged to the consuming instruction but caused by the producer load. ↩
Nsight Compute Profiling Guide: Short Scoreboard — waiting on a shorter-latency dependency such as a shared-memory access or an MUFU/SFU (special-function) result. ↩
Nsight Compute Profiling Guide: Barrier — the warp is waiting at a synchronization barrier (__syncthreads() / named barrier) for sibling warps in the block. ↩
Nsight Compute Profiling Guide: Not Selected — the warp was eligible to issue but another warp was selected this cycle; a high share signals surplus eligible warps (room to lower occupancy without losing performance). ↩
NVIDIA, CUDA C++ Programming Guide — #pragma unroll: bare pragma fully unrolls a constant-trip-count loop (and does nothing for a non-constant trip count); #pragma unroll N unrolls N times; #pragma unroll 1 prevents unrolling. ↩