Markdown

CUDA occupancy tuning¶

Scope: occupancy as active warps versus the SM hardware maximum, the three resource limiters (registers/thread, shared memory/block, block size), theoretical versus achieved occupancy, and how to drive it with the Occupancy API and __launch_bounds__, including when more occupancy stops paying off.

flowchart TB
    Regs["Registers / thread<br/>(cap registers to fit more)"] --> LimR["Block Limit Registers"]
    Smem["Shared memory / block<br/>(shrink tiles to fit more)"] --> LimS["Block Limit Shared Mem"]
    Block["Block size: threads / block<br/>(multiple of 32)"] --> LimW["Block Limit Warps"]
    LimR --> Bind["Binding limiter = min of the three"]
    LimS --> Bind
    LimW --> Bind
    Bind --> Theo["Theoretical occupancy %"]
    Theo -->|"minus tail effects, load imbalance, scheduling"| Ach["Achieved occupancy % (per SM)"]
    Ach --> Hide["Hides latency until ~60-70%, then returns diminish"]

What it is¶

Occupancy is the ratio of warps resident on a streaming multiprocessor (SM) to the hardware maximum warps that SM can hold. It is the GPU's primary latency-hiding mechanism: when one warp stalls on a long-latency event (a global-memory load, a dependent ALU result, a barrier), the warp scheduler switches to another resident warp to keep the issue pipelines busy. More resident warps means more candidates to cover those stalls.

On modern NVIDIA datacenter SMs the per-SM ceiling is 64 resident warps = 2,048 threads. This limit has held across Ampere, Hopper, and Blackwell (Blackwell B200 is compute capability 10.0). Generation-over-generation gains come from more SMs, larger caches, and multi-die packaging, not a higher per-SM warp count. Not every device matches: the RTX PRO 6000 and the GB10 superchip in DGX Spark expose a higher compute capability (12.x) but cap at 48 warps (1,536 threads) per SM. Confirm the exact limit for your target with Nsight Compute's Occupancy analysis rather than assuming.

Two numbers matter, and they are different:

Theoretical occupancy: the maximum resident warps the launch could achieve given the kernel's register and shared-memory demands and its block size. A static, compile-time/launch-config property.
Achieved occupancy: the average fraction of warp slots actually filled during execution, measured per SM. It is theoretical occupancy minus real-world losses: tail effects, load imbalance between warps/blocks, and scheduling overhead. Nsight Compute reports both; the gap tells you whether you have a configuration problem (theoretical low) or an execution problem (achieved well below theoretical).

What caps theoretical occupancy is whichever of these three SM resources runs out first:

Registers per thread. Each Blackwell SM has a 64K 32-bit register file (256 KB), and the hardware exposes at most 255 registers per thread. registers/thread × threads/SM must fit the register file. A register-heavy kernel forces the SM to schedule fewer warps.
Shared memory per block. Blackwell provides 228 KB (227 KB usable) of unified shared memory / L1 per SM; a single block can request up to 227 KB of dynamic shared memory. Large per-block tiles mean fewer blocks fit, so fewer warps reside.
Block size (threads per block). Block dimension must be a multiple of the 32-thread warp; max 1,024 threads (32 warps) per block; max 32 resident blocks per SM. A 1,024-thread block that is otherwise unconstrained lets only 2 blocks reside (2 × 1,024 = 2,048); 256-thread blocks let up to 8 blocks reside, which can fill all 64 warp slots with finer granularity.

Nsight Compute's Occupancy section names the binding constraint directly ("Limited by max registers per thread", "Limited by shared memory per block", or "Limited by thread count"), alongside Block Limit Registers, Block Limit Shared Mem, Block Limit Warps, and Theoretical Occupancy %.

Why it matters¶

A kernel with too few resident warps cannot hide latency: when a warp issues a global load to fetch A[idx], all 32 lanes stall for hundreds of cycles, and if there is no other ready warp the SM idles. This is the latency-bound regime: low achieved bandwidth well below peak, with a high fraction of "Stall: Not Selected" or "No Eligible" cycles in Nsight Compute's Warp State Statistics. Raising occupancy is the direct fix: more warps means the scheduler almost always has a ready candidate.

The relationship is strongly non-linear and saturates. For a memory-bound kernel, going from ~10% to ~50% achieved occupancy can be a large win because you finally have enough warps to cover DRAM latency. Going from 50% to 100% typically yields little; other ceilings (memory-bandwidth saturation, cache misses, execution-dependency stalls) start to dominate. Past a moderate occupancy (roughly 60–70%) returns diminish, and effort is better spent on memory coalescing, higher instruction-level parallelism (ILP), and on-chip data reuse.

Occupancy is a means (hiding latency), not the goal. Pushing toward 100% can actively hurt if it is bought by capping registers so hard that the compiler spills to local memory (backed by global HBM, hundreds to >1,000 cycles), or by shrinking tiles until memory coalescing degrades. The win from extra warps must exceed the loss from leaner per-thread resources, a trade-off, not a monotonic gain.

When it is needed (and when not)¶

Tune occupancy when the profiler says occupancy is the limiter, not before. The diagnostic chain is: profile with Nsight Compute, read the warp-stall breakdown, then act.

Raise occupancy when:

The kernel is latency-bound: low achieved bandwidth, dominant "Not Selected" / scoreboard-wait cycles, and eligible warps per cycle is below the scheduler's active-warps limit (the SM keeps running out of ready warps). More resident warps give the scheduler something to switch to.
The kernel is underutilizing the GPU: both FLOPS and bandwidth low, timeline gaps, not enough blocks to cover all SMs. Here the fix is launching more work, not resource tuning.
Nsight Compute flags "Limited by max registers per thread" or "Limited by shared memory per block" and the kernel is otherwise starved for warps.

Do not chase higher occupancy when:

The kernel is memory-bandwidth bound (achieved bandwidth at ~80%+ of peak HBM): adding warps cannot move more bytes/second. Raise arithmetic intensity instead: tiling (shared-memory tiling), kernel fusion, better coalescing, or lower precision.
The kernel is compute-bound (tensor cores or ALUs near peak): each warp already does heavy independent work. Compute-bound kernels frequently hit peak at ~50% occupancy because per-warp ILP covers latency without needing many warps; this is normal, not a defect.
Eligible warps already meet or exceed the scheduler limit but the kernel is still slow: the bottleneck is elsewhere (bandwidth, exec-dependency stalls), and more warps will not help.

When eligible-warps-per-cycle already meets the scheduler's active-warps-per-scheduler limit, prefer raising per-warp ILP (independent loads, multiple accumulators, loop unrolling) over adding threads. See roofline / arithmetic intensity for the bound-classification framing and goodput for why predicated-off / wasted issue slots erode useful throughput.

How: implement, integrate, maintain¶

1. Start from a sane block size¶

Use a multiple of 32 to avoid partially filled warps (a 33-thread block consumes two warp slots but uses 1/32 of the second). 256 threads (8 warps) is the standard starting point, small enough to let several blocks co-reside, large enough to hide latency. For Blackwell, 256–512 threads/block is a reasonable range to sweep. Round the grid up so every element is covered:

__global__ void scale_kernel(float* input, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {                 // bounds check: n need not be a multiple of blockDim
        input[idx] *= 2.0f;
    }
}

const int threadsPerBlock = 256;                                   // multiple of 32
const int blocksPerGrid   = (n + threadsPerBlock - 1) / threadsPerBlock;
scale_kernel<<<blocksPerGrid, threadsPerBlock>>>(d_input, n);

2. Let the Occupancy API pick the block size¶

cudaOccupancyMaxPotentialBlockSize heuristically returns the block size that maximizes theoretical occupancy given the kernel's actual register and shared-memory usage, plus a minGridSize (smallest grid that fills the device). This avoids guessing.

int minGridSize = 0;
int bestBlockSize = 0;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize, &bestBlockSize,
    scale_kernel,
    /* dynamicSMemBytes = */ 0,
    /* blockSizeLimit   = */ 0);   // 0 = no upper bound on block size

int gridSize = (n + bestBlockSize - 1) / bestBlockSize;
scale_kernel<<<gridSize, bestBlockSize>>>(d_input, n);

To check how many blocks of a chosen size actually co-reside per SM (and compute occupancy = activeBlocks × blockSize / 32 / maxWarpsPerSM):

int activeBlocks = 0;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
    &activeBlocks, scale_kernel,
    /* blockSize        = */ bestBlockSize,
    /* dynamicSMemBytes = */ 0);

The Occupancy API targets theoretical occupancy. Always validate the suggested block size with real timing: a configuration slightly below max theoretical occupancy can win in practice by avoiding register spills or preserving memory coalescing.

3. Pin the trade-off with `__launch_bounds__`¶

When you have profiled and know the kernel wants more warps (latency-bound) and tolerates fewer registers per thread, the __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor) qualifier tells the compiler the launch shape so it can cap register usage to make minBlocksPerMultiprocessor blocks of maxThreadsPerBlock threads co-reside:

// Promise <= 256 threads/block; request >= 8 blocks resident per SM.
// 8 blocks x 256 threads = 2,048 threads = the per-SM thread cap on B200.
__global__ __launch_bounds__(256, 8)
void latency_bound_kernel(const float* a, const float* b, float* out, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = a[idx] + b[idx];
    }
}

Semantics (per the CUDA Programming Guide): the compiler derives a register ceiling from the bounds so the requested blocks fit, capping per-thread registers and restricting unrolling/inlining to avoid spills. This trades a little per-thread performance for more warps in flight, a win for latency-bound kernels, but no benefit for a compute-bound kernel that is not warp-starved. A launch exceeding maxThreadsPerBlock fails. Optimal bounds differ across major architectures, so re-tune when you retarget.

4. File-level register cap (coarser alternative)¶

-maxrregcount=N caps registers per thread for the whole translation unit, raising occupancy by letting more warps reside. It is coarser than __launch_bounds__ (which is per-kernel) and risks spilling if set too aggressively. Prefer __launch_bounds__ for per-kernel control.

nvcc -maxrregcount=32 -arch=sm_100 -lineinfo kernel.cu -o kernel

Find the smallest register limit that maximizes occupancy without excessive spilling; cap too hard and excess variables spill to local memory (off-chip DRAM latency), which is worse than the occupancy you bought. -lineinfo adds source-line mapping so Nsight Compute can attribute stalls to source.

5. Measure: theoretical and achieved¶

Profile with Nsight Compute and read the Occupancy section. Compile with -lineinfo for source attribution.

ncu --set full --section Occupancy \
    --metrics sm__warps_active.avg.pct_of_peak_sustained_active \
    -o occ_report ./my_app

Key fields: Theoretical Occupancy %, Achieved Occupancy %, Achieved Active Warps Per SM, and the Block Limit * rows that name the binding resource. A large theoretical-to-achieved gap points at tail effects or load imbalance (uneven work per warp/block, blocks draining at different times), not at the resource mix; fix the imbalance, do not just add warps. Re-profile after every change: confirm achieved occupancy rose and runtime improved, since fixing occupancy often just exposes the next bottleneck (memory or compute bound). See diagnostics tools and observability for fleet-level signals.

6. PyTorch / framework users¶

You rarely set occupancy directly. PyTorch's built-in matmul, convolution, reduction, and elementwise kernels already pick launch configurations internally (effectively automated occupancy tuning). If a profile shows few SMs active, the cause is usually tiny tensor ops that do not launch enough threads: batch/combine work, or let torch.compile fuse many small ops into larger kernels (more work per launch → higher occupancy). For custom CUDA extensions, apply the same Occupancy-API / __launch_bounds__ workflow above. See Frameworks and Performance Optimization and Tuning.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 6 "GPU Architecture, CUDA Programming, and Maximizing Occupancy" (SM/warp/block limits, register file, shared memory, block-size guidance); Ch. 8 "Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism" (achieved vs theoretical occupancy, occupancy limiters, __launch_bounds__, Occupancy API, diminishing returns). Per-SM numbers in the book are stated for Blackwell B200 (cc 10.0); the book notes its illustrative tables and directs readers to NVIDIA docs for exact per-device limits.
NVIDIA, CUDA C++ Programming Guide — Execution Configuration & __launch_bounds__ (C++ language extensions): https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/cpp-language-extensions.html
NVIDIA, CUDA Runtime API — Occupancy group (cudaOccupancyMaxPotentialBlockSize, cudaOccupancyMaxActiveBlocksPerMultiprocessor): https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OCCUPANCY.html
NVIDIA, CUDA Pro Tip: Occupancy API Simplifies Launch Configuration: https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/
NVIDIA, CUDA C++ Best Practices Guide — nvcc compiler switches (-maxrregcount): https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/nvcc-compiler-switches.html
NVIDIA, Nsight Compute — Occupancy section and Occupancy Calculator (theoretical vs achieved, block limits, limiters): https://docs.nvidia.com/nsight-compute/