Markdown

GPU execution model: SMs, warps, and SIMT¶

Scope: How an NVIDIA GPU actually executes a kernel: streaming multiprocessors, the 32-thread warp, SIMT lockstep, the thread/block/grid hierarchy, warp schedulers, and why warp divergence costs throughput.

flowchart LR
  subgraph SW["Software (what you write)"]
    direction TB
    Thread["Thread (one global index)"]
    Block["Thread block / CTA (up to 1,024 threads)"]
    Grid["Grid (all blocks at launch)"]
    Thread --> Block --> Grid
  end
  subgraph HW["Hardware (what runs it)"]
    direction TB
    Lane["CUDA core / lane (one thread's ALU)"]
    SM["SM (4 warp schedulers, register file, L1/shared)"]
    GPU["Full GPU (many SMs)"]
    Lane --> SM --> GPU
  end
  Warp["Warp = 32 threads, issued in lockstep (SIMT)"]
  Thread -->|"grouped into"| Warp
  Warp -->|"scheduled onto"| SM
  Block -->|"runtime distributes, no guaranteed order"| SM

What it is¶

A GPU is a throughput processor: it hides latency by keeping thousands of threads in flight instead of optimizing single-thread latency like a CPU. Work is mapped onto hardware through three layers.

Streaming Multiprocessor (SM): the execution engine, roughly analogous to a CPU core but built for parallelism. A GPU has many SMs. Each SM holds a register file, an L1/shared-memory block, warp schedulers, and the INT32/FP32/Tensor Core/SFU/LD-ST pipelines.
Warp: the fundamental scheduling unit, exactly 32 threads. The hardware issues one instruction per warp; all 32 lanes run that same instruction under SIMT (Single Instruction, Multiple Threads), each on its own data and registers. Warp size is 32 on every NVIDIA architecture to date.³²
Thread / block / grid: you write per-thread kernel code. Threads group into a thread block (a.k.a. CTA, cooperative thread array) of up to 1,024 threads; blocks form a grid at launch. The runtime distributes blocks across SMs.²¹

Each thread computes a unique global index from built-in variables and operates on its slice of the data:

__global__ void scale(float* input, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  // unique global index
    if (idx < N) {                                    // bounds check: N rarely a multiple of blockDim
        input[idx] *= 2.0f;
    }
}

The if (idx < N) guard matters: a kernel runs whole warps, so the final warp can address elements past N. Without the guard an out-of-range lane triggers cudaErrorIllegalAddress. GPU faults surface lazily: the launch is asynchronous and the global fault flag is only checked at the next synchronizing call, so poll with cudaGetLastError() and cudaDeviceSynchronize() after launch.¹

Warp schedulers¶

A modern SM is partitioned into four scheduling sub-partitions, each with its own warp scheduler and dispatch unit. Think of it as four "mini-SMs" sharing the SM's on-chip resources.¹ Per cycle each scheduler picks one ready warp and can dual-issue two independent instructions from that same warp (e.g. one arithmetic + one memory). Dual-issue does not cross warps. The Special Function Unit (SFU), which handles sin, cos, rcp, rsqrt, runs on its own pipeline outside the dual-issue math/memory pair, so transcendentals do not stall the main pipelines.¹

Because warps stall on long-latency events (HBM loads, cache fills), the scheduler hides that latency by switching to another ready warp. Keeping enough warps resident so that a ready one always exists is occupancy (see CUDA Occupancy Tuning).

Per-pipeline issue counts (e.g. exact LD/ST pipe pairings) are architecture-specific and not contractually guaranteed. Use Nsight Compute counters to determine whether a kernel is issue-bound on compute or memory rather than assuming a number. See Profiling GPUs: Nsight Systems and Nsight Compute.¹

Why it matters¶

SIMT is the whole performance contract. Two consequences dominate kernel design:

Block sizing must be a multiple of 32. A warp occupies a scheduler slot whether it runs 32 active lanes or 1. A 256-thread block is exactly 8 full warps; a 33-thread block consumes two warp slots and uses 1/32 of the second warp's lanes, which wastes parallelism. Standard starting point is 256 threads/block, tuned 128–512 against register and shared-memory pressure.¹³
Threads in a warp should agree on control flow and access memory in coalesced patterns. Divergent branches serialize (below); scattered addresses waste DRAM bandwidth (see Memory Coalescing and Vectorized Access).

The hierarchy is also the portability contract: blocks execute independently in no guaranteed order, so the same kernel scales unchanged from a small GPU to one with far more SMs. Never assume inter-block ordering.¹

Warp divergence¶

When threads in one warp take different sides of a branch, the warp serializes: it runs the if lanes with the else lanes masked off, then runs the else lanes with the if lanes masked. Execution time multiplies by the number of distinct paths taken within that warp.¹

Two qualifications:

Divergence is intra-warp only. Different warps taking different branches pay no penalty; they are independent scheduling units.¹
Independent Thread Scheduling (ITS), Volta onward. Since Volta each thread has its own program counter and call stack, so divergent paths can interleave and threads need not reconverge at the immediate post-dominator. ITS enables finer intra-warp cooperation but does not remove the throughput cost of divergence: masked lanes still do no useful work.³⁴ Official docs prevail here over any "warps always fully serialize and reconverge immediately" framing, which describes only pre-Volta hardware.

// Divergent: lanes split inside the warp, both paths run serially.
if (threadIdx.x % 2 == 0) heavy_a(); else heavy_b();

// Uniform: branch keyed on a warp-aligned quantity -> whole warps stay coherent.
if ((threadIdx.x / 32) % 2 == 0) heavy_a(); else heavy_b();

When it is needed (and when not)¶

Needed whenever you write or tune CUDA/Triton kernels, read an Nsight report, reason about occupancy, or explain why a kernel is far from its roofline ceiling. Divergence, partial warps, and low occupancy are all SIMT-model symptoms.
Not needed at the application layer if you only call PyTorch/cuBLAS/cuDNN/vLLM. Those libraries already ship warp-efficient kernels. You drop to this level when you author custom kernels or chase a profiler-confirmed stall, not before. Premature warp-level micro-tuning of code that library kernels already cover is wasted effort.

How: implement, integrate, maintain¶

Launch a kernel with warp-aligned dimensions¶

Pick threadsPerBlock as a multiple of 32; size the grid to cover all elements with a round-up so no element is missed:

const int threadsPerBlock = 256;                                  // 8 full warps
const int blocksPerGrid    = (N + threadsPerBlock - 1) / threadsPerBlock;  // round up
scale<<<blocksPerGrid, threadsPerBlock>>>(d_input, N);
cudaDeviceSynchronize();
if (cudaError_t e = cudaGetLastError(); e != cudaSuccess)
    fprintf(stderr, "launch failed: %s\n", cudaGetErrorString(e));

For 2D/3D data use dim3 for both arguments and add one bounds check per axis:

dim3 block(16, 16);                                               // 256 threads = 8 warps
dim3 grid((width  + block.x - 1) / block.x,
          (height + block.y - 1) / block.y);
my2DKernel<<<grid, block>>>(d_image, width, height);

Query the hardware limits at runtime¶

Do not hardcode per-SM limits; read them so the same binary tunes to the device:

cudaDeviceProp p;
cudaGetDeviceProperties(&p, 0);
printf("SMs                : %d\n", p.multiProcessorCount);
printf("warp size          : %d\n", p.warpSize);              // 32
printf("max threads/block  : %d\n", p.maxThreadsPerBlock);    // 1024
printf("max threads/SM     : %d\n", p.maxThreadsPerMultiProcessor);  // 2048 on cc 10.0
printf("regs/SM            : %d\n", p.regsPerMultiprocessor); // 65536
printf("shared/SM (bytes)  : %zu\n", p.sharedMemPerMultiprocessor);

Hardware limits (Blackwell B200, compute capability 10.0)¶

Confirmed against the CUDA Programming Guide compute-capability table.² Other generations differ. Always query the device.

Resource	Limit
Warp size	32 threads
Max threads per block	1,024 (`blockDim.x * blockDim.y * blockDim.z <= 1024`)
Max warps per block	32 (1,024 / 32)
Max resident warps per SM	64
Max resident threads per SM	2,048 (64 warps x 32)
Max resident blocks per SM	32
32-bit registers per SM	65,536 (256 KB)
Max registers per thread	255
Max shared memory per SM	228 KB (227 KB usable per block)
Unified L1/shared data cache per SM	256 KB
Max resident grids (concurrent kernels)	128
Max blocks per grid dim	X: 2,147,483,647 · Y/Z: 65,535

The 64-warps / 2,048-threads-per-SM ceiling has held across recent generations, so occupancy reasoning carries over. With 1,024-thread blocks only 2 blocks fit per SM; with 256-thread blocks up to 8 blocks (still 2,048 threads) fit, which can raise occupancy and hide latency, at the cost of more scheduling overhead.¹²

Maintain: detect divergence and partial warps¶

Profile, do not guess. In Nsight Compute the relevant signals are warp execution efficiency / smsp__thread_inst_executed_per_inst_executed (fraction of active lanes per issued instruction; below 1.0 means divergence or masking) and achieved occupancy. Mitigations, in order: restructure branches so the predicate is warp-uniform (key on threadIdx.x / 32 or block-level conditions), sort/bucket data so neighbouring lanes follow the same path, replace short branches with predication, and keep block dimensions multiples of 32. See CUDA Occupancy Tuning and Profiling GPUs: Nsight Systems and Nsight Compute.

Templates and numbers above are reference material grounded in the cited sources; they have not been hardware-validated in this KB. Benchmark on your own target before relying on any figure.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Chapter 6, "GPU Architecture, CUDA Programming, and Maximizing Occupancy" — SIMT model, SM/warp/block/grid hierarchy, four warp schedulers and dual-issue, warp divergence, Blackwell B200 limit tables.
NVIDIA CUDA Programming Guide — Compute Capabilities (cc 10.0 limit table: warp size, threads/block, warps/SM, threads/SM, blocks/SM, registers, shared memory): https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/compute-capabilities.html ²
NVIDIA CUDA Programming Guide — SIMT Architecture and warp execution / Independent Thread Scheduling: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture ³⁴
NVIDIA Blackwell Tuning Guide (architecture-specific issue and occupancy guidance): https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html

Fregly, AI Systems Performance Engineering, Ch. 6. ↩↩↩↩↩↩↩↩↩↩
NVIDIA CUDA Programming Guide, Compute Capabilities — cc 10.0 row. ↩↩↩↩↩
NVIDIA CUDA Programming Guide, SIMT Architecture (warp size 32, lockstep execution). ↩↩↩↩
NVIDIA CUDA Programming Guide, SIMT Architecture — Independent Thread Scheduling (Volta+); divergence still costs throughput. ↩↩