Markdown

The GPU memory hierarchy¶

Scope: the on-chip-to-off-chip memory tiers of an NVIDIA GPU SM -- registers, shared memory/L1, the read-only/constant caches, L2, and HBM global memory -- their capacities, latencies, and bandwidths, and why keeping data reuse close to the SM is the single biggest lever on kernel throughput.

flowchart TB
    REG["Registers (per thread, on SM)<br/>64K x 32-bit/SM, 255/thread<br/>single-cycle, tens of TB/s per SM"]
    SMEM["Shared memory / L1 (per SM)<br/>228 KB shared in 256 KB SRAM<br/>~20-30 cycles, TB/s per SM"]
    CC["Constant / read-only cache (per SM)<br/>~8 KB fronting 64 KB constant<br/>~1 cycle warp-uniform broadcast"]
    L2["L2 cache (GPU-wide)<br/>126 MB (B200)<br/>~200 cycles, multi-TB/s aggregate"]
    HBM["Global memory + local spills (HBM3e)<br/>up to 180 GB (B200)<br/>100s-1000s cycles, ~8 TB/s total"]
    HOST["Beyond device: host / NVLink<br/>larger still, slower link bandwidth"]

    REG --> SMEM
    SMEM --> CC
    CC --> L2
    L2 --> HBM
    HBM --> HOST

    REG -. "descend: capacity grows; bandwidth & latency worsen" .-> HOST

What it is¶

A GPU exposes a strict latency/capacity tradeoff across tiers. Closer to the streaming multiprocessor (SM) means smaller, faster, and more parallel; farther away means larger, slower, and bandwidth-shared. A kernel's job is to stage data into the fast tiers, reuse it heavily, and minimize trips to the slow off-chip tier.

The tiers, from fastest to slowest, on Blackwell (B200, compute capability sm_100):

Tier	Scope	Capacity	Approx. latency	Approx. bandwidth
Registers	Per thread (on SM)	64K 32-bit registers per SM; max 255 per thread	single-cycle	tens of TB/s per SM
Shared memory / L1	Per SM	228 KB shared (227 KB usable per block) within a 256 KB combined L1/tex/shared SRAM	~20-30 cycles	TB/s per SM (bank-conflict-free)
Constant cache	Per SM	~8 KB cache fronting 64 KB `__constant__` space	~1 cycle on warp-uniform broadcast	TB/s-scale (broadcast)
L2 cache	GPU-wide	126 MB total (B200)	~200 cycles	multi-TB/s aggregate
Local memory	Per thread (spills)	backed by HBM	100s-1000s cycles	~8 TB/s (HBM3e)
Global memory (HBM3e)	Device-wide	up to 180 GB (B200); up to ~288 GB (B300)	100s-1000s cycles	~8 TB/s total

Numbers per Fregly, AI Systems Performance Engineering, Ch. 6 (Table 6-5), cross-checked against the NVIDIA Blackwell Tuning Guide. The book flags its tables as "illustrative" for latency/bandwidth -- treat cycle counts as order-of-magnitude and confirm capacities for your exact part with cudaGetDeviceProperties / Nsight Compute.

A few tier-specific notes:

Registers are private per thread, single-cycle, and essentially free. Overflow ("register spilling") lands in local memory, which despite the name is off-chip DRAM at full HBM latency.
Shared memory and L1 share one 256 KB SRAM per SM on Blackwell; you choose the split. Up to 228 KB can be carved out as user-managed shared memory; CUDA reserves 1 KB per block, leaving 227 KB usable per thread block.
Constant cache broadcasts one address to all 32 lanes of a warp in a single cycle when the access is warp-uniform. Divergent reads serialize.
Read-only data cache: arbitrary global data marked read-only (via const ... __restrict__ pointers or __ldg()) can be routed through the SM's read-only/texture path, separate from the regular L1 write path.
L2 is the GPU-wide glue between all SMs and HBM; one block can warm a line that other blocks reuse without re-touching DRAM.
TMEM (256 KB per-SM, Blackwell 5th-gen Tensor Cores) is a dedicated Tensor Core accumulator buffer, not a normal pointer-addressable tier -- it is out of scope here and covered under Tensor Cores and Mixed Precision.

Why it matters¶

The latency gap between registers and HBM is roughly three orders of magnitude in cycles. A kernel that reads each operand once from HBM and discards it is bound by ~8 TB/s and hundreds-of-cycles latency. The same kernel that stages a tile into shared memory and reuses it N times effectively multiplies its delivered bandwidth by N.

This is why "data reuse close to the SM is the key to performance" is the governing rule:

Compute hardware now outpaces memory. Blackwell HBM3e delivers ~8 TB/s, but FLOPS and model sizes grow faster, so more kernels are memory-bound than compute-bound (see Roofline Model and Arithmetic Intensity).
The canonical memory-bound case is LLM decode: hundreds of billions of parameters streamed from HBM per step saturate bandwidth regardless of compute headroom.
High occupancy hides latency by swapping warps, but it does not raise the bandwidth ceiling -- at 100% occupancy a memory-bound kernel is still memory-bound. Locality is the fix, not more warps. See CUDA Occupancy Tuning.

When it is needed (and when not)¶

Reason about the hierarchy explicitly when:

You are writing or tuning custom CUDA / Triton kernels (tiling, fusion, attention).
A kernel is memory-bound on the roofline and you need to lift delivered bandwidth via reuse.
Nsight Compute shows low L1/L2 hit rates, register spills to local memory, or shared-memory bank conflicts.
A repeatedly read, small, warp-uniform table (RoPE tables, ALiBi slopes, LayerNorm gamma/beta, quant scales) is hitting global memory instead of the constant cache.

You usually do not need to hand-manage tiers when:

You stay on vectorized PyTorch tensor ops or established library kernels (cuBLAS, cuDNN, FlashAttention) -- they already tile and stage. Avoid Python-level loops over GPU elements, which serialize work.
The workload is compute-bound with high arithmetic intensity and already near the FLOPS roofline; locality changes will not move the ceiling.

How: implement, integrate, maintain¶

Mark read-only global data so it uses the read-only path¶

Tag pointers const ... __restrict__ so the compiler may route loads through the read-only data cache, and broadcast warp-uniform data from constant memory. __ldg() forces a load through the read-only cache explicitly.

// addParallel: one thread per element. const __restrict__ lets the
// compiler use the read-only data cache for A and B.
__global__ void addParallel(const float* __restrict__ A,
                            const float* __restrict__ B,
                            float* __restrict__ C,
                            int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];   // or: __ldg(&A[idx]) + __ldg(&B[idx])
    }
}

// Small, read-only, warp-uniform table -> constant cache (~8 KB fronting
// a 64 KB __constant__ space). Single-cycle broadcast when every lane
// reads the same address.
__constant__ float alibi_slopes[256];

Stage and reuse in shared memory¶

The win is reuse: load a tile from HBM (via L2) into shared memory once, then have every thread in the block read it many times at SRAM speed. Synchronize with __syncthreads() between the fill and the use, and keep barrier count minimal. Bank-conflict-free access patterns reach TB/s per SM. Full tiling pattern: Shared Memory, Bank Conflicts, and Tiling.

Opt in to large dynamic shared memory¶

Above 48 KB of dynamic shared memory you must opt in per kernel before launch, and you can bias the L1/shared carveout:

// Request up to 227 KB dynamic shared memory (Blackwell usable max per block).
size_t smem = 227 * 1024;
cudaFuncSetAttribute(myKernel,
                     cudaFuncAttributeMaxDynamicSharedMemorySize,
                     smem);

// Optionally bias the unified L1/shared split toward shared memory.
cudaFuncSetAttribute(myKernel,
                     cudaFuncAttributePreferredSharedMemoryCarveout,
                     cudaSharedmemCarveoutMaxShared);

myKernel<<<blocks, threads, smem>>>(/* ... */);

The requested dynamic size plus the kernel's static sharedSizeBytes cannot exceed the device's cudaDevAttrMaxSharedMemoryPerBlockOptin. Always check the return value -- the opt-in fails at runtime on parts that lack the capacity.

Feed L2 and HBM with coalesced, aligned transactions¶

Structure global loads so a warp's 32 lanes touch one contiguous, 128-byte-aligned segment that maps cleanly to a cache line. This avoids split transactions and maximizes both L2 and HBM bandwidth. See Memory Coalescing and Vectorized Access.

Query the real capacities for your part¶

Do not hardcode; read them back at runtime.

cudaDeviceProp p;
cudaGetDeviceProperties(&p, 0);
// p.regsPerMultiprocessor          // 64K on Blackwell
// p.sharedMemPerMultiprocessor     // total SRAM available to shared
// p.sharedMemPerBlockOptin         // opt-in dynamic shared max (~227 KB)
// p.l2CacheSize                    // 126 MB on B200
// p.totalGlobalMem                 // HBM capacity

Maintain: measure where data lives¶

Confirm the hierarchy is actually being used. Nsight Compute reports per-tier hit rates, register spills, and bank conflicts; the Speed Of Light section shows whether you are compute- or memory-bound.

# Memory chart, cache hit rates, and bank conflicts for a kernel.
ncu --section MemoryWorkloadAnalysis \
    --section SpeedOfLight \
    --metrics l1tex__t_sector_hit_rate.pct,lts__t_sector_hit_rate.pct \
    --target-processes all \
    -o mem_report \
    ./my_app

# Check for register spills to local memory (spills mean DRAM latency).
nvcc -Xptxas -v -arch=sm_100 my_kernel.cu -o my_app
# ptxas prints: "N bytes stack frame, N bytes spill stores, N bytes spill loads"

If ncu shows low L2 hit rate, improve coalescing/tiling. If ptxas -v shows nonzero spill stores/loads, cut per-thread register pressure (smaller tiles, __launch_bounds__, fewer live temporaries). Full profiling workflow: Profiling GPUs: Nsight Systems and Nsight Compute.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 6, "GPU Architecture, CUDA Programming, and Maximizing Occupancy" -- memory hierarchy table (Table 6-5), per-SM register/shared/constant capacities, L2 and HBM figures, read-only/constant cache behavior. Note: the book states its latency/bandwidth table values are illustrative.
NVIDIA Blackwell Tuning Guide -- per-SM specs for compute capability 10.0 (64K registers/SM, 255 registers/thread, 228 KB shared/SM, 227 KB max shared/block, 256 KB combined L1+tex+shared, 126 MB L2 on GB200). https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html
NVIDIA CUDA C++ Programming Guide -- constant memory and constant cache, read-only data cache and __ldg(), cudaFuncSetAttribute with cudaFuncAttributeMaxDynamicSharedMemorySize and cudaFuncAttributePreferredSharedMemoryCarveout, 48 KB opt-in threshold. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
NVIDIA CUDA Runtime API -- cudaFuncSetAttribute, cudaDevAttrMaxSharedMemoryPerBlockOptin, cudaGetDeviceProperties. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html

Reference templates only; figures are quoted from the book and NVIDIA documentation and have not been hardware-tested here. Where the book and official docs differ, the official NVIDIA figure is used.