Markdown

Memory coalescing and vectorized access¶

Scope: how a warp's 32 lanes should address global (HBM) memory -- contiguous and aligned so the hardware coalesces lane requests into the fewest cache-line transactions -- why strided or misaligned patterns waste bandwidth, and how vectorized loads (float4/int4, and 32-byte Blackwell vectors) cut load-instruction count while preserving alignment.

flowchart TB
    W["Warp: 32 lanes, 4-byte words"]
    W --> C["Coalesced<br/>lane i reads element i (stride 1)"]
    W --> S["Strided / scattered<br/>in[idx * stride], gather"]
    C --> CT["One contiguous 128-byte segment"]
    CT --> CS["Four 32-byte sectors (~4.0 sectors/request)"]
    CS --> CE["Global Load Efficiency ~100%, DRAM ~peak"]
    S --> ST["Lanes land in separate sectors"]
    ST --> SS["Up to 8-32 sectors/request, mostly unused bytes"]
    SS --> SE["Low efficiency, DRAM well below peak, warps stall"]

What it is¶

Global-memory access is fastest when the 32 threads of a warp touch a contiguous, aligned range of addresses that the memory controller can combine ("coalesce") into a small number of wide transactions. The hardware services a warp's loads in 32-byte sectors; on modern GPUs the L2/DRAM access granularity is the 32-byte sector, and four contiguous sectors form a 128-byte cache line. A perfectly coalesced 32-lane load of 4-byte words (128 bytes total, base 128-byte-aligned) maps to exactly four 32-byte sectors -- the ideal.

Two distinct, complementary properties govern this:

Coalescing (inter-thread): which addresses the lanes of one warp hit. Stride-1 indexing so lane i reads element i keeps the warp on one contiguous segment.
Vectorization (intra-thread): how wide each lane's individual load is. Each thread fetches an aligned vector (e.g. a 16-byte float4) in one instruction instead of four scalar 4-byte loads.

Per Fregly (Ch. 7), uncoalesced patterns balloon the per-request sector count. A strided or scattered warp can break one logical request into up to 8 sectors (or approach 32 in the worst case, since L2 activity is reported in 32-byte sectors); Nsight Compute surfaces this as average sectors per request above 4.0, lower Global Memory Load Efficiency, and DRAM throughput well below peak. Coalesced access drives sectors-per-request back toward 4.0 and efficiency toward 100%.

A note on the "128-byte line" framing: the book describes 128-byte cache lines composed of four 32-byte sectors. The NVIDIA CUDA C++ Best Practices Guide states that on compute capability 6.0 and higher the access unit is the 32-byte transaction regardless of L1 caching, and a warp of adjacent 4-byte words coalesces into four 32-byte transactions. The two descriptions agree on the observable ideal -- four 32-byte sectors per warp -- so prefer "four 32-byte sectors" as the architecture-neutral statement.

Why it matters¶

Blackwell HBM3e delivers up to ~8 TB/s per device (up to ~16 TB/s aggregate across the two GPUs of a GB200/GB300 superchip). Uncoalesced access leaves most of that idle: the controller fetches whole sectors but returns mostly unused bytes, so effective bandwidth collapses and warps stall on memory instead of feeding the ALUs.

The book's before/after copy kernels make the gap concrete (numbers flagged illustrative -- see References):

Coalescing fix (drop a stride of 2 to stride 1): DRAM throughput 25% -> 90% of peak (~3.6x), Global Memory Load Efficiency 23% -> 99%, average sectors per request 8.0 -> 4.0, kernel time 4.8 ms -> 1.3 ms (~3.7x).
Vectorization fix (scalar float -> float4): Global Memory Load Efficiency 28% -> 97%, average sectors per request 31.8 -> 4.0, DRAM throughput 25% -> 90%, kernel time 4.2 ms -> 1.2 ms (~3.5x).

Coalescing reduces scattered sectors; vectorization reduces instruction count. Even a fully coalesced scalar warp still issues 32 individual 4-byte loads that the hardware stitches back together; one float4 load per lane replaces four scalar loads, cutting load instructions ~4x and reducing latency. This is the difference between a memory-bound kernel and one that approaches the bandwidth roofline -- see Roofline Model and Arithmetic Intensity.

When it is needed (and when not)¶

Reason about coalescing and vectorization explicitly when:

You write or tune custom CUDA / Triton kernels and Nsight Compute's Memory Workload Analysis shows average sectors per request above 4.0, low Global Memory Load Efficiency, or sub-peak DRAM throughput.
Your data layout is array-of-structures (AoS) and threads stride across struct fields; a structure-of-arrays (SoA) layout lets lane i touch element i contiguously.
A gather/scatter or strided index (in[idx * stride], index_select with a strided index) forces scattered loads.
You can guarantee 16-byte (Hopper) or 32-byte (Blackwell + CUDA 13) alignment and want to cut load-instruction count.

You usually do not hand-tune this when:

You stay on vectorized PyTorch tensor ops or established library kernels (cuBLAS, cuDNN, CUTLASS, FlashAttention) -- they already lay out data for coalesced, vectorized access. A contiguous tensor.clone() lowers to an optimized device-to-device copy.
The kernel is already compute-bound at high arithmetic intensity; better coalescing will not move the FLOPS ceiling.
The kernel is register- or parallelism-limited -- NVIDIA notes vectorized loads raise register pressure and can reduce occupancy, so scalar loads may be better. See CUDA Occupancy Tuning.

How: implement, integrate, maintain¶

Fix coalescing: stride-1 indexing¶

The uncoalesced pattern strides each lane across the input; the fix is contiguous indexing (stride = 1) so the warp lands on one 128-byte-aligned segment.

// Uncoalesced: stride > 1 scatters lanes across sectors.
__global__ void uncoalescedCopy(const float* __restrict__ in,
                                float* __restrict__ out,
                                int N, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        out[idx] = in[idx * stride];   // lanes are stride*4 bytes apart
    }
}

// Coalesced: stride-1, lane i reads element i. The warp's 32 loads
// combine into the minimum number of 128-byte transactions.
__global__ void coalescedCopy(const float* __restrict__ in,
                              float* __restrict__ out,
                              int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        out[idx] = in[idx];
    }
}

In PyTorch, a strided torch.index_select reproduces the uncoalesced gather (lanes stride * 4 bytes apart); the coalesced equivalent is just a contiguous copy.

import torch

# Uncoalesced: strided gather issues scattered loads.
def uncoalesced_copy(input_tensor: torch.Tensor, stride: int) -> torch.Tensor:
    flat = input_tensor.contiguous().view(-1)
    assert flat.numel() % stride == 0, "stride must divide tensor length"
    idx = torch.arange(0, flat.numel(), stride,
                       device=flat.device, dtype=torch.long)
    return torch.index_select(flat, 0, idx)

# Coalesced: clone() on a contiguous tensor lowers to a vectorized
# device-to-device copy under the hood.
n, stride = 1 << 20, 2
inp = torch.arange(n * stride, device="cuda", dtype=torch.float32)
out_bad = uncoalesced_copy(inp, stride)
out_good = inp.clone()

To let TorchInductor pick coalesced, vectorized schedules when shapes are stable, compile with autotuning:

compiled = torch.compile(my_fn, mode="max-autotune")

Fix instruction count: vectorized loads (`float4`)¶

CUDA's built-in float4 (from <cuda_runtime.h>) packs four floats into a compiler-guaranteed 16-byte-aligned struct. A warp of float4 loads moves 32 x 16 = 512 bytes, which the controller splits into exactly four aligned 128-byte transactions. Launch N/4 threads and step the pointer in float4 units.

#include <cuda_runtime.h>

// 16-byte (128-bit) vector copy: one float4 per thread.
// On sm_90 (Hopper) NVCC emits ld.global.v4.f32 / st.global.v4.f32.
static_assert(alignof(float4) == 16, "float4 alignment must be 16 bytes");

__global__ void copyVector16B(const float4* __restrict__ in,
                              float4* __restrict__ out,
                              int N4) {            // number of float4 elements
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N4) {
        out[idx] = in[idx];                        // 16B load + 16B store
    }
}

cudaMalloc returns device pointers aligned to at least 256 bytes, which satisfies the 16-byte float4 requirement at the allocation boundary. Alignment is a hard precondition: a misaligned cast forfeits vectorization. Adding an element offset can break alignment unless the offset is a multiple of the vector width.

// Valid only if ptr is already a multiple of 16 bytes (4 floats).
// cudaMalloc base addresses satisfy this; arbitrary offsets may not.
auto ptr4 = reinterpret_cast<const float4*>(ptr);

Handle a tail when N is not divisible by 4 (a short scalar cleanup over the last N % 4 elements), or assert divisibility in host code.

Blackwell: 32-byte (256-bit) vectors¶

Prior to Blackwell + CUDA 13, global vector loads were capped at 16 bytes (128 bits) per thread. Blackwell adds 32-byte (256-bit) load/store instructions for 32-byte-aligned types. CUDA has no built-in 8-float vector type, so define one with alignas(32); the compiler then emits a single ld.global.v8.f32 per thread on sm_100 instead of two 16-byte loads.

#include <cuda_runtime.h>

// Blackwell-only: 32-byte per-thread vector copy. Requires 32B alignment;
// NVCC emits ld.global.v8.f32 / st.global.v8.f32 on sm_100.
struct alignas(32) float8 { float v[8]; };
static_assert(alignof(float8) == 32, "float8 alignment must be 32 bytes");

__global__ void copyVector32B(const float8* __restrict__ in,
                              float8* __restrict__ out, int N8) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N8) {
        out[idx] = in[idx];                        // 32B load + 32B store
    }
}

If the compiler cannot prove 32-byte alignment it splits the access into two 16-byte instructions, doubling instruction count -- so enforce 32-byte alignment for 256-bit loads on Blackwell. Note the coalescer still services requests in 128-byte chunks (four 32-byte sectors): a 32-lane warp at 32 B/thread moves 1024 B (8 x 128-byte lines); at 16 B/thread it moves 512 B (4 x 128-byte lines). Both are fully efficient when aligned.

Maintain: confirm with Nsight Compute¶

The Memory Workload Analysis section reports average sectors per request and Global Memory Load Efficiency directly; a value near 4.0 with efficiency near 100% confirms coalesced, well-utilized transactions.

ncu --section MemoryWorkloadAnalysis \
    --section SpeedOfLight \
    --metrics l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio \
    --target-processes all \
    -o coalesce_report \
    ./my_app

If sectors per request sits above 4.0, fix indexing (stride-1 / SoA) first, then vectorize. Confirm NVCC actually emitted vector instructions by inspecting the SASS:

# Look for LDG.E.128 (16B) or wider vector loads; LDG.E without a width
# suffix means scalar 32-bit loads (no vectorization).
cuobjdump -sass ./my_app | grep -E 'LDG|STG'

Full profiling loop: Profiling GPUs: Nsight Systems and Nsight Compute. Vectorization composes with shared-memory staging -- tile loads should themselves be coalesced: Shared Memory, Bank Conflicts, and Tiling.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 7, "Profiling and Tuning GPU Memory Access Patterns" -- coalesced vs uncoalesced access, AoS vs SoA, the strided/float4/float8 before-after kernels, the Nsight Compute metric tables (Tables 7-1 to 7-3), and Blackwell/CUDA 13 32-byte vector loads. The book flags all metric-table numbers as illustrative, not measured.
NVIDIA CUDA C++ Best Practices Guide -- coalesced access, 32-byte transaction granularity on compute capability 6.0+, and the four-32-byte-transaction ideal for adjacent 4-byte words. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
NVIDIA Technical Blog, "How to Access Global Memory Efficiently in CUDA C/C++ Kernels" -- coalesced vs strided/offset access and their transaction cost. https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
NVIDIA Technical Blog, "CUDA Pro Tip: Increase Performance with Vectorized Memory Access" -- int2/int4/float4 loads (LDG.E.64/LDG.E.128), the 2x/4x instruction-count reduction, alignment requirement, and the register-pressure/parallelism tradeoff. https://developer.nvidia.com/blog/cuda-pro-tip-increase-performance-with-vectorized-memory-access/
NVIDIA CUDA C++ Programming Guide -- built-in vector types (float4) and their alignment, __restrict__, device-pointer alignment from cudaMalloc. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Reference templates only; figures are quoted from the book and NVIDIA documentation and have not been hardware-tested here. Where the book and official docs differ (the book's "128-byte line of four sectors" vs the guide's 32-byte transaction unit on CC 6.0+), the NVIDIA description is preferred and both are noted.