Markdown

Shared memory, bank conflicts, and tiling¶

Scope: using on-chip shared memory as a software-managed cache, covering tiling for data reuse (tiled GEMM), the 32-bank conflict model, and padding/swizzling to keep warp accesses conflict-free.

What it is¶

Shared memory is a per-SM, software-managed SRAM scratchpad that a thread block stages data into, synchronizes on with __syncthreads(), and reuses across all of its threads before going back to off-chip HBM. It is the explicit, programmer-controlled half of the same on-chip SRAM that backs L1 (see GPU Memory Hierarchy).

Two distinct techniques live here:

Tiling: partition a problem so each block pulls a small submatrix (a tile) of its inputs into shared memory exactly once, then reuses those cached values across many multiply-accumulate operations. This converts redundant global loads into one load plus N on-chip reads, raising arithmetic intensity and pushing a kernel from memory-bound toward compute-bound (see Roofline Model and Arithmetic Intensity).
Bank-conflict avoidance: shared memory is split into 32 banks of 4-byte width. When threads in a warp hit the same bank at different addresses, the hardware serializes those accesses. Padding and swizzling reshape the address-to-bank mapping so a warp's 32 lanes land in 32 distinct banks.

The bank model is fixed hardware. On all modern NVIDIA GPUs including Blackwell, shared memory has 32 banks with a 4-byte bank width; successive 32-bit words map to successive banks, so bank = (address / 4) % 32 (NVIDIA CUDA C++ Best Practices Guide, "Shared Memory"). Because the warp size (32) equals the bank count (32), the mapping repeats every 128 bytes (32 lanes x 4 bytes), and any access pattern whose per-lane stride is a multiple of 128 bytes collapses the whole warp onto one bank, a 32-way conflict.

One exception: if all 32 lanes read the exact same address, the hardware broadcasts the value in a single cycle and there is no conflict (NVIDIA CUDA C++ Programming Guide, "Shared Memory"). Any other case where two or more lanes touch different addresses in the same bank serializes.

Why use it¶

A naive N x N matrix multiply loads each element of A from HBM N times, once per row of B it multiplies against, giving N-1 redundant loads per element. On a GPU that can sustain tens of TFLOPS, those redundant loads waste the bandwidth that should be feeding math units. The kernel is memory-bound: SMs stall waiting on global memory instead of driving the ALUs.

Tiling eliminates the redundancy. With a 32x32 tile, each element of A and B is fetched from HBM once per tile instead of once per thread (32x fewer fetches for those elements). In the book's worked example, tiling a 1024x1024 FP32 GEMM moves global-memory load sectors from ~9,800 down to ~1,200, lifts achieved occupancy from 42% to 89%, and raises sustained FP32 throughput from 15 to 170 GFLOPS (roughly 11x) by raising arithmetic intensity from 1.5 to 8 FLOPS/byte (Fregly, AI Systems Performance Engineering, Ch. 7, Table 7-4). DRAM throughput as a percentage of peak actually falls (90% to 25%); that is desirable, because the kernel now does far more work per byte moved.

The book labels all of these metric tables "illustrative ... to explain the concepts." Treat the ratios as directionally correct and confirm absolute numbers for your part with Nsight Compute. The 170 GFLOPS result is far below Blackwell's FP32 peak (~80 TFLOPS) precisely because the matrix is small and the example deliberately uses FP32 CUDA cores, not Tensor Cores (see Tensor Cores and Mixed Precision).

The reuse mechanism behind those numbers is a first-order model (not the book's measured Table 7-4 figures, which include cache effects) and is worth making concrete and checkable. Each value staged into a tile is read once from HBM instead of once per thread, so global traffic drops by the tile factor and arithmetic intensity rises by the same factor:

import numpy as np

def global_loads_naive(N):
    # First-order model: each output element reads a full row of A and a
    # full column of B from global memory -> 2*N loads per element.
    return 2 * N * N * N

def global_loads_tiled(N, tile):
    # Each A/B element is pulled from global memory once per output tile it
    # feeds = N/tile times instead of N times.
    return 2 * N * N * (N // tile)

N, tile = 1024, 32
naive = global_loads_naive(N)
tiled = global_loads_tiled(N, tile)

# Tiling cuts global traffic by exactly the tile factor.
assert naive // tiled == tile
assert tiled * tile == naive

# Arithmetic intensity (FLOP/byte) rises by the same factor: reuse is what
# moves a kernel from memory-bound toward compute-bound.
flops = 2 * N ** 3
ai_naive = flops / (naive * 4)          # 4 bytes per float32 word
ai_tiled = flops / (tiled * 4)
assert np.isclose(ai_tiled / ai_naive, tile)

print(f"reuse model: global loads cut {tile}x, AI x{ai_tiled/ai_naive:.0f} "
      f"({ai_naive:.2f} -> {ai_tiled:.2f} FLOP/byte, first-order)")

Bank conflicts then matter because they silently negate the speedup. A naive 32x32 transpose that reads down a tile column forces a 32-way conflict, serializing reads 32x. In the book's transpose example, padding the tile drops shared-memory load bank conflicts from 4.8 million to 0, raises shared-memory utilization from 52% to 100%, cuts warp memory-stall fraction from ~38% to ~0.5%, and improves kernel time ~3x (4 ms to 1.3 ms) (Table 7-5). If a shared-memory kernel is not accelerating as expected, bank conflicts are a likely culprit.

When to use it (and when not)¶

Reach for explicit tiling and bank-conflict tuning when:

You are writing a custom CUDA or Triton kernel (GEMM, transpose, attention, stencil, reduction) with strong data reuse and Nsight Compute shows the kernel memory-bound with low arithmetic intensity.
The Nsight Compute Memory Workload / Shared Memory section reports nonzero bank conflicts or shared-memory throughput well below peak.
A library kernel does not cover your shape, dtype, or fused pattern and you must hand-roll the staging.

You usually do not hand-manage shared memory when:

You stay on torch.matmul / torch.mm or established libraries. cuBLAS, cuDNN, and CUTLASS already implement multi-level (block/warp/thread) tiling, coalesced tile loads, and XOR swizzling internally, so the book's PyTorch GEMM gets the same wins automatically. Do not write Python loops over GPU elements; they serialize and are pathologically slow.
TorchInductor (via Triton) can generate tiled kernels for your shapes when alignment and contiguity permit.
The kernel is already compute-bound near the FLOPS roofline; more reuse will not move that ceiling.

Architecture¶

The tiled GEMM dataflow: a block loads one tile of A and one of B from HBM cooperatively, publishes them to shared memory behind a barrier, has every warp reuse those on-chip values, then advances along K.

flowchart TB
    Start(["Block computes a 32x32 output tile"]) --> Load["Cooperative coalesced load:<br/>one tile of A and B from HBM<br/>(once per tile, not per thread)"]
    Load --> Sync1["__syncthreads(): tile fully resident"]
    Sync1 --> Compute["Multiply-accumulate over the tile<br/>(32 on-chip reads per loaded value)"]
    Compute --> Sync2["__syncthreads(): reads done before overwrite"]
    Sync2 --> More{"More tiles along K?"}
    More -- "yes, advance K" --> Load
    More -- "no" --> Write["Write accumulated sum to C in HBM"]
    Compute -. "reuse from shared memory<br/>cuts global-memory traffic" .-> Compute

The structural pieces the diagram rests on:

Off-chip HBM holds the full A, B, and C. It is the scarce resource: the whole point of tiling is to touch it as few times as possible.
Per-SM shared memory is the staging SRAM. It is partitioned into 32 banks of 4-byte width, so an ideal warp access spreads its 32 lanes across all 32 banks and completes in one cycle.
The warp (32 lanes) is the unit that both loads a tile (coalesced, one contiguous 128-byte segment per warp) and later reads it back for the multiply-accumulate. Because 32 lanes meet 32 banks, the bank-mapping math (below) decides whether each shared-memory read is one cycle or up to 32.
The two __syncthreads() barriers bracket every tile: the first makes the loaded tile visible to all threads, the second guarantees all reads finish before the next iteration overwrites the tile. They turn a per-thread load pattern into a per-block, load-once/reuse-many pattern.

How to use it: tiled GEMM and conflict-free access¶

Tiled GEMM with shared memory¶

Each block loads one 32x32 tile of A and one of B into shared memory, syncs, computes a 32x32 block of partial products, accumulates, and advances along the K dimension. 32x32 is a common starting tile: it matches the 32-lane warp, fits comfortably in shared memory, and maps a full warp to one row.

The kernel below is a CUDA C++ reference template (it requires nvcc and an NVIDIA GPU and is not executed in this page). Its core math, tile-by-tile accumulation along K with a guarded boundary, is validated in numpy directly beneath it.

#include <cuda_runtime.h>

#define TILE_SIZE 32

__global__ void tiledMatMul(const float* A, const float* B, float* C, int N) {
    __shared__ float sA[TILE_SIZE][TILE_SIZE];
    __shared__ float sB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N; t += TILE_SIZE) {
        // Cooperative, coalesced load of one tile of A and one of B.
        if (row < N && (t + threadIdx.x) < N)
            sA[threadIdx.y][threadIdx.x] = A[row * N + t + threadIdx.x];
        else
            sA[threadIdx.y][threadIdx.x] = 0.0f;

        if ((t + threadIdx.y) < N && col < N)
            sB[threadIdx.y][threadIdx.x] = B[(t + threadIdx.y) * N + col];
        else
            sB[threadIdx.y][threadIdx.x] = 0.0f;

        __syncthreads();  // tile fully resident before use

        for (int k = 0; k < TILE_SIZE; ++k)
            sum += sA[threadIdx.y][k] * sB[k][threadIdx.x];

        __syncthreads();  // all reads done before next tile overwrites
    }

    if (row < N && col < N)
        C[row * N + col] = sum;
}

Launch with a 32x32 block matching the tile:

dim3 block(TILE_SIZE, TILE_SIZE);
dim3 grid((N + TILE_SIZE - 1) / TILE_SIZE, (N + TILE_SIZE - 1) / TILE_SIZE);
tiledMatMul<<<grid, block>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();

The two __syncthreads() barriers are mandatory: the first guarantees the tile is loaded before any thread reads it, the second prevents the next iteration from overwriting the tile while threads still read it. Note both tile loads are arranged to be coalesced, so each warp pulls a contiguous 128-byte segment from global memory (see Memory Coalescing and Vectorized Access).

Validate the tiling math itself (numpy, runnable). This mirrors the kernel's tile-by-tile accumulation and, critically, exercises the boundary path (N not a multiple of the tile) that the if (row < N && ...) guards protect:

import numpy as np

def tiled_matmul(A, B, tile=32):
    # Mirror the CUDA kernel: stage tile-by-tile along K, zero-pad the
    # boundary exactly as the `if (row < N && ...)` guards do, accumulate.
    N = A.shape[0]
    assert A.shape == (N, N) and B.shape == (N, N)
    C = np.zeros((N, N), dtype=A.dtype)
    for kt in range(0, N, tile):
        ke = min(kt + tile, N)            # boundary: final tile may be short
        C += A[:, kt:ke] @ B[kt:ke, :]    # load tiles, multiply-accumulate
    return C

def slow_matmul(A, B):
    # Independent triple-loop reference (different summation than BLAS).
    N = A.shape[0]
    C = np.zeros((N, N), dtype=np.float64)
    for i in range(N):
        for j in range(N):
            acc = 0.0
            for k in range(N):
                acc += float(A[i, k]) * float(B[k, j])
            C[i, j] = acc
    return C

rng = np.random.default_rng(0)

# Equivalence to an independent slow reference (small, float64 exactness).
As = rng.standard_normal((8, 8))
Bs = rng.standard_normal((8, 8))
assert np.allclose(tiled_matmul(As, Bs, 4), slow_matmul(As, Bs), atol=1e-9)

# Happy path: N a multiple of the tile, float32 like the kernel.
A = rng.standard_normal((64, 64)).astype(np.float32)
B = rng.standard_normal((64, 64)).astype(np.float32)
assert np.allclose(tiled_matmul(A, B, 32), A @ B, atol=1e-3)

# Adversarial edge case: N NOT a multiple of the tile. This is the
# boundary path a naive kernel gets wrong; the guarded load must still
# produce the exact product.
A2 = rng.standard_normal((70, 70)).astype(np.float32)
B2 = rng.standard_normal((70, 70)).astype(np.float32)
assert np.allclose(tiled_matmul(A2, B2, 32), A2 @ B2, atol=1e-3)

# Tile size must never change the result (equivalence across tilings).
for t in (1, 8, 16, 32, 64):
    assert np.allclose(tiled_matmul(A, B, t), A @ B, atol=1e-3)

print("tiled GEMM: slow-reference + boundary(70) + tile-invariance OK")

The bank conflict, concretely¶

A row-major 32x32 tile of floats has rows exactly 128 bytes apart (32 cols x 4 bytes). Reading down a column holds the column index constant and varies the row across the warp, so the 32 addresses differ by multiples of 128 bytes, and (k * 128) / 4 % 32 == 0 for every lane. All 32 lanes hit bank 0: a 32-way conflict that serializes the reads. The same trap appears with any explicit stride of 32 floats:

__shared__ float arr[32 * warpCount];
float x = arr[threadIdx.x * 32];  // stride 32 floats = 128 bytes -> all in bank 0

The canonical case is a naive transpose: the write tile[ty][tx] is coalesced, but the transposed read tile[tx][ty] walks a column and triggers the 32-way conflict.

Validate the bank model (numpy, runnable). This computes the exact bank = word_index % 32 mapping, detects the 32-way conflict in the naive layout, proves the +1 pad fixes it, sweeps widths adversarially, and checks the same-address broadcast exception:

import numpy as np

BANKS = 32  # 32 banks, 4-byte width: bank = (word_index) % 32

def column_read_banks(rows, width_floats, col=0):
    # A warp reads down one column: lane r touches element [r, col] whose
    # linear word index is r*width_floats + col.
    r = np.arange(rows)
    return (r * width_floats + col) % BANKS

def conflict_ways(bank_ids):
    # Max lanes on any single bank = serialization factor.
    # 1 == conflict-free; 32 == full 32-way conflict.
    _, counts = np.unique(bank_ids, return_counts=True)
    return int(counts.max())

# Naive 32x32 float tile: rows are 32 words apart, so a column walk puts
# every lane on the SAME bank -> 32-way conflict (detect the defect).
naive = column_read_banks(32, width_floats=32)
assert conflict_ways(naive) == 32
assert len(np.unique(naive)) == 1

# Fix: pad to 33 words/row. Successive rows step one bank further, so a
# column walk hits all 32 distinct banks -> conflict-free.
padded = column_read_banks(32, width_floats=33)
assert conflict_ways(padded) == 1
assert len(np.unique(padded)) == 32

# Adversarial sweep: ANY row width that is a multiple of 32 reintroduces
# the 32-way conflict; any ODD width stays conflict-free (gcd with 32 = 1).
for w in (32, 64, 96, 128):
    assert conflict_ways(column_read_banks(32, w)) == 32
for w in (33, 31, 17, 65):
    assert conflict_ways(column_read_banks(32, w)) == 1

# Broadcast exception: all lanes at the SAME address (not just same bank)
# is served in one cycle, so it is not a conflict.
same_addr = np.zeros(32, dtype=int)
assert len(np.unique(same_addr)) == 1   # one distinct address => broadcast

print("bank model: naive=32-way conflict, +1 pad -> conflict-free, broadcast OK")

Fix 1: pad the inner dimension¶

Add one padding column so each row is 33 floats wide. Successive rows now start one bank further along, so a column walk spreads across all 32 banks instead of colliding on one.

#include <cuda_runtime.h>

#define TILE_DIM 32
#define PAD 1  // one padding column shifts the bank mapping per row

__global__ void transposePadded(const float* idata, float* odata, int width) {
    __shared__ float tile[TILE_DIM][TILE_DIM + PAD];  // 32 x 33

    int x = blockIdx.x * TILE_DIM + threadIdx.x;
    int y = blockIdx.y * TILE_DIM + threadIdx.y;

    tile[threadIdx.y][threadIdx.x] = idata[y * width + x];  // coalesced write
    __syncthreads();
    odata[x * width + y] = tile[threadIdx.x][threadIdx.y];  // now conflict-free
}

Padding costs ~3% extra shared memory for a 32-wide tile (1 KB for a 32x32 float tile) and is the simplest fix. The exact same [32][33] trick removes the column-access conflict in the tiled-GEMM kernel above. The width_floats=33 case in the numpy block above is precisely this fix, and it asserts the result is conflict-free.

Fix 2: swizzle¶

Swizzling is a compile-time index transform that scrambles the linear shared-memory index so sequential threads map to different banks, typically XOR-ing the row into the column (new_col = col ^ row) or a modulo offset. It achieves conflict-free access with zero memory overhead, at the cost of more complex indexing. CUTLASS and other high-performance libraries use XOR swizzling in their tile iterators rather than padding (NVIDIA CUTLASS docs). PyTorch exposes no high-level shared-memory padding/swizzle API; you implement it in a custom kernel (for example via torch.utils.cpp_extension) or rely on the libraries to do it under the hood.

Validate the swizzle (numpy, runnable). This proves the XOR transform is conflict-free, that it is a bijection (so no address is dropped), and that it holds for every logical column, not just column 0:

import numpy as np

BANKS = 32

def xor_swizzle_banks(rows, width_floats, col=0):
    # CUTLASS-style XOR swizzle: stored column is (col ^ row), so a column
    # walk touches bank (row*width + (col ^ row)) % BANKS.
    r = np.arange(rows)
    return (r * width_floats + (col ^ r)) % BANKS

def is_permutation(x):
    return sorted(int(v) for v in x) == list(range(len(x)))

# Unswizzled 32-wide tile: column walk = 32-way conflict (baseline defect).
r = np.arange(32)
plain = (r * 32 + 0) % BANKS
assert len(np.unique(plain)) == 1

# XOR swizzle on the same 32-wide tile: all 32 banks, conflict-free, with
# ZERO padding overhead.
sw = xor_swizzle_banks(32, 32, col=0)
assert len(np.unique(sw)) == 32
assert is_permutation(sw)               # bijection: no address is lost

# Adversarial: the swizzle must stay conflict-free for EVERY logical
# column, not just column 0 (a fix that only works for one column is a bug).
for c in range(32):
    assert len(np.unique(xor_swizzle_banks(32, 32, col=c))) == 32

print("XOR swizzle: conflict-free for all 32 columns, zero padding, bijective")

How to integrate with it¶

Shared-memory tiling plugs into a PyTorch stack at three levels, from most to least automatic:

Library call. torch.matmul / torch.mm dispatch to cuBLAS/CUTLASS, which stage tiles into shared memory with multi-level tiling, coalesced loads, and XOR swizzling. You get the reuse and conflict-free access for free.
Compiler. TorchInductor (via Triton) can generate tiled kernels for your shapes when alignment and contiguity permit, so a torch.compiled region often reaches the same pattern without a hand-written kernel.
Custom kernel. When no library or compiled path covers your shape, dtype, or fused pattern, drop to a custom CUDA or Triton kernel (for example via torch.utils.cpp_extension) and stage shared memory yourself, applying the padding or swizzle fixes above.

The library call is a reference template here (it requires a CUDA GPU and PyTorch and is not executed in this page); its core math is the tiled GEMM validated in numpy under "How to use it":

import torch

# Reference template: requires a CUDA GPU and PyTorch; not executed here.
A = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
B = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
C = torch.matmul(A, B)  # cuBLAS/CUTLASS: tiled + coalesced + conflict-free

How to run it in production¶

Prefer the library path. cuBLAS, cuDNN, and CUTLASS already do the same staging plus warp/thread-level tiling and swizzling, tuned over years across GPU generations, and you are unlikely to beat them quickly by hand. Hand-roll shared memory only when a library genuinely does not cover your shape, dtype, or fused pattern, and when you do, verify the result on real hardware (below) rather than trusting the source read. Never write Python loops over GPU elements: they serialize and are pathologically slow. Because the library kernels stage shared memory internally, the same 48 KB static limit and dynamic opt-in (see "How to scale it") apply to any custom kernel you add alongside them.

How to maintain it¶

A hand-rolled shared-memory kernel is only correct once you have profiled it, and it must be re-profiled after every change and on every new GPU generation. These kernels are reference templates and have not been hardware-tested here. Confirm behavior with Nsight Compute (see Profiling GPUs: Nsight Systems and Nsight Compute):

# Shared-memory + occupancy sections for the tiled/transpose kernels
ncu --set full \
    --section MemoryWorkloadAnalysis \
    --section Occupancy \
    ./matmul_tiled

Inspect the Shared Memory section: bank conflicts should read 0 and shared-memory throughput near 100% once padding or swizzling is applied. Confirm achieved occupancy and that global-memory load sectors drop after tiling. Treat any regression in these three signals (nonzero conflicts, throughput below peak, sector count creeping back up) as the maintenance trigger to re-tune.

How to scale it¶

Sizing tiles against on-chip budgets¶

Bigger tiles square the reuse factor but quadruple shared-memory use and add registers, lowering occupancy. Blackwell (sm_100) provides up to 228 KB of allocatable shared memory per SM (NVIDIA Blackwell Tuning Guide). Note that static __shared__ allocations remain capped at 48 KB; to exceed that you must allocate dynamic shared memory and opt in with cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, bytes) (NVIDIA Blackwell Tuning Guide).

A 64x64 FP32 tile needs 16 KB per input tile (64 x 64 x 4 bytes), which fits, but occupancy is then bounded by the smallest of the shared-memory, register, warp, and thread limits. Each Blackwell SM has 65,536 32-bit registers (max 255 per thread) and supports at most 2,048 resident threads (64 warps). For a 64x64 tile computed by a 1024-thread block at ~22 registers/thread, the book shows only two blocks fit per SM (22,528 of 65,536 registers), saturating the 2,048-thread ceiling. Verify your own config with cudaOccupancyMaxPotentialBlockSize or the Nsight Compute occupancy report (see CUDA Occupancy Tuning); CUTLASS profilers automate the tile-size sweep. Scaling further along K is handled by the tile loop itself: the block advances one tile at a time, so problem size grows without growing the shared-memory footprint.

Failure modes¶

Silent bank conflicts. A column-strided read (stride a multiple of 32 floats) serializes a warp up to 32x and produces correct results, so it never errors; it only shows up as low shared-memory throughput and high warp memory-stall in Nsight. Fix with padding ([32][33]) or XOR swizzle, as validated in the numpy blocks above.
Missing or misplaced __syncthreads(). Dropping the first barrier lets threads read a tile before it is loaded; dropping the second lets the next iteration overwrite the tile while threads still read it. Both are data races that yield wrong, nondeterministic results, not crashes.
Boundary bug on non-multiple sizes. When N is not a multiple of the tile, unguarded loads read out of bounds or leave stale values in the tile. The guarded load with zero-fill is what keeps the product exact; the numpy validation's 70x70 case exercises exactly this path.
Exceeding the 48 KB static limit. A static __shared__ allocation above 48 KB fails to launch. Reaching the Blackwell 228 KB ceiling requires dynamic shared memory plus the cudaFuncSetAttribute opt-in.
Occupancy collapse from oversized tiles. Larger tiles raise reuse but consume more shared memory and registers, so too few blocks stay resident and the SM is underutilized despite fewer global loads. Size against the smallest of the shared-memory, register, warp, and thread limits.
Over-padding. Padding is cheap (~3% for a 32-wide tile) but padding every dimension or padding wide tiles wastes shared memory and can itself cost occupancy; pad only the dimension that is column-walked, or swizzle for zero overhead.
Broadcast misread as a conflict. All 32 lanes reading the same address is a single-cycle broadcast, not a conflict; only same-bank different-address accesses serialize. Reading a conflict count without checking whether lanes share an address misdiagnoses the kernel.

References¶

Chris Fregly, AI Systems Performance Engineering, O'Reilly, 2025. Ch. 7, "Profiling and Tuning GPU Memory Access Patterns": "Tiling and Data Reuse Using Shared Memory" (pp. 255-263) and "Avoid Shared-Memory Bank Conflicts" (pp. 264-270). Source of the tiled-GEMM and padded-transpose code and the Table 7-4 / Table 7-5 metrics (book-flagged as illustrative).
NVIDIA CUDA C++ Programming Guide, Shared Memory: bank model, broadcast on same-address access.
NVIDIA CUDA C++ Best Practices Guide, Shared Memory: 32 banks / 4-byte width, bank = (address/4) % 32, padding to avoid conflicts.
NVIDIA Blackwell Tuning Guide: 228 KB shared memory per SM (sm_100), 48 KB static limit, opt-in for dynamic shared memory above it.
NVIDIA CUTLASS: XOR-swizzled tile iterators for conflict-free shared-memory GEMM.