Markdown

Thread block clusters and distributed shared memory¶

Scope: the Hopper/Blackwell thread block cluster (an optional grouping level above the block that co-schedules a set of blocks on SMs within one GPC) and distributed shared memory (DSMEM), which lets those blocks read, write, and atomically update each other's shared memory directly over the SM-to-SM network instead of round-tripping through global memory. Covers the cooperative-groups cluster API, cluster-wide synchronization, how to launch clusters (compile-time __cluster_dims__ vs runtime cudaLaunchKernelEx), and when a cluster beats plain per-block shared memory.

flowchart TB
    Grid["Grid (compute capability 9.0+)"]
    subgraph GPC["Thread block cluster (co-scheduled on one GPC, up to 8 blocks)"]
        direction LR
        B0["Block 0 on SM (shared memory tile)"]
        B1["Block 1 on SM (shared memory tile)"]
        B0 <-->|"DSMEM load / store / atomic over SM-to-SM network (~7x faster than global)"| B1
    end
    Grid --> GPC
    API["cooperative_groups::this_cluster(): map_shared_rank, cluster.sync()"] --> GPC
    GPC -.->|"avoids HBM round-trip"| HBM["Global memory (HBM)"]

What it is¶

The classic CUDA hierarchy is thread -> warp -> thread block -> grid. Compute capability 9.0 (Hopper, sm_90) and later add an optional level between block and grid: the thread block cluster. A cluster is a group of thread blocks, laid out in 1, 2, or 3 dimensions like blocks and grids, that the hardware guarantees to co-schedule concurrently on SMs within a single GPC (Graphics Processing Cluster).¹⁵ That concurrency guarantee is the whole point: because every block in the cluster is resident at the same time on neighbouring SMs, the blocks can communicate and synchronize with one another mid-kernel, something plain blocks in a grid cannot do, since the grid scheduler makes no co-residency promise.

The capability the cluster unlocks is distributed shared memory (DSMEM). Each block still owns its own shared-memory allocation, but within a cluster the shared-memory address spaces of all member blocks are mapped into a single logical region: "threads in different blocks but within the same cluster" can access "the shared memory of all blocks in the cluster."¹ At the hardware level, "with clusters, it is possible for all the threads to directly access other SM's shared memory with load, store, and atomic operations," carried by "a dedicated SM-to-SM network" that "ensures fast, low latency access to remote DSMEM."⁵ A thread takes a pointer into a peer block's shared memory and dereferences it like any other pointer; the SM-to-SM fabric services the access without ever touching global memory or L2.

Cooperative Groups exposes the cluster as a first-class group. cooperative_groups::this_cluster() returns a cluster_group handle (it assumes a 1x1x1 cluster if the grid was launched without clusters, and requires compute capability 9.0+).⁴ The group provides the cluster's geometry and rank, a cluster-wide barrier, and the mapping from a local shared-memory pointer to the equivalent pointer in a peer block:

cluster.block_rank() returns this block's linear rank within the cluster.
cluster.num_blocks() / cluster.dim_blocks() give cluster size and shape in blocks.
cluster.map_shared_rank(local_smem_ptr, target_rank) translates a pointer into this block's shared memory into a pointer into the shared memory of the block at target_rank, valid for direct load/store/atomic.¹⁴
cluster.sync() is a barrier across all threads of all blocks in the cluster. It also offers the split-phase form cluster.barrier_arrive() / cluster.barrier_wait(token), which lets a block signal arrival, do unrelated local work to hide barrier latency, then wait.⁴

Underneath, map_shared_rank lowers to the PTX mapa.shared::cluster instruction, which rebases a shared-memory address onto a target block rank; remote accesses then issue against the .shared::cluster state space.⁶ The cluster runs entirely on hardware: the co-scheduling, the SM-to-SM network, and the cluster barrier are not emulated in software.

Why it matters¶

Before clusters, the only way for two thread blocks to share intermediate data was global memory: block A writes a tile to HBM, a grid-level barrier or a separate kernel launch enforces ordering, then block B reads it back. That is two HBM round-trips and, often, an extra kernel launch, exactly the traffic the roofline tells you to avoid on a bandwidth-bound kernel. DSMEM collapses the exchange onto the on-die SM-to-SM network. NVIDIA reports that "compared to using global memory, DSMEM accelerates data exchange between thread blocks by about 7x."⁵ You keep the data on-chip, avoid the HBM bandwidth and latency, and avoid the launch overhead of splitting the work into producer and consumer kernels.

The second win is working-set size. A single block's shared memory is capped per SM (227 KB usable on Blackwell, 228 KB on Hopper, see GPU Memory Hierarchy). A problem whose reuse tile is larger than one block's shared budget previously had to spill to L2/HBM. A cluster of up to 8 blocks pools that on-chip capacity: a tile that needs more than one SM's shared memory can live in DSMEM across the cluster, addressed with ordinary pointers. This is the structural reason cluster-based GEMM and attention kernels (CUTLASS, FlashAttention/MLA) can stage larger operand tiles than a single block could hold.

The third win is scheduling locality. The concurrency guarantee means a producer-consumer pipeline across blocks is correct by construction: no polling on a global flag, no risk that the consumer block was never co-resident. Combined with the split-phase barrier_arrive/barrier_wait, this lets a cluster overlap inter-block synchronization with useful compute, which is the foundation for warp-specialized, software-pipelined kernels (see Warp Specialization and Intra-Kernel Pipelining).

When it is needed (and when not)¶

Use a cluster when blocks must cooperate on shared data within a single kernel and the cooperation is the bottleneck:

The reuse tile exceeds one block's shared-memory budget but fits across a handful of co-resident blocks, so pool it in DSMEM rather than spilling to HBM.
A producer-consumer or all-to-all exchange between blocks currently round-trips through global memory; DSMEM removes the HBM traffic (~7x faster exchange⁵).
You are building a multi-block GEMM/attention pipeline where operands or partial sums are passed block-to-block (CUTLASS, FlashAttention-class kernels), often paired with TMA multicast into DSMEM.

Do not reach for clusters when:

A plain block already holds its whole working set in its own shared memory. A single-block shared-memory tiling kernel that fits is simpler and has no cluster barrier or SM-to-SM cost, so add a cluster only when one block's shared memory is genuinely too small.
The kernel is compute-bound near peak tensor-core throughput and does no inter-block data exchange, so a cluster adds synchronization without removing a bottleneck.
You target pre-Hopper hardware. Clusters require compute capability 9.0+.¹⁴ Code paths for sm_80 and earlier must fall back to global-memory exchange.
The cooperation is grid-wide, not local. A cluster co-schedules blocks only within one GPC and is capped at a small number of blocks (8 portably²); whole-grid cooperation still needs a cooperative grid launch or multiple kernels.

A practical caveat: requesting a cluster constrains the scheduler (all member blocks must be co-resident in one GPC), so a cluster launch can reduce achievable occupancy versus independent blocks. Validate against the no-cluster baseline with Nsight Compute; only ship the cluster if it wins on wall-clock, not on theory.

How: implement, integrate, maintain¶

1. Choose how to opt in: compile-time vs runtime cluster dimensions¶

Compile-time: fix the cluster shape with the __cluster_dims__(X,Y,Z) kernel attribute. The shape is baked into the kernel and cannot change per launch:⁶

// Cluster of 2 blocks in X, fixed at compile time.
__global__ void __cluster_dims__(2, 1, 1)
cluster_kernel(const float* input, float* output) {
    // ... kernel body uses cooperative_groups::this_cluster() ...
}

// Launch with ordinary triple-chevron syntax; cluster shape comes from the attribute.
dim3 threadsPerBlock(256, 1, 1);
dim3 numBlocks(/* grid in blocks, must be a multiple of the cluster size */);
cluster_kernel<<<numBlocks, threadsPerBlock>>>(d_input, d_output);

Runtime: set the cluster dimensions per launch via cudaLaunchKernelEx, leaving the kernel free of __cluster_dims__. This is the canonical pattern from the CUDA Programming Guide:³

// Runtime cluster dimensions via the extensible launch API.
cudaLaunchConfig_t config = {0};
config.gridDim  = numBlocks;        // grid in blocks
config.blockDim = threadsPerBlock;  // threads per block

cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeClusterDimension;
attribute[0].val.clusterDim.x = 2;  // cluster size, in blocks, per dimension
attribute[0].val.clusterDim.y = 1;
attribute[0].val.clusterDim.z = 1;

config.attrs    = attribute;
config.numAttrs = 1;

cudaLaunchKernelEx(&config, cluster_kernel, d_input, d_output);

The product clusterDim.x * clusterDim.y * clusterDim.z is the number of blocks per cluster. 8 is the maximum portable cluster size guaranteed across architectures that support clusters; larger ("non-portable") sizes may exist on a specific device but are not guaranteed and must be queried before use.² The grid dimensions must be an integer multiple of the cluster dimensions in each axis.

2. Exchange data through DSMEM inside the kernel¶

Inside the kernel, obtain the cluster, find a peer block's rank, map a local shared pointer onto that peer, then read or write it directly, with a cluster.sync() fence around the exchange so all peers have published their shared memory before anyone reads.¹⁴

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void __cluster_dims__(2, 1, 1)
dsmem_exchange(const float* input, float* output) {
    // Per-block shared tile, published to peers via DSMEM.
    __shared__ float tile[256];

    cg::cluster_group cluster = cg::this_cluster();
    const unsigned int rank   = cluster.block_rank();   // this block's rank in the cluster
    const unsigned int tid     = threadIdx.x;

    // Each block fills its own shared tile from global input.
    tile[tid] = input[blockIdx.x * blockDim.x + tid];

    // Cluster-wide barrier: every block's shared memory is now visible to peers.
    cluster.sync();

    // Read the neighbouring block's shared memory directly over the SM-to-SM network.
    const unsigned int peer = rank ^ 1u;                       // partner block in a size-2 cluster
    float* peer_tile = cluster.map_shared_rank(tile, peer);    // pointer into peer's shared memory
    float  neighbour = peer_tile[tid];                          // remote DSMEM load, no HBM round-trip

    // Barrier again before any block is allowed to overwrite its tile.
    cluster.sync();

    output[blockIdx.x * blockDim.x + tid] = tile[tid] + neighbour;
}

map_shared_rank returns a pointer valid for load, store, and atomic operations against the peer block's shared memory; the access is serviced by the dedicated SM-to-SM network rather than global memory.⁵ To overlap the barrier with independent work, use the split-phase form instead of sync():⁴

auto token = cluster.barrier_arrive();   // signal arrival
local_processing();                       // hide barrier latency with unrelated work
cluster.barrier_wait(std::move(token));   // wait for all peers

3. Query portable cluster occupancy before committing to a size¶

Because a cluster forces co-residency within one GPC, the number of clusters that fit concurrently is its own occupancy question. Query it with the cluster occupancy API rather than assuming, and confirm that a chosen cluster size can actually launch on the target device (especially for non-portable sizes above 8). NVIDIA exposes occupancy queries for clusters analogous to the per-block Occupancy API; use them to pick a cluster size that co-resides without starving the GPC.¹

# Build for Hopper (sm_90) or Blackwell (sm_100); clusters need cc 9.0+.
nvcc -arch=sm_90 -lineinfo cluster_kernel.cu -o cluster_kernel

4. Profile against the no-cluster baseline¶

Profile both the cluster kernel and a plain single-block (or global-memory-exchange) version with Nsight Compute. The cluster wins only if the DSMEM exchange removes enough HBM traffic and launch overhead to beat the lower occupancy and barrier cost it introduces. Read achieved occupancy, memory throughput, and the cluster barrier stalls; ship the cluster only when wall-clock improves. Re-tune cluster size when retargeting across Hopper and Blackwell: the portable ceiling is 8, but the best size and the achievable cluster occupancy differ by device.

5. Framework users¶

In PyTorch you do not write cluster launches by hand. Cluster-based kernels reach you through libraries: CUTLASS GEMMs, FlashAttention/MLA attention kernels, and torch.compile/Triton-generated code on Hopper/Blackwell. Author clusters directly only in custom CUDA extensions or Triton/CUTLASS kernels where a multi-block cooperative tile is the bottleneck. See Frameworks and Performance Optimization and Tuning.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly) — Hopper/Blackwell thread block clusters, distributed shared memory, and the cooperative-groups cluster API (the cluster as a grouping level that co-schedules blocks on SMs within a GPC, DSMEM for direct block-to-block shared-memory access, cluster-wide synchronization, and when clusters beat plain per-block shared memory). The cluster size, API names, and numeric claims in this page were cross-checked against the official NVIDIA/CUDA documentation below and follow the docs where they are more specific.
NVIDIA, CUDA C++ Programming Guide — Thread Block Clusters (compute capability 9.0+, blocks co-scheduled in a single GPC, distributed shared memory, cooperative-groups cluster access): https://docs.nvidia.com/cuda/cuda-programming-guide/01-introduction/programming-model.html
NVIDIA, CUDA C++ Programming Guide — Thread Block Clusters launch (__cluster_dims__, cudaLaunchKernelEx with cudaLaunchAttributeClusterDimension and clusterDim, max portable cluster size of 8, grid-multiple-of-cluster rule): https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters
NVIDIA, CUDA C++ Programming Guide — Cooperative Groups (this_cluster(), cluster_group, cluster.sync(), cluster.map_shared_rank(), block_rank(), barrier_arrive()/barrier_wait(), cc 9.0+ requirement): https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cooperative-groups.html
NVIDIA, NVIDIA Hopper Architecture In-Depth (Technical Blog) — clusters concurrently scheduled across SMs within a GPC, dedicated SM-to-SM network, DSMEM direct load/store/atomic to other SMs' shared memory, ~7x faster block-to-block exchange vs global memory: https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
NVIDIA, CUDA Runtime API — cudaLaunchKernelEx, cudaLaunchConfig_t, cudaLaunchAttribute (execution group): https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html

CUDA C++ Programming Guide, Programming Model — Thread Block Clusters: "GPUs with compute capability 9.0 and higher have an optional level of grouping called clusters"; "All thread blocks in a cluster are executed in a single GPC"; "Threads in clusters can access the shared memory of all blocks in the cluster, which is referred to as distributed shared memory." https://docs.nvidia.com/cuda/cuda-programming-guide/01-introduction/programming-model.html ↩↩↩↩↩↩
CUDA C++ Programming Guide — Thread Block Clusters: "up to 8 blocks per cluster supported portably across architectures"; larger non-portable sizes must be queried before use. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters ↩↩
CUDA C++ Programming Guide — runtime cluster launch via cudaLaunchKernelEx: attribute[0].id = cudaLaunchAttributeClusterDimension; attribute[0].val.clusterDim.x = 2;. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-block-clusters ↩
CUDA C++ Programming Guide, Cooperative Groups — this_cluster() "Returns the handle to a group of threads in the current cluster" (requires cc 9.0+, assumes 1x1x1 cluster on a non-cluster grid); cluster.barrier_arrive() / cluster.barrier_wait(std::move(token)) split-phase barrier. https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cooperative-groups.html ↩↩↩↩↩↩
NVIDIA Hopper Architecture In-Depth: "A cluster is a group of thread blocks that are guaranteed to be concurrently scheduled onto a group of SMs"; "With clusters, it is possible for all the threads to directly access other SM's shared memory with load, store, and atomic operations"; "The dedicated SM-to-SM network for clusters ensures fast, low latency access to remote DSMEM"; "Compared to using global memory, DSMEM accelerates data exchange between thread blocks by about 7x." https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ ↩↩↩↩↩
__cluster_dims__(X,Y,Z) kernel attribute and the underlying mapa.shared::cluster PTX instruction used to rebase a shared-memory address onto a target block rank. https://cudacourseh100.github.io/pages/lesson-2.html ↩↩