Skip to content
Markdown

CUTLASS: templated GEMM and kernel building blocks

Scope: CUTLASS as the open, templated C++ library for high-performance GEMM/conv on Tensor Cores, its tiling/pipelining abstractions (CuTe, CollectiveBuilder), epilogue fusion (EVT), and when to reach for CUTLASS versus cuBLAS, a hand-written MMA kernel, or Triton.

flowchart TB
    G["Global GEMM (M x N x K)"] -->|"CuTe partition"| TB["Threadblock tile<br/>(e.g. 128x128, sized to 256 KB TMEM)"]
    TB -->|"partition across warps"| WT["Warp tile"]
    WT -->|"issue Tensor Core MMA"| MMA["Instruction / MMA tile (Tensor Core)"]
    subgraph ML["Software-pipelined mainloop (double-buffered)"]
        HBM["Global DRAM (A, B)"] -->|"cp.async / TMA (bypass registers)"| SMEM["Shared memory (staged tiles)"]
        SMEM -->|"load operands"| MMA
        MMA -->|"FP32 accumulate"| ACC["Accumulators (TMEM / registers)"]
    end
    ACC -->|"fused epilogue: alpha*acc + beta*C, bias/activation"| EPI["Collective epilogue"]
    EPI -->|"single coalesced write-out"| D["Output D"]

What it is

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source, header-only C++ template library that decomposes GEMM and convolution into reusable, composable components: a collective mainloop (the tiled, pipelined MMA loop) and a collective epilogue (the post-accumulation write-out). You instantiate a template for your operand types, layouts, tile shape, and target architecture, and CUTLASS emits a kernel that runs near the Tensor Core throughput a hand-tuned kernel would reach.

The book frames it as the easy on-ramp to Tensor Core performance: "With CUTLASS, you write a single templated call, and it will automatically apply many advanced optimizations" (shared-memory tiling, asynchronous memory transfers cp.async/TMA, and double buffering staged through TMEM, the 256 KB per-SM on-chip accumulator buffer on Blackwell-class hardware), so "your Tensor Cores run at near-peak throughput without any manual kernel tuning" (Fregly, Ch. 9).

What CUTLASS automates per the book:

  • Tile selection to balance register pressure, shared-memory capacity, and Tensor Core utilization, chosen empirically per kernel, e.g. 128x128 or 256x128 tiles sized to fit the 256 KB per-SM TMEM budget.
  • Asynchronous staging via cp.async or TMA (cp.async.bulk.tensor) from global DRAM into shared memory, bypassing the register file.
  • Double buffering: while Tensor Cores process the current tile, TMA prefetches the next into shared memory.
  • Accumulation in a higher precision than the operands (e.g. FP32 accumulate for FP16/FP8 inputs) for numerical fidelity, then coalesced write-out from TMEM.
  • Warp specialization, thread-block clusters, and TMA multicast with distributed shared memory (DSMEM) when beneficial, tiling across multiple SMs for larger effective tiles.

Architecturally, CUTLASS 3.x is built on CuTe (CUDA Tensors), a layout algebra that represents tensors as (shape, stride) pairs and composes thread/data partitioning hierarchically. CuTe is what lets the same source express a tiling for SM80, SM90 (Hopper), and SM100 (Blackwell) by swapping an architecture tag rather than rewriting the kernel (CUTLASS 3.x design blog).

Why it matters

A GEMM on Tensor Cores is only as fast as its weakest stage: if tiles are mis-sized, the pipeline stalls; if data movement is not overlapped with compute, the Tensor Cores idle waiting on HBM. Getting all of this right by hand is, per the book, "weeks of low-level tuning." CUTLASS collapses that into "a few lines of template code" while landing within a few percent of the hand-written result. The book's illustrative comparison (Table 9-1, FP16 inputs / FP32 accumulation):

Metric Hand-tuned MMA CUTLASS GEMM
Tensor Core utilization 98% 98%
Registers per thread ~52 ~60 (slightly higher)
Shared memory per CTA ~2 KB ~4 KB
Development effort High Low (template config)

The book is explicit that these numbers are illustrative, not benchmarked: "The numeric values in all metrics tables are illustrative to explain the concepts. For actual benchmark results on different GPU architectures, see the GitHub repository." The narrative claim (CUTLASS "matches or exceeds hand-tuned MMA performance within about 2%," with the slightly higher register and shared-memory use staying "well within the hardware limits" and not impacting occupancy) is the load-bearing point, not the exact percentages. These figures are not independently hardware-tested in this KB.

The second reason CUTLASS matters is epilogue fusion. cuBLAS gives you the GEMM; anything after (bias-add, activation, scaling, residual add, dtype cast) is a separate kernel that re-reads the result from global memory. CUTLASS fuses those into the GEMM epilogue, applying them to accumulator values still on-chip before the single write-out. This eliminates extra kernel launches and round-trips to HBM (CUTLASS: Fast Linear Algebra in CUDA C++). This is the same arithmetic-intensity win as Kernel Fusion, applied to the matmul tail.

When it is needed (and when not)

Reach for cuBLAS / cuBLASLt first for standard GEMMs. It is the default in PyTorch and most frameworks, ships heavily tuned kernels with internal heuristic selection (cublasGemmEx, cublasLtMatmul), and for "large regular-shaped input matrices" delivers state-of-the-art performance with zero kernel authoring (cuBLAS 12.0 on Hopper). The book's own guidance: the easiest way to get these optimizations is to use the library; CUTLASS is the next step when the library does not fit.

Reach for CUTLASS when:

  • You need epilogue fusion cuBLAS does not expose: fuse activation, bias, scaling, or a custom elementwise tail into the GEMM to kill a separate kernel pass.
  • The shape is non-standard (tall-and-skinny, grouped/batched with irregular sizes, small or odd M/N/K) where a tunable tile configuration beats cuBLAS's heuristic; CUTLASS can be instantiated and profiled across tile/cluster/stage configurations to find the best one.
  • You want binary-size control: compile exactly the GEMM variants you need rather than linking the full cuBLAS binary.
  • You are building a custom kernel (e.g. a fused-attention or MoE block) and want CuTe's tiling/pipelining primitives rather than starting from raw PTX.

Reach for a hand-written MMA kernel or inline PTX/SASS (Inline PTX and SASS-Level Tuning) only when CUTLASS leaves measurable performance on the table on an already-profiled hot kernel. The book frames inline assembly as a 5-10% squeeze on already-optimized kernels, rarely worth it given CUTLASS lands within ~2%.

Reach for Triton when you want fused custom kernels in Python with autotuning and are willing to trade some peak performance and fine-grained control for far faster iteration. Triton and CUTLASS occupy the same "fuse it yourself" niche from opposite ends: Triton optimizes for author productivity, CUTLASS for control and ceiling. The book lists cuBLASLt, cuDNN, CUTLASS, and OpenAI's Triton together as the kernels that "perform cp.async instructions or TMA transfers into shared memory" feeding the Transformer Engine.

Do not reach for CUTLASS to get a one-off standard matmul faster than cuBLAS; for regular shapes you will usually tie or lose, at much higher engineering cost.

How: implement, integrate, maintain

The book's device API (CUTLASS 2.x style)

For a half-precision GEMM C = alpha*A*B + beta*C with FP32 accumulation, the book instantiates the 2.x device GEMM template:

#include <cutlass/numeric_types.h>
#include <cutlass/gemm/device/gemm.h>

using Gemm = cutlass::gemm::device::Gemm<
    cutlass::half_t, cutlass::layout::RowMajor,     // A (FP16)
    cutlass::half_t, cutlass::layout::ColumnMajor,  // B (FP16)
    cutlass::half_t, cutlass::layout::RowMajor,     // C / output (FP16)
    float,                                          // accumulator (FP32)
    cutlass::arch::OpClassTensorOp,
    cutlass::arch::Sm100>;                          // e.g. Blackwell B200

Gemm gemm_op;
cutlass::Status status = gemm_op({
    {M, N, K},        // GEMM problem shape
    {A_d, lda},       // A: pointer + leading dimension
    {B_d, ldb},       // B: pointer + leading dimension
    {C_d, ldc},       // source C: pointer + leading dimension
    {C_d, ldc},       // destination D: pointer + leading dimension
    {alpha, beta}     // epilogue scalars (LinearCombination)
});
if (status != cutlass::Status::kSuccess) { /* handle */ }

The book's listing passes the alpha/beta and pointer arguments as a flat positional list; the actual device::Gemm::Arguments struct groups them as TensorRefs plus an epilogue-params struct as shown above. The class names, template parameter order, and cutlass::Status::kSuccess return contract are taken verbatim from the CUTLASS headers (CUTLASS GEMM API); the book's flattened call form is paraphrased pseudocode for the same operation. Where the book and the headers differ on the argument grouping, prefer the headers.

The modern API (CUTLASS 3.x, CollectiveBuilder + CuTe)

On Hopper (SM90) and Blackwell (SM100), the recommended path assembles a kernel from two collective builders (one for the mainloop, one for the epilogue), then wraps the kernel in a device adapter. Names and template-argument order below are verbatim from CUTLASS example 49 and the 3.x API docs:

#include <cutlass/gemm/collective/collective_builder.hpp>
#include <cutlass/epilogue/collective/collective_builder.hpp>
#include <cutlass/gemm/kernel/gemm_universal.hpp>
#include <cutlass/gemm/device/gemm_universal_adapter.h>

using namespace cute;

// Epilogue: linear combination alpha*acc + beta*C, fused write-out.
using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
    Shape<_128,_128,_64>, Shape<_1,_1,_1>,        // tile shape, cluster shape
    cutlass::epilogue::collective::EpilogueTileAuto,
    ElementAccumulator, ElementCompute,           // accumulate / compute types
    ElementC, LayoutC, AlignmentC,                // source C
    ElementD, LayoutD, AlignmentD,                // destination D
    cutlass::epilogue::collective::EpilogueScheduleAuto
>::CollectiveOp;

// Mainloop: tiled, pipelined Tensor Core MMA.
using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
    ElementA, LayoutA, AlignmentA,
    ElementB, LayoutB, AlignmentB,
    ElementAccumulator,
    Shape<_128,_128,_64>, Shape<_2,_1,_1>,        // tile shape, cluster shape
    cutlass::gemm::collective::StageCountAuto,
    cutlass::gemm::collective::KernelScheduleAuto
>::CollectiveOp;

using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
    Shape<int,int,int,int>,                       // ProblemShape: M, N, K, L
    CollectiveMainloop,
    CollectiveEpilogue>;

using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;

GemmUniversal is "a stateless universal device GEMM kernel type that treats GEMM as a composition of a collective mainloop and a collective epilogue," and GemmUniversalAdapter is "a stateful, reusable GEMM handle" that manages the kernel's Params lifetime (CUTLASS 3.x GEMM API). The Auto schedule/stage tags let the builder pick the warp-specialized schedule and pipeline depth for the target arch; pin them explicitly only when profiling shows a better choice.

Epilogue fusion with EVT

For fusions beyond alpha*acc + beta*C, CUTLASS 3.x on SM90 uses Epilogue Visitor Trees (EVT), "a collection of visitors organized in a tree that collectively operate as a single visitor," composed from load, compute, and store nodes so you express a custom epilogue as a compute graph instead of writing a new collective epilogue (Colfax: Epilogue Visitor Trees). Representative node types: Sm90AccFetch (read accumulators), Sm90SrcFetch (load C), Sm90Compute (elementwise op), Sm90ColBroadcast/Sm90RowBroadcast (broadcast a bias vector), Sm90ScalarBroadcast. A bias-plus-ReLU tail, for example, is a Sm90Compute<ReLU> node over a Sm90Compute<plus> of the accumulator and a broadcast bias. EVT support is limited to Hopper and warp-specialized kernels in CUTLASS 3.x; on older architectures use the prebuilt LinearCombination* epilogues.

Tile sizing

Per the book, larger tiles raise per-tile throughput but reduce concurrent tiles per SM, hurting small GEMMs; very small tiles trade arithmetic intensity for parallelism. The TMEM budget bounds the tile: a 256x512 FP16 tile (2 bytes/elem) or a 256x256 FP32 tile (4 bytes/elem) maxes the 256 KB per-SM budget. Start from the Auto schedules; profile candidate tile/cluster shapes with the CUTLASS profiler when tuning a specific shape. See Shared Memory, Bank Conflicts, and Tiling and CUDA Occupancy Tuning.

Integrate and maintain

  • Header-only: CUTLASS is included, not linked; templates instantiate at compile time, so build times and binary size scale with how many variants you instantiate. Compile only the variants you need.
  • Verify the precision and pipeline actually engaged: profile with Nsight Compute and confirm the Speed-of-Light roofline moved from memory-bound to compute-bound and Tensor Core utilization rose. See Profiling GPUs: Nsight Systems and Nsight Compute and GPU Diagnostics and Validation.
  • Re-pin the architecture tag per generation: arch::Sm80 (Ampere), Sm90 (Hopper), Sm100 (Blackwell) select different mainloop/epilogue schedules and TMA/TMEM features. Pull a fresh CUTLASS when moving to a new architecture; NVIDIA updates the builders and EVT for each generation's FP8/FP4/TMEM capabilities.
  • Prefer the 3.x CollectiveBuilder path on SM90+: the 2.x device::Gemm template still works but does not expose Hopper/Blackwell warp-specialized schedules, clusters, or EVT.

References

Related: Tensor Core Programming · Tensor Cores and Mixed Precision · OpenAI Triton: Authoring GPU Kernels in Python · Kernel Fusion · Shared Memory, Bank Conflicts, and Tiling · Warp Specialization and Intra-Kernel Pipelining · Thread Block Clusters and Distributed Shared Memory · CUDA Stream-Ordered Memory Allocator · Inline PTX and SASS-Level Tuning · Profiling GPUs: Nsight Systems and Nsight Compute · Roofline Model and Arithmetic Intensity · Glossary