Markdown

Tensor core programming¶

Scope: how to program NVIDIA Tensor Cores directly from CUDA (the warp-level WMMA fragment API, Hopper's asynchronous warp-group WGMMA in inline PTX, and Blackwell's TCGen05), and how manual CUDA-core tiling, vectorized loads, and explicit shared memory become hardware tiling, TMA, and implicit staging.

Reference templates on real APIs; pin versions and validate before production use.

flowchart LR
    CC["CUDA-core GEMM<br/>(manual tiling, vectorized loads, explicit shared memory)"] -->|"manual tiling becomes hardware tiling"| WMMA["WMMA fragments<br/>(warp-level, 16x16x16, mma.h)"]
    WMMA -->|"warp group, asynchronous MMA"| WGMMA["WGMMA<br/>(Hopper sm_90, inline PTX)"]
    WGMMA -->|"FP4, leaner encoding"| TC05["TCGen05<br/>(Blackwell sm_100)"]
    WMMA -.->|"still stages tiles by hand"| SMEM["Shared memory (explicit)"]
    WGMMA -->|"vectorized loads become TMA"| TMA["Tensor Memory Accelerator (TMA)"]
    WGMMA -.->|"shared memory becomes implicit"| FUSE["Shared-to-register load fused into MMA"]
    LIB["cuBLAS / CUTLASS / Triton"] -.->|"library on-ramp"| WGMMA

What it is¶

A Tensor Core is a fixed-function matrix-multiply-accumulate unit inside each SM that computes D = A * B + C on small matrix tiles. This page is about issuing those instructions from CUDA. The precision formats, accumulation rules, and PyTorch/CUTLASS framing live in the companion page Tensor Cores and Mixed Precision and are not repeated here.

The defining trait of the programming model: you stop managing individual scalar values and start issuing whole-tile matrix operations. The hardware distributes a tile across a warp (WMMA), a warp group (WGMMA), or the SM (TCGen05), holds the operands in the right registers, and performs thousands of fused multiply-adds per instruction. Inputs multiply at low precision (FP16/BF16/FP8/FP4) while the accumulator stays at FP32, the mechanism that keeps training numerically stable.

The instruction set has evolved through four generations, each raising parallelism and tightening the compute-memory coupling:

Generation	Arch / target	Granularity	API surface	Tile (FP16)
MMA	Volta `sm_70`, 2017	Warp (32 threads)	PTX, explicit registers	small, superseded by WMMA
WMMA	Volta `sm_70`+, CUDA 9.0	Warp (32 threads)	C++ fragment API (`mma.h`)	16x16x16
WGMMA	Hopper `sm_90`, 2022	Warp group (128 threads)	Inline PTX + descriptors	up to 64x256x16
TCGen05	Blackwell `sm_100`, 2024	SM-level	PTX (`tcgen05.*`), via CUTLASS	adds FP4

WMMA works on any GPU from Volta onward (including consumer RTX cards), so most of the code below runs on hardware you likely already have. WGMMA requires Hopper; TCGen05 requires Blackwell.

When to program tensor cores directly (vs cuBLAS/CUTLASS/Triton)¶

Most workloads should never hand-write a Tensor Core kernel. The book's decision framework, from least to most effort:

Start with cuBLAS. It reaches 713 TFLOPS on a 4096x4096 FP16 GEMM (H100, book-reported) with zero custom code. Only go lower when you have a fusion opportunity (such as Flash Attention) or a memory layout the library does not fit.
Use CUTLASS for maintainable custom kernels across GPU generations. It wraps WMMA, WGMMA, and TCGen05 behind C++ templates and lands within roughly 5-10% of peak while staying portable.
Reach for Triton when you want fused custom kernels in Python with autotuning and accept some peak-performance loss for faster iteration.
Write WMMA by hand when learning the model or targeting Ampere; the fragment API gives respectable performance (71 TFLOPS on H100) without any PTX.
Write WGMMA + TMA by hand only for maximum performance on Hopper when you are comfortable with inline PTX; the book's progression reaches 618 TFLOPS (87% of cuBLAS) at a steep complexity cost.

Independent of the path, Tensor Cores pay off only when the matrices are large (roughly 64 or more in every dimension), the profile shows the kernel is compute-bound, the data is in the layout the instruction expects (WGMMA wants column-major), and most of the runtime falls inside matrix multiplies. Stay on CUDA cores for small tiles, irregular memory access, reductions, or element-wise work.

WMMA fragment API¶

WMMA (Warp Matrix Multiply-Accumulate) exposes Tensor Cores through three functions (load_matrix_sync, mma_sync, store_matrix_sync) that operate on fragments: opaque types that hold one thread's portion of a 16x16 tile. The 256 elements of the tile are scattered non-contiguously across the warp's 32 threads; the exact mapping is hardware-specific and opaque to you. You declare a fragment, call load, and trust the hardware to hand each thread its slice.

A fragment's type declares its role (matrix_a, matrix_b, or accumulator) plus the tile dimensions, element type, and memory layout for the operands. For FP16 the tile is always 16x16x16, so a single mma_sync issues 4,096 fused multiply-adds. Inputs are half; the accumulator is float, which is what prevents rounding error from compounding across the K loop.

WMMA automates the matrix multiply but not the memory movement. You still stage A and B tiles from global into shared memory by hand. The typical tiling hierarchy: a thread block claims a large output tile (e.g. 128x128), each warp owns a sub-region (e.g. 32x64), and each warp steps through it as a grid of 16x16 WMMA operations. Eight warps per 128x128 block is the sweet spot: enough work per warp to amortize the fragment-load overhead while staying within register and shared-memory limits.

A minimal, correct kernel where one warp computes one 16x16 output tile:

#include <mma.h>
#include <cuda_fp16.h>
using namespace nvcuda::wmma;

// One warp computes one 16x16 output tile of C = A * B.
// A: M x K row-major, B: K x N row-major, C: M x N row-major.
// Preconditions: M, N, K are multiples of 16.
__global__ void wmma_gemm_16x16x16(const half *A, const half *B, float *C,
                                   int M, int N, int K) {
    // Warp coordinates: one warp -> one 16x16 output tile.
    const int warpM = (blockIdx.x * blockDim.x + threadIdx.x) / warpSize;
    const int warpN = blockIdx.y * blockDim.y + threadIdx.y;

    fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
    fragment<matrix_b, 16, 16, 16, half, row_major> b_frag;
    fragment<accumulator, 16, 16, 16, float> c_frag;
    fill_fragment(c_frag, 0.0f);

    // Accumulate the tile across K, 16 elements per step.
    for (int k = 0; k < K; k += 16) {
        const int aRow = warpM * 16;
        const int bCol = warpN * 16;
        if (aRow < M && bCol < N) {
            load_matrix_sync(a_frag, A + aRow * K + k, K);  // lda = K
            load_matrix_sync(b_frag, B + k * N + bCol, N);  // ldb = N
            mma_sync(c_frag, a_frag, b_frag, c_frag);       // D = A*B + C
        }
    }

    const int cRow = warpM * 16;
    const int cCol = warpN * 16;
    if (cRow < M && cCol < N)
        store_matrix_sync(C + cRow * N + cCol, c_frag, N, mem_row_major);  // ldc = N
}

Launch with whole warps along x so each warp maps to its own tile, then compile for a Volta-or-newer target:

dim3 block(128, 4);                            // 128/32 = 4 warps in x, 4 in y
dim3 grid((M / 16 + 3) / 4, (N / 16 + 3) / 4); // one warp per 16x16 output tile
wmma_gemm_16x16x16<<<grid, block>>>(A, B, C, M, N, K);

# FP16 WMMA needs compute capability 7.0+ (Volta). Match your target arch.
nvcc -arch=sm_70 wmma_gemm.cu -o wmma_gemm   # sm_80 (Ampere) for BF16/TF32 fragments,
                                             # sm_90 (Hopper), sm_100 (Blackwell)

This warp-level path reaches 71 TFLOPS on an H100 for a 4096x4096 FP16 GEMM (book-reported, not benchmarked here), 1.7x over a hand-tuned CUDA-core kernel and 142x over a naive one. The real-world WMMA kernel adds the shared-memory staging and the warp/block tiling loop around this core.

WGMMA asynchronous warp-group MMA¶

WGMMA (Warp Group Matrix Multiply-Accumulate) is a structural break from WMMA. It operates on a warp group of 128 threads (4 warps), matching Hopper's four warp schedulers so all four fire the same instruction on the same cycle. Three things change at once:

Granularity. A single wgmma works on a 64x64x16 tile (262,144 FMAs per instruction), and the 128 threads must reach the instruction at the same program counter (the sync requirement).
Asynchrony. mma_async issues and returns immediately; the Tensor Core runs in the background while threads prefetch the next tile. You place warpgroup_arrive(), warpgroup_commit_batch(), and warpgroup_wait<N>() fences manually to control when results are ready.
Memory. Operands are read straight from shared memory via 64-bit descriptors, and tiles arrive through the Tensor Memory Accelerator (TMA) rather than thread-issued loads.

There is no C++ fragment API. WGMMA is issued from inline PTX, so the instruction itself is the artifact:

// Hopper WGMMA, issued from inline PTX (no C++ fragment API). One asynchronous
// 64x64x16 MMA: FP32 accumulate, FP16 inputs, operands read from shared memory
// through two 64-bit descriptors. d[0..31] hold the 64x64 tile across 128 threads.
// Illustrative instruction format; a full kernel wraps this in TMA loads,
// transaction barriers, and warpgroup_arrive / commit_batch / wait fences.
asm volatile(
    "wgmma.mma_async.sync.aligned.m64n64k16.f32.f16.f16 "
    "{%0, %1, %2, %3, /* 28 more accumulators */ %31}, "
    "%32, %33, "                  // A and B shared-memory descriptors
    "%34, %35, %36, %37, %38;"    // scaleD, scaleA, scaleB, transA, transB
    : "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3]) /* ... */, "+f"(d[31])
    : "l"(desc_a), "l"(desc_b),
      "n"(1), "n"(1), "n"(1), "n"(0), "n"(0));

Every dot-separated token is meaningful: wgmma (the instruction family), mma_async (issue-and-return), sync (warp-group lockstep), aligned (shared-memory addresses naturally aligned to the tile), m64n64k16 (the 64x64x16 tile), f32 (accumulator precision), f16.f16 (A and B input precision). The constraint letters bind C++ to registers: +f is a read-write 32-bit float, l a 64-bit operand (the descriptors), n a compile-time immediate (the scale/transpose flags).

A shared-memory descriptor is a 64-bit value, built once per operand, that encodes the tile's start address, its row-to-row stride, the stride between tiles for iterating across K, and a validity bit. The Tensor Core reads operands directly from shared memory using this descriptor. This removes the register-to-register copy WMMA needed and gives the hardware control over data-movement timing. The exact bit layout is hardware-defined; production code builds it through CUTLASS or follows the PTX ISA reference rather than hand-packing it.

TMA is a dedicated per-SM DMA engine that understands multi-dimensional layouts. You build a CUtensorMap descriptor (cuTensorMapEncodeTiled) that encodes the tensor's base address, strides, and tile shape; a single producer thread issues cp.async.bulk.tensor and the hardware walks the index space and stages the tile, so no thread ever computes a source address. Correctness comes from transaction-counting barriers: the producer arrives with a byte count (barrier_arrive_tx), consumers arrive without one, and the barrier releases only when all threads have arrived and the TMA transfer has finished. This binds the software and hardware timelines together, something __syncthreads(), which waits only on threads, cannot do.

The book's progressive optimization on H100 (4096x4096 FP16, book-reported, not benchmarked here): basic 64x64x64 tiles reach 318 TFLOPS; 128x128 tiles, 433; TMA asynchronous loads, 504; and maximum 128x256 tiles with a three-stage circular buffer and one producer plus two consumer warp groups, 618 TFLOPS, 87% of cuBLAS.

TCGen05 on Blackwell¶

TCGen05 is the fifth-generation Tensor Core instruction set, introduced with Blackwell (B200, sm_100). It extends WGMMA's asynchronous model rather than replacing it, and the book defers hand-written TCGen05 to its CUTLASS chapter. In practice you reach it through CUTLASS 4.2+, not raw PTX. The headline changes:

Native FP4. Four-bit floating point reaches roughly 18 PFLOPS on a single B200 with 2:4 sparsity, about 9-10 PFLOPS dense. (The book quotes ~18,000 TFLOPS; that is the sparse peak. NVIDIA's published dense figure is ~half it, matching the ~10 PFLOPS dense in tensor cores and mixed precision.) Either way it is roughly double H100's FP8. FP4 is primarily an inference tool and demands careful per-layer calibration.
Leaner encoding. Reduced instruction overhead and better register allocation improve even FP16/FP8 paths that already existed on Hopper.
Enhanced TMA integration. Blackwell's second-generation TMA automates more of the barrier coordination that was manual under WGMMA.
Backward compatibility. Hopper WGMMA and Volta WMMA kernels run unmodified on Blackwell; they just do not benefit from TCGen05-specific gains.

At the PTX level the instructions appear as tcgen05.mma with tcgen05.ld / tcgen05.st, staging operands through the per-SM Tensor Memory (TMEM) accumulator described in the companion page. Tooling requires CUDA Toolkit 12.8 as the minimum for sm_100. See NVIDIA Blackwell Datacenter Platform and CUTLASS GEMM.

From CUDA cores to tensor cores (pattern mapping)¶

The hand-tuned CUDA-core optimizations have direct hardware analogs. Tensor Core programming encodes in silicon the patterns you previously wrote by hand.

Manual tiling becomes hardware tiling. A CUDA-core kernel keeps a per-thread accumulator array and runs a nested FMA loop:

// CUDA cores: per-thread accumulators, one FMA per iteration.
half acc[TM * TN];
for (int m = 0; m < TM; ++m)
    for (int n = 0; n < TN; ++n)
        acc[m * TN + n] = __hadd(acc[m * TN + n], __hmul(regM[m], regN[n]));

WMMA collapses that whole loop into one call that does 4,096 FMAs with hardware-managed register distribution; WGMMA scales it to 262,144 FMAs per instruction:

// Tensor cores: one instruction, hardware-distributed registers.
fragment<accumulator, 16, 16, 16, float> c_frag;
mma_sync(c_frag, a_frag, b_frag, c_frag);

Vectorized loads become TMA. The careful float4 casts, alignment checks, and manual loop unrolling that wrung bandwidth out of global loads collapse into a single descriptor-driven TMA call. The descriptor already encodes shape, strides, alignment, and swizzling, so one cp.async.bulk.tensor computes the addresses, issues the widest transactions, applies the bank-conflict-avoiding swizzle, overlaps with compute, and trips the barrier on arrival.

Shared memory becomes implicit. A CUDA-core kernel stages global to shared, calls __syncthreads(), then loads shared to registers before computing, three explicit stages. WMMA still exposes the shared-to-fragment load. WGMMA fuses the shared-to-register read into the MMA instruction itself; the intermediate load disappears from your code, cutting instruction count, register pressure, and synchronization. Shared memory is still used (it remains the fastest on-chip path), just no longer micromanaged. See Shared Memory, Bank Conflicts, and Tiling.

What you still optimize manually¶

Hardware automation does not remove the architecture decisions. You remain responsible for:

Launch geometry and tile shapes. Grid/block dimensions and tile sizes drive occupancy; the hardware handles a tile, but you partition the full matrix and iterate K.
Layout conversion. WGMMA expects column-major. Converting a 4096x4096 matrix from row-major can cost ~30ms, more than the GEMM itself. Preconvert, store column-major natively, or fall back to WMMA for row-major data.
Pipeline design. Number of producer threads, circular-buffer stage count, and which barriers coordinate producers and consumers.
Epilogue math. Bias, activation, scaling, and residuals are separate from the MMA (or fused via CUTLASS), not done by the Tensor Core hardware.
Synchronization. wgmma.mma_async is asynchronous, so you place warpgroup_arrive(), warpgroup_commit_batch(), and warpgroup_wait<N>() yourself. Wrong placement causes races or deadlocks.

Failure modes¶

Tiles too small. Matrices below ~64x64 do not amortize launch and fragment-load overhead; Tensor Cores sit idle. Batch the work or stay on CUDA cores.
Wrong layout for WGMMA. Feeding row-major data forces a conversion that can dwarf the GEMM. Preconvert, store column-major, or use WMMA.
Accumulating in operand precision. Accumulate in FP32; accumulating in FP16 lets rounding error drift and destabilizes training. The accumulator fragment type must be float.
Missing or misplaced warp-group fences. Omitting warpgroup_arrive / commit_batch / wait around mma_async reads half-computed results or deadlocks.
Using __syncthreads() for TMA. It waits only on threads, not on the background DMA, so consumers can read a half-delivered tile. Use transaction-counting barriers (barrier_arrive_tx) with byte counts.
Warp-group divergence. WGMMA's 128 threads must hit the instruction at the same program counter; divergence breaks the sync requirement.
Too many warps per block. Splitting a 128x128 block across 32 warps starves each warp (only ~2 WMMA ops per K-step), so fragment-load overhead dominates; about 8 warps is the sweet spot.
Wrong -arch or SASS-only builds. WGMMA needs sm_90, TCGen05 needs sm_100 with CUDA 12.8+; a SASS-only fatbin will not forward-run on a newer GPU (see inline PTX and SASS).
Misaligned shared memory. The aligned qualifier requires naturally aligned descriptors and buffers; declare staging arrays alignas(128).

References¶

CUDA for Deep Learning (Manning, MEAP), Chapter 7 "Tensor Cores" — WMMA fragments and tiling hierarchy, WGMMA inline PTX and shared-memory descriptors, TMA producer-consumer pipelines, the 71-to-618 TFLOPS H100 progression, TCGen05 overview, and the CUDA-core-to-Tensor-Core pattern mapping. Performance figures in this page are book-reported, not benchmarked here.
NVIDIA, CUDA C++ Programming Guide — the Warp Matrix Functions (WMMA) section: fragment, load_matrix_sync, mma_sync, store_matrix_sync, supported tile shapes and types.
NVIDIA, Parallel Thread Execution (PTX) ISA — wgmma.mma_async, the cp.async.bulk.tensor (TMA) family, shared-memory matrix descriptors, and tcgen05.*.
NVIDIA, CUTLASS — open templated C++/CuTe library wrapping WMMA, WGMMA, and TCGen05 for production GEMM across sm_80/sm_90/sm_100.
NVIDIA, Blackwell Architecture — fifth-generation Tensor Cores, FP4, and second-generation TMA context for TCGen05.