Skip to content
Markdown

PyTorch custom CUDA extensions

Scope: integrating a hand-written CUDA kernel into a PyTorch model. The two binding paths (torch.utils.cpp_extension, ahead-of-time via CUDAExtension or just-in-time via load()/load_inline(), and pybind11 for standalone kernels), wrapping the kernel as a torch.autograd.Function with forward/backward so it composes with autograd, and when to reach for a custom extension over OpenAI Triton or a cuBLAS/CUTLASS library call. The practical complement to writing the kernel itself (CUDA occupancy tuning).

Reference templates on real APIs; pin versions and validate before production use.

flowchart TB
    CU["Hand-written kernel (.cu)<br/>__global__ + host launcher"]
    BIND["Binding choice"]
    PB["pybind11: PYBIND11_MODULE<br/>NumPy / raw pointers, no autograd"]
    EXT["torch.utils.cpp_extension<br/>torch::Tensor, dispatcher"]
    JIT["JIT: load() / load_inline()<br/>compile on first import, cached"]
    AOT["AOT: setup.py CUDAExtension<br/>BuildExtension, prebuilt wheel"]
    AF["torch.autograd.Function<br/>forward / backward"]
    MOD["Use as an nn.Module op<br/>composes with autograd + torch.compile"]

    CU --> BIND
    BIND -->|"standalone / NumPy"| PB
    BIND -->|"PyTorch tensors"| EXT
    EXT --> JIT
    EXT --> AOT
    JIT --> AF
    AOT --> AF
    AF --> MOD
    PB -.->|"no autograd; you manage device memory"| MOD

What it is

There are two established ways to call a hand-written CUDA kernel from Python, and the appendix this page is built on covers both as "essential tools for integrating custom CUDA kernels into Python workflows" (CUDA for Deep Learning, Manning MEAP, Appendix A.9):

  • pybind11. A lightweight, header-only library for creating Python bindings of C++ code. It is "perfect for standalone CUDA kernels that don't need PyTorch integration": you hand it NumPy arrays (or raw pointers) and you own the device-memory management end to end (A.9.1).
  • torch.utils.cpp_extension / CUDAExtension. PyTorch's build helpers, "designed for kernels that integrate with PyTorch's autograd system and tensor operations" (A.9.2). Your C++ entry point receives torch::Tensor arguments that are already device-resident, and the resulting op can be wrapped so gradients flow through it.

Both compile your .cu into a Python-importable module. The split that decides everything downstream: with pybind11 your kernel is an island (NumPy in, NumPy out, no autograd, manual cudaMalloc/cudaMemcpy); with cpp_extension it is a first-class PyTorch op that can become a differentiable layer.

When to use a custom extension (vs Triton vs cuBLAS)

Reach for a library first. For a standard GEMM or convolution, cuBLAS / cuDNN already ship near-peak, heavily tuned kernels and are the PyTorch default. A hand-written kernel will usually tie or lose at far higher cost. For a fused, memory-bound op chain authored in Python with autotuning, prefer OpenAI Triton; for templated GEMM with epilogue fusion in C++, prefer CUTLASS.

Reach for a hand-written CUDA extension when:

  • You already have a .cu kernel (legacy code, a paper's reference implementation, a vendor sample) and need it callable from PyTorch now.
  • You need an instruction sequence or memory pattern that Triton does not expose for your target architecture, and you want to keep authoring in CUDA C++ rather than a CUTLASS template.
  • You are binding a standalone numerical kernel to NumPy with no PyTorch dependency at all. That is the pybind11 path.

The kernel itself is out of scope here; see CUDA occupancy tuning, kernel fusion, and Tensor Cores and mixed precision. This page is the wrapping-and-shipping half.

Path 1: torch.utils.cpp_extension (JIT load and AOT CUDAExtension)

This is the path for kernels that must behave like real PyTorch ops. One .cu holds the kernel and the C++ entry point; torch/extension.h pulls in tensors, TORCH_CHECK, the AT_DISPATCH_* type-dispatch macros, and the binding macros. The kernel below is the templated tiled-naive matmul from the appendix (A.9.2), with the device-memory checks and a launch-error check made explicit.

// matmul.cu
#include <torch/extension.h>
#include <c10/cuda/CUDAException.h>   // C10_CUDA_KERNEL_LAUNCH_CHECK

template <typename scalar_t>
__global__ void matmul_kernel(const scalar_t* A, const scalar_t* B, scalar_t* C,
                              int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M && col < N) {
        scalar_t sum = 0;
        for (int k = 0; k < K; ++k)
            sum += A[row * K + k] * B[k * N + col];   // row-major, contiguous
        C[row * N + col] = sum;
    }
}

torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B) {
    TORCH_CHECK(A.is_cuda() && B.is_cuda(), "inputs must be CUDA tensors");
    TORCH_CHECK(A.dim() == 2 && B.dim() == 2, "inputs must be 2D");
    TORCH_CHECK(A.size(1) == B.size(0), "inner dimensions must match");
    A = A.contiguous();                 // the naive kernel assumes row-major contiguity
    B = B.contiguous();
    const int M = A.size(0), K = A.size(1), N = B.size(1);
    auto C = torch::zeros({M, N}, A.options());

    const dim3 threads(16, 16);
    const dim3 blocks((N + threads.x - 1) / threads.x,
                      (M + threads.y - 1) / threads.y);
    AT_DISPATCH_FLOATING_TYPES(A.scalar_type(), "matmul_cuda", ([&] {
        matmul_kernel<scalar_t><<<blocks, threads>>>(
            A.data_ptr<scalar_t>(), B.data_ptr<scalar_t>(), C.data_ptr<scalar_t>(),
            M, N, K);
    }));
    C10_CUDA_KERNEL_LAUNCH_CHECK();     // kernel launch errors are asynchronous; check now
    return C;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("matmul", &matmul_cuda, "Naive matrix multiply (CUDA)");
}

TORCH_EXTENSION_NAME is filled in at build time with the module name you request, so the same source serves both build modes below. AT_DISPATCH_FLOATING_TYPES instantiates the kernel for float and double (so gradcheck in double works); add AT_DISPATCH_FLOATING_TYPES_AND_HALF if you need FP16.

JIT: load() (compile on first import). No build step, no setup.py; the first call invokes nvcc, then the result is cached under TORCH_EXTENSIONS_DIR (default ~/.cache/torch_extensions). Best for iteration.

import torch
from torch.utils.cpp_extension import load

ext = load(                         # detects .cu, compiles with nvcc, returns the module
    name="naive_matmul",            # becomes TORCH_EXTENSION_NAME
    sources=["matmul.cu"],
    extra_cuda_cflags=["-O3"],
    verbose=True,
)
# ext.matmul(A, B) is now callable. For a few-line kernel, load_inline(cuda_sources=...)
# takes the source as a string and skips the separate .cu file.

AOT: CUDAExtension + setup.py (prebuilt, shippable). Compile once into a wheel; import is instant and the target host needs no compiler. This is the appendix's setup.py (A.9.2):

# setup.py
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name="naive_matmul",
    ext_modules=[
        CUDAExtension(
            name="naive_matmul",
            sources=["matmul.cu"],
            extra_compile_args={"cxx": ["-O3"], "nvcc": ["-O3"]},
        )
    ],
    cmdclass={"build_ext": BuildExtension},
)
python setup.py build_ext --inplace     # or: pip install .
python -c "import torch, naive_matmul; print(naive_matmul.matmul)"

For AOT, set TORCH_CUDA_ARCH_LIST (e.g. "8.0;9.0") so nvcc emits SASS/PTX for the GPUs you ship to; an unlisted architecture means the kernel is not built for that card.

Modern dispatcher registration (alternative to PYBIND11_MODULE). A raw PYBIND11_MODULE op is opaque to torch.compile (it forces a graph break). To register a real dispatcher op that torch.compile and the custom-ops system can reason about, swap the binding for TORCH_LIBRARY:

TORCH_LIBRARY(naive, m) {
    m.def("matmul(Tensor a, Tensor b) -> Tensor");
}
TORCH_LIBRARY_IMPL(naive, CUDA, m) {
    m.impl("matmul", &matmul_cuda);
}
// Python: torch.ops.naive.matmul(a, b)

Path 2: pybind11 (standalone, NumPy)

When the kernel has no business knowing about PyTorch, pybind11 is lighter: minimal dependencies, faster compiles, NumPy in and out. The catch the appendix's narrative implies but the snippet elides: a NumPy array is host memory, so the host wrapper must stage it through device memory before the kernel can touch it. The appendix's own example checklist calls out "Proper memory management" and a "Complete workflow from host to device and back," so the cudaMalloc/cudaMemcpy below is the workflow the kernel requires, not added scope.

// vector_add.cu  -- kernel + host launcher (NumPy buffers are host memory)
#include <cuda_runtime.h>

__global__ void vector_add_kernel(const float* a, const float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

extern "C" void vector_add_cuda(const float* a, const float* b, float* c, int n) {
    float *da, *db, *dc;
    size_t bytes = static_cast<size_t>(n) * sizeof(float);
    cudaMalloc(&da, bytes); cudaMalloc(&db, bytes); cudaMalloc(&dc, bytes);
    cudaMemcpy(da, a, bytes, cudaMemcpyHostToDevice);   // host -> device
    cudaMemcpy(db, b, bytes, cudaMemcpyHostToDevice);
    int threads = 256, blocks = (n + threads - 1) / threads;
    vector_add_kernel<<<blocks, threads>>>(da, db, dc, n);
    cudaMemcpy(c, dc, bytes, cudaMemcpyDeviceToHost);   // device -> host
    cudaFree(da); cudaFree(db); cudaFree(dc);
}
// binding.cpp  -- pybind11 binding over NumPy arrays
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <stdexcept>

namespace py = pybind11;
extern "C" void vector_add_cuda(const float* a, const float* b, float* c, int n);

void vector_add(py::array_t<float> a, py::array_t<float> b, py::array_t<float> c) {
    if (a.ndim() != 1 || b.ndim() != 1 || c.ndim() != 1)
        throw std::runtime_error("arrays must be 1D");
    if (a.size() != b.size() || a.size() != c.size())
        throw std::runtime_error("arrays must have the same size");
    vector_add_cuda(a.data(), b.data(), c.mutable_data(),
                    static_cast<int>(a.size()));
}

PYBIND11_MODULE(vector_add_ext, m) {
    m.doc() = "CUDA vector addition extension";
    m.def("vector_add", &vector_add, "Add two vectors using CUDA");
}

Pybind11Extension compiles .cpp with the host compiler, not nvcc, so the cleanest build precompiles the .cu with nvcc and links it:

nvcc -O3 -Xcompiler -fPIC -c vector_add.cu -o vector_add.o
c++  -O3 -shared -fPIC $(python -m pybind11 --includes) \
     binding.cpp vector_add.o -o vector_add_ext$(python3-config --extension-suffix) \
     -L/usr/local/cuda/lib64 -lcudart
python -c "import numpy as np, vector_add_ext; \
a=np.ones(8,np.float32); b=np.ones(8,np.float32); c=np.zeros(8,np.float32); \
vector_add_ext.vector_add(a,b,c); print(c)"

There is no autograd here and no PyTorch tensor. This is the path for "scientific computing kernels, image-processing pipelines, custom numerical libraries, and standalone GPU utilities" (A.9.3).

Wrapping as autograd.Function

A cpp_extension op only computes a forward result. To make it a differentiable layer that composes with the rest of the network, wrap it in a torch.autograd.Function: forward runs the kernel and stashes what backward needs; backward returns one gradient per forward input. For C = A @ B the gradients are exact and standard: dL/dA = grad @ Bᵀ, dL/dB = Aᵀ @ grad.

import torch
from torch.utils.cpp_extension import load

ext = load(name="naive_matmul", sources=["matmul.cu"],
           extra_cuda_cflags=["-O3"], verbose=True)

class NaiveMatmul(torch.autograd.Function):
    @staticmethod
    def forward(ctx, a, b):
        ctx.save_for_backward(a, b)
        return ext.matmul(a, b)                 # custom CUDA forward

    @staticmethod
    def backward(ctx, grad_out):
        a, b = ctx.saved_tensors
        grad_a = grad_out @ b.t()               # dL/dA = grad @ B^T
        grad_b = a.t() @ grad_out               # dL/dB = A^T @ grad
        return grad_a, grad_b                    # one grad per forward input

def naive_matmul(a, b):
    return NaiveMatmul.apply(a, b)

class NaiveLinear(torch.nn.Module):             # now usable like any layer
    def __init__(self, in_f, out_f):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.randn(in_f, out_f, device="cuda"))
    def forward(self, x):
        return naive_matmul(x, self.weight)

Validate the gradient against finite differences before trusting it. gradcheck runs in double precision, which the kernel supports via AT_DISPATCH_FLOATING_TYPES:

A = torch.randn(64, 32, device="cuda", dtype=torch.double, requires_grad=True)
B = torch.randn(32, 16, device="cuda", dtype=torch.double, requires_grad=True)
assert torch.autograd.gradcheck(naive_matmul, (A, B))   # numeric vs analytic Jacobian

# forward sanity check against the reference op
Af = A.detach().float(); Bf = B.detach().float()
assert torch.allclose(naive_matmul(Af, Bf), Af @ Bf, rtol=1e-3, atol=1e-3)

The backward here routes through torch.matmul (cuBLAS), which is correct and minimal. If you reuse the naive kernel for the backward matmuls instead, call .contiguous() on the transposed operands first; a transposed view is not row-major-contiguous, and the kernel's A[row*K+k] indexing assumes it is. For vmap/torch.func compatibility, split the classic two-argument forward into a forward plus a setup_context staticmethod (Extending PyTorch); the classic form above is fully supported for ordinary autograd.

pybind11 vs CUDAExtension (when to use each)

The two approaches "solve overlapping but distinct problems"; pick on whether you need standalone bindings or first-class PyTorch interop (A.9.3). The appendix also notes both have "similar performance characteristics": the choice is about integration, not speed.

Decision factor Use pybind11 Use cpp_extension / CUDAExtension
PyTorch interop None — standalone kernel Kernels work directly on torch::Tensor
Autograd Not available First-class via torch.autograd.Function
Data type NumPy arrays / raw pointers PyTorch tensors (already device-resident)
Device memory You manage cudaMalloc/cudaMemcpy PyTorch caching allocator / tensor .options()
Dependencies Minimal, faster compiles Pulls in the PyTorch C++ stack
torch.compile Opaque (graph break) TORCH_LIBRARY op composes with Inductor
Best for Scientific kernels, image pipelines, standalone GPU utilities Custom NN layers, training ops, inference engines, research prototypes

In short: standalone numerical kernel and/or NumPy workflow, take pybind11; the kernel has to be a differentiable PyTorch op, take cpp_extension.

Failure modes

  • Host pointers into a device kernel. Passing a NumPy buffer (or any host pointer) straight to kernel<<<>>> is an illegal memory access on a discrete GPU. Stage through cudaMalloc/cudaMemcpy (Path 2) or use mapped/managed memory. PyTorch CUDA tensors are already device-resident, so the cpp_extension path avoids this.
  • Non-contiguous / transposed tensors. A row-major kernel indexing A[row*K+k] silently produces wrong numbers for a transposed or strided view. Call .contiguous() (or assert with TORCH_CHECK(A.is_contiguous())) before launch.
  • ABI / version mismatch. A compiled extension is bound to the PyTorch build and CUDA toolkit it was compiled against (C++ ABI, torch headers, libcudart). Importing it under a different PyTorch is undefined; rebuild on every PyTorch or CUDA upgrade, and keep nvcc's host compiler compatible with the one that built PyTorch.
  • Missing architecture in an AOT build. Without the GPU's SM in TORCH_CUDA_ARCH_LIST, the kernel is not compiled for that card and fails to launch (or silently falls back to PTX JIT). List every deployment architecture.
  • Unchecked launch errors. Kernel launches are asynchronous; an error surfaces on a later, unrelated CUDA call. Use C10_CUDA_KERNEL_LAUNCH_CHECK() (or cudaGetLastError() after the launch) so failures point at the right kernel.
  • torch.compile graph breaks. A PYBIND11_MODULE binding is opaque to Inductor and breaks the graph. Register the op with TORCH_LIBRARY (a custom op) so torch.compile treats it as a composable node; see torch.compile and OpenAI Triton for the same custom-op pattern.
  • Wrong gradient count or unsaved tensors. backward must return exactly one gradient per forward input (use None for non-differentiable inputs), and any tensor it reads must be kept via ctx.save_for_backward. Always confirm with torch.autograd.gradcheck in double precision.

Verify the win, do not assume it: profile with Nsight Systems / Nsight Compute and validate correctness with the diagnostics workflow.

References

  • CUDA for Deep Learning (Manning MEAP), Appendix A.9 "Python bindings for CUDA kernels" — pybind11 vs PyTorch CUDAExtension, the vector-add and matmul examples (matmul_kernel, TORCH_CHECK, AT_DISPATCH_FLOATING_TYPES, PYBIND11_MODULE(TORCH_EXTENSION_NAME), CUDAExtension/BuildExtension setup.py), and the "when to use each" guidance.
  • PyTorch, Custom C++ and CUDA Extensions tutorial — end-to-end cpp_extension op: .cu kernel, C++ binding, JIT load/load_inline vs AOT setup.py, and the torch.autograd.Function wrapper.
  • PyTorch, torch.utils.cpp_extension APICUDAExtension, BuildExtension, load, load_inline, and the build/include flags used above.
  • PyTorch, Extending PyTorchtorch.autograd.Function contract (forward/backward, save_for_backward, setup_context) and gradcheck.
  • PyTorch, Custom ops landing page — registering a kernel as a dispatcher op (TORCH_LIBRARY) so it composes with torch.compile.
  • pybind11, documentation — the header-only binding library, PYBIND11_MODULE, and the NumPy array_t buffer interface.

Related: OpenAI Triton · CUTLASS GEMM · CUDA occupancy tuning · torch.compile · Glossary