PyTorch custom CUDA extensions¶
Scope: integrating a hand-written CUDA kernel into a PyTorch model. The two binding paths (torch.utils.cpp_extension, ahead-of-time via CUDAExtension or just-in-time via load()/load_inline(), and pybind11 for standalone kernels), wrapping the kernel as a torch.autograd.Function with forward/backward so it composes with autograd, and when to reach for a custom extension over OpenAI Triton or a cuBLAS/CUTLASS library call. The practical complement to writing the kernel itself (CUDA occupancy tuning).
Reference templates on real APIs; pin versions and validate before production use.
flowchart TB
CU["Hand-written kernel (.cu)<br/>__global__ + host launcher"]
BIND["Binding choice"]
PB["pybind11: PYBIND11_MODULE<br/>NumPy / raw pointers, no autograd"]
EXT["torch.utils.cpp_extension<br/>torch::Tensor, dispatcher"]
JIT["JIT: load() / load_inline()<br/>compile on first import, cached"]
AOT["AOT: setup.py CUDAExtension<br/>BuildExtension, prebuilt wheel"]
AF["torch.autograd.Function<br/>forward / backward"]
MOD["Use as an nn.Module op<br/>composes with autograd + torch.compile"]
CU --> BIND
BIND -->|"standalone / NumPy"| PB
BIND -->|"PyTorch tensors"| EXT
EXT --> JIT
EXT --> AOT
JIT --> AF
AOT --> AF
AF --> MOD
PB -.->|"no autograd; you manage device memory"| MOD
What it is¶
There are two established ways to call a hand-written CUDA kernel from Python, and the appendix this page is built on covers both as "essential tools for integrating custom CUDA kernels into Python workflows" (CUDA for Deep Learning, Manning MEAP, Appendix A.9):
pybind11. A lightweight, header-only library for creating Python bindings of C++ code. It is "perfect for standalone CUDA kernels that don't need PyTorch integration": you hand it NumPy arrays (or raw pointers) and you own the device-memory management end to end (A.9.1).torch.utils.cpp_extension/CUDAExtension. PyTorch's build helpers, "designed for kernels that integrate with PyTorch's autograd system and tensor operations" (A.9.2). Your C++ entry point receivestorch::Tensorarguments that are already device-resident, and the resulting op can be wrapped so gradients flow through it.
Both compile your .cu into a Python-importable module. The split that decides everything downstream: with pybind11 your kernel is an island (NumPy in, NumPy out, no autograd, manual cudaMalloc/cudaMemcpy); with cpp_extension it is a first-class PyTorch op that can become a differentiable layer.
When to use a custom extension (vs Triton vs cuBLAS)¶
Reach for a library first. For a standard GEMM or convolution, cuBLAS / cuDNN already ship near-peak, heavily tuned kernels and are the PyTorch default. A hand-written kernel will usually tie or lose at far higher cost. For a fused, memory-bound op chain authored in Python with autotuning, prefer OpenAI Triton; for templated GEMM with epilogue fusion in C++, prefer CUTLASS.
Reach for a hand-written CUDA extension when:
- You already have a
.cukernel (legacy code, a paper's reference implementation, a vendor sample) and need it callable from PyTorch now. - You need an instruction sequence or memory pattern that Triton does not expose for your target architecture, and you want to keep authoring in CUDA C++ rather than a CUTLASS template.
- You are binding a standalone numerical kernel to NumPy with no PyTorch dependency at all. That is the
pybind11path.
The kernel itself is out of scope here; see CUDA occupancy tuning, kernel fusion, and Tensor Cores and mixed precision. This page is the wrapping-and-shipping half.
Path 1: torch.utils.cpp_extension (JIT load and AOT CUDAExtension)¶
This is the path for kernels that must behave like real PyTorch ops. One .cu holds the kernel and the C++ entry point; torch/extension.h pulls in tensors, TORCH_CHECK, the AT_DISPATCH_* type-dispatch macros, and the binding macros. The kernel below is the templated tiled-naive matmul from the appendix (A.9.2), with the device-memory checks and a launch-error check made explicit.
// matmul.cu
#include <torch/extension.h>
#include <c10/cuda/CUDAException.h> // C10_CUDA_KERNEL_LAUNCH_CHECK
template <typename scalar_t>
__global__ void matmul_kernel(const scalar_t* A, const scalar_t* B, scalar_t* C,
int M, int N, int K) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
scalar_t sum = 0;
for (int k = 0; k < K; ++k)
sum += A[row * K + k] * B[k * N + col]; // row-major, contiguous
C[row * N + col] = sum;
}
}
torch::Tensor matmul_cuda(torch::Tensor A, torch::Tensor B) {
TORCH_CHECK(A.is_cuda() && B.is_cuda(), "inputs must be CUDA tensors");
TORCH_CHECK(A.dim() == 2 && B.dim() == 2, "inputs must be 2D");
TORCH_CHECK(A.size(1) == B.size(0), "inner dimensions must match");
A = A.contiguous(); // the naive kernel assumes row-major contiguity
B = B.contiguous();
const int M = A.size(0), K = A.size(1), N = B.size(1);
auto C = torch::zeros({M, N}, A.options());
const dim3 threads(16, 16);
const dim3 blocks((N + threads.x - 1) / threads.x,
(M + threads.y - 1) / threads.y);
AT_DISPATCH_FLOATING_TYPES(A.scalar_type(), "matmul_cuda", ([&] {
matmul_kernel<scalar_t><<<blocks, threads>>>(
A.data_ptr<scalar_t>(), B.data_ptr<scalar_t>(), C.data_ptr<scalar_t>(),
M, N, K);
}));
C10_CUDA_KERNEL_LAUNCH_CHECK(); // kernel launch errors are asynchronous; check now
return C;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("matmul", &matmul_cuda, "Naive matrix multiply (CUDA)");
}
TORCH_EXTENSION_NAME is filled in at build time with the module name you request, so the same source serves both build modes below. AT_DISPATCH_FLOATING_TYPES instantiates the kernel for float and double (so gradcheck in double works); add AT_DISPATCH_FLOATING_TYPES_AND_HALF if you need FP16.
JIT: load() (compile on first import). No build step, no setup.py; the first call invokes nvcc, then the result is cached under TORCH_EXTENSIONS_DIR (default ~/.cache/torch_extensions). Best for iteration.
import torch
from torch.utils.cpp_extension import load
ext = load( # detects .cu, compiles with nvcc, returns the module
name="naive_matmul", # becomes TORCH_EXTENSION_NAME
sources=["matmul.cu"],
extra_cuda_cflags=["-O3"],
verbose=True,
)
# ext.matmul(A, B) is now callable. For a few-line kernel, load_inline(cuda_sources=...)
# takes the source as a string and skips the separate .cu file.
AOT: CUDAExtension + setup.py (prebuilt, shippable). Compile once into a wheel; import is instant and the target host needs no compiler. This is the appendix's setup.py (A.9.2):
# setup.py
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
setup(
name="naive_matmul",
ext_modules=[
CUDAExtension(
name="naive_matmul",
sources=["matmul.cu"],
extra_compile_args={"cxx": ["-O3"], "nvcc": ["-O3"]},
)
],
cmdclass={"build_ext": BuildExtension},
)
python setup.py build_ext --inplace # or: pip install .
python -c "import torch, naive_matmul; print(naive_matmul.matmul)"
For AOT, set TORCH_CUDA_ARCH_LIST (e.g. "8.0;9.0") so nvcc emits SASS/PTX for the GPUs you ship to; an unlisted architecture means the kernel is not built for that card.
Modern dispatcher registration (alternative to PYBIND11_MODULE). A raw PYBIND11_MODULE op is opaque to torch.compile (it forces a graph break). To register a real dispatcher op that torch.compile and the custom-ops system can reason about, swap the binding for TORCH_LIBRARY:
TORCH_LIBRARY(naive, m) {
m.def("matmul(Tensor a, Tensor b) -> Tensor");
}
TORCH_LIBRARY_IMPL(naive, CUDA, m) {
m.impl("matmul", &matmul_cuda);
}
// Python: torch.ops.naive.matmul(a, b)
Path 2: pybind11 (standalone, NumPy)¶
When the kernel has no business knowing about PyTorch, pybind11 is lighter: minimal dependencies, faster compiles, NumPy in and out. The catch the appendix's narrative implies but the snippet elides: a NumPy array is host memory, so the host wrapper must stage it through device memory before the kernel can touch it. The appendix's own example checklist calls out "Proper memory management" and a "Complete workflow from host to device and back," so the cudaMalloc/cudaMemcpy below is the workflow the kernel requires, not added scope.
// vector_add.cu -- kernel + host launcher (NumPy buffers are host memory)
#include <cuda_runtime.h>
__global__ void vector_add_kernel(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) c[idx] = a[idx] + b[idx];
}
extern "C" void vector_add_cuda(const float* a, const float* b, float* c, int n) {
float *da, *db, *dc;
size_t bytes = static_cast<size_t>(n) * sizeof(float);
cudaMalloc(&da, bytes); cudaMalloc(&db, bytes); cudaMalloc(&dc, bytes);
cudaMemcpy(da, a, bytes, cudaMemcpyHostToDevice); // host -> device
cudaMemcpy(db, b, bytes, cudaMemcpyHostToDevice);
int threads = 256, blocks = (n + threads - 1) / threads;
vector_add_kernel<<<blocks, threads>>>(da, db, dc, n);
cudaMemcpy(c, dc, bytes, cudaMemcpyDeviceToHost); // device -> host
cudaFree(da); cudaFree(db); cudaFree(dc);
}
// binding.cpp -- pybind11 binding over NumPy arrays
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <stdexcept>
namespace py = pybind11;
extern "C" void vector_add_cuda(const float* a, const float* b, float* c, int n);
void vector_add(py::array_t<float> a, py::array_t<float> b, py::array_t<float> c) {
if (a.ndim() != 1 || b.ndim() != 1 || c.ndim() != 1)
throw std::runtime_error("arrays must be 1D");
if (a.size() != b.size() || a.size() != c.size())
throw std::runtime_error("arrays must have the same size");
vector_add_cuda(a.data(), b.data(), c.mutable_data(),
static_cast<int>(a.size()));
}
PYBIND11_MODULE(vector_add_ext, m) {
m.doc() = "CUDA vector addition extension";
m.def("vector_add", &vector_add, "Add two vectors using CUDA");
}
Pybind11Extension compiles .cpp with the host compiler, not nvcc, so the cleanest build precompiles the .cu with nvcc and links it:
nvcc -O3 -Xcompiler -fPIC -c vector_add.cu -o vector_add.o
c++ -O3 -shared -fPIC $(python -m pybind11 --includes) \
binding.cpp vector_add.o -o vector_add_ext$(python3-config --extension-suffix) \
-L/usr/local/cuda/lib64 -lcudart
python -c "import numpy as np, vector_add_ext; \
a=np.ones(8,np.float32); b=np.ones(8,np.float32); c=np.zeros(8,np.float32); \
vector_add_ext.vector_add(a,b,c); print(c)"
There is no autograd here and no PyTorch tensor. This is the path for "scientific computing kernels, image-processing pipelines, custom numerical libraries, and standalone GPU utilities" (A.9.3).
Wrapping as autograd.Function¶
A cpp_extension op only computes a forward result. To make it a differentiable layer that composes with the rest of the network, wrap it in a torch.autograd.Function: forward runs the kernel and stashes what backward needs; backward returns one gradient per forward input. For C = A @ B the gradients are exact and standard: dL/dA = grad @ Bᵀ, dL/dB = Aᵀ @ grad.
import torch
from torch.utils.cpp_extension import load
ext = load(name="naive_matmul", sources=["matmul.cu"],
extra_cuda_cflags=["-O3"], verbose=True)
class NaiveMatmul(torch.autograd.Function):
@staticmethod
def forward(ctx, a, b):
ctx.save_for_backward(a, b)
return ext.matmul(a, b) # custom CUDA forward
@staticmethod
def backward(ctx, grad_out):
a, b = ctx.saved_tensors
grad_a = grad_out @ b.t() # dL/dA = grad @ B^T
grad_b = a.t() @ grad_out # dL/dB = A^T @ grad
return grad_a, grad_b # one grad per forward input
def naive_matmul(a, b):
return NaiveMatmul.apply(a, b)
class NaiveLinear(torch.nn.Module): # now usable like any layer
def __init__(self, in_f, out_f):
super().__init__()
self.weight = torch.nn.Parameter(torch.randn(in_f, out_f, device="cuda"))
def forward(self, x):
return naive_matmul(x, self.weight)
Validate the gradient against finite differences before trusting it. gradcheck runs in double precision, which the kernel supports via AT_DISPATCH_FLOATING_TYPES:
A = torch.randn(64, 32, device="cuda", dtype=torch.double, requires_grad=True)
B = torch.randn(32, 16, device="cuda", dtype=torch.double, requires_grad=True)
assert torch.autograd.gradcheck(naive_matmul, (A, B)) # numeric vs analytic Jacobian
# forward sanity check against the reference op
Af = A.detach().float(); Bf = B.detach().float()
assert torch.allclose(naive_matmul(Af, Bf), Af @ Bf, rtol=1e-3, atol=1e-3)
The backward here routes through torch.matmul (cuBLAS), which is correct and minimal. If you reuse the naive kernel for the backward matmuls instead, call .contiguous() on the transposed operands first; a transposed view is not row-major-contiguous, and the kernel's A[row*K+k] indexing assumes it is. For vmap/torch.func compatibility, split the classic two-argument forward into a forward plus a setup_context staticmethod (Extending PyTorch); the classic form above is fully supported for ordinary autograd.
pybind11 vs CUDAExtension (when to use each)¶
The two approaches "solve overlapping but distinct problems"; pick on whether you need standalone bindings or first-class PyTorch interop (A.9.3). The appendix also notes both have "similar performance characteristics": the choice is about integration, not speed.
| Decision factor | Use pybind11 |
Use cpp_extension / CUDAExtension |
|---|---|---|
| PyTorch interop | None — standalone kernel | Kernels work directly on torch::Tensor |
| Autograd | Not available | First-class via torch.autograd.Function |
| Data type | NumPy arrays / raw pointers | PyTorch tensors (already device-resident) |
| Device memory | You manage cudaMalloc/cudaMemcpy |
PyTorch caching allocator / tensor .options() |
| Dependencies | Minimal, faster compiles | Pulls in the PyTorch C++ stack |
torch.compile |
Opaque (graph break) | TORCH_LIBRARY op composes with Inductor |
| Best for | Scientific kernels, image pipelines, standalone GPU utilities | Custom NN layers, training ops, inference engines, research prototypes |
In short: standalone numerical kernel and/or NumPy workflow, take pybind11; the kernel has to be a differentiable PyTorch op, take cpp_extension.
Failure modes¶
- Host pointers into a device kernel. Passing a NumPy buffer (or any host pointer) straight to
kernel<<<>>>is an illegal memory access on a discrete GPU. Stage throughcudaMalloc/cudaMemcpy(Path 2) or use mapped/managed memory. PyTorch CUDA tensors are already device-resident, so thecpp_extensionpath avoids this. - Non-contiguous / transposed tensors. A row-major kernel indexing
A[row*K+k]silently produces wrong numbers for a transposed or strided view. Call.contiguous()(or assert withTORCH_CHECK(A.is_contiguous())) before launch. - ABI / version mismatch. A compiled extension is bound to the PyTorch build and CUDA toolkit it was compiled against (C++ ABI,
torchheaders,libcudart). Importing it under a different PyTorch is undefined; rebuild on every PyTorch or CUDA upgrade, and keepnvcc's host compiler compatible with the one that built PyTorch. - Missing architecture in an AOT build. Without the GPU's SM in
TORCH_CUDA_ARCH_LIST, the kernel is not compiled for that card and fails to launch (or silently falls back to PTX JIT). List every deployment architecture. - Unchecked launch errors. Kernel launches are asynchronous; an error surfaces on a later, unrelated CUDA call. Use
C10_CUDA_KERNEL_LAUNCH_CHECK()(orcudaGetLastError()after the launch) so failures point at the right kernel. torch.compilegraph breaks. APYBIND11_MODULEbinding is opaque to Inductor and breaks the graph. Register the op withTORCH_LIBRARY(a custom op) sotorch.compiletreats it as a composable node; see torch.compile and OpenAI Triton for the same custom-op pattern.- Wrong gradient count or unsaved tensors.
backwardmust return exactly one gradient perforwardinput (useNonefor non-differentiable inputs), and any tensor it reads must be kept viactx.save_for_backward. Always confirm withtorch.autograd.gradcheckin double precision.
Verify the win, do not assume it: profile with Nsight Systems / Nsight Compute and validate correctness with the diagnostics workflow.
References¶
- CUDA for Deep Learning (Manning MEAP), Appendix A.9 "Python bindings for CUDA kernels" —
pybind11vs PyTorchCUDAExtension, the vector-add and matmul examples (matmul_kernel,TORCH_CHECK,AT_DISPATCH_FLOATING_TYPES,PYBIND11_MODULE(TORCH_EXTENSION_NAME),CUDAExtension/BuildExtensionsetup.py), and the "when to use each" guidance. - PyTorch, Custom C++ and CUDA Extensions tutorial — end-to-end
cpp_extensionop:.cukernel, C++ binding, JITload/load_inlinevs AOTsetup.py, and thetorch.autograd.Functionwrapper. - PyTorch, torch.utils.cpp_extension API —
CUDAExtension,BuildExtension,load,load_inline, and the build/include flags used above. - PyTorch, Extending PyTorch —
torch.autograd.Functioncontract (forward/backward,save_for_backward,setup_context) andgradcheck. - PyTorch, Custom ops landing page — registering a kernel as a dispatcher op (
TORCH_LIBRARY) so it composes withtorch.compile. - pybind11, documentation — the header-only binding library,
PYBIND11_MODULE, and the NumPyarray_tbuffer interface.
Related: OpenAI Triton · CUTLASS GEMM · CUDA occupancy tuning · torch.compile · Glossary