cuTile Rust: safe tile-based GPU kernels¶
Scope: cuTile Rust (arXiv 2606.15991), the tile-based system for writing GPU kernels in safe, idiomatic Rust: how Rust's aliasing-XOR-mutability rule maps onto partitioned tensor outputs, how the generated launch boundary preserves ownership while GPU work is in flight, the lazy DeviceOp execution model (sync, async, CUDA graph), and what the safety costs at runtime (measured: nothing, within noise). It sits beside the other kernel-authoring surfaces in this KB: Triton (Python tile DSL, no static safety), CUTLASS (templated C++), and PyTorch CUDA extensions (binding hand-written kernels); the execution machinery it wraps is covered in CUDA graphs and streams and concurrency.
The Rust snippets are reference templates quoted from the paper, unexecuted here (no Rust GPU toolchain in this environment); as of 2026-07 cuTile Rust's standalone public repository is NVlabs/cutile-rs (the paper's own artifact, v0.2.0, and a dependency of Grout), so treat availability as evolving and verify against current releases. Benchmark numbers are the paper's (B200, RTX 5090), not reproduced. The Python example is executed and asserted.
flowchart TB
SRC["kernel.rs<br/>#[cutile::module] + #[cutile::entry()]"] --> MACRO["proc macro emits: desugared Rust,<br/>typed host launcher, embedded kernel AST"]
MACRO --> RUSTC["rustc verifies ownership and types<br/>(host binary)"]
RUSTC --> JIT["first launch: runtime JIT<br/>kernel AST to Tile IR to cubin"]
HOST["Host: partition mutable output<br/>Partition<Tensor> (disjoint sub-tensors)"] --> LAUNCH{"Launch boundary:<br/>prepare() / recover()"}
LAUNCH -->|"&mut Tensor: exclusive, 1:1 per tile program"| TP["Tile programs<br/>(single-threaded semantics, token-ordered)"]
LAUNCH -->|"&Tensor: shared, broadcast"| TP
JIT --> TP
TP --> MODES["Execution: .sync() | .await | .graph()"]
What it is¶
cuTile Rust is a tile-based programming model that carries Rust's safety guarantees across both host and device code. Existing Rust-on-GPU paths (the LLVM PTX backend, Rust-CUDA, rust-gpu, CubeCL) show that Rust compiles to efficient GPU machine code, but they treat the kernel itself as unsafe code: SPMD threads, raw coordinates, and programmer-managed disjointness sit outside the borrow checker. cuTile Rust closes that gap by raising the device program to the tile level. Tiles are immutable array-like values of fixed size; a kernel is a grid of tile programs, each modeled as a single logical thread over tiles; loads from tensors produce tiles, tile operations produce new tiles, and stores target mutable sub-tensors.1
The ownership move is the core idea. A mutable output is partitioned on the host into disjoint sub-tensors before launch (.partition([128])); each tile program receives an exclusive &mut Tensor over exactly one sub-tensor, and the tile-program-to-sub-tensor mapping is injective by construction. Immutable inputs pass as shared &Tensor views broadcast to all programs. Rust's aliasing-XOR-mutability rule therefore holds across the launch: one mutable reference per output region, any number of shared readers. For kernels where one program must own several output sub-tensors (a GEMM reusing K-dimension operands), MappedPartitionMut plus compiler-branded PartitionIndex values and bounded dimension iterators let the front end prove disjointness statically, without dynamic bounds checks in hot loops.1
It shares the Tile IR backend with cuTile Python and CUDA Tile C++; what Rust adds is that shape and ownership facts become ordinary type and borrow facts checked before launch.2
Why use it¶
- Safety measured at zero runtime cost. On a B200, the safe mapped-partition GEMM (f16, M=N=K=8192) reaches 2.07 PFlop/s: 92% of the device's dense f16 peak and 96.4% of cuBLAS, with an unsafe raw-pointer variant of the same schedule matching within 0.3%. Memory-bound element-wise add at N=2^28 hits 7.02 TB/s for both safe and unsafe Rust (cuTile Python 7.01 TB/s) against a 7.68 TB/s theoretical peak.3
- A real class of races becomes inexpressible. The paper demonstrates a cuTile Python head-permutation kernel with swapped store indices: a data race that corrupts 17 to 35% of elements non-deterministically across runs. In cuTile Rust the store target is the partition view itself, not an index the programmer picks, so the bug cannot be written; Appendix A of the paper proves data-race freedom of the safe API against Tile IR's memory model.4
- The host side is typed too. Kernel launches are lazy
DeviceOpvalues that compose like iterators (then,zip!,shared) and only run when driven; the borrow checker verifies the whole launch-execute-return lifecycle, and host access to a tensor is blocked while GPU work that borrows it is in flight. - End-to-end proof. Grout, an open-source Qwen3 inference engine built on cuTile Rust with Hugging Face, reaches 171 tokens/s batch-1 decode for Qwen3-4B on an RTX 5090 and 82 tokens/s for Qwen3-32B on a B200, competitive with vLLM and SGLang and consistent with an HBM roofline check.5
When to use it (and when not)¶
- Use it where the kernel is tile-shaped: element-wise ops, GEMM, reductions, normalization, attention-style fused tensor ops. These map naturally onto whole-tile loads, stores, matmuls, and reductions, which is exactly the surface the safety checking covers.
- Use it when the host application is already Rust (inference engines, agents, services built on candle or Burn) and kernel authoring is the unsafe island left in an otherwise safe codebase.
- Do not use it when the kernel needs SIMT-level control: explicit warp primitives or manual shared-memory protocols are given up in exchange for single-threaded tile semantics (Tile IR performs implicit warp specialization, which mitigates but does not eliminate the loss). A clean safe SIMT model is explicitly future work.1
- Mind the maturity. The tensor API is young: even Grout drops to
unchecked_accessesor raw pointers for its attention and fused-norm kernels, and falls back to cuBLAS for model GEMMs at some sizes. Expect explicitunsafeislands for frontier kernels. - Weigh against Triton when the team is Python-first: same tile-level abstraction and comparable backend performance ambitions, without the ownership guarantees but without the Rust toolchain either.
Architecture¶
Safety is enforced at three places. On the host, tensors and partitions are ordinary owned values and borrows; Partition<Tensor<T>> moves a tensor into a launch, Partition<&mut Tensor> borrows it exclusively, and safe code can only construct mutable partitions through APIs that start from ownership or an exclusive borrow. Because CUDA launch grids are three-dimensional, mutable (partitioned) tensors are capped at rank 3; immutable tensors are broadcast, not partitioned, so their rank is unbound.
At the launch boundary, a proc macro generates a typed host launcher per #[cutile::entry()] function. The KernelInput/KernelOutput traits define a two-phase protocol: prepare() relinquishes host access and marshals pointers plus shape/stride/partition scalars; recover() returns the value in the same host type after the stream completes (same-type-in-same-type-out: pass an Arc<Tensor<T>>, get an Arc<Tensor<T>> back). A generated device entry reconstructs the matching Tile IR views, so the &mut/& distinction survives the raw-pointer ABI.
Inside the kernel, Tile IR orders memory operations with tokens rather than program order. The compiler threads a token chain through every operation on a &mut Tensor, establishing happens-before (a store after a load on the same mutable tensor observes it), while &Tensor reads carry no tokens and reorder freely for throughput. Rust's reference distinction becomes exactly the ordering the backend preserves, and nothing more.1
Execution composes lazily: a DeviceOp owns or borrows operands, exposes its output type, and runs in one of three modes: .sync() (block until complete), .await (yield the task, for async Rust hosts), or .graph() (capture as a CUDA graph and replay). CudaGraph::scope gives closure-based capture where s.record(op) adds a node without executing it, which is the borrow-safety fact that permits buffer reuse between recorded kernels; a GraphNode marker trait keeps allocating operations out of captures so replays never depend on unstable addresses (see CUDA graphs). Escape hatches are explicit and local: unchecked_accesses inside an unsafe fn disables bounds checks, and raw *mut T parameters give direct Tile IR access for patterns the tensor API cannot yet express.1
How to use it¶
The paper's complete element-wise add (reference template, unexecuted; API as of the paper, verify against current releases):
// Reference template (cuTile Rust, arXiv 2606.15991, Listing 1). No unsafe anywhere.
use cutile::prelude::*;
#[cutile::module]
mod kernel {
use cutile::core::*;
#[cutile::entry()]
fn add<const B: i32>(
z: &mut Tensor<f32, {[B]}>, // exclusive write (one sub-tensor per program)
x: &Tensor<f32, {[-1]}>, // shared read (broadcast)
y: &Tensor<f32, {[-1]}>,
) {
let tx = load_tile_like(x, z);
let ty = load_tile_like(y, z);
z.store(tx + ty);
}
}
fn main() -> Result<()> {
let x = api::ones::<f32>([1024]);
let y = api::ones::<f32>([1024]);
let z = api::zeros::<f32>([1024]).partition([128]); // 8 disjoint sub-tensors
let (_z, _x, _y) = kernel::add(z, x, y).sync()?; // ownership returned after sync
Ok(())
}
Choosing an execution mode is a measured cost decision. In the paper's pipeline sweep on an RTX 5090 (a small element-wise kernel repeated N times), per-kernel .sync_on(stream) converges to about 7.3 us/op, a chained .then() pipeline with one final .sync() and the async path both converge to about 3.4 us/op, and CUDA graph replay approaches the GPU dispatch limit at about 0.8 us/op; async carries a fixed per-pipeline callback cost (about 21 us at N=1), so it pays when the host has concurrent work to overlap, such as tool calling in agentic loops.3
How to develop with it¶
The discipline the compiler enforces is checkable without any GPU: mutable output claims must exactly cover the output, be pairwise disjoint, and be validated before anything executes. This model of it is executed and asserted, including non-divisible shapes, a rejected aliased launch plan, and a tiled matmul that matches the numpy reference exactly:
# cutile_ownership.py - validated: the ownership discipline cuTile Rust enforces at
# compile time, modeled at runtime. A partitioner splits a mutable output into
# disjoint tiles (exact cover), a borrow-checker-style validator rejects any launch
# plan whose mutable tile claims overlap, and a tiled matmul where each simulated
# tile program writes only its owned tile matches the numpy reference exactly.
# This validates the model's invariants, it does not run cuTile. numpy only.
import numpy as np
Claim = tuple[int, int, int, int] # (row0, row1, col0, col1), half-open
def partition(shape: tuple[int, int], tile: tuple[int, int]) -> list[Claim]:
"""Split a 2D output into tiles; edge tiles shrink (non-divisible shapes)."""
rows, cols = shape
th, tw = tile
assert th > 0 and tw > 0
return [(r, min(r + th, rows), c, min(c + tw, cols))
for r in range(0, rows, th) for c in range(0, cols, tw)]
def exact_cover(shape: tuple[int, int], claims: list[Claim]) -> bool:
"""Every output element owned by exactly one claim (injective and total)."""
counts = np.zeros(shape, dtype=np.int64)
for r0, r1, c0, c1 in claims:
counts[r0:r1, c0:c1] += 1
return bool(np.all(counts == 1))
def validate_launch(shape: tuple[int, int], claims: list[Claim]) -> None:
"""Borrow-checker analogue: mutable claims must be pairwise disjoint and
in-bounds, or the launch plan is rejected before anything executes."""
counts = np.zeros(shape, dtype=np.int64)
for r0, r1, c0, c1 in claims:
assert 0 <= r0 < r1 <= shape[0] and 0 <= c0 < c1 <= shape[1], "out of bounds"
counts[r0:r1, c0:c1] += 1
if int(counts.max(initial=0)) > 1:
raise ValueError("aliased mutable claims: launch plan rejected")
def tiled_matmul(x: np.ndarray, y: np.ndarray, tile: tuple[int, int]) -> np.ndarray:
"""Each tile program computes and stores only its owned output tile."""
z = np.zeros((x.shape[0], y.shape[1]), dtype=x.dtype)
claims = partition(z.shape, tile)
validate_launch(z.shape, claims) # checked before launch, as in cuTile Rust
for r0, r1, c0, c1 in claims: # one simulated tile program per claim
z[r0:r1, c0:c1] = x[r0:r1, :] @ y[:, c0:c1]
return z
# 1) Exact cover holds for divisible and non-divisible shapes (edge tiles shrink).
for shape, tile in (((128, 128), (32, 32)), ((100, 33), (32, 32)), ((7, 5), (3, 2))):
claims = partition(shape, tile)
assert exact_cover(shape, claims), (shape, tile)
# 2) Disjoint plans pass validation; an overlapping plan is rejected pre-launch.
validate_launch((8, 8), [(0, 4, 0, 8), (4, 8, 0, 8)])
try:
validate_launch((8, 8), [(0, 5, 0, 8), (4, 8, 0, 8)]) # rows 4 claimed twice
raise AssertionError("aliased mutable claims were not rejected")
except ValueError:
pass
# 3) Tile programs writing only their owned tiles reproduce the reference exactly.
rng = np.random.default_rng(0)
x = rng.standard_normal((96, 64)).astype(np.float64)
y = rng.standard_normal((64, 80)).astype(np.float64)
z = tiled_matmul(x, y, tile=(32, 32))
assert np.array_equal(z, x @ y), np.abs(z - x @ y).max()
z_edge = tiled_matmul(x[:70, :], y[:, :50], tile=(32, 32)) # non-divisible output
assert np.array_equal(z_edge, x[:70, :] @ y[:, :50])
# 4) Adversarial: a hand-built overlapping write plan is caught before execution,
# so the racy double-write can never happen.
bad_plan = partition((96, 80), (32, 32))
bad_plan.append((0, 32, 0, 32)) # duplicate claim on tile (0, 0)
try:
validate_launch((96, 80), bad_plan)
raise AssertionError("duplicate tile claim was not rejected")
except ValueError:
pass
print("tiles for (100, 33) @ 32x32:", len(partition((100, 33), (32, 32))))
print("tiled matmul == numpy reference: max |diff| = 0.0 (exact)")
print("all ownership and tiling assertions passed")
Output: tiles for (100, 33) @ 32x32: 8, tiled matmul == numpy reference: max |diff| = 0.0 (exact), all ownership and tiling assertions passed. The difference in the real system is when this check runs: cuTile Rust discharges it at compile time through partition types and branded indices, so the rejected plans above would not build.
Beyond the one-sub-tensor case, develop multi-output schedules through MappedPartitionMut and iter_indices() (the paper's GEMM walks a bounded sequence of output sub-tensors per program to reuse K-dimension operands); keep unsafe opt-outs small and wrapped, the pattern Grout uses for attention and fused norms.1
How to maintain it¶
- Pin the whole stack. Kernel AST, Tile IR, and cubin generations are coupled: the proc macro embeds an AST that the runtime JIT compiles at first launch, so a toolchain bump can change codegen without touching source. Re-run kernel benchmarks after any cuTile, Tile IR, or driver upgrade.
- Budget the first launch. JIT compilation cost grows with tile size (largest GEMM tiles are the slowest to compile); warm kernels at startup or ship a cache rather than paying it on the first user request.
- Audit the unsafe islands. The escape hatches are the maintenance surface: track every
unchecked_accessesand raw-pointer kernel, and re-test whether a grown safe API can absorb them on each release (the paper's zero-cost GEMM result is the argument that they eventually can). - Watch availability. As of 2026-07 cuTile Rust ships as NVlabs/cutile-rs (v0.2.0, the paper's own artifact), alongside cuTile Python and CUDA Tile C++ as the shipping Tile IR frontends, with Grout as one open-source consumer of it; verify the cutile-rs release channel before planning a build on it.
Running it in production¶
Grout is the production-shaped evidence: a lean, model-specialized batch-1 Qwen3 engine, not a general serving stack. Its decode forward pass is recorded once as DeviceOp graph nodes inside CudaGraph::scope and replayed per token; prefill runs through a cached step graph with buffers from a reusable tensor pool; dense QKV and Gate+Up projections dispatch to cuBLAS; the fused path covers QK-norm, RoPE, KV-cache writes, GQA decode attention, and split-K merges. Measured single-request decode: 154.7 tokens/s at 8192 generated tokens on RTX 5090/Qwen3-4B (74.7% of the HBM roofline estimate) and 80.1 tokens/s on B200/Qwen3-32B (66.7% of roofline), versus 77.5 for vLLM and 76.5 for SGLang on the same B200 sweep.5 The roofline sanity check is the same first-order decode model this KB uses (weights plus KV bytes per token over peak bandwidth; see LLM inference efficiency).
Operationally, the async mode is what changes the deployment shape: one host thread keeps GPU work in flight while servicing I/O and control, which matters as inference interleaves tool calling with generation. For fleet serving, treat an engine like Grout as a specialized point solution and benchmark against the general engines per the usual inference serving discipline.
Failure modes¶
- Escape hatches reintroducing races.
unchecked_accessesand raw pointers hand invariants back to the programmer; an overlapping write behind an opt-out is exactly the permute-heads bug the safe API prevents. Keep opt-outs isolated behind small safe wrappers and review them as unsafe code. - Tile shape versus occupancy mismatch. Tile sizes drive both correctness bounds and performance; the paper tunes tile shapes per problem size for its GEMM sweep. A shape tuned for one matrix size can underfill or spill on another; retune rather than reuse (see occupancy tuning).
- Toolchain immaturity. A young tensor API means missing patterns, and version drift across proc macro, JIT, and Tile IR; pin versions and keep a known-good kernel benchmark suite as the upgrade gate.
- Host-device contract violations at FFI edges. The safety proof covers the typed launch path; passing pointers in or out through foreign interfaces (interop with C++ engines, external allocators) bypasses
prepare()/recover()and silently voids the same-type-in-same-type-out guarantee. - Debugging through an IR stack. A miscompiled or slow kernel now involves proc-macro output, kernel AST, Tile IR, and SASS; capture the intermediate artifacts early, and profile with the standard Nsight workflow rather than guessing at which layer a regression lives.
References¶
- Elibol, Roesch, Gelado, Buehler, Garland, Fearless Concurrency on the GPU (arXiv 2606.15991): https://arxiv.org/abs/2606.15991
- NVlabs/cutile-rs (cuTile Rust, the paper's standalone artifact, v0.2.0): https://github.com/NVlabs/cutile-rs
- Grout, the cuTile-Rust-based Qwen3 inference engine (open source): https://github.com/huggingface/grout
- NVIDIA cuda-tile (Tile IR / CUDA Tile C++ path): https://github.com/NVIDIA/cuda-tile
- NVIDIA cuTile Python: https://github.com/NVIDIA/cutile-python
- NVIDIA Tile IR documentation: https://docs.nvidia.com/cuda/tile-ir/
- Rust-CUDA (Rust GPU device code, unsafe kernels): https://github.com/Rust-GPU/Rust-CUDA
- rust-gpu (Embark Studios lineage): https://github.com/Rust-GPU/rust-gpu
- CubeCL (SIMT-style Rust kernels across backends): https://github.com/tracel-ai/cubecl
- Triton (Python tile DSL): https://github.com/triton-lang/triton
- candle (Rust host-side tensor framework): https://github.com/huggingface/candle
Related: Triton · CUTLASS · Tensor Core programming · PyTorch CUDA extensions · CUDA graphs · CUDA streams and concurrency · Warp specialization and pipelining · Kernel fusion
-
cuTile Rust (arXiv 2606.15991): tiles as immutable fixed-size values; kernels as grids of single-threaded tile programs; mutable outputs partitioned into disjoint sub-tensors with an injective program-to-sub-tensor mapping (mutable rank capped at 3 by the CUDA grid); MappedPartitionMut with branded PartitionIndex and bounded Dim iterators for multi-output programs without hot-loop dynamic checks; Tile IR token chains order mutable operations while shared reads reorder freely; KernelInput/KernelOutput prepare()/recover() with same-type-in-same-type-out; DeviceOp lazy composition (then, zip!, shared) with sync/async/graph modes and GraphNode-restricted capture; escape hatches via unchecked_accesses and raw pointers; proc macro emits desugared Rust, typed launcher, and an embedded AST JIT-compiled to Tile IR and cubin at first launch. ↩↩↩↩↩↩
-
The paper positions cuTile Rust alongside cuTile Python and CUDA Tile C++ on the shared Tile IR backend, and against Triton, Pallas, ThunderKittens (tile DSLs without static safety), Descend and Mojo (new safe languages), and Rust-CUDA/rust-gpu/CubeCL (Rust device code without the safe launch and tensor-access model). ↩
-
Evaluation (single B200 GPU of a DGX B200; RTX 5090 workstation; SM clocks locked for microbenchmarks): GEMM f16 at M=N=K=8192 reaches 2.07 PFlop/s safe Rust (92% of dense f16 peak, 96.4% of cuBLASLt-swept cuBLAS), unsafe Rust within 0.3%, cuTile Python 2.04 PFlop/s; element-wise add at N=2^28: 7.02 TB/s safe and unsafe Rust, 7.01 TB/s Python, 7.68 TB/s peak; execution-mode sweep (RTX 5090, f16, length-2048 tile, N=1..1000): per-kernel sync ~7.3 us/op, chained sync and async ~3.4 us/op, graph replay ~0.8 us/op, async fixed cost ~21 us at N=1. ↩↩
-
Paper section 3.5: a cuTile Python permute-heads kernel with swapped store indices races (17 to 35% of elements differ across runs); the cuTile Rust partition view has no store index to swap, and Appendix A proves data-race freedom of the safe API against the Tile IR memory model. ↩
-
Grout (open source, built with Hugging Face): batch-1, f16, prefix caching disabled; abstract headline 171 tok/s (RTX 5090, Qwen3-4B) and 82 tok/s (B200, Qwen3-32B); generation sweep at 8192 tokens: 154.7 tok/s (74.7% of the HBM roofline R = beta/(W + mean KV bytes per token); roofline falls 221.6 to 207.1 tok/s as KV grows 0.040 to 0.607 GB/token) and 80.1 tok/s (66.7% of roofline) vs vLLM 77.5 and SGLang 76.5; safe kernels for element-wise, embedding, argmax; unchecked or raw-pointer kernels for attention and fused norms; model GEMMs fall back to cuBLAS. ↩↩