Markdown

Inline PTX and SASS-level tuning¶

Scope: dropping to inline PTX (asm volatile) for instructions the compiler will not emit, reading the real SASS with cuobjdump / nvdisasm to verify what actually ran, and the PTX-vs-SASS forward-compatibility model, a last-resort tool after the algorithmic and occupancy wins are spent.

What it is¶

There are three program representations between your CUDA C++ and the silicon, and they are not interchangeable:

CUDA C++ is what you write.
PTX (Parallel Thread Execution) is a stable, virtual ISA. Forward-compatible: a binary that ships PTX can be JIT-compiled to run on a newer architecture it was never built for.¹
SASS (Streaming ASSembler) is the actual machine code for one specific architecture (sm_90 Hopper, sm_100 Blackwell). Not forward-compatible: a SASS-only binary will not run on a newer architecture.¹

nvcc lowers C++ → PTX → SASS. Two distinct interventions live at these levels:

Inline PTX lets you splice a PTX instruction directly into a kernel via an asm() statement, for the handful of instructions the C++ compiler will not emit on its own (a specific cache-eviction hint, a clock read, a hardware op with no intrinsic). The general form is asm("template" : outputs : inputs : clobbers);, where operands are referenced as %0, %1, … and bound to registers by constraint letters: "r" = .u32, "l" = .u64, "h" = .u16, "f" = .f32, "d" = .f64, "n" = immediate integer.² The compiler assumes an asm() has no side effects beyond writing its output operands and may move or delete it; asm volatile blocks that, and a trailing : : : "memory" clobber tells the compiler the statement touches memory so surrounding loads/stores are not reordered across it.²
SASS inspection is read-only. You never write SASS by hand in this workflow; you disassemble the compiled binary to confirm the instructions, register count, and predication the compiler actually produced, because C++ and even PTX are several optimization passes removed from what executes. The tools are cuobjdump and nvdisasm from the CUDA Toolkit.³

The unifying point: you reach for PTX to make the hardware do something, and for SASS to verify it did. The book frames the SASS layer as the ground truth a profiler correlates against: Nsight Compute maps each source line to PTX and SASS only when you compile with -lineinfo, and its source view shows the per-instruction mix and stall attribution at that level.⁴

Why it matters¶

The compiler is good, and on Blackwell it gets better every toolkit release. The book's own guidance is explicit: write kernels with the highest-level, most-recent APIs and let libraries pick up new hardware features "for free" without code changes.⁵ Inline PTX is the opposite of that. It pins you to a specific instruction and can defeat future compiler improvements. So the why is narrow and concrete:

Some instructions have no C++ surface. Blackwell's fifth-generation Tensor Core MMA is exposed in PTX as tcgen05.mma with associated tcgen05.ld / tcgen05.st, and asynchronous bulk copies use the cp.async / cp.async.bulk.tensor family.⁶⁷ You normally hit these through CUTLASS or cuda::memcpy_async, but a hand-written kernel that needs one of them with no library wrapper has to emit the PTX directly.
The C++ you wrote is not the SASS that ran. A const __restrict__ annotation only permits the read-only cache path; whether the compiler took it shows up as LDG in the SASS, not in your source.⁸ A ternary you wrote to dodge warp divergence is only a win if the compiler turned it into a predicated SEL/MOV rather than a branch; in PTX you would see the @p predicate guard instead of a split into separate warp paths.⁹ You cannot confirm either from C++; you confirm it in the disassembly.
Register pressure is a SASS fact. Blackwell caps a thread at 255 registers, and registers are the dominant occupancy limiter.¹⁰ nvdisasm / Nsight tell you the real per-thread count after the optimizer's spills and rematerialization, the number that actually decides how many warps fit.

Done well, this closes the loop: profile → hypothesize → change → disassemble to confirm the change landed → re-profile. Done as a first move, it is wasted effort against a problem that tiling, fusion, or occupancy would have solved with portable code.

When it is needed (and when not)¶

Reach for it when:

A profiler-confirmed bottleneck persists after the algorithmic wins (coalescing, shared-memory tiling, kernel fusion, occupancy tuning) are exhausted and the remaining gap is a specific instruction the compiler will not emit.
You must verify the compiler's output: did the ternary predicate, did __restrict__ route through the read-only path, how many registers did the kernel really use, did the intended LDGSTS/cp.async appear. This verification use of SASS is cheap and always justified.
You are writing a kernel against a hardware feature with no intrinsic yet (a new tcgen05.* path, a cache-control qualifier), and a CUTLASS or Triton wrapper does not cover your case.

Skip it when:

You have not profiled. Inline PTX is never the first optimization. The book's regime ladder is underutilized → latency-bound → memory-bound → compute-bound; PTX-level work only makes sense deep in the compute-bound regime, after occupancy and memory access are handled.¹¹
A high-level path exists. torch.compile, CUTLASS, cuBLAS/cuDNN, and cuda::memcpy_async already emit the advanced PTX (TMA, tcgen05, multicast) and track new architectures.⁵ Hand-rolled PTX that duplicates them is portability debt for no gain.
You would ship SASS-only. Emitting architecture-specific SASS or omitting PTX breaks forward compatibility. The binary will not run on the next GPU generation.¹

A sibling effort here refused to invent an unmeasured speedup figure; this page does likewise. No instruction-count or runtime numbers are claimed for hand-written PTX because none were measured on hardware by this author. The disassembly procedures below are exact, the magnitudes are workload-specific and must be measured on your device.

How: implement, integrate, maintain¶

1. Inline PTX, minimally¶

Use inline PTX only for the instruction that has no intrinsic. Read a hardware counter (%clock) with volatile so the read is not hoisted, deleted, or CSE-merged with another:²

__device__ __forceinline__ unsigned clock_lo() {
    unsigned c;
    // volatile: do not move/delete; %% escapes the special register prefix
    asm volatile ("mov.u32 %0, %%clock;" : "=r"(c));
    return c;
}

Operand binding by constraint letter, output (=) before input, %n zero-indexed in text order:²

// d = a * b + c, fused, on .f32 registers
__device__ __forceinline__ float fma_ptx(float a, float b, float c) {
    float d;
    asm("fma.rn.f32 %0, %1, %2, %3;"
        : "=f"(d)                 // %0 write-only output, .f32 reg
        : "f"(a), "f"(b), "f"(c)); // %1..%3 inputs
    return d;
}

If the PTX reads or writes memory the compiler cannot see, add the "memory" clobber so surrounding accesses are not reordered across it:²

asm volatile ("…" : "=r"(x) : "r"(addr) : "memory");

Keep these wrappers tiny, __forceinline__, and isolated; the rest of the kernel stays in C++ so the optimizer keeps working everywhere except the one instruction you pinned.

2. Compile so the binary stays portable and inspectable¶

# Ship SASS for today's arch AND PTX for forward compatibility,
# and emit line tables so SASS correlates back to source.
nvcc -lineinfo \
     -gencode arch=compute_100,code=sm_100 \
     -gencode arch=compute_100,code=compute_100 \
     kernel.cu -o app

The second -gencode embeds PTX (code=compute_100) alongside the sm_100 SASS, producing a fatbin that JIT-compiles forward onto a future architecture; a SASS-only build would not.¹ -lineinfo adds source-line tables without disabling optimization, which is what lets Nsight Compute and nvdisasm map instructions back to your source.⁴³ You can prove the PTX path works by forcing JIT at load time with CUDA_FORCE_PTX_JIT=1; if the binary lacks PTX the launch fails, which flags a non-portable build.¹

3. Read the SASS to verify what ran¶

cuobjdump works on either a cubin or a host executable (it finds embedded cubins); nvdisasm takes only a standalone cubin but gives richer output.³

# Disassemble one kernel's SASS straight from the app binary.
cuobjdump -sass -fun fma_ptx app

# Also dump the PTX the compiler kept, to see the @p predicate / fma.rn.
cuobjdump -ptx app

# For the richer view, extract the cubin, then drive nvdisasm.
cuobjdump -xelf all app                 # writes app.<arch>.cubin
nvdisasm -c app.sm_100.cubin            # code sections only

What to check in the output:

Predication vs branching. A divergence-avoiding ternary should appear as a predicated SEL/MOV (PTX @p), not a branch that splits the warp.⁹ If you still see a branch, the compiler did not predicate it.
Read-only path. const __restrict__ data routed through the read-only cache shows up as LDG; a plain global LD means the hint was not taken.⁸
Register count and spills. nvdisasm -plr prints register life ranges; the per-thread register count (≤255 on Blackwell) and any local-memory spill traffic are the real occupancy inputs, not your variable count.³¹⁰

For control flow, nvdisasm -cfg app.sm_100.cubin emits a graphviz control-flow graph, and nvdisasm -g annotates the disassembly with source lines from a -lineinfo/-G build.³

flowchart TD
    A["CUDA C++ kernel"] -->|"nvcc front-end"| B["PTX (virtual ISA, forward-compatible)"]
    B -->|"ptxas / JIT"| C["SASS (per-arch machine code)"]
    A -.->|"asm volatile (inline PTX)"| B
    C -->|"cuobjdump -sass / nvdisasm"| D["Verify: predication, LDG, registers, spills"]
    D -->|"re-profile with Nsight Compute"| A

4. Maintain it as a liability¶

Inline PTX is code the compiler can no longer improve and that may not be optimal on the next architecture; treat each asm block as technical debt. Pin it behind a wrapper, comment why no intrinsic exists, and re-disassemble after every toolkit or architecture bump to confirm the surrounding C++ still lowers as intended. Guard correctness with Compute Sanitizer: a missing "memory" clobber or a wrong constraint is exactly the kind of silent corruption it catches. When a future toolkit gains an intrinsic or a library covers your case, delete the PTX. The portable path wins on every axis except the one instruction you were forced to pin.

Accuracy note: the PTX-vs-SASS forward-compatibility model, the @p predicate / SEL/MOV lowering, const __restrict__ → read-only path, the tcgen05.* and cp.async instruction names, the 255-register Blackwell limit, -lineinfo source correlation, and the profile-driven optimization ladder are grounded in Fregly's book (chapters 6–8). The inline-PTX asm() syntax (constraint letters, volatile, "memory" clobber) and the cuobjdump / nvdisasm flag spellings are not covered in the cited book chapters; they are grounded in the official NVIDIA Inline PTX Assembly and CUDA Binary Utilities references and noted as such. Where book phrasing and NVIDIA docs would diverge, the docs are authoritative. No hardware-measured speedup numbers are asserted for hand-written PTX, as none were validated on a device by this author.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly) — Ch. 6 (GPU architecture, PTX/SASS forward-compatibility model, memory hierarchy, 255-register limit), Ch. 7 (const __restrict__ read-only path, cp.async/TMA), Ch. 8 (warp divergence, @p predication, tcgen05 MMA, -lineinfo source correlation, optimization regime ladder).
NVIDIA, Inline PTX Assembly in CUDA — asm() syntax, constraint letters, volatile, "memory" clobber: https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html
NVIDIA, CUDA Binary Utilities (cuobjdump, nvdisasm options): https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html
NVIDIA, Parallel Thread Execution ISA — PTX instruction reference (fma, cp.async, tcgen05.*): https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
NVIDIA, CUDA C++ Programming Guide — compilation, fatbinaries, PTX JIT and forward compatibility (-gencode, CUDA_FORCE_PTX_JIT): https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Fregly, AI Systems Performance Engineering, Ch. 6, "CUDA GPU Backward and Forward Compatibility Model": kernels with PTX forward-run on newer architectures; SASS-only binaries (e.g. sm_90, sm_100) do not; family targets (sm_100f) restrict portability; ship a fatbin with PTX; verify with CUDA_FORCE_PTX_JIT=1. Cross-checked against NVIDIA, CUDA C++ Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ ↩↩↩↩↩
NVIDIA, Inline PTX Assembly in CUDA — asm("template" : outputs : inputs : clobbers); constraints "h"=.u16, "r"=.u32, "l"=.u64, "f"=.f32, "d"=.f64, "n"=immediate; %n zero-indexed operands, %% escape; compiler assumes no side effects beyond outputs unless volatile; "memory" clobber prevents reordering of surrounding memory accesses. https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html ↩↩↩↩↩
NVIDIA, CUDA Binary Utilities — cuobjdump accepts cubins and host binaries (-sass/--dump-sass, -ptx/--dump-ptx, -elf/--dump-elf, -xelf/--extract-elf, -fun/--function); nvdisasm accepts only cubins but adds -c/--print-code, -cfg (graphviz CFG), -plr/--print-life-ranges, -g/--print-line-info. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html ↩↩↩↩↩
Fregly, Ch. 8 — "Nsight Compute correlates to CUDA C or C++ source and PTX or SASS. You can compile device code with -lineinfo to enable source line mapping"; source view shows instruction mix and throughput/stall breakdown. ↩↩
Fregly, Ch. 7 — write kernels with the highest-level, most-recent APIs (e.g. cuda::memcpy_async); they transparently leverage new hardware features like TMA "for free" without code changes. ↩↩
Fregly, Ch. 8 — "Blackwell Tensor Cores expose fifth-generation MMA instructions in PTX as tcgen05.mma and associated loads/stores (e.g., tcgen05.ld and tcgen05.st)." ↩
Fregly, Ch. 7/8 — asynchronous copy instructions cp.async and the TMA cp.async.bulk.tensor family, driven by cuda::memcpy_async + cuda::pipeline. ↩
Fregly, Ch. 7 — marking data const __restrict__ permits the read-only data path (__ldg/LDG); the optimization is visible in the instruction stream and Nsight metrics, not in C++ source. ↩↩
Fregly, Ch. 8 — a divergence-reducing ternary is "likely to translate into a predicated move instruction (SEL/MOV based on condition) rather than an actual branch … In PTX/assembly, you would see the PTX @p predicate syntax to guard the write without splitting into separate warp paths." ↩↩
Fregly, Ch. 6/8 — Blackwell exposes at most 255 registers per thread; register usage is the dominant occupancy limiter ("Limited by max registers per thread"). ↩↩
Fregly, Ch. 8 — diagnose and optimize through the regimes underutilized → latency bound → memory bound → compute bound; instruction-level/PTX work belongs in the compute-bound regime after memory and occupancy are addressed. ↩