Skip to content
Markdown

Profiling GPUs: Nsight Systems and Nsight Compute

Scope: the profile-driven workflow that finds the bottleneck with Nsight Systems (timeline / system view), drills into the offending kernel with Nsight Compute (per-kernel counters), correlates code with NVTX ranges, and iterates against the PyTorch profiler.

What it is

A two-tier NVIDIA profiling stack, used iteratively rather than in isolation:

  • Nsight Systems (nsys) is a system-wide, low-overhead timeline profiler. It traces CPU threads, CUDA API calls, kernel launches, memory copies, NCCL/cuBLAS/cuDNN activity, NVTX ranges, and OS runtime calls on one synchronized timeline. It answers where the time goes: is the GPU starved (CPU-bound, data-loader stalls, host/device copies), or compute-bound (kernels back-to-back, GPU busy)?
  • Nsight Compute (ncu) is a per-kernel profiler. Once Nsight Systems names the dominant kernel, ncu replays that kernel and collects hardware counters: Global Memory Load Efficiency, average sectors per request, SM Active %, achieved occupancy, shared-memory bank conflicts, warp stall reasons, DRAM throughput. It answers why a single kernel is slow.
  • NVTX (NVIDIA Tools Extension) ranges are user annotations (forward, backward, optimizer.step) that show up as named bands on both tools' timelines, mapping raw kernels back to model code.
  • PyTorch profiler (torch.profiler) is the framework-level view: it attributes time to Python/operator names and emits Chrome traces, complementing nsys for the "which op" question without leaving Python.

The book frames this as a non-optional loop: "To identify bottlenecks, we must iteratively use NVIDIA Nsight Systems and NVIDIA Nsight Compute together with the PyTorch profiler. Combined, these tools help pinpoint bottlenecks and track performance over time at different levels of the stack" (Fregly, Ch. 1).

Why use it

You cannot optimize what you have not measured. The book's mandate is a "profile-driven mindset": "Use profilers to identify the true bottlenecks (whether it's compute utilization, memory bandwidth, memory latency, cache misses, or communication/network delays). Then apply targeted optimizations for that bottleneck" (Fregly, Ch. 1). The explicit warning against "vibe" optimizations ("We did X and things felt faster") is the reason this page exists: headline GPU utilization can read near 100% while goodput, useful work per unit time, sits far lower because cycles are lost to stalled communication, idle compute, or data waits.

The division of labor is deliberate. The book: "Nsight Systems presents a timeline view of overall GPU activity, while Nsight Compute reports per-kernel metrics such as SM Active %. Use Nsight Compute when you need quantitative per-kernel analysis." Starting with ncu on the wrong kernel wastes time; starting with nsys tells you which kernel (or which CPU gap) is worth the expensive per-kernel replay.

Every memory optimization in this KB, from coalescing to shared-memory tiling to occupancy tuning, is verified through these counters. The book closes its memory chapter: "Those tools remain your north star whenever you tune a new kernel."

When to use it (and when not)

Use this workflow when:

  • A training or inference run misses its throughput / latency target and the cause is unknown (start with nsys).
  • A specific kernel is known-hot and you need counter-level diagnosis (coalescing, bank conflicts, occupancy, stall reasons) using ncu.
  • You are validating an optimization end-to-end: profile before, change one thing, profile after, compare counters. See the roofline model to decide whether a kernel is memory- or compute-bound before you tune.
  • You suspect a gap between GPU work: data loading, host/device copies, Python overhead, gradient-sync stalls (nsys + NVTX, cross-checked with the PyTorch profiler).

Do not reach for it when:

  • You only need coarse, always-on fleet telemetry (utilization, power, ECC, temperature). That is observability/monitoring and diagnostics tools territory, not a profiler.
  • The bottleneck is already proven to be the network fabric across nodes; profile collectives, but cluster-wide comms analysis is its own discipline.
  • You would run ncu on a full training step in production. It serializes and replays kernels, inflating wall time by orders of magnitude. Scope it to a few launches (see --launch-skip / --launch-count below) on a representative run, never the whole job.

ncu requires the GPU performance counters to be accessible to the user; on locked-down nodes this needs the NVIDIA driver's profiling permission (NVreg_RestrictProfilingToAdminUsers=0) or running with appropriate privileges.

Architecture

The stack is a feedback loop, not a linear pipeline. Nsight Systems localizes the problem (host gap vs hot kernel), NVTX maps kernels to model phases, Nsight Compute quantifies the per-kernel cause, the PyTorch profiler cross-checks the operator, and the fix is re-profiled against the same counters until the target is met.

flowchart TB
    A["Run misses throughput / latency target"] --> B["Nsight Systems (system timeline)"]
    B --> C{"GPU-starved or kernel-bound?"}
    C -->|"gaps between kernels"| D["Fix host side first<br/>(data loader, H2D copies, Python)"]
    C -->|"one kernel dominates"| E["NVTX ranges map kernel to model phase"]
    E --> F["Nsight Compute (per-kernel counters)<br/>roofline, occupancy, sectors/request, DRAM %"]
    F --> G["Cross-check op with PyTorch profiler"]
    G --> H["Apply one targeted fix<br/>(coalescing / tiling / occupancy / fusion)"]
    D --> H
    H --> I["Re-profile and compare same counters"]
    I --> C

The two tools observe the same execution at different granularities. Nsight Systems samples the whole timeline once, at low overhead, so it is safe on real runs. Nsight Compute replays a single kernel many times to read hardware counters, so it is expensive and must be scoped. NVTX ranges are the shared coordinate system: the same named bands (forward, backward, optimizer) appear on both timelines and in the PyTorch profiler, so a finding in one tool maps unambiguously into the others.

The core coalescing model the counters expose

The load-bearing quantitative model on this page is what Nsight Compute's average sectors per request measures, and why 4.0 is ideal. A 128-byte L2 cache line is four 32-byte sectors; a fully coalesced, aligned warp load of 32 four-byte elements maps to exactly 4 sectors, so Global Memory Load Efficiency is 100%. Scattered or misaligned access touches more sectors (up to 32), fetching bytes the warp never uses. The block below is a numpy-only, executable model of that counter, validated against the happy path and three adversarial cases (worst-case scatter, misalignment across a line boundary, and a stride-monotonicity invariant). It was run with python3 and all asserts pass.

import numpy as np

def sectors_per_request(byte_addresses, sector_bytes=32):
    """Distinct 32-byte L2 sectors touched by one warp's 32 loads.
    Nsight Compute 'average sectors per request' counts unique sectors a
    warp's memory request maps to. A 128-byte line is four 32-byte sectors;
    a coalesced, aligned 32x4B warp load touches exactly 4."""
    a = np.asarray(byte_addresses, dtype=np.int64)
    assert a.size == 32, "a warp issues 32 lanes"
    return int(np.unique(a // sector_bytes).size)

def load_efficiency(byte_addresses, elem_bytes=4, sector_bytes=32):
    """Bytes requested / bytes actually fetched (distinct sectors * 32 B)."""
    sectors = sectors_per_request(byte_addresses, sector_bytes)
    return (32 * elem_bytes) / (sectors * sector_bytes)

# Happy path: perfectly coalesced, aligned float32 warp -> ideal 4.0.
coalesced = [i * 4 for i in range(32)]                  # 128 contiguous bytes
assert sectors_per_request(coalesced) == 4, "aligned contiguous warp -> 4 sectors"
assert abs(load_efficiency(coalesced) - 1.0) < 1e-9, "coalesced load is 100% efficient"

# Adversarial 1: worst-case scatter (stride >= 32 B) -> 32 distinct sectors.
scattered = [i * 128 for i in range(32)]                # each lane in its own line
assert sectors_per_request(scattered) == 32, "fully scattered warp touches 32 sectors"
assert abs(load_efficiency(scattered) - 0.125) < 1e-9, "scatter efficiency == 12.5%"

# Adversarial 2: 4-byte misalignment straddles an extra sector (corruption of ideal).
misaligned = [4 + i * 4 for i in range(32)]             # bytes [4,132): spans 5 sectors
assert sectors_per_request(misaligned) == 5, "misaligned 128 B access spans 5 sectors"
assert sectors_per_request(misaligned) > sectors_per_request(coalesced), \
    "misalignment strictly worsens sectors/request"

# Adversarial 3: bounds + monotonicity invariant across strides (vs slow reference).
prev = 0
for stride in [4, 8, 16, 32, 64, 128]:
    sec = sectors_per_request([i * stride for i in range(32)])
    assert 1 <= sec <= 32, f"sectors/request out of [1,32] for stride {stride}"
    if stride >= 32:
        assert sec == 32, f"stride {stride} >= 32 B must give 32 sectors"
    assert sec >= prev, "sectors/request is monotone non-decreasing in stride"
    prev = sec

print("coalesced=4 (100%), scattered=32 (12.5%), misaligned=5; invariants hold")

How to use it: find the bottleneck with Nsight Systems

Capture a timeline over a few steady-state iterations. Trace CUDA, NVTX, OS runtime, and the comms/math libraries you care about:

nsys profile \
  --trace=cuda,nvtx,osrt,cudnn,cublas \
  --sample=cpu \
  --output=run_baseline \
  python train.py --steps 50

--trace selects the APIs to record; cuda,nvtx,osrt is the common base, with cudnn,cublas (and nccl for multi-GPU) added when relevant. --sample=cpu adds CPU call-stack sampling to expose host-side stalls. This writes run_baseline.nsys-rep, openable in the Nsight Systems GUI.

For a fast, headless summary without opening the GUI, ask nsys for stats:

nsys profile --trace=cuda,nvtx --stats=true \
  --output=run_baseline python train.py --steps 50

--stats=true prints CUDA API, kernel, and memory-operation summary tables (top kernels by total time, copy volumes) directly to the terminal, the fastest way to read off the dominant kernel name and confirm whether time is in kernels or in gaps between them.

To profile only steady state and skip warm-up/compilation, gate capture with the CUDA Profiler API and have the script start/stop it:

nsys profile --trace=cuda,nvtx \
  --capture-range=cudaProfilerApi --capture-range-end=stop \
  --output=run_steady python train.py

The gating below is a reference template (needs PyTorch/CUDA, not installed here). It starts capture only after warm-up so torch.compile/cuDNN autotuning is excluded from the trace:

# Reference template (requires torch + CUDA; not executed here).
import torch

# warm up, let cudnn/torch.compile autotune settle, then:
for step, batch in enumerate(loader):
    if step == 10:
        torch.cuda.cudart().cudaProfilerStart()
    train_step(batch)
    if step == 15:
        torch.cuda.cudart().cudaProfilerStop()
        break

Read the timeline: if GPU rows show gaps between kernels, you are GPU-starved (CPU / data / copy bound), so fix the host side first. If kernels are wall-to-wall and one dominates, you have a kernel to hand to ncu.

How to integrate: annotate with NVTX for correlation

Raw kernel names (ampere_sgemm_...) are hard to map to model code. Wrap logical phases in NVTX ranges so the timeline shows named bands. The block below is a reference template (needs PyTorch, not installed here):

# Reference template (requires torch + CUDA; not executed here).
import torch

for batch in loader:
    torch.cuda.nvtx.range_push("forward")
    out = model(batch)
    loss = criterion(out, batch.target)
    torch.cuda.nvtx.range_pop()      # forward

    torch.cuda.nvtx.range_push("backward")
    loss.backward()
    torch.cuda.nvtx.range_pop()      # backward

    torch.cuda.nvtx.range_push("optimizer")
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    torch.cuda.nvtx.range_pop()      # optimizer

range_push / range_pop are a nested stack: every push needs a matching pop. To auto-annotate every autograd op (and correlate backward ops with their forward origin) use the built-in context manager instead of hand-rolling ranges:

# Reference template (requires torch; not executed here).
with torch.autograd.profiler.emit_nvtx():
    train_step(batch)

The correctness property those ranges rely on, that pushes and pops form a balanced nested stack, is what the pure-Python model below validates. An unmatched pop poisons the timeline; a dropped push leaks a range. It was run with python3 and all asserts pass:

def nvtx_depth_trace(events):
    """events: list of ('push', name) / ('pop', None). Returns per-event
    stack depth and the final depth. Raises on underflow (a pop with an
    empty stack), mirroring an unmatched NVTX range_pop."""
    depth, trace = 0, []
    for kind, _ in events:
        if kind == "push":
            depth += 1
        elif kind == "pop":
            if depth == 0:
                raise ValueError("range_pop with no matching range_push")
            depth -= 1
        else:
            raise ValueError(f"bad event {kind!r}")
        trace.append(depth)
    return trace, depth

# Happy path: forward/backward/optimizer, each balanced -> returns to depth 0.
good = [("push", "forward"), ("pop", None),
        ("push", "backward"), ("pop", None),
        ("push", "optimizer"), ("pop", None)]
trace, final = nvtx_depth_trace(good)
assert final == 0, "balanced NVTX ranges must return to depth 0"
assert max(trace) == 1, "sibling ranges nest at most one deep"

# Happy path (nested): a step range enclosing forward reaches depth 2.
trace, final = nvtx_depth_trace(
    [("push", "step"), ("push", "forward"), ("pop", None), ("pop", None)])
assert final == 0 and max(trace) == 2, "nested ranges reach depth 2 then unwind"

# Adversarial 1: a dropped pop leaks a range -> residual depth 1.
leaked = [("push", "forward"),
          ("push", "backward"), ("pop", None),
          ("push", "optimizer"), ("pop", None)]   # 'forward' never popped
_, final = nvtx_depth_trace(leaked)
assert final == 1, "a missing range_pop leaves a leaked, still-open range"

# Adversarial 2: an extra pop underflows -> must be detected and raised.
raised = False
try:
    nvtx_depth_trace([("push", "a"), ("pop", None), ("pop", None)])
except ValueError:
    raised = True
assert raised, "an unmatched range_pop must be rejected"

print("nvtx: balanced=0 residual, leaked-push=1 residual, extra-pop raises")

These NVTX ranges appear as named rows in both nsys and ncu, so a slow kernel can be traced back to, say, backward of a specific layer.

How to drill in: per-kernel counters with Nsight Compute

Once nsys names the hot kernel, profile just that kernel with ncu. Do not collect the full metric set across every launch; scope it:

ncu --set full \
  --kernel-name regex:"tiledMatMul|sgemm" \
  --launch-skip 20 --launch-count 3 \
  --nvtx --nvtx-include "forward/" \
  --export kernel_report \
  python train.py
  • --set full collects all sections (memory workload, occupancy, scheduler, warp-state). It is thorough but slow; use a lighter --set or --section (e.g. --section MemoryWorkloadAnalysis) for a quick pass.
  • --kernel-name regex:... restricts profiling to matching kernels.
  • --launch-skip / --launch-count skip warm-up launches and profile only a few, essential because ncu replays each kernel multiple times to gather counters.
  • --nvtx --nvtx-include narrows to kernels inside a named NVTX range.

Read the counters the book calls out. For a memory-bound kernel, the Memory Workload Analysis section shows the symptoms of uncoalesced access: low Global Memory Load Efficiency, high DRAM sector read counts, and average sectors per request above 4.0 (Fregly, Ch. 7). The ideal is 4.0, since a 128-byte cache line is four 32-byte sectors, so a fully coalesced, aligned warp load maps to 4.0 sectors/request; values climbing toward 32 mean scattered, wasted fetches. (The sectors_per_request model in Architecture above computes exactly this, and confirms 4 for a coalesced warp and 32 for a fully scattered one.)

Book vs. official docs: the book describes the 128-byte cache line "composed of four 32-byte sectors." NVIDIA's CUDA C++ Programming Guide documents that on compute capability 6.0+ the hardware coalesces warp accesses into 32-byte memory transactions; the 128-byte line is four such sectors. Both descriptions are consistent: the 32-byte sector is the transaction unit, and Nsight Compute reports activity in 32-byte sectors at L2. Where wording differs, prefer NVIDIA's sector/transaction definition.

Other load-bearing counters from the book's tables (Ch. 7), all illustrative values, not hardware-tested here:

Symptom Counter to read Bad Good
Uncoalesced / strided loads Average sectors per request up to ~32 ~4.0
Wasted bandwidth Global Memory Load Efficiency ~23-28% ~97-99%
Memory-bound stalls SM Active % ~62% ~99%
Shared-memory serialization Shared-memory bank conflicts millions 0
Bandwidth headroom DRAM throughput (% of peak) ~25% ~90%

For scripting and dashboards, the book names raw metric IDs: sm__sass_data_bytes_mem_* and gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed (Fregly, Ch. 7), which ncu --metrics <id> can collect directly.

How to cross-check: the PyTorch profiler

nsys/ncu work at the CUDA/SASS level; the PyTorch profiler attributes time to operator and Python names, which is faster for "which op" triage and produces a Chrome trace. Use its schedule to skip warm-up and record only a few active steps. The block below is a reference template (needs PyTorch, not installed here):

# Reference template (requires torch; not executed here).
import torch
from torch.profiler import profile, ProfilerActivity, schedule, record_function

sched = schedule(wait=5, warmup=2, active=3, repeat=1)

def on_ready(p):
    print(p.key_averages().table(sort_by="cuda_time_total", row_limit=15))
    p.export_chrome_trace("torch_trace.json")

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=sched,
    on_trace_ready=on_ready,
    record_shapes=True,
    with_stack=True,
) as prof:
    for batch in loader:
        with record_function("train_step"):
            train_step(batch)
        prof.step()

schedule(wait, warmup, active) discards wait+warmup steps so the trace reflects steady state; record_function("name") labels arbitrary code ranges (analogous to NVTX); key_averages().table(sort_by="cuda_time_total") ranks operators by GPU time; export_chrome_trace writes a JSON viewable in chrome://tracing or Perfetto. Use this to confirm the operator behind an nsys kernel before paying for an ncu replay.

The behavior this template depends on, that only the active steps of each cycle are recorded and wait+warmup steps are discarded, is validated by the numpy-only model below (equivalence checked against a brute-force reference over 200 random schedules). It was run with python3 and all asserts pass:

import numpy as np

def is_recording(step, wait, warmup, active, repeat):
    """True iff torch.profiler would RECORD (be in the 'active' phase) at
    this 0-based step, matching torch's cyclic wait->warmup->active schedule."""
    cycle = wait + warmup + active
    if cycle == 0:
        return False
    if repeat > 0 and step // cycle >= repeat:
        return False
    pos = step % cycle
    return wait + warmup <= pos < wait + warmup + active

# The page's own schedule(wait=5, warmup=2, active=3, repeat=1): steps 7,8,9.
recorded = [s for s in range(50) if is_recording(s, 5, 2, 3, 1)]
assert recorded == [7, 8, 9], f"expected steps 7,8,9 recorded, got {recorded}"
assert len(recorded) == 3 * 1, "total recorded == active * repeat"

# Equivalence to a slow reference over random schedules: recorded count must
# equal active*repeat, and no wait/warm-up step may ever be recorded.
rng = np.random.default_rng(1)
for _ in range(200):
    w, wu = int(rng.integers(0, 5)), int(rng.integers(0, 4))
    ac, rp = int(rng.integers(1, 5)), int(rng.integers(1, 4))
    horizon = (w + wu + ac) * rp + 3
    rec = [s for s in range(horizon) if is_recording(s, w, wu, ac, rp)]
    assert len(rec) == ac * rp, f"schedule({w},{wu},{ac},{rp}) recorded {len(rec)}"
    cyc = w + wu + ac
    for s in rec:
        assert (s % cyc) >= w + wu, "no wait/warm-up step may be recorded"

print(f"schedule(5,2,3,1) records {recorded}; 200 random schedules match reference")

How to maintain and scale it: the iterative loop

  1. nsys (system view) tells you whether the run is GPU-starved or kernel-bound, and names the dominant kernel or host-side gap.
  2. NVTX ranges map that kernel to a model phase.
  3. ncu (per-kernel) diagnoses the counter (coalescing, occupancy, bank conflicts, stall reason); cross-check the op with the PyTorch profiler.
  4. Apply one targeted change (coalescing, tiling, occupancy, kernel fusion).
  5. Re-profile and compare the same counters. Keep the .nsys-rep / ncu reports as the before/after evidence.

Change one thing per iteration: if you tune coalescing and occupancy at once, the counters cannot attribute the delta. To scale the loop across a codebase, the book stresses automating it: "set up automated performance tests to catch regressions, reductions in performance, early in the development cycle" (Fregly, Ch. 1). Wire nsys --stats or ncu --metrics <id> into CI, capture the same counter IDs each run, and fail the build on counter regressions (for example, average sectors per request rising back above ~4.0, or DRAM throughput dropping). Because ncu is expensive, scale it by scoping CI to a handful of representative kernels with --kernel-name and --launch-count, not the whole training step.

Failure modes

  • ncu on a full production step. It serializes and replays every kernel to read counters, inflating wall time by orders of magnitude and perturbing timing. Always scope with --kernel-name, --launch-skip, and --launch-count; never profile the whole job with ncu.
  • Profiling counters locked down. On hardened nodes ncu cannot read performance counters and fails with a permissions error. This needs the driver flag NVreg_RestrictProfilingToAdminUsers=0 or elevated privileges; nsys timeline capture generally still works without it.
  • Unbalanced NVTX ranges. A range_push without a matching range_pop (or vice versa) corrupts the nesting so bands overlap or leak on the timeline, and kernels map to the wrong phase. Keep pushes and pops balanced (the stack model in "How to integrate" detects both a leaked push and an extra pop); prefer emit_nvtx() to hand-rolling.
  • Profiling warm-up instead of steady state. torch.compile/cuDNN autotuning and allocator warm-up dominate the first iterations, so an unscoped capture measures compilation, not the kernel. Gate with --capture-range=cudaProfilerApi (nsys) or schedule(wait, warmup, active) (PyTorch profiler) so only steady-state steps are recorded.
  • Trusting headline GPU utilization. Utilization can read near 100% while goodput is far lower because cycles go to stalled comms, idle compute, or data waits. Read SM Active %, DRAM throughput, and stall reasons, not just the utilization percentage.
  • Starting with ncu on the wrong kernel. Per-kernel replay is expensive; picking the kernel before nsys has shown it dominates wastes that budget on a kernel that is not the bottleneck. Always localize with nsys (or the PyTorch profiler) first.
  • Interpreting sectors/request without alignment context. A value above 4.0 signals wasted fetches, but the cause may be scatter or misalignment (the model above shows a 4-byte offset alone pushes a contiguous 128-byte access from 4 to 5 sectors). Check the access pattern and base alignment before assuming the fix is coalescing rather than padding/alignment.
  • Missing comms/library traces on multi-GPU. Omitting nccl (or cudnn/cublas) from --trace hides collective and library time, so a communication-bound run looks idle between kernels. Add the relevant --trace APIs when the workload uses them.

References

  • Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 1, "Introduction and AI System Overview" (Benchmarking and Profiling; profile-driven mindset; goodput); Ch. 7, "Profiling and Tuning GPU Memory Access Patterns" (Nsight Compute counters: Global Memory Load Efficiency, average sectors per request, SM Active %, bank conflicts; raw metric IDs sm__sass_data_bytes_mem_*, gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed).
  • NVIDIA Nsight Systems User Guide (nsys profile, --trace, --sample, --stats, --capture-range=cudaProfilerApi): https://docs.nvidia.com/nsight-systems/UserGuide/index.html
  • NVIDIA Nsight Compute CLI documentation (ncu, --set, --section, --kernel-name, --launch-skip, --launch-count, --metrics, --nvtx): https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html
  • NVIDIA Nsight Compute documentation (Memory Workload Analysis; sectors/transactions): https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
  • NVIDIA CUDA C++ Programming Guide (global memory coalescing into 32-byte transactions, CC 6.0+): https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
  • PyTorch torch.profiler documentation (profile, ProfilerActivity, schedule, record_function, key_averages, export_chrome_trace): https://docs.pytorch.org/docs/stable/profiler.html
  • PyTorch NVTX (torch.cuda.nvtx.range_push / range_pop, torch.autograd.profiler.emit_nvtx): https://docs.pytorch.org/docs/stable/generated/torch.cuda.nvtx.range_push.html

The nsys/ncu commands and PyTorch orchestration snippets are reference templates: all commands, flags, and API names are quoted from the book and the official NVIDIA/PyTorch documentation above and have not been hardware-tested here. Counter values in the table are the book's illustrative figures, not measurements. The two numpy blocks (sectors/request, profiler schedule selection) and the pure-Python NVTX stack-balance block are runnable and were executed with python3; their asserts pass. Where the book and official docs differ (the book's "128-byte line of four 32-byte sectors" vs. NVIDIA's documented 32-byte transaction unit on CC 6.0+), the NVIDIA description is preferred and both are noted.

Related: Memory Coalescing and Vectorized Access · Shared Memory, Bank Conflicts, and Tiling · CUDA Occupancy Tuning · Roofline Model and Arithmetic Intensity · Goodput: Measuring Useful AI Throughput · Kernel Fusion · GPU Memory Hierarchy · GPU Diagnostics and Validation · Glossary