Markdown

CUDA compute sanitizer: correctness debugging¶

Scope: compute-sanitizer and its four tools (memcheck, racecheck, initcheck, synccheck) for catching out-of-bounds accesses, shared-memory data races, uninitialized global reads, and invalid synchronization in custom kernels before they silently corrupt results.

flowchart TB
    APP["unmodified app launch<br/>(no source change, no relink)"] --> CS["compute-sanitizer<br/>--tool"]
    CS --> MC["memcheck (default)"]
    CS --> RC["racecheck"]
    CS --> IC["initcheck"]
    CS --> SC["synccheck"]
    MC --> MCB["out-of-bounds / misaligned accesses,<br/>HW exceptions, leaks (--leak-check full)"]
    RC --> RCB["shared-memory data races<br/>(WAW / WAR / RAW hazards)"]
    IC --> ICB["uninitialized global-memory reads"]
    SC --> SCB["invalid synchronization<br/>(barrier in divergent control flow)"]

What it is¶

Compute Sanitizer is NVIDIA's runtime correctness checker for GPU code. It runs your application unmodified (no source change, no relink) and instruments device execution to trap memory and synchronization defects the hardware would otherwise let pass. It ships with the CUDA Toolkit and replaces the legacy cuda-memcheck (removed in CUDA 12). One binary, four tools selected with --tool:

memcheck (default) detects out-of-bounds and misaligned global/shared/local accesses, plus hardware exceptions raised by the GPU; with --leak-check full it also reports device allocations never freed.¹²
racecheck detects shared-memory data-access hazards (Write-after-Write, Write-after-Read, Read-after-Write) that constitute data races.³
initcheck detects reads of global memory that was never written (uninitialized reads).⁴
synccheck detects invalid usage of synchronization primitives, e.g. a __syncthreads() or cooperative-group barrier that not all threads in the convergence set reach.⁵

Invocation is a prefix on the launch command:

compute-sanitizer --tool memcheck ./my_app arg1 arg2

Reports are byte-accurate and name the offending thread/block. Compile with -lineinfo so each report maps to a source file and line; this adds line tables without disabling optimization, unlike -G (full device debug, which serializes execution and is far slower).⁶

Why it matters¶

GPU memory bugs are usually silent. An out-of-bounds store on the CPU faults; on the GPU it may land in another allocation, corrupt a neighbouring tensor, and produce a numerically plausible but wrong result with no crash. A shared-memory race between warps that forgot a __syncthreads() gives different output run-to-run, often only under load, and disappears under a debugger. An uninitialized read pulls whatever stale bytes occupy that HBM line, frequently zero on a fresh allocation, so it "works" in tests and fails in production once the pool is reused.

These defects do not show up as kernel launch failures. cudaGetLastError() returns success, the loss curve looks fine for a while, then a training run diverges or an inference batch returns garbage for one request. Compute Sanitizer makes the failure deterministic and located: it reports the exact access size, the address, how far past the nearest allocation it landed, and the source line. That turns a multi-day "which kernel corrupts memory" hunt into a single annotated stack.

The cost is runtime slowdown (instrumentation overhead is substantial; expect order-of-magnitude on memory-heavy kernels), so it is a debugging and CI gate, not a production setting.

When it is needed (and when not)¶

Reach for it when:

A custom kernel produces wrong or nondeterministic output and you have ruled out algorithm error: race or out-of-bounds is the prime suspect.
You wrote or modified a hand-rolled CUDA kernel, a Triton kernel lowered to PTX, a CUTLASS epilogue, or any shared-memory tiling code that coordinates warps with barriers, exactly where races and OOB live.
Adding a kernel to CI: run it under memcheck and racecheck with --error-exitcode so a defect fails the build.
A result mismatches a CPU reference only at certain block sizes or grid shapes, a classic index-arithmetic OOB.

Skip it when:

The problem is performance, not correctness. Compute Sanitizer finds no hotspots and reports no timing; use Nsight Systems / Nsight Compute for that. The two are complementary: correctness first, then profile.
You are running only well-tested library kernels (cuBLAS, cuDNN, framework-internal PyTorch ops). They are already validated; sanitizing them wastes hours of slowdown for noise. Sanitize your code.
You need it always-on in production. The overhead forbids it; gate it in CI and pre-merge instead.

racecheck only inspects shared-memory hazards; it does not detect global-memory races between blocks. initcheck covers global-memory uninitialized reads. Choose the tool that matches the suspected fault class; running all four blindly multiplies the slowdown.

How: implement, integrate, maintain¶

1. Build for attribution¶

nvcc -lineinfo -arch=sm_100 kernel.cu -o my_app

-lineinfo is the right default for sanitizing: source-line mapping with optimization intact.⁶ Reserve -G for cases where you also need a stepping debugger (cuda-gdb); it changes scheduling and can mask or move races.

2. memcheck: out-of-bounds and leaks¶

Given a kernel that indexes one element past its array:

__global__ void scale(float* x, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    x[i] *= 2.0f;            // BUG: no `if (i < n)` guard -> OOB when grid overshoots n
}

compute-sanitizer --tool memcheck --leak-check full --error-exitcode 1 ./my_app

memcheck reports an invalid __global__ write, the access size, the faulting thread/block, and that the address is "out of bounds and is N bytes after the nearest allocation".⁶ --leak-check full adds device allocations that were never cudaFreed (default is --leak-check no).² --error-exitcode 1 makes the process exit non-zero on any detected error so CI fails (the flag defaults to 0, i.e. errors do not change the exit code).⁷

The fix is the standard bounds guard, identical to the pattern used in occupancy tuning:

__global__ void scale(float* x, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) x[i] *= 2.0f;
}

3. racecheck: shared-memory data races¶

A reduction that reads a neighbour's shared-memory slot without a barrier between the write and the read:

__global__ void reduce(const float* in, float* out, int n) {
    extern __shared__ float s[];
    int t = threadIdx.x;
    s[t] = (blockIdx.x * blockDim.x + t < n) ? in[blockIdx.x * blockDim.x + t] : 0.0f;
    // BUG: missing __syncthreads() here -> race on s[t + stride]
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (t < stride) s[t] += s[t + stride];
        __syncthreads();
    }
    if (t == 0) out[blockIdx.x] = s[0];
}

compute-sanitizer --tool racecheck --racecheck-report all ./my_app

--racecheck-report takes hazard, analysis (default), or all.³ A hazard report is byte-accurate: it names the two conflicting accesses (e.g. a RAW between the initial store to s[t] and another thread's load of s[t + stride]) and classifies them WAW / WAR / RAW. An analysis report aggregates hazards across the launch into the likely root cause. Use all while debugging, analysis once you only want the summary. Racecheck reports hazards but does not halt execution.³ Optionally tune --racecheck-detect-level {info,warn,error} (default warn).³ The fix is a __syncthreads() after the load, before any thread reads a slot another thread wrote.

4. initcheck: uninitialized global reads¶

compute-sanitizer --tool initcheck ./my_app

Flags kernel reads of global memory that no prior write initialized: typically a cudaMalloc buffer consumed before it is fully populated, or a padding region read by an unguarded tile load. Add --track-unused-memory to additionally report allocated global memory never accessed at all (dead allocations), useful for trimming a kernel's footprint.⁴

5. synccheck: invalid synchronization¶

compute-sanitizer --tool synccheck ./my_app

Reports invalid use of synchronization primitives: a barrier (__syncthreads(), __syncwarp(mask), or a cooperative-groups barrier_sync) that not every thread in the required convergence set reaches: the canonical "barrier inside divergent control flow" bug, which is undefined behaviour and a frequent source of hangs or corruption.⁵ This is distinct from racecheck: synccheck checks barrier validity; racecheck checks data hazards that a missing-but-valid barrier would have prevented.

6. Narrow the run and wire into CI¶

Sanitizer overhead scales with checked launches, so scope it:

compute-sanitizer --tool memcheck \
  --kernel-name reduce \
  --launch-count 4 --launch-skip 1 \
  --target-processes application-only \
  --error-exitcode 1 --print-limit 50 \
  --save report.out --log-file sanitizer.log \
  ./my_app

--kernel-name reduce checks only matching kernels (there is also --kernel-name-exclude).⁸
--launch-count / --launch-skip bound how many launches are instrumented and skip warm-up launches (both default 0 = no limit / no skip).⁸
--target-processes {application-only,all}: all (default) follows child processes; application-only checks just the launched binary.⁹
--destroy-on-device-error {context,kernel}: on a device error, tear down the whole context (default) or just the failing kernel and continue.¹⁰
--print-limit N caps reported errors; --save writes machine-readable results; --log-file redirects the textual report.¹¹

In CI, run memcheck (correctness) and racecheck (races) as separate jobs with --error-exitcode 1; a non-zero exit fails the pipeline on any defect. Keep the sanitized job on a small, deterministic input; the slowdown makes full-scale data impractical. Once clean, hand off to the profiling workflow for performance and to diagnostics tools for fleet-level health.

Accuracy note: this page is grounded in the official NVIDIA Compute Sanitizer documentation and the NVIDIA developer blog cited below; flag spellings and defaults (notably --error-exitcode as a single word, default 0, and --racecheck-report default analysis) are quoted from the command-line reference. The intended source-book chapter (Fregly, Ch. on debugging/correctness) was not available to this author for line-level grounding; where the book and NVIDIA docs would disagree, the docs are authoritative and were followed.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly) — GPU correctness debugging / Compute Sanitizer treatment. (Book chapter not directly accessible to this author; claims below are cross-checked against and grounded in the official NVIDIA documentation.)
NVIDIA, Compute Sanitizer Documentation — overview and tool descriptions: https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html
NVIDIA, Compute Sanitizer — command-line options (--tool, --leak-check, --racecheck-report, --racecheck-detect-level, --track-unused-memory, --kernel-name, --launch-count, --launch-skip, --error-exitcode, --target-processes, --destroy-on-device-error, --print-limit, --save, --log-file): https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html
NVIDIA Developer Blog, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer (-lineinfo, memcheck OOB report format, example invocations): https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/
NVIDIA, CUDA C++ Programming Guide — synchronization primitives (__syncthreads, __syncwarp, cooperative groups): https://docs.nvidia.com/cuda/cuda-c-programming-guide/

NVIDIA, Compute Sanitizer Documentation — tool list and --tool selection (default memcheck). https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩
NVIDIA, Compute Sanitizer — memcheck detects out-of-bounds and misaligned accesses and GPU hardware exceptions; --leak-check {full,no} (default no). https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩↩
NVIDIA, Compute Sanitizer — racecheck reports shared-memory data-access hazards (WAW/WAR/RAW); --racecheck-report {hazard,analysis,all} (default analysis), --racecheck-detect-level {info,warn,error} (default warn); hazards are reported but execution is not affected. https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩↩↩↩
NVIDIA, Compute Sanitizer — initcheck reports uninitialized global-memory reads; --track-unused-memory reports allocated-but-unused global memory. https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩↩
NVIDIA, Compute Sanitizer — synccheck reports invalid usage of synchronization primitives. https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩↩
NVIDIA Developer Blog, Efficient CUDA Debugging with NVIDIA Compute Sanitizer — -lineinfo for source-line attribution without disabling optimization; memcheck OOB report wording ("out of bounds and is N bytes after the nearest allocation"); compute-sanitizer --tool memcheck ./app example. https://developer.nvidia.com/blog/debugging-cuda-more-efficiently-with-nvidia-compute-sanitizer/ ↩↩↩
NVIDIA, Compute Sanitizer — --error-exitcode {number} (default 0); --check-exit-code {yes,no}. https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩
NVIDIA, Compute Sanitizer — --kernel-name / --kernel-name-exclude; --launch-count (default 0); --launch-skip (default 0). https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩↩
NVIDIA, Compute Sanitizer — --target-processes {application-only,all} (default all). https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩
NVIDIA, Compute Sanitizer — --destroy-on-device-error {context,kernel} (default context). https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩
NVIDIA, Compute Sanitizer — --print-limit (default 100), --save {filename}, --log-file {filename}. https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html ↩