PyTorch performance regression testing in CI¶
Scope: standing up automated performance regression tests for PyTorch training/inference: capturing step-time, throughput, MFU, and peak-memory baselines; deterministic benchmarking (warmup, torch.cuda.synchronize, fixed clocks); failing CI when a commit regresses a metric past threshold; and pairing the gate with the profiler triad to localize the regression.
What it is¶
A performance regression test is a benchmark plus a comparison: run a fixed workload under controlled conditions, measure a small set of metrics, compare them to a stored baseline, and fail the build if any metric crosses a tolerance band. It is the CI-resident form of the book's core discipline: "set up automated performance tests to catch regressions, reductions in performance, early in the development cycle" (Fregly, Ch. 1). It exists to replace anecdotal "vibe" optimization with a hypothesis-measure-rerun loop: "develop hypotheses, measure the results with reproducible benchmarks, adjust to improve the results, rerun the benchmarks" (Fregly, Ch. 1).
The metrics worth gating are the ones that move money at scale:
- Step time. Milliseconds per training iteration (the book benchmarks eager at ~248 ms/iter and a compiled MoE at ~173 ms/iter, both with
torch.cuda.Eventtiming; Fregly, Ch. 13). - Throughput. Tokens/sec or samples/sec, the numerator of goodput.
- MFU (Model FLOPS Utilization). Achieved FLOPS as a fraction of the GPU's theoretical peak; the durable hardware-normalized metric that survives clock and batch changes.
- Peak memory.
torch.cuda.max_memory_allocated(); a regression here is an OOM waiting to happen at scale.
The gate is only the alarm. Root-causing the regression is done with the profiler triad, PyTorch profiler (Kineto), Nsight Systems, and Nsight Compute, already described in Profiling GPUs: Nsight Systems and Nsight Compute.
flowchart LR
COMMIT["Commit / PR"] --> BENCH["Deterministic benchmark<br/>warmup + fixed clocks"]
BENCH --> METRICS["step_time, throughput,<br/>MFU, peak_mem"]
METRICS --> CMP{"Within tolerance<br/>vs baseline?"}
CMP -->|"yes"| PASS["CI green"]
CMP -->|"no"| FAIL["CI red"]
FAIL --> TRIAD["Profiler triad localizes:<br/>torch.profiler -> nsys -> ncu"]
Why it matters¶
At ultrascale, "even a small-percentage efficiency gain can save millions of dollars", and the inverse is equally true: "small inefficiencies such as redundant computations and slow data pipelines can silently increase costs as the system scales" (Fregly, Ch. 1). A 3% step-time regression that ships unnoticed is a permanent 3% tax on every future training run on the cluster. A regression test is the cheapest possible place to catch it: before merge, on one node, against a known-good number.
The failure mode this guards against is silent drift. A dependency bump (a new cuDNN, a new PyTorch nightly, a kernel that lost a fusion), a refactor that inserts a host-device sync inside the hot loop, or a config change that breaks torch.compile into graph breaks, none of these fail a correctness test. They only show up as a slower wall clock, which nobody notices until the monthly bill or a missed deadline. The book's own optimization workflow depends on trustworthy before/after numbers: the roofline comparison that shifts a kernel from "50% of peak FLOPS, 70% memory BW, memory bound" to "85% FLOPS, 40% memory BW, compute bound" (Fregly, Ch. 13, Table 13-4) is only meaningful if the measurement is reproducible. CI enforces that reproducibility on every commit.
When it is needed (and when not)¶
Stand up a perf gate when:
- A workload runs repeatedly at scale (a training recipe, an inference server hot path) where a regression compounds across many runs or many requests.
- You depend on a moving toolchain (PyTorch nightly, new CUDA/cuDNN/Triton, vendor kernel libraries) and need an early-warning trip-wire on every bump.
- You ship performance work itself (kernel fusions,
torch.compilemode changes, allocator tuning, attention API swaps) and need to prove the win held and did not regress a neighbor.
Do not bother when:
- The code is not on a hot path or runs once. The benchmark's own variance will exceed any signal.
- You cannot pin the environment. On shared, unpinned, clock-floating, thermally-throttling hardware, run-to-run noise swamps a few-percent regression and the gate flaps. The book is explicit that floating clocks and throttling cause non-determinism; lock clocks for deterministic comparison (NVIDIA Nsight Compute, nvidia-smi).
- The metric is latency under CUDA Graphs but you want to see launch overhead: CUDA Graphs mask launch cost, so a graphed benchmark can hide a real per-launch regression. The book notes this directly and recommends
max-autotune-no-cudagraphswhen "using CUDA Graphs can mask the overhead of launching kernels, which might not be desired in some benchmarks" (Fregly, Ch. 13). See CUDA Graphs: Capture, Replay, and Launch Overhead.
A flaky perf gate is worse than none: it trains the team to ignore red. Tune tolerances to sit above measured noise (see below) before turning the gate to blocking.
How: implement, integrate, maintain¶
1. Benchmark deterministically¶
GPU work is asynchronous: a Python timer around a kernel launch measures only launch time, not execution. You must either synchronize before reading the clock or time with CUDA events. The official guidance is unambiguous: "time measurements without synchronizations are not accurate"; create events with enable_timing=True, record around the region, then torch.cuda.synchronize() before calling elapsed_time (PyTorch CUDA semantics, torch.cuda.Event).
Three determinism controls, all from the book's benchmark setup (Fregly, Ch. 13):
- Warmup before measuring: "run a few warm-up iterations before profiling ... compiling JIT kernels, filling caches" so the timed region reflects steady state.
torch.cuda.synchronize()at the boundaries.- Fixed math/algorithm selection:
torch.backends.cudnn.benchmark = Falseandtorch.backends.cudnn.deterministic = Trueto stop cuDNN from re-picking algorithms run to run.
# bench_step.py — deterministic per-iteration step-time + peak-memory measurement
import torch
# Determinism controls (Fregly, Ch. 13): no cuDNN autotune, deterministic algos.
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
WARMUP, ITERS = 10, 50
def measure(step_fn) -> dict[str, float]:
"""step_fn() runs one full train iteration (fwd+bwd+opt) on a fixed batch."""
torch.cuda.reset_peak_memory_stats()
# Warmup: JIT/compile, allocator caching, clock ramp. Not measured.
for _ in range(WARMUP):
step_fn()
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
for _ in range(ITERS):
step_fn()
end.record()
torch.cuda.synchronize() # required before elapsed_time()
step_ms = start.elapsed_time(end) / ITERS # ms per iteration
peak_mib = torch.cuda.max_memory_allocated() / (1024**2)
return {"step_ms": step_ms, "peak_mib": peak_mib}
torch.cuda.max_memory_allocated and reset_peak_memory_stats are the documented way to capture peak device memory (torch.cuda memory management). For pure kernel/op microbenchmarks, prefer torch.utils.benchmark.Timer.blocked_autorange(), which handles warmup, thread-pool pinning, and CUDA synchronization for you and sizes the measurement block until timer overhead is below 0.1% of runtime (torch.utils.benchmark).
2. Lock the clocks¶
Floating GPU clocks and thermal throttling are the dominant source of run-to-run variance and the reason perf gates flap. Lock the GPU graphics clock before measuring and release it after:
# Lock to a fixed graphics clock for deterministic comparison; release afterward.
sudo nvidia-smi -lgc <min>,<max> # e.g. sudo nvidia-smi -lgc 1410,1410
# ... run benchmark ...
sudo nvidia-smi -rgc # restore default behavior
-lgc (--lock-gpu-clocks) pins clocks; -rgc (--reset-gpu-clocks) restores them (nvidia-smi). NVIDIA's profiling guidance is to "lock the GPU clocks during the profiling session" for consistent values (Nsight Compute Profiling Guide). Lock for deterministic CI comparison; let clocks float only when you specifically want best-case peak performance, not regression detection.
3. Capture and store the baseline¶
Emit metrics as JSON, keyed by a hardware + toolchain fingerprint so baselines never compare across unlike machines. A baseline measured on an H100 is meaningless against a B200 run, and the book stresses system-level, like-for-like comparison.
# write_baseline.py
import json, subprocess, torch
def fingerprint() -> dict[str, str]:
return {
"gpu": torch.cuda.get_device_name(0),
"torch": torch.__version__,
"cuda": torch.version.cuda or "cpu",
"driver": subprocess.check_output(
["nvidia-smi", "--query-gpu=driver_version",
"--format=csv,noheader"]).decode().strip(),
}
# metrics = measure(step_fn) | {"tokens_per_s": ...}
# json.dump({"fingerprint": fingerprint(), "metrics": metrics},
# open("baseline.json", "w"), indent=2)
Compute MFU as achieved FLOPS over the device's theoretical peak FLOPS for the run's dtype; MFU normalizes out clock and batch noise and is the metric that ages best across hardware. The per-iteration FLOPS estimate can come from the PyTorch profiler's experimental FLOPS counter (run with with_flops=True and read the flops column from prof.key_averages().table()) or from a closed-form analytic count for the model (torch.profiler).
4. Gate in CI¶
Compare current metrics to baseline; fail past per-metric tolerance. Direction matters: step-time and peak-memory are upper-bounded (lower is better), throughput and MFU are lower-bounded.
# gate.py — exit nonzero on regression so CI fails the build.
import json, sys
# Regression direction and tolerance per metric.
# "up": larger is a regression (step_ms, peak_mib)
# "down": smaller is a regression (tokens_per_s, mfu)
RULES = {
"step_ms": ("up", 0.05), # >5% slower fails
"peak_mib": ("up", 0.03), # >3% more memory fails
"tokens_per_s": ("down", 0.05), # >5% lower throughput fails
"mfu": ("down", 0.05),
}
def main(baseline_path: str, current_path: str) -> int:
base = json.load(open(baseline_path))
cur = json.load(open(current_path))
assert base["fingerprint"] == cur["fingerprint"], "hardware/toolchain mismatch"
failures = []
for name, (direction, tol) in RULES.items():
b, c = base["metrics"][name], cur["metrics"][name]
delta = (c - b) / b
regressed = (direction == "up" and delta > tol) or \
(direction == "down" and -delta > tol)
flag = "REGRESSION" if regressed else "ok"
print(f"{name:14} base={b:.3f} cur={c:.3f} delta={delta:+.1%} [{flag}]")
if regressed:
failures.append(name)
return 1 if failures else 0
if __name__ == "__main__":
sys.exit(main(sys.argv[1], sys.argv[2]))
Set tolerances above measured noise: run the benchmark N times on identical code, take the coefficient of variation, and set each tolerance to a few sigma above it. A gate tighter than your noise floor flaps; a gate looser than the regressions you care about is decorative.
5. Add an NVTX-summary regression check (optional, CI-friendly)¶
Beyond wall-clock metrics, you can regress per-phase GPU time straight from Nsight Systems in headless CI. Annotate phases with NVTX (torch.cuda.nvtx.range_push("forward") / range_pop(), or torch.profiler.record_function), profile, then emit the NVTX GPU Projection Summary. The book recommends exactly this for CI: "You can use one of these commands in your continuous build and integration pipelines to monitor and detect any performance regressions" (Fregly, Ch. 13).
nsys profile --output=profile --stats=true -t cuda,nvtx python bench_step.py
# NVTX GPU projection summary: per-phase (forward/backward/optimizer) GPU time
nsys stats --report=nvtx_gpu_proj_sum profile.nsys-rep
nvtx_gpu_proj_sum (NVTX GPU Projection Summary) is a built-in Nsight Systems report that projects asynchronous GPU kernel time onto host-defined NVTX ranges via CUPTI (Nsight Systems User Guide). Parse its per-range GPU-time column and feed those numbers through the same gate.py to catch a regression isolated to, say, backward while total step time barely moves.
6. Localize a failure with the profiler triad¶
When the gate goes red, the book's localization order applies (Fregly, Ch. 13; Profiling GPUs: Nsight Systems and Nsight Compute):
- PyTorch profiler (Kineto).
prof.key_averages().table(sort_by="self_cuda_time_total")to see which op moved (a matmul, a dispatch/combine, a new graph break). - Nsight Systems (
nsys). Timeline to see where the time went: a new host-device sync, a data-loader stall, lost compute/communication overlap. - Nsight Compute (
ncu). Per-kernel roofline/occupancy on the named hot kernel to see why it slowed (slipped from compute-bound back to memory-bound).
Fix one variable, rerun the deterministic benchmark, confirm the metric recovered. The book: "implement and test these optimizations one by one to verify that each actually improves performance" (Fregly, Ch. 13).
Maintain¶
- Re-baseline deliberately, never silently. A toolchain bump or an intended optimization changes the number; commit the new
baseline.jsonin the same PR with the reason, so the baseline has provenance. - One baseline per fingerprint. Keep separate baselines per GPU/PyTorch/CUDA/driver tuple; the gate already refuses to compare across them.
- Watch for noise creep. If the gate starts flapping, re-measure the noise floor: a degraded node, a cooling change, or a removed clock lock is usually the cause, not the code.
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 1 (automated perf tests, profile-driven mindset, reproducible benchmarks, cost of small inefficiencies at scale); Ch. 13 (Profiling, Tuning, and Scaling PyTorch — warmup +
torch.cuda.Event+torch.cuda.synchronizebenchmarking,cudnn.benchmark/deterministic, eager vs compiled step times, NVTX projection summary in CI, profiler triad, CUDA Graphs masking launch overhead). - PyTorch — CUDA semantics (asynchronous execution, timing)
- PyTorch —
torch.cuda.Event - PyTorch — CUDA memory management (
max_memory_allocated,reset_peak_memory_stats) - PyTorch —
torch.utils.benchmark(Timer.blocked_autorange) - PyTorch —
torch.profiler(with_flops,key_averages().table) - NVIDIA — nvidia-smi (
--lock-gpu-clocks/-lgc,--reset-gpu-clocks/-rgc) - NVIDIA — Nsight Compute Profiling Guide (lock clocks; Speed-of-Light)
- NVIDIA — Nsight Systems User Guide (
nsys stats --report=nvtx_gpu_proj_sum)
Related: torch.compile · PyTorch CUDA Caching Allocator Tuning · PyTorch/XLA and the XLA Compiler · Activation Checkpointing and Memory Offloading · PyTorch Attention APIs: SDPA and FlexAttention · Profiling GPUs: Nsight Systems and Nsight Compute · CUDA Graphs: Capture, Replay, and Launch Overhead · Tensor Cores and Mixed Precision · Goodput: Measuring Useful AI Throughput · Performance Optimization and Tuning · Inference Serving and Optimization · Glossary