Markdown

Performance optimization & tuning¶

Scope: making the cluster fast. The optimisation method and the specific levers across the whole stack, from NCCL environment variables to BIOS/PCIe to fused kernels. The "optimise" verb, pulling together threads from GPU performance and health, storage and data, distributed training, and observability.

flowchart LR
  MEASURE["Measure"] --> LOCATE["Locate dominant bottleneck"]
  LOCATE --> FIX["Fix one layer"]
  FIX --> REMEASURE["Re-measure"]
  REMEASURE --> MEASURE

Overview¶

Optimisation is a discipline, not a bag of tricks: measure, find the dominant bottleneck, fix that one, repeat. The target is MFU (distributed training) for training and goodput (inference serving) for inference; never tune blind. Half the wins are not clever kernels but cluster hygiene: a PCIe link that trained down a generation, ACS left enabled, or NCCL quietly on TCP. Know the hierarchy of where time goes and check the cheap, high-impact things first.

GPU performance engineering deep dives¶

This page is the pillar for GPU cluster performance. Work top-down: profile, classify the bottleneck, then open the focused page for that layer:

Method and profiling. roofline and arithmetic intensity, the Nsight profiling workflow, goodput for AI systems.
GPU architecture and memory. the SIMT execution model, the GPU memory hierarchy, memory coalescing, shared-memory tiling.
CUDA kernel optimization. CUDA occupancy tuning, kernel fusion, Tensor Cores and mixed precision, CUDA streams and concurrency, CUDA graphs.
PyTorch performance. torch.compile graph capture and fusion, the PyTorch CUDA caching allocator, activation checkpointing and offload.
Attention kernels. FlashAttention and MLA.
Host and system tuning. Linux OS tuning for GPU nodes, NUMA and CPU pinning, GPU power and thermal tuning.
Distributed communication. NCCL collectives and algorithms, RDMA and RoCE tuning, SHARP in-network reduction, compute and communication overlap.
Storage and data I/O. GPUDirect Storage, data-loading pipeline tuning.
Inference optimization. continuous batching internals, KV-cache management, speculative decoding.

Core knowledge¶

Method and the roofline¶

Loop: measure → locate the dominant bottleneck → fix → re-measure. Use Nsight Systems (observability) to see where the step's time actually goes.
Roofline: every kernel is compute-bound or memory-bound. Establish which before tuning, since optimising compute on a memory-bound kernel wastes effort.

The bottleneck hierarchy¶

Dataloader / IO bound (storage and data): GPUs idle waiting on data. Fix: more workers, prefetch, pinned memory, DALI, dataset sharding.
Communication bound: collectives dominate. Fix: NCCL tuning (below), topology-correct parallelism (distributed training), comms/compute overlap, SHARP/NVLS (networking fabric).
Compute bound but low MFU: kernel efficiency. Fix: fused kernels (FlashAttention), FP8, larger batch, torch.compile.
CPU / host bound: Python and launch overhead on many small kernels. Fix: CUDA Graphs, torch.compile, bigger batches.

NCCL tuning (the high-leverage levers)¶

Algorithm/protocol: NCCL_ALGO (Ring / Tree / NVLS / CollNet), NCCL_PROTO (LL / LL128 / Simple), channels (NCCL_MIN_NCHANNELS / MAX_NCHANNELS).
Transport/fabric: NCCL_IB_HCA (which HCAs), NCCL_SOCKET_IFNAME, NCCL_IB_GID_INDEX, NCCL_P2P_LEVEL, NCCL_NET_GDR_LEVEL (force GPUDirect RDMA), NCCL_NVLS_ENABLE (NVLink SHARP / in-network reduction).
Verify, do not assume: NCCL_DEBUG=INFO shows the chosen transport. Confirm [GDRDMA] is engaged and that it is not falling back to TCP sockets (Kubernetes for GPUs). Validate with nccl-tests bus bandwidth against the topology expectation (GPU performance and health).

Kernel and framework level¶

FlashAttention (fused, memory-efficient attention), fused optimizers, torch.compile (graph capture + fusion), CUDA Graphs (kill per-kernel launch overhead), Transformer Engine FP8 (distributed training).
Comms/compute overlap: DDP gradient bucketing, FSDP prefetch, interleaved pipeline schedules, all hiding communication under computation.
Precision: BF16 default → FP8 (Blackwell NVFP4 for inference), always with numerics validated.

Precision & interconnect by generation¶

The available precision levers and the comms levers both depend on the GPU generation/tier.

Precision floor moves with the architecture. TF32 is Ampere and later (Tensor Float 32 on the A100 3rd-gen Tensor Cores). FP8 needs Hopper or Ada: the H100 ships "a Transformer Engine with FP8 precision", and Ada (RTX 6000 Ada, L40S) also has FP8 Tensor Cores. FP4 / NVFP4 is Blackwell only; the GB200 "second-generation Transformer Engine ... enables FP4 AI". Choose the lowest precision the target hardware supports and validate numerics: FP8 for Hopper/Ada training and inference, NVFP4 for Blackwell inference; on Ampere stay at BF16/TF32.
No-NVLink GPUs change the comms levers. On GeForce (RTX 4090/5090) and the RTX PRO/workstation and Ada cards there is no NVLink, so the inter-GPU path is PCIe peer-to-peer: NVLS / NVLink-SHARP (NCCL_NVLS_ENABLE) and NVSwitch in-network reduction do not apply, and NCCL runs over PCIe P2P (or host staging) rather than NVLink. The high-impact levers become PCIe link health and ACS (below) plus NCCL_P2P_LEVEL, not NVLS. GPUDirect RDMA (NCCL_NET_GDR_LEVEL) is unavailable on GeForce, since it is a Tesla/Quadro (datacenter/workstation-pro) capability, so multi-node GeForce collectives stage through host memory (networking fabric).

System / BIOS / host tuning (the quiet wins)¶

PCIe link health: confirm the link trains at full Gen and width (nvidia-smi -q | grep -i pcie, lspci -vv). A riser or cable that negotiates down to x8 or an older Gen silently halves bandwidth.
ACS (PCIe Access Control Services): when enabled, it routes peer-to-peer through the root complex and breaks/limits P2P and GPUDirect RDMA. Disable ACS on the GPU/NIC paths, a classic high-impact gotcha.
NUMA affinity: bind each GPU to its local CPU and NIC (Kubernetes for GPUs); enable above-4G decoding / large BAR, set the CPU governor to performance, configure huge pages and IRQ affinity.

Power vs performance¶

Lock clocks for reproducible benchmarks; for production, the perf/W curve is non-linear, so a modestly lower power cap (datacentre readiness) can improve efficiency and thermal headroom for a small throughput cost.

Don't-miss checklist¶

Identify the dominant bottleneck and measure MFU before tuning anything.
Verify the PCIe link trains at full Gen/width and that ACS is off on P2P/GDR paths.
Confirm NCCL is on IB + GDR with the right HCA/IfName, not TCP.
Overlap comms with compute; apply FlashAttention / torch.compile / CUDA Graphs.
NUMA-bind GPU ↔ NIC ↔ CPU.

Failure modes¶

ACS left enabled: P2P/GDR silently disabled, inter-GPU bandwidth halved.
PCIe link negotiated down (bad riser/cable): silent, persistent bandwidth loss.
NCCL on TCP fallback (wrong ifname/HCA): collectives ~10x slow.
Tuning a non-bottleneck, e.g. optimising kernels while dataloader-bound.
FP8 enabled without numeric validation: silent quality regression (distributed training).

Open questions & validation¶

The full NCCL env-var matrix for a single-DC fat-tree, including NVLS and GDR-level forcing.
Run an ACS / NUMA / PCIe-link health checklist on a real node (script in Ansible bring-up).
An Nsight-driven MFU improvement on a real training run, start to finish.

References¶

NCCL environment variables: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NCCL troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
GPUDirect RDMA (ACS, peer-to-peer): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
torch.compile: https://docs.pytorch.org/docs/stable/torch.compiler.html
FlashAttention: https://github.com/Dao-AILab/flash-attention
NVIDIA A100 (TF32, 3rd-gen Tensor Cores): https://www.nvidia.com/en-us/data-center/a100/
NVIDIA H100 (Transformer Engine FP8): https://www.nvidia.com/en-us/data-center/h100/
NVIDIA GB200 NVL72 (2nd-gen Transformer Engine, FP4/NVFP4): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
NVIDIA Transformer Engine (FP8/NVFP4 recipes): https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Related: Fabric · Performance · Storage · Training · Inference · Observability · Glossary