Performance optimization & tuning¶
Scope: making the cluster fast. The optimisation method and the specific levers across the whole stack, from NCCL environment variables to BIOS/PCIe to fused kernels. The "optimise" verb, pulling together threads from GPU performance and health, storage and data, distributed training, and observability.
flowchart LR
MEASURE["Measure"] --> LOCATE["Locate dominant bottleneck"]
LOCATE --> FIX["Fix one layer"]
FIX --> REMEASURE["Re-measure"]
REMEASURE --> MEASURE
Overview¶
Optimisation is a discipline, not a bag of tricks: measure, find the dominant bottleneck, fix that one, repeat. The target is MFU (distributed training) for training and goodput (inference serving) for inference; never tune blind. Half the wins are not clever kernels but cluster hygiene: a PCIe link that trained down a generation, ACS left enabled, or NCCL quietly on TCP. Know the hierarchy of where time goes and check the cheap, high-impact things first.
GPU performance engineering deep dives¶
This page is the pillar for GPU cluster performance. Work top-down: profile, classify the bottleneck, then open the focused page for that layer:
- Method and profiling. roofline and arithmetic intensity, the Nsight profiling workflow, goodput for AI systems.
- GPU architecture and memory. the SIMT execution model, the GPU memory hierarchy, memory coalescing, shared-memory tiling.
- CUDA kernel optimization. CUDA occupancy tuning, kernel fusion, Tensor Cores and mixed precision, CUDA streams and concurrency, CUDA graphs.
- PyTorch performance. torch.compile graph capture and fusion, the PyTorch CUDA caching allocator, activation checkpointing and offload.
- Attention kernels. FlashAttention and MLA.
- Host and system tuning. Linux OS tuning for GPU nodes, NUMA and CPU pinning, GPU power and thermal tuning.
- Distributed communication. NCCL collectives and algorithms, RDMA and RoCE tuning, SHARP in-network reduction, compute and communication overlap.
- Storage and data I/O. GPUDirect Storage, data-loading pipeline tuning.
- Inference optimization. continuous batching internals, KV-cache management, speculative decoding.
Core knowledge¶
Method and the roofline¶
- Loop: measure → locate the dominant bottleneck → fix → re-measure. Use Nsight Systems (observability) to see where the step's time actually goes.
- Roofline: every kernel is compute-bound or memory-bound. Establish which before tuning, since optimising compute on a memory-bound kernel wastes effort.
The bottleneck hierarchy¶
- Dataloader / IO bound (storage and data): GPUs idle waiting on data. Fix: more workers, prefetch, pinned memory, DALI, dataset sharding.
- Communication bound: collectives dominate. Fix: NCCL tuning (below), topology-correct parallelism (distributed training), comms/compute overlap, SHARP/NVLS (networking fabric).
- Compute bound but low MFU: kernel efficiency. Fix: fused kernels (FlashAttention), FP8, larger batch,
torch.compile. - CPU / host bound: Python and launch overhead on many small kernels. Fix: CUDA Graphs,
torch.compile, bigger batches.
NCCL tuning (the high-leverage levers)¶
- Algorithm/protocol:
NCCL_ALGO(Ring / Tree / NVLS / CollNet),NCCL_PROTO(LL / LL128 / Simple), channels (NCCL_MIN_NCHANNELS/MAX_NCHANNELS). - Transport/fabric:
NCCL_IB_HCA(which HCAs),NCCL_SOCKET_IFNAME,NCCL_IB_GID_INDEX,NCCL_P2P_LEVEL,NCCL_NET_GDR_LEVEL(force GPUDirect RDMA),NCCL_NVLS_ENABLE(NVLink SHARP / in-network reduction). - Verify, do not assume:
NCCL_DEBUG=INFOshows the chosen transport. Confirm[GDRDMA]is engaged and that it is not falling back to TCP sockets (Kubernetes for GPUs). Validate withnccl-testsbus bandwidth against the topology expectation (GPU performance and health).
Kernel and framework level¶
- FlashAttention (fused, memory-efficient attention), fused optimizers,
torch.compile(graph capture + fusion), CUDA Graphs (kill per-kernel launch overhead), Transformer Engine FP8 (distributed training). - Comms/compute overlap: DDP gradient bucketing, FSDP prefetch, interleaved pipeline schedules, all hiding communication under computation.
- Precision: BF16 default → FP8 (Blackwell NVFP4 for inference), always with numerics validated.
Precision & interconnect by generation¶
The available precision levers and the comms levers both depend on the GPU generation/tier.
- Precision floor moves with the architecture. TF32 is Ampere and later (Tensor Float 32 on the A100 3rd-gen Tensor Cores). FP8 needs Hopper or Ada: the H100 ships "a Transformer Engine with FP8 precision", and Ada (RTX 6000 Ada, L40S) also has FP8 Tensor Cores. FP4 / NVFP4 is Blackwell only; the GB200 "second-generation Transformer Engine ... enables FP4 AI". Choose the lowest precision the target hardware supports and validate numerics: FP8 for Hopper/Ada training and inference, NVFP4 for Blackwell inference; on Ampere stay at BF16/TF32.
- No-NVLink GPUs change the comms levers. On GeForce (RTX 4090/5090) and the RTX PRO/workstation and Ada cards there is no NVLink, so the inter-GPU path is PCIe peer-to-peer: NVLS / NVLink-SHARP (
NCCL_NVLS_ENABLE) and NVSwitch in-network reduction do not apply, and NCCL runs over PCIe P2P (or host staging) rather than NVLink. The high-impact levers become PCIe link health and ACS (below) plusNCCL_P2P_LEVEL, not NVLS. GPUDirect RDMA (NCCL_NET_GDR_LEVEL) is unavailable on GeForce, since it is a Tesla/Quadro (datacenter/workstation-pro) capability, so multi-node GeForce collectives stage through host memory (networking fabric).
System / BIOS / host tuning (the quiet wins)¶
- PCIe link health: confirm the link trains at full Gen and width (
nvidia-smi -q | grep -i pcie,lspci -vv). A riser or cable that negotiates down to x8 or an older Gen silently halves bandwidth. - ACS (PCIe Access Control Services): when enabled, it routes peer-to-peer through the root complex and breaks/limits P2P and GPUDirect RDMA. Disable ACS on the GPU/NIC paths, a classic high-impact gotcha.
- NUMA affinity: bind each GPU to its local CPU and NIC (Kubernetes for GPUs); enable above-4G decoding / large BAR, set the CPU governor to performance, configure huge pages and IRQ affinity.
Power vs performance¶
- Lock clocks for reproducible benchmarks; for production, the perf/W curve is non-linear, so a modestly lower power cap (datacentre readiness) can improve efficiency and thermal headroom for a small throughput cost.
Don't-miss checklist¶
- Identify the dominant bottleneck and measure MFU before tuning anything.
- Verify the PCIe link trains at full Gen/width and that ACS is off on P2P/GDR paths.
- Confirm NCCL is on IB + GDR with the right HCA/IfName, not TCP.
- Overlap comms with compute; apply FlashAttention /
torch.compile/ CUDA Graphs. - NUMA-bind GPU ↔ NIC ↔ CPU.
Failure modes¶
- ACS left enabled: P2P/GDR silently disabled, inter-GPU bandwidth halved.
- PCIe link negotiated down (bad riser/cable): silent, persistent bandwidth loss.
- NCCL on TCP fallback (wrong ifname/HCA): collectives ~10x slow.
- Tuning a non-bottleneck, e.g. optimising kernels while dataloader-bound.
- FP8 enabled without numeric validation: silent quality regression (distributed training).
Open questions & validation¶
- The full NCCL env-var matrix for a single-DC fat-tree, including NVLS and GDR-level forcing.
- Run an ACS / NUMA / PCIe-link health checklist on a real node (script in Ansible bring-up).
- An Nsight-driven MFU improvement on a real training run, start to finish.
References¶
- NCCL environment variables: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
- NCCL troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
- GPUDirect RDMA (ACS, peer-to-peer): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
- torch.compile: https://docs.pytorch.org/docs/stable/torch.compiler.html
- FlashAttention: https://github.com/Dao-AILab/flash-attention
- NVIDIA A100 (TF32, 3rd-gen Tensor Cores): https://www.nvidia.com/en-us/data-center/a100/
- NVIDIA H100 (Transformer Engine FP8): https://www.nvidia.com/en-us/data-center/h100/
- NVIDIA GB200 NVL72 (2nd-gen Transformer Engine, FP4/NVFP4): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
- NVIDIA Transformer Engine (FP8/NVFP4 recipes): https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Related: Fabric · Performance · Storage · Training · Inference · Observability · Glossary