Skip to content
Markdown

Performance optimization & tuning

Scope: making the cluster fast. The optimisation method and the specific levers across the whole stack, from NCCL environment variables to BIOS/PCIe to fused kernels. The "optimise" verb, pulling together threads from GPU performance and health, storage and data, distributed training, and observability.

flowchart LR
  MEASURE["Measure"] --> LOCATE["Locate dominant bottleneck"]
  LOCATE --> FIX["Fix one layer"]
  FIX --> REMEASURE["Re-measure"]
  REMEASURE --> MEASURE

Overview

Optimisation is a discipline, not a bag of tricks: measure, find the dominant bottleneck, fix that one, repeat. The target is MFU (distributed training) for training and goodput (inference serving) for inference; never tune blind. Half the wins are not clever kernels but cluster hygiene: a PCIe link that trained down a generation, ACS left enabled, or NCCL quietly on TCP. Know the hierarchy of where time goes and check the cheap, high-impact things first.

GPU performance engineering deep dives

This page is the pillar for GPU cluster performance. Work top-down: profile, classify the bottleneck, then open the focused page for that layer:

Core knowledge

Method and the roofline

  • Loop: measure → locate the dominant bottleneck → fix → re-measure. Use Nsight Systems (observability) to see where the step's time actually goes.
  • Roofline: every kernel is compute-bound or memory-bound. Establish which before tuning, since optimising compute on a memory-bound kernel wastes effort.

The bottleneck hierarchy

  • Dataloader / IO bound (storage and data): GPUs idle waiting on data. Fix: more workers, prefetch, pinned memory, DALI, dataset sharding.
  • Communication bound: collectives dominate. Fix: NCCL tuning (below), topology-correct parallelism (distributed training), comms/compute overlap, SHARP/NVLS (networking fabric).
  • Compute bound but low MFU: kernel efficiency. Fix: fused kernels (FlashAttention), FP8, larger batch, torch.compile.
  • CPU / host bound: Python and launch overhead on many small kernels. Fix: CUDA Graphs, torch.compile, bigger batches.

NCCL tuning (the high-leverage levers)

  • Algorithm/protocol: NCCL_ALGO (Ring / Tree / NVLS / CollNet), NCCL_PROTO (LL / LL128 / Simple), channels (NCCL_MIN_NCHANNELS / MAX_NCHANNELS).
  • Transport/fabric: NCCL_IB_HCA (which HCAs), NCCL_SOCKET_IFNAME, NCCL_IB_GID_INDEX, NCCL_P2P_LEVEL, NCCL_NET_GDR_LEVEL (force GPUDirect RDMA), NCCL_NVLS_ENABLE (NVLink SHARP / in-network reduction).
  • Verify, do not assume: NCCL_DEBUG=INFO shows the chosen transport. Confirm [GDRDMA] is engaged and that it is not falling back to TCP sockets (Kubernetes for GPUs). Validate with nccl-tests bus bandwidth against the topology expectation (GPU performance and health).

Kernel and framework level

  • FlashAttention (fused, memory-efficient attention), fused optimizers, torch.compile (graph capture + fusion), CUDA Graphs (kill per-kernel launch overhead), Transformer Engine FP8 (distributed training).
  • Comms/compute overlap: DDP gradient bucketing, FSDP prefetch, interleaved pipeline schedules, all hiding communication under computation.
  • Precision: BF16 default → FP8 (Blackwell NVFP4 for inference), always with numerics validated.

Precision & interconnect by generation

The available precision levers and the comms levers both depend on the GPU generation/tier.

  • Precision floor moves with the architecture. TF32 is Ampere and later (Tensor Float 32 on the A100 3rd-gen Tensor Cores). FP8 needs Hopper or Ada: the H100 ships "a Transformer Engine with FP8 precision", and Ada (RTX 6000 Ada, L40S) also has FP8 Tensor Cores. FP4 / NVFP4 is Blackwell only; the GB200 "second-generation Transformer Engine ... enables FP4 AI". Choose the lowest precision the target hardware supports and validate numerics: FP8 for Hopper/Ada training and inference, NVFP4 for Blackwell inference; on Ampere stay at BF16/TF32.
  • No-NVLink GPUs change the comms levers. On GeForce (RTX 4090/5090) and the RTX PRO/workstation and Ada cards there is no NVLink, so the inter-GPU path is PCIe peer-to-peer: NVLS / NVLink-SHARP (NCCL_NVLS_ENABLE) and NVSwitch in-network reduction do not apply, and NCCL runs over PCIe P2P (or host staging) rather than NVLink. The high-impact levers become PCIe link health and ACS (below) plus NCCL_P2P_LEVEL, not NVLS. GPUDirect RDMA (NCCL_NET_GDR_LEVEL) is unavailable on GeForce, since it is a Tesla/Quadro (datacenter/workstation-pro) capability, so multi-node GeForce collectives stage through host memory (networking fabric).

System / BIOS / host tuning (the quiet wins)

  • PCIe link health: confirm the link trains at full Gen and width (nvidia-smi -q | grep -i pcie, lspci -vv). A riser or cable that negotiates down to x8 or an older Gen silently halves bandwidth.
  • ACS (PCIe Access Control Services): when enabled, it routes peer-to-peer through the root complex and breaks/limits P2P and GPUDirect RDMA. Disable ACS on the GPU/NIC paths, a classic high-impact gotcha.
  • NUMA affinity: bind each GPU to its local CPU and NIC (Kubernetes for GPUs); enable above-4G decoding / large BAR, set the CPU governor to performance, configure huge pages and IRQ affinity.

Power vs performance

  • Lock clocks for reproducible benchmarks; for production, the perf/W curve is non-linear, so a modestly lower power cap (datacentre readiness) can improve efficiency and thermal headroom for a small throughput cost.

Don't-miss checklist

  • Identify the dominant bottleneck and measure MFU before tuning anything.
  • Verify the PCIe link trains at full Gen/width and that ACS is off on P2P/GDR paths.
  • Confirm NCCL is on IB + GDR with the right HCA/IfName, not TCP.
  • Overlap comms with compute; apply FlashAttention / torch.compile / CUDA Graphs.
  • NUMA-bind GPU ↔ NIC ↔ CPU.

Failure modes

  • ACS left enabled: P2P/GDR silently disabled, inter-GPU bandwidth halved.
  • PCIe link negotiated down (bad riser/cable): silent, persistent bandwidth loss.
  • NCCL on TCP fallback (wrong ifname/HCA): collectives ~10x slow.
  • Tuning a non-bottleneck, e.g. optimising kernels while dataloader-bound.
  • FP8 enabled without numeric validation: silent quality regression (distributed training).

Open questions & validation

  • The full NCCL env-var matrix for a single-DC fat-tree, including NVLS and GDR-level forcing.
  • Run an ACS / NUMA / PCIe-link health checklist on a real node (script in Ansible bring-up).
  • An Nsight-driven MFU improvement on a real training run, start to finish.

References

  • NCCL environment variables: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
  • NCCL troubleshooting: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
  • GPUDirect RDMA (ACS, peer-to-peer): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
  • torch.compile: https://docs.pytorch.org/docs/stable/torch.compiler.html
  • FlashAttention: https://github.com/Dao-AILab/flash-attention
  • NVIDIA A100 (TF32, 3rd-gen Tensor Cores): https://www.nvidia.com/en-us/data-center/a100/
  • NVIDIA H100 (Transformer Engine FP8): https://www.nvidia.com/en-us/data-center/h100/
  • NVIDIA GB200 NVL72 (2nd-gen Transformer Engine, FP4/NVFP4): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
  • NVIDIA Transformer Engine (FP8/NVFP4 recipes): https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Related: Fabric · Performance · Storage · Training · Inference · Observability · Glossary