GPU performance & health¶
Scope: tuning the collective-communication and GPU layer, and monitoring GPU health, where interconnect saturation and telemetry decide whether the hardware is actually used.
flowchart LR
TOPO["Topology"] --> NCCL["NCCL collectives"]
NCCL --> BUSBW["Bus bandwidth validation"]
BUSBW --> TUNE["Performance tuning"]
DCGM["DCGM health"] --> GATE["Scheduler health gate"]
GATE --> RAS["RAS workflow"]
Overview¶
Once the cluster runs, performance lives or dies on how well collectives saturate the interconnect and how reliably GPU health is observed and acted on. This is where existing distributed-training experience is a genuine asset.
Core knowledge¶
NCCL¶
- The collective-communication library underneath distributed training: all_reduce, all_gather, reduce_scatter, broadcast.
- Topology-aware: builds rings and trees across NVLink (intra-node) and InfiniBand/RoCE (inter-node). It detects and exploits the hierarchy.
- Tuning levers: algorithm and protocol selection, channel count, chunk sizes, and topology hints. Validate with
nccl-tests, reading achieved bus bandwidth against expected for the topology, not just raw numbers. - SHARP (in-network reduction on NVIDIA IB switches) offloads part of the all_reduce into the fabric, raising effective bandwidth. See networking fabric.
Interconnect tiers (know the hierarchy)¶
- NVLink / NVLink-C2C: intra-node and, in NVL72, intra-rack GPU-to-GPU and Grace-to-GPU. Fifth-gen NVLink: 1.8 TB/s per GPU.
- GPUDirect RDMA: lets the NIC read/write GPU memory directly, bypassing the host, for low-latency inter-node transfers.
- InfiniBand / RoCE: inter-node fabric.
Distributed training parallelism (context)¶
- Data parallel (DDP), fully-sharded (FSDP), tensor/pipeline parallel, and low-communication approaches like DiLoCo. The parallelism strategy drives the collective pattern, which drives the fabric requirement.
Health and telemetry¶
- DCGM (Data Center GPU Manager): GPU health, diagnostics, utilisation, ECC errors, thermal and power telemetry; integrates with Prometheus exporters.
nvidia-smifor quick inspection; DCGM for fleet-scale monitoring and health gating.- Tie health into scheduling so failing GPUs are drained (see provisioning and scheduling) and into commissioning acceptance (see commissioning).
Don't-miss checklist¶
- Read NCCL results as bus bandwidth vs topology expectation, not absolute throughput.
- Confirm GPUDirect RDMA is actually engaged, not silently falling back to host-staged copies.
- Confirm SHARP is active where the fabric supports it.
- Wire DCGM telemetry into the same observability stack used for sign-off and run-time.
Failure modes¶
- Collectives bottlenecked by a single mis-tuned tier (often inter-node) while intra-node looks fine.
- GPUDirect silently disabled, halving effective inter-node bandwidth.
- Health signals collected but not gating the scheduler, so jobs land on degraded GPUs.
Open questions & validation¶
- Validate DCGM field metrics against the deployed exporter; names and availability vary by DCGM version (observability).
- Document a reference NCCL tuning procedure for a single-DC fat-tree, distinct from the geo-distributed-over-WAN case (see performance tuning).
References¶
- GB300 NVL72 architecture, NVLink and NCCL all-reduce testing: https://verda.com/blog/gb300-nvl72-architecture
- Blackwell Ultra interconnect (NVLink 5, GPUDirect context): https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/
Related: Fabric · Commissioning · Provisioning · Glossary