Skip to content
Markdown

GPU performance & health

Scope: tuning the collective-communication and GPU layer, and monitoring GPU health, where interconnect saturation and telemetry decide whether the hardware is actually used.

flowchart LR
  TOPO["Topology"] --> NCCL["NCCL collectives"]
  NCCL --> BUSBW["Bus bandwidth validation"]
  BUSBW --> TUNE["Performance tuning"]
  DCGM["DCGM health"] --> GATE["Scheduler health gate"]
  GATE --> RAS["RAS workflow"]

Overview

Once the cluster runs, performance lives or dies on how well collectives saturate the interconnect and how reliably GPU health is observed and acted on. This is where existing distributed-training experience is a genuine asset.

Core knowledge

NCCL

  • The collective-communication library underneath distributed training: all_reduce, all_gather, reduce_scatter, broadcast.
  • Topology-aware: builds rings and trees across NVLink (intra-node) and InfiniBand/RoCE (inter-node). It detects and exploits the hierarchy.
  • Tuning levers: algorithm and protocol selection, channel count, chunk sizes, and topology hints. Validate with nccl-tests, reading achieved bus bandwidth against expected for the topology, not just raw numbers.
  • SHARP (in-network reduction on NVIDIA IB switches) offloads part of the all_reduce into the fabric, raising effective bandwidth. See networking fabric.

Interconnect tiers (know the hierarchy)

  • NVLink / NVLink-C2C: intra-node and, in NVL72, intra-rack GPU-to-GPU and Grace-to-GPU. Fifth-gen NVLink: 1.8 TB/s per GPU.
  • GPUDirect RDMA: lets the NIC read/write GPU memory directly, bypassing the host, for low-latency inter-node transfers.
  • InfiniBand / RoCE: inter-node fabric.

Distributed training parallelism (context)

  • Data parallel (DDP), fully-sharded (FSDP), tensor/pipeline parallel, and low-communication approaches like DiLoCo. The parallelism strategy drives the collective pattern, which drives the fabric requirement.

Health and telemetry

  • DCGM (Data Center GPU Manager): GPU health, diagnostics, utilisation, ECC errors, thermal and power telemetry; integrates with Prometheus exporters.
  • nvidia-smi for quick inspection; DCGM for fleet-scale monitoring and health gating.
  • Tie health into scheduling so failing GPUs are drained (see provisioning and scheduling) and into commissioning acceptance (see commissioning).

Don't-miss checklist

  • Read NCCL results as bus bandwidth vs topology expectation, not absolute throughput.
  • Confirm GPUDirect RDMA is actually engaged, not silently falling back to host-staged copies.
  • Confirm SHARP is active where the fabric supports it.
  • Wire DCGM telemetry into the same observability stack used for sign-off and run-time.

Failure modes

  • Collectives bottlenecked by a single mis-tuned tier (often inter-node) while intra-node looks fine.
  • GPUDirect silently disabled, halving effective inter-node bandwidth.
  • Health signals collected but not gating the scheduler, so jobs land on degraded GPUs.

Open questions & validation

  • Validate DCGM field metrics against the deployed exporter; names and availability vary by DCGM version (observability).
  • Document a reference NCCL tuning procedure for a single-DC fat-tree, distinct from the geo-distributed-over-WAN case (see performance tuning).

References

  • GB300 NVL72 architecture, NVLink and NCCL all-reduce testing: https://verda.com/blog/gb300-nvl72-architecture
  • Blackwell Ultra interconnect (NVLink 5, GPUDirect context): https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/

Related: Fabric · Commissioning · Provisioning · Glossary