Markdown

GPU performance & health¶

Scope: tuning the collective-communication and GPU layer, and monitoring GPU health, where interconnect saturation and telemetry decide whether the hardware is actually used.

flowchart LR
  TOPO["Topology"] --> NCCL["NCCL collectives"]
  NCCL --> BUSBW["Bus bandwidth validation"]
  BUSBW --> TUNE["Performance tuning"]
  DCGM["DCGM health"] --> GATE["Scheduler health gate"]
  GATE --> RAS["RAS workflow"]

Overview¶

Once the cluster runs, performance lives or dies on how well collectives saturate the interconnect and how reliably GPU health is observed and acted on. This is where existing distributed-training experience is a genuine asset.

Core knowledge¶

NCCL¶

The collective-communication library underneath distributed training: all_reduce, all_gather, reduce_scatter, broadcast.
Topology-aware: builds rings and trees across NVLink (intra-node) and InfiniBand/RoCE (inter-node). It detects and exploits the hierarchy.
Tuning levers: algorithm and protocol selection, channel count, chunk sizes, and topology hints. Validate with nccl-tests, reading achieved bus bandwidth against expected for the topology, not just raw numbers.
SHARP (in-network reduction on NVIDIA IB switches) offloads part of the all_reduce into the fabric, raising effective bandwidth. See networking fabric.

Interconnect tiers (know the hierarchy)¶

NVLink / NVLink-C2C: intra-node and, in NVL72, intra-rack GPU-to-GPU and Grace-to-GPU. Fifth-gen NVLink: 1.8 TB/s per GPU.
GPUDirect RDMA: lets the NIC read/write GPU memory directly, bypassing the host, for low-latency inter-node transfers.
InfiniBand / RoCE: inter-node fabric.

Distributed training parallelism (context)¶

Data parallel (DDP), fully-sharded (FSDP), tensor/pipeline parallel, and low-communication approaches like DiLoCo. The parallelism strategy drives the collective pattern, which drives the fabric requirement.

Health and telemetry¶

DCGM (Data Center GPU Manager): GPU health, diagnostics, utilisation, ECC errors, thermal and power telemetry; integrates with Prometheus exporters.
nvidia-smi for quick inspection; DCGM for fleet-scale monitoring and health gating.
Tie health into scheduling so failing GPUs are drained (see provisioning and scheduling) and into commissioning acceptance (see commissioning).

Don't-miss checklist¶

Read NCCL results as bus bandwidth vs topology expectation, not absolute throughput.
Confirm GPUDirect RDMA is actually engaged, not silently falling back to host-staged copies.
Confirm SHARP is active where the fabric supports it.
Wire DCGM telemetry into the same observability stack used for sign-off and run-time.

Failure modes¶

Collectives bottlenecked by a single mis-tuned tier (often inter-node) while intra-node looks fine.
GPUDirect silently disabled, halving effective inter-node bandwidth.
Health signals collected but not gating the scheduler, so jobs land on degraded GPUs.

Open questions & validation¶

Validate DCGM field metrics against the deployed exporter; names and availability vary by DCGM version (observability).
Document a reference NCCL tuning procedure for a single-DC fat-tree, distinct from the geo-distributed-over-WAN case (see performance tuning).

References¶

GB300 NVL72 architecture, NVLink and NCCL all-reduce testing: https://verda.com/blog/gb300-nvl72-architecture
Blackwell Ultra interconnect (NVLink 5, GPUDirect context): https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/

Related: Fabric · Commissioning · Provisioning · Glossary