Markdown

Recipe: fabric validation with nccl-tests¶

Scope: a standalone, executable recipe to validate the cluster fabric with nccl-tests as a Kubeflow MPIJob, covering the manifest, how to apply it, the bus-bandwidth pass criterion, confirming GPUDirect RDMA from the logs, and the common failure modes.

Reference template. Nothing here was executed on hardware. Pin the container image, substitute real namespaces, RDMA resource keys (rdma/ib vs rdma/roce), and HCA filters before applying. Run on one node pair before a fleet roll.

What it is¶

nccl-tests is NVIDIA's microbenchmark suite for NCCL collectives. all_reduce_perf runs an all-reduce across every GPU in the job, sweeping message sizes, and reports two bandwidths per size: algorithm bandwidth (algbw = size / time) and bus bandwidth (busbw), the hardware-utilization figure normalized so it is comparable to the link peak independent of rank count.¹ For ring all-reduce over n ranks, busbw = algbw * 2(n-1)/n.¹ You read busbw at the large-message tail (8 GiB) against the topology expectation and confirm the transport went over GPUDirect RDMA, not TCP.

This is the host/cluster fabric proof, step 6 of the end-to-end bring-up in workload bring-up recipes. It is the same probe packaged for one purpose; the broader acceptance suite is smoke tests: GPU platform and the host-level interconnect procedure is fabric bring-up and benchmarking.

Why it matters¶

A degraded link, a NIC that fell back to TCP, or a missing GPUDirect path does not crash a job; it silently halves collective bandwidth and shows up later as low MFU or a slow training step, where it is easily misattributed to the model or a GPU. Validating the fabric first turns "the run is slow" into a one-variable question. Bus bandwidth near line rate plus a GDRDMA transport in the logs is the single signal that the comms layer is healthy before any training benchmark is trusted.

flowchart LR
  APPLY["kubectl apply MPIJob"] --> SCHED["MPI Operator builds hostfile,<br/>launches mpirun"]
  SCHED --> RUN["all_reduce_perf -b 8 -e 8G"]
  RUN --> LOGS["kubectl logs launcher"]
  LOGS --> GDR{"transport == NET/IB/.../GDRDMA?"}
  GDR -->|"no"| TCP["FAIL: TCP fallback /<br/>GDR disabled — fix RDMA"]
  GDR -->|"yes"| BW{"tail busbw near line rate?"}
  BW -->|"no"| DEG["FAIL: degraded link /<br/>wrong algo — investigate"]
  BW -->|"yes"| PASS["PASS: fabric validated"]

When it is needed (and when not)¶

Needed: after fabric bring-up and node provisioning, before the first distributed training job (distributed training recipes, FSDP, DiLoCo); on commissioning of new nodes; and as a regression check when a MFU regression or NCCL hang points at the fabric.

Not needed: single-node single-GPU work (no collective fabric to prove); intra-node-only validation, where nvbandwidth/NVLink checks in fabric bring-up are the right tool; and ongoing health, which belongs to GPU health gating and telemetry and monitoring, not a one-shot benchmark.

How: implement, integrate, maintain¶

Prerequisite: the Kubeflow MPI Operator (kubeflow.org/v2beta1) installed, and the RDMA resource exposed by the Network Operator (confirm the key with kubectl describe node | grep rdma). The image below ships NCCL; the nccl-tests binaries are prebuilt at /workspace/nccl-tests/build in recent NGC PyTorch images; if absent, build them in an init step (make MPI=1).

Save as nccl-allreduce.yaml. This runs 2 workers x 8 GPUs = 16 ranks. slotsPerWorker must equal GPUs per worker, and -np must equal total ranks.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccl-allreduce
  namespace: ml
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - name: launcher
              image: nvcr.io/nvidia/pytorch:25.05-py3   # pin to your tested tag
              command:
                - mpirun
                - --allow-run-as-root
                - -np
                - "16"                     # total ranks = workers * slotsPerWorker
                - -bind-to
                - none
                - -map-by
                - slot
                - -x
                - NCCL_DEBUG=INFO          # prints the per-channel transport
                - -x
                - NCCL_IB_HCA=mlx5         # restrict to IB HCAs; substitute real prefix
                - -x
                - NCCL_NET_GDR_LEVEL=SYS   # allow GDR across NUMA; see NCCL env docs
                - /workspace/nccl-tests/build/all_reduce_perf
                - -b
                - "8"                      # min message: 8 bytes
                - -e
                - 8G                       # max message: 8 GiB
                - -f
                - "2"                      # geometric step x2
                - -g
                - "1"                      # 1 GPU per MPI rank
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - name: worker
              image: nvcr.io/nvidia/pytorch:25.05-py3
              resources:
                limits:
                  nvidia.com/gpu: 8
                  rdma/rdma_shared_device_a: 1   # substitute your RDMA resource key

Apply and watch:

kubectl apply -f nccl-allreduce.yaml
# launcher only starts once all workers are Running
kubectl get pods -n ml -l training.kubeflow.org/job-name=nccl-allreduce -w
kubectl logs -n ml -f job/nccl-allreduce-launcher

Read the result. The output table is one row per message size with columns size count type redop root time algbw busbw #wrong (twice: out-of-place then in-place):

#       size         count      type   redop    time   algbw   busbw #wrong
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)
  8589934592    2147483648     float     sum   ...     XXX.X   YYY.Y      0
# Avg bus bandwidth : YYY.Y

Pass criteria (all three):

Bus bandwidth. At the 8 GiB tail, busbw approaches the link's line-rate ceiling for the topology (achieved is always below the ceiling, due to protocol overhead). Set the threshold from the per-link spec, e.g. on a 400 Gb/s (~50 GB/s) IB plane expect tail busbw in the tens of GB/s; record your measured tail as the baseline for MFU regression. Do not hardcode a number you have not measured.
GPUDirect RDMA engaged. NCCL_DEBUG=INFO shows channels over IB with GDR, e.g. Channel 00 : 1[...] -> 0[...] [send] via NET/IB/0/GDRDMA,² and you must not see NET/IB : GPU Direct RDMA Disabled for HCA ... or a NET/Socket/TCP transport.²
Correctness. #wrong is 0 on every row.

Integrate this as the fabric gate in workload bring-up and the platform smoke suite (smoke tests); the same RDMA resource request and NCCL_* env carry into the Volcano training job (Volcano Gang Scheduler) and distributed training recipes. For algorithm/protocol tuning of the collective itself, see NCCL collectives and algorithm selection.

Maintain: re-run on every new node batch and after fabric or driver changes; keep the measured tail busbw as the recorded baseline and alert on regression via telemetry / observability. Pin the image tag so the NCCL version is stable across runs.

Common failure modes¶

TCP fallback / GDR disabled. Logs show NET/Socket or GPU Direct RDMA Disabled; busbw collapses. Cause: missing rdma/... resource request, wrong NCCL_IB_HCA filter, or no GPUDirect path. Confirm the RDMA resource is requested and the HCA prefix is real.
Launcher never starts. MPI Operator holds the launcher until all workers are Running; a worker stuck Pending (insufficient nvidia.com/gpu or RDMA resources) blocks the whole job. Check kubectl describe pod on the pending worker.
-np / slotsPerWorker mismatch. -np not equal to workers * slotsPerWorker over-/under-subscribes ranks and mpirun errors or hangs.
Asymmetric / degraded link. GDR engaged but tail busbw well below the rest of the fleet points to one bad cable/port. Cross-check with fabric bring-up (ibdiagnet) and gate the node via GPU health gating.
Binaries missing. all_reduce_perf: No such file means the image lacks prebuilt nccl-tests; build with make MPI=1 in an init container.
Collective stalls, never returns. A rank wedges with no XID; see runbook: NCCL hang.

References¶

nccl-tests: https://github.com/NVIDIA/nccl-tests
nccl-tests performance (algbw/busbw): https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md
NCCL environment variables (NCCL_NET_GDR_LEVEL, NCCL_IB_HCA, NCCL_DEBUG): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Kubeflow MPI Operator (v2beta1 MPIJob): https://github.com/kubeflow/mpi-operator
MPIJob via Kueue (Launcher/Worker, slotsPerWorker): https://kueue.sigs.k8s.io/docs/tasks/run/kubeflow/mpijobs/
NVIDIA GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/

busbw vs algbw and the 2(n-1)/n all-reduce correction factor: NCCL-tests PERFORMANCE.md — https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md ↩↩
GDRDMA transport vs "GPU Direct RDMA Disabled" log lines under NCCL_DEBUG=INFO: NCCL issues #1510/#1939 and NVIDIA env docs — https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩