Markdown

Runbook: NCCL hang / collective stall¶

Scope: diagnose and clear a NCCL hang / collective stall. Step time goes to infinity with no XID and the whole world-size blocks.

Run this when a multi-node job wedges with no XID: step time goes to infinity, GPUs sit busy-but-idle, and a collective never returns. Severity: job-down, the whole world-size is blocked on one stalled rank or one bad fabric link. The job is stuck, not crashed, so there is no stack trace to read.

Reference templates on real APIs; pin versions and validate before production use.

A hang is distinct from a crash: every rank waits in the same all_reduce/all_gather because one rank never arrived or a transport silently fell back / died. Fabric background is in networking fabric; NCCL env and GDR in distributed training and performance tuning; a hardware fault on the offending node escalates to the GPU-fault runbook.

Trigger¶

Step time → infinity: training/throughput flatlines, the job makes no progress, but processes are alive (no exit, no XID).
A collective times out (Watchdog caught collective operation timeout, NCCL_ASYNC_ERROR_HANDLING) or all ranks report waiting on the same op.
GPUs show high utilization but zero useful work: spinning in a collective, not computing.

Pre-checks¶

Confirm it is a hang, not a hardware fault. Scan for XID first; a fatal XID means this is really a GPU fault → the GPU-fault runbook:
```
for n in $NODES; do ssh "$n" 'dmesg -T | grep -i "NVRM: Xid"'; done
```
Confirm Fabric Manager is active on every participating node. A single failed nvidia-fabricmanager stalls the whole NVLink ring and presents exactly as a hang (the GPU software stack):
```
for n in $NODES; do ssh "$n" 'systemctl is-active nvidia-fabricmanager'; done
```
Confirm InfiniBand/RoCE ports are Active on the rails the job uses (ibstat | grep "State:") (networking fabric).
Note the last good step / checkpoint so recovery is bounded (the checkpoint-recovery runbook).

Flow¶

flowchart TB
    A["Job wedged (no XID)"] --> B["NCCL_DEBUG=INFO: read transport"]
    B -->|"TCP fallback, IB expected"| C["Fix NCCL_IB_HCA / IFNAME, ACS off"]
    B -->|"GDRDMA, transport ok"| D["Fabric health: ibdiagnet + FM"]
    D -->|"bad link / port"| E["Drain offending node"]
    D -->|"fabric clean"| F["Find straggler / dead rank"]
    C --> G["Restart from last checkpoint"]
    E --> H["GPU fault path"]
    F --> G
    G --> I["Verify: nccl-tests busbw"]

Procedure¶

NODES="gpu-12 gpu-13 gpu-14 gpu-15"   # ranks in the wedged job

Read the transport NCCL actually chose. Re-launch (or inspect logs from) the job with debug on and check whether the IB path came up as [GDRDMA] or silently fell back to TCP sockets. A TCP fallback on an IB cluster is the most common stall cause (performance tuning):
```
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,NET <launch cmd> 2>&1 | grep -E "NET/IB|GDRDMA|Socket"
```
Lines showing NET/IB/.../GDRDMA = good. NET/Socket where IB was expected = misconfigured transport → step 4.
Check fabric health for missing links, bad ports, or routing inconsistencies on the rails the job uses (networking fabric):
```
ibdiagnet --pc -r --get_phy_info     # writes report to /var/tmp/ibdiagnet2/
```
Inspect the report for link-down, symbol-error, or routing/credit-loop warnings on the affected switch ports.
Find the straggler or dead rank not entering the collective. A monitored barrier names the rank that never arrives; cross-reference with per-node GPU activity:
```
import datetime
import torch.distributed as dist
dist.monitored_barrier(timeout=datetime.timedelta(seconds=60))  # raises naming the late rank
```
Map the late rank to its node; if that node shows an XID or off-bus GPU, divert to the GPU-fault runbook.
Verify the transport config when step 1 showed a fallback. ACS must be off for P2P/GDR, and the HCA / socket interface selectors must match the cluster's RDMA NICs (performance tuning, networking fabric):
```
ssh "$n" 'lspci -vvv | grep -i "ACSCtl"'   # SrcValid+ on the GPU/NIC path blocks P2P
# In the job env, pin the real RDMA HCAs and host iface:
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
export NCCL_SOCKET_IFNAME=^docker0,lo
export NCCL_NET_GDR_LEVEL=SYS
```
Disable ACS on the affected hosts (BIOS or setpci) before retry; it must be off for GDRDMA to engage.

Verification¶

A 2-node nccl-tests all_reduce_perf across the previously-stalled nodes recovers busbw to line rate (no TCP fallback in the log) (workload recipes):
```
mpirun -np 16 -H gpu-12:8,gpu-13:8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
```
The restarted job's step time returns to baseline and progresses past the prior wedge point (observability).
NCCL_DEBUG=INFO now shows NET/IB/.../GDRDMA (or the intended RoCE/IB transport), not socket fallback.

Rollback¶

A hang is not a config change to revert; recovery is:

Drain the offending node if it is the root cause (bad port, dead rank, FM failure) and route it into the GPU-fault path (the GPU-fault runbook). Resume the job at reduced world-size or after replacement.
Restart from the last checkpoint. Kill the wedged job cleanly and resume from the last good step; never let a hung job hold the allocation (the checkpoint-recovery runbook).
If the transport fix was env-only (HCA/IFNAME/ACS), bake it into the launch template so the next run starts correct (SRE and MLOps practices).

the GPU-fault runbook: GPU fault / RMA (when the straggler is a dead/faulted GPU).
the MFU-regression runbook: MFU regression (a partial stall shows up as MFU loss, not a full hang).
the checkpoint-recovery runbook: Checkpoint recovery (resume the killed job).
the driver-upgrade runbook: Driver upgrade (post-upgrade fabric/transport regressions).
operational runbooks: Operational runbooks index.

References¶

NCCL networking troubleshooting (transports, GDR, fallback): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting/networking_troubleshooting.html
NCCL environment variables (NCCL_IB_HCA, NCCL_SOCKET_IFNAME, NCCL_NET_GDR_LEVEL, NCCL_DEBUG): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
ibdiagnet (InfiniBand fabric diagnostic): https://docs.nvidia.com/networking/display/ibdiagnetusermanualv221
NVIDIA Fabric Manager user guide (NVLink domain, service health): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
torch.distributed monitored_barrier (straggler identification): https://docs.pytorch.org/docs/stable/distributed.html
nccl-tests (all_reduce_perf, busbw): https://github.com/NVIDIA/nccl-tests