Markdown

Runbook: fabric manager failure¶

Scope: nvidia-fabricmanager is inactive or aborting on an NVSwitch system (HGX/DGX 8-GPU baseboard, GB200/GB300 NVL72), so GPUs do not form their NVLink domain and collectives degrade to PCIe. Diagnose, restore the version-matched Fabric Manager, and prove NVLink before readmitting the node.

Reference templates on real APIs. Nothing here was executed on hardware; pin versions to your driver branch and validate on one node before fleet use.

This is the failure-mode counterpart to the Fabric Manager reference (what FM is, how it is versioned) and the fabric bring-up / benchmarking procedure (how to prove NVLink and collectives). The most common root cause is a driver upgrade that left FM (or libnvidia-nscq) on the old version; that ordering belongs to the driver-upgrade runbook.

Trigger¶

Open this runbook when any of these is observed on an NVSwitch node:

systemctl is-active nvidia-fabricmanager returns anything but active: the service is inactive, failed, or started then exited.
New CUDA jobs on the node fail at init with cudaErrorSystemNotReady. NVIDIA: if an application launches before FM has initialised the system, or FM fails to initialise it, CUDA initialisation fails with cudaErrorSystemNotReady.¹
nvidia-smi nvlink --status shows links inactive, or nvidia-smi topo --matrix shows GPU pairs connected over PHB/SYS (PCIe host bridge / system) instead of NV#.⁵
Collectives that assumed NVLink slow down sharply (NCCL falls back to PCIe) and nccl-tests busbw collapses to a fraction of the NVLink figure (NCCL-hang runbook covers the stall variant).

Quiet partial case: a single GPU fails to register. NVIDIA: "If a GPU fails to register with the fabric, it will lose its NVLink peer-to-peer capability and be available for non-peer-to-peer use cases."¹ The GPU still runs; it just has no NVLink peers. Treat reduced peer count as a fault even when nvidia-smi lists every GPU.

Pre-checks¶

Confirm the node actually has NVSwitches before touching FM. FM is only for NVSwitch hardware (HGX/DGX baseboards, NVL72); PCIe-attached datacenter cards never run it.¹ On a non-NVSwitch box this runbook does not apply.

Service state and recent journal. FM aborting on its compatibility check is the signature event here:
```
systemctl is-active nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager --no-pager
sudo journalctl -u nvidia-fabricmanager -b --no-pager | tail -n 100
```
A daemon that starts then exits, or a journal line about an incompatible driver stack, points at a version mismatch (step 3). During initialization the FM service checks the currently loaded kernel driver stack version for compatibility, and if the loaded driver stack version is not compatible, aborts the process.¹
FM log. The default log carries the per-switch / per-GPU initialization detail the journal summarises:¹
```
sudo tail -n 100 /var/log/fabricmanager.log
```
Healthy startup logs fabric initialization completing with all expected GPUs and NVSwitches registered; a failure names the GPU or switch that failed to register.
Version match: driver vs Fabric Manager vs NSCQ. This is the decisive check. The nvidia-fabricmanager package and the libnvidia-nscq library must both match the installed driver version.²⁵
```
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n1
# Debian / Ubuntu:
dpkg -l 'nvidia-fabricmanager*' 'libnvidia-nscq*' 2>/dev/null
# RHEL:
# rpm -qa 'nvidia-fabric-manager*' 'libnvidia-nscq*'
```
The driver version and the FM / NSCQ package versions must agree. If they do not, that is the bug, not NVLink, not NCCL. (NSCQ is the stable driver API DCGM uses to monitor NVSwitch devices; it is versioned against the driver the same way FM is.²)
Rule out a masked-dead fabric. If fabricmanager.cfg has FM_STAY_RESIDENT_ON_FAILURES=1, the daemon can show active while the fabric is dead: setting it to 1 keeps FM running despite NVSwitch/GPU config failures, but the system remains uninitialized and CUDA launches fail.¹ A green systemctl status is not proof; confirm with the NVLink/CUDA checks in Verification.
```
grep -E '^(FM_STAY_RESIDENT_ON_FAILURES|FABRIC_MODE)' /usr/share/nvidia/nvswitch/fabricmanager.cfg
```
For bare-metal training/inference clusters expect FABRIC_MODE=0 (bare-metal / full passthrough).¹
Confirm this is FM, not IB. On NVL72, intra-node NVLink and the InfiniBand/RoCE scale-out fabric are separate planes; a clean IB fabric with dead NVLink is exactly the FM case. (Plane separation and the IB checks live in fabric bring-up.)

Procedure¶

Cordon and drain before mutating the node; never reinstall FM or restart the service under live jobs. NODE is the Kubernetes node name (Slurm equivalent in step 1).

NODE=gpu-07.dc1.internal
DRIVER_BRANCH=<branch>     # the branch the running driver is on, e.g. the pinned LTS branch

Cordon and drain so the scheduler stops placing work and running pods evict cleanly:

kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
# Slurm: scontrol update nodename="$NODE" state=drain reason="fabric manager failure"

If the only fault is a stopped service (versions already matched in pre-check 3), enable for boot and restart, then re-read state:
```
sudo systemctl enable nvidia-fabricmanager
sudo systemctl restart nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager --no-pager
```
The package drops the unit but does not enable or start it, so an un-enabled service silently stays down across reboots.¹ If FM now reaches active (running) and the journal shows fabric init completing, skip to Verification.
If versions are mismatched, reinstall the FM stack matched to the driver branch. Pick the form for the platform; <branch> must equal the driver branch from pre-check 3.¹

Ubuntu / Debian, pre-4th-gen NVSwitch (A100, H100/H200, single combined package):

sudo apt-get install -V cuda-drivers-fabricmanager-<branch> libnvidia-nscq-<branch>

Ubuntu / Debian, 4th-gen NVSwitch (B200/B300/B100, open driver plus the NVLink5 stack):

sudo apt-get install -V nvidia-open-<branch>
sudo apt-get install -V nvlink5-<branch>

RHEL 8/9, pre-4th-gen, via the driver module's fm profile:
```
sudo dnf module install nvidia-driver:<branch>/fm
```
Bring libnvidia-nscq to the same version in the same step or DCGM's NVSwitch monitoring stays broken even once FM is healthy.²
Restart and confirm init. FM re-runs the driver compatibility check at startup, so a now-matched version should pass:¹
```
sudo systemctl enable nvidia-fabricmanager
sudo systemctl restart nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager --no-pager
sudo journalctl -u nvidia-fabricmanager -b --no-pager | tail -n 60
```
If FM still aborts on the compatibility check, the driver and FM are still out of step; recheck dpkg -l against driver_version, and do not force FM_STAY_RESIDENT_ON_FAILURES=1 to paper over it.
NVL72 only: confirm IMEX after FM is up. On multi-node NVLink, FM programs each node's switches but the nvidia-imex service orchestrates the NVLink memory domain across nodes; a clean FM with IMEX down leaves intra-node NVLink working and no cross-node NVLink. Start it after FM and match it to the same driver branch (Fabric Manager, IMEX).
```
sudo systemctl status nvidia-imex --no-pager
sudo systemctl restart nvidia-imex     # if not active
nvidia-imex-ctl -c /etc/nvidia-imex/config.cfg
```

Verification¶

Do not uncordon on systemctl alone (pre-check 4: it can lie). Require a real NVLink/collective proof.

NVLink links up, paths over NVSwitch. Every NVLink should report active; the topology matrix should show NV# between GPUs, not a PCIe fallback:⁵
```
nvidia-smi nvlink --status
nvidia-smi topo --matrix
```
nvidia-smi nvlink --status reports each link's state and, when active, its rated bandwidth (not live throughput). A link still inactive, or a GPU showing fewer links than its NVLink count, means the fabric is not fully formed.⁵
DCGM diagnostic exercises the link. dcgmi diag includes a PCIe + NVLink plugin that uses NVLink to communicate between GPUs when possible, otherwise PCIe.³ Run a long diagnostic (level 3 is the < 10-min hardware diag on a 4-GPU system; it does not appear in the < 2.5 s level 1):³
```
dcgmi diag -r 3
```
Treat any Fail in the PCIe/NVLink rows as a non-pass; this is the proof the fabric is not just "service active" but actually carrying GPU-to-GPU traffic.
Collective busbw at the NVLink figure. Build and run all_reduce_perf from NVIDIA/nccl-tests and read busbw, not algbw; busbw applies the AllReduce correction so the number reflects hardware utilisation and is comparable to the interconnect peak:⁴⁵
```
# single node, N GPUs in one process:
./build/all_reduce_perf -b 8 -e 8G -f 2 -g <N>
# multi-node (built with MPI=1), one rank per GPU:
mpirun -np <ranks> -N <gpus_per_node> ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
```
On an 8-GPU NVLink node, large-message busbw should approach the NVLink bus bandwidth; a figure stuck near a single-GPU or PCIe ceiling means the fabric is still degraded (re-open Procedure). Set NCCL_DEBUG=INFO to confirm the transport: a NET/Socket line where you expect NVLink/IB is a fallback, not a pass.⁵

Uncordon only after links are up and a collective hits the expected busbw:

kubectl uncordon "$NODE"
# Slurm: scontrol update nodename="$NODE" state=resume

Rollback¶

This runbook restores a degraded plane; the safe state is the node out of service, not a half-fixed fabric admitting jobs.

If FM will not come healthy (still aborting after a version-matched reinstall, or links stay down), leave the node cordoned/drained and escalate. Do not uncordon a node whose NVLink fabric is not formed:
```
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
# Slurm: scontrol update nodename="$NODE" state=drain reason="fabric down - escalated"
```
If FM broke during a driver upgrade, the clean rollback is the single-variable revert in the driver-upgrade runbook: re-pin the previous driver branch (which restores the matching FM/NSCQ), reboot, and re-validate. Do not hand-patch only FM to chase a driver that itself rolled back.
If dcgmi diag -r 3 fails the GPU/NVLink rows even with FM healthy and versions matched, suspect hardware (a seated-but-degraded board or baseboard fault), not configuration, and divert to the GPU-fault / RMA path (reliability and RAS).
Never set FM_STAY_RESIDENT_ON_FAILURES=1 as a "fix": it keeps the daemon up but leaves the system uninitialized and CUDA launches failing.¹

Fabric Manager: what FM is, version lockstep with the driver, IMEX for multi-node NVLink.
fabric bring-up / benchmarking: full NVLink + collective validation procedure (the proofs used here).
driver-upgrade runbook: correct upgrade ordering so FM never falls behind the driver (root-cause prevention).
NCCL-hang runbook: collective stall variant when the fabric is up but a collective hangs.
reliability and RAS: escalation when a node fails diag and the fault is hardware.
operational runbooks: runbook index.

References¶

NVIDIA Fabric Manager User Guide — service function, driver-compatibility abort at init, cudaErrorSystemNotReady, package/systemctl commands, FM_STAY_RESIDENT_ON_FAILURES / FABRIC_MODE, config/log paths, GPU-fails-to-register behaviour: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NSCQ packaging for Debian (libnvidia-nscq-<branch>, "stable driver API used by DCGM for monitoring NVSwitch devices"): https://github.com/NVIDIA/apt-packaging-libnvidia-nscq
NVIDIA DCGM Diagnostics — run levels and the PCIe + NVLink plugin: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
NVIDIA/nccl-tests — all_reduce_perf, -b/-e/-f/-g flags, mpirun multi-node launch, busbw vs algbw: https://github.com/NVIDIA/nccl-tests
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

NVIDIA Fabric Manager User Guide — NVSwitch-only scope; driver-compatibility check that aborts on an incompatible loaded driver stack; cudaErrorSystemNotReady when the fabric is uninitialised; install package names (cuda-drivers-fabricmanager-<branch>, nvidia-open-<branch>, nvlink5-<branch>, nvidia-driver:<branch>/fm); systemd unit not enabled/started by the package; FM_STAY_RESIDENT_ON_FAILURES / FABRIC_MODE semantics; config /usr/share/nvidia/nvswitch/fabricmanager.cfg; log /var/log/fabricmanager.log; "If a GPU fails to register with the fabric, it will lose its NVLink peer-to-peer capability and be available for non-peer-to-peer use cases." https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩↩↩↩↩↩↩↩↩↩
NVIDIA apt-packaging-libnvidia-nscq — Debian package libnvidia-nscq-<branch> (first dot-delimited driver-version field); "NVSwitch Configuration and Query (NSCQ) library provides a stable driver API used by DCGM for monitoring NVSwitch devices"; versioned to match the driver. https://github.com/NVIDIA/apt-packaging-libnvidia-nscq ↩↩↩
NVIDIA DCGM Diagnostics — dcgmi diag -r <level>; run levels 1 (< 2.5 s), 2 (< 2.5 min), 3 (< 10 min), 4 (< 45 min) on a 4-GPU system; PCIe + NVLink plugin that "will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe." https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩↩
NVIDIA/nccl-tests — all_reduce_perf; -b minbytes, -e maxbytes, -f stepfactor, -g ngpus per thread; multi-node via mpirun -np <ranks> -N <gpus_per_node> ... -g 1 (built with MPI=1); output reports algbw and busbw. https://github.com/NVIDIA/nccl-tests ↩
This KB — fabric bring-up / benchmarking page: nvidia-smi nvlink --status reports per-link state and rated bandwidth; nvidia-smi topo --matrix shows NV# vs PHB/SYS; busbw is the comparable figure for AllReduce; NCCL_DEBUG=INFO reveals transport fallback; FM/NSCQ must match the driver. fabric-bringup-benchmarking.md ↩↩↩↩↩↩