Runbook: fabric manager failure¶
Scope: nvidia-fabricmanager is inactive or aborting on an NVSwitch system (HGX/DGX 8-GPU baseboard, GB200/GB300 NVL72), so GPUs do not form their NVLink domain and collectives degrade to PCIe. Diagnose, restore the version-matched Fabric Manager, and prove NVLink before readmitting the node.
Reference templates on real APIs. Nothing here was executed on hardware; pin versions to your driver branch and validate on one node before fleet use.
This is the failure-mode counterpart to the Fabric Manager reference (what FM is, how it is versioned) and the fabric bring-up / benchmarking procedure (how to prove NVLink and collectives). The most common root cause is a driver upgrade that left FM (or libnvidia-nscq) on the old version; that ordering belongs to the driver-upgrade runbook.
Trigger¶
Open this runbook when any of these is observed on an NVSwitch node:
systemctl is-active nvidia-fabricmanagerreturns anything butactive: the service isinactive,failed, or started then exited.- New CUDA jobs on the node fail at init with
cudaErrorSystemNotReady. NVIDIA: if an application launches before FM has initialised the system, or FM fails to initialise it, CUDA initialisation fails withcudaErrorSystemNotReady.1 nvidia-smi nvlink --statusshows linksinactive, ornvidia-smi topo --matrixshows GPU pairs connected overPHB/SYS(PCIe host bridge / system) instead ofNV#.5- Collectives that assumed NVLink slow down sharply (NCCL falls back to PCIe) and
nccl-testsbusbw collapses to a fraction of the NVLink figure (NCCL-hang runbook covers the stall variant).
Quiet partial case: a single GPU fails to register. NVIDIA: "If a GPU fails to register with the fabric, it will lose its NVLink peer-to-peer capability and be available for non-peer-to-peer use cases."1 The GPU still runs; it just has no NVLink peers. Treat reduced peer count as a fault even when nvidia-smi lists every GPU.
Pre-checks¶
Confirm the node actually has NVSwitches before touching FM. FM is only for NVSwitch hardware (HGX/DGX baseboards, NVL72); PCIe-attached datacenter cards never run it.1 On a non-NVSwitch box this runbook does not apply.
-
Service state and recent journal. FM aborting on its compatibility check is the signature event here:
A daemon that starts then exits, or a journal line about an incompatible driver stack, points at a version mismatch (step 3). During initialization the FM service checks the currently loaded kernel driver stack version for compatibility, and if the loaded driver stack version is not compatible, aborts the process.1systemctl is-active nvidia-fabricmanager sudo systemctl status nvidia-fabricmanager --no-pager sudo journalctl -u nvidia-fabricmanager -b --no-pager | tail -n 100 -
FM log. The default log carries the per-switch / per-GPU initialization detail the journal summarises:1
Healthy startup logs fabric initialization completing with all expected GPUs and NVSwitches registered; a failure names the GPU or switch that failed to register. -
Version match: driver vs Fabric Manager vs NSCQ. This is the decisive check. The
nvidia-fabricmanagerpackage and thelibnvidia-nscqlibrary must both match the installed driver version.25The driver version and the FM / NSCQ package versions must agree. If they do not, that is the bug, not NVLink, not NCCL. (NSCQ is the stable driver API DCGM uses to monitor NVSwitch devices; it is versioned against the driver the same way FM is.2)nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n1 # Debian / Ubuntu: dpkg -l 'nvidia-fabricmanager*' 'libnvidia-nscq*' 2>/dev/null # RHEL: # rpm -qa 'nvidia-fabric-manager*' 'libnvidia-nscq*' -
Rule out a masked-dead fabric. If
For bare-metal training/inference clusters expectfabricmanager.cfghasFM_STAY_RESIDENT_ON_FAILURES=1, the daemon can showactivewhile the fabric is dead: setting it to1keeps FM running despite NVSwitch/GPU config failures, but the system remains uninitialized and CUDA launches fail.1 A greensystemctl statusis not proof; confirm with the NVLink/CUDA checks in Verification.FABRIC_MODE=0(bare-metal / full passthrough).1 -
Confirm this is FM, not IB. On NVL72, intra-node NVLink and the InfiniBand/RoCE scale-out fabric are separate planes; a clean IB fabric with dead NVLink is exactly the FM case. (Plane separation and the IB checks live in fabric bring-up.)
Procedure¶
Cordon and drain before mutating the node; never reinstall FM or restart the service under live jobs. NODE is the Kubernetes node name (Slurm equivalent in step 1).
NODE=gpu-07.dc1.internal
DRIVER_BRANCH=<branch> # the branch the running driver is on, e.g. the pinned LTS branch
-
Cordon and drain so the scheduler stops placing work and running pods evict cleanly:
-
If the only fault is a stopped service (versions already matched in pre-check 3), enable for boot and restart, then re-read state:
The package drops the unit but does not enable or start it, so an un-enabled service silently stays down across reboots.1 If FM now reachessudo systemctl enable nvidia-fabricmanager sudo systemctl restart nvidia-fabricmanager sudo systemctl status nvidia-fabricmanager --no-pageractive (running)and the journal shows fabric init completing, skip to Verification. -
If versions are mismatched, reinstall the FM stack matched to the driver branch. Pick the form for the platform;
<branch>must equal the driver branch from pre-check 3.1 - Ubuntu / Debian, pre-4th-gen NVSwitch (A100, H100/H200, single combined package):
- Ubuntu / Debian, 4th-gen NVSwitch (B200/B300/B100, open driver plus the NVLink5 stack):
-
RHEL 8/9, pre-4th-gen, via the driver module's
Bringfmprofile:libnvidia-nscqto the same version in the same step or DCGM's NVSwitch monitoring stays broken even once FM is healthy.2 -
Restart and confirm init. FM re-runs the driver compatibility check at startup, so a now-matched version should pass:1
If FM still aborts on the compatibility check, the driver and FM are still out of step; rechecksudo systemctl enable nvidia-fabricmanager sudo systemctl restart nvidia-fabricmanager sudo systemctl status nvidia-fabricmanager --no-pager sudo journalctl -u nvidia-fabricmanager -b --no-pager | tail -n 60dpkg -lagainstdriver_version, and do not forceFM_STAY_RESIDENT_ON_FAILURES=1to paper over it. -
NVL72 only: confirm IMEX after FM is up. On multi-node NVLink, FM programs each node's switches but the
nvidia-imexservice orchestrates the NVLink memory domain across nodes; a clean FM with IMEX down leaves intra-node NVLink working and no cross-node NVLink. Start it after FM and match it to the same driver branch (Fabric Manager, IMEX).
Verification¶
Do not uncordon on systemctl alone (pre-check 4: it can lie). Require a real NVLink/collective proof.
-
NVLink links up, paths over NVSwitch. Every NVLink should report active; the topology matrix should show
NV#between GPUs, not a PCIe fallback:5nvidia-smi nvlink --statusreports each link's state and, when active, its rated bandwidth (not live throughput). A link stillinactive, or a GPU showing fewer links than its NVLink count, means the fabric is not fully formed.5 -
DCGM diagnostic exercises the link.
Treat anydcgmi diagincludes a PCIe + NVLink plugin that uses NVLink to communicate between GPUs when possible, otherwise PCIe.3 Run a long diagnostic (level 3 is the < 10-min hardware diag on a 4-GPU system; it does not appear in the < 2.5 s level 1):3Failin the PCIe/NVLink rows as a non-pass; this is the proof the fabric is not just "service active" but actually carrying GPU-to-GPU traffic. -
Collective busbw at the NVLink figure. Build and run
all_reduce_perffromNVIDIA/nccl-testsand read busbw, not algbw; busbw applies the AllReduce correction so the number reflects hardware utilisation and is comparable to the interconnect peak:45On an 8-GPU NVLink node, large-message busbw should approach the NVLink bus bandwidth; a figure stuck near a single-GPU or PCIe ceiling means the fabric is still degraded (re-open Procedure). Set# single node, N GPUs in one process: ./build/all_reduce_perf -b 8 -e 8G -f 2 -g <N> # multi-node (built with MPI=1), one rank per GPU: mpirun -np <ranks> -N <gpus_per_node> ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1NCCL_DEBUG=INFOto confirm the transport: aNET/Socketline where you expect NVLink/IB is a fallback, not a pass.5 -
Uncordon only after links are up and a collective hits the expected busbw:
Rollback¶
This runbook restores a degraded plane; the safe state is the node out of service, not a half-fixed fabric admitting jobs.
- If FM will not come healthy (still aborting after a version-matched reinstall, or links stay down), leave the node cordoned/drained and escalate. Do not uncordon a node whose NVLink fabric is not formed:
- If FM broke during a driver upgrade, the clean rollback is the single-variable revert in the driver-upgrade runbook: re-pin the previous driver branch (which restores the matching FM/NSCQ), reboot, and re-validate. Do not hand-patch only FM to chase a driver that itself rolled back.
- If
dcgmi diag -r 3fails the GPU/NVLink rows even with FM healthy and versions matched, suspect hardware (a seated-but-degraded board or baseboard fault), not configuration, and divert to the GPU-fault / RMA path (reliability and RAS). - Never set
FM_STAY_RESIDENT_ON_FAILURES=1as a "fix": it keeps the daemon up but leaves the system uninitialized and CUDA launches failing.1
Related runbooks¶
- Fabric Manager: what FM is, version lockstep with the driver, IMEX for multi-node NVLink.
- fabric bring-up / benchmarking: full NVLink + collective validation procedure (the proofs used here).
- driver-upgrade runbook: correct upgrade ordering so FM never falls behind the driver (root-cause prevention).
- NCCL-hang runbook: collective stall variant when the fabric is up but a collective hangs.
- reliability and RAS: escalation when a node fails diag and the fault is hardware.
- operational runbooks: runbook index.
References¶
- NVIDIA Fabric Manager User Guide — service function, driver-compatibility abort at init,
cudaErrorSystemNotReady, package/systemctl commands,FM_STAY_RESIDENT_ON_FAILURES/FABRIC_MODE, config/log paths, GPU-fails-to-register behaviour: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html - NSCQ packaging for Debian (
libnvidia-nscq-<branch>, "stable driver API used by DCGM for monitoring NVSwitch devices"): https://github.com/NVIDIA/apt-packaging-libnvidia-nscq - NVIDIA DCGM Diagnostics — run levels and the PCIe + NVLink plugin: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- NVIDIA/nccl-tests —
all_reduce_perf,-b/-e/-f/-gflags,mpirunmulti-node launch, busbw vs algbw: https://github.com/NVIDIA/nccl-tests - kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
Related: Fabric Manager · Fabric Bring-Up · Driver Upgrade · NCCL Hang · Reliability · Operational Runbooks · Glossary
-
NVIDIA Fabric Manager User Guide — NVSwitch-only scope; driver-compatibility check that aborts on an incompatible loaded driver stack;
cudaErrorSystemNotReadywhen the fabric is uninitialised; install package names (cuda-drivers-fabricmanager-<branch>,nvidia-open-<branch>,nvlink5-<branch>,nvidia-driver:<branch>/fm); systemd unit not enabled/started by the package;FM_STAY_RESIDENT_ON_FAILURES/FABRIC_MODEsemantics; config/usr/share/nvidia/nvswitch/fabricmanager.cfg; log/var/log/fabricmanager.log; "If a GPU fails to register with the fabric, it will lose its NVLink peer-to-peer capability and be available for non-peer-to-peer use cases." https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩↩↩↩↩↩↩↩↩↩↩ -
NVIDIA apt-packaging-libnvidia-nscq — Debian package
libnvidia-nscq-<branch>(first dot-delimited driver-version field); "NVSwitch Configuration and Query (NSCQ) library provides a stable driver API used by DCGM for monitoring NVSwitch devices"; versioned to match the driver. https://github.com/NVIDIA/apt-packaging-libnvidia-nscq ↩↩↩ -
NVIDIA DCGM Diagnostics —
dcgmi diag -r <level>; run levels 1 (< 2.5 s), 2 (< 2.5 min), 3 (< 10 min), 4 (< 45 min) on a 4-GPU system; PCIe + NVLink plugin that "will use NvLink to communicate between GPUs when possible. Otherwise, communication between GPUs will occur over PCIe." https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩↩ -
NVIDIA/nccl-tests —
all_reduce_perf;-bminbytes,-emaxbytes,-fstepfactor,-gngpus per thread; multi-node viampirun -np <ranks> -N <gpus_per_node> ... -g 1(built withMPI=1); output reports algbw and busbw. https://github.com/NVIDIA/nccl-tests ↩ -
This KB — fabric bring-up / benchmarking page:
nvidia-smi nvlink --statusreports per-link state and rated bandwidth;nvidia-smi topo --matrixshowsNV#vsPHB/SYS; busbw is the comparable figure for AllReduce;NCCL_DEBUG=INFOreveals transport fallback; FM/NSCQ must match the driver. fabric-bringup-benchmarking.md ↩↩↩↩↩↩