Runbook: NVLink visibility / P2P failure¶
Scope: diagnose GPUs that cannot see each other over NVLink: nvidia-smi nvlink --status shows links inactive, CUDA P2P access is disabled, and collectives silently fall back to PCIe. Broader than a Fabric Manager outage: this covers MIG-on (P2P deliberately dropped), topology mis-wiring, and NVLink link-layer errors (CRC / replay / recovery), of which a dead nvidia-fabricmanager is only one branch.
Run this when GPUs on a node refuse to talk over NVLink:
nvidia-smi nvlink --statusreports linksinactiveor fewer than the card's link count,nvidia-smi topo --matrixshows GPU pairs overPHB/SYSinstead ofNV#, P2P probes reportCANNOT access peer, and NCCL busbw collapses to a PCIe ceiling. Severity: node-degraded. The GPUs run, but every NVLink-assuming collective pays a PCIe tax.Reference templates on real APIs. Nothing here was executed on hardware; pin versions to your driver / CUDA branch and validate on one node before fleet use.
P2P-disabled is the symptom; the cause splits four ways and each has a different owner. A dead Fabric Manager on an NVSwitch box is the Fabric Manager failure runbook. MIG mode silently strips NVLink P2P on Ampere; that is expected, not a fault. Link-layer errors (CRC / replay / recovery climbing) point at a cable/board fault that escalates to GPU fault / RMA. A topology that comes up with the wrong NV# map is a wiring/seating problem. Conceptual background on the interconnect is in NVSwitch & NVLink and Fabric Manager; ACS-blocked P2P is in ACS disable.
Trigger¶
nvidia-smi nvlink --statusreports one or more linksinactive, or a GPU shows fewer active links than its NVLink count (a partial fabric, not a clean down).1nvidia-smi topo --matrixshows GPU↔GPU overPHB/SYS/NODE(PCIe host bridge or system path) where you expectNV#(NVLink, # = link count).1nvidia-smi topo -p2p rwshows P2P read/write not supported between GPU pairs that should be NVLink peers.1- A CUDA P2P probe (
p2pBandwidthLatencyTest/simpleP2P) reportsPeer access ... CANNOTor P2P-enabled bandwidth equal to P2P-disabled bandwidth.4 - NCCL falls back to PCIe:
NCCL_DEBUG=INFOshows no NVLink path andnccl-testsbusbw sits at a PCIe ceiling (NCCL-hang runbook covers the full-stall variant).
Pre-checks¶
Establish which of the four branches you are in before mutating anything. NODE is the Kubernetes node name.
-
Is this an NVSwitch box, and is Fabric Manager healthy? On HGX/DGX baseboards and NVL72, dead FM presents exactly as inactive NVLink; that is the Fabric Manager failure runbook, not this one. Check before anything else:
active(or no FM because the box is PCIe-attached, not NVSwitch) → stay here. Anything else → divert to the FM runbook. -
Is MIG mode on? Enabling MIG drops NVLink P2P on Ampere: "when you enable the MIG mode, the GPU NVLinks will be disabled (or not used) and the GPU will lose its NVLink peer-to-peer (P2P) capability."2 On A100/HGX A100 the NVLinks are trained off; on H100 and later they stay active but P2P across instances is still gone. Inactive NVLink under MIG is expected, not a fault:
If MIG isEnabledand you did not expect it, that is the bug; go to stale MIG state / MIG partitioning, not the hardware path. -
Read link state and the topology map as ground truth:
A GPU with some links active and some inactive is a link/board fault (step in Procedure); a GPU with all links inactive on an NVSwitch box points back to FM (pre-check 1). -
Read NVLink error counters. Climbing CRC / replay / recovery / link-down counts mean a degrading physical link, not a config problem; that path ends at GPU fault / RMA:1
-
Rule out ACS. PCIe Access Control Services on the GPU/switch path silently blocks P2P translation; with ACS on, P2P is refused even when NVLink trains fine (ACS disable):
Flow¶
flowchart TB
A["NVLink inactive / P2P disabled"] --> B{"NVSwitch box and FM active?"}
B -->|"FM down"| C["Fabric Manager failure runbook"]
B -->|"FM active or PCIe-attached"| D{"MIG mode enabled?"}
D -->|"Yes (NVLink P2P dropped by design)"| E["Stale MIG state runbook / disable MIG if unintended"]
D -->|"No"| F{"nvlink -e: CRC/replay/recovery climbing?"}
F -->|"Counters clean, P2P still blocked"| G{"ACS on the GPU/NIC path?"}
F -->|"Counters climbing / partial links"| H["Drain, GPU reset, re-read counters"]
G -->|"ACS on"| I["Disable ACS, re-probe P2P"]
G -->|"ACS off"| J["Suspect topology / seating: reset and retrain"]
H -->|"Still failing after reset"| K["GPU fault / RMA"]
H -->|"Clean after reset"| L["Verify: topo NV#, p2p test, busbw"]
I --> L
J --> L
E --> L
Procedure¶
Cordon and drain before any GPU reset or MIG/driver mutation: a reset is refused while a client holds the device, and you must not flip a live node's fabric under jobs. NODE is the Kubernetes node name (Slurm equivalent inline).
-
Cordon and drain so the scheduler stops placing work and running pods evict cleanly:
-
Clear residual GPU clients: a reset or MIG mutation is refused with
In use by another clientwhile a CUDA app or straynvidia-smiis attached:3 -
Branch on the pre-check verdict.
-
MIG was on and unintended → exit MIG mode (this is a GPU reset on Ampere), then re-read NVLink. Mode is not InfoROM-persistent on Hopper+, so re-assert intent explicitly (stale MIG state):3
-
ACS was on → disable it on the affected root ports so P2P translation is allowed, then re-probe (ACS disable). Prefer the platform's documented path (BIOS / kernel) over a transient
setpci: -
Links inactive on an NVSwitch box → this is Fabric Manager; restore the version-matched FM stack and prove NVLink there. Do not continue here: Fabric Manager failure runbook.
-
Error counters climbing / partial link set / wrong
NV#map → suspect a cable, connector seating, or board fault. Reset the GPU once to retrain links, then re-read counters:1If links stay inactive, counters climb again, or the topology map is still wrong after a reset, stop treating it as software; escalate to GPU fault / RMA. Persistent NVLink errors are a hardware RMA signal, not a config retry.sudo nvidia-smi --gpu-reset -i 0 # retrain links (node must be drained) nvidia-smi nvlink -e # counters should be clean after reset nvidia-smi nvlink --status -
Confirm the driver/firmware floor if links will not train at all (not merely P2P-blocked). A GSP firmware / driver mismatch can leave the interconnect half-initialised (GSP firmware mismatch, kernel modules); a node that no longer enumerates GPUs is kernel GPU missing.
Verification¶
Do not uncordon on nvidia-smi nvlink --status alone: an active link state is necessary but not sufficient. Require a real P2P + collective proof.
-
Links active and paths over NVLink. Every expected link reports active, and the topology matrix shows
NV#between peers, not a PCIe fallback:1 -
CUDA actually does P2P. Build and run
p2pBandwidthLatencyTestfromNVIDIA/cuda-samples; the P2P=Enabled bandwidth matrix must be materially higher than P2P=Disabled and the latency lower; equal numbers mean P2P never engaged:4simpleP2Pis the lighter pass/fail check (Peer access ... ENABLED, verificationPASSED).4 -
Collective busbw at the NVLink figure.
On an NVLink node, large-message busbw should approach the NVLink bus bandwidth; a figure stuck at the PCIe ceiling means the fabric is still degraded; re-open Procedure.all_reduce_perffromNVIDIA/nccl-tests, read busbw (not algbw), withNCCL_DEBUG=INFOconfirming an NVLink transport rather than aNET/Socket/ PCIe fallback:5 -
Uncordon only after 1–3 pass:
Rollback¶
This runbook restores a degraded plane; the safe state is the node out of service, not a half-fixed fabric admitting jobs.
- If a change made it worse, revert the single variable: re-enable MIG to the recorded prior layout (stale MIG state), or restore the prior ACS setting. Never stack changes: one mutation, one re-verify.
- If links stay inactive or counters keep climbing after a reset, leave the node cordoned/drained and escalate to hardware; do not uncordon a node whose NVLink fabric is not fully formed:
- If the root cause was Fabric Manager or a driver upgrade, the clean fix lives in those runbooks (Fabric Manager failure, driver upgrade): re-pin the matched driver/FM/NSCQ branch rather than hand-patching NVLink state.
- As a stopgap only, a job can be forced onto PCIe (
NCCL_P2P_DISABLE=1, orNCCL_P2P_LEVELscoped below NVLink) to keep work moving at reduced bandwidth while the node is repaired; this masks the fault, it does not fix it.5
Related runbooks¶
- Fabric Manager failure: inactive NVLink because
nvidia-fabricmanageris down on an NVSwitch box (pre-check 1 branch). - stale MIG state: MIG geometry drift; MIG-on is why P2P is "missing" in pre-check 2.
- GPU fault / RMA: escalation when NVLink errors persist after a reset (hardware path).
- GSP firmware mismatch: links will not train because of a driver/firmware floor.
- kernel GPU missing: the node no longer enumerates GPUs at all (upstream of NVLink).
- driver upgrade: post-upgrade NVLink/FM regressions and the correct ordering.
- NCCL hang: the collective-stall variant when the fabric is up but a collective wedges.
- operational runbooks: runbook index.
References¶
- nvidia-smi (man page) —
nvlink -s/--status,-e/--errorcounters(CRC flit/data, replay, recovery, link-down),-c/--capabilities;topo -m/--matrix(NV#/PHB/SYS legend) andtopo -p2p(P2P read/write/NVLink/atomics/PCIe status): https://docs.nvidia.com/deploy/nvidia-smi/ - NVIDIA Fabric Manager User Guide — MIG mode disables GPU NVLinks and removes NVLink P2P; A100/HGX A100 train NVLinks off vs H100+ keep them active; P2P restoration requires FM running: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- NVIDIA MIG User Guide —
-mig 0/1(GPU reset on Ampere, not InfoROM-persistent on Hopper+),In use by another clientreset refusal: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html - NVIDIA/cuda-samples —
p2pBandwidthLatencyTest(P2P enabled vs disabled bandwidth/latency matrix) andsimpleP2P(peer-access pass/fail), underSamples/5_Domain_Specific/: https://github.com/NVIDIA/cuda-samples - NCCL environment variables —
NCCL_DEBUG,NCCL_P2P_DISABLE,NCCL_P2P_LEVEL(NVL = use P2P over NVLink): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html - NVIDIA/nccl-tests —
all_reduce_perf,-b/-e/-f/-gflags, busbw vs algbw: https://github.com/NVIDIA/nccl-tests - kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
Related: NVSwitch & NVLink · Fabric Manager · Fabric Manager Failure · MIG Partitioning · Stale MIG State · ACS Disable · GPU Fault / RMA · NCCL Hang · Operational Runbooks · Glossary
-
nvidia-smi man page —
nvidia-smi nvlink -s/--statusreports per-link active/inactive and rated bandwidth when active;-e/--errorcountersreports CRC (flit/data), replay, recovery and link-down counters per link;nvidia-smi topo -m/--matrixlegend (NV#= NVLink with link count,PHB/SYS/NODE/PXB/PIX= PCIe paths);nvidia-smi topo -p2pshows P2P read (r)/write (w)/NVLink (n)/atomics (a)/PCIe (p) status between GPUs. https://docs.nvidia.com/deploy/nvidia-smi/ ↩↩↩↩↩↩ -
NVIDIA Fabric Manager User Guide — "when you enable the MIG mode, the GPU NVLinks will be disabled (or not used) and the GPU will lose its NVLink peer-to-peer (P2P) capability"; on DGX/HGX A100 the GPU and NVSwitch-side NVLinks are trained off and retrained when MIG is disabled; on DGX/HGX H100 and later NVLinks stay active during MIG mode, but restoring P2P after disabling MIG requires the FM service running. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩
-
NVIDIA MIG User Guide —
nvidia-smi -i <id> -mig 0/1toggles MIG mode (a GPU reset on Ampere; not InfoROM-persistent on Hopper+, so re-assert on boot); a reset or MIG mutation is refused withIn use by another clientwhile a CUDA app ornvidia-smiholds the device. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html ↩↩ -
NVIDIA/cuda-samples —
p2pBandwidthLatencyTestdemonstrates CUDA P2P transfers between GPU pairs and prints unidirectional/bidirectional bandwidth and latency matrices for P2P=Disabled vs P2P=Enabled (equal numbers mean P2P did not engage);simpleP2Pvalidates peer access and reports a pass/fail. Both live underSamples/5_Domain_Specific/and build with make/cmake. https://github.com/NVIDIA/cuda-samples ↩↩↩ -
NVIDIA NCCL —
NCCL_DEBUG=INFOreveals the chosen transport (aNET/Socket/PCIe path where NVLink is expected is a fallback);NCCL_P2P_DISABLE=1disables direct GPU-to-GPU P2P;NCCL_P2P_LEVEL=NVLuses P2P only when GPUs are NVLink-connected;nccl-testsall_reduce_perfreports busbw (hardware-comparable) vs algbw. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩