Markdown

Runbook: NVLink visibility / P2P failure¶

Scope: diagnose GPUs that cannot see each other over NVLink: nvidia-smi nvlink --status shows links inactive, CUDA P2P access is disabled, and collectives silently fall back to PCIe. Broader than a Fabric Manager outage: this covers MIG-on (P2P deliberately dropped), topology mis-wiring, and NVLink link-layer errors (CRC / replay / recovery), of which a dead nvidia-fabricmanager is only one branch.

Run this when GPUs on a node refuse to talk over NVLink: nvidia-smi nvlink --status reports links inactive or fewer than the card's link count, nvidia-smi topo --matrix shows GPU pairs over PHB/SYS instead of NV#, P2P probes report CANNOT access peer, and NCCL busbw collapses to a PCIe ceiling. Severity: node-degraded. The GPUs run, but every NVLink-assuming collective pays a PCIe tax.

Reference templates on real APIs. Nothing here was executed on hardware; pin versions to your driver / CUDA branch and validate on one node before fleet use.

P2P-disabled is the symptom; the cause splits four ways and each has a different owner. A dead Fabric Manager on an NVSwitch box is the Fabric Manager failure runbook. MIG mode silently strips NVLink P2P on Ampere; that is expected, not a fault. Link-layer errors (CRC / replay / recovery climbing) point at a cable/board fault that escalates to GPU fault / RMA. A topology that comes up with the wrong NV# map is a wiring/seating problem. Conceptual background on the interconnect is in NVSwitch & NVLink and Fabric Manager; ACS-blocked P2P is in ACS disable.

Trigger¶

nvidia-smi nvlink --status reports one or more links inactive, or a GPU shows fewer active links than its NVLink count (a partial fabric, not a clean down).¹
nvidia-smi topo --matrix shows GPU↔GPU over PHB / SYS / NODE (PCIe host bridge or system path) where you expect NV# (NVLink, # = link count).¹
nvidia-smi topo -p2p rw shows P2P read/write not supported between GPU pairs that should be NVLink peers.¹
A CUDA P2P probe (p2pBandwidthLatencyTest / simpleP2P) reports Peer access ... CANNOT or P2P-enabled bandwidth equal to P2P-disabled bandwidth.⁴
NCCL falls back to PCIe: NCCL_DEBUG=INFO shows no NVLink path and nccl-tests busbw sits at a PCIe ceiling (NCCL-hang runbook covers the full-stall variant).

Pre-checks¶

Establish which of the four branches you are in before mutating anything. NODE is the Kubernetes node name.

NODE=gpu-07.dc1.internal

Is this an NVSwitch box, and is Fabric Manager healthy? On HGX/DGX baseboards and NVL72, dead FM presents exactly as inactive NVLink; that is the Fabric Manager failure runbook, not this one. Check before anything else:
```
systemctl is-active nvidia-fabricmanager    # NVSwitch systems only
```
active (or no FM because the box is PCIe-attached, not NVSwitch) → stay here. Anything else → divert to the FM runbook.
Is MIG mode on? Enabling MIG drops NVLink P2P on Ampere: "when you enable the MIG mode, the GPU NVLinks will be disabled (or not used) and the GPU will lose its NVLink peer-to-peer (P2P) capability."² On A100/HGX A100 the NVLinks are trained off; on H100 and later they stay active but P2P across instances is still gone. Inactive NVLink under MIG is expected, not a fault:
```
nvidia-smi --query-gpu=index,mig.mode.current --format=csv
```
If MIG is Enabled and you did not expect it, that is the bug; go to stale MIG state / MIG partitioning, not the hardware path.

Read link state and the topology map as ground truth:

nvidia-smi nvlink --status                  # per-link active/inactive (+ rated BW when active)
nvidia-smi topo --matrix                    # NV# = NVLink, PHB/SYS/NODE = PCIe fallback

A GPU with some links active and some inactive is a link/board fault (step in Procedure); a GPU with all links inactive on an NVSwitch box points back to FM (pre-check 1).

Read NVLink error counters. Climbing CRC / replay / recovery / link-down counts mean a degrading physical link, not a config problem; that path ends at GPU fault / RMA:¹
```
nvidia-smi nvlink -e                         # CRC (flit/data), replay, recovery, link-down per link
```
Rule out ACS. PCIe Access Control Services on the GPU/switch path silently blocks P2P translation; with ACS on, P2P is refused even when NVLink trains fine (ACS disable):
```
sudo lspci -vvv | grep -i 'ACSCtl'           # SrcValid+ on the path blocks P2P
```

Flow¶

flowchart TB
    A["NVLink inactive / P2P disabled"] --> B{"NVSwitch box and FM active?"}
    B -->|"FM down"| C["Fabric Manager failure runbook"]
    B -->|"FM active or PCIe-attached"| D{"MIG mode enabled?"}
    D -->|"Yes (NVLink P2P dropped by design)"| E["Stale MIG state runbook / disable MIG if unintended"]
    D -->|"No"| F{"nvlink -e: CRC/replay/recovery climbing?"}
    F -->|"Counters clean, P2P still blocked"| G{"ACS on the GPU/NIC path?"}
    F -->|"Counters climbing / partial links"| H["Drain, GPU reset, re-read counters"]
    G -->|"ACS on"| I["Disable ACS, re-probe P2P"]
    G -->|"ACS off"| J["Suspect topology / seating: reset and retrain"]
    H -->|"Still failing after reset"| K["GPU fault / RMA"]
    H -->|"Clean after reset"| L["Verify: topo NV#, p2p test, busbw"]
    I --> L
    J --> L
    E --> L

Procedure¶

Cordon and drain before any GPU reset or MIG/driver mutation: a reset is refused while a client holds the device, and you must not flip a live node's fabric under jobs. NODE is the Kubernetes node name (Slurm equivalent inline).

NODE=gpu-07.dc1.internal

Cordon and drain so the scheduler stops placing work and running pods evict cleanly:

kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
# Slurm: scontrol update nodename="$NODE" state=drain reason="nvlink p2p failure"

Clear residual GPU clients: a reset or MIG mutation is refused with In use by another client while a CUDA app or stray nvidia-smi is attached:³

nvidia-smi                                    # the "Processes" table must be empty
sudo fuser -k /dev/nvidia*                     # last resort, then re-check nvidia-smi

Branch on the pre-check verdict.
MIG was on and unintended → exit MIG mode (this is a GPU reset on Ampere), then re-read NVLink. Mode is not InfoROM-persistent on Hopper+, so re-assert intent explicitly (stale MIG state):³
```
sudo nvidia-smi -i 0 -mig 0
nvidia-smi nvlink --status                   # links should return to active (Ampere)
```
ACS was on → disable it on the affected root ports so P2P translation is allowed, then re-probe (ACS disable). Prefer the platform's documented path (BIOS / kernel) over a transient setpci:
```
sudo lspci -vvv | grep -i 'ACSCtl'           # confirm SrcValid- after the change
```
Links inactive on an NVSwitch box → this is Fabric Manager; restore the version-matched FM stack and prove NVLink there. Do not continue here: Fabric Manager failure runbook.
Error counters climbing / partial link set / wrong NV# map → suspect a cable, connector seating, or board fault. Reset the GPU once to retrain links, then re-read counters:¹
```
sudo nvidia-smi --gpu-reset -i 0             # retrain links (node must be drained)
nvidia-smi nvlink -e                          # counters should be clean after reset
nvidia-smi nvlink --status
```
If links stay inactive, counters climb again, or the topology map is still wrong after a reset, stop treating it as software; escalate to GPU fault / RMA. Persistent NVLink errors are a hardware RMA signal, not a config retry.
Confirm the driver/firmware floor if links will not train at all (not merely P2P-blocked). A GSP firmware / driver mismatch can leave the interconnect half-initialised (GSP firmware mismatch, kernel modules); a node that no longer enumerates GPUs is kernel GPU missing.

Verification¶

Do not uncordon on nvidia-smi nvlink --status alone: an active link state is necessary but not sufficient. Require a real P2P + collective proof.

Links active and paths over NVLink. Every expected link reports active, and the topology matrix shows NV# between peers, not a PCIe fallback:¹

nvidia-smi nvlink --status
nvidia-smi topo --matrix                       # GPU pairs must read NV#, not PHB/SYS
nvidia-smi topo -p2p rw                         # P2P read+write supported across peers

CUDA actually does P2P. Build and run p2pBandwidthLatencyTest from NVIDIA/cuda-samples; the P2P=Enabled bandwidth matrix must be materially higher than P2P=Disabled and the latency lower; equal numbers mean P2P never engaged:⁴
```
# from a built cuda-samples tree:
./p2pBandwidthLatencyTest
```
simpleP2P is the lighter pass/fail check (Peer access ... ENABLED, verification PASSED).⁴
Collective busbw at the NVLink figure. all_reduce_perf from NVIDIA/nccl-tests, read busbw (not algbw), with NCCL_DEBUG=INFO confirming an NVLink transport rather than a NET/Socket / PCIe fallback:⁵
```
NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 8G -f 2 -g <N>
```
On an NVLink node, large-message busbw should approach the NVLink bus bandwidth; a figure stuck at the PCIe ceiling means the fabric is still degraded; re-open Procedure.

Uncordon only after 1–3 pass:

kubectl uncordon "$NODE"
# Slurm: scontrol update nodename="$NODE" state=resume

Rollback¶

This runbook restores a degraded plane; the safe state is the node out of service, not a half-fixed fabric admitting jobs.

If a change made it worse, revert the single variable: re-enable MIG to the recorded prior layout (stale MIG state), or restore the prior ACS setting. Never stack changes: one mutation, one re-verify.
If links stay inactive or counters keep climbing after a reset, leave the node cordoned/drained and escalate to hardware; do not uncordon a node whose NVLink fabric is not fully formed:
```
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
# Slurm: scontrol update nodename="$NODE" state=drain reason="nvlink down - escalated"
```
If the root cause was Fabric Manager or a driver upgrade, the clean fix lives in those runbooks (Fabric Manager failure, driver upgrade): re-pin the matched driver/FM/NSCQ branch rather than hand-patching NVLink state.
As a stopgap only, a job can be forced onto PCIe (NCCL_P2P_DISABLE=1, or NCCL_P2P_LEVEL scoped below NVLink) to keep work moving at reduced bandwidth while the node is repaired; this masks the fault, it does not fix it.⁵

Fabric Manager failure: inactive NVLink because nvidia-fabricmanager is down on an NVSwitch box (pre-check 1 branch).
stale MIG state: MIG geometry drift; MIG-on is why P2P is "missing" in pre-check 2.
GPU fault / RMA: escalation when NVLink errors persist after a reset (hardware path).
GSP firmware mismatch: links will not train because of a driver/firmware floor.
kernel GPU missing: the node no longer enumerates GPUs at all (upstream of NVLink).
driver upgrade: post-upgrade NVLink/FM regressions and the correct ordering.
NCCL hang: the collective-stall variant when the fabric is up but a collective wedges.
operational runbooks: runbook index.

References¶

nvidia-smi (man page) — nvlink -s/--status, -e/--errorcounters (CRC flit/data, replay, recovery, link-down), -c/--capabilities; topo -m/--matrix (NV#/PHB/SYS legend) and topo -p2p (P2P read/write/NVLink/atomics/PCIe status): https://docs.nvidia.com/deploy/nvidia-smi/
NVIDIA Fabric Manager User Guide — MIG mode disables GPU NVLinks and removes NVLink P2P; A100/HGX A100 train NVLinks off vs H100+ keep them active; P2P restoration requires FM running: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA MIG User Guide — -mig 0/1 (GPU reset on Ampere, not InfoROM-persistent on Hopper+), In use by another client reset refusal: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
NVIDIA/cuda-samples — p2pBandwidthLatencyTest (P2P enabled vs disabled bandwidth/latency matrix) and simpleP2P (peer-access pass/fail), under Samples/5_Domain_Specific/: https://github.com/NVIDIA/cuda-samples
NCCL environment variables — NCCL_DEBUG, NCCL_P2P_DISABLE, NCCL_P2P_LEVEL (NVL = use P2P over NVLink): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
NVIDIA/nccl-tests — all_reduce_perf, -b/-e/-f/-g flags, busbw vs algbw: https://github.com/NVIDIA/nccl-tests
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

nvidia-smi man page — nvidia-smi nvlink -s/--status reports per-link active/inactive and rated bandwidth when active; -e/--errorcounters reports CRC (flit/data), replay, recovery and link-down counters per link; nvidia-smi topo -m/--matrix legend (NV# = NVLink with link count, PHB/SYS/NODE/PXB/PIX = PCIe paths); nvidia-smi topo -p2p shows P2P read (r)/write (w)/NVLink (n)/atomics (a)/PCIe (p) status between GPUs. https://docs.nvidia.com/deploy/nvidia-smi/ ↩↩↩↩↩↩
NVIDIA Fabric Manager User Guide — "when you enable the MIG mode, the GPU NVLinks will be disabled (or not used) and the GPU will lose its NVLink peer-to-peer (P2P) capability"; on DGX/HGX A100 the GPU and NVSwitch-side NVLinks are trained off and retrained when MIG is disabled; on DGX/HGX H100 and later NVLinks stay active during MIG mode, but restoring P2P after disabling MIG requires the FM service running. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html ↩
NVIDIA MIG User Guide — nvidia-smi -i <id> -mig 0/1 toggles MIG mode (a GPU reset on Ampere; not InfoROM-persistent on Hopper+, so re-assert on boot); a reset or MIG mutation is refused with In use by another client while a CUDA app or nvidia-smi holds the device. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html ↩↩
NVIDIA/cuda-samples — p2pBandwidthLatencyTest demonstrates CUDA P2P transfers between GPU pairs and prints unidirectional/bidirectional bandwidth and latency matrices for P2P=Disabled vs P2P=Enabled (equal numbers mean P2P did not engage); simpleP2P validates peer access and reports a pass/fail. Both live under Samples/5_Domain_Specific/ and build with make/cmake. https://github.com/NVIDIA/cuda-samples ↩↩↩
NVIDIA NCCL — NCCL_DEBUG=INFO reveals the chosen transport (a NET/Socket/PCIe path where NVLink is expected is a fallback); NCCL_P2P_DISABLE=1 disables direct GPU-to-GPU P2P; NCCL_P2P_LEVEL=NVL uses P2P only when GPUs are NVLink-connected; nccl-tests all_reduce_perf reports busbw (hardware-comparable) vs algbw. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩↩