Skip to content
Markdown

Runbook: NVLink visibility / P2P failure

Scope: diagnose GPUs that cannot see each other over NVLink: nvidia-smi nvlink --status shows links inactive, CUDA P2P access is disabled, and collectives silently fall back to PCIe. Broader than a Fabric Manager outage: this covers MIG-on (P2P deliberately dropped), topology mis-wiring, and NVLink link-layer errors (CRC / replay / recovery), of which a dead nvidia-fabricmanager is only one branch.

Run this when GPUs on a node refuse to talk over NVLink: nvidia-smi nvlink --status reports links inactive or fewer than the card's link count, nvidia-smi topo --matrix shows GPU pairs over PHB/SYS instead of NV#, P2P probes report CANNOT access peer, and NCCL busbw collapses to a PCIe ceiling. Severity: node-degraded. The GPUs run, but every NVLink-assuming collective pays a PCIe tax.

Reference templates on real APIs. Nothing here was executed on hardware; pin versions to your driver / CUDA branch and validate on one node before fleet use.

P2P-disabled is the symptom; the cause splits four ways and each has a different owner. A dead Fabric Manager on an NVSwitch box is the Fabric Manager failure runbook. MIG mode silently strips NVLink P2P on Ampere; that is expected, not a fault. Link-layer errors (CRC / replay / recovery climbing) point at a cable/board fault that escalates to GPU fault / RMA. A topology that comes up with the wrong NV# map is a wiring/seating problem. Conceptual background on the interconnect is in NVSwitch & NVLink and Fabric Manager; ACS-blocked P2P is in ACS disable.

Trigger

  • nvidia-smi nvlink --status reports one or more links inactive, or a GPU shows fewer active links than its NVLink count (a partial fabric, not a clean down).1
  • nvidia-smi topo --matrix shows GPU↔GPU over PHB / SYS / NODE (PCIe host bridge or system path) where you expect NV# (NVLink, # = link count).1
  • nvidia-smi topo -p2p rw shows P2P read/write not supported between GPU pairs that should be NVLink peers.1
  • A CUDA P2P probe (p2pBandwidthLatencyTest / simpleP2P) reports Peer access ... CANNOT or P2P-enabled bandwidth equal to P2P-disabled bandwidth.4
  • NCCL falls back to PCIe: NCCL_DEBUG=INFO shows no NVLink path and nccl-tests busbw sits at a PCIe ceiling (NCCL-hang runbook covers the full-stall variant).

Pre-checks

Establish which of the four branches you are in before mutating anything. NODE is the Kubernetes node name.

NODE=gpu-07.dc1.internal
  1. Is this an NVSwitch box, and is Fabric Manager healthy? On HGX/DGX baseboards and NVL72, dead FM presents exactly as inactive NVLink; that is the Fabric Manager failure runbook, not this one. Check before anything else:

    systemctl is-active nvidia-fabricmanager    # NVSwitch systems only
    
    active (or no FM because the box is PCIe-attached, not NVSwitch) → stay here. Anything else → divert to the FM runbook.

  2. Is MIG mode on? Enabling MIG drops NVLink P2P on Ampere: "when you enable the MIG mode, the GPU NVLinks will be disabled (or not used) and the GPU will lose its NVLink peer-to-peer (P2P) capability."2 On A100/HGX A100 the NVLinks are trained off; on H100 and later they stay active but P2P across instances is still gone. Inactive NVLink under MIG is expected, not a fault:

    nvidia-smi --query-gpu=index,mig.mode.current --format=csv
    
    If MIG is Enabled and you did not expect it, that is the bug; go to stale MIG state / MIG partitioning, not the hardware path.

  3. Read link state and the topology map as ground truth:

    nvidia-smi nvlink --status                  # per-link active/inactive (+ rated BW when active)
    nvidia-smi topo --matrix                    # NV# = NVLink, PHB/SYS/NODE = PCIe fallback
    
    A GPU with some links active and some inactive is a link/board fault (step in Procedure); a GPU with all links inactive on an NVSwitch box points back to FM (pre-check 1).

  4. Read NVLink error counters. Climbing CRC / replay / recovery / link-down counts mean a degrading physical link, not a config problem; that path ends at GPU fault / RMA:1

    nvidia-smi nvlink -e                         # CRC (flit/data), replay, recovery, link-down per link
    

  5. Rule out ACS. PCIe Access Control Services on the GPU/switch path silently blocks P2P translation; with ACS on, P2P is refused even when NVLink trains fine (ACS disable):

    sudo lspci -vvv | grep -i 'ACSCtl'           # SrcValid+ on the path blocks P2P
    

Flow

flowchart TB
    A["NVLink inactive / P2P disabled"] --> B{"NVSwitch box and FM active?"}
    B -->|"FM down"| C["Fabric Manager failure runbook"]
    B -->|"FM active or PCIe-attached"| D{"MIG mode enabled?"}
    D -->|"Yes (NVLink P2P dropped by design)"| E["Stale MIG state runbook / disable MIG if unintended"]
    D -->|"No"| F{"nvlink -e: CRC/replay/recovery climbing?"}
    F -->|"Counters clean, P2P still blocked"| G{"ACS on the GPU/NIC path?"}
    F -->|"Counters climbing / partial links"| H["Drain, GPU reset, re-read counters"]
    G -->|"ACS on"| I["Disable ACS, re-probe P2P"]
    G -->|"ACS off"| J["Suspect topology / seating: reset and retrain"]
    H -->|"Still failing after reset"| K["GPU fault / RMA"]
    H -->|"Clean after reset"| L["Verify: topo NV#, p2p test, busbw"]
    I --> L
    J --> L
    E --> L

Procedure

Cordon and drain before any GPU reset or MIG/driver mutation: a reset is refused while a client holds the device, and you must not flip a live node's fabric under jobs. NODE is the Kubernetes node name (Slurm equivalent inline).

NODE=gpu-07.dc1.internal
  1. Cordon and drain so the scheduler stops placing work and running pods evict cleanly:

    kubectl cordon "$NODE"
    kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
    # Slurm: scontrol update nodename="$NODE" state=drain reason="nvlink p2p failure"
    

  2. Clear residual GPU clients: a reset or MIG mutation is refused with In use by another client while a CUDA app or stray nvidia-smi is attached:3

    nvidia-smi                                    # the "Processes" table must be empty
    sudo fuser -k /dev/nvidia*                     # last resort, then re-check nvidia-smi
    

  3. Branch on the pre-check verdict.

  4. MIG was on and unintended → exit MIG mode (this is a GPU reset on Ampere), then re-read NVLink. Mode is not InfoROM-persistent on Hopper+, so re-assert intent explicitly (stale MIG state):3

    sudo nvidia-smi -i 0 -mig 0
    nvidia-smi nvlink --status                   # links should return to active (Ampere)
    

  5. ACS was on → disable it on the affected root ports so P2P translation is allowed, then re-probe (ACS disable). Prefer the platform's documented path (BIOS / kernel) over a transient setpci:

    sudo lspci -vvv | grep -i 'ACSCtl'           # confirm SrcValid- after the change
    

  6. Links inactive on an NVSwitch box → this is Fabric Manager; restore the version-matched FM stack and prove NVLink there. Do not continue here: Fabric Manager failure runbook.

  7. Error counters climbing / partial link set / wrong NV# map → suspect a cable, connector seating, or board fault. Reset the GPU once to retrain links, then re-read counters:1

    sudo nvidia-smi --gpu-reset -i 0             # retrain links (node must be drained)
    nvidia-smi nvlink -e                          # counters should be clean after reset
    nvidia-smi nvlink --status
    
    If links stay inactive, counters climb again, or the topology map is still wrong after a reset, stop treating it as software; escalate to GPU fault / RMA. Persistent NVLink errors are a hardware RMA signal, not a config retry.

  8. Confirm the driver/firmware floor if links will not train at all (not merely P2P-blocked). A GSP firmware / driver mismatch can leave the interconnect half-initialised (GSP firmware mismatch, kernel modules); a node that no longer enumerates GPUs is kernel GPU missing.

Verification

Do not uncordon on nvidia-smi nvlink --status alone: an active link state is necessary but not sufficient. Require a real P2P + collective proof.

  1. Links active and paths over NVLink. Every expected link reports active, and the topology matrix shows NV# between peers, not a PCIe fallback:1

    nvidia-smi nvlink --status
    nvidia-smi topo --matrix                       # GPU pairs must read NV#, not PHB/SYS
    nvidia-smi topo -p2p rw                         # P2P read+write supported across peers
    

  2. CUDA actually does P2P. Build and run p2pBandwidthLatencyTest from NVIDIA/cuda-samples; the P2P=Enabled bandwidth matrix must be materially higher than P2P=Disabled and the latency lower; equal numbers mean P2P never engaged:4

    # from a built cuda-samples tree:
    ./p2pBandwidthLatencyTest
    
    simpleP2P is the lighter pass/fail check (Peer access ... ENABLED, verification PASSED).4

  3. Collective busbw at the NVLink figure. all_reduce_perf from NVIDIA/nccl-tests, read busbw (not algbw), with NCCL_DEBUG=INFO confirming an NVLink transport rather than a NET/Socket / PCIe fallback:5

    NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 8G -f 2 -g <N>
    
    On an NVLink node, large-message busbw should approach the NVLink bus bandwidth; a figure stuck at the PCIe ceiling means the fabric is still degraded; re-open Procedure.

  4. Uncordon only after 1–3 pass:

    kubectl uncordon "$NODE"
    # Slurm: scontrol update nodename="$NODE" state=resume
    

Rollback

This runbook restores a degraded plane; the safe state is the node out of service, not a half-fixed fabric admitting jobs.

  • If a change made it worse, revert the single variable: re-enable MIG to the recorded prior layout (stale MIG state), or restore the prior ACS setting. Never stack changes: one mutation, one re-verify.
  • If links stay inactive or counters keep climbing after a reset, leave the node cordoned/drained and escalate to hardware; do not uncordon a node whose NVLink fabric is not fully formed:
    kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
    # Slurm: scontrol update nodename="$NODE" state=drain reason="nvlink down - escalated"
    
  • If the root cause was Fabric Manager or a driver upgrade, the clean fix lives in those runbooks (Fabric Manager failure, driver upgrade): re-pin the matched driver/FM/NSCQ branch rather than hand-patching NVLink state.
  • As a stopgap only, a job can be forced onto PCIe (NCCL_P2P_DISABLE=1, or NCCL_P2P_LEVEL scoped below NVLink) to keep work moving at reduced bandwidth while the node is repaired; this masks the fault, it does not fix it.5

References

  • nvidia-smi (man page) — nvlink -s/--status, -e/--errorcounters (CRC flit/data, replay, recovery, link-down), -c/--capabilities; topo -m/--matrix (NV#/PHB/SYS legend) and topo -p2p (P2P read/write/NVLink/atomics/PCIe status): https://docs.nvidia.com/deploy/nvidia-smi/
  • NVIDIA Fabric Manager User Guide — MIG mode disables GPU NVLinks and removes NVLink P2P; A100/HGX A100 train NVLinks off vs H100+ keep them active; P2P restoration requires FM running: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
  • NVIDIA MIG User Guide — -mig 0/1 (GPU reset on Ampere, not InfoROM-persistent on Hopper+), In use by another client reset refusal: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
  • NVIDIA/cuda-samples — p2pBandwidthLatencyTest (P2P enabled vs disabled bandwidth/latency matrix) and simpleP2P (peer-access pass/fail), under Samples/5_Domain_Specific/: https://github.com/NVIDIA/cuda-samples
  • NCCL environment variables — NCCL_DEBUG, NCCL_P2P_DISABLE, NCCL_P2P_LEVEL (NVL = use P2P over NVLink): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
  • NVIDIA/nccl-tests — all_reduce_perf, -b/-e/-f/-g flags, busbw vs algbw: https://github.com/NVIDIA/nccl-tests
  • kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

Related: NVSwitch & NVLink · Fabric Manager · Fabric Manager Failure · MIG Partitioning · Stale MIG State · ACS Disable · GPU Fault / RMA · NCCL Hang · Operational Runbooks · Glossary


  1. nvidia-smi man page — nvidia-smi nvlink -s/--status reports per-link active/inactive and rated bandwidth when active; -e/--errorcounters reports CRC (flit/data), replay, recovery and link-down counters per link; nvidia-smi topo -m/--matrix legend (NV# = NVLink with link count, PHB/SYS/NODE/PXB/PIX = PCIe paths); nvidia-smi topo -p2p shows P2P read (r)/write (w)/NVLink (n)/atomics (a)/PCIe (p) status between GPUs. https://docs.nvidia.com/deploy/nvidia-smi/ 

  2. NVIDIA Fabric Manager User Guide — "when you enable the MIG mode, the GPU NVLinks will be disabled (or not used) and the GPU will lose its NVLink peer-to-peer (P2P) capability"; on DGX/HGX A100 the GPU and NVSwitch-side NVLinks are trained off and retrained when MIG is disabled; on DGX/HGX H100 and later NVLinks stay active during MIG mode, but restoring P2P after disabling MIG requires the FM service running. https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html 

  3. NVIDIA MIG User Guide — nvidia-smi -i <id> -mig 0/1 toggles MIG mode (a GPU reset on Ampere; not InfoROM-persistent on Hopper+, so re-assert on boot); a reset or MIG mutation is refused with In use by another client while a CUDA app or nvidia-smi holds the device. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html 

  4. NVIDIA/cuda-samples — p2pBandwidthLatencyTest demonstrates CUDA P2P transfers between GPU pairs and prints unidirectional/bidirectional bandwidth and latency matrices for P2P=Disabled vs P2P=Enabled (equal numbers mean P2P did not engage); simpleP2P validates peer access and reports a pass/fail. Both live under Samples/5_Domain_Specific/ and build with make/cmake. https://github.com/NVIDIA/cuda-samples 

  5. NVIDIA NCCL — NCCL_DEBUG=INFO reveals the chosen transport (a NET/Socket/PCIe path where NVLink is expected is a fallback); NCCL_P2P_DISABLE=1 disables direct GPU-to-GPU P2P; NCCL_P2P_LEVEL=NVL uses P2P only when GPUs are NVLink-connected; nccl-tests all_reduce_perf reports busbw (hardware-comparable) vs algbw. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html