Runbook: PCIe / P2P bandwidth regression¶
Scope: investigate a PCIe link trained down (lower gen/width) or P2P blocked by ACS (H2D/D2H/P2P bandwidth far below expected) and restore full bandwidth.
Run this when a node's host-to-device, device-to-host, or peer-to-peer copy bandwidth is well under spec: a PCIe link negotiated below its
LnkCap(lower gen or fewer lanes), or peer transactions are bouncing off the Root Complex because ACS redirect is on. Severity: node-degraded, not down. Jobs run but data movement is throttled, so step time, prefetch, and GPUDirect RDMA all suffer silently. There is no XID; the GPU is healthy, the path to it is not.Reference templates on real APIs; pin versions and validate before production use. Nothing here was hardware-tested.
This is a path fault, not a device fault. Two distinct root causes share the same symptom (low copy bandwidth) and the procedure separates them: (1) the PCIe link trained down to a lower generation or narrower width (LnkSta < LnkCap), or (2) the link is full-rate but ACS P2P Request Redirect is forcing GPU-to-GPU / GPU-to-NIC traffic upstream through the Root Complex instead of straight across the switch. ACS background and the boot-time fix are in the ACS-disable service; PCIe/P2P fundamentals are in the GPU software stack and NVSwitch/NVLink; when the slow path is an NVLink fabric (not PCIe) issue, divert to the fabric-manager runbook.
Trigger¶
- H2D/D2H bandwidth far below spec:
nvbandwidthorp2pBandwidthLatencyTestreports single-digit or low-double-digit GB/s where the link should deliver ~26 GB/s (Gen4 x16) or ~50 GB/s (Gen5 x16). - P2P bandwidth collapses to host-staging numbers: peer copies land near PCIe-through-CPU rates, or
nvidia-smi topo -p2p rshows P2P not supported across a pair that should have it. lspcireports a downgraded link: currentLnkStaspeed/width is below the deviceLnkCap(e.g.LnkCapx16 8GT/s butLnkStax8 2.5GT/s).- Regression appeared after a reseat, BIOS/firmware update, reboot, or thermal event, all of which can re-enable ACS or retrain a link low.
Pre-checks¶
- Confirm it is a PCIe/P2P fault, not a GPU fault. No fatal XID; scan first, since a fatal XID routes to the GPU-fault runbook:
- Confirm the GPUs enumerate at all. Missing GPUs are a different runbook (kernel/GPU-missing):
- Establish the expected number. Record the platform's spec link (Gen4 x16 ≈ 32 GB/s raw / ~26 GB/s effective; Gen5 x16 ≈ 64 GB/s raw) and the topology's expected P2P path (benchmarking). Without a target, "low" is meaningless.
- Note whether the path is PCIe or NVLink.
nvidia-smi topo -m: anNV#pair that reads slow is a fabric problem → NVSwitch/NVLink, fabric-manager runbook. APIX/PXB/PHB/SYSpair that reads slow is the PCIe path this runbook covers.
Flow¶
flowchart TB
A["Low H2D/D2H/P2P bandwidth"] --> B{"XID present?"}
B -->|"yes"| Z["GPU fault path (RMA runbook)"]
B -->|"no"| C["Cordon + drain node"]
C --> D["nvidia-smi topo -m / -p2p r"]
D -->|"slow pair on NV# link"| Y["NVLink fabric path (fabric-manager runbook)"]
D -->|"slow pair on PCIe path"| E{"LnkSta < LnkCap?"}
E -->|"yes (trained down)"| F["Reseat / fix slot BIOS gen+width"]
F -->|"still low after reseat+BIOS"| Z
E -->|"no (full gen+width)"| G{"ACS redirect bit set?"}
G -->|"yes"| H["Run disable-acs.service"]
G -->|"no"| I["Re-measure; check NUMA/SM-vs-CE path"]
F --> V["Verify: re-measure bandwidth"]
H --> V
I --> V
V -->|"at spec"| U["Uncordon"]
V -->|"still low"| Z
Procedure¶
Cordon and drain before touching the link. A
setpciwrite, a reseat, or a retrain can momentarily disrupt the path; do not do it under live work.
-
Cordon and drain the node so no new work lands on it and running work clears (Kubernetes or Slurm, whichever schedules the node):
-
Read the topology and the P2P matrix. Identify which GPU pair is slow and whether the OS believes P2P is even available on it:
If a pair that should be P2P-capable shows it unsupported, suspect ACS (step 5). If the pair traversesssh "$NODE" 'nvidia-smi topo -m' # legend: NV#=NVLink, PIX/PXB/PHB=PCIe switch path, SYS=via CPU/UPI ssh "$NODE" 'nvidia-smi topo -p2p r' # P2P read capability matrix; w/n/a/p for write/nvlink/atomics/pcieNV#, this is the wrong runbook; go to the fabric-manager runbook. -
Check the PCIe link generation and width per GPU. Compare capability against status: a downgrade is
LnkStabelowLnkCap:# Per-GPU PCIe BDF, then full link capability vs current status: for bdf in $(ssh "$NODE" "nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader" \ | sed 's/^0000//'); do ssh "$NODE" "sudo lspci -vvv -s $bdf | grep -E 'LnkCap:|LnkSta:'" doneLnkCap:is the negotiated maximum (e.g.Speed 16GT/s, Width x16).LnkSta:is the live state. ALnkStashowingSpeed 2.5GT/sorWidth x8against ax16 16GT/sLnkCap, or a(downgraded)annotation, is a trained-down link → step 4. IfLnkSta == LnkCap(full gen and width), the link is healthy and the loss is P2P routing → step 5.
Cross-check the live nvidia-smi view, which surfaces the current vs max generation directly:
ssh "$NODE" 'nvidia-smi -q | grep -A4 "GPU Link Info"' # PCIe Generation Current vs Max, Link Width Current vs Max
- Trained-down link path. A link that negotiated low is physical or firmware, in escalating order of intervention:
- Confirm not power/thermal throttling the link: a hot or power-capped board can drop PCIe ASPM/gen state. Check
nvidia-smi -q -d PERFORMANCEfor active clthrottle reasons; a thermal event routes to the thermal-emergency runbook. - Reseat the card / riser / cable during the drain window. This is the single most common fix for a width drop (e.g. x16 → x8 from a partially seated connector), a physical action on a drained node, not a command.
- Check the slot BIOS config: a slot pinned to Gen3 (or auto-negotiated low) in firmware caps
LnkCapitself. Align the BIOS PCIe link-speed setting to the platform spec; this is a vendor BIOS change, validated on one node first. -
GPU-side fault: if reseat and BIOS are correct and the link still trains low, treat the board as suspect and route to the GPU-fault runbook.
-
ACS-redirect path (link is full-rate but P2P is slow/unsupported). ACS P2P Request Redirect on a bridge forces peer traffic upstream, defeating GPUDirect P2P/RDMA. Read the redirect bits on the GPU/NIC bridges:
If any bridge shows a# Any bridge still printing here has a redirect bit ON: ssh "$NODE" "sudo lspci -vvv 2>/dev/null \ | grep -E 'ReqRedir\+|CmpltRedir\+|UpstreamFwd\+|SrcValid\+'"+redirect flag, ACS is the cause. Clear it by running the boot-time ACS-disable one-shot; do not hand-rollsetpcioutside the managed service (the ACS-disable service): If the unit is absent, ACS was never managed on this node; install the service via bring-up rather than a one-off write so it survives the next reboot (the ACS-disable service). Re-read the grep from this step; it must return empty.
Verification¶
The proof is a re-measured bandwidth that meets the platform target, on the previously-slow path. Build and run NVIDIA's nvbandwidth (the maintained replacement for the removed bandwidthTest) or p2pBandwidthLatencyTest from cuda-samples:
# Host<->device copy bandwidth (copy-engine), per GPU:
ssh "$NODE" './nvbandwidth -t host_to_device_memcpy_ce'
ssh "$NODE" './nvbandwidth -t device_to_host_memcpy_ce'
# Peer-to-peer device<->device, the path ACS was throttling:
ssh "$NODE" './nvbandwidth -t device_to_device_memcpy_read_ce'
ssh "$NODE" './nvbandwidth -l' # list all testcases by name/index
# Alternative from cuda-samples — prints a P2P=Enabled vs P2P=Disabled bandwidth matrix:
ssh "$NODE" './p2pBandwidthLatencyTest'
Pass criteria:
- H2D/D2H recovers to the link's effective rate (~26 GB/s on Gen4 x16; ~50 GB/s on Gen5 x16), no longer single-digit.
- P2P (P2P=Enabled) bandwidth in
p2pBandwidthLatencyTestsubstantially exceeds the P2P=Disabled column for the previously-slow pair, proof peer traffic is taking the direct path, not host-staging. lspci LnkStanow equalsLnkCap(full gen and width) and no bridge prints a+redirect flag (re-run the step 5 grep; it comes back empty).
Rollback¶
A bandwidth regression is a fault to fix, not a change to revert; "rollback" here means safely backing out an intervention that did not help or made things worse:
- If a BIOS slot change degraded the node, restore the prior BIOS PCIe setting and re-measure before proceeding.
- If the link still trains low after reseat + BIOS, stop iterating in place. Route the board to the GPU-fault runbook and leave the node drained rather than ship degraded P2P.
- Bake the fix in. If ACS was the cause, ensure
disable-acs.serviceis enabled at boot so the next reboot does not regress (the ACS-disable service); a manualsystemctl startdoes not survive a power cycle. - Uncordon only after verification passes. Never return a node to the pool on the unproven assumption it is fixed:
Related runbooks¶
- the ACS-disable service: the boot-time fix and full ACS rationale (the canonical P2P-redirect remediation).
- the fabric-manager runbook: when the slow path is NVLink/NVSwitch, not PCIe.
- the NCCL-hang runbook: ACS-off is also a precondition for GDR; a full collective stall (not just slow bandwidth) starts there.
- the GPU-fault runbook: a link that will not train after reseat + BIOS is a board fault.
- the kernel/GPU-missing runbook: when the GPU does not enumerate at all.
- the thermal-emergency runbook: a thermal event can drop the PCIe link state.
- operational runbooks: operational runbooks index.
References¶
lspci(8)—-vvvverbose,-sdevice selection, capability decode (LnkCap/LnkSta): https://man7.org/linux/man-pages/man8/lspci.8.html- NVIDIA Enterprise Support — Understanding PCIe Configuration for Maximum Performance (
LnkCapvsLnkSta, downgraded links): https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance nvidia-smidocumentation —topo --matrixlegend (X/SYS/NODE/PHB/PXB/PIX/NV#) andtopo -p2pcapability flags (r/w/n/a/p): https://docs.nvidia.com/deploy/nvidia-smi/index.html- NVIDIA
nvbandwidth— H2D/D2H/D2D memcpy testcases,-l/-t: https://github.com/NVIDIA/nvbandwidth - NVIDIA cuda-samples
p2pBandwidthLatencyTest(P2P Enabled vs Disabled bandwidth matrix): https://github.com/NVIDIA/cuda-samples/tree/master/Samples/5_Domain_Specific/p2pBandwidthLatencyTest - PCI-SIG ACS Engineering Change Notice — ACS Control register bits; P2P Request Redirect forces peer traffic upstream: https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf
- Linux kernel command-line parameters —
pci=disable_acs_redir=(force ACS redirect off for P2P): https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html - Kubernetes — Safely Drain a Node (
kubectl cordon/drain): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ - Slurm
scontrol— node stateDRAIN/RESUME: https://slurm.schedmd.com/scontrol.html
Related: ACS-Disable Service · NVSwitch / NVLink · GPU Software Stack · Fabric Manager Failure · GPU Fault / RMA · Benchmarking · Operational Runbooks · Glossary