Markdown

Runbook: PCIe / P2P bandwidth regression¶

Scope: investigate a PCIe link trained down (lower gen/width) or P2P blocked by ACS (H2D/D2H/P2P bandwidth far below expected) and restore full bandwidth.

Run this when a node's host-to-device, device-to-host, or peer-to-peer copy bandwidth is well under spec: a PCIe link negotiated below its LnkCap (lower gen or fewer lanes), or peer transactions are bouncing off the Root Complex because ACS redirect is on. Severity: node-degraded, not down. Jobs run but data movement is throttled, so step time, prefetch, and GPUDirect RDMA all suffer silently. There is no XID; the GPU is healthy, the path to it is not.

Reference templates on real APIs; pin versions and validate before production use. Nothing here was hardware-tested.

This is a path fault, not a device fault. Two distinct root causes share the same symptom (low copy bandwidth) and the procedure separates them: (1) the PCIe link trained down to a lower generation or narrower width (LnkSta < LnkCap), or (2) the link is full-rate but ACS P2P Request Redirect is forcing GPU-to-GPU / GPU-to-NIC traffic upstream through the Root Complex instead of straight across the switch. ACS background and the boot-time fix are in the ACS-disable service; PCIe/P2P fundamentals are in the GPU software stack and NVSwitch/NVLink; when the slow path is an NVLink fabric (not PCIe) issue, divert to the fabric-manager runbook.

Trigger¶

H2D/D2H bandwidth far below spec: nvbandwidth or p2pBandwidthLatencyTest reports single-digit or low-double-digit GB/s where the link should deliver ~26 GB/s (Gen4 x16) or ~50 GB/s (Gen5 x16).
P2P bandwidth collapses to host-staging numbers: peer copies land near PCIe-through-CPU rates, or nvidia-smi topo -p2p r shows P2P not supported across a pair that should have it.
lspci reports a downgraded link: current LnkSta speed/width is below the device LnkCap (e.g. LnkCap x16 8GT/s but LnkSta x8 2.5GT/s).
Regression appeared after a reseat, BIOS/firmware update, reboot, or thermal event, all of which can re-enable ACS or retrain a link low.

Pre-checks¶

Confirm it is a PCIe/P2P fault, not a GPU fault. No fatal XID; scan first, since a fatal XID routes to the GPU-fault runbook:
```
ssh "$NODE" 'dmesg -T | grep -i "NVRM: Xid"'
```
Confirm the GPUs enumerate at all. Missing GPUs are a different runbook (kernel/GPU-missing):
```
ssh "$NODE" 'nvidia-smi -L'
```
Establish the expected number. Record the platform's spec link (Gen4 x16 ≈ 32 GB/s raw / ~26 GB/s effective; Gen5 x16 ≈ 64 GB/s raw) and the topology's expected P2P path (benchmarking). Without a target, "low" is meaningless.
Note whether the path is PCIe or NVLink. nvidia-smi topo -m: an NV# pair that reads slow is a fabric problem → NVSwitch/NVLink, fabric-manager runbook. A PIX/PXB/PHB/SYS pair that reads slow is the PCIe path this runbook covers.

Flow¶

flowchart TB
    A["Low H2D/D2H/P2P bandwidth"] --> B{"XID present?"}
    B -->|"yes"| Z["GPU fault path (RMA runbook)"]
    B -->|"no"| C["Cordon + drain node"]
    C --> D["nvidia-smi topo -m / -p2p r"]
    D -->|"slow pair on NV# link"| Y["NVLink fabric path (fabric-manager runbook)"]
    D -->|"slow pair on PCIe path"| E{"LnkSta < LnkCap?"}
    E -->|"yes (trained down)"| F["Reseat / fix slot BIOS gen+width"]
    F -->|"still low after reseat+BIOS"| Z
    E -->|"no (full gen+width)"| G{"ACS redirect bit set?"}
    G -->|"yes"| H["Run disable-acs.service"]
    G -->|"no"| I["Re-measure; check NUMA/SM-vs-CE path"]
    F --> V["Verify: re-measure bandwidth"]
    H --> V
    I --> V
    V -->|"at spec"| U["Uncordon"]
    V -->|"still low"| Z

Procedure¶

NODE=gpu-07.dc1.internal      # the degraded node

Cordon and drain before touching the link. A setpci write, a reseat, or a retrain can momentarily disrupt the path; do not do it under live work.

Cordon and drain the node so no new work lands on it and running work clears (Kubernetes or Slurm, whichever schedules the node):

# Kubernetes
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --force
# Slurm
scontrol update NodeName="$NODE" State=DRAIN Reason="pcie-p2p-bw-regression"

Read the topology and the P2P matrix. Identify which GPU pair is slow and whether the OS believes P2P is even available on it:
```
ssh "$NODE" 'nvidia-smi topo -m'           # legend: NV#=NVLink, PIX/PXB/PHB=PCIe switch path, SYS=via CPU/UPI
ssh "$NODE" 'nvidia-smi topo -p2p r'       # P2P read capability matrix; w/n/a/p for write/nvlink/atomics/pcie
```
If a pair that should be P2P-capable shows it unsupported, suspect ACS (step 5). If the pair traverses NV#, this is the wrong runbook; go to the fabric-manager runbook.
Check the PCIe link generation and width per GPU. Compare capability against status: a downgrade is LnkSta below LnkCap:
```
# Per-GPU PCIe BDF, then full link capability vs current status:
for bdf in $(ssh "$NODE" "nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader" \
               | sed 's/^0000//'); do
  ssh "$NODE" "sudo lspci -vvv -s $bdf | grep -E 'LnkCap:|LnkSta:'"
done
```
LnkCap: is the negotiated maximum (e.g. Speed 16GT/s, Width x16). LnkSta: is the live state. A LnkSta showing Speed 2.5GT/s or Width x8 against a x16 16GT/s LnkCap, or a (downgraded) annotation, is a trained-down link → step 4. If LnkSta == LnkCap (full gen and width), the link is healthy and the loss is P2P routing → step 5.

Cross-check the live nvidia-smi view, which surfaces the current vs max generation directly:

ssh "$NODE" 'nvidia-smi -q | grep -A4 "GPU Link Info"'   # PCIe Generation Current vs Max, Link Width Current vs Max

Trained-down link path. A link that negotiated low is physical or firmware, in escalating order of intervention:
Confirm not power/thermal throttling the link: a hot or power-capped board can drop PCIe ASPM/gen state. Check nvidia-smi -q -d PERFORMANCE for active clthrottle reasons; a thermal event routes to the thermal-emergency runbook.
Reseat the card / riser / cable during the drain window. This is the single most common fix for a width drop (e.g. x16 → x8 from a partially seated connector), a physical action on a drained node, not a command.
Check the slot BIOS config: a slot pinned to Gen3 (or auto-negotiated low) in firmware caps LnkCap itself. Align the BIOS PCIe link-speed setting to the platform spec; this is a vendor BIOS change, validated on one node first.
GPU-side fault: if reseat and BIOS are correct and the link still trains low, treat the board as suspect and route to the GPU-fault runbook.
ACS-redirect path (link is full-rate but P2P is slow/unsupported). ACS P2P Request Redirect on a bridge forces peer traffic upstream, defeating GPUDirect P2P/RDMA. Read the redirect bits on the GPU/NIC bridges:
```
# Any bridge still printing here has a redirect bit ON:
ssh "$NODE" "sudo lspci -vvv 2>/dev/null \
  | grep -E 'ReqRedir\+|CmpltRedir\+|UpstreamFwd\+|SrcValid\+'"
```
If any bridge shows a + redirect flag, ACS is the cause. Clear it by running the boot-time ACS-disable one-shot; do not hand-roll setpci outside the managed service (the ACS-disable service):
```
ssh "$NODE" 'sudo systemctl start disable-acs.service && systemctl is-active disable-acs.service'
```
If the unit is absent, ACS was never managed on this node; install the service via bring-up rather than a one-off write so it survives the next reboot (the ACS-disable service). Re-read the grep from this step; it must return empty.

Verification¶

The proof is a re-measured bandwidth that meets the platform target, on the previously-slow path. Build and run NVIDIA's nvbandwidth (the maintained replacement for the removed bandwidthTest) or p2pBandwidthLatencyTest from cuda-samples:

# Host<->device copy bandwidth (copy-engine), per GPU:
ssh "$NODE" './nvbandwidth -t host_to_device_memcpy_ce'
ssh "$NODE" './nvbandwidth -t device_to_host_memcpy_ce'
# Peer-to-peer device<->device, the path ACS was throttling:
ssh "$NODE" './nvbandwidth -t device_to_device_memcpy_read_ce'
ssh "$NODE" './nvbandwidth -l'      # list all testcases by name/index

# Alternative from cuda-samples — prints a P2P=Enabled vs P2P=Disabled bandwidth matrix:
ssh "$NODE" './p2pBandwidthLatencyTest'

Pass criteria:

H2D/D2H recovers to the link's effective rate (~26 GB/s on Gen4 x16; ~50 GB/s on Gen5 x16), no longer single-digit.
P2P (P2P=Enabled) bandwidth in p2pBandwidthLatencyTest substantially exceeds the P2P=Disabled column for the previously-slow pair, proof peer traffic is taking the direct path, not host-staging.
lspci LnkSta now equals LnkCap (full gen and width) and no bridge prints a + redirect flag (re-run the step 5 grep; it comes back empty).

Rollback¶

A bandwidth regression is a fault to fix, not a change to revert; "rollback" here means safely backing out an intervention that did not help or made things worse:

If a BIOS slot change degraded the node, restore the prior BIOS PCIe setting and re-measure before proceeding.
If the link still trains low after reseat + BIOS, stop iterating in place. Route the board to the GPU-fault runbook and leave the node drained rather than ship degraded P2P.
Bake the fix in. If ACS was the cause, ensure disable-acs.service is enabled at boot so the next reboot does not regress (the ACS-disable service); a manual systemctl start does not survive a power cycle.
Uncordon only after verification passes. Never return a node to the pool on the unproven assumption it is fixed:
```
kubectl uncordon "$NODE"
# or: scontrol update NodeName="$NODE" State=RESUME
```

the ACS-disable service: the boot-time fix and full ACS rationale (the canonical P2P-redirect remediation).
the fabric-manager runbook: when the slow path is NVLink/NVSwitch, not PCIe.
the NCCL-hang runbook: ACS-off is also a precondition for GDR; a full collective stall (not just slow bandwidth) starts there.
the GPU-fault runbook: a link that will not train after reseat + BIOS is a board fault.
the kernel/GPU-missing runbook: when the GPU does not enumerate at all.
the thermal-emergency runbook: a thermal event can drop the PCIe link state.
operational runbooks: operational runbooks index.

References¶

lspci(8) — -vvv verbose, -s device selection, capability decode (LnkCap/LnkSta): https://man7.org/linux/man-pages/man8/lspci.8.html
NVIDIA Enterprise Support — Understanding PCIe Configuration for Maximum Performance (LnkCap vs LnkSta, downgraded links): https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance
nvidia-smi documentation — topo --matrix legend (X/SYS/NODE/PHB/PXB/PIX/NV#) and topo -p2p capability flags (r/w/n/a/p): https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA nvbandwidth — H2D/D2H/D2D memcpy testcases, -l/-t: https://github.com/NVIDIA/nvbandwidth
NVIDIA cuda-samples p2pBandwidthLatencyTest (P2P Enabled vs Disabled bandwidth matrix): https://github.com/NVIDIA/cuda-samples/tree/master/Samples/5_Domain_Specific/p2pBandwidthLatencyTest
PCI-SIG ACS Engineering Change Notice — ACS Control register bits; P2P Request Redirect forces peer traffic upstream: https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf
Linux kernel command-line parameters — pci=disable_acs_redir= (force ACS redirect off for P2P): https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
Kubernetes — Safely Drain a Node (kubectl cordon/drain): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
Slurm scontrol — node state DRAIN/RESUME: https://slurm.schedmd.com/scontrol.html