Markdown

PCIe ACS-disable service¶

Scope: a systemd one-shot that clears the PCIe ACS Control redirect bits on every bridge at boot (setpci loop) so GPU/NIC peer-to-peer (GPUDirect P2P/RDMA) is not forced upstream through the Root Complex; why ACS breaks P2P, why it must run on every boot, the isolation trade-off, and how to prove it with lspci -vvv ACSCtl.

This is the runtime ACS handling that role: base_tuning explicitly defers. base_tuning does the boot-time GRUB/IOMMU groundwork but does not touch ACS, because mainline kernels cannot clear the ACS redirect bits from a stock boot parameter without out-of-tree patches. It is one unit in the site.yml from the bring-up hub, and it expands the disable-acs.sh sketch on that hub page into a full idempotent role. It does not contradict the hub; it is the hub's service, fleshed out. The performance rationale (ACS as a quiet, high-impact bandwidth killer) is in performance optimization; the fabric context is networking fabric. Reference template, drawn from the PCI-SIG ACS specification, the Linux kernel PCI parameter docs, NVIDIA GPUDirect/AI-Enterprise docs, and pciutils/Ansible module docs. Nothing here was executed on hardware, so verify the register write against the specific PCIe bridges on your platform and validate on one node before a fleet roll.

flowchart LR
  BOOT["Boot / reboot"] --> ENUM["Enumerate PCI bridges (lspci -D)"]
  ENUM --> CAP["Has ACS capability?"]
  CAP -->|"no"| SKIP["Skip device"]
  CAP -->|"yes"| CLEAR["Clear ACS Control redirect bits (setpci ECAP_ACS+0x6.w)"]
  CLEAR --> VERIFY["lspci -vvv: redirect bits are '-'"]
  VERIFY --> P2P["GPU <-> NIC P2P / GPUDirect not redirected upstream"]

What it does¶

PCIe Access Control Services (ACS) is a switch/Root-Port feature that controls whether peer-to-peer transactions are allowed to flow directly between downstream ports or are redirected upstream toward the Root Complex (so an IOMMU above can enforce isolation). The Linux PCI maintainers state it plainly: "In order to support P2P traffic on a segment of the PCI hierarchy, we must be able to disable the ACS redirect bits for select PCI bridges."² When the ACS P2P Request Redirect bit is set on a Root Port, the PCI-SIG specification requires that "peer-to-peer Requests ... be redirected" upstream rather than routed directly to the peer.¹ That redirection is exactly what defeats GPUDirect P2P and GPUDirect RDMA: a GPU-to-GPU or GPU-to-NIC DMA that should hop across a PCIe switch instead bounces off the Root Complex, collapsing or disabling the direct path.

This service does one thing, idempotently, on every boot: it walks every PCIe bridge, and for any bridge that exposes an ACS capability it writes the ACS Control register so the redirect-relevant bits are off. The bits that matter for P2P routing are, by ACS bit position:¹

bit 0: ACS Source Validation (SrcValid)
bit 1: ACS Translation Blocking (TransBlk)
bit 2: ACS P2P Request Redirect (ReqRedir)
bit 3: ACS P2P Completion Redirect (CmpltRedir)
bit 4: ACS Upstream Forwarding (UpstreamFwd)
bit 5: ACS P2P Egress Control (EgressCtrl)
bit 6: ACS Direct Translated P2P (DirectTrans)

Clearing SrcValid/ReqRedir/CmpltRedir/UpstreamFwd (and DirectTrans) on the bridge stops upstream redirection so peer transactions take the direct path. This is the runtime equivalent of the kernel's pci=disable_acs_redir= parameter, which "Each device specified will have the PCI ACS redirect capabilities forced off which will allow P2P traffic between devices through bridges without forcing it upstream."²

Why a boot-time one-shot and not a one-off command: ACS Control is not retained across resets or reboots. NVIDIA's own VM-configuration guidance notes these settings "are not retained across resets / reboots," so they must be re-applied after every power-on.³ A systemd one-shot with RemainAfterExit=yes, ordered before the workload/fabric services and enabled at boot, re-applies the register every time the node comes up. Re-running converges (a bridge already cleared stays cleared), so the unit and the Ansible role are both idempotent.

Security trade-off, read this before fleet-wide rollout. ACS is an isolation control. Disabling the redirect bits removes the guarantee that peer transactions are mediated by the IOMMU above the bridge; the kernel documentation warns that forcing ACS redirect off "removes isolation between devices and will make the IOMMU groups less granular,"² which means devices that were in separate IOMMU groups may collapse into one. On a single-tenant GPU node dedicated to one trusted workload this is the accepted, standard trade-off for P2P bandwidth. On a node that passes GPUs/NICs through to untrusted VMs or multiple tenants, weakening ACS weakens VFIO isolation between those devices, so do not blanket-disable there (the kernel docs are explicit that forcing ACS redirect off "removes isolation between devices"²). For the broader tenancy model this feeds into, see security and multi-tenancy (note: that page covers the tenant-isolation tiers, not ACS/VFIO specifics). A VM/passthrough topology that needs both isolation and P2P performance is a different configuration: it keeps ACS and instead enables PCIe ATS on the NIC plus DirectTrans so address-translated peer traffic is allowed directly¹. That is out of scope for this bare-metal service; verify against the NVIDIA AI Enterprise VM guide.³

A separate, orthogonal requirement for GPUDirect RDMA: the IOMMU must not perform non-identity translation. NVIDIA: GPUDirect RDMA "relies upon all physical addresses being the same from the different PCI devices' point of view. This makes it incompatible with IOMMUs performing any form of translation other than 1:1, hence they must be disabled or configured for pass-through translation."⁴ That passthrough (iommu=pt) is set by role: base_tuning on the GRUB cmdline, not here.

Variables¶

Role defaults live in roles/acs_disable/defaults/main.yml. This service has no dependency on the per-tier inventory keys (gpu_tier, nvidia_nvswitch); ACS-on-bridge behaviour is a property of the PCIe topology, not the GPU class, so the role runs on any GPU host that uses PCIe peer-to-peer. The one decision is the bit mask.

Variable	Scope	Default	Purpose
`acs_disable_enabled`	role	`true`	Master switch. Set `false` on multi-tenant/VFIO-passthrough nodes where ACS isolation must be preserved (see security trade-off).
`acs_script_path`	role	`/usr/local/sbin/disable-acs.sh`	Absolute path of the installed boot script.
`acs_unit_name`	role	`disable-acs.service`	systemd unit filename.
`acs_redir_mask`	role	`0x5d`	Bits cleared from the ACS Control word: `SrcValid`(0) \| `ReqRedir`(2) \| `CmpltRedir`(3) \| `UpstreamFwd`(4) \| `DirectTrans`(6). The script writes `current & ~mask`. Same bit positions {0,2,3,4,6} that NVIDIA references via `ECAP_ACS+0x6.w`, but note the operation differs: NVIDIA's appendix assigns `0x5d` as a literal value (`ECAP_ACS+0x6.w=0x5D`, which sets those bits), whereas this role treats `0x5d` as a clear-mask. The intent here — turning the redirect-relevant bits off — is the same goal, so do not copy NVIDIA's literal assignment into this mask field. Use `0xffff` to clear the entire Control word.
`acs_before_units`	role	`["kubelet.service", "slurmd.service"]`	Units this one-shot must run before, so ACS is cleared prior to any workload that assumes P2P. Trim to the schedulers actually present on the node.

# roles/acs_disable/defaults/main.yml
acs_disable_enabled: true
acs_script_path: /usr/local/sbin/disable-acs.sh
acs_unit_name: disable-acs.service
acs_redir_mask: "0x5d"          # SrcValid|ReqRedir|CmpltRedir|UpstreamFwd|DirectTrans
acs_before_units:
  - kubelet.service
  - slurmd.service

Tasks¶

The role installs two files (the boot script and the unit that runs it), then enables the unit and triggers it once so the running boot is fixed without waiting for a reboot. The setpci loop, the lspci -D bridge enumeration, and the ECAP_ACS+0x6.w register target generalise the hub's disable-acs.sh;⁷ the mask is parameterised, and the role's default mask (0x5d) differs from the hub sketch's hard-coded ~0x1d: 0x5d additionally clears DirectTrans (bit 6), so the cleared-bit set is wider than the hub's, not identical. ECAP_ACS is a named extended capability setpci resolves to the ACS structure base; +0x6.w selects the 2-byte Control register; the .w suffix is the documented 16-bit width specifier.⁵

# roles/acs_disable/tasks/main.yml
- name: Install the ACS-disable boot script
  ansible.builtin.copy:
    dest: "{{ acs_script_path }}"
    mode: "0755"
    owner: root
    group: root
    content: |
      #!/usr/bin/env bash
      # Clear PCIe ACS Control redirect bits on every bridge so GPU<->NIC
      # peer-to-peer (GPUDirect P2P/RDMA) is not redirected upstream.
      # ACS Control is not retained across reboot; this runs at every boot.
      set -euo pipefail
      MASK={{ acs_redir_mask }}
      for bdf in $(lspci -D | awk '/PCI bridge/ {print $1}'); do
        # ECAP_ACS+0x6.w = ACS Control register (16-bit). Skip bridges with no ACS cap.
        cur=$(setpci -s "$bdf" ECAP_ACS+0x6.w 2>/dev/null) || continue
        new=$(printf '%04x' $(( 0x$cur & ~MASK )))
        [ "$new" = "$cur" ] && continue
        setpci -s "$bdf" ECAP_ACS+0x6.w="$new"
        echo "ACSCTL_CHANGED $bdf $cur->$new"
      done
  notify: reload systemd

- name: Install the ACS-disable systemd unit (one-shot, before workloads)
  ansible.builtin.copy:
    dest: "/etc/systemd/system/{{ acs_unit_name }}"
    mode: "0644"
    owner: root
    group: root
    content: |
      [Unit]
      Description=Clear PCIe ACS redirect bits for GPUDirect P2P/RDMA
      DefaultDependencies=no
      After=sysinit.target
      Before=basic.target {{ acs_before_units | join(' ') }}
      ConditionPathExists={{ acs_script_path }}
      [Service]
      Type=oneshot
      ExecStart={{ acs_script_path }}
      RemainAfterExit=yes
      [Install]
      WantedBy=multi-user.target
  notify: reload systemd

- name: Enable the ACS-disable unit at boot
  ansible.builtin.systemd_service:
    name: "{{ acs_unit_name }}"
    enabled: "{{ acs_disable_enabled }}"
    daemon_reload: true

- name: Apply ACS-disable now on the running boot
  ansible.builtin.command: "{{ acs_script_path }}"
  register: acs_apply
  changed_when: "'ACSCTL_CHANGED' in acs_apply.stdout"   # see note: script is silent on no-op
  when: acs_disable_enabled | bool

# roles/acs_disable/handlers/main.yml
- name: reload systemd
  ansible.builtin.systemd_service:
    daemon_reload: true

Idempotency notes. ansible.builtin.copy is convergent: a script or unit whose content already matches reports ok, not changed, so reload systemd fires only when a file actually changed. The setpci write is itself idempotent: the loop reads the current Control word, computes current & ~mask, and skips the write when the value is unchanged ([ "$new" = "$cur" ]), so a bridge already cleared is left alone and a re-run mutates nothing. The "apply now" task exists because enabling a unit does not run it; running the script directly fixes the live boot without a reboot. The script emits ACSCTL_CHANGED only when a register write happened, so the changed_when reflects real mutation and a second converged run reports ok. The unit uses DefaultDependencies=no + explicit After=sysinit.target / Before=basic.target so it lands early in boot, before the GPU schedulers in acs_before_units; Before= is an ordering directive only, it does not pull those units in.⁶

This role depends only on ansible.builtin; setpci/lspci come from the pciutils package, which the role assumes is present (it ships in the base image on Debian/Ubuntu and RHEL). If your image is minimal, install it in role: base_tuning or add an ansible.builtin.apt/ansible.builtin.dnf task with name: pciutils, state: present.

Apply & verify¶

Run the whole node bring-up (the hub site.yml applies acs_disable after base_tuning), or target just this role with a tag:

# whole bring-up
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal

# this role only, if tagged `acs` in site.yml
ansible-playbook -i inventory/hosts.ini site.yml --tags acs --limit gpu-07.dc1.internal

Validation is read-only. The decisive signal is the ACSCtl line in verbose lspci: after the service runs, every PCIe bridge that has an ACS capability must show the redirect bits (SrcValid, ReqRedir, CmpltRedir, UpstreamFwd, DirectTrans) off. NVIDIA's appendix shows a fully cleared reference state, but this role's default mask intentionally leaves TransBlk and EgressCtrl unchanged unless you set acs_redir_mask: 0xffff.³

ACSCtl: SrcValid- TransBlk[+|-] ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl[+|-] DirectTrans-

NODE=gpu-07.dc1.internal

# 1. The unit is enabled and completed this boot (oneshot stays "active (exited)").
ssh "$NODE" "systemctl is-enabled disable-acs.service"          # expect: enabled
ssh "$NODE" "systemctl is-active  disable-acs.service"          # expect: active

# 2. No PCIe bridge still has a redirect bit set. Flags with '+' are still ON.
#    A clean node prints nothing; any line printed is a bridge that escaped.
ssh "$NODE" "sudo lspci -vvv 2>/dev/null \
  | grep -E 'ReqRedir\+|CmpltRedir\+|UpstreamFwd\+|SrcValid\+|DirectTrans\+'"  # expect: (no output)

# 3. Inspect one GPU/NIC bridge directly to read the ACSCtl line.
ssh "$NODE" "sudo lspci -vvv -s <bridge-BDF> | grep -i ACSCtl"
#   expect: ACSCtl: SrcValid- ... ReqRedir- CmpltRedir- UpstreamFwd- ... DirectTrans-

Expected signal, together: the unit is enabled and active; the grep for any + redirect flag across all bridges returns empty; and the ACSCtl line on the GPU-to-NIC bridge shows SrcValid-/ReqRedir-/CmpltRedir-/UpstreamFwd-/DirectTrans-. As an Ansible-side gate, consider adding the same "no + redirect bit on any bridge" check to role: validate_health. Its current asserts cover GPU enumeration, persistence, Fabric Manager, IB ports, and dcgmi diag, but not ACS, so this assertion is a recommended addition, not yet present, that would make a node where ACS silently re-enabled fail the play instead of shipping degraded P2P. The end-to-end performance proof (that P2P bandwidth is actually restored) is a p2pBandwidthLatencyTest / nccl-tests run against the topology expectation, covered in fabric bring-up and benchmarking; this page proves the bits are cleared, not the bandwidth number (not hardware-tested).

Failure modes¶

ACS re-enabled after a reboot, firmware update, or BIOS reset is the most common silent regression. ACS Control does not persist,³ and a node that booted without the one-shot (unit disabled, masked, or ordered after the workload) runs with redirect bits on: P2P/GPUDirect RDMA is throttled or off and inter-GPU bandwidth is silently halved. The hub flags this exact case ("ACS re-enabled by a firmware/BIOS reset, silently halving P2P bandwidth").⁷ Verify with check (2) above and confirm systemctl is-enabled disable-acs.service. Runbook: fabric-manager failure covers the adjacent "collectives fell back to PCIe/host-staging" triage; ACS is the first bridge-level thing to rule out there.
pciutils not installed on a minimal image: setpci/lspci are absent, the script fails at the first call, and (with set -euo pipefail) the unit goes failed. systemctl status disable-acs.service shows command not found. Fix: install pciutils (add the apt/dnf task noted above) and re-run.
Wrong mask for the platform's bridges: clearing the full Control word (acs_redir_mask: 0xffff) on a switch that legitimately needs EgressCtrl, or a bridge whose ACS layout differs, can have no effect or an unintended one. The hub's open question stands: "Verify the ACS-disable mask against the specific PCIe bridges on the platform before fleet-wide use."⁷ Confirm the resulting ACSCtl line per bridge before rollout.
Unit ordered after the workload: if Before= omits the scheduler actually in use (e.g. slurmd on a Slurm node but only kubelet.service is listed), a job can start before ACS is cleared and come up on the redirected path. Set acs_before_units to the schedulers present on the node.
Disabled ACS on a multi-tenant / VFIO-passthrough node: here the failure is a security regression, not a perf one. Weakening ACS collapses IOMMU groups and removes isolation between passed-through devices.² Set acs_disable_enabled: false on those hosts. Related tenancy model: security and multi-tenancy (isolation tiers; it does not itself cover ACS/VFIO).
Non-idempotent drift: replacing the guarded setpci loop with an unconditional write that ignores the read-back re-writes every bridge on every run, so the register is touched even when already cleared. That defeats the convergence the role is built on (and makes any future change-detection unreliable). Keep the [ "$new" = "$cur" ] && continue guard so an already-cleared bridge is left untouched.

References¶

PCI-SIG ACS Engineering Change Notice (ACS Control register bits; "peer-to-peer Requests ... be redirected" upstream): https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf
Linux kernel command-line parameters — pci=disable_acs_redir= ("ACS redirect capabilities forced off ... removes isolation between devices and will make the IOMMU groups less granular"): https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
NVIDIA AI Enterprise — Optimizing VM Configuration, Appendix (lspci ACSCtl: SrcValid- ..., setpci ... ECAP_ACS+0x6.w, "not retained across resets / reboots"): https://docs.nvidia.com/ai-enterprise/planning-resource/optimizing-vm-configuration-ai-inference/latest/appendix.html
NVIDIA GPUDirect RDMA — Supported Systems (IOMMU must be disabled or 1:1 pass-through): https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
setpci(8) — named capabilities (ECAP_ACS), +offset, and .w width specifier: https://man7.org/linux/man-pages/man8/setpci.8.html
lspci(8) — verbose capability decode (ACSCap/ACSCtl): https://man7.org/linux/man-pages/man8/lspci.8.html
systemd.unit(5) — Before=/After= ordering vs requirement: https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html
ansible.builtin.copy: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/copy_module.html
ansible.builtin.systemd_service: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.html
ansible.builtin.command: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/command_module.html

PCI-SIG, "Access Control Services" Engineering Change Notice — defines the ACS Control register bits (Source Validation, Translation Blocking, P2P Request Redirect, P2P Completion Redirect, Upstream Forwarding, P2P Egress Control, Direct Translated P2P) and specifies that when P2P Request Redirect is enabled in a Root Port, "peer-to-peer Requests" are redirected upstream. https://pdos.csail.mit.edu/~sbw/links/ECN_access_control_061011.pdf ↩↩↩
Linux kernel, Documentation/admin-guide/kernel-parameters.txt — pci=disable_acs_redir=<pci_dev>: "Each device specified will have the PCI ACS redirect capabilities forced off which will allow P2P traffic between devices through bridges without forcing it upstream. Note: this removes isolation between devices and will make the IOMMU groups less granular." Introducing commit: "In order to support P2P traffic on a segment of the PCI hierarchy, we must be able to disable the ACS redirect bits for select PCI bridges." https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html ↩↩↩↩↩
NVIDIA AI Enterprise, "Optimizing VM Configuration for Performant AI Inference — Appendix" — shows reading ACS via lspci -vvv (ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-) and writing it via setpci ... ECAP_ACS+0x6.w, and notes the settings "are not retained across resets / reboots." https://docs.nvidia.com/ai-enterprise/planning-resource/optimizing-vm-configuration-ai-inference/latest/appendix.html ↩↩↩↩
NVIDIA, "GPUDirect RDMA" (Supported Systems): GPUDirect RDMA "relies upon all physical addresses being the same from the different PCI devices' point of view. This makes it incompatible with IOMMUs performing any form of translation other than 1:1, hence they must be disabled or configured for pass-through translation for GPUDirect RDMA to work." https://docs.nvidia.com/cuda/gpudirect-rdma/index.html ↩
setpci(8) man page — capabilities are addressable by name (ECAP_ACS) or ECAP<id>, with +offset and a .b/.w/.l width specifier selecting 1/2/4-byte access. https://man7.org/linux/man-pages/man8/setpci.8.html ↩
systemd.unit(5) — Before=/After= configure ordering only and are independent of the requirement directives (Wants=/Requires=). https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html ↩
This KB, Ansible: Node & Fabric Bring-Up — the hub disable-acs.sh script (the lspci -D bridge loop and ECAP_ACS+0x6.w write this role generalises) and its ACS failure-mode/open-question notes. ↩↩↩