Markdown

Ansible role: rdma_fabric¶

Scope: install the DOCA-OFED host stack, load nvidia_peermem for GPUDirect RDMA, and write /etc/nccl.conf defaults (NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL) keyed per NIC model. The fabric-layer role in the node bring-up chain; it is staged by rdma_stage: ofed runs before nvidia_stack, then peermem runs after the driver reboot and before validate.

Reference template, drawn from the upstream NVIDIA DOCA-OFED, GPUDirect RDMA, and NCCL docs (References). Not hardware-tested here. Pin the DOCA-OFED version, set rdma_nic_model to your card, and validate on one node before a fleet roll. For a maintained Kubernetes path, use the NVIDIA Network Operator together with the GPU Operator RDMA flow; for host Ansible, compare with the nvidia.nvidia_driver collection.

flowchart LR
  OFED["rdma_stage=ofed: DOCA-OFED host stack"] --> DRIVER["nvidia_stack builds driver"]
  DRIVER --> PEERMEM["rdma_stage=peermem: load nvidia_peermem"]
  PEERMEM --> NCCL["Write /etc/nccl.conf per NIC"]
  NCCL --> VERIFY["lsmod + ibstat checks"]

What it does¶

rdma_fabric makes a node ready for GPU-to-GPU traffic over the InfiniBand/RoCE fabric. Three concerns, in order:

DOCA-OFED installs NVIDIA's host RDMA stack (the doca-ofed profile), which supersedes the standalone MLNX_OFED. This provides the mlx5 kernel drivers, IB verbs userspace, and the RDMA peer-memory support nvidia_peermem links against. It must be present before the peer-memory module can compile/load against the RDMA APIs.
nvidia_peermem loads the GPUDirect RDMA peer-memory client now shipped inside the NVIDIA GPU driver (replacing the out-of-tree nv_peer_mem, which NVIDIA deprecates from the R470 driver branch and newer). This is what lets a Mellanox/ConnectX HCA read and write GPU device memory directly over PCIe, skipping the host-staged bounce buffer. The role both loads it now and persists it across reboots.
/etc/nccl.conf holds host-wide NCCL fabric defaults so every job inherits a sane IB transport without per-launch env soup. Values are selected from a per-NIC-model map (rdma_nic_model), because the right NCCL_IB_GID_INDEX differs between native InfiniBand and RoCE.

Ordering gotcha: if the GPU driver was installed before DOCA-OFED, nvidia_peermem can fail to load because the driver build did not see the RDMA peer-memory APIs. In that case rebuild or reinstall the NVIDIA driver after OFED. The site playbook avoids the fault by running this role twice: rdma_stage=ofed before nvidia_stack, flushing the OFED reboot, then rdma_stage=peermem after the driver reboot.

This role does not touch the subnet manager, switch config, or ACS; those are upstream (fabric/SM) or handled by base_tuning / service-acs-disable.

Variables¶

Set inventory defaults in [gpu_nodes:vars] (see the hub inventory); override rdma_nic_model per host or host-group. Role defaults live in roles/rdma_fabric/defaults/main.yml.

Variable	Default	Meaning
`doca_ofed_package`	`doca-ofed`	DOCA-OFED profile meta-package. `doca-all` for the full SDK superset; `doca-ofed` is the RDMA/networking subset. Pin a version (`doca-ofed=<ver>`) once chosen.
`rdma_nic_model`	`cx7_ib`	Selects the NCCL value map below. One of `cx7_ib`, `cx7_roce`, `cx6_ib`, `cx6_roce`. ConnectX-7 vs -6, InfiniBand vs RoCE.
`nccl_conf_path`	`/etc/nccl.conf`	Host-wide NCCL config file. NCCL reads this in addition to environment variables.
`nccl_ib_hca`	`mlx5`	`NCCL_IB_HCA`. Prefix filter over IB verbs devices; `mlx5` matches all ConnectX `mlx5_*` ports. Use `=mlx5_0:1,mlx5_1:1` for exact device:port pinning.
`nccl_ib_gid_index`	per model (see map)	`NCCL_IB_GID_INDEX`. RoCE GID table index from `show_gids`. NCCL default is `-1` (auto); native IB ignores it, RoCE typically needs `3` (RoCEv2/IPv4).
`nccl_net_gdr_level`	`SYS`	`NCCL_NET_GDR_LEVEL`. Max NIC-to-GPU distance at which GPUDirect RDMA is used. `SYS` = always enable; other keywords `LOC`/`PIX`/`PXB`/`PHB`.
`rdma_reboot_on_ofed`	`true`	Whether a fresh DOCA-OFED install notifies the `reboot node` handler (needed so `mlx5` and peer-memory load cleanly).
`rdma_stage`	`all`	`ofed`, `peermem`, or `all`. The site playbook uses staged mode (`ofed` before the driver, `peermem` after). `all` is for a single-role canary on a node whose driver/OFED ordering is already correct.

Per-model NCCL map (roles/rdma_fabric/vars/main.yml), keyed by rdma_nic_model:

# roles/rdma_fabric/vars/main.yml
nccl_nic_profiles:
  cx7_ib:    { gid_index: -1, hca: "mlx5" }   # ConnectX-7, native InfiniBand: GID index unused
  cx7_roce:  { gid_index: 3,  hca: "mlx5" }   # ConnectX-7, RoCEv2: GID 3 = RoCEv2/IPv4 (confirm via show_gids)
  cx6_ib:    { gid_index: -1, hca: "mlx5" }   # ConnectX-6, native InfiniBand
  cx6_roce:  { gid_index: 3,  hca: "mlx5" }   # ConnectX-6, RoCEv2

The RoCE gid_index: 3 is the common RoCEv2/IPv4 default but is fabric-specific; verify against show_gids on a real node before fleet roll (see Failure modes).

Tasks¶

Real, idempotent tasks/main.yml. Uses only stock modules (ansible.builtin.*, community.general.modprobe). DOCA-OFED install assumes the NVIDIA DOCA apt repo is already configured by the base image or a repo-setup role; do not rely on nvidia_stack for this, because the OFED stage runs before the driver stage.

# roles/rdma_fabric/tasks/main.yml
- name: Assert rdma_stage is valid
  ansible.builtin.assert:
    that:
      - rdma_stage | default('all') in ['ofed', 'peermem', 'all']
    fail_msg: "rdma_stage must be one of ofed, peermem, all"
    quiet: true

- name: Resolve NCCL profile for this NIC model
  ansible.builtin.set_fact:
    nccl_profile: "{{ nccl_nic_profiles[rdma_nic_model] }}"
  # fails loudly if rdma_nic_model is not a key in the map

- name: Install DOCA-OFED host stack
  ansible.builtin.apt:
    name: "{{ doca_ofed_package }}"
    state: present
    update_cache: true
  register: ofed_install
  notify: reboot node
  when:
    - rdma_stage | default('all') in ['ofed', 'all']
    - rdma_reboot_on_ofed | bool

- name: Install DOCA-OFED host stack (no reboot handler)
  ansible.builtin.apt:
    name: "{{ doca_ofed_package }}"
    state: present
    update_cache: true
  when:
    - rdma_stage | default('all') in ['ofed', 'all']
    - not (rdma_reboot_on_ofed | bool)

- name: Load and persist nvidia_peermem (GPUDirect RDMA)
  community.general.modprobe:
    name: nvidia_peermem
    state: present
    persistent: present          # writes /etc/modules-load.d/ entry; loads on next boot
  register: peermem
  # EINVAL here => GPU driver was built before OFED; reinstall driver (see What it does)
  when: rdma_stage | default('all') in ['peermem', 'all']

- name: Write NCCL fabric defaults
  ansible.builtin.template:
    src: nccl.conf.j2
    dest: "{{ nccl_conf_path }}"
    owner: root
    group: root
    mode: "0644"
  # template is declarative => idempotent; rewrites only on content change
  when: rdma_stage | default('all') in ['peermem', 'all']

Companion template, emitting the same keys the hub writes, parameterised per NIC model:

{# roles/rdma_fabric/templates/nccl.conf.j2 #}
# Managed by Ansible role rdma_fabric. Override per-job via NCCL_* env vars.
# NIC model: {{ rdma_nic_model }}
NCCL_IB_HCA={{ nccl_ib_hca | default(nccl_profile.hca) }}
NCCL_IB_GID_INDEX={{ nccl_ib_gid_index | default(nccl_profile.gid_index) }}
NCCL_NET_GDR_LEVEL={{ nccl_net_gdr_level }}

# roles/rdma_fabric/handlers/main.yml
- name: reboot node
  ansible.builtin.reboot:
    reboot_timeout: 1200
  # shared handler name with base_tuning/nvidia_stack; Ansible de-dupes one reboot per flush

Idempotency notes: - community.general.modprobe with persistent: present loads the module now and writes the /etc/modules-load.d/ persist entry (per the module docs); treat re-runs as converging on that loaded-and-persisted state. The module docs do not specify the exact changed-reporting rule, so do not rely on a precise changed count. - ansible.builtin.template rewrites /etc/nccl.conf only when rendered content differs, so re-runs are no-ops. - The apt task is idempotent on state: present; pin the version to stop silent OFED upgrades on update_cache. - rdma_stage is the ordering guard. Use ofed before the driver and peermem after the driver reboot. Avoid all in a fresh build unless the driver package is known to rebuild after OFED. - No raw command/shell mutating tasks, so nothing to guard with creates/changed_when here. The verify step (below) is the only command, and it is read-only (changed_when: false).

Apply & verify¶

Run via the site playbook, or target the role alone:

# whole chain
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal

# this role only (assumes site.yml maps rdma_fabric to a tag)
ansible-playbook -i inventory/hosts.ini site.yml --tags rdma_fabric --limit gpu-01.dc1.internal

Validation tasks (drop in roles/rdma_fabric/tasks/main.yml tail, or rely on validate):

- name: nvidia_peermem is loaded
  ansible.builtin.command: lsmod
  register: lsmod_out
  changed_when: false
  failed_when: "'nvidia_peermem' not in lsmod_out.stdout"

- name: At least one IB/RoCE port is ACTIVE + LinkUp
  ansible.builtin.shell: >
    set -o pipefail;
    ibstat | grep -c 'State: Active'
  args: { executable: /bin/bash }
  register: ib_active
  changed_when: false
  failed_when: ib_active.stdout | int < 1

Manual checks and expected signal:

# 1. peer-memory module present (in-tree module reports with an underscore)
lsmod | grep nvidia_peermem
#   nvidia_peermem         16384  0

# 2. fabric ports up — healthy port shows both lines
ibstat
#   Port 1:
#     State: Active
#     Physical state: LinkUp
#     Rate: 400            # ConnectX-7 NDR; varies by NIC/cable
#     Link layer: InfiniBand   # or "Ethernet" for RoCE

Expected signal: lsmod | grep nvidia_peermem returns a non-empty line, and every fabric port in ibstat reads State: Active / Physical state: LinkUp. If those hold, drive a real data-path test: see fabric bring-up benchmarking for ib_write_bw and NCCL all-reduce bandwidth (the only proof GPUDirect RDMA is actually on the wire, not just loaded).

Failure modes¶

Symptom	Likely cause	Runbook
`modprobe nvidia-peermem` returns `EINVAL`; `lsmod` shows no `nvidia_peermem`	GPU driver compiled before DOCA-OFED, so peer-memory APIs absent.	kernel/GPU missing — reinstall/rebuild driver after OFED.
Module loaded but NCCL falls back to host-staged copies (low bandwidth)	`NCCL_NET_GDR_LEVEL` too restrictive, or ACS re-enabled breaking P2P.	NCCL hang/slow; re-run `service-acs-disable`.
`ibstat` shows `State: Down` or `Physical state: Polling`	No subnet manager, bad cable, or port not enabled on the switch.	fabric-manager failure.
RoCE path: handshake works but throughput collapses	Wrong `NCCL_IB_GID_INDEX` for the RoCEv2 GID — confirm with `show_gids`.	NCCL hang/slow.
`nv_peer_mem` and `nvidia_peermem` both present	Legacy out-of-tree module conflicts; only one loads.	Remove `nv_peer_mem` package; see kernel/GPU missing.

References¶

GPUDirect RDMA peer-memory client (nvidia-peermem; lsmod shows nvidia_peermem): https://docs.nvidia.com/cuda/gpudirect-rdma/ — states nv_peer_mem is "deprecated when running GPU drivers from the R470 branch and newer".
nvidia-peermem README ("now included with the NVIDIA Linux GPU driver"; the GitHub nv_peer_mem project "should be considered deprecated"): https://download.nvidia.com/XFree86/Linux-x86_64/470.42.01/README/nvidia-peermem.html
GPU Operator RDMA verification (example lsmod | grep nvidia output with nvidia_peermem): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html
NVIDIA DOCA-OFED host installation and upgrade (doca-ofed / doca-all apt profiles): https://docs.nvidia.com/doca/sdk/doca-host+installation+and+upgrade/index.html
NCCL environment variables (NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
community.general.modprobe module (state, persistent parameters): https://docs.ansible.com/ansible/latest/collections/community/general/modprobe_module.html
ansible.builtin.template module: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/template_module.html
ansible.builtin.apt module: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html
InfiniBand ibstat port states (State: Active, Physical state: LinkUp): https://docs.oracle.com/cd/E19914-01/820-6705-10/appendix2.html