Ansible role: rdma_fabric¶
Scope: install the DOCA-OFED host stack, load nvidia_peermem for GPUDirect RDMA, and write /etc/nccl.conf defaults (NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL) keyed per NIC model. The fabric-layer role in the node bring-up chain; it is staged by rdma_stage: ofed runs before nvidia_stack, then peermem runs after the driver reboot and before validate.
Reference template, drawn from the upstream NVIDIA DOCA-OFED, GPUDirect RDMA, and NCCL docs (References). Not hardware-tested here. Pin the DOCA-OFED version, set
rdma_nic_modelto your card, and validate on one node before a fleet roll. For a maintained Kubernetes path, use the NVIDIA Network Operator together with the GPU Operator RDMA flow; for host Ansible, compare with thenvidia.nvidia_drivercollection.
flowchart LR
OFED["rdma_stage=ofed: DOCA-OFED host stack"] --> DRIVER["nvidia_stack builds driver"]
DRIVER --> PEERMEM["rdma_stage=peermem: load nvidia_peermem"]
PEERMEM --> NCCL["Write /etc/nccl.conf per NIC"]
NCCL --> VERIFY["lsmod + ibstat checks"]
What it does¶
rdma_fabric makes a node ready for GPU-to-GPU traffic over the InfiniBand/RoCE fabric. Three concerns, in order:
- DOCA-OFED installs NVIDIA's host RDMA stack (the
doca-ofedprofile), which supersedes the standalone MLNX_OFED. This provides themlx5kernel drivers, IB verbs userspace, and the RDMA peer-memory supportnvidia_peermemlinks against. It must be present before the peer-memory module can compile/load against the RDMA APIs. nvidia_peermemloads the GPUDirect RDMA peer-memory client now shipped inside the NVIDIA GPU driver (replacing the out-of-treenv_peer_mem, which NVIDIA deprecates from the R470 driver branch and newer). This is what lets a Mellanox/ConnectX HCA read and write GPU device memory directly over PCIe, skipping the host-staged bounce buffer. The role both loads it now and persists it across reboots./etc/nccl.confholds host-wide NCCL fabric defaults so every job inherits a sane IB transport without per-launch env soup. Values are selected from a per-NIC-model map (rdma_nic_model), because the rightNCCL_IB_GID_INDEXdiffers between native InfiniBand and RoCE.
Ordering gotcha: if the GPU driver was installed before DOCA-OFED, nvidia_peermem can fail to load because the driver build did not see the RDMA peer-memory APIs. In that case rebuild or reinstall the NVIDIA driver after OFED. The site playbook avoids the fault by running this role twice: rdma_stage=ofed before nvidia_stack, flushing the OFED reboot, then rdma_stage=peermem after the driver reboot.
This role does not touch the subnet manager, switch config, or ACS; those are upstream (fabric/SM) or handled by base_tuning / service-acs-disable.
Variables¶
Set inventory defaults in [gpu_nodes:vars] (see the hub inventory); override rdma_nic_model per host or host-group. Role defaults live in roles/rdma_fabric/defaults/main.yml.
| Variable | Default | Meaning |
|---|---|---|
doca_ofed_package |
doca-ofed |
DOCA-OFED profile meta-package. doca-all for the full SDK superset; doca-ofed is the RDMA/networking subset. Pin a version (doca-ofed=<ver>) once chosen. |
rdma_nic_model |
cx7_ib |
Selects the NCCL value map below. One of cx7_ib, cx7_roce, cx6_ib, cx6_roce. ConnectX-7 vs -6, InfiniBand vs RoCE. |
nccl_conf_path |
/etc/nccl.conf |
Host-wide NCCL config file. NCCL reads this in addition to environment variables. |
nccl_ib_hca |
mlx5 |
NCCL_IB_HCA. Prefix filter over IB verbs devices; mlx5 matches all ConnectX mlx5_* ports. Use =mlx5_0:1,mlx5_1:1 for exact device:port pinning. |
nccl_ib_gid_index |
per model (see map) | NCCL_IB_GID_INDEX. RoCE GID table index from show_gids. NCCL default is -1 (auto); native IB ignores it, RoCE typically needs 3 (RoCEv2/IPv4). |
nccl_net_gdr_level |
SYS |
NCCL_NET_GDR_LEVEL. Max NIC-to-GPU distance at which GPUDirect RDMA is used. SYS = always enable; other keywords LOC/PIX/PXB/PHB. |
rdma_reboot_on_ofed |
true |
Whether a fresh DOCA-OFED install notifies the reboot node handler (needed so mlx5 and peer-memory load cleanly). |
rdma_stage |
all |
ofed, peermem, or all. The site playbook uses staged mode (ofed before the driver, peermem after). all is for a single-role canary on a node whose driver/OFED ordering is already correct. |
Per-model NCCL map (roles/rdma_fabric/vars/main.yml), keyed by rdma_nic_model:
# roles/rdma_fabric/vars/main.yml
nccl_nic_profiles:
cx7_ib: { gid_index: -1, hca: "mlx5" } # ConnectX-7, native InfiniBand: GID index unused
cx7_roce: { gid_index: 3, hca: "mlx5" } # ConnectX-7, RoCEv2: GID 3 = RoCEv2/IPv4 (confirm via show_gids)
cx6_ib: { gid_index: -1, hca: "mlx5" } # ConnectX-6, native InfiniBand
cx6_roce: { gid_index: 3, hca: "mlx5" } # ConnectX-6, RoCEv2
The RoCE gid_index: 3 is the common RoCEv2/IPv4 default but is fabric-specific; verify against show_gids on a real node before fleet roll (see Failure modes).
Tasks¶
Real, idempotent tasks/main.yml. Uses only stock modules (ansible.builtin.*, community.general.modprobe). DOCA-OFED install assumes the NVIDIA DOCA apt repo is already configured by the base image or a repo-setup role; do not rely on nvidia_stack for this, because the OFED stage runs before the driver stage.
# roles/rdma_fabric/tasks/main.yml
- name: Assert rdma_stage is valid
ansible.builtin.assert:
that:
- rdma_stage | default('all') in ['ofed', 'peermem', 'all']
fail_msg: "rdma_stage must be one of ofed, peermem, all"
quiet: true
- name: Resolve NCCL profile for this NIC model
ansible.builtin.set_fact:
nccl_profile: "{{ nccl_nic_profiles[rdma_nic_model] }}"
# fails loudly if rdma_nic_model is not a key in the map
- name: Install DOCA-OFED host stack
ansible.builtin.apt:
name: "{{ doca_ofed_package }}"
state: present
update_cache: true
register: ofed_install
notify: reboot node
when:
- rdma_stage | default('all') in ['ofed', 'all']
- rdma_reboot_on_ofed | bool
- name: Install DOCA-OFED host stack (no reboot handler)
ansible.builtin.apt:
name: "{{ doca_ofed_package }}"
state: present
update_cache: true
when:
- rdma_stage | default('all') in ['ofed', 'all']
- not (rdma_reboot_on_ofed | bool)
- name: Load and persist nvidia_peermem (GPUDirect RDMA)
community.general.modprobe:
name: nvidia_peermem
state: present
persistent: present # writes /etc/modules-load.d/ entry; loads on next boot
register: peermem
# EINVAL here => GPU driver was built before OFED; reinstall driver (see What it does)
when: rdma_stage | default('all') in ['peermem', 'all']
- name: Write NCCL fabric defaults
ansible.builtin.template:
src: nccl.conf.j2
dest: "{{ nccl_conf_path }}"
owner: root
group: root
mode: "0644"
# template is declarative => idempotent; rewrites only on content change
when: rdma_stage | default('all') in ['peermem', 'all']
Companion template, emitting the same keys the hub writes, parameterised per NIC model:
{# roles/rdma_fabric/templates/nccl.conf.j2 #}
# Managed by Ansible role rdma_fabric. Override per-job via NCCL_* env vars.
# NIC model: {{ rdma_nic_model }}
NCCL_IB_HCA={{ nccl_ib_hca | default(nccl_profile.hca) }}
NCCL_IB_GID_INDEX={{ nccl_ib_gid_index | default(nccl_profile.gid_index) }}
NCCL_NET_GDR_LEVEL={{ nccl_net_gdr_level }}
# roles/rdma_fabric/handlers/main.yml
- name: reboot node
ansible.builtin.reboot:
reboot_timeout: 1200
# shared handler name with base_tuning/nvidia_stack; Ansible de-dupes one reboot per flush
Idempotency notes:
- community.general.modprobe with persistent: present loads the module now and writes the /etc/modules-load.d/ persist entry (per the module docs); treat re-runs as converging on that loaded-and-persisted state. The module docs do not specify the exact changed-reporting rule, so do not rely on a precise changed count.
- ansible.builtin.template rewrites /etc/nccl.conf only when rendered content differs, so re-runs are no-ops.
- The apt task is idempotent on state: present; pin the version to stop silent OFED upgrades on update_cache.
- rdma_stage is the ordering guard. Use ofed before the driver and peermem after the driver reboot. Avoid all in a fresh build unless the driver package is known to rebuild after OFED.
- No raw command/shell mutating tasks, so nothing to guard with creates/changed_when here. The verify step (below) is the only command, and it is read-only (changed_when: false).
Apply & verify¶
Run via the site playbook, or target the role alone:
# whole chain
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-01.dc1.internal
# this role only (assumes site.yml maps rdma_fabric to a tag)
ansible-playbook -i inventory/hosts.ini site.yml --tags rdma_fabric --limit gpu-01.dc1.internal
Validation tasks (drop in roles/rdma_fabric/tasks/main.yml tail, or rely on validate):
- name: nvidia_peermem is loaded
ansible.builtin.command: lsmod
register: lsmod_out
changed_when: false
failed_when: "'nvidia_peermem' not in lsmod_out.stdout"
- name: At least one IB/RoCE port is ACTIVE + LinkUp
ansible.builtin.shell: >
set -o pipefail;
ibstat | grep -c 'State: Active'
args: { executable: /bin/bash }
register: ib_active
changed_when: false
failed_when: ib_active.stdout | int < 1
Manual checks and expected signal:
# 1. peer-memory module present (in-tree module reports with an underscore)
lsmod | grep nvidia_peermem
# nvidia_peermem 16384 0
# 2. fabric ports up — healthy port shows both lines
ibstat
# Port 1:
# State: Active
# Physical state: LinkUp
# Rate: 400 # ConnectX-7 NDR; varies by NIC/cable
# Link layer: InfiniBand # or "Ethernet" for RoCE
Expected signal: lsmod | grep nvidia_peermem returns a non-empty line, and every fabric port in ibstat reads State: Active / Physical state: LinkUp. If those hold, drive a real data-path test: see fabric bring-up benchmarking for ib_write_bw and NCCL all-reduce bandwidth (the only proof GPUDirect RDMA is actually on the wire, not just loaded).
Failure modes¶
| Symptom | Likely cause | Runbook |
|---|---|---|
modprobe nvidia-peermem returns EINVAL; lsmod shows no nvidia_peermem |
GPU driver compiled before DOCA-OFED, so peer-memory APIs absent. | kernel/GPU missing — reinstall/rebuild driver after OFED. |
| Module loaded but NCCL falls back to host-staged copies (low bandwidth) | NCCL_NET_GDR_LEVEL too restrictive, or ACS re-enabled breaking P2P. |
NCCL hang/slow; re-run service-acs-disable. |
ibstat shows State: Down or Physical state: Polling |
No subnet manager, bad cable, or port not enabled on the switch. | fabric-manager failure. |
| RoCE path: handshake works but throughput collapses | Wrong NCCL_IB_GID_INDEX for the RoCEv2 GID — confirm with show_gids. |
NCCL hang/slow. |
nv_peer_mem and nvidia_peermem both present |
Legacy out-of-tree module conflicts; only one loads. | Remove nv_peer_mem package; see kernel/GPU missing. |
References¶
- GPUDirect RDMA peer-memory client (
nvidia-peermem;lsmodshowsnvidia_peermem): https://docs.nvidia.com/cuda/gpudirect-rdma/ — statesnv_peer_memis "deprecated when running GPU drivers from the R470 branch and newer". nvidia-peermemREADME ("now included with the NVIDIA Linux GPU driver"; the GitHubnv_peer_memproject "should be considered deprecated"): https://download.nvidia.com/XFree86/Linux-x86_64/470.42.01/README/nvidia-peermem.html- GPU Operator RDMA verification (example
lsmod | grep nvidiaoutput withnvidia_peermem): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-rdma.html - NVIDIA DOCA-OFED host installation and upgrade (
doca-ofed/doca-allapt profiles): https://docs.nvidia.com/doca/sdk/doca-host+installation+and+upgrade/index.html - NCCL environment variables (
NCCL_IB_HCA,NCCL_IB_GID_INDEX,NCCL_NET_GDR_LEVEL): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html community.general.modprobemodule (state,persistentparameters): https://docs.ansible.com/ansible/latest/collections/community/general/modprobe_module.htmlansible.builtin.templatemodule: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/template_module.htmlansible.builtin.aptmodule: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html- InfiniBand
ibstatport states (State: Active,Physical state: LinkUp): https://docs.oracle.com/cd/E19914-01/820-6705-10/appendix2.html
Related: Node & Fabric Bring-Up · role: nvidia_stack · role: validate_health · Fabric Benchmarking · Glossary