Skip to content
Markdown

Ansible role: base_tuning

Scope: host prep before any NVIDIA package lands. Blacklist nouveau, set the GRUB kernel cmdline (IOMMU mode per platform, pci=realloc for large GPU BARs), pin the CPU governor to performance, reserve hugepages, apply RDMA/network sysctls, with handlers that run update-grub, rebuild the initramfs, and reboot only when something actually changed.

This is the first role in the site.yml from the bring-up hub; it lays the kernel-space and boot-time groundwork that role: nvidia_stack and role: rdma_fabric depend on. The kernel-module mechanics it touches (nouveau, DKMS, initramfs) are explained in kernel modules. Reference template, not hardware-tested. Pin every value to your platform and validate on one canary node before a fleet roll.

flowchart LR
  IMAGE["Fresh OS image"] --> NOUVEAU["Blacklist nouveau"]
  NOUVEAU --> GRUB["GRUB cmdline: IOMMU, pci=realloc"]
  GRUB --> GOV["CPU governor: performance"]
  GOV --> HP["Hugepages + sysctl"]
  HP --> HANDLERS["Handlers: update-grub, initramfs, reboot"]
  HANDLERS --> READY["Reboot once, ready for nvidia_stack"]

What it does

base_tuning converges a freshly-imaged node onto a uniform kernel and boot-time baseline so the driver stack installs cleanly and GPUDirect/RDMA paths perform. It does five things, all idempotent:

  1. Blacklist nouveau. The in-tree nouveau driver binds NVIDIA GPUs at boot and blocks the nvidia modules from loading; the blacklist must live in the initramfs so it applies at the first boot stage. See kernel modules and runbook: kernel/GPU missing.
  2. Set the GRUB kernel cmdline. Enable the IOMMU for the host CPU vendor (intel_iommu=on or amd_iommu=on), put it in passthrough mode (iommu=pt), which the upstream x86 IOMMU page describes as a 1:1 IOMMU mapping and the AMD acceptance page frames as letting the adapter bypass DMA translation to memory for performance, and add pci=realloc so the kernel can reassign larger BARs than the BIOS allocated, required by GPUs with large VRAM windows.
  3. Pin the CPU governor to performance via a one-shot systemd unit, so cores run at top frequency instead of scaling down under bursty collective traffic.
  4. Reserve hugepages (vm.nr_hugepages) for large-model and RDMA buffers.
  5. Apply network/RDMA sysctls that the fabric layer expects.

Each mutation that needs a boot to take effect notifies a handler. The role reboots at most once, and only if GRUB, the initramfs, or the blacklist changed; a no-op re-run reboots nothing.

base_tuning deliberately does not disable PCIe ACS. Mainline kernels cannot override ACS from a boot parameter, so it is handled at runtime, on every boot, by service: acs_disable.

Variables

Role defaults live in roles/base_tuning/defaults/main.yml; inventory-level keys (gpu_tier, driver_branch) are set on the [gpu_nodes:vars] group in the hub inventory. base_iommu_platform is the one knob this role adds.

Variable Scope Default Purpose
base_iommu_platform role intel Host CPU vendor: intel -> intel_iommu=on, amd -> amd_iommu=on. Selects the IOMMU enable flag.
base_iommu_mode role pt IOMMU mode appended as iommu={{ }}. pt (passthrough) for bare-metal GPU hosts; set empty to omit.
base_pci_realloc role true Append pci=realloc so the kernel reassigns larger GPU BARs than the BIOS sized. Platform-dependent — verify against your platform's guidance (some need pci=realloc=off).
base_extra_cmdline role "" Extra space-separated kernel args appended verbatim (e.g. processor.max_cstate=0).
base_governor role performance CPU scaling governor written by the one-shot unit.
base_cpupower_package role linux-cpupower Package that provides /usr/bin/cpupower on Debian/Ubuntu; use kernel-tools on RHEL-family images.
base_nr_hugepages role 2048 Persistent 2 MiB hugepages reserved via vm.nr_hugepages (2048 x 2 MiB = 4 GiB).
base_sysctl role see below Dict of sysctl key/value pairs applied persistently (RDMA/network tuning).
base_reboot_timeout role 900 Seconds the reboot handler waits for the node to come back.
gpu_tier inventory datacenter Carried from the hub; reserved for future tier-conditional tuning. Not branched on here.
# roles/base_tuning/defaults/main.yml
base_iommu_platform: intel      # intel | amd
base_iommu_mode: pt             # pt | "" (empty to omit iommu=)
base_pci_realloc: true
base_extra_cmdline: ""
base_governor: performance
base_cpupower_package: linux-cpupower
base_nr_hugepages: 2048
base_reboot_timeout: 900
base_sysctl:
  vm.swappiness: "10"
  net.core.somaxconn: "4096"
  net.ipv4.tcp_mtu_probing: "1"   # tolerate jumbo-frame path-MTU on RoCE fabrics

The full cmdline string is assembled once from these defaults:

# roles/base_tuning/vars/main.yml
_base_iommu_flag: "{{ 'intel_iommu=on' if base_iommu_platform == 'intel' else 'amd_iommu=on' }}"
_base_cmdline: >-
  {{ _base_iommu_flag }}
  {{ ('iommu=' ~ base_iommu_mode) if base_iommu_mode else '' }}
  {{ 'pci=realloc' if base_pci_realloc else '' }}
  {{ base_extra_cmdline }}

Tasks

# roles/base_tuning/tasks/main.yml
- name: Blacklist nouveau (must be in the initramfs to win the boot race)
  ansible.builtin.copy:
    dest: /etc/modprobe.d/blacklist-nouveau.conf
    mode: "0644"
    content: |
      blacklist nouveau
      options nouveau modeset=0
  notify: rebuild initramfs

- name: Set GRUB kernel cmdline (IOMMU, pci=realloc, extras)
  ansible.builtin.lineinfile:
    path: /etc/default/grub
    regexp: '^GRUB_CMDLINE_LINUX='
    line: 'GRUB_CMDLINE_LINUX="{{ _base_cmdline | regex_replace("\\s+", " ") | trim }}"'
    backup: true
  notify:
    - update grub
    - reboot node

- name: Install cpupower for governor management
  ansible.builtin.apt:
    name: "{{ base_cpupower_package }}"
    state: present
    update_cache: true

- name: Install one-shot unit that pins the CPU governor
  ansible.builtin.copy:
    dest: /etc/systemd/system/cpu-performance.service
    mode: "0644"
    content: |
      [Unit]
      Description=Pin CPU scaling governor to {{ base_governor }}
      After=multi-user.target
      [Service]
      Type=oneshot
      ExecStart=/usr/bin/cpupower frequency-set -g {{ base_governor }}
      RemainAfterExit=yes
      [Install]
      WantedBy=multi-user.target
  notify: reload systemd

- name: Enable and start the governor unit
  ansible.builtin.systemd_service:
    name: cpu-performance.service
    enabled: true
    state: started
    daemon_reload: true

- name: Reserve persistent hugepages
  ansible.posix.sysctl:
    name: vm.nr_hugepages
    value: "{{ base_nr_hugepages }}"
    sysctl_file: /etc/sysctl.d/90-gpu-hugepages.conf
    sysctl_set: true
    reload: true

- name: Apply RDMA / network sysctls
  ansible.posix.sysctl:
    name: "{{ item.key }}"
    value: "{{ item.value }}"
    sysctl_file: /etc/sysctl.d/90-gpu-tuning.conf
    sysctl_set: true
    reload: true
  loop: "{{ base_sysctl | dict2items }}"
# roles/base_tuning/handlers/main.yml
- name: update grub
  ansible.builtin.command: update-grub        # Debian/Ubuntu; RHEL: grub2-mkconfig -o /boot/grub2/grub.cfg
  changed_when: true
  notify: reboot node

- name: rebuild initramfs
  ansible.builtin.command: update-initramfs -u   # Debian/Ubuntu; RHEL: dracut --force
  changed_when: true
  notify: reboot node

- name: reload systemd
  ansible.builtin.systemd_service:
    daemon_reload: true

- name: reboot node
  ansible.builtin.reboot:
    reboot_timeout: "{{ base_reboot_timeout }}"

Idempotency notes. copy and lineinfile are convergent: a matching file or line reports ok, not changed, so the notify does not fire and the node does not reboot on a clean re-run. ansible.posix.sysctl writes the value to sysctl_file and (sysctl_set: true) applies it live, reporting changed only when the on-disk value differs. The command handlers are inherently non-convergent, so they are reachable only via notify from a task that itself changed; changed_when: true is correct there because running update-grub/update-initramfs always rewrites its output. Multiple notify to reboot node collapse to a single reboot at the end of the play.

base_tuning depends on the ansible.posix collection (the sysctl module). Pin it in requirements.yml:

# requirements.yml
collections:
  - name: ansible.posix

Apply & verify

Run the whole node bring-up (the hub site.yml applies base_tuning first), or target just this role with a tag:

# whole bring-up
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal

# this role only, if tagged `base` in site.yml
ansible-playbook -i inventory/hosts.ini site.yml --tags base --limit gpu-07.dc1.internal

After the role reboots the node, confirm the two signals it is responsible for. Validation is read-only:

NODE=gpu-07.dc1.internal

# 1. Kernel cmdline carries the IOMMU + pci=realloc args set by GRUB.
ssh "$NODE" "cat /proc/cmdline"
# expect (Intel host): ... intel_iommu=on iommu=pt pci=realloc ...
ssh "$NODE" "grep -Eo 'intel_iommu=on|amd_iommu=on|iommu=pt|pci=realloc' /proc/cmdline"

# 2. Every CPU is on the performance governor.
ssh "$NODE" "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | sort -u"
# expect a single line: performance
ssh "$NODE" "cpupower frequency-info -p"   # cross-check: 'The governor "performance" ...'

# 3. nouveau is gone, hugepages are reserved.
ssh "$NODE" "lsmod | grep -c nouveau"             # expect 0
ssh "$NODE" "grep HugePages_Total /proc/meminfo"  # expect HugePages_Total: 2048

Expected signal, all three together: /proc/cmdline shows the IOMMU flag for the platform plus pci=realloc; scaling_governor collapses to the single value performance; nouveau count is 0. As an Ansible-side assertion you can fold the same checks into role: validate_health so a regressed node fails the play instead of silently drifting.

Failure modes

  • Two GRUB_CMDLINE_LINUX= lines in /etc/default/grub (a prior hand-edit plus the lineinfile): update-grub honors only the last, so args are silently dropped. The anchored regexp rewrites the canonical line; remove stray duplicates by hand once.
  • Blacklist written but initramfs not rebuilt -> nouveau still wins the next boot and nvidia cannot bind. Caused by the rebuild initramfs handler not firing (file already matched) on a node whose initramfs predates the file. Force it: update-initramfs -u (Debian) / dracut --force (RHEL), then reboot. Runbook: kernel/GPU missing.
  • pci=realloc wrong for the platform. On some hosts reallocation fails and the kernel does not restore BIOS-assigned BARs, so GPUs misbehave after reboot; other Intel/AMD GPU systems instead need pci=realloc=off. Set base_pci_realloc: false (or base_extra_cmdline: "pci=realloc=off") and verify against your platform's acceptance guide. Runbook: kernel/GPU missing.
  • Governor not performance after reboot: the node has no cpufreq policy (governor files absent under virtualization/some BIOS power profiles), or cpupower is not installed. Confirm ls /sys/devices/system/cpu/cpu0/cpufreq exists and install the linux-cpupower (Debian) / kernel-tools (RHEL) package.
  • IOMMU flag set but inactive: dmesg | grep -e DMAR -e IOMMU shows nothing because VT-d/AMD-Vi is disabled in firmware. This is a BIOS/BMC fix, not a kernel-cmdline fix; see OOB/BMC management.
  • Non-idempotent drift: replacing the command handlers with raw shell tasks in the main task list (not behind notify) reboots on every run. Keep boot-affecting commands as handlers.

References

  • The kernel's command-line parameters (intel_iommu, amd_iommu, iommu=pt, pci=realloc): https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
  • x86 IOMMU support (iommu=pt passthrough semantics): https://docs.kernel.org/arch/x86/iommu.html
  • CPU performance scaling — governors and scaling_governor sysfs path: https://docs.kernel.org/admin-guide/pm/cpufreq.html
  • HugeTLB pages — vm.nr_hugepages, HugePages_Total: https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
  • AMD Instinct acceptance — kernel parameters (iommu=pt, intel_iommu=on; and the pci=realloc=off caveat: this page recommends disabling reallocation on Instinct systems "enabling Linux to clearly detect all GPUs", the opposite of base_pci_realloc: true — see the failure mode below): https://instinct.docs.amd.com/projects/system-acceptance/en/latest/common/kernel-parameters.html
  • ansible.builtin.lineinfile: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.html
  • ansible.builtin.copy: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/copy_module.html
  • ansible.builtin.systemd_service: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.html
  • ansible.builtin.reboot: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.html
  • ansible.posix.sysctl: https://docs.ansible.com/projects/ansible/latest/collections/ansible/posix/sysctl_module.html
  • NVIDIA driver installation guide — kernel modules (nouveau blacklist, initramfs): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html
  • cpupower (Ubuntu Server — set governor, enable at boot): https://documentation.ubuntu.com/server/explanation/performance/perf-tune-cpupower/

Related: Bring-up hub · nvidia_stack · rdma_fabric · acs_disable · validate_health · Inventory model · Kernel modules · Kernel/GPU missing runbook · Glossary