Ansible role: base_tuning¶
Scope: host prep before any NVIDIA package lands. Blacklist nouveau, set the GRUB kernel cmdline (IOMMU mode per platform, pci=realloc for large GPU BARs), pin the CPU governor to performance, reserve hugepages, apply RDMA/network sysctls, with handlers that run update-grub, rebuild the initramfs, and reboot only when something actually changed.
This is the first role in the site.yml from the bring-up hub; it lays the kernel-space and boot-time groundwork that role: nvidia_stack and role: rdma_fabric depend on. The kernel-module mechanics it touches (nouveau, DKMS, initramfs) are explained in kernel modules. Reference template, not hardware-tested. Pin every value to your platform and validate on one canary node before a fleet roll.
flowchart LR
IMAGE["Fresh OS image"] --> NOUVEAU["Blacklist nouveau"]
NOUVEAU --> GRUB["GRUB cmdline: IOMMU, pci=realloc"]
GRUB --> GOV["CPU governor: performance"]
GOV --> HP["Hugepages + sysctl"]
HP --> HANDLERS["Handlers: update-grub, initramfs, reboot"]
HANDLERS --> READY["Reboot once, ready for nvidia_stack"]
What it does¶
base_tuning converges a freshly-imaged node onto a uniform kernel and boot-time baseline so the driver stack installs cleanly and GPUDirect/RDMA paths perform. It does five things, all idempotent:
- Blacklist
nouveau. The in-treenouveaudriver binds NVIDIA GPUs at boot and blocks thenvidiamodules from loading; the blacklist must live in the initramfs so it applies at the first boot stage. See kernel modules and runbook: kernel/GPU missing. - Set the GRUB kernel cmdline. Enable the IOMMU for the host CPU vendor (
intel_iommu=onoramd_iommu=on), put it in passthrough mode (iommu=pt), which the upstream x86 IOMMU page describes as a 1:1 IOMMU mapping and the AMD acceptance page frames as letting the adapter bypass DMA translation to memory for performance, and addpci=reallocso the kernel can reassign larger BARs than the BIOS allocated, required by GPUs with large VRAM windows. - Pin the CPU governor to
performancevia a one-shot systemd unit, so cores run at top frequency instead of scaling down under bursty collective traffic. - Reserve hugepages (
vm.nr_hugepages) for large-model and RDMA buffers. - Apply network/RDMA sysctls that the fabric layer expects.
Each mutation that needs a boot to take effect notifies a handler. The role reboots at most once, and only if GRUB, the initramfs, or the blacklist changed; a no-op re-run reboots nothing.
base_tuning deliberately does not disable PCIe ACS. Mainline kernels cannot override ACS from a boot parameter, so it is handled at runtime, on every boot, by service: acs_disable.
Variables¶
Role defaults live in roles/base_tuning/defaults/main.yml; inventory-level keys (gpu_tier, driver_branch) are set on the [gpu_nodes:vars] group in the hub inventory. base_iommu_platform is the one knob this role adds.
| Variable | Scope | Default | Purpose |
|---|---|---|---|
base_iommu_platform |
role | intel |
Host CPU vendor: intel -> intel_iommu=on, amd -> amd_iommu=on. Selects the IOMMU enable flag. |
base_iommu_mode |
role | pt |
IOMMU mode appended as iommu={{ }}. pt (passthrough) for bare-metal GPU hosts; set empty to omit. |
base_pci_realloc |
role | true |
Append pci=realloc so the kernel reassigns larger GPU BARs than the BIOS sized. Platform-dependent — verify against your platform's guidance (some need pci=realloc=off). |
base_extra_cmdline |
role | "" |
Extra space-separated kernel args appended verbatim (e.g. processor.max_cstate=0). |
base_governor |
role | performance |
CPU scaling governor written by the one-shot unit. |
base_cpupower_package |
role | linux-cpupower |
Package that provides /usr/bin/cpupower on Debian/Ubuntu; use kernel-tools on RHEL-family images. |
base_nr_hugepages |
role | 2048 |
Persistent 2 MiB hugepages reserved via vm.nr_hugepages (2048 x 2 MiB = 4 GiB). |
base_sysctl |
role | see below | Dict of sysctl key/value pairs applied persistently (RDMA/network tuning). |
base_reboot_timeout |
role | 900 |
Seconds the reboot handler waits for the node to come back. |
gpu_tier |
inventory | datacenter |
Carried from the hub; reserved for future tier-conditional tuning. Not branched on here. |
# roles/base_tuning/defaults/main.yml
base_iommu_platform: intel # intel | amd
base_iommu_mode: pt # pt | "" (empty to omit iommu=)
base_pci_realloc: true
base_extra_cmdline: ""
base_governor: performance
base_cpupower_package: linux-cpupower
base_nr_hugepages: 2048
base_reboot_timeout: 900
base_sysctl:
vm.swappiness: "10"
net.core.somaxconn: "4096"
net.ipv4.tcp_mtu_probing: "1" # tolerate jumbo-frame path-MTU on RoCE fabrics
The full cmdline string is assembled once from these defaults:
# roles/base_tuning/vars/main.yml
_base_iommu_flag: "{{ 'intel_iommu=on' if base_iommu_platform == 'intel' else 'amd_iommu=on' }}"
_base_cmdline: >-
{{ _base_iommu_flag }}
{{ ('iommu=' ~ base_iommu_mode) if base_iommu_mode else '' }}
{{ 'pci=realloc' if base_pci_realloc else '' }}
{{ base_extra_cmdline }}
Tasks¶
# roles/base_tuning/tasks/main.yml
- name: Blacklist nouveau (must be in the initramfs to win the boot race)
ansible.builtin.copy:
dest: /etc/modprobe.d/blacklist-nouveau.conf
mode: "0644"
content: |
blacklist nouveau
options nouveau modeset=0
notify: rebuild initramfs
- name: Set GRUB kernel cmdline (IOMMU, pci=realloc, extras)
ansible.builtin.lineinfile:
path: /etc/default/grub
regexp: '^GRUB_CMDLINE_LINUX='
line: 'GRUB_CMDLINE_LINUX="{{ _base_cmdline | regex_replace("\\s+", " ") | trim }}"'
backup: true
notify:
- update grub
- reboot node
- name: Install cpupower for governor management
ansible.builtin.apt:
name: "{{ base_cpupower_package }}"
state: present
update_cache: true
- name: Install one-shot unit that pins the CPU governor
ansible.builtin.copy:
dest: /etc/systemd/system/cpu-performance.service
mode: "0644"
content: |
[Unit]
Description=Pin CPU scaling governor to {{ base_governor }}
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/usr/bin/cpupower frequency-set -g {{ base_governor }}
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
notify: reload systemd
- name: Enable and start the governor unit
ansible.builtin.systemd_service:
name: cpu-performance.service
enabled: true
state: started
daemon_reload: true
- name: Reserve persistent hugepages
ansible.posix.sysctl:
name: vm.nr_hugepages
value: "{{ base_nr_hugepages }}"
sysctl_file: /etc/sysctl.d/90-gpu-hugepages.conf
sysctl_set: true
reload: true
- name: Apply RDMA / network sysctls
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
sysctl_file: /etc/sysctl.d/90-gpu-tuning.conf
sysctl_set: true
reload: true
loop: "{{ base_sysctl | dict2items }}"
# roles/base_tuning/handlers/main.yml
- name: update grub
ansible.builtin.command: update-grub # Debian/Ubuntu; RHEL: grub2-mkconfig -o /boot/grub2/grub.cfg
changed_when: true
notify: reboot node
- name: rebuild initramfs
ansible.builtin.command: update-initramfs -u # Debian/Ubuntu; RHEL: dracut --force
changed_when: true
notify: reboot node
- name: reload systemd
ansible.builtin.systemd_service:
daemon_reload: true
- name: reboot node
ansible.builtin.reboot:
reboot_timeout: "{{ base_reboot_timeout }}"
Idempotency notes. copy and lineinfile are convergent: a matching file or line reports ok, not changed, so the notify does not fire and the node does not reboot on a clean re-run. ansible.posix.sysctl writes the value to sysctl_file and (sysctl_set: true) applies it live, reporting changed only when the on-disk value differs. The command handlers are inherently non-convergent, so they are reachable only via notify from a task that itself changed; changed_when: true is correct there because running update-grub/update-initramfs always rewrites its output. Multiple notify to reboot node collapse to a single reboot at the end of the play.
base_tuning depends on the ansible.posix collection (the sysctl module). Pin it in requirements.yml:
Apply & verify¶
Run the whole node bring-up (the hub site.yml applies base_tuning first), or target just this role with a tag:
# whole bring-up
ansible-playbook -i inventory/hosts.ini site.yml --limit gpu-07.dc1.internal
# this role only, if tagged `base` in site.yml
ansible-playbook -i inventory/hosts.ini site.yml --tags base --limit gpu-07.dc1.internal
After the role reboots the node, confirm the two signals it is responsible for. Validation is read-only:
NODE=gpu-07.dc1.internal
# 1. Kernel cmdline carries the IOMMU + pci=realloc args set by GRUB.
ssh "$NODE" "cat /proc/cmdline"
# expect (Intel host): ... intel_iommu=on iommu=pt pci=realloc ...
ssh "$NODE" "grep -Eo 'intel_iommu=on|amd_iommu=on|iommu=pt|pci=realloc' /proc/cmdline"
# 2. Every CPU is on the performance governor.
ssh "$NODE" "cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | sort -u"
# expect a single line: performance
ssh "$NODE" "cpupower frequency-info -p" # cross-check: 'The governor "performance" ...'
# 3. nouveau is gone, hugepages are reserved.
ssh "$NODE" "lsmod | grep -c nouveau" # expect 0
ssh "$NODE" "grep HugePages_Total /proc/meminfo" # expect HugePages_Total: 2048
Expected signal, all three together: /proc/cmdline shows the IOMMU flag for the platform plus pci=realloc; scaling_governor collapses to the single value performance; nouveau count is 0. As an Ansible-side assertion you can fold the same checks into role: validate_health so a regressed node fails the play instead of silently drifting.
Failure modes¶
- Two
GRUB_CMDLINE_LINUX=lines in/etc/default/grub(a prior hand-edit plus thelineinfile):update-grubhonors only the last, so args are silently dropped. The anchoredregexprewrites the canonical line; remove stray duplicates by hand once. - Blacklist written but initramfs not rebuilt ->
nouveaustill wins the next boot andnvidiacannot bind. Caused by therebuild initramfshandler not firing (file already matched) on a node whose initramfs predates the file. Force it:update-initramfs -u(Debian) /dracut --force(RHEL), then reboot. Runbook: kernel/GPU missing. pci=reallocwrong for the platform. On some hosts reallocation fails and the kernel does not restore BIOS-assigned BARs, so GPUs misbehave after reboot; other Intel/AMD GPU systems instead needpci=realloc=off. Setbase_pci_realloc: false(orbase_extra_cmdline: "pci=realloc=off") and verify against your platform's acceptance guide. Runbook: kernel/GPU missing.- Governor not
performanceafter reboot: the node has nocpufreqpolicy (governor files absent under virtualization/some BIOS power profiles), orcpupoweris not installed. Confirmls /sys/devices/system/cpu/cpu0/cpufreqexists and install thelinux-cpupower(Debian) /kernel-tools(RHEL) package. - IOMMU flag set but inactive:
dmesg | grep -e DMAR -e IOMMUshows nothing because VT-d/AMD-Vi is disabled in firmware. This is a BIOS/BMC fix, not a kernel-cmdline fix; see OOB/BMC management. - Non-idempotent drift: replacing the
commandhandlers with rawshelltasks in the main task list (not behindnotify) reboots on every run. Keep boot-affecting commands as handlers.
References¶
- The kernel's command-line parameters (
intel_iommu,amd_iommu,iommu=pt,pci=realloc): https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html - x86 IOMMU support (
iommu=ptpassthrough semantics): https://docs.kernel.org/arch/x86/iommu.html - CPU performance scaling — governors and
scaling_governorsysfs path: https://docs.kernel.org/admin-guide/pm/cpufreq.html - HugeTLB pages —
vm.nr_hugepages,HugePages_Total: https://docs.kernel.org/admin-guide/mm/hugetlbpage.html - AMD Instinct acceptance — kernel parameters (
iommu=pt,intel_iommu=on; and thepci=realloc=offcaveat: this page recommends disabling reallocation on Instinct systems "enabling Linux to clearly detect all GPUs", the opposite ofbase_pci_realloc: true— see the failure mode below): https://instinct.docs.amd.com/projects/system-acceptance/en/latest/common/kernel-parameters.html ansible.builtin.lineinfile: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/lineinfile_module.htmlansible.builtin.copy: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/copy_module.htmlansible.builtin.systemd_service: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.htmlansible.builtin.reboot: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/reboot_module.htmlansible.posix.sysctl: https://docs.ansible.com/projects/ansible/latest/collections/ansible/posix/sysctl_module.html- NVIDIA driver installation guide — kernel modules (
nouveaublacklist, initramfs): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html cpupower(Ubuntu Server — set governor, enable at boot): https://documentation.ubuntu.com/server/explanation/performance/perf-tune-cpupower/
Related: Bring-up hub · nvidia_stack · rdma_fabric · acs_disable · validate_health · Inventory model · Kernel modules · Kernel/GPU missing runbook · Glossary