Markdown

Image & config management¶

Scope: keeping a GPU fleet bit-for-bit consistent: golden OS images with pinned driver/CUDA/firmware baselines, the immutable-image vs config-management (Ansible) split, how drift creeps in, and how to audit driver/CUDA versions across nodes to catch it. The node-by-node install mechanics live in Driver Install and Lifecycle; this page is the fleet view.

All command and config snippets below are reference templates, not hardware-tested. Pin the exact versions, package names, and branches for your target GPU against the cited NVIDIA/vendor docs before running anything in production.

What it is¶

A GPU node's behaviour is determined by a stack of versioned pieces that must agree across the whole fleet: the base OS and kernel, the NVIDIA driver branch (and with it libcuda.so and the GSP firmware blob, CUDA Driver), the kernel modules (NVIDIA Kernel Modules), the CUDA toolkit baked into images or supplied by containers (CUDA Toolkit and Runtime), the container/runtime layer, and on NVSwitch systems a branch-matched Fabric Manager (Fabric Manager). "Image & config management" is the discipline that makes every node carry the same versions of all of these, and keeps them there.

There are two ways to put that state on a node, and they are not mutually exclusive:

Golden image (immutable model). Build one authorised OS image (kernel, pinned driver, container runtime, agents, host tuning), test it, version it, and provision every node from it (Bare-Metal Provisioning and PXE, Provisioning Tooling). Updates produce a new image version; nodes are re-imaged, not edited in place. A golden image is "an authorized template ... intentionally minimal, secure, and reproducible," and the immutable model "eliminates configuration drift entirely because every server is born from a known-good image and never modified after deployment."¹
Config management (convergent model). Provision a thinner base image, then run Ansible (push-based, agentless over SSH, Ansible: Node and Fabric Bring-Up) to install the driver stack and converge host state to a declared spec. The same playbooks can run once at image-build time (config management used as an image builder) or continuously against live nodes.¹

Most real GPU fleets are a hybrid: a golden base image for the OS/kernel/runtime, plus Ansible (or the GPU Operator on Kubernetes) for the NVIDIA stack and per-tier specifics. The boundary you choose decides where drift can enter and how you remediate it.

flowchart LR
  SPEC["Pinned baseline: kernel, driver branch, CUDA, FM, runtime"] --> BUILD["Build and version golden image"]
  SPEC --> ANSIBLE["Ansible roles: declared host state"]
  BUILD --> PROVISION["Provision or re-image node"]
  ANSIBLE --> PROVISION
  PROVISION --> AUDIT["Audit: driver and CUDA versions across fleet"]
  AUDIT --> DRIFT{"Matches baseline?"}
  DRIFT -->|"yes"| POOL["Admit to scheduler"]
  DRIFT -->|"no"| REMEDIATE["Re-image or re-converge"]
  REMEDIATE --> AUDIT

Why it's needed (and when)¶

Uniform node state is the root of cluster reliability: image drift is the source of a large share of intermittent, hard-to-reproduce GPU faults. A collective job is only as healthy as its least consistent node; one box on the wrong driver branch can strand the NVLink domain (Fabric Manager is lockstep with the driver, Fabric Manager), force a NCCL fallback to PCIe, or fail a CUDA-version check mid-run. Drift also defeats reasoning about incidents: if no two nodes are identical, "it works on gpu-03 but not gpu-07" has no clean answer.

You are doing image/config management whenever you:

Stand up or expand a fleet: bake the baseline into the image so every node starts identical (Provisioning Tooling, Bare-Metal Provisioning and PXE).
Adopt a new driver/CUDA baseline fleet-wide: cut a new image version (or bump the Ansible driver_branch) and roll it node-by-node behind cordon/drain (Rolling Driver / CUDA Upgrade).
Suspect drift: a node behaving differently, or a periodic audit flags a version mismatch (Image Drift Across Fleet).
Recover a node: re-image to a known-good version rather than hand-patching it back to health.

The branch-selection policy itself (LTS vs Production, CUDA forward-compat) is out of scope here; see Driver Versions and Branches. This page assumes a baseline is chosen and concerns itself with applying it uniformly and proving it stayed applied.

How it's set up & managed¶

Pin a single baseline, fleet-wide¶

The whole approach rests on one rule: one branch per fleet. Driver / libcuda.so / Fabric Manager / GSP firmware must be the same version on every node; drift across them is the root cause behind a large share of fabric and collective faults (Reliability, RAS and Failure Modes). Express the baseline as data, in version control, so the image build and the config-management run read the same pins:

# baseline.yml -- the single source of truth, committed to git
os_image: "ubuntu-24.04"
kernel: "6.8.0-XX-generic"        # pin the kernel the modules were built against
driver_branch: "580"              # NVIDIA datacenter driver branch (verify against the live support matrix)
cuda_branch: "13-0"               # CUDA toolkit repo suffix
fabricmanager: "580"              # MUST match driver_branch on NVSwitch systems
container_toolkit: "nvidia-container-toolkit"

Branch values above are placeholders; confirm the current LTS/Production branch and its CUDA pairing against NVIDIA's live support matrix before adopting (Driver Versions and Branches).²

Golden image: bake and version¶

Build the image once from the pinned baseline (driver via the NVIDIA CUDA apt repo, exactly as Driver Install and Lifecycle documents), validate it, then stamp it with a version so a node can report which image it booted. A simple, auditable convention is a manifest file written into the image at build time:

# reference template, not hardware-tested -- run inside the image build
cat >/etc/cluster-image.json <<JSON
{
  "image_version": "gpu-node-2026.06-r580",
  "built": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
  "kernel": "$(uname -r)",
  "driver_branch": "580",
  "cuda_branch": "13-0"
}
JSON

In the immutable model this manifest is the contract: a node either matches the current image_version or it is re-imaged. Updates are a new image version and a re-provision, never an in-place edit; that is what keeps drift out ("updates create new versions"; rollback is "to a known-good image").¹ Image building, PXE/iPXE delivery, and the provisioner that writes it are covered in image management's provisioning siblings and Bare-Metal Provisioning and PXE.

Config management: declare and converge (Ansible)¶

Where you converge live nodes (or build the image), Ansible installs the stack and pins the versions so the package manager cannot silently move a node off the baseline. Two mechanisms, used together: install an exact version, then hold it.

Install an exact package version by appending =<version> to the name in ansible.builtin.apt:

# roles/nvidia_stack/tasks/pin.yml -- reference template, not hardware-tested
- name: Install the pinned driver metapackage (exact version)
  ansible.builtin.apt:
    name: "nvidia-open-{{ driver_branch }}"   # branch metapackage; or pin "name=version" exactly
    state: present
    update_cache: true
  notify: reboot node

To find the exact version string for a strict name=version pin, run apt-cache policy <package> on a target host and use what apt reports. Then prevent unattended-upgrade from moving the package with ansible.builtin.dpkg_selections, whose documented selection choices are install, hold, deinstall, purge:³

# Hold the stack on the baseline so unattended-upgrade cannot bump it
- name: Hold driver, CUDA driver, and Fabric Manager on the baseline
  ansible.builtin.dpkg_selections:
    name: "{{ item }}"
    selection: hold
  loop:
    - "nvidia-open-{{ driver_branch }}"
    - "cuda-drivers"
    - "nvidia-fabricmanager-{{ fabricmanager }}"

dpkg_selections changes selection state only: it never installs or removes; pair it with the apt task above for the actual install.³ This is the fleet-scale counterpart to the per-node apt-mark hold shown in Driver Install and Lifecycle: silent unattended bumps are a real source of FM/driver mismatch on Ubuntu, and the hold is what stops them.

Make convergence assert the baseline rather than just install it, so a re-converge fails loudly on a drifted node instead of quietly fixing it (which would hide the drift):

# roles/validate/tasks/baseline.yml -- fail the play on baseline mismatch
- name: Read installed driver version
  ansible.builtin.command:
    cmd: nvidia-smi --query-gpu=driver_version --format=csv,noheader
  register: drv
  changed_when: false

- name: Assert driver matches the pinned branch
  ansible.builtin.assert:
    that: "drv.stdout.startswith(driver_branch)"
    fail_msg: "Driver {{ drv.stdout }} does not match baseline branch {{ driver_branch }} -- drift"

Runtime GPU settings: not in the image, enforce with DCGM¶

The image and Ansible own the software versions; several GPU runtime settings (compute mode, power limit, clocks) are volatile (they reset when a GPU goes idle or is reset), so baking them into an image does not keep them set. DCGM exists to hold these: "Once a configuration has been defined DCGM will enforce that configuration, for example across driver restarts, GPU resets or at job start."⁴ Define the target once per node (e.g. via a boot-time unit), so config state is declared rather than ad hoc:

# reference template, not hardware-tested -- requires root (DCGM config management is not allowed as non-root)
dcgmi group -c gpu_group            # create a group (returns a group id, e.g. 1)
dcgmi config -g 1 --set -c 2        # set a target: compute mode EXCLUSIVE_PROCESS (-c 2) on group 1
dcgmi config -g 1 --get             # list the target vs current configuration

Compute-mode value -c 2 and the --set/--get forms above are the documented dcgmi config examples;⁴ confirm power-limit and clock flags against dcgmi config --help on your DCGM build before scripting them. Persistence mode is the one runtime setting better owned at the OS layer: enable the nvidia-persistenced daemon in the image/Ansible (Persistence Mode, Driver Install and Lifecycle) rather than via DCGM.

Where drift creeps in¶

Even with the above, drift enters through known gaps; close each one:

Unattended-upgrade / a stray apt upgrade moving the driver or kernel off the baseline. Hold the packages (above).
A kernel bump with no matching module rebuild: DKMS rebuilds, but the node now runs a kernel the image was not validated on. Pin the kernel, re-run convergence after any kernel change (Driver Install and Lifecycle, Kernel Upgrade, GPU Missing).
Non-idempotent shell tasks in config management that mutate instead of converge: the playbook causes drift. Make tasks idempotent and assert end state.
Manual hotfixes on one node during an incident that never make it back into the image or playbook. Re-image to fold the fix in; never leave a hand-patched node in the pool.
Double-managed driver: a host driver and the GPU Operator's driver containers both present. Pick one (NVIDIA Container Toolkit and CDI).
Firmware/BIOS settings reset by a service event (e.g. ACS re-enabled, halving P2P bandwidth) that the OS image does not own. Assert these at boot (Ansible: Node and Fabric Bring-Up).

Validated usage & tests¶

Reference templates, not hardware-tested. The descriptions are the expected shape of output; exact version strings depend on your baseline and are not asserted here.

Audit driver and CUDA-supported version, per node¶

nvidia-smi --query-gpu with --format=csv,noheader is the scriptable read. driver_version is the installed NVIDIA driver version; emit it clean for machine parsing:⁵

# reference template, not hardware-tested
nvidia-smi --query-gpu=name,driver_version --format=csv,noheader
nvidia-smi --help-query-gpu     # enumerate every queryable field

Expect one line per GPU: the model name and a non-ERR! driver version. The full nvidia-smi banner also prints a separate CUDA Version (the maximum toolkit the driver supports, not what is installed); treat it as a ceiling, never as evidence a toolkit is present (CUDA Driver). Field-by-field flag reference: nvidia-smi Reference.

Audit across the whole fleet¶

The point of the audit is fleet uniformity, not a single node. Fan the per-node query out with your existing remote-exec path: an Ansible ad-hoc command is one option (the same SSH inventory used for bring-up, Ansible: Node and Fabric Bring-Up):

# reference template, not hardware-tested -- one driver_version line per host
ansible gpu_nodes -i inventory/hosts.ini -m ansible.builtin.command \
  -a "nvidia-smi --query-gpu=driver_version --format=csv,noheader"

A uniform fleet returns the same driver version from every host. Any host reporting a different version, ERR!, or no GPU is drift (or a dead driver); take it to Image Drift Across Fleet. To make the expectation explicit and gate on it, collapse the fleet's answers to unique values; a single distinct line is the pass condition:

# reference template, not hardware-tested
ansible gpu_nodes -i inventory/hosts.ini -m ansible.builtin.command \
  -a "nvidia-smi --query-gpu=driver_version --format=csv,noheader" \
  | awk '/CHANGED|SUCCESS|=>/{h=$1} /^[0-9]/{print h": "$0}' | sort -u -k2

(Parse defensively: Ansible's human output is not a stable format; for a real gate, prefer -o/JSON callback or a structured fact and compare to baseline.yml.) Compare every node against the pinned driver_branch; treat any deviation as a node to re-image or re-converge before it takes work.

Confirm the image version a node booted¶

If you wrote an image manifest at build time (above), the cheapest drift check is reading it back and diffing against the current baseline:

# reference template, not hardware-tested
ansible gpu_nodes -i inventory/hosts.ini -m ansible.builtin.command \
  -a "cat /etc/cluster-image.json"

Expect every node to report the current image_version. A node on an older image_version is, by definition, drifted: re-image it; do not patch it forward in place (Image Drift Across Fleet).

Deeper than versions¶

Version uniformity is necessary, not sufficient: a node on the right driver can still be unhealthy. Gate admission on an actual GPU diagnostic (dcgmi diag) and the fabric checks, not on nvidia-smi alone (GPU Diagnostics and Validation, GPU Health Gating). The image audit answers "is this node the right build"; the health gate answers "is this node fit for work".

Failure modes¶

Brief; each links the matching runbook.

A node drifted off the baseline: different driver/CUDA/kernel version than the fleet, from an unattended bump, a stray upgrade, or a manual hotfix. Audit catches it; remediate by re-image or re-converge, do not hand-patch. → Image Drift Across Fleet.
NVLink domain will not form after a driver bump: Fabric Manager left version-mismatched to the driver because the pin moved one but not the other. Keep FM lockstep with driver_branch. → Fabric Manager Failure.
No GPU after a kernel change: node booted a kernel the image was not validated on; modules stale or unsigned. → Kernel Upgrade, GPU Missing.
GSP firmware / driver mismatch from a partial in-place update instead of a clean re-image. → GSP Firmware / Driver Mismatch.
A drifted node passes the version audit but fails work: versions matched, hardware did not. Run the diagnostic gate. → GPU Health Gating, GPU Diagnostics and Validation.

The end-to-end rolling procedure for moving the whole fleet to a new baseline (cordon/drain, re-image or Ansible reinstall, branch-matched Fabric Manager, reboot, validate, rollback) is Rolling Driver / CUDA Upgrade. Do not big-bang a fleet; roll one batch at a time and validate green before advancing.

References¶

NVIDIA Driver Installation Guide for Linux (CUDA apt repo, headers, pinning, metapackages — the per-node install this page builds on): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/
NVIDIA Supported Drivers and CUDA Toolkit Versions (LTS/Production branch ↔ CUDA pairing; verify the live baseline): https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html
NVIDIA DCGM Feature Overview (dcgmi group -c, dcgmi config -g <id> --set -c <mode> / --get; config enforced across driver restarts, GPU resets, or at job start): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
NVIDIA DCGM Getting Started (config management requires root: "Some DCGM functionality, such as configuration management, are not allowed to be run as non-root."): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html
nvidia-smi manual (--query-gpu, --format=csv,noheader,nounits, -L/--list-gpus, -q): https://docs.nvidia.com/deploy/nvidia-smi/index.html
Useful nvidia-smi Queries, NVIDIA (--query-gpu=...,driver_version --format=csv; --help-query-gpu lists all fields; driver_version = installed driver version): https://nvidia.custhelp.com/app/answers/detail/a_id/3751/
Ansible ansible.builtin.apt (install an exact version via name=version): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/apt_module.html
Ansible ansible.builtin.dpkg_selections (selection: install|hold|deinstall|purge; changes selection only, does not install/remove): https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/dpkg_selections_module.html
"What is a Golden Image" (authorised, reproducible template; immutable model eliminates drift; updates create new versions; config tools at build time): https://www.wiz.io/academy/container-security/what-is-a-golden-image

A golden image is "an authorized template ... intentionally minimal, secure, and reproducible"; the immutable model "eliminates configuration drift entirely because every server is born from a known-good image and never modified after deployment," updates "create new versions," and config-management tools (Ansible/Chef/Puppet) can run "once during image creation instead of continually mutating live servers." https://www.wiz.io/academy/container-security/what-is-a-golden-image ↩↩↩
NVIDIA Supported Drivers and CUDA Toolkit Versions — datacenter driver branches are designated LTS or Production and pair with specific CUDA toolkit versions; pick one branch and verify it against the live matrix. https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html ↩
Ansible ansible.builtin.dpkg_selections — parameters name (required) and selection with choices install, hold, deinstall, purge; example "Prevent python from being upgraded" sets selection: hold; "This module will not cause any packages to be installed/removed/purged, use the ansible.builtin.apt module for that." https://docs.ansible.com/projects/ansible/latest/collections/ansible/builtin/dpkg_selections_module.html ↩↩
NVIDIA DCGM Feature Overview — create a group with dcgmi group -c <name>; set a target with dcgmi config -g <id> --set -c <mode> (e.g. -c 2); read with dcgmi config -g <id> --get; "Once a configuration has been defined DCGM will enforce that configuration, for example across driver restarts, GPU resets or at job start." https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html — The non-root restriction is documented on the DCGM Getting Started page: "Some DCGM functionality, such as configuration management, are not allowed to be run as non-root." https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/getting-started.html ↩↩
Useful nvidia-smi Queries (NVIDIA Enterprise Support) — nvidia-smi --query-gpu=...,driver_version --format=csv; driver_version is the version of the installed NVIDIA driver; nvidia-smi --help-query-gpu enumerates every queryable field. https://nvidia.custhelp.com/app/answers/detail/a_id/3751/ ↩