Runbook: rolling driver / CUDA upgrade¶
Scope: the longform procedure for rolling a GPU fleet to a new NVIDIA driver branch (with Fabric Manager and CUDA moved in step), one node at a time behind cordon/drain, in batches inside a maintenance window.
Run this to move a GPU fleet to a new NVIDIA driver branch (with Fabric Manager and CUDA in step), one node at a time behind cordon/drain. Severity: planned change, fleet-wide scope, executed in batches inside a maintenance window. Never big-bang a whole fleet.
Reference templates on real APIs; pin versions and validate before production use.
This is the longform procedure for RB-1 in operational runbooks. The driver/Fabric-Manager/CUDA stack itself is described in the GPU software stack; the Ansible roles that do the reinstall are in Ansible bring-up.
Trigger¶
- A new pinned datacenter driver branch is adopted (the GPU software stack), e.g. moving the fleet to the next LTS or Production branch.
- A security advisory / CVE against the installed driver or container toolkit.
- A framework requirement: a new CUDA toolkit / runtime needed by a training or serving stack (distributed training, inference serving).
On NVSwitch systems (HGX 8-GPU baseboards, NVL72), the Fabric Manager package is matched to the driver branch and reinstalled in the same step; a driver-only bump that leaves an incompatible Fabric Manager will fail to form the NVLink domain (reliability and RAS). Verify the minimum driver for the target GPU on the FM compatibility page in References.
Pre-checks¶
- Change ticket raised and approved; rollback variable identified (the previous
driver_branch). - New branch validated on a canary node end-to-end first:
dcgmi diag -r 3clean,nccl-testsbusbw at line rate, one real smoke job (commissioning, workload recipes). Do not roll the fleet on an unvalidated branch. - Maintenance window agreed; batch size chosen so capacity stays above the planned healthy quorum (do not drain below the threshold that breaks running jobs or SLOs, the SLO/SLI catalog).
- Inventory pinned in git so the change set is a single variable (SRE and MLOps practices).
- Decide host-installed driver vs the GPU Operator's driver containers; do not run both (the Kubernetes platform).
Flow¶
stateDiagram-v2
[*] --> Cordon
Cordon --> Drain: node unschedulable
Drain --> Reinstall: workloads evicted
Reinstall --> Reboot: ansible driver_branch set
Reboot --> Validate: node back up
Validate --> Uncordon: diag and fabric pass
Validate --> Rollback: diag or fabric fail
Rollback --> Reboot: pin previous branch
Uncordon --> [*]: node admits work
Procedure¶
Run per node, in batches that preserve the healthy quorum. NODE is the Kubernetes node name (Slurm-equivalent: scontrol update nodename=<n> state=drain reason="driver upgrade").
- Cordon the node so the scheduler stops placing work:
- Drain running pods (keep DaemonSets such as the GPU Operator; clear emptyDir):
- Reinstall the stack via Ansible, pinned to the new branch. The
nvidia_stackrole reinstalls the driver, the branch-matched Fabric Manager, the container toolkit and DCGM, and rebuilds DKMS (Ansible bring-up): - Reboot to load the new kernel module cleanly:
- Verify the GPU + service stack is healthy on the node:
ssh "$NODE" nvidia-smi --query-gpu=name,driver_version,pstate --format=csv,noheader ssh "$NODE" systemctl is-active nvidia-fabricmanager nvidia-persistenced ssh "$NODE" 'dcgmi diag -r 3' # Long HW diag: PCIe/NVLink, mem bandwidth, NCCL, stress ssh "$NODE" 'ibstat | grep -c "State: Active"'nvidia-fabricmanagermust reportactiveon NVSwitch systems;dcgmi diag -r 3must not containFail;ibstatmust show the expected number of Active ports. - Uncordon and watch the node re-register its GPUs:
Verification¶
- The GPU Operator's
nvidia-operator-validatorpod goes green on the node (driver / toolkit / CUDA / device-plugin validations); the node carriesnvidia.com/gpu.deploy.operator-validatorand re-advertisesnvidia.com/gpu(the Kubernetes platform). Watch it: - A smoke job lands and runs on the upgraded node: a short GPU job, and a 2-node
nccl-testsall_reduce_perfconfirming busbw at line rate (workload recipes). - Telemetry resumes: DCGM exporter and the node's metrics reappear (telemetry and monitoring).
- Repeat per batch; do not advance until the current batch is green.
Rollback¶
Single-variable, GitOps-style: re-run the same play pinned to the previous branch, then reboot and re-validate.
kubectl cordon "$NODE" && kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
ansible-playbook -i inventory/hosts.ini site.yml --limit "$NODE" \
-e "driver_branch=<previous-branch>"
ansible "$NODE" -i inventory/hosts.ini -b -m reboot
ssh "$NODE" 'dcgmi diag -r 3' && kubectl uncordon "$NODE"
If a node fails diag on both the new and previous branch, treat it as hardware and divert to the GPU-fault / RMA path (the GPU-fault runbook). If the NVLink domain will not form after reinstall, suspect a Fabric-Manager/driver mismatch (reliability and RAS).
Related runbooks¶
- the capacity-add runbook: Add GPU capacity (same Ansible + cordon/drain primitives).
- the GPU-fault runbook: GPU fault, drain, reset, RMA (escalation when a node fails diag).
- the NCCL-hang runbook: NCCL hang / collective stall (post-upgrade fabric checks).
- operational runbooks: Operational runbooks index (RB-1).
References¶
- NVIDIA driver install guide: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
- NVIDIA Fabric Manager user guide (driver compatibility, NVSwitch systems): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- NVIDIA GPU Operator (operator-validator, node labels): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
- kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
- nccl-tests: https://github.com/NVIDIA/nccl-tests
Related: Software Stack · Reliability · Ansible · K8s Platform · Operational Runbooks · Capacity Add · Glossary