Markdown

Runbook: rolling driver / CUDA upgrade¶

Scope: the longform procedure for rolling a GPU fleet to a new NVIDIA driver branch (with Fabric Manager and CUDA moved in step), one node at a time behind cordon/drain, in batches inside a maintenance window.

Run this to move a GPU fleet to a new NVIDIA driver branch (with Fabric Manager and CUDA in step), one node at a time behind cordon/drain. Severity: planned change, fleet-wide scope, executed in batches inside a maintenance window. Never big-bang a whole fleet.

Reference templates on real APIs; pin versions and validate before production use.

This is the longform procedure for RB-1 in operational runbooks. The driver/Fabric-Manager/CUDA stack itself is described in the GPU software stack; the Ansible roles that do the reinstall are in Ansible bring-up.

Trigger¶

A new pinned datacenter driver branch is adopted (the GPU software stack), e.g. moving the fleet to the next LTS or Production branch.
A security advisory / CVE against the installed driver or container toolkit.
A framework requirement: a new CUDA toolkit / runtime needed by a training or serving stack (distributed training, inference serving).

On NVSwitch systems (HGX 8-GPU baseboards, NVL72), the Fabric Manager package is matched to the driver branch and reinstalled in the same step; a driver-only bump that leaves an incompatible Fabric Manager will fail to form the NVLink domain (reliability and RAS). Verify the minimum driver for the target GPU on the FM compatibility page in References.

Pre-checks¶

Change ticket raised and approved; rollback variable identified (the previous driver_branch).
New branch validated on a canary node end-to-end first: dcgmi diag -r 3 clean, nccl-tests busbw at line rate, one real smoke job (commissioning, workload recipes). Do not roll the fleet on an unvalidated branch.
Maintenance window agreed; batch size chosen so capacity stays above the planned healthy quorum (do not drain below the threshold that breaks running jobs or SLOs, the SLO/SLI catalog).
Inventory pinned in git so the change set is a single variable (SRE and MLOps practices).
Decide host-installed driver vs the GPU Operator's driver containers; do not run both (the Kubernetes platform).

Flow¶

stateDiagram-v2
    [*] --> Cordon
    Cordon --> Drain: node unschedulable
    Drain --> Reinstall: workloads evicted
    Reinstall --> Reboot: ansible driver_branch set
    Reboot --> Validate: node back up
    Validate --> Uncordon: diag and fabric pass
    Validate --> Rollback: diag or fabric fail
    Rollback --> Reboot: pin previous branch
    Uncordon --> [*]: node admits work

Procedure¶

Run per node, in batches that preserve the healthy quorum. NODE is the Kubernetes node name (Slurm-equivalent: scontrol update nodename=<n> state=drain reason="driver upgrade").

NODE=gpu-07.dc1.internal
DRIVER_BRANCH=<new-branch>      # e.g. the newly-adopted LTS branch

Cordon the node so the scheduler stops placing work:
```
kubectl cordon "$NODE"
```

Drain running pods (keep DaemonSets such as the GPU Operator; clear emptyDir):

kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m

Reinstall the stack via Ansible, pinned to the new branch. The nvidia_stack role reinstalls the driver, the branch-matched Fabric Manager, the container toolkit and DCGM, and rebuilds DKMS (Ansible bring-up):
```
ansible-playbook -i inventory/hosts.ini site.yml \
  --limit "$NODE" \
  -e "driver_branch=$DRIVER_BRANCH"
```

Reboot to load the new kernel module cleanly:

ansible "$NODE" -i inventory/hosts.ini -b -m reboot

Verify the GPU + service stack is healthy on the node:

ssh "$NODE" nvidia-smi --query-gpu=name,driver_version,pstate --format=csv,noheader
ssh "$NODE" systemctl is-active nvidia-fabricmanager nvidia-persistenced
ssh "$NODE" 'dcgmi diag -r 3'          # Long HW diag: PCIe/NVLink, mem bandwidth, NCCL, stress
ssh "$NODE" 'ibstat | grep -c "State: Active"'

nvidia-fabricmanager must report active on NVSwitch systems; dcgmi diag -r 3 must not contain Fail; ibstat must show the expected number of Active ports.

Uncordon and watch the node re-register its GPUs:

kubectl uncordon "$NODE"
kubectl describe node "$NODE" | grep -E 'nvidia.com/gpu|gpu.deploy'

Verification¶

The GPU Operator's nvidia-operator-validator pod goes green on the node (driver / toolkit / CUDA / device-plugin validations); the node carries nvidia.com/gpu.deploy.operator-validator and re-advertises nvidia.com/gpu (the Kubernetes platform). Watch it:
```
kubectl -n gpu-operator get pods -l app=nvidia-operator-validator -o wide | grep "$NODE"
```
A smoke job lands and runs on the upgraded node: a short GPU job, and a 2-node nccl-tests all_reduce_perf confirming busbw at line rate (workload recipes).
Telemetry resumes: DCGM exporter and the node's metrics reappear (telemetry and monitoring).
Repeat per batch; do not advance until the current batch is green.

Rollback¶

Single-variable, GitOps-style: re-run the same play pinned to the previous branch, then reboot and re-validate.

kubectl cordon "$NODE" && kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
ansible-playbook -i inventory/hosts.ini site.yml --limit "$NODE" \
  -e "driver_branch=<previous-branch>"
ansible "$NODE" -i inventory/hosts.ini -b -m reboot
ssh "$NODE" 'dcgmi diag -r 3' && kubectl uncordon "$NODE"

If a node fails diag on both the new and previous branch, treat it as hardware and divert to the GPU-fault / RMA path (the GPU-fault runbook). If the NVLink domain will not form after reinstall, suspect a Fabric-Manager/driver mismatch (reliability and RAS).

the capacity-add runbook: Add GPU capacity (same Ansible + cordon/drain primitives).
the GPU-fault runbook: GPU fault, drain, reset, RMA (escalation when a node fails diag).
the NCCL-hang runbook: NCCL hang / collective stall (post-upgrade fabric checks).
operational runbooks: Operational runbooks index (RB-1).

References¶

NVIDIA driver install guide: https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html
NVIDIA Fabric Manager user guide (driver compatibility, NVSwitch systems): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
DCGM diagnostics (run levels): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
NVIDIA GPU Operator (operator-validator, node labels): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
kubectl drain (safely drain a node): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
nccl-tests: https://github.com/NVIDIA/nccl-tests