Markdown

Runbook: persistence mode / clock bounce¶

Scope: a node (or fleet) showing slow job starts, bouncing clocks and non-reproducible benchmarks because the NVIDIA driver is de-initializing idle GPUs. Diagnose, enable nvidia-persistenced fleet-wide via config management, and confirm Persistence-M: On with stable clocks. Conceptual background is in Persistence Mode.

All commands, unit files and the Ansible play below are reference templates, not hardware-tested. Pin them to the driver version on your nodes and validate on a canary before fleet rollout.

This is the operational counterpart to Persistence Mode; it does not re-explain the mechanism, only the fix. Severity: usually a performance/SLO degradation, not an outage; the change is additive and safe (no cordon strictly required to enable it, but cordon/drain any node you are actively poking).

Trigger¶

Slow job starts. First CUDA call of every job pays an order 1-3 second per-GPU init cost; on an 8-GPU node this serializes into a multi-second cold start on launch. NVIDIA: "Applications that trigger GPU initilization may incur a short (order of 1-3 second) startup cost per GPU due to ECC scrubbing behavior." ¹
Clocks bouncing / inconsistent benchmarks. Back-to-back runs are not reproducible and applied clock/power limits silently reset, because "If the driver deinitializes a GPU some non-persistent state associated with that GPU will be lost and revert back to defaults the next time the GPU is initialized." ¹ This happens whenever the GPU goes idle: "When all GPU clients terminate the driver will then deinitialize the GPU." ¹
nvidia-smi shows Persistence-M: Off on a headless / server node, the direct fingerprint. Persistence is off by default on a fresh install; on a job-running node that is the bug.

Adjacent symptoms that this runbook does not fix: thermal/power throttling (see Reliability, RAS and Failure Modes), or clocks pinned low by an applications.graphics clock that was set and never cleared, which shows as throttle reason applications_clocks_setting, not a persistence problem.

Pre-checks¶

Run on the affected node before changing anything. Record the current state so you can confirm the delta after.

# 1. Is persistence actually off? (the header column + the detailed field)
nvidia-smi --query-gpu=index,name,persistence_mode,pstate --format=csv
# Off node: persistence_mode reads "Disabled" for the rows in question

# 2. Performance / clock detail: current clocks, max clocks, and WHY clocks are where they are
nvidia-smi -q -d PERFORMANCE
nvidia-smi -q -d CLOCK
# Look at "Clocks Event Reasons" (older drivers: "Clocks Throttle Reasons").
# GPU Idle / Idle -> Active churn between runs is the persistence fingerprint;
# SW Thermal Slowdown / HW Slowdown / SW Power Cap are a DIFFERENT problem (RAS/thermal).

# 3. Is the daemon installed, and is it running?
which nvidia-persistenced            # expect /usr/bin/nvidia-persistenced (shipped with the driver, v319+)
systemctl status nvidia-persistenced
# "Active: inactive (dead)" or "could not be found" -> the daemon is the gap, not the driver

Decision points:

Daemon binary missing (which returns nothing): this is a driver-install problem, not a persistence one. Do not continue here; go to Driver Install and Lifecycle / NVIDIA Kernel Modules and Kernel upgrade, GPU missing.
Daemon present but throttle reasons show thermal/power: clocks are low for a thermal/power reason, not de-init. Persistence will not help; divert to Reliability, RAS and Failure Modes / thermal handling.
Daemon present, inactive, persistence Off, throttle reason is idle/none: this runbook applies. Proceed.

Do not "fix" this with nvidia-smi -pm 1. The legacy kernel-mode flag is "near end-of-life and will be eventually deprecated in favor of the Persistence Daemon." ² Once the daemon is running, nvidia-smi routes persistence through it (RPC), so -pm is at best redundant and at worst masks a daemon that is not actually enabled at boot. Manage the daemon; see Persistence Mode.

Procedure¶

The fix is a state change to a systemd service, so drive it from config management (idempotent, fleet-wide) rather than ad-hoc SSH. Enabling persistence is additive and does not evict work, but if you are also clearing stale clocks or want a clean baseline, cordon/drain the node first so nothing is mid-benchmark while you poke it.

NODE is the Kubernetes node name (Slurm equivalent in step 1).

NODE=gpu-07.dc1.internal

Cordon and drain the node you are actively working (optional for enable-only; required if you will bounce clocks or restart the daemon under load):

kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --timeout=15m
# Slurm: scontrol update nodename=gpu-07 state=drain reason="persistence enable"

Enable + start the daemon at boot, fleet-wide, via Ansible. The daemon ships disabled: NVIDIA cannot reliably auto-install a startup unit "on the wide range of systems the NVIDIA GPU driver supports," ³ so enabling it is on you. The packaged unit name is nvidia-persistenced.service. The upstream daemon starts "with persistence mode enabled for all devices," ⁴ so enabling the service is sufficient, with no per-GPU command. Reference template (not hardware-tested):

# roles/nvidia_persistence/tasks/main.yml
- name: Ensure nvidia-persistenced is enabled and running
  ansible.builtin.systemd_service:   # alias: ansible.builtin.systemd
    name: nvidia-persistenced.service
    enabled: true                    # start on boot
    state: started                   # idempotent: starts only if not running

Run it across the fleet (limit to a canary first, then widen):

ansible-playbook -i inventory/hosts.ini site.yml --tags persistence --limit gpu_canary
# validate canary (see Verification), then:
ansible-playbook -i inventory/hosts.ini site.yml --tags persistence --limit gpu_nodes

Equivalent single-host action, if you must do it by hand on one node:

ssh "$NODE" 'sudo systemctl enable --now nvidia-persistenced.service'

Do not enable the legacy flag. Confirm nothing in your bring-up still calls nvidia-smi -pm 1 as the control surface; grep your Ansible roles and any node-init scripts for -pm and remove it; the daemon is the source of truth. ² (A stray -pm 1 will appear to work while the daemon is down, then become a no-op once the daemon is up, hiding a daemon that never got enabled.)

Uncordon any node you drained:

kubectl uncordon "$NODE"
# Slurm: scontrol update nodename=gpu-07 state=resume

flowchart LR
  A["Persistence-M Off, clocks bounce"] --> B{"Daemon binary present?"}
  B -->|"No"| C["Driver-install problem: install-lifecycle / kernel-modules"]
  B -->|"Yes"| D{"Throttle reason thermal or power?"}
  D -->|"Yes"| E["RAS/thermal path, not persistence"]
  D -->|"No, idle or none"| F["Ansible: systemd enable --now nvidia-persistenced"]
  F --> G["Verify Persistence-M On, stable clocks"]

Verification¶

Two things must be true on every node: the service is active, and every GPU reports persistence on with stable clocks. Assert both across the fleet, not just on the node you fixed.

# 1. Service is up now and will come up on boot
systemctl is-active nvidia-persistenced.service     # expect: active
systemctl is-enabled nvidia-persistenced.service    # expect: enabled
pgrep -a nvidia-persistenced                          # expect: /usr/bin/nvidia-persistenced --user ... line

# 2. Per-GPU persistence is On (header column + detailed field agree)
nvidia-smi --query-gpu=index,persistence_mode --format=csv
# Expect: every row reads "Enabled"
nvidia-smi -q | grep -i "Persistence Mode"
# Expect: "Persistence Mode : Enabled" per GPU

Confirm clocks no longer bounce and that the reason for the current clock is benign (idle/none), not de-init churn:

nvidia-smi --query-gpu=index,clocks.current.sm,clocks.current.graphics,clocks.current.memory,clocks_event_reasons.active --format=csv
# Older drivers expose the same as clocks_throttle_reasons.active.
# Sample the idle box a few times: clocks should sit at a steady idle floor and
# step cleanly to boost under load, NOT swing because the GPU keeps de-initializing.

Prove the benefit with a real diagnostic, not a printed number. The clock/perf-relevant DCGM level is Long (-r 3), "System HW Diagnostics" (PCIe/NVLink, memory bandwidth, NCCL, stress); it exercises the GPU long enough that a de-initializing GPU would show up as instability: ⁵

dcgmi diag -r 3
# Expect: no "Fail" results. ("-r 1" Quick is a <2.5 s sanity check;
#  "-r 2" Medium, "-r 3" Long ~<35 min on 8-GPU, "-r 4" Extra Long.)

For the end-to-end proof that cold-start latency is gone, time the first lightweight CUDA call on a freshly idle GPU with the daemon running and expect a near-immediate first invocation (vs the 1-3 s/GPU init tax when persistence is off). Treat this only as a same-node before/after; it is target-dependent, not a fleet target number. Fabric-level benchmarks (nccl-tests busbw) belong to Fabric Bring-Up, Validation and Benchmarking, which assumes persistence is already pinned on.

Rollback¶

Not applicable: this change is safe and additive. Enabling persistence keeps an idle GPU initialized; it does not alter results, evict work, or change clocks you did not set. There is no production reason to turn it back off on a job-running node.

The one real caveat is reset, not rollback: the daemon holds an open handle, so a bare nvidia-smi --gpu-reset (or any ECC toggle that needs a reset) fails with the GPU "in use." That is expected. When you genuinely need a clean reset, stop the daemon, reset, restart:

sudo systemctl stop nvidia-persistenced.service
sudo nvidia-smi -i <id> --gpu-reset
sudo systemctl start nvidia-persistenced.service

See ECC Toggle Recovery for the ECC-change flow that depends on this, and Stale MIG State for MIG reconfiguration, which has the same reset-blocked-by-daemon interaction.

Rolling Driver / CUDA Upgrade: rolling driver/CUDA/Fabric-Manager upgrade; its post-upgrade check already asserts nvidia-persistenced active.
Kernel upgrade, GPU missing: when the daemon binary or GPU is absent (driver-install fault, not persistence).
ECC Toggle Recovery: ECC enable/disable requiring a GPU reset; stop the daemon first.
Stale MIG State: MIG reconfiguration blocked by an open handle / stale state.
Fabric Manager Failure: NVSwitch fabric not forming; orthogonal service, same systemctl is-active discipline.
Operational Runbooks: runbook index.

References¶

NVIDIA Driver Persistence — Overview (de-init on zero clients, 1-3 s/GPU ECC-scrub init, state revert to defaults): https://docs.nvidia.com/deploy/driver-persistence/overview.html
NVIDIA Driver Persistence — Persistence Daemon (/usr/bin/nvidia-persistenced, not auto-installed on startup, --user sample, mimics a client): https://docs.nvidia.com/deploy/driver-persistence/persistence-daemon.html
NVIDIA Driver Persistence — Persistence Mode (Legacy) (nvidia-smi -pm, near end-of-life): https://docs.nvidia.com/deploy/driver-persistence/persistence-mode-legacy.html
nvidia-persistenced(1) man page (defaults: persistence enabled for all devices): https://manpages.ubuntu.com/manpages/noble/man1/nvidia-persistenced.1.html
nvidia-smi manual (-q -d display groups incl. PERFORMANCE, CLOCK; --query-gpu fields): https://docs.nvidia.com/deploy/nvidia-smi/index.html
NVIDIA DCGM diagnostics — run levels -r 1/2/3/4 (Quick/Medium/Long/Extra Long): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
Ansible ansible.builtin.systemd_service module (enabled, state; systemd is an alias): https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_service_module.html

Related: Persistence Mode · Glossary

NVIDIA Driver Persistence — Overview: "When all GPU clients terminate the driver will then deinitialize the GPU"; "Applications that trigger GPU initilization may incur a short (order of 1-3 second) startup cost per GPU due to ECC scrubbing behavior"; "If the driver deinitializes a GPU some non-persistent state associated with that GPU will be lost and revert back to defaults the next time the GPU is initialized." https://docs.nvidia.com/deploy/driver-persistence/overview.html ↩↩↩
NVIDIA Driver Persistence — Persistence Mode (Legacy): "This solution is near end-of-life and will be eventually deprecated in favor of the Persistence Daemon." https://docs.nvidia.com/deploy/driver-persistence/persistence-mode-legacy.html ↩↩
NVIDIA Driver Persistence — Persistence Daemon: installed in /usr/bin; "there is no single standard for installing an application to start on system initialization on Linux, so we cannot reliably do so on the wide range of systems the NVIDIA GPU driver supports"; sample nvidia-persistenced --user foo; "The daemon simply mimics an external client of the GPU but does not actually use the GPU for any work." https://docs.nvidia.com/deploy/driver-persistence/persistence-daemon.html ↩
nvidia-persistenced(1): "By default, nvidia-persistenced starts with persistence mode enabled for all devices." https://manpages.ubuntu.com/manpages/noble/man1/nvidia-persistenced.1.html ↩
NVIDIA DCGM diagnostics: run level -r 3 is the Long "System HW Diagnostics" run (PCIe/NVLink, memory bandwidth, NCCL, stress). https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩