Skip to content
Markdown

Operational runbooks (index)

Scope: the index of operational runbooks. Each recurring use-case is its own page with trigger, pre-checks, procedure, verification, and rollback. Where the troubleshooting runbook is symptom→cause triage, these are the ordered procedures. Technology-specific cookbooks live on each tech page (see orchestration overview through the cluster, training, and RL-library pages).

Paradigm: every runbook has a trigger, pre-checks, a numbered procedure, an explicit verify, and a rollback. Drive changes node-by-node behind cordon/drain; never mutate a whole fleet in one step. Keep these in git and link them from alerts (telemetry and monitoring, SRE and MLOps practices).

Lifecycle: every runbook follows the same shape

flowchart LR
  T["Trigger (alert / event)"] --> P["Pre-checks"]
  P --> PR["Procedure (cordon/drain, act)"]
  PR --> V{"Verify"}
  V -->|"pass"| DONE["Return to service"]
  V -->|"fail"| RB["Rollback"]
  RB --> PR

Runbook catalogue

All 23 runbooks, grouped by the subsystem that triggers them.

Driver, CUDA and kernel

Runbook Trigger Page
Rolling driver / CUDA upgrade new LTS branch, CVE rolling driver and CUDA upgrade
Driver / kernel module load failure nvidia module won't load, NVML mismatch driver and module load failure
Kernel upgrade — GPU missing new kernel, GPUs vanish kernel upgrade GPU-missing
GSP firmware / driver mismatch GSP RPC errors, init fails GSP firmware and driver mismatch
Persistence mode / clock bounce clocks idle-down, first-use latency persistence mode and clock bounce

GPU hardware and health

Runbook Trigger Page
GPU fault — drain, reset, RMA XID / ECC alert GPU fault drain, reset and RMA
ECC toggle recovery uncorrectable ECC, row-remap pending ECC toggle recovery
Thermal / cooling emergency thermal throttle, CDU alarm thermal and cooling emergency
Stale MIG state leftover MIG instances block scheduling stale MIG state

Fabric and interconnect

Runbook Trigger Page
NCCL hang / collective stall step time → ∞, no XID NCCL hang and collective stall
NVLink visibility / P2P failure peers not visible over NVLink NVLink visibility and P2P failure
PCIe / P2P bandwidth regression H2D/D2H/P2P bandwidth down PCIe and P2P bandwidth regression
Fabric Manager failure nv-fabricmanager down, NVSwitch unconfigured Fabric Manager failure

Scheduling and capacity

Runbook Trigger Page
Scheduler: GPU job pending pods/jobs stuck Pending/PD scheduler GPU job pending
Add GPU capacity new nodes / scale-up add GPU capacity
Topology-unaware scheduling collectives slow, jobs split across spine topology-unaware scheduling

Training workloads

Runbook Trigger Page
Training OOM CUDA out of memory on a rank training out-of-memory
Training MFU regression MFU below baseline training MFU regression
Checkpoint recovery / resume job crash / preemption checkpoint recovery and resume

Inference workloads

Runbook Trigger Page
Inference SLO breach TTFT/TPOT burn-rate inference SLO breach
Inference KV-cache OOM preemptions, KV-cache pressure inference KV-cache OOM

Provisioning and fleet

Runbook Trigger Page
OOB / BMC unreachable no lights-out path to a node OOB and BMC unreachable
Image drift across fleet non-reproducible per-node failures image drift across the fleet

Where the technology cookbooks live

Each cluster, training-algorithm, and RL-library page carries its own Cookbook (common use cases) section with worked code:

Don't-miss checklist

  • Always cordon/drain before mutating a node; batch, never big-bang.
  • Classify XID before any RMA (reliability and RAS).
  • Every change has a one-step rollback (GitOps revert or Ansible pin) (SRE and MLOps practices).
  • Verify with a real proof (dcgmi diag, nccl-tests, loss continuity), not "it came back".

References

  • DCGM diagnostics: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
  • kubectl drain: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
  • NVIDIA GPU debug guidelines: https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html

Related: Reliability · Troubleshooting · Telemetry · Agentic AIOps · Practices · SLO/SLI · Glossary