Operational runbooks (index)¶
Scope: the index of operational runbooks. Each recurring use-case is its own page with trigger, pre-checks, procedure, verification, and rollback. Where the troubleshooting runbook is symptom→cause triage, these are the ordered procedures. Technology-specific cookbooks live on each tech page (see orchestration overview through the cluster, training, and RL-library pages).
Paradigm: every runbook has a trigger, pre-checks, a numbered procedure, an explicit verify, and a rollback. Drive changes node-by-node behind cordon/drain; never mutate a whole fleet in one step. Keep these in git and link them from alerts (telemetry and monitoring, SRE and MLOps practices).
Lifecycle: every runbook follows the same shape
flowchart LR
T["Trigger (alert / event)"] --> P["Pre-checks"]
P --> PR["Procedure (cordon/drain, act)"]
PR --> V{"Verify"}
V -->|"pass"| DONE["Return to service"]
V -->|"fail"| RB["Rollback"]
RB --> PR
Runbook catalogue¶
All 23 runbooks, grouped by the subsystem that triggers them.
Driver, CUDA and kernel¶
| Runbook | Trigger | Page |
|---|---|---|
| Rolling driver / CUDA upgrade | new LTS branch, CVE | rolling driver and CUDA upgrade |
| Driver / kernel module load failure | nvidia module won't load, NVML mismatch |
driver and module load failure |
| Kernel upgrade — GPU missing | new kernel, GPUs vanish | kernel upgrade GPU-missing |
| GSP firmware / driver mismatch | GSP RPC errors, init fails | GSP firmware and driver mismatch |
| Persistence mode / clock bounce | clocks idle-down, first-use latency | persistence mode and clock bounce |
GPU hardware and health¶
| Runbook | Trigger | Page |
|---|---|---|
| GPU fault — drain, reset, RMA | XID / ECC alert | GPU fault drain, reset and RMA |
| ECC toggle recovery | uncorrectable ECC, row-remap pending | ECC toggle recovery |
| Thermal / cooling emergency | thermal throttle, CDU alarm | thermal and cooling emergency |
| Stale MIG state | leftover MIG instances block scheduling | stale MIG state |
Fabric and interconnect¶
| Runbook | Trigger | Page |
|---|---|---|
| NCCL hang / collective stall | step time → ∞, no XID | NCCL hang and collective stall |
| NVLink visibility / P2P failure | peers not visible over NVLink | NVLink visibility and P2P failure |
| PCIe / P2P bandwidth regression | H2D/D2H/P2P bandwidth down | PCIe and P2P bandwidth regression |
| Fabric Manager failure | nv-fabricmanager down, NVSwitch unconfigured |
Fabric Manager failure |
Scheduling and capacity¶
| Runbook | Trigger | Page |
|---|---|---|
| Scheduler: GPU job pending | pods/jobs stuck Pending/PD | scheduler GPU job pending |
| Add GPU capacity | new nodes / scale-up | add GPU capacity |
| Topology-unaware scheduling | collectives slow, jobs split across spine | topology-unaware scheduling |
Training workloads¶
| Runbook | Trigger | Page |
|---|---|---|
| Training OOM | CUDA out of memory on a rank |
training out-of-memory |
| Training MFU regression | MFU below baseline | training MFU regression |
| Checkpoint recovery / resume | job crash / preemption | checkpoint recovery and resume |
Inference workloads¶
| Runbook | Trigger | Page |
|---|---|---|
| Inference SLO breach | TTFT/TPOT burn-rate | inference SLO breach |
| Inference KV-cache OOM | preemptions, KV-cache pressure | inference KV-cache OOM |
Provisioning and fleet¶
| Runbook | Trigger | Page |
|---|---|---|
| OOB / BMC unreachable | no lights-out path to a node | OOB and BMC unreachable |
| Image drift across fleet | non-reproducible per-node failures | image drift across the fleet |
Where the technology cookbooks live¶
Each cluster, training-algorithm, and RL-library page carries its own Cookbook (common use cases) section with worked code:
- Cluster: Kubernetes Kubernetes · k3s k3s · Ray Ray · Slurm Slurm.
- Training: FSDP FSDP · DDP DDP · DeepSpeed/ZeRO DeepSpeed and ZeRO · TP tensor parallelism · PP pipeline parallelism · DiLoCo DiLoCo.
- Post-training: GRPO GRPO · DPO DPO · SFT/LoRA SFT and LoRA.
- RL libraries: verl verl · slime slime · SkyRL SkyRL · OpenRLHF OpenRLHF · NeMo-RL NeMo-RL · TRL TRL.
Don't-miss checklist¶
- Always cordon/drain before mutating a node; batch, never big-bang.
- Classify XID before any RMA (reliability and RAS).
- Every change has a one-step rollback (GitOps revert or Ansible pin) (SRE and MLOps practices).
- Verify with a real proof (
dcgmi diag,nccl-tests, loss continuity), not "it came back".
References¶
- DCGM diagnostics: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
- kubectl drain: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
- NVIDIA GPU debug guidelines: https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html
Related: Reliability · Troubleshooting · Telemetry · Agentic AIOps · Practices · SLO/SLI · Glossary