Markdown

Operational runbooks (index)¶

Scope: the index of operational runbooks. Each recurring use-case is its own page with trigger, pre-checks, procedure, verification, and rollback. Where the troubleshooting runbook is symptom→cause triage, these are the ordered procedures. Technology-specific cookbooks live on each tech page (see orchestration overview through the cluster, training, and RL-library pages).

Paradigm: every runbook has a trigger, pre-checks, a numbered procedure, an explicit verify, and a rollback. Drive changes node-by-node behind cordon/drain; never mutate a whole fleet in one step. Keep these in git and link them from alerts (telemetry and monitoring, SRE and MLOps practices).

Lifecycle: every runbook follows the same shape

flowchart LR
  T["Trigger (alert / event)"] --> P["Pre-checks"]
  P --> PR["Procedure (cordon/drain, act)"]
  PR --> V{"Verify"}
  V -->|"pass"| DONE["Return to service"]
  V -->|"fail"| RB["Rollback"]
  RB --> PR

Runbook catalogue¶

All 23 runbooks, grouped by the subsystem that triggers them.

Driver, CUDA and kernel¶

Runbook	Trigger	Page
Rolling driver / CUDA upgrade	new LTS branch, CVE	rolling driver and CUDA upgrade
Driver / kernel module load failure	`nvidia` module won't load, NVML mismatch	driver and module load failure
Kernel upgrade — GPU missing	new kernel, GPUs vanish	kernel upgrade GPU-missing
GSP firmware / driver mismatch	GSP RPC errors, init fails	GSP firmware and driver mismatch
Persistence mode / clock bounce	clocks idle-down, first-use latency	persistence mode and clock bounce

GPU hardware and health¶

Runbook	Trigger	Page
GPU fault — drain, reset, RMA	XID / ECC alert	GPU fault drain, reset and RMA
ECC toggle recovery	uncorrectable ECC, row-remap pending	ECC toggle recovery
Thermal / cooling emergency	thermal throttle, CDU alarm	thermal and cooling emergency
Stale MIG state	leftover MIG instances block scheduling	stale MIG state

Fabric and interconnect¶

Runbook	Trigger	Page
NCCL hang / collective stall	step time → ∞, no XID	NCCL hang and collective stall
NVLink visibility / P2P failure	peers not visible over NVLink	NVLink visibility and P2P failure
PCIe / P2P bandwidth regression	H2D/D2H/P2P bandwidth down	PCIe and P2P bandwidth regression
Fabric Manager failure	`nv-fabricmanager` down, NVSwitch unconfigured	Fabric Manager failure

Scheduling and capacity¶

Runbook	Trigger	Page
Scheduler: GPU job pending	pods/jobs stuck Pending/PD	scheduler GPU job pending
Add GPU capacity	new nodes / scale-up	add GPU capacity
Topology-unaware scheduling	collectives slow, jobs split across spine	topology-unaware scheduling

Training workloads¶

Runbook	Trigger	Page
Training OOM	`CUDA out of memory` on a rank	training out-of-memory
Training MFU regression	MFU below baseline	training MFU regression
Checkpoint recovery / resume	job crash / preemption	checkpoint recovery and resume

Inference workloads¶

Runbook	Trigger	Page
Inference SLO breach	TTFT/TPOT burn-rate	inference SLO breach
Inference KV-cache OOM	preemptions, KV-cache pressure	inference KV-cache OOM

Provisioning and fleet¶

Runbook	Trigger	Page
OOB / BMC unreachable	no lights-out path to a node	OOB and BMC unreachable
Image drift across fleet	non-reproducible per-node failures	image drift across the fleet

Where the technology cookbooks live¶

Each cluster, training-algorithm, and RL-library page carries its own Cookbook (common use cases) section with worked code:

Cluster: Kubernetes Kubernetes · k3s k3s · Ray Ray · Slurm Slurm.
Training: FSDP FSDP · DDP DDP · DeepSpeed/ZeRO DeepSpeed and ZeRO · TP tensor parallelism · PP pipeline parallelism · DiLoCo DiLoCo.
Post-training: GRPO GRPO · DPO DPO · SFT/LoRA SFT and LoRA.
RL libraries: verl verl · slime slime · SkyRL SkyRL · OpenRLHF OpenRLHF · NeMo-RL NeMo-RL · TRL TRL.

Don't-miss checklist¶

Always cordon/drain before mutating a node; batch, never big-bang.
Classify XID before any RMA (reliability and RAS).
Every change has a one-step rollback (GitOps revert or Ansible pin) (SRE and MLOps practices).
Verify with a real proof (dcgmi diag, nccl-tests, loss continuity), not "it came back".

References¶

DCGM diagnostics: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
kubectl drain: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
NVIDIA GPU debug guidelines: https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html

Related: Reliability · Troubleshooting · Telemetry · Agentic AIOps · Practices · SLO/SLI · Glossary