Markdown

GPU health gating¶

Scope: keep jobs off bad GPUs. The control layer that runs a health check, turns a verdict into a scheduler state change, and stops work landing on a degraded node: Slurm HealthCheckProgram + NHC, cordon/drain, and the Kubernetes equivalents (GPU Operator validator gate, GPU Feature Discovery labels, Node Problem Detector conditions). The tests themselves (dcgmi diag, dcgmi health) live on GPU Diagnostics and Validation; the fault taxonomy (XID/ECC, what to RMA) lives on Reliability, RAS and Failure Modes. This page is the wiring between them.

Every command and config below is a reference template, not hardware-tested. Parameter names, defaults, and exit-code conventions vary by Slurm release, NHC version, GPU Operator version, and DCGM build; confirm each against the cited source for your installed versions before scripting a gate against them.

What it is¶

Health gating is a closed loop with three moving parts, run on every node continuously:

A probe. A script that decides "is this node fit to take work?" It composes the diagnostics from GPU Diagnostics and Validation: a cheap dcgmi diag -r 1 and a passive dcgmi health -c for the pre-job gate, escalating to -r 3 for post-incident adjudication.
A verdict→state mapping. A non-zero probe must change the scheduler's view of the node, not just log. On Slurm the node goes DRAIN; on Kubernetes it gets cordoned or a node condition flips.
A re-admission path. A node returns to the pool only after a clean probe, never automatically.

flowchart LR
  PROBE["Probe (dcgmi diag / health, NHC checks)"] --> V{"healthy?"}
  V -->|"yes"| SCHED["Schedulable: jobs land"]
  V -->|"no"| OFF["DRAIN / cordon: jobs blocked"]
  OFF --> REM["Remediate (reset, RMA triage)"]
  REM --> RECHECK["Re-probe"]
  RECHECK --> V

The split that matters: passive watches (dcgmi health -c) are non-invasive and run while jobs are live; active diagnostics (dcgmi diag -r 1..4) need exclusive access to the GPUs and so only run when the node is idle or already drained.¹ A gate that runs an active diagnostic on a busy node produces false failures; see GPU Diagnostics and Validation for that constraint.

Two scheduler worlds implement the same loop differently:

Slurm. HealthCheckProgram runs your probe on a timer on every node; the probe itself calls scontrol update ... State=drain on failure. NHC (LBNL Node Health Check) is the standard pre-built probe.²⁵
Kubernetes. No single built-in equivalent. Health is assembled from the GPU Operator validator (gates whether GPU pods schedule at all), GPU Feature Discovery (labels for placement), Node Problem Detector (turns log/script signals into node conditions), and kubectl cordon/drain for remediation.⁶⁷

This is the "health gates and drain" stage of Provisioning and Scheduling. For the topology/placement side of scheduling see Slurm Topology-Aware Placement; the scheduler deep-dives are Slurm for GPU Clusters and Kubernetes for GPU Clusters; this page does not re-document either.

Why it's needed (and when)¶

A GPU that enumerates in nvidia-smi is not a GPU fit for a multi-hour collective. At scale, hardware failure is steady-state, not exceptional (Reliability, RAS and Failure Modes): ECC can be degrading, an HBM row-remap can be pending, an NVLink lane can have dropped, GSP firmware can mismatch the driver. Without gating, the scheduler keeps placing jobs on a node that drifted unhealthy since last use, and a single bad node sinks every multi-node job that touches it. The gate exists to make "known-bad" and "schedulable" mutually exclusive, automatically, at fleet scale. You do not hand-nurse nodes when there are thousands.

Three moments, three depths (the depths come from GPU Diagnostics and Validation; gating decides when and what happens next):

Periodic background (always on). A timer-driven probe on every node (passive dcgmi health -c plus a cheap dcgmi diag -r 1 on idle nodes) catches slow drift (ECC ramp, link degradation) between jobs. This is the steady-state gate.
Pre-job (fast, on node entry). Before a job is dispatched to a node, a quick check confirms the node is still fit. It must be cheap enough not to delay dispatch materially, so it leans on the passive watch and the -r 1 quick level, never the long levels.
Post-incident (targeted, node already drained). After an XID/SXID, a NCCL hang, or a thermal trip, the node is drained and a dcgmi diag -r 3 adjudicates return-to-pool vs RMA (Reliability, RAS and Failure Modes, Topology-Unaware Scheduling Starvation for placement fallout).

Rule of thumb: passive watches run always; the cheap diagnostic gates entry; the expensive diagnostic runs only on an already-drained node, never in the dispatch hot path.

How it's set up & managed¶

Slurm: `HealthCheckProgram`¶

Slurm runs a single script, as root, on every compute node on a timer. Per the slurm.conf manual, HealthCheckProgram is the "fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the NOT_RESPONDING state", and "this program will be killed if it does not terminate normally within 60 seconds." Slurm takes no action on the result by itself: "any action to be taken must be explicitly performed by the program (e.g. execute scontrol update NodeName=foo State=drain Reason=tmp_file_system_full to drain a node)."²

Three parameters in slurm.conf (verify against the manual for your release²):

# slurm.conf
HealthCheckProgram=/usr/sbin/nhc      # the probe; runs as root on every node
HealthCheckInterval=300               # seconds between runs; default 0 = DISABLED
HealthCheckNodeState=IDLE,CYCLE       # which node states run it, and stagger across the interval

HealthCheckInterval defaults to zero, which disables execution; it must be set or the gate never runs.²
HealthCheckNodeState defaults to ANY; IDLE restricts the active probe to idle nodes (so an exclusive dcgmi diag never contends with a running job), and CYCLE spreads runs across the interval instead of firing on all nodes at once.²

Apply config changes with scontrol reconfigure. The drain itself, and re-admission, are manual scontrol operations the probe (or an operator) performs:

scontrol reconfigure                                       # reload slurm.conf after edits

# manual drain (what the probe runs on failure; Reason is mandatory, quote multi-word)
scontrol update NodeName=gpu042 State=DRAIN Reason="dcgmi diag -r1 fail: PCIe"

# inspect why nodes are drained/down (first chars of Reason per node)
sinfo -R

# re-admit after a clean re-probe (DRAINED has no running jobs; RESUME returns it)
scontrol update NodeName=gpu042 State=RESUME

State vocabulary that gates behaviour: DRAINING = no new jobs but jobs still running; DRAINED = no new jobs and none running (drain complete); DOWN = offline. DRAIN as a sinfo filter covers both DRAINING and DRAINED.³⁴

Slurm: NHC as the probe¶

Hand-rolling the probe is rarely worth it. NHC (LBNL Node Health Check, mej/nhc, current release 1.4.3) is the standard HealthCheckProgram payload: a framework of check_* functions that, on first failure, marks the node offline and stops.⁵ Its purpose: "Nodes which are determined to be 'unhealthy' can be marked as down or offline so as to prevent jobs from being scheduled or run on them."⁵

Install lays the binary at /usr/sbin/nhc, config at /etc/nhc/nhc.conf, helpers under /usr/libexec/nhc/. nhc-genconf scans the host and writes a starting /etc/nhc/nhc.conf.auto you then curate.⁵ Reference template, not hardware-tested:

# build from source (tarball)
./configure --prefix=/usr --sysconfdir=/etc --libexecdir=/usr/libexec
make test
make install

nhc-genconf                 # writes /etc/nhc/nhc.conf.auto from a host scan; review before use
nhc -d                      # run once in debug/foreground to see every check fire

Wire it into Slurm exactly as above (HealthCheckProgram=/usr/sbin/nhc). Under Slurm, NHC marks a node offline using scontrol, driven by the variable SLURM_SC_OFFLINE_ARGS, default "update State=DRAIN", i.e. NHC runs scontrol update NodeName=<node> State=DRAIN Reason=<failed check> for you.⁵

Config lines are target || check_command [args] (the target is a hostmask or *). A GPU-node /etc/nhc/nhc.conf composes filesystem/service/hardware checks with GPU checks:

# /etc/nhc/nhc.conf  (reference template, not hardware-tested)
* || check_fs_mount_rw -f /                       # rootfs mounted read-write
* || check_ps_service -u root -S sshd             # sshd up (start if missing)

# GPU gate: run the DCGM quick diagnostic and require exit 0.
# check_cmd_status -t <timeout_s> -r <required_exit_code> <command...>
* || check_cmd_status -t 60 -r 0 dcgmi diag -r 1

NHC ships a built-in NVIDIA check, check_nv_healthmon, but it wraps the deprecated nvidia-healthmon from the old Tesla Deployment Kit; do not build a new gate on it.⁵ The portable, current approach is the generic check_cmd_status wrapping dcgmi diag (above): -r 0 requires a clean exit, -t bounds it. Keep the gating level at -r 1; reserve -r 3 for the post-incident, already-drained path so NHC never tries to run an exclusive-access diagnostic on a node carrying a job (GPU Diagnostics and Validation).¹

For belt-and-braces coverage outside the scheduler's timer, run NHC from cron via nhc-wrapper, which records results and suppresses duplicate notifications (it "will ignore results that are identical to those previously obtained"):⁵

# cron: notify root on new/cleared errors, or if an error persists 12h
/usr/sbin/nhc-wrapper -M root -X 12h

Kubernetes: validator gate, labels, conditions¶

Kubernetes core does not understand GPUs; health is assembled from operators (Kubernetes for GPU Clusters). Four pieces:

1. GPU Operator validator (the admission gate). The nvidia-operator-validator runs validation init containers and writes sentinel files under /run/nvidia/validations/, in order host-driver-ready, toolkit-ready, cuda-ready, plugin-ready.⁸ The device-plugin (and downstream GPU pods) block on these via an init container that loops until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done.⁸ Net effect: if the driver/toolkit/CUDA stack on a node is unhealthy, the validator never writes the sentinel, and GPU pods stay Init:; the gate is implicit but real. The validator can be scoped with validator.cuda.env WITH_WORKLOAD=false to skip the CUDA workload step.⁸ (Stale sentinels from a crashed validator can wedge pods; see Image Drift Across Fleet / GPU Operator troubleshooting.)

2. GPU Feature Discovery (placement labels). GFD labels nodes from the detected hardware (nvidia.com/gpu.present=true, nvidia.com/gpu.product for the GPU model, nvidia.com/gpu.count) so workloads place on the right GPUs via nodeSelector/affinity.⁹ Operands themselves are gated by feature.node.kubernetes.io/pci-10de.present=true (NVIDIA PCI vendor 0x10de); set nvidia.com/gpu.deploy.operands=false on a node to stop the operator deploying its DaemonSets there.¹⁰

3. Node Problem Detector (faults → node conditions). NPD is a DaemonSet that watches logs and runs scripts, then reports problems as node conditions (a permanent problem) or events (temporary). Its custom-plugin-monitor runs arbitrary scripts whose exit code sets a condition⁶, the hook for surfacing a GPU XID watcher or a dcgmi health read as, say, a GPUUnhealthy=True condition that your own controller (or kubectl) then acts on. NPD does not cordon by itself; it makes the signal visible to the API.

4. cordon / drain (remediation). kubectl cordon marks a node unschedulable (no new pods); kubectl drain evicts running pods respecting PodDisruptionBudgets; kubectl uncordon re-admits.⁷ On a GPU node, drain needs --ignore-daemonsets (the GPU Operator runs DaemonSets) and usually --delete-emptydir-data:⁷

kubectl cordon gpu-node-7                                        # stop new pods landing
kubectl drain gpu-node-7 --ignore-daemonsets --delete-emptydir-data   # evict workloads
# ... reset / RMA-triage per reliability-ras ...
kubectl uncordon gpu-node-7                                      # re-admit after a clean re-probe

Pre-built GPU health checks for NPD exist (e.g. custom plugins driving on GPU count / NVLink / XID / ECC) and managed K8s offerings ship them, but the underlying mechanism is always: probe → node condition → cordon/drain by a controller or operator.⁶

Validated usage & tests¶

Reference template, not hardware-tested. The outputs below are the expected shape of a healthy or gated node, never numbers measured on hardware.

Slurm: confirm the gate is wired and firing¶

scontrol show config | grep -i healthcheck
# Expect HealthCheckProgram set to your probe, HealthCheckInterval non-zero,
# HealthCheckNodeState as configured. A zero interval means the gate is DISABLED.

Force a failure on one node (drop a tripwire the probe checks, or temporarily point the probe at check_cmd_status -t 1 -r 0 false), wait one interval, then:

sinfo -R
# Expect the node listed under a Reason string (first chars of what the probe set),
# e.g.  dcgmi diag -r1 f   gpu042   — proving probe->DRAIN actually fired.

sinfo -N -l -n gpu042
# Expect state drained/draining. DRAINED = no jobs running; DRAINING = jobs still finishing.

A node that should be unhealthy but shows idle/mixed means the probe is not draining; check it exits non-zero (Slurm acts only if the script runs the scontrol drain²), runs within 60 s, and that the interval is non-zero.

Slurm: NHC standalone¶

nhc -d
# Expect each check to print and the run to end cleanly on a healthy node.
# On failure NHC prints the failing check and (under Slurm) runs scontrol update ... State=DRAIN;
# re-run resolves nothing until the underlying fault clears.

Kubernetes: confirm the validator gate and labels¶

kubectl get pods -n gpu-operator | grep validator
# Expect nvidia-operator-validator Running/Completed on healthy GPU nodes.
# Stuck in Init: on a node => the GPU stack failed validation; GPU pods there will not start.

kubectl get nodes -L nvidia.com/gpu.product,nvidia.com/gpu.count,nvidia.com/gpu.present
# Expect GFD to have populated product/count/present on each GPU node.

kubectl describe node gpu-node-7 | sed -n '/Conditions:/,/Addresses:/p'
# Expect the standard conditions Ready/MemoryPressure/DiskPressure/PIDPressure.
# With NPD + a GPU custom-plugin-monitor, expect an additional GPU condition you defined;
# True on that condition is your cordon/drain trigger.

kubectl get nodes
# A gated node shows SchedulingDisabled after `kubectl cordon`; new GPU pods stay Pending.

Do not read a green validator or a populated GFD label as proof the silicon is fit; both confirm the stack and inventory, not that a stress diagnostic passed. Acceptance still requires an active dcgmi diag (GPU Diagnostics and Validation).¹

Failure modes¶

HealthCheckInterval left at 0. The gate is silently disabled; nodes never get probed. Verify it is non-zero in scontrol show config.²
Probe logs but never drains. Slurm acts only if the script itself runs scontrol update ... State=drain; a probe that just prints leaves bad nodes schedulable.² Use NHC, or call scontrol from your script.
Active diagnostic in the gate on a busy node. dcgmi diag -r 2/3/4 needs exclusive GPU access¹; running it via the pre-job/periodic gate on an allocated node yields false failures. Keep the gate at -r 1 and restrict with HealthCheckNodeState=IDLE. Triage of a falsely-gated node: OOB / BMC Unreachable rules out the OOB/host layer first.
Gating on check_nv_healthmon. It wraps the deprecated nvidia-healthmon⁵; a new gate must use dcgmi diag/dcgmi health via check_cmd_status instead.
Node drained but never re-admitted. Slurm DRAIN/K8s cordon are sticky by design; a node stuck out of the pool after the fault cleared is lost capacity. Close the loop with scontrol update ... State=RESUME / kubectl uncordon only after a clean re-probe.³⁷
K8s drain fails on a GPU node. Without --ignore-daemonsets it refuses (GPU Operator runs DaemonSets); without --delete-emptydir-data it stalls on emptyDir pods.⁷
Stale GPU Operator validation sentinels. A crashed validator can leave /run/nvidia/validations/* files (or their absence) wedging GPU pods in Init:.⁸ Drift/remediation: Image Drift Across Fleet.
Trusting a passive watch or a label as fitness. dcgmi health -c, a green validator, and a GFD label are non-invasive; they do not exercise the GPU. A node that drifted into a marginal-board fault can pass all three; acceptance needs an active diagnostic on an idle node (GPU Diagnostics and Validation, Reliability, RAS and Failure Modes).¹

References¶

Slurm slurm.conf (HealthCheckProgram run-as-root + NOT_RESPONDING exclusion + 60 s kill + explicit scontrol update ... State=drain; HealthCheckInterval default 0; HealthCheckNodeState ANY/IDLE/ALLOC/CYCLE): https://slurm.schedmd.com/slurm.conf.html
Slurm scontrol (update NodeName= State=DRAIN/RESUME Reason=): https://slurm.schedmd.com/scontrol.html
Slurm sinfo (-R reasons; DRAINING/DRAINED/DOWN states; DRAIN filter): https://slurm.schedmd.com/sinfo.html
LBNL Node Health Check (mej/nhc; v1.4.3; /usr/sbin/nhc, /etc/nhc/nhc.conf, nhc-genconf, nhc-wrapper; SLURM_SC_OFFLINE_ARGS="update State=DRAIN"; check_cmd_status -t -r, check_fs_mount_rw, check_ps_service, deprecated check_nv_healthmon): https://github.com/mej/nhc
LBNL NHC documentation: https://lbnl-node-health-check.readthedocs.io/en/latest/README.html
NVIDIA DCGM Diagnostics (run levels; active levels require exclusive GPU access; dcgmi diag -r): https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html
NVIDIA GPU Operator getting-started (operand labels nvidia.com/gpu.deploy.operands=false, feature.node.kubernetes.io/pci-10de.present=true): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
NVIDIA GPU Feature Discovery / k8s-device-plugin (labels nvidia.com/gpu.product, nvidia.com/gpu.count, nvidia.com/gpu.present): https://github.com/NVIDIA/k8s-device-plugin/blob/main/docs/gpu-feature-discovery/README.md
NVIDIA GPU Operator validator sentinels (/run/nvidia/validations/{host-driver-ready,toolkit-ready,cuda-ready,plugin-ready}; device-plugin init wait loop; WITH_WORKLOAD): https://github.com/NVIDIA/gpu-operator/issues/508 · troubleshooting: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html
Kubernetes Node Problem Detector (node conditions; custom-plugin-monitor): https://github.com/kubernetes/node-problem-detector
Kubernetes safely drain a node (kubectl cordon/drain --ignore-daemonsets --delete-emptydir-data/uncordon): https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/

NVIDIA DCGM Diagnostics — the active diagnostic runs real GPU workloads (large matrix multiplications that stress the GPU on power and throughput) and analyzes the results, so it needs the target GPUs to itself and yields false failures when other processes are using them; run levels -r 1 (Quick) through -r 4 (Extended). Authoritative for run-level membership and timings on your build via dcgmi diag --help. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html ↩↩↩↩↩
Slurm slurm.conf manual — HealthCheckProgram: "Fully qualified pathname of a script to execute as user root periodically on all compute nodes that are not in the NOT_RESPONDING state"; "This program will be killed if it does not terminate normally within 60 seconds"; "Any action to be taken must be explicitly performed by the program (e.g. execute scontrol update NodeName=foo State=drain Reason=tmp_file_system_full to drain a node)". HealthCheckInterval default 0 disables execution. HealthCheckNodeState default ANY; supports IDLE, ALLOC, CYCLE. https://slurm.schedmd.com/slurm.conf.html ↩↩↩↩↩↩↩↩
Slurm scontrol — update NodeName=<nodes> State=DRAIN Reason="..." (Reason mandatory for DOWN/DRAIN/FAIL; quote multi-word); State=RESUME re-admits a drained node. https://slurm.schedmd.com/scontrol.html ↩↩
Slurm sinfo — -R lists the Reason field for down/drained/draining/failing nodes; DRAINING = no new jobs, jobs still running; DRAINED = no new jobs, none running; DOWN = offline; DRAIN filter covers DRAINING+DRAINED. https://slurm.schedmd.com/sinfo.html ↩
LBNL Node Health Check (mej/nhc), current release 1.4.3 — periodic per-node health check that marks unhealthy nodes offline "so as to prevent jobs from being scheduled or run on them"; /usr/sbin/nhc, config /etc/nhc/nhc.conf (target || check_command syntax), nhc-genconf writes /etc/nhc/nhc.conf.auto, helpers /usr/libexec/nhc/; Slurm offline action via SLURM_SC_OFFLINE_ARGS default "update State=DRAIN" run through scontrol; checks include check_cmd_status -t <secs> -r <retval> <cmd>, check_fs_mount_rw, check_ps_service, and the deprecated check_nv_healthmon (wraps nvidia-healthmon); nhc-wrapper -M root -X 12h for cron with duplicate-suppression. https://github.com/mej/nhc · https://lbnl-node-health-check.readthedocs.io/en/latest/README.html ↩↩↩↩↩↩↩↩
Kubernetes Node Problem Detector — DaemonSet that reports node problems as node conditions (permanent) or events (temporary); custom-plugin-monitor invokes user scripts whose exit status sets a condition; NPD surfaces the signal but does not cordon by itself. https://github.com/kubernetes/node-problem-detector ↩↩↩
Kubernetes — kubectl cordon <node> marks a node unschedulable; kubectl drain <node> safely evicts pods (respecting PodDisruptionBudgets), needs --ignore-daemonsets where DaemonSets run and --delete-emptydir-data for emptyDir pods; kubectl uncordon <node> re-admits. https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ ↩↩↩↩↩
NVIDIA GPU Operator — nvidia-operator-validator writes sentinel files /run/nvidia/validations/{host-driver-ready,toolkit-ready,cuda-ready,plugin-ready}; the device-plugin init container blocks on until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done, so a failed validation keeps GPU pods in Init:; CUDA workload step disabled via validator.cuda.env WITH_WORKLOAD=false. https://github.com/NVIDIA/gpu-operator/issues/508 · https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/troubleshooting.html ↩↩↩↩
NVIDIA GPU Feature Discovery (in k8s-device-plugin) — generates node labels from detected GPUs, including nvidia.com/gpu.present=true, nvidia.com/gpu.product (GPU model), and nvidia.com/gpu.count (physical GPU count), used for placement. https://github.com/NVIDIA/k8s-device-plugin/blob/main/docs/gpu-feature-discovery/README.md ↩
NVIDIA GPU Operator getting-started — GPU worker nodes identified by feature.node.kubernetes.io/pci-10de.present=true (NVIDIA PCI vendor 0x10de); label a node nvidia.com/gpu.deploy.operands=false to stop the operator deploying operands there (and nvidia.com/gpu.deploy.driver=false for the driver specifically). https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html ↩