SLOs for cluster and fabric health¶
Scope: operational SLOs for the cluster substrate (allocatable-GPU health, fabric link health and bandwidth, node readiness, and thermal headroom) expressed as PromQL/DCGM SLIs, and how those SLIs gate scheduling.
Reference templates, not hardware-tested. PromQL assumes dcgm-exporter, kube-state-metrics, and the kube-scheduler
/metricsendpoint. Metric names below are the dcgm-exporter default counter names (etc/default-counters.csv), which differ from the raw DCGM field-id API names; confirm against your exporter's counters file and DCGM build, since several fields (per-link NVLink, ECC) are commented out by default. SLO targets are reference values; set your own with stakeholders.
This page is the substrate counterpart to the user-facing SLO/SLI catalog: that page measures the service (inference availability, TTFT, job success); this one measures whether the cluster is fit to schedule onto at all. The diagnostics that produce these signals live on GPU Health Gating and Fabric Bring-Up, Validation and Benchmarking; the alert-rule mechanics (multi-window burn rate, SLO-as-code) are defined once on SLO/SLI Catalog and Error-Budget Alerts and not repeated here.
What it is¶
A cluster/fabric SLO is a service-level objective on the substrate, not on the workload. The SLI is a ratio of "good capacity" to "total capacity" measured continuously from infra telemetry, and the SLO is its target over a window. Four substrate SLIs:
| SLI | Definition (good / total) | Reference SLO | Source signal |
|---|---|---|---|
| GPU allocatable ratio | schedulable GPUs / total GPUs | >= 98% / 28d | kube-state-metrics / Slurm |
| Node readiness | Ready=true nodes / total GPU nodes |
>= 99% / 28d | kube-state-metrics |
| Fabric link health | links without errors / total links | >= 99.9% / 28d | dcgm-exporter (NVLink), ibdiagnet |
| Thermal headroom | GPUs below slowdown threshold / total | >= 99.9% (instantaneous, paged) | dcgm-exporter |
The distinction from SLO/SLI Catalog and Error-Budget Alerts matters: those are user-outcome SLIs and you alert on burn rate. Substrate SLIs are capacity gates: a breach does not (only) page; it withdraws nodes from scheduling via the health-gating loop (GPU Health Gating). The SLO quantifies how much capacity you tolerate losing before it becomes an availability problem for the workloads on top.
Why it matters¶
A GPU that enumerates in nvidia-smi is not a GPU fit for a multi-hour collective (GPU Health Gating). At fleet scale hardware failure is steady-state: ECC ramps, HBM row-remaps go pending, NVLink lanes drop, nodes wedge NotReady, inlet temperature climbs and GPUs hit the slowdown threshold and silently throttle. Each one removes capacity or, worse, leaves a degraded node schedulable so one bad node sinks every multi-node job that lands on it.
Substrate SLOs make that capacity loss measurable and budgeted instead of discovered by a failed training run:
- They turn "the cluster feels flaky" into a number with a target and an error budget you can defend in a capacity review (GPU Capacity Planning, Cloud, Neoclouds and Cost/Capacity).
- They give the scheduler gate (GPU Health Gating) an objective threshold: below it, cordon/drain; the SLI is the post-hoc record of how often that fired.
- Thermal-headroom and fabric SLIs catch silent degradation: a throttling GPU or an NVLink negotiated down to 1X still passes a liveness check but tanks MFU (Training MFU Regression).
When it is needed (and when not)¶
Needed when:
- You run multi-node collectives: one degraded link or throttling GPU stalls the whole job; substrate health is the dominant availability term (Distributed Training Recipes, FSDP).
- You operate a shared, multi-tenant cluster where the scheduler must trust that "allocatable" means "fit" (Kubernetes for GPU Clusters, Slurm for GPU Clusters, Volcano Gang Scheduler).
- You owe a capacity or uptime commitment upstream (a neocloud SLA, an internal platform SLO) and need substrate evidence (Cloud, Neoclouds and Cost/Capacity, Vendor Sourcing and Procurement Logistics).
Not needed (or lighter-weight) when:
- Single-node, single-GPU work: fabric SLIs are moot; a node-readiness liveness check suffices.
- A managed neocloud already gates and replaces unhealthy nodes under its own SLA, so you consume their health signal rather than re-deriving it (Cloud, Neoclouds and Cost/Capacity). Still track allocatable ratio to hold them to it.
- Pre-production bring-up: use the one-shot acceptance procedures (Fabric Bring-Up, Validation and Benchmarking, Smoke Tests: GPU Platform) instead of continuous SLOs; SLOs come online once the cluster carries real load.
Do not put an SLO on a raw infra gauge with no capacity meaning (e.g. average GPU temperature); that produces green dashboards and unhappy jobs (SLO/SLI Catalog and Error-Budget Alerts failure modes). The SLI must be a good/total capacity ratio.
How: implement, integrate, maintain¶
The flow: dcgm-exporter and kube-state-metrics emit substrate signals; recording rules compute the four SLIs; thresholds both page (via the burn-rate pattern on SLO/SLI Catalog and Error-Budget Alerts) and feed the gating loop that cordons/drains (GPU Health Gating), withdrawing the node from allocatable and closing the loop.
flowchart LR
DCGM["dcgm-exporter: temp, NVLink, XID"] --> PROM["Prometheus recording rules"]
KSM["kube-state-metrics: node Ready, allocatable"] --> PROM
PROM --> SLI["Substrate SLIs: allocatable, readiness, fabric, thermal"]
SLI --> ALERT["Burn-rate alert -> page"]
SLI --> GATE["Threshold breach -> gate"]
GATE --> DRAIN["Cordon / drain node"]
DRAIN --> ALLOC["Node leaves allocatable"]
ALLOC --> KSM
Implement: the four SLIs as PromQL¶
dcgm-exporter labels every metric with Hostname / gpu / UUID; kube-state-metrics labels with node. Join on node/Hostname where you need per-node rollups.
# 1. GPU allocatable ratio (Kubernetes): schedulable GPUs / total GPUs.
# Allocatable excludes GPUs on cordoned (unschedulable) nodes.
sum(
kube_node_status_allocatable{resource="nvidia_com_gpu"}
* on(node) group_left()
(kube_node_spec_unschedulable == bool 0)
)
/ sum(kube_node_status_capacity{resource="nvidia_com_gpu"})
# 2. Node readiness: GPU nodes with Ready=true / total GPU nodes.
# Restrict to GPU nodes via a label your GFD/labeller sets (here: a node label join).
sum(kube_node_status_condition{condition="Ready",status="true"})
/ count(kube_node_status_condition{condition="Ready"})
# 3. Fabric link health (NVLink error-free fraction).
# No new NVLink errors in the window => link healthy. Aggregate counter; per-link
# fields (DCGM_FI_DEV_NVLINK_*_L0..) are commented out in the exporter default.
( count(count by (Hostname,gpu) (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL))
- count(count by (Hostname,gpu) (
increase(DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL[5m]) > 0
)) )
/ count(count by (Hostname,gpu) (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL))
# 4. Thermal headroom: GPUs at least 5 C below the per-GPU slowdown threshold.
# DCGM_FI_DEV_GPU_TEMP_SLOWDOWN is the HW slowdown trip point; compare live temp to it.
count(
DCGM_FI_DEV_GPU_TEMP
< on(Hostname,gpu) (DCGM_FI_DEV_GPU_TEMP_SLOWDOWN - 5)
)
/ count(DCGM_FI_DEV_GPU_TEMP)
Field-name caveats, verify against your exporter's counters file:
DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTALandDCGM_FI_DEV_GPU_TEMP_SLOWDOWNare not in the stockdefault-counters.csv; add them to the exporter's counters configmap to export them (DCGM Exporter). If unavailable, substitute a fixed thermal threshold (e.g.DCGM_FI_DEV_GPU_TEMP > 87, as used on Telemetry, Monitoring and Alerting) and gate fabric health on XID instead:increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0.- On Slurm there is no kube-state-metrics; derive allocatable ratio from
sinfo/drain state via a textfile-collector exporter scrapingsinfo -h -o '%D %t'(idle+mix+alloc vs drain/down).
Integrate: recording rule + scheduling gate¶
Compute the SLIs as recording rules so alerts and dashboards share one definition (multi-window burn-rate alerting is on SLO/SLI Catalog and Error-Budget Alerts; reuse it verbatim against these :ratio series). The substrate-specific addition is the gate alert that drives cordon/drain.
# PrometheusRule: substrate SLIs + thermal gate. Reference template, not hardware-tested.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: slo-cluster-fabric
namespace: monitoring
labels: { release: kube-prom }
spec:
groups:
- name: slo.substrate
rules:
- record: slo:gpu_allocatable:ratio
expr: |
sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}
* on(node) group_left() (kube_node_spec_unschedulable == bool 0))
/ sum(kube_node_status_capacity{resource="nvidia_com_gpu"})
- record: slo:node_ready:ratio
expr: |
sum(kube_node_status_condition{condition="Ready",status="true"})
/ count(kube_node_status_condition{condition="Ready"})
# Capacity SLO breach: allocatable below target -> page (link runbook).
- alert: GpuAllocatableBelowSLO
expr: slo:gpu_allocatable:ratio < 0.98
for: 15m
labels: { severity: warning }
annotations:
summary: "Allocatable GPU ratio {{ $value | humanizePercentage }} < 98% SLO"
runbook: "gpu-health-gating: find drained/cordoned nodes, triage RMA vs reset"
# Thermal gate: a GPU within 2 C of its slowdown trip -> page; gate node hot.
- alert: GpuThermalHeadroomCritical
expr: |
DCGM_FI_DEV_GPU_TEMP
>= on(Hostname,gpu) (DCGM_FI_DEV_GPU_TEMP_SLOWDOWN - 2)
for: 2m
labels: { severity: critical }
annotations:
summary: "GPU {{ $labels.gpu }} on {{ $labels.Hostname }} within 2C of slowdown"
runbook: "gpu-health-gating: cordon/drain the node; check cooling and power"
The gate alert is the seam to GPU Health Gating: route GpuThermalHeadroomCritical to Alertmanager, and have the receiver (or a Node Problem Detector custom-plugin-monitor reading the same DCGM signal) set a node condition / cordon the node so it leaves allocatable. That cordon then shows up in SLI #1, closing the loop. On Slurm the equivalent is the HealthCheckProgram probe running scontrol update ... State=DRAIN (GPU Health Gating).
Maintain¶
- Generate from SLO-as-code. Declare these four SLIs in an OpenSLO/Sloth spec in git so rules regenerate from one source (SLO/SLI Catalog and Error-Budget Alerts, SLO-as-code section). Review targets with stakeholders.
- Validate the gate end-to-end. Force a synthetic breach (cordon one node; confirm
slo:gpu_allocatable:ratiodrops) and a synthetic thermal trip (a temporary recording rule that emits a value above the trip) and confirm the page fires and the receiver cordons. Never claim the gate works without observing the cordon (Smoke Tests: GPU Platform). - Reconcile capacity vs cost. Sustained allocatable < SLO is lost paid capacity; feed it into the cost model (Cloud, Neoclouds and Cost/Capacity, Vendor Sourcing and Procurement Logistics) and, on neoclouds, into node-replacement SLA claims.
- Pin metric names on exporter upgrades. dcgm-exporter renames/relocates counters between releases; re-confirm every
DCGM_FI_*against the installeddefault-counters.csvafter an upgrade (DCGM Exporter, Telemetry, Monitoring and Alerting). - Keep fabric error SLIs counter-aware. NVLink/XID/PCIe-replay are monotonic counters; always wrap in
increase()/rate()over a window, never compare raw totals: a non-zero lifetime counter is not an active fault.
References¶
- NVIDIA dcgm-exporter default counters (
DCGM_FI_DEV_GPU_TEMP,DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,DCGM_FI_DEV_XID_ERRORS,DCGM_FI_DEV_PCIE_REPLAY_COUNTER; per-link NVLink and ECC commented out by default): https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/default-counters.csv - NVIDIA DCGM field identifiers (
DCGM_FI_DEV_GPU_TEMP_SLOWDOWN,DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_*,DCGM_FI_DEV_MEMORY_TEMP, ECC SBE/DBE,DCGM_FI_DEV_CLOCKS_EVENT_REASONS): https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html - kube-state-metrics node metrics (
kube_node_status_condition,kube_node_status_allocatable,kube_node_status_capacity,kube_node_spec_unschedulable): https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/cluster/node-metrics.md - Kubernetes metrics for object states (kube-state-metrics overview): https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
- Prometheus Operator
PrometheusRule(recording + alerting rules): https://prometheus-operator.dev/docs/developer/alerting/ - Google SRE Workbook — Alerting on SLOs (burn-rate pattern, reused from the catalog): https://sre.google/workbook/alerting-on-slos/
- Slurm
sinfo(node state output for a Slurm allocatable SLI): https://slurm.schedmd.com/sinfo.html
Related: SLO/SLI catalog · GPU health gating · Fabric bring-up & benchmarking · Telemetry & monitoring · Observability · DCGM exporter manifest · Cluster orchestration · Neoclouds & cost · Glossary