Skip to content
Markdown

SLOs for cluster and fabric health

Scope: operational SLOs for the cluster substrate (allocatable-GPU health, fabric link health and bandwidth, node readiness, and thermal headroom) expressed as PromQL/DCGM SLIs, and how those SLIs gate scheduling.

Reference templates, not hardware-tested. PromQL assumes dcgm-exporter, kube-state-metrics, and the kube-scheduler /metrics endpoint. Metric names below are the dcgm-exporter default counter names (etc/default-counters.csv), which differ from the raw DCGM field-id API names; confirm against your exporter's counters file and DCGM build, since several fields (per-link NVLink, ECC) are commented out by default. SLO targets are reference values; set your own with stakeholders.

This page is the substrate counterpart to the user-facing SLO/SLI catalog: that page measures the service (inference availability, TTFT, job success); this one measures whether the cluster is fit to schedule onto at all. The diagnostics that produce these signals live on GPU Health Gating and Fabric Bring-Up, Validation and Benchmarking; the alert-rule mechanics (multi-window burn rate, SLO-as-code) are defined once on SLO/SLI Catalog and Error-Budget Alerts and not repeated here.

What it is

A cluster/fabric SLO is a service-level objective on the substrate, not on the workload. The SLI is a ratio of "good capacity" to "total capacity" measured continuously from infra telemetry, and the SLO is its target over a window. Four substrate SLIs:

SLI Definition (good / total) Reference SLO Source signal
GPU allocatable ratio schedulable GPUs / total GPUs >= 98% / 28d kube-state-metrics / Slurm
Node readiness Ready=true nodes / total GPU nodes >= 99% / 28d kube-state-metrics
Fabric link health links without errors / total links >= 99.9% / 28d dcgm-exporter (NVLink), ibdiagnet
Thermal headroom GPUs below slowdown threshold / total >= 99.9% (instantaneous, paged) dcgm-exporter

The distinction from SLO/SLI Catalog and Error-Budget Alerts matters: those are user-outcome SLIs and you alert on burn rate. Substrate SLIs are capacity gates: a breach does not (only) page; it withdraws nodes from scheduling via the health-gating loop (GPU Health Gating). The SLO quantifies how much capacity you tolerate losing before it becomes an availability problem for the workloads on top.

Why it matters

A GPU that enumerates in nvidia-smi is not a GPU fit for a multi-hour collective (GPU Health Gating). At fleet scale hardware failure is steady-state: ECC ramps, HBM row-remaps go pending, NVLink lanes drop, nodes wedge NotReady, inlet temperature climbs and GPUs hit the slowdown threshold and silently throttle. Each one removes capacity or, worse, leaves a degraded node schedulable so one bad node sinks every multi-node job that lands on it.

Substrate SLOs make that capacity loss measurable and budgeted instead of discovered by a failed training run:

  • They turn "the cluster feels flaky" into a number with a target and an error budget you can defend in a capacity review (GPU Capacity Planning, Cloud, Neoclouds and Cost/Capacity).
  • They give the scheduler gate (GPU Health Gating) an objective threshold: below it, cordon/drain; the SLI is the post-hoc record of how often that fired.
  • Thermal-headroom and fabric SLIs catch silent degradation: a throttling GPU or an NVLink negotiated down to 1X still passes a liveness check but tanks MFU (Training MFU Regression).

When it is needed (and when not)

Needed when:

Not needed (or lighter-weight) when:

  • Single-node, single-GPU work: fabric SLIs are moot; a node-readiness liveness check suffices.
  • A managed neocloud already gates and replaces unhealthy nodes under its own SLA, so you consume their health signal rather than re-deriving it (Cloud, Neoclouds and Cost/Capacity). Still track allocatable ratio to hold them to it.
  • Pre-production bring-up: use the one-shot acceptance procedures (Fabric Bring-Up, Validation and Benchmarking, Smoke Tests: GPU Platform) instead of continuous SLOs; SLOs come online once the cluster carries real load.

Do not put an SLO on a raw infra gauge with no capacity meaning (e.g. average GPU temperature); that produces green dashboards and unhappy jobs (SLO/SLI Catalog and Error-Budget Alerts failure modes). The SLI must be a good/total capacity ratio.

How: implement, integrate, maintain

The flow: dcgm-exporter and kube-state-metrics emit substrate signals; recording rules compute the four SLIs; thresholds both page (via the burn-rate pattern on SLO/SLI Catalog and Error-Budget Alerts) and feed the gating loop that cordons/drains (GPU Health Gating), withdrawing the node from allocatable and closing the loop.

flowchart LR
  DCGM["dcgm-exporter: temp, NVLink, XID"] --> PROM["Prometheus recording rules"]
  KSM["kube-state-metrics: node Ready, allocatable"] --> PROM
  PROM --> SLI["Substrate SLIs: allocatable, readiness, fabric, thermal"]
  SLI --> ALERT["Burn-rate alert -> page"]
  SLI --> GATE["Threshold breach -> gate"]
  GATE --> DRAIN["Cordon / drain node"]
  DRAIN --> ALLOC["Node leaves allocatable"]
  ALLOC --> KSM

Implement: the four SLIs as PromQL

dcgm-exporter labels every metric with Hostname / gpu / UUID; kube-state-metrics labels with node. Join on node/Hostname where you need per-node rollups.

# 1. GPU allocatable ratio (Kubernetes): schedulable GPUs / total GPUs.
#    Allocatable excludes GPUs on cordoned (unschedulable) nodes.
sum(
  kube_node_status_allocatable{resource="nvidia_com_gpu"}
  * on(node) group_left()
  (kube_node_spec_unschedulable == bool 0)
)
/ sum(kube_node_status_capacity{resource="nvidia_com_gpu"})

# 2. Node readiness: GPU nodes with Ready=true / total GPU nodes.
#    Restrict to GPU nodes via a label your GFD/labeller sets (here: a node label join).
sum(kube_node_status_condition{condition="Ready",status="true"})
/ count(kube_node_status_condition{condition="Ready"})

# 3. Fabric link health (NVLink error-free fraction).
#    No new NVLink errors in the window => link healthy. Aggregate counter; per-link
#    fields (DCGM_FI_DEV_NVLINK_*_L0..) are commented out in the exporter default.
( count(count by (Hostname,gpu) (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL))
  - count(count by (Hostname,gpu) (
      increase(DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL[5m]) > 0
    )) )
/ count(count by (Hostname,gpu) (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL))

# 4. Thermal headroom: GPUs at least 5 C below the per-GPU slowdown threshold.
#    DCGM_FI_DEV_GPU_TEMP_SLOWDOWN is the HW slowdown trip point; compare live temp to it.
count(
  DCGM_FI_DEV_GPU_TEMP
  < on(Hostname,gpu) (DCGM_FI_DEV_GPU_TEMP_SLOWDOWN - 5)
)
/ count(DCGM_FI_DEV_GPU_TEMP)

Field-name caveats, verify against your exporter's counters file:

  • DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL and DCGM_FI_DEV_GPU_TEMP_SLOWDOWN are not in the stock default-counters.csv; add them to the exporter's counters configmap to export them (DCGM Exporter). If unavailable, substitute a fixed thermal threshold (e.g. DCGM_FI_DEV_GPU_TEMP > 87, as used on Telemetry, Monitoring and Alerting) and gate fabric health on XID instead: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0.
  • On Slurm there is no kube-state-metrics; derive allocatable ratio from sinfo/drain state via a textfile-collector exporter scraping sinfo -h -o '%D %t' (idle+mix+alloc vs drain/down).

Integrate: recording rule + scheduling gate

Compute the SLIs as recording rules so alerts and dashboards share one definition (multi-window burn-rate alerting is on SLO/SLI Catalog and Error-Budget Alerts; reuse it verbatim against these :ratio series). The substrate-specific addition is the gate alert that drives cordon/drain.

# PrometheusRule: substrate SLIs + thermal gate. Reference template, not hardware-tested.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-cluster-fabric
  namespace: monitoring
  labels: { release: kube-prom }
spec:
  groups:
    - name: slo.substrate
      rules:
        - record: slo:gpu_allocatable:ratio
          expr: |
            sum(kube_node_status_allocatable{resource="nvidia_com_gpu"}
                * on(node) group_left() (kube_node_spec_unschedulable == bool 0))
            / sum(kube_node_status_capacity{resource="nvidia_com_gpu"})
        - record: slo:node_ready:ratio
          expr: |
            sum(kube_node_status_condition{condition="Ready",status="true"})
            / count(kube_node_status_condition{condition="Ready"})
        # Capacity SLO breach: allocatable below target -> page (link runbook).
        - alert: GpuAllocatableBelowSLO
          expr: slo:gpu_allocatable:ratio < 0.98
          for: 15m
          labels: { severity: warning }
          annotations:
            summary: "Allocatable GPU ratio {{ $value | humanizePercentage }} < 98% SLO"
            runbook: "gpu-health-gating: find drained/cordoned nodes, triage RMA vs reset"
        # Thermal gate: a GPU within 2 C of its slowdown trip -> page; gate node hot.
        - alert: GpuThermalHeadroomCritical
          expr: |
            DCGM_FI_DEV_GPU_TEMP
            >= on(Hostname,gpu) (DCGM_FI_DEV_GPU_TEMP_SLOWDOWN - 2)
          for: 2m
          labels: { severity: critical }
          annotations:
            summary: "GPU {{ $labels.gpu }} on {{ $labels.Hostname }} within 2C of slowdown"
            runbook: "gpu-health-gating: cordon/drain the node; check cooling and power"

The gate alert is the seam to GPU Health Gating: route GpuThermalHeadroomCritical to Alertmanager, and have the receiver (or a Node Problem Detector custom-plugin-monitor reading the same DCGM signal) set a node condition / cordon the node so it leaves allocatable. That cordon then shows up in SLI #1, closing the loop. On Slurm the equivalent is the HealthCheckProgram probe running scontrol update ... State=DRAIN (GPU Health Gating).

Maintain

  • Generate from SLO-as-code. Declare these four SLIs in an OpenSLO/Sloth spec in git so rules regenerate from one source (SLO/SLI Catalog and Error-Budget Alerts, SLO-as-code section). Review targets with stakeholders.
  • Validate the gate end-to-end. Force a synthetic breach (cordon one node; confirm slo:gpu_allocatable:ratio drops) and a synthetic thermal trip (a temporary recording rule that emits a value above the trip) and confirm the page fires and the receiver cordons. Never claim the gate works without observing the cordon (Smoke Tests: GPU Platform).
  • Reconcile capacity vs cost. Sustained allocatable < SLO is lost paid capacity; feed it into the cost model (Cloud, Neoclouds and Cost/Capacity, Vendor Sourcing and Procurement Logistics) and, on neoclouds, into node-replacement SLA claims.
  • Pin metric names on exporter upgrades. dcgm-exporter renames/relocates counters between releases; re-confirm every DCGM_FI_* against the installed default-counters.csv after an upgrade (DCGM Exporter, Telemetry, Monitoring and Alerting).
  • Keep fabric error SLIs counter-aware. NVLink/XID/PCIe-replay are monotonic counters; always wrap in increase()/rate() over a window, never compare raw totals: a non-zero lifetime counter is not an active fault.

References

  • NVIDIA dcgm-exporter default counters (DCGM_FI_DEV_GPU_TEMP, DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, DCGM_FI_DEV_XID_ERRORS, DCGM_FI_DEV_PCIE_REPLAY_COUNTER; per-link NVLink and ECC commented out by default): https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/default-counters.csv
  • NVIDIA DCGM field identifiers (DCGM_FI_DEV_GPU_TEMP_SLOWDOWN, DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_*, DCGM_FI_DEV_MEMORY_TEMP, ECC SBE/DBE, DCGM_FI_DEV_CLOCKS_EVENT_REASONS): https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-api/dcgm-api-field-ids.html
  • kube-state-metrics node metrics (kube_node_status_condition, kube_node_status_allocatable, kube_node_status_capacity, kube_node_spec_unschedulable): https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/cluster/node-metrics.md
  • Kubernetes metrics for object states (kube-state-metrics overview): https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
  • Prometheus Operator PrometheusRule (recording + alerting rules): https://prometheus-operator.dev/docs/developer/alerting/
  • Google SRE Workbook — Alerting on SLOs (burn-rate pattern, reused from the catalog): https://sre.google/workbook/alerting-on-slos/
  • Slurm sinfo (node state output for a Slurm allocatable SLI): https://slurm.schedmd.com/sinfo.html

Related: SLO/SLI catalog · GPU health gating · Fabric bring-up & benchmarking · Telemetry & monitoring · Observability · DCGM exporter manifest · Cluster orchestration · Neoclouds & cost · Glossary