Skip to content
Markdown

SRE, platform & MLOps practices

Scope: the operating-excellence layer over the whole stack: SLOs and error budgets, GitOps and policy-as-code, IaC, and the MLOps lifecycle. The practices that keep pages 01-23 reliable, self-service, and reproducible rather than a pile of one-off manifests.

Reference patterns and templates. Adopt incrementally; maturity matters more than tooling choice.

Overview

Three disciplines overlap on a GPU platform. SRE keeps it reliable and on-budget. Platform engineering makes it self-service and consistent through code. MLOps makes the workloads on top reproducible and continuously delivered. The unifying principle is the same one in the index Conventions: everything is code in git, applied by automation, observed against explicit objectives.

Loop: GitOps reconciliation

flowchart LR
  ENG["Engineer"] -->|"pull request"| GIT["Git repo"]
  GIT -->|"reconcile"| CD["Argo CD / Flux"]
  CD -->|"apply"| K8S["Cluster"]
  K8S -.->|"drift detected"| CD
  POL["Kyverno / OPA"] -.->|"admission control"| K8S

SRE: reliability as an objective

  • SLIs → SLOs → error budgets. Define service-level indicators per workload class and hold a budget:
  • Training platform: job success rate, GPU availability, scheduler queue-wait p95, MFU floor (distributed training).
  • Inference service: availability, TTFT/TPOT against SLO, goodput (inference serving).
  • When the budget is spent, freeze risky change and spend effort on reliability; the budget arbitrates velocity vs stability.
  • Golden signals, GPU-flavoured: latency (TTFT/TPOT), traffic (tokens/s, jobs/h), errors (XID/ECC, OOM, NCCL timeouts), saturation (SM-active, KV-cache, queue depth) (observability).
  • Operational practice: runbooks (the troubleshooting runbook) linked from alerts (telemetry and monitoring); blameless postmortems; automate toil, where node drain/reset/RMA (reliability and RAS) should be a controller, not a human; capacity management against power and lead time (cloud and cost).

Platform engineering: self-service through code

  • IaC: Terraform/OpenTofu for cloud and cluster infra; Ansible for the metal (Ansible bring-up). No click-ops; state in git, peer-reviewed.
  • GitOps: a repo of the Kubernetes platform and workload recipes manifests, reconciled by Argo CD or Flux, so the cluster converges to git and drift self-heals:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata: { name: gpu-platform, namespace: argocd }
spec:
  project: default
  source: { repoURL: https://git.internal/infra/gpu-platform, path: clusters/dc1, targetRevision: main }
  destination: { server: https://kubernetes.default.svc }
  syncPolicy: { automated: { prune: true, selfHeal: true } }
  • Policy-as-code: Kyverno or OPA/Gatekeeper enforce guardrails so tenants cannot foot-gun the fleet: require GPU limits, block :latest, mandate MIG for shared namespaces, enforce signed images (security and multi-tenancy):
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: require-gpu-limits }
spec:
  rules:
    - name: gpu-request-must-have-limit
      match: { any: [{ resources: { kinds: [Pod] } }] }
      validate:
        failureAction: Enforce   # spec-level validationFailureAction is deprecated since Kyverno 1.13
        message: "Pods requesting nvidia.com/gpu must set a matching limit."
        pattern:
          spec:
            containers:
              - resources:
                  limits: { "nvidia.com/gpu": "?*" }
  • Golden paths: templated self-service (Backstage software templates, Helm/Kustomize bases) so a user launches a compliant training or serving job without re-deriving the platform.

MLOps: reproducible, continuously-delivered models

  • Versioning & lineage: data (DVC, LakeFS), code (git), models (a registry: MLflow, W&B). Every model traces to data + code + config + checkpoint (storage and data, distributed training).
  • Experiment tracking: W&B / MLflow for metrics, configs, and artifacts, distinct from infra monitoring (observability).
  • Pipeline orchestration: Kubeflow Pipelines, Argo Workflows, Flyte, or Metaflow for data → train → eval → register → deploy, running on the gang scheduler (the Kubernetes platform).
  • CI/CD for models: an eval gate before promotion, where quality/regression/safety evals must pass to advance a model to staging/prod serving (inference serving); progressive rollout (canary/shadow) for new model versions.
  • Reproducibility: pinned container envs (NGC images), deterministic seeds where feasible, deterministic resume (distributed training); the training loop logs the MFU it achieved.
  • Drift & quality: monitor input/output distributions and quality in production; close the loop back to retraining.

Cross-cutting

  • FinOps loop: attribute $/GPU-hour and $/token to teams; review utilisation (SM-active, not GPU-util) weekly; right-size reservations vs spot (cloud and cost, performance tuning).
  • Supply chain & security: signed images, SBOMs, provenance; secrets in a vault/KMS; firmware attestation (security and multi-tenancy).
  • DR / continuity: checkpoints are the training DR story (storage and data); back up cluster state (etcd) and the GitOps repo is the cluster's source of truth.

Maturity checklist

  • SLOs + error budgets defined per workload class; alerts map to runbooks.
  • All infra in git; nothing applied by hand in prod (Argo CD/Flux reconciling).
  • Policy-as-code guardrails enforced (GPU limits, image provenance, MIG for shared).
  • Model registry + eval gate before any production promotion.
  • Toil automated: node drain/reset/RMA and scaling are controllers, not tickets.
  • FinOps review cadence on real utilisation; capacity tracked against power/lead time.

Don't-miss checklist

  • Make the error budget arbitrate change velocity, not vibes.
  • Reconcile from git; treat manual kubectl apply/helm install in prod as an incident smell.
  • Gate model promotion on evals, never on "it trained".
  • Measure utilisation as SM-active/MFU for both reliability and cost (observability, cloud and cost).

Failure modes

  • Dashboards without SLOs: lots of graphs, no agreement on "healthy".
  • Click-ops drift: prod diverges from git, changes are unauditable and irreproducible.
  • No policy guardrails: a tenant ships a :latest, no-limit, full-GPU pod and starves the fleet.
  • Models promoted without eval gates; regressions reach production.
  • "GPU-util" used for capacity/cost decisions, hiding 70% waste (observability).

Open questions & validation

  • Define the concrete SLO targets and budgets for the actual workloads before wiring alerts.
  • Validate the GitOps + policy stack on a non-prod cluster; rehearse drift self-heal and a policy denial.
  • Stand up the model registry + eval gate and run one model through promotion end-to-end.

References

  • Google SRE books (SLOs, error budgets, toil): https://sre.google/books/
  • Argo CD: https://argo-cd.readthedocs.io/ · Flux: https://fluxcd.io/
  • Kyverno: https://kyverno.io/ · OPA Gatekeeper: https://open-policy-agent.github.io/gatekeeper/
  • MLflow: https://mlflow.org/docs/latest/ · Kubeflow Pipelines: https://www.kubeflow.org/docs/components/pipelines/
  • FinOps Foundation: https://www.finops.org/

Related: LLM eval harness · Experiment tracking & registry · Observability · Reliability · Security · Cloud & Cost · Runbook · Ansible · K8s Platform · Workloads · Glossary