Skip to content
Markdown

Platform engineering for GPU clusters: the internal developer platform paradigm

Scope: the internal developer platform (IDP) paradigm as an architecture, applied to GPU clusters: intent-based workload abstractions compiled into Kubernetes resources under platform policy, plane separation (experience, control, data, observability), golden paths, and reconciliation as the execution model. This is the paradigm-level companion to the split-plane platform architecture (which covers the physical control/data split) and the platform pages on quota, tenancy, and RBAC. Product-agnostic: Backstage, Crossplane, KubeVela, and OpenChoreo appear only as instances of the paradigm at different altitudes.

What it is

An internal developer platform treats the cluster as a product with two user groups whose interests conflict. Workload authors (ML engineers, researchers) want to say "train this model on 4 H100s from this image" and get a running, observable job. Platform engineers want every job to land inside quota, inside the right tenancy boundary, on the right queue, with cost attribution they can bill against. The paradigm resolves the conflict with three load-bearing ideas:

  1. Intent-based abstractions. Developers declare what (a component of some platform-defined type: a training job, an inference service, a notebook), never how (Deployments, tolerations, device-plugin resource names, NCCL environment). The set of available types and the knobs each type exposes are a typed contract owned by the platform team.
  2. Compilation under policy. A control plane renders intent into real Kubernetes resources, and the rendering path is where policy lives: quota checks, tenancy placement, security defaults, cost labels. Fields the platform owns cannot be set from below; out-of-policy intent is rejected at admission, not discovered in review.
  3. Reconciliation. Rendered state is declared state; controllers converge actual state toward it and reflect runtime reality back up in the developer's vocabulary (job phase, queue position, GPU hours burned), not as raw pod events.

On Kubernetes the paradigm is realised with CRDs plus controllers, which is why the ecosystem implementations compose rather than compete: Backstage is an experience plane (portal and catalog), Crossplane composes control-plane APIs, KubeVela implements the Open Application Model's component/trait split, and OpenChoreo packages all planes (portal, CI, GitOps, observability) into one CNCF Sandbox platform. None of them is GPU-aware out of the box; the GPU value comes from the component types you define on top of the paradigm.

Why use it

  • The cluster is the most expensive shared system most orgs run. Without a platform layer, GPU access degenerates into one of two modes: ticket-ops (a platform team hand-crafts every namespace and quota bump while GPUs idle in the queue of requests) or admin-for-everyone (researchers hold near-cluster-admin, quota is a spreadsheet, and one self-granted toleration lands an experiment on the production inference pool). The paradigm replaces both with governed self-service.
  • Golden paths encapsulate scarce expertise. The decisions that make a training job fast and safe (gang admission, topology placement, NCCL environment, MIG or time-sliced sharing) are made once, by the people who understand them, and every rendered job inherits them. Teams stop rediscovering the same ten footguns.
  • Governance by construction. Quota, tenancy, and cost attribution enforced in the rendering and admission path hold under deadline pressure; the same rules enforced by code review do not. A forged cost label or an over-quota request is a rejected API call, not an incident.

When to use it (and when not)

Use the paradigm when several teams share GPU clusters; when workload shapes recur (SFT runs, eval sweeps, serving deployments, notebooks); when you need chargeback or showback; and when the population of workload authors churns faster than you can teach Kubernetes.

Skip it when one team runs dedicated boxes: Slurm plus a wiki is a better fit than a bespoke abstraction layer with one consumer. Skip it while workload shapes are still unstable; an abstraction frozen before the shapes settle is worse than raw YAML, because it adds a translation layer and an escape hatch. And do not build it without a product owner: a platform nobody owns rots into exactly the glue it was meant to replace.

Architecture

The paradigm separates concerns into planes. The split-plane page covers running control and data planes on different infrastructure; this is the logical view that holds even when everything shares one cluster.

flowchart TB
  subgraph XP["Experience plane"]
    DEV["ML engineer<br/>portal / CLI / git repo"]
    PE["Platform engineer<br/>component types, traits, quotas"]
  end
  subgraph CP["Control plane (intent to resources)"]
    CRD["Intent CRDs<br/>TrainingJob, InferenceService, Notebook"]
    ADM["Admission<br/>quota, policy, platform-owned fields"]
    REN["Renderer + reconciler<br/>component-type registry"]
  end
  subgraph DP["GPU data plane (one or many clusters)"]
    KQ["Kueue / Volcano<br/>gang admission, borrowing"]
    OP["Training / serving operators"]
    GPU["GPU nodes<br/>device plugin or DRA, MIG"]
  end
  subgraph OBS["Observability plane"]
    MET["DCGM + Prometheus"]
    COST["Cost attribution by rendered project label"]
  end
  DEV -->|"declares intent"| CRD
  PE -->|"defines policy"| ADM
  CRD --> ADM --> REN -->|"rendered, suspended Jobs"| KQ
  KQ --> OP --> GPU
  GPU --> MET --> COST
  MET -->|"status in developer terms"| DEV
  • The experience plane is however intent enters: a Backstage-style portal, a CLI, or plain YAML in a GitOps repo promoted dev to staging to prod. It holds no truth; it talks to the control plane.
  • The control plane stores intent (CRDs), enforces policy at admission, renders resources, and reconciles. It is small, stateful, and must be highly available as an API; it must not sit on the data path (see failure modes).
  • The data plane is the GPU estate: queue-managed clusters running the device plugin or DRA, training operators, and serving stacks. Big, elastic, churny.
  • The observability plane closes the loop: DCGM and Prometheus feed status and cost back keyed by the labels the renderer stamped, so attribution survives anything a workload author does.

How it works (validated core mechanism)

The kernel of the paradigm is small enough to state precisely: render(intent) -> resources under policy, where admission rejects out-of-policy intent, platform-owned fields cannot be set from below, rendering is deterministic (reconciliation is meaningless without that), and a reconciler diffs desired against actual state into create/patch/delete actions. The model below is pure Python and was executed as shown, with every assert passing; the production idiom (CRDs plus controller-runtime, or KubeVela CUE templates) keeps exactly this shape.

from dataclasses import dataclass, field

# Platform side. The platform team owns these; developer intent never sets them.
QUOTAS = {"nlp": 8, "vision": 4}  # GPUs per project per environment
TRAITS = {
    "spot-tolerant": {"key": "node.gpu/spot", "operator": "Exists", "effect": "NoSchedule"},
}
PLATFORM_OWNED = {"namespace", "tolerations", "priorityClassName", "hostNetwork", "labels"}


@dataclass(frozen=True)
class Intent:
    project: str
    env: str
    component_type: str
    name: str
    image: str
    gpus: int
    traits: tuple = ()
    overrides: dict = field(default_factory=dict)  # attempts to reach below the abstraction


def render_training_job(intent: Intent) -> dict:
    tolerations = [TRAITS[t] for t in intent.traits]  # unknown trait -> KeyError, by design
    return {
        "apiVersion": "batch/v1",
        "kind": "Job",
        "metadata": {
            "name": intent.name,
            "namespace": f"{intent.project}-{intent.env}",
            "labels": {
                "kueue.x-k8s.io/queue-name": f"{intent.project}-queue",
                "cost/project": intent.project,
            },
        },
        "spec": {
            "suspend": True,  # created suspended; the queue, not the author, admits it
            "template": {
                "spec": {
                    "tolerations": tolerations,
                    "containers": [{
                        "name": "trainer",
                        "image": intent.image,
                        "resources": {"limits": {"nvidia.com/gpu": str(intent.gpus)}},
                    }],
                    "restartPolicy": "Never",
                }
            },
        },
    }


COMPONENT_TYPES = {"training-job": render_training_job}


def render(intent: Intent) -> dict:
    if intent.component_type not in COMPONENT_TYPES:
        raise ValueError(f"unknown component type: {intent.component_type}")
    if intent.gpus > QUOTAS[intent.project]:
        raise ValueError(f"over quota: {intent.gpus} > {QUOTAS[intent.project]}")
    illegal = PLATFORM_OWNED & intent.overrides.keys()
    if illegal:
        raise ValueError(f"platform-owned fields: {sorted(illegal)}")
    return COMPONENT_TYPES[intent.component_type](intent)


def reconcile(desired: dict, actual: dict) -> list:
    actions = [("create", k) for k in desired if k not in actual]
    actions += [("patch", k) for k in desired if k in actual and actual[k] != desired[k]]
    actions += [("delete", k) for k in actual if k not in desired]
    return sorted(actions)


def must_reject(intent: Intent) -> None:
    try:
        render(intent)
    except (ValueError, KeyError):
        return
    raise AssertionError(f"accepted out-of-policy intent: {intent.name}")


ok = Intent("nlp", "prod", "training-job", "sft-run", "ghcr.io/acme/sft:v3", gpus=4,
            traits=("spot-tolerant",))
job = render(ok)

# happy path: intent lands in the project-per-env cell, on the project's queue,
# suspended, with the GPU request and the trait-granted toleration
assert job["metadata"]["namespace"] == "nlp-prod"
assert job["metadata"]["labels"]["kueue.x-k8s.io/queue-name"] == "nlp-queue"
assert job["spec"]["suspend"] is True
tpl = job["spec"]["template"]["spec"]
assert tpl["containers"][0]["resources"]["limits"]["nvidia.com/gpu"] == "4"
assert tpl["tolerations"] == [TRAITS["spot-tolerant"]]

# determinism: same intent, same render; reconciliation is meaningless without it
assert render(ok) == render(ok)

# adversarial: over quota (16 > 8), unknown component type, unknown trait
must_reject(Intent("nlp", "prod", "training-job", "big", "img", gpus=16))
must_reject(Intent("nlp", "prod", "warehouse", "x", "img", gpus=1))
must_reject(Intent("nlp", "prod", "training-job", "x", "img", 1, traits=("root-on-host",)))

# adversarial: reaching below the abstraction; a self-granted toleration and a
# forged cost label must both be rejected, or governance and billing are fiction
must_reject(Intent("nlp", "prod", "training-job", "x", "img", 1,
                   overrides={"tolerations": [{"operator": "Exists"}]}))
must_reject(Intent("nlp", "prod", "training-job", "x", "img", 1,
                   overrides={"labels": {"cost/project": "vision"}}))

# reconcile: one missing, one drifted (suspend flipped by hand), one orphan
drifted = render(ok)
drifted["spec"]["suspend"] = False
desired = {"nlp-prod/sft-run": render(ok), "nlp-prod/eval": render(
    Intent("nlp", "prod", "training-job", "eval", "ghcr.io/acme/eval:v1", gpus=1))}
actual = {"nlp-prod/sft-run": drifted, "nlp-prod/stale": {"kind": "Job"}}
assert reconcile(desired, actual) == [
    ("create", "nlp-prod/eval"), ("delete", "nlp-prod/stale"), ("patch", "nlp-prod/sft-run")]
assert reconcile(desired, desired) == []  # converged: no actions

print("all assertions passed")

What the adversarial asserts pin down, because each one is a real platform property:

  • Over-quota, unknown type, unknown trait are rejected at admission. The developer finds out at kubectl apply time, not three days into a queue.
  • Platform-owned fields cannot be set from below. A self-granted toleration is a tenancy escape; a forged cost/project label silently bills another team. Both are refused, which is the entire difference between "policy" and "convention".
  • Rendering is deterministic and jobs are born suspend: true. The queue admits work when quota allows (Kueue's model); the author never self-admits. Determinism is what lets the reconciler treat any diff as drift.
  • Reconcile converges and detects drift and orphans. A hand-edited job gets patched back; a resource with no backing intent gets deleted. On a converged system the action list is empty.

How to use it

The workload author's whole interface is one intent document per component, kept in git and promoted across environments by GitOps. The shape (illustrative, not a standard):

apiVersion: platform.example.com/v1
kind: Component
metadata:
  name: sft-run
  project: nlp
spec:
  type: training-job          # from the platform's component-type registry
  image: ghcr.io/acme/sft:v3
  gpus: 4
  traits:
    - spot-tolerant           # only traits the platform grants to this project
  env: prod

Everything else (namespace, queue, tolerations, priority, network policy, cost labels) is the renderer's business. Status flows back onto the same object: phase, queue position, GPU hours consumed. When authors need something the type does not expose, that is a feature request against the component type, not a reason to hand out raw namespace access; if you do allow a raw-YAML escape hatch, audit it (see maintenance).

How to develop with it

Platform engineers extend the system by writing component types and traits, not by editing tenant YAML:

  • training-job must render to gang-aware primitives: a Kueue-suspended Job or a Volcano PodGroup with minMember equal to world size. Rendering a multi-worker training run to Deployment-style independent replicas is the paradigm's classic GPU mistake (see failure modes).
  • inference-service renders to a Deployment plus autoscaling on GPU or request metrics, rollout strategy included, so serving teams never hand-write an HPA.
  • notebook renders with an idle-reaper sidecar or TTL; interactive GPUs are where quota quietly dies.
  • Traits are the sanctioned extension points: spot tolerance, a MIG profile, a topology constraint. A trait is platform-defined and project-granted; the code above shows why an ungranted trait must be a hard error.

Test the renderer the way you test a compiler: golden-file tests (known intent, byte-stable expected manifests) plus adversarial cases like the ones above. Every policy you claim to enforce should have a must_reject test.

How to run it in production

  • Quota: one Kueue ClusterQueue per project with borrowing between them; the renderer's queue label is the binding between the abstraction and the enforcement.
  • Tenancy: namespace per project per environment, RBAC that grants authors access to intent objects only, NetworkPolicy default-deny between projects, and the hardware-level isolation dimensions the namespace model cannot give you.
  • Cost: attribution keys are rendered, never author-supplied; meter GPU hours from DCGM joined on those labels.
  • Platform SLOs: measure the platform as a product: time from intent applied to workload admitted, reconcile latency, portal/API availability, and fraction of pending time explained to the author (queue position surfaced honestly). The training-platform SLO page covers the workload-side objectives.
  • Blast radius: the control plane must be off the data path. Test it: stop the control plane and confirm running training jobs keep training and serving keeps serving; only new intent stalls. The split-plane architecture makes this boundary physical.

How to maintain it

  • Version component types like APIs. Schema changes get deprecation windows and migration tooling; a renderer change that alters output for unchanged intent is a breaking change even if every field is "internal".
  • Track golden-path coverage. Fraction of GPU hours running through component types versus escape hatches is the platform's product-health metric. Rising escape-hatch usage is the roadmap, written by your users.
  • Upgrade control plane first, then data planes rolling; rendered resources must stay valid across one version of skew.
  • Re-render on policy change. Quota and trait changes must reconcile into existing workloads predictably (usually: applied at next admission, never by killing running jobs silently).

Failure modes

  • Deployment semantics applied to training. The inherited app-IDP assumption that replicas are independent deadlocks GPU queues: 7 of 8 workers scheduled, the 8th pending, 7 GPUs burning while the job waits forever. Training types must render gang admission (pending-job runbook).
  • Leaky abstraction. Topology, NCCL tuning, and interconnect details bleed through the moment a job is slow, and authors need the platform to surface them (placement, fabric counters) rather than pretend they do not exist. An abstraction that hides diagnosis loses its users.
  • Admission-time quota versus scheduling-time reality. The renderer says yes, the queue says later, the author sees nothing. If you cannot show queue position and the reason (quota, fragmentation, capacity), the platform is lying by omission.
  • Control plane on the data path. If workloads stop when the control plane is down, you built a single point of failure, not a platform. Running jobs must survive; only new intent may stall.
  • Golden path rot. Component types that lag the ML stack (a new parallelism scheme, a new serving engine) push teams to fork templates, and the platform quietly becomes legacy. Coverage metrics catch this early.
  • Forgeable attribution. Any label or annotation an author can set is not a billing key. Attribution must come from the renderer, as validated above.
  • Platform without a product owner. The ticket queue returns wearing a portal. Staff it like a product or do not build it.

References

  • CNCF Platforms whitepaper (the paradigm's definition): https://tag-app-delivery.cncf.io/whitepapers/platforms/
  • CNCF Platform Engineering Maturity Model: https://tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model/
  • Open Application Model (component/trait split): https://oam.dev/
  • KubeVela (OAM on Kubernetes): https://kubevela.io/docs/
  • Crossplane (control-plane composition): https://docs.crossplane.io/
  • Backstage (experience plane): https://backstage.io/docs/overview/what-is-backstage/
  • OpenChoreo, an example of a complete Kubernetes IDP with plane separation: https://github.com/openchoreo/openchoreo
  • Kueue (quota and gang admission): https://kueue.sigs.k8s.io/docs/
  • Kubernetes custom resources (the CRD substrate): https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

Related: GPU Platform Split-Plane Architecture · Kubernetes for GPUs · Kubernetes · Operator for GPU Allocation · Kueue Quota · Kueue ClusterQueue Manifest · Gang-Scheduled Training Recipe · Security & Multi-tenancy · RBAC for GPU Operators · Distributed Training as a Platform Service · SRE, Platform & MLOps Practices · Training Platform SLOs · Glossary