Markdown

Kubernetes operator for GPU allocation & orchestration¶

Scope: the custom Kubernetes operator pattern that turns declarative GPU-workload intents ("lease 8×H100 for this tenant", "run this served deployment", "schedule this training job") into reconciled, self-healing pods on GPU nodes. Covers the CRD set, the reconcile loop, leader-election HA, Server-Side-Apply multi-writer status, finalizers for releasing GPUs, and the operational traps (LIST memory, level- vs edge-triggering). This is the allocation control plane that sits above the NVIDIA GPU Operator and the scheduler, not a replacement for either.

What it is¶

An operator is a Kubernetes controller plus one or more Custom Resource Definitions (CRDs) that together encode operational knowledge as code: you declare what you want (a custom resource), and a control loop continuously drives the cluster toward it. ([Kubernetes operator pattern]) For a GPU platform, the CRDs model the domain objects raw Kubernetes lacks:

GpuLease / GpuRental: allocate N GPUs of a given model to a tenant for a window, with quota and lifecycle.
TrainingDeployment / UserDeployment: a long-running or elastic workload bound to leased GPUs (see distributed training as a platform service).
BatchJob: a run-to-completion GPU job.
NodeProfile: discovered per-node capability (GPU model, count, fabric) the controller schedules against.
Queue: admission/ordering when demand exceeds capacity.

Each CRD kind gets its own controller with a reconcile loop: observe the CR's desired spec, observe actual cluster state (pods, nodes, leases), compute the difference, take the minimal action to converge, and write observed state back to .status. Reconciliation is level-triggered and idempotent: it acts on current state, not on a stream of events, so a missed event, a restart, or a duplicate trigger all converge to the same result. Controllers share one runtime/manager (controller-runtime in Go, kube-rs Controller in Rust). ([controller-runtime; kube-rs])

flowchart LR
  USER["Tenant applies CR:<br/>GpuLease / TrainingDeployment"] --> API["kube-apiserver (etcd)"]
  API --> WATCH["Operator: watch + reconcile loop"]
  WATCH --> DIFF{"desired spec<br/>vs actual state"}
  DIFF -->|"converge"| ACT["Create/patch Pods,<br/>bind GPU nodes, set quota"]
  ACT --> NODES["GPU nodes"]
  NODES -->|"observed"| STATUS["Patch .status via SSA<br/>(field-manager owned)"]
  STATUS --> API
  LE["Leader election (Lease)"] -.->|"only the leader reconciles"| WATCH

Why it matters¶

Raw kubectl and a bare Job do not express "a tenant's 24-hour lease on 8×H100 with a quota, an audit trail, and automatic release on expiry." Encoding that as a CRD + controller gives a GPU platform four things it otherwise hand-rolls in scripts:

Declarative, self-healing GPU-as-a-service. A node dies, a pod is evicted, the operator restarts, and the loop re-converges to the declared lease without human action. This is the level-triggered guarantee.
Lease/lifecycle semantics raw Kubernetes lacks: allocate, extend, expire, release, with quota enforced at admission.
An auditable control plane. Every allocation is a versioned API object with status and events, not a side effect of an imperative script.
A clean seam. The operator encodes policy (who gets which GPUs, when); it delegates driver installation to the NVIDIA GPU Operator and low-level scheduling/quota to DRA, Kueue, or Volcano. It is the layer that turns those primitives into a product.

When to build one (and when not)¶

Build a custom operator when:

You are running a multi-tenant GPU platform / internal GPU cloud / neocloud where users request capacity through an API and you must enforce leases, quotas, and lifecycle.
The workload lifecycle is richer than run-to-completion (elastic resize, lease extension, warm pools, tenant namespaces) so a Job plus a cron is not enough.
You need a single source of truth for "who holds which GPUs" that survives restarts and is reconciled, not scripted.

Do not build one when:

A Job + Kueue (quota/queueing) + Volcano (gang scheduling) already covers your needs. Most batch-training clusters do not need a bespoke operator; see the orchestration decision guide. Adopting CRDs you do not need is pure operational tax.
You only need driver/plugin lifecycle, which is exactly what the NVIDIA GPU Operator already does; do not reinvent it.
The cluster is single-tenant and static: declare pods directly; the reconcile machinery buys nothing.

How to build and operate it¶

Reference shapes below, unexecuted. Confirm CRD schema conventions, RBAC, and SSA/leader-election APIs against current Kubernetes, controller-runtime, and kube-rs docs for your versions before relying on them.

1. Model the domain as CRDs¶

Define a CRD per domain object with an OpenAPI v3 schema (validation at admission), a spec (desired) and status (observed), and useful additionalPrinterColumns. Keep one concern per CRD.

# Reference shape of a GPU-lease CRD (illustrative).
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata: { name: gpuleases.platform.example.com }
spec:
  group: platform.example.com
  scope: Namespaced
  names: { kind: GpuLease, plural: gpuleases, shortNames: [lease] }
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: [gpuModel, count, tenant]
              properties:
                gpuModel: { type: string }     # e.g. "H100-SXM-80GB"
                count:    { type: integer, minimum: 1 }
                tenant:   { type: string }
                expiresAt:{ type: string, format: date-time }
            status:
              type: object
              properties:
                phase: { type: string }        # Pending | Bound | Expired | Released
                boundNodes: { type: array, items: { type: string } }

2. Write a level-triggered reconcile loop¶

The loop must be safe to call at any time with only the current state. Return an outcome that either re-queues (with backoff) or settles.

// kube-rs reconcile sketch (illustrative). Idempotent: acts on observed state only.
async fn reconcile(lease: Arc<GpuLease>, ctx: Arc<Ctx>) -> Result<Action> {
    if expired(&lease) {
        release_gpus(&lease, &ctx).await?;                 // idempotent: no-op if already released
        patch_status(&lease, Phase::Expired, &ctx).await?; // SSA, field-manager owned
        return Ok(Action::await_change());
    }
    let nodes = bind_or_reuse_nodes(&lease, &ctx).await?;  // place if unplaced; reuse if already bound
    ensure_pods(&lease, &nodes, &ctx).await?;              // create missing; leave existing
    patch_status(&lease, Phase::Bound, &ctx).await?;
    Ok(Action::requeue(Duration::from_secs(30)))           // periodic re-check
}

3. Run HA with leader election¶

Run 2+ replicas for availability, but let only the leader reconcile; concurrent writers to the same objects corrupt state. Use a coordination.k8s.io/Lease; standby replicas take over on lease expiry (typically tens of seconds). Export a per-pod is_leader gauge (scraped directly, not via a Service that collapses replicas) so you can alert when sum(is_leader) < 1, meaning reconciliation has silently halted. ([Kubernetes leader election])

4. Use Server-Side Apply with field managers for multi-writer status¶

When more than one actor writes one object's .status (e.g. an allocation controller sets .status.phase while a separate health probe sets .status.reachability) naive update calls clobber each other. Server-Side Apply with a distinct field manager per writer gives each ownership of only its fields, so writes merge instead of overwrite. This is the discipline that makes a multi-controller operator safe. ([Kubernetes Server-Side Apply])

5. Release GPUs with finalizers¶

A lease deleted while pods still hold GPUs leaks capacity. Add a finalizer so deletion blocks until the controller tears down pods and frees the allocation, then removes the finalizer. Without it, kubectl delete returns instantly and the GPUs stay pinned.

6. Bound memory and watch cost¶

A controller that LISTs thousands of pods/nodes can balloon RSS; cache with shared informers, page large LISTs, and (for Rust operators) a non-default allocator such as jemalloc measurably curbs RSS creep during large LIST/watch. Scope watches with label selectors so the operator does not cache the whole cluster.

7. Wire RBAC and observability¶

Grant least-privilege RBAC for exactly the CRD kinds and core resources the controller touches. Emit reconcile duration, error rate, and queue depth per controller, plus the leader gauge, into observability & monitoring; record per-tenant deployment availability as an SLI (SLO/SLI catalog).

Failure modes¶

No leader. Lease lost or all replicas down, so reconciliation halts silently and CRs drift from reality. Alert on sum(is_leader) < 1.
Edge-triggered assumptions. Logic that depends on seeing every event breaks on restart or missed watch. Reconcile from observed state only; make every action idempotent.
Status clobbering. Multiple writers without SSA field managers overwrite each other's .status. Use SSA with distinct field managers.
Leaked allocations. With no finalizer, deleting a lease leaves GPUs pinned. Always finalize teardown.
LIST memory blowup. Unbounded informer cache OOMs the operator on large clusters. Scope watches, page LISTs, bound the allocator.
Reinventing the scheduler. Encoding gang scheduling or quota in your operator instead of delegating to Volcano/Kueue. Keep the operator to policy; delegate placement.

Open questions & validation¶

The boundary between your operator and DRA/Kueue/Volcano: what each owns, and where lease policy ends and scheduling begins.
Leader-failover time vs. your reconcile-latency SLO; measure the gap during which no reconciliation happens.
Idempotency under chaos: kill the operator mid-reconcile and confirm it re-converges with no duplicate pods or leaked leases.
SSA field-ownership conflicts when controllers and humans both edit objects.
Operator memory under a worst-case LIST (largest tenant, most nodes).

References¶

Kubernetes — Operator pattern: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
Kubernetes — Custom Resources / CRDs: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
Kubernetes — Server-Side Apply (field management): https://kubernetes.io/docs/reference/using-api/server-side-apply/
Kubernetes — Leader election (coordination.k8s.io Lease): https://kubernetes.io/docs/concepts/architecture/leases/
controller-runtime (Go operator framework): https://github.com/kubernetes-sigs/controller-runtime
kube-rs (Rust Kubernetes client + Controller runtime): https://kube.rs/
NVIDIA GPU Operator (driver/plugin lifecycle — the layer below): https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html