Markdown

GPU platform architecture: cloud control plane + distributed data plane¶

Scope: the reference architecture for a GPU-on-demand platform (a neocloud, an internal GPU cloud, or a multi-provider GPU service) that splits into a cloud-hosted control plane (API, auth, billing, capacity aggregation) and a distributed GPU data plane (bare-metal or rented nodes running the workloads), joined by VPC peering and a WireGuard overlay. Covers why the split exists, the multi-provider aggregation layer, cross-plane connectivity and secrets, and the failure modes of running compute you do not own. This is the system-level page that ties together the allocation operator, overlay networking, and remote GPU verification.

What it is¶

A GPU-on-demand platform has two halves with opposite operational profiles, and the architecture's central decision is to run them on different infrastructure:

Control plane (cloud-hosted, e.g. managed containers/ECS/Fargate): the API gateway, authentication, rate limiting, billing/metering, the multi-provider capacity aggregator, and platform state. It is small, stateful, internet-facing, and must be highly reliable, exactly what a managed cloud with managed databases and a key manager is good at. A stable egress (NAT gateway) gives it a fixed IP for whitelisting against providers and upstreams.
Data plane (distributed, K3s on bare-metal/rented GPU nodes): the GPU-allocation operator, the training/inference workloads, the WireGuard mesh, and node-local state. It is large, elastic, churny, and runs on hardware you may not own or trust; it lives wherever GPU capacity is cheapest.

The two planes are connected by VPC peering (for control traffic, e.g. the API fetching a cluster token or kubeconfig over SSH) and a WireGuard overlay (reaching remote GPU nodes behind NAT), with platform secrets held in a cloud key manager and synced into the data plane.

flowchart TB
  USER["User / tenant"] -->|"HTTPS"| GW["API gateway (cloud)<br/>OAuth2/JWT, rate limit, multi-tenant"]
  subgraph CP["Control plane (cloud, managed)"]
    GW --> AGG["Multi-provider aggregator"]
    GW --> BILL["Billing / metering (gRPC mesh)"]
    GW --> KMS["Key manager (secrets)"]
    GW --> DB["Managed DB (platform state)"]
  end
  subgraph DP["Data plane (distributed GPU nodes)"]
    OP["GPU-allocation operator (K3s)"] --> N1["GPU node @ provider A"]
    OP --> N2["GPU node @ provider B"]
  end
  AGG -->|"provision VM (provider APIs)"| DP
  GW -->|"VPC peering: token/kubeconfig (SSH)"| OP
  N1 <-->|"WireGuard overlay"| N2
  KMS -->|"sync kubeconfig, keys"| OP

Why it matters¶

The split exists because the two halves fail differently and scale differently. Co-locating them (running billing and auth on the same churny GPU nodes that get preempted and re-imaged) couples a reliable, money-handling control plane to the least reliable part of the system. Separating them buys:

Blast-radius isolation. A data-plane outage (a provider drops, a region goes dark) must not take down the API or billing, and vice versa. The planes degrade independently.
Right tool per half. Auth, metering, and state want a managed cloud (managed DB, KMS, autoscaling containers); GPU compute wants to be wherever it is cheapest, including bare metal and neoclouds. Each half runs where it is strongest.
Multi-provider reach. A single normalized aggregation layer lets the platform consume capacity from many GPU providers without the control plane caring which one served a given lease, the foundation of build-vs-rent flexibility at the platform level.
A trust boundary. The data plane runs on hardware the platform may not own, so it must verify the GPUs and treat the nodes as untrusted. Those concerns stay cleanly in the data plane while the control plane holds the secrets.

When to use it (and when not)¶

Use the split-plane architecture when:

You are building a GPU-on-demand platform / neocloud / internal GPU cloud / marketplace that serves capacity to users through an API.
Capacity spans multiple providers or bare-metal sites, so a normalized aggregation + provisioning layer is needed.
The control-plane concerns (auth, billing, API) demand higher reliability than the GPU compute substrate can guarantee.

Do not use it when:

You run a single owned cluster for your own workloads. Control and data plane co-locate; a plain Kubernetes/Slurm cluster with Kueue/Volcano is simpler and there is no provider-abstraction or cross-plane-trust problem to solve.
There is no multi-tenancy and no external API, so the cloud control plane is overhead with no consumer.

How it is built and operated¶

Reference patterns, unexecuted. Confirm managed-service, Terraform, VPC-peering, and KMS specifics against your cloud provider's current docs; pin versions.

1. Cloud control plane¶

Run the API gateway, auth, billing, and aggregator as managed containers behind load balancers, with platform state in a managed database and secrets in a key manager. Internal services talk over a gRPC service-discovery mesh (private DNS), with a separate metrics listener per service for observability. Provision it as code (Terraform) with a locked state backend. Give it a stable egress IP (NAT gateway) so providers and upstreams can whitelist it.

2. Multi-provider GPU aggregator¶

The aggregator is the layer that makes many GPU providers look like one. The pattern:

Normalized provider abstraction: one interface, per-provider adapters (each provider's REST/OAuth differs); the rest of the platform speaks the normalized model.
Per-provider rate limiting: every provider API has limits; a token-bucket per provider prevents one busy region from exhausting another's quota.
Capability/AZ validation: validate region/instance/GPU-model availability before committing (ideally at compile/config time) so a request cannot target a combination that cannot exist.
Lifecycle + orphan reconciliation: track every provisioned VM against a backing lease, and run an audit loop that enumerates provider inventories directly to catch VMs with no backing lease row, the canonical billing leak (you are paying a provider for a node no tenant is renting). Reap with a grace window and a dry-run mode.

3. CRD-driven autoscaling and provisioning¶

Scale the data plane with a CRD-driven autoscaler: a scaling policy and per-node-pool state, a warm pool that pre-provisions GPU capacity to hide cold-start latency, and a provisioning pipeline that brings a raw provider VM into the cluster (SSH key → K3s install → WireGuard sync → health-gate → verify the GPU) before marking it schedulable.

4. Cross-plane connectivity and secrets¶

VPC peering for control traffic from the cloud plane to the cluster (e.g. fetching a token/kubeconfig over SSH from a jump host).
WireGuard overlay for reaching GPU nodes behind NAT, with Flannel-over-WireGuard carrying the pod network across providers.
Secrets in the cloud key manager, synced into the data plane as needed, never long-lived secrets scattered across untrusted GPU nodes. Cross-plane credentials (the kubeconfig, SSH keys) must be tightly scoped and rotatable.

5. The request flow, end to end¶

A lease request threads both planes: user → API gateway (authz, rate limit) → billing check → aggregator/operator provisions or binds a node → (VPC peering / WireGuard) → workload runs on the GPU node → telemetry and metering flow back. Each hop is a place to enforce policy, and each is independently observable.

Failure modes¶

Plane coupling. Billing/auth co-located with churny GPU compute → a data-plane outage takes down the money path. Keep the control plane on managed cloud, isolated.
Orphaned provider VMs. A VM with no backing lease bills forever. Run the reconciliation audit loop against provider inventories, not just your own database.
Provider rate-limit exhaustion. Unbounded provider API calls trip limits and stall provisioning fleet-wide. Rate-limit per provider.
No stable egress. A changing control-plane IP breaks provider/upstream whitelists. Pin egress (NAT gateway).
Secrets sprawl. Long-lived secrets on untrusted GPU nodes are a breach waiting to happen. Hold them in a cloud KMS; sync narrowly; scope cross-plane credentials tightly (security & multi-tenancy).
Trusting unverified data-plane hardware. Renting a "GPU" that is weaker/shared/spoofed. Verify every node before it joins.
Cross-plane auth too broad. A kubeconfig or SSH key with more scope than the control plane needs widens the blast radius. Least privilege, rotatable.

Open questions & validation¶

The exact control-plane/data-plane boundary for your platform: what must be in managed cloud vs. what can live in the cluster, and the independent-degradation test (kill one plane, confirm the other survives).
Aggregator coverage: does the orphan-reconciliation loop actually catch every billing leak across all providers, tested with a deliberately orphaned VM?
Warm-pool sizing: cold-start latency hidden vs. idle GPU cost, measured rather than guessed (capacity planning).
Cross-plane credential scope and rotation; what an attacker with the data-plane kubeconfig can and cannot reach.
Provider-abstraction completeness: a new provider onboards by writing one adapter, with no change to the control plane.

References¶

AWS ECS / Fargate (managed container control plane): https://docs.aws.amazon.com/ecs/
Terraform (infrastructure as code): https://developer.hashicorp.com/terraform/docs
AWS VPC peering (cross-VPC control traffic): https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html
AWS Secrets Manager / KMS (cross-plane secrets): https://docs.aws.amazon.com/secretsmanager/
K3s (lightweight Kubernetes for the data plane): https://docs.k3s.io/
gRPC (internal service-to-service mesh): https://grpc.io/docs/