Manifest: Kueue ClusterQueue¶
Scope: the ResourceFlavor + ClusterQueue + LocalQueue triad that fences nvidia.com/gpu into team quota, plus a Job labelled to a LocalQueue and the kubectl get workloads checks that prove admission and quota accounting. The CRD detail behind line 5 of Kubernetes & Helm: GPU Platform; pairs with Kueue. Install the controller there.
Reference templates from the upstream Kueue v1beta2 CRDs. Pin the chart/image in Kueue and apply these manifests via GitOps (SRE and MLOps practices). Never hardware-tested here.
nominalQuotanumbers are placeholders; set them to your fleet's real GPU count.
flowchart LR
JOB["Job + label<br/>kueue.x-k8s.io/queue-name"] --> WL["Workload<br/>(unit of admission)"]
WL --> LQ["LocalQueue<br/>(namespace-scoped)"]
LQ --> CQ["ClusterQueue<br/>(quota + namespaceSelector)"]
CQ --> RF["ResourceFlavor<br/>(nodeLabels -> GPU nodes)"]
CQ -. "QuotaReserved -> Admitted" .-> WL
What it is¶
Kueue is a job-level quota and queueing controller: it suspends Jobs, then admits them only when their ResourceFlavor quota is free. Three cluster-scoped/namespaced objects:
ResourceFlavornames a class of nodes (here, GPU nodes) vianodeLabels/nodeTaints. Quota is counted per flavor.1ClusterQueueis the quota pool: which resources it covers (coveredResources), how much per flavor (nominalQuota), and which namespaces may draw on it (namespaceSelector).2LocalQueueis the namespaced handle teams submit to; it points at oneClusterQueueviaspec.clusterQueue.4
A Job carries kueue.x-k8s.io/queue-name: <localqueue>; Kueue creates a Workload for it (the unit of admission) and walks it through QuotaReserved -> Admitted -> Finished.56 Kueue is not a scheduler; it gates admission. Gang placement is still Volcano/kube-scheduler (Volcano Job).
Prerequisites¶
- Kueue controller installed and
Ready; see Kueue. CRDsclusterqueues,localqueues,resourceflavors,workloadsin groupkueue.x-k8s.io/v1beta2must exist. - GPU nodes already advertising
nvidia.com/gpu(GPU Operator device plugin up; see GPU Operator ClusterPolicy). - A node label to bind the flavor to GPU nodes. The GPU Operator's GFD applies
nvidia.com/gpu.present=true; pick a label that is true on every GPU node (Containers and Kubernetes for GPUs). - A namespace per team (here
team-a) sonamespaceSelectorcan scope the pool.
kubectl get crd | grep kueue.x-k8s.io # expect clusterqueues/localqueues/resourceflavors/workloads
kubectl -n kueue-system get deploy kueue-controller-manager # expect READY 1/1
The manifest¶
One flavor pinned to GPU nodes, one ClusterQueue scoped to labelled namespaces, one LocalQueue in team-a. Apply in this order (flavor and ClusterQueue are cluster-scoped; the LocalQueue is namespaced).
# 1. ResourceFlavor: this quota counts only GPU nodes.
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
name: gpu-nodes
spec:
nodeLabels:
nvidia.com/gpu.present: "true" # set by GPU Operator GFD; use a label true on every GPU node
---
# 2. ClusterQueue: the GPU quota pool, open only to namespaces labelled kueue.x-k8s.io/queue=research.
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
name: research
spec:
namespaceSelector:
matchLabels:
kueue.x-k8s.io/queue: research # only matching namespaces may borrow from this pool
queueingStrategy: BestEffortFIFO # default; head-of-line blocks only in StrictFIFO
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: gpu-nodes
resources:
- name: nvidia.com/gpu
nominalQuota: 64 # PLACEHOLDER: total GPUs this queue may admit at once
---
# 3. LocalQueue: the handle team-a submits to.
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
name: team-a-gpu
namespace: team-a
spec:
clusterQueue: research
Label the namespace so the namespaceSelector matches (skip if you used namespaceSelector: {} for all namespaces):
kubectl create namespace team-a --dry-run=client -o yaml | kubectl apply -f -
kubectl label namespace team-a kueue.x-k8s.io/queue=research --overwrite
kubectl apply -f kueue-quota.yaml
A Job that requests 2 GPUs and is routed to the LocalQueue by label. Do not pre-suspend it; Kueue suspends and resumes via its webhook.6
apiVersion: batch/v1
kind: Job
metadata:
generateName: gpu-burn-
namespace: team-a
labels:
kueue.x-k8s.io/queue-name: team-a-gpu # routes the Workload to LocalQueue team-a-gpu
spec:
parallelism: 1
completions: 1
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:13.0.0-base-ubuntu24.04 # pin to your fleet's CUDA base
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 2 # counted against nominalQuota: 64
Configuration¶
| Object.field | Type | Meaning / values |
|---|---|---|
ResourceFlavor.spec.nodeLabels |
map | Node labels the flavor binds to; injected into the pod at admission so it lands on those nodes.1 |
ResourceFlavor.spec.nodeTaints |
list | Taints the flavor's quota requires the pod to tolerate; only tolerating workloads consume it.1 |
ResourceFlavor.spec.tolerations |
list | Tolerations Kueue adds to admitted pods so they schedule onto tainted GPU nodes.1 |
ClusterQueue.spec.namespaceSelector |
LabelSelector | Which namespaces may draw on the pool. {} = all namespaces; matchLabels to scope.2 |
ClusterQueue.spec.resourceGroups[].coveredResources |
list | Resource names this group governs, e.g. ["nvidia.com/gpu"].2 |
…resourceGroups[].flavors[].name |
string | Must reference an existing ResourceFlavor (here gpu-nodes).2 |
…flavors[].resources[].nominalQuota |
quantity | Admittable amount of that resource in that flavor. GPUs are integers.2 |
…resources[].borrowingLimit |
quantity | Max this CQ may borrow from its cohort above nominal; omit to disallow borrowing.2 |
…resources[].lendingLimit |
quantity | Max this CQ lends to the cohort; omit to lend all idle quota.2 |
ClusterQueue.spec.cohortName |
string | Cohort this CQ shares/borrows with (renamed from cohort in v1beta2).3 |
ClusterQueue.spec.queueingStrategy |
enum | BestEffortFIFO (default) or StrictFIFO (head-of-line blocks).2 |
ClusterQueue.spec.preemption.reclaimWithinCohort |
enum | Never | LowerPriority | Any — reclaim borrowed quota from cohort peers.2 |
ClusterQueue.spec.preemption.withinClusterQueue |
enum | Never | LowerPriority | LowerOrNewerEqualPriority.2 |
ClusterQueue.spec.stopPolicy |
enum | None | Hold | HoldAndDrain — pause admission / drain the queue.2 |
LocalQueue.spec.clusterQueue |
string | The ClusterQueue this LocalQueue feeds.4 |
Job label kueue.x-k8s.io/queue-name |
string | Routes the Job's Workload to a LocalQueue in the same namespace.76 |
Apply & verify¶
kubectl apply -f kueue-quota.yaml # flavor + ClusterQueue + LocalQueue
kubectl get clusterqueue research -o wide
kubectl get localqueue team-a-gpu -n team-a
The ClusterQueue is usable only once it reports Active:
kubectl get clusterqueue research -o jsonpath='{range .status.conditions[?(@.type=="Active")]}{.status}{" "}{.reason}{"\n"}{end}'
# expected: True Ready
A False/Active here almost always means the referenced ResourceFlavor does not exist. Check kubectl get resourceflavor gpu-nodes.
Submit the Job and watch the Workload move through admission:
Expected once quota is free (ADMITTED=True):
RESERVED IN shows the ClusterQueue holding the quota; ADMITTED=True means the pods were resumed.6 Confirm the full condition chain and quota accounting:
WL=$(kubectl -n team-a get workloads.kueue.x-k8s.io -o name | head -n1)
kubectl -n team-a get "$WL" -o jsonpath='{range .status.conditions[*]}{.type}={.status}{"\n"}{end}'
# expected: QuotaReserved=True Admitted=True (Finished=True after the Job completes)
kubectl get clusterqueue research -o jsonpath='{.status.flavorsUsage}' | jq .
# expected: gpu-nodes / nvidia.com/gpu total == 2 (matches the Job's limit)
kubectl get clusterqueue research -o jsonpath='{.status.admittedWorkloads}{"\n"}' # expected: 1
Quota exhaustion is the correct negative signal: with nominalQuota: 64, the 33rd 2-GPU Job stays ADMITTED empty and kubectl -n team-a describe workload <wl> shows couldn't assign flavors … insufficient quota for nvidia.com/gpu.6 The Job's pods do not exist until admission; that is Kueue working, not a stuck Job.
Failure modes¶
- ClusterQueue
Active=False.flavors[].namepoints at aResourceFlavorthat does not exist (typo, or applied out of order). Create the flavor first; the CQ reconciles toActiveon its own.3 - Workload stuck, no
QuotaReserved. The Job's namespace does not matchnamespaceSelector. Label it (kueue.x-k8s.io/queue=research) or widen the selector. EmptyWorkloadlist usually means thekueue.x-k8s.io/queue-namelabel is missing/misspelled, so Kueue never adopted the Job.6 QuotaReserved=Truebut pods never schedule. TheResourceFlavor.nodeLabelsselect nodes that lack free GPUs, or pods don't tolerate the flavor'snodeTaints. Addtolerationsto the flavor or fix the label. Verify GPUs are actually free withkubectl describe node(GPU Diagnostics and Validation).- Quota never frees after Jobs finish. Workloads not transitioning to
Finished; check the controller logs inkueue-system. UntilFinished, their GPUs stay counted inflavorsUsage.5 - Pre-suspended Job hangs. Manually setting
spec.suspend: trueand expecting Kueue to also manage it; let the webhook own suspension. Conversely a Job with no queue label runs ungated and bypasses quota entirely.6 - GPUs requested under
requestsonly. Fornvidia.com/gputhe device plugin requireslimits; quota is counted from the effective request. Always setlimits(Containers and Kubernetes for GPUs).
References¶
- Kueue v1beta2 API reference (ResourceFlavor, ClusterQueue, LocalQueue, Workload specs;
cohortName): https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta2/ - ClusterQueue concept (resourceGroups, namespaceSelector, queueingStrategy, preemption, stopPolicy): https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/
- ResourceFlavor concept (nodeLabels, nodeTaints, tolerations): https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/
- LocalQueue concept (spec.clusterQueue, status): https://kueue.sigs.k8s.io/docs/concepts/local_queue/
- Workload concept (unit of admission, conditions): https://kueue.sigs.k8s.io/docs/concepts/workload/
- Run a Kubernetes Job (queue-name label, suspend behaviour,
kubectl get workloadsoutput): https://kueue.sigs.k8s.io/docs/tasks/run/jobs/ - Labels and annotations (
kueue.x-k8s.io/queue-name): https://kueue.sigs.k8s.io/docs/reference/labels-and-annotations/ - Administer cluster quotas: https://kueue.sigs.k8s.io/docs/tasks/manage/administer_cluster_quotas/
Related: Helm: GPU Platform · Kueue · Volcano Job · Kubernetes for GPUs · Security & multi-tenancy · Glossary
-
ResourceFlavor
spec.nodeLabels/nodeTaints/tolerations— https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/ ↩↩↩↩ -
ClusterQueue
resourceGroups,coveredResources,flavors[].resources[].nominalQuota/borrowingLimit/lendingLimit,namespaceSelector,queueingStrategy,preemption,stopPolicy— https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/ ↩↩↩↩↩↩↩↩↩↩↩ -
ClusterQueueSpec
cohortName(typeCohortReference) in v1beta2 — https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta2/ ↩↩ -
LocalQueue
spec.clusterQueueand status counters — https://kueue.sigs.k8s.io/docs/concepts/local_queue/ ↩↩ -
Workload is the unit of admission; conditions
QuotaReserved/Admitted/Finished— https://kueue.sigs.k8s.io/docs/concepts/workload/ ↩↩ -
kueue.x-k8s.io/queue-namelabel, webhook-managed suspension,kubectl get workloadscolumns, insufficient-quota describe output — https://kueue.sigs.k8s.io/docs/tasks/run/jobs/ ↩↩↩↩↩↩↩ -
kueue.x-k8s.io/queue-namelabel reference — https://kueue.sigs.k8s.io/docs/reference/labels-and-annotations/ ↩