Helm: Kueue (quota & fair-share)¶
Scope: install Kueue (release manifest or Helm OCI chart), model fair-share GPU quota with ResourceFlavor + ClusterQueue + LocalQueue, share spare capacity across a cohort with controlled preemption, and let jobs suspend until quota admits them. Pairs with Kueue ClusterQueue.
Reference templates pinned to Kueue v0.18.1 against the
kueue.x-k8s.io/v1beta2API. Not hardware-tested. Pin the chart/manifest version and apply via GitOps (SRE and MLOps practices) rather than by hand in production. Field names below are taken from the v1beta2 reference; older clusters onv1beta1use different cohort/preemption shapes; verify against your installed CRD version.
What it is¶
Kueue is a Kubernetes-native job queueing controller. It does not place pods itself; it gates admission: a Job submitted to a Kueue-managed queue is created suspend: true, Kueue checks whether its ClusterQueue has free quota (optionally borrowing from a cohort), and only then flips suspend: false so the real scheduler (default kube-scheduler, or Volcano for gang placement) binds it. Workloads above quota wait in line instead of overcommitting the cluster.
Three objects express the policy, plus one per submitted job:
- ResourceFlavor is a named class of nodes (e.g. a GPU SKU), selected by node labels; quota is counted per flavor.
- ClusterQueue is the cluster-scoped quota pool: which resources, which flavors,
nominalQuotaper flavor, borrow/lend limits, preemption policy, and cohort membership. - LocalQueue is a namespaced pointer to a
ClusterQueue; jobs reference the LocalQueue by name via thekueue.x-k8s.io/queue-namelabel. - Workload is created by Kueue for each managed Job; it carries the
QuotaReserved/Admittedstatus conditions.
flowchart LR
JOB["Job (suspend=true)<br/>label queue-name"] --> LQ["LocalQueue<br/>(namespace)"]
LQ --> CQ["ClusterQueue<br/>nominalQuota per flavor"]
CQ -->|"borrow / lend / preempt"| COH["Cohort<br/>(shared pool)"]
CQ --> RF["ResourceFlavor<br/>(node labels)"]
CQ -->|"quota fits -> Admitted"| ADMIT["Job suspend=false<br/>scheduler binds"]
Prerequisites¶
- Kubernetes 1.29 or newer (Kueue v0.18 baseline). 1
- A working GPU stack so
nvidia.com/gpuis allocatable on nodes: the GPU Operator installed and a time-slicing or MIG sharing model chosen. Kueue counts whatever resource name the device plugin advertises. kubectlwith server-side apply; cluster-admin to install CRDs and the controller.- GPU nodes labelled so a
ResourceFlavorcan select them (e.g.nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3, applied by the GPU Operator's GFD, or a label you set). - RBAC for whoever creates
ClusterQueue/ResourceFlavor(cluster-scoped) vsLocalQueue/Jobs (namespaced); see RBAC for GPU Platform Operators.
Install¶
Two supported paths. Pick one; do not mix. Both install CRDs, the controller, and webhooks into kueue-system.
Release manifest (pinned version): 1
VERSION=v0.18.1
kubectl apply --server-side -f \
https://github.com/kubernetes-sigs/kueue/releases/download/${VERSION}/manifests.yaml
Helm via the OCI chart registry: 1
helm install kueue oci://registry.k8s.io/kueue/charts/kueue \
--version=0.18.1 \
--namespace kueue-system --create-namespace \
--wait --timeout 300s
--server-side matters for the manifest path: the bundled CRDs are large and can exceed the client-side apply annotation size limit. The Helm chart's controller config (feature gates, manageJobsWithoutQueueName, integrations) lives under the chart's values.yaml controllerManager / managerConfig keys. Consult helm show values oci://registry.k8s.io/kueue/charts/kueue --version=0.18.1 for the exact tree of your pinned chart rather than guessing keys.
By default Kueue only manages Jobs that carry the kueue.x-k8s.io/queue-name label; everything else schedules normally. (manageJobsWithoutQueueName: true flips that to opt-out; leave it off unless the whole cluster is Kueue-governed.)
The manifest¶
Minimal, apply-correct quota for one GPU flavor. Order matters: the ResourceFlavor and ClusterQueue are cluster-scoped; the LocalQueue is namespaced and must reference an existing ClusterQueue.
# 1) A flavor = a class of GPU nodes, selected by node labels.
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
name: gpu-h100
spec:
nodeLabels:
nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
---
# 2) Cluster-scoped quota pool. Members of cohort "research" can borrow/lend.
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
name: team-a
spec:
cohortName: research # omit to make the queue standalone (no borrowing)
namespaceSelector: {} # which namespaces' LocalQueues may target this CQ
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: gpu-h100
resources:
- name: "cpu"
nominalQuota: "100"
- name: "memory"
nominalQuota: 800Gi
- name: "nvidia.com/gpu"
nominalQuota: 32 # team-a's guaranteed GPUs
borrowingLimit: 16 # may borrow up to 16 more from the cohort
lendingLimit: 8 # may lend up to 8 idle GPUs to the cohort
preemption:
reclaimWithinCohort: Any # reclaim GPUs this CQ lent, from any priority
borrowWithinCohort:
policy: LowerPriority # when borrowing, may preempt lower-prio peers
maxPriorityThreshold: 100
withinClusterQueue: LowerPriority # within team-a, higher prio preempts lower
flavorFungibility:
whenCanBorrow: TryNextFlavor # try another flavor before borrowing
whenCanPreempt: TryNextFlavor # try another flavor before preempting
---
# 3) Namespaced entry point. Jobs in ns "team-a" target this by name.
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
name: gpu-queue
namespace: team-a
spec:
clusterQueue: team-a
A second ClusterQueue sharing cohortName: research (its own template in Kueue ClusterQueue) forms the fair-share pool: each team is guaranteed its nominalQuota, idle GPUs (up to lendingLimit) flow to whoever has pending work (up to their borrowingLimit), and reclaimWithinCohort pulls lent capacity back when the owner needs it.
A GPU Job opts in with the queue-name label and is created suspended: 5
apiVersion: batch/v1
kind: Job
metadata:
name: train-smoke
namespace: team-a
labels:
kueue.x-k8s.io/queue-name: gpu-queue # selects the LocalQueue
spec:
parallelism: 2
completions: 2
suspend: true # Kueue's webhook sets this automatically; shown here for clarity. It flips to false on admission
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:25.05-py3
command: ["nvidia-smi"]
resources:
requests: { cpu: "8", memory: 64Gi, nvidia.com/gpu: 1 }
limits: { nvidia.com/gpu: 1 }
Configuration¶
| Object / field | Type | Meaning |
|---|---|---|
ResourceFlavor.spec.nodeLabels |
map | Node labels a flavor selects; quota is counted per flavor. 2 |
ResourceFlavor.spec.nodeTaints / tolerations |
list | Taints injected onto matching nodes / tolerations Kueue adds to admitted pods. 2 |
ClusterQueue.spec.cohortName |
string | Cohort membership. Matching names share a borrow/lend pool; omit = standalone, no borrowing. 4 |
ClusterQueue.spec.namespaceSelector |
labelSelector | Which namespaces' LocalQueues may use this CQ. {} = all. 3 |
...resourceGroups[].coveredResources |
list | Resource names this group governs (e.g. nvidia.com/gpu). 3 |
...flavors[].resources[].nominalQuota |
Quantity | Guaranteed amount of that resource in that flavor. 3 |
...resources[].borrowingLimit |
Quantity | Max this CQ may borrow from the cohort above its nominal quota. 3 |
...resources[].lendingLimit |
Quantity | Max idle quota this CQ exposes to the cohort. 3 |
spec.preemption.reclaimWithinCohort |
enum | Reclaim lent quota from the cohort: Never | LowerPriority | Any. 2 |
spec.preemption.borrowWithinCohort.policy |
enum | Preempt within cohort while borrowing: Never | LowerPriority. 2 |
spec.preemption.withinClusterQueue |
enum | Preempt inside this CQ: Never | LowerPriority | LowerOrNewerEqualPriority. 2 |
spec.flavorFungibility.whenCanBorrow / whenCanPreempt |
enum | Try the next flavor before borrowing/preempting: MayStopSearch | TryNextFlavor. 2 |
LocalQueue.spec.clusterQueue |
string | The ClusterQueue this namespaced queue points at. 2 |
Job label kueue.x-k8s.io/queue-name |
string | LocalQueue the Job is submitted to; absence means Kueue ignores the Job (default config). 5 |
Apply & verify¶
Install, then confirm the controller is up:
kubectl wait deploy/kueue-controller-manager -n kueue-system \
--for=condition=available --timeout=5m
kubectl get pods -n kueue-system
Expected signal: the controller Deployment reaches Available, and kueue-controller-manager-… is Running / READY 1/1. 1
Apply the quota objects and the smoke Job, then watch admission:
kubectl apply -f kueue-quota.yaml # ResourceFlavor + ClusterQueue + LocalQueue
kubectl apply -f train-smoke.yaml # the suspended Job above
# LocalQueue should show admitted/pending counts once a Job lands:
kubectl get localqueue gpu-queue -n team-a -o wide
# The Workload Kueue created for the Job, and its conditions:
kubectl get workloads.kueue.x-k8s.io -n team-a
kubectl describe workload -n team-a -l kueue.x-k8s.io/queue-name=gpu-queue
Expected signal (quota available): the Workload gains QuotaReserved then Admitted=True, an Admitted event is emitted by the kueue manager, and the Job's .spec.suspend flips to false: 56
kubectl get job train-smoke -n team-a -o jsonpath='{.spec.suspend}' # -> false once admitted
kubectl get pods -n team-a # trainer pods schedule
Expected signal (over quota): submit a Job requesting more nvidia.com/gpu than nominalQuota + borrowingLimit and it stays suspend: true; its Workload has no Admitted condition and kubectl describe reports insufficient quota, proof the gate holds rather than overcommitting. 6
Failure modes¶
- Job runs immediately, ignoring quota. The
kueue.x-k8s.io/queue-namelabel is missing (or on the wrong object), so Kueue never manages it. Default config only governs labelled Jobs. 5 - Job stuck
suspend: trueforever.ClusterQueuequota (plus any cohort borrow) is below the request, theLocalQueuepoints at a non-existent/InactiveClusterQueue, or theResourceFlavornode labels match no nodes so quota is unschedulable.kubectl describe workloadnames the reason. 6 nominalQuotaexceeds allocatable. Kueue admits against quota numbers, not live capacity. IfnominalQuotafornvidia.com/gpuis larger than what nodes actually expose, Workloads admit then sitPendingat the scheduler. Keep quota ≤ real allocatable. 6- Unexpected cross-team preemption.
reclaimWithinCohort: AnyorborrowWithinCohortlets a borrowing/reclaiming queue evict peers. Start withNever/LowerPriorityand a priority scheme before enablingAny. 2 - Time-slicing inflates GPU count. Kueue counts the advertised
nvidia.com/gpu; with time-slicing replicas, quota maps to slices, not isolated GPUs. Size quota against the sharing model, not raw card count. - CRD version mismatch. Applying
v1beta2manifests against a cluster running an older Kueue (onlyv1beta1served) fails admission;cohortNameand theborrowWithinCohortshape differ across versions. Match manifest API version to the installed controller.
References¶
- Kueue installation (versions, manifest + Helm OCI, kubectl wait): https://kueue.sigs.k8s.io/docs/installation/
- Kueue v1beta2 API reference (ResourceFlavor, ClusterQueue, LocalQueue, Workload, preemption enums): https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta2/
- ClusterQueue concept (resourceGroups, quotas, flavorFungibility): https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/
- Cohort concept (cohortName, borrow/lend, fair sharing): https://kueue.sigs.k8s.io/docs/concepts/cohort/
- Run a Kubernetes Job with Kueue (queue-name label, suspend, Admitted): https://kueue.sigs.k8s.io/docs/tasks/run/jobs/
- Troubleshooting Jobs (suspend/admission diagnostics): https://kueue.sigs.k8s.io/docs/tasks/troubleshooting/troubleshooting_jobs/
- Releases: https://github.com/kubernetes-sigs/kueue/releases
Related: GPU platform hub · GPU Operator · Volcano scheduler · ClusterQueue manifest · Kubernetes for GPUs · Security & multi-tenancy · Glossary
-
Kueue installation — https://kueue.sigs.k8s.io/docs/installation/ (Kubernetes 1.29+,
VERSION=v0.18.1,kubectl apply --server-side -f .../manifests.yaml, Helmoci://registry.k8s.io/kueue/charts/kueue --version=0.18.1, namespacekueue-system,kubectl wait deploy/kueue-controller-manager). ↩↩↩↩ -
Kueue v1beta2 API reference — https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta2/ ↩↩↩↩↩↩↩↩
-
ClusterQueue concept — https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/ ↩↩↩↩↩
-
Cohort concept — https://kueue.sigs.k8s.io/docs/concepts/cohort/ ↩
-
Run a Kubernetes Job — https://kueue.sigs.k8s.io/docs/tasks/run/jobs/ ↩↩↩↩
-
Troubleshooting Jobs — https://kueue.sigs.k8s.io/docs/tasks/troubleshooting/troubleshooting_jobs/ ↩↩↩↩