Helm: Kueue (quota & fair-share)¶

Markdown

Scope: install Kueue (release manifest or Helm OCI chart), model fair-share GPU quota with ResourceFlavor + ClusterQueue + LocalQueue, share spare capacity across a cohort with controlled preemption, and let jobs suspend until quota admits them. Pairs with Kueue ClusterQueue.

Reference templates pinned to Kueue v0.18.1 against the kueue.x-k8s.io/v1beta2 API. Not hardware-tested. Pin the chart/manifest version and apply via GitOps (SRE and MLOps practices) rather than by hand in production. Field names below are taken from the v1beta2 reference; older clusters on v1beta1 use different cohort/preemption shapes; verify against your installed CRD version.

What it is¶

Kueue is a Kubernetes-native job queueing controller. It does not place pods itself; it gates admission: a Job submitted to a Kueue-managed queue is created suspend: true, Kueue checks whether its ClusterQueue has free quota (optionally borrowing from a cohort), and only then flips suspend: false so the real scheduler (default kube-scheduler, or Volcano for gang placement) binds it. Workloads above quota wait in line instead of overcommitting the cluster.

Three objects express the policy, plus one per submitted job:

ResourceFlavor is a named class of nodes (e.g. a GPU SKU), selected by node labels; quota is counted per flavor.
ClusterQueue is the cluster-scoped quota pool: which resources, which flavors, nominalQuota per flavor, borrow/lend limits, preemption policy, and cohort membership.
LocalQueue is a namespaced pointer to a ClusterQueue; jobs reference the LocalQueue by name via the kueue.x-k8s.io/queue-name label.
Workload is created by Kueue for each managed Job; it carries the QuotaReserved / Admitted status conditions.

flowchart LR
  JOB["Job (suspend=true)<br/>label queue-name"] --> LQ["LocalQueue<br/>(namespace)"]
  LQ --> CQ["ClusterQueue<br/>nominalQuota per flavor"]
  CQ -->|"borrow / lend / preempt"| COH["Cohort<br/>(shared pool)"]
  CQ --> RF["ResourceFlavor<br/>(node labels)"]
  CQ -->|"quota fits -> Admitted"| ADMIT["Job suspend=false<br/>scheduler binds"]

Prerequisites¶

Kubernetes 1.29 or newer (Kueue v0.18 baseline). ¹
A working GPU stack so nvidia.com/gpu is allocatable on nodes: the GPU Operator installed and a time-slicing or MIG sharing model chosen. Kueue counts whatever resource name the device plugin advertises.
kubectl with server-side apply; cluster-admin to install CRDs and the controller.
GPU nodes labelled so a ResourceFlavor can select them (e.g. nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3, applied by the GPU Operator's GFD, or a label you set).
RBAC for whoever creates ClusterQueue/ResourceFlavor (cluster-scoped) vs LocalQueue/Jobs (namespaced); see RBAC for GPU Platform Operators.

Install¶

Two supported paths. Pick one; do not mix. Both install CRDs, the controller, and webhooks into kueue-system.

Release manifest (pinned version): ¹

VERSION=v0.18.1
kubectl apply --server-side -f \
  https://github.com/kubernetes-sigs/kueue/releases/download/${VERSION}/manifests.yaml

Helm via the OCI chart registry: ¹

helm install kueue oci://registry.k8s.io/kueue/charts/kueue \
  --version=0.18.1 \
  --namespace kueue-system --create-namespace \
  --wait --timeout 300s

--server-side matters for the manifest path: the bundled CRDs are large and can exceed the client-side apply annotation size limit. The Helm chart's controller config (feature gates, manageJobsWithoutQueueName, integrations) lives under the chart's values.yaml controllerManager / managerConfig keys. Consult helm show values oci://registry.k8s.io/kueue/charts/kueue --version=0.18.1 for the exact tree of your pinned chart rather than guessing keys.

By default Kueue only manages Jobs that carry the kueue.x-k8s.io/queue-name label; everything else schedules normally. (manageJobsWithoutQueueName: true flips that to opt-out; leave it off unless the whole cluster is Kueue-governed.)

The manifest¶

Minimal, apply-correct quota for one GPU flavor. Order matters: the ResourceFlavor and ClusterQueue are cluster-scoped; the LocalQueue is namespaced and must reference an existing ClusterQueue.

# 1) A flavor = a class of GPU nodes, selected by node labels.
apiVersion: kueue.x-k8s.io/v1beta2
kind: ResourceFlavor
metadata:
  name: gpu-h100
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3
---
# 2) Cluster-scoped quota pool. Members of cohort "research" can borrow/lend.
apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata:
  name: team-a
spec:
  cohortName: research            # omit to make the queue standalone (no borrowing)
  namespaceSelector: {}           # which namespaces' LocalQueues may target this CQ
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: gpu-h100
          resources:
            - name: "cpu"
              nominalQuota: "100"
            - name: "memory"
              nominalQuota: 800Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 32          # team-a's guaranteed GPUs
              borrowingLimit: 16        # may borrow up to 16 more from the cohort
              lendingLimit: 8           # may lend up to 8 idle GPUs to the cohort
  preemption:
    reclaimWithinCohort: Any            # reclaim GPUs this CQ lent, from any priority
    borrowWithinCohort:
      policy: LowerPriority             # when borrowing, may preempt lower-prio peers
      maxPriorityThreshold: 100
    withinClusterQueue: LowerPriority   # within team-a, higher prio preempts lower
  flavorFungibility:
    whenCanBorrow: TryNextFlavor        # try another flavor before borrowing
    whenCanPreempt: TryNextFlavor       # try another flavor before preempting
---
# 3) Namespaced entry point. Jobs in ns "team-a" target this by name.
apiVersion: kueue.x-k8s.io/v1beta2
kind: LocalQueue
metadata:
  name: gpu-queue
  namespace: team-a
spec:
  clusterQueue: team-a

A second ClusterQueue sharing cohortName: research (its own template in Kueue ClusterQueue) forms the fair-share pool: each team is guaranteed its nominalQuota, idle GPUs (up to lendingLimit) flow to whoever has pending work (up to their borrowingLimit), and reclaimWithinCohort pulls lent capacity back when the owner needs it.

A GPU Job opts in with the queue-name label and is created suspended: ⁵

apiVersion: batch/v1
kind: Job
metadata:
  name: train-smoke
  namespace: team-a
  labels:
    kueue.x-k8s.io/queue-name: gpu-queue   # selects the LocalQueue
spec:
  parallelism: 2
  completions: 2
  suspend: true                            # Kueue's webhook sets this automatically; shown here for clarity. It flips to false on admission
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: trainer
          image: nvcr.io/nvidia/pytorch:25.05-py3
          command: ["nvidia-smi"]
          resources:
            requests:    { cpu: "8", memory: 64Gi, nvidia.com/gpu: 1 }
            limits:      { nvidia.com/gpu: 1 }

Configuration¶

Object / field	Type	Meaning
`ResourceFlavor.spec.nodeLabels`	map	Node labels a flavor selects; quota is counted per flavor. ²
`ResourceFlavor.spec.nodeTaints` / `tolerations`	list	Taints injected onto matching nodes / tolerations Kueue adds to admitted pods. ²
`ClusterQueue.spec.cohortName`	string	Cohort membership. Matching names share a borrow/lend pool; omit = standalone, no borrowing. ⁴
`ClusterQueue.spec.namespaceSelector`	labelSelector	Which namespaces' LocalQueues may use this CQ. `{}` = all. ³
`...resourceGroups[].coveredResources`	list	Resource names this group governs (e.g. `nvidia.com/gpu`). ³
`...flavors[].resources[].nominalQuota`	Quantity	Guaranteed amount of that resource in that flavor. ³
`...resources[].borrowingLimit`	Quantity	Max this CQ may borrow from the cohort above its nominal quota. ³
`...resources[].lendingLimit`	Quantity	Max idle quota this CQ exposes to the cohort. ³
`spec.preemption.reclaimWithinCohort`	enum	Reclaim lent quota from the cohort: `Never` \| `LowerPriority` \| `Any`. ²
`spec.preemption.borrowWithinCohort.policy`	enum	Preempt within cohort while borrowing: `Never` \| `LowerPriority`. ²
`spec.preemption.withinClusterQueue`	enum	Preempt inside this CQ: `Never` \| `LowerPriority` \| `LowerOrNewerEqualPriority`. ²
`spec.flavorFungibility.whenCanBorrow` / `whenCanPreempt`	enum	Try the next flavor before borrowing/preempting: `MayStopSearch` \| `TryNextFlavor`. ²
`LocalQueue.spec.clusterQueue`	string	The `ClusterQueue` this namespaced queue points at. ²
Job label `kueue.x-k8s.io/queue-name`	string	LocalQueue the Job is submitted to; absence means Kueue ignores the Job (default config). ⁵

Apply & verify¶

Install, then confirm the controller is up:

kubectl wait deploy/kueue-controller-manager -n kueue-system \
  --for=condition=available --timeout=5m
kubectl get pods -n kueue-system

Expected signal: the controller Deployment reaches Available, and kueue-controller-manager-… is Running / READY 1/1. ¹

Apply the quota objects and the smoke Job, then watch admission:

kubectl apply -f kueue-quota.yaml      # ResourceFlavor + ClusterQueue + LocalQueue
kubectl apply -f train-smoke.yaml      # the suspended Job above

# LocalQueue should show admitted/pending counts once a Job lands:
kubectl get localqueue gpu-queue -n team-a -o wide

# The Workload Kueue created for the Job, and its conditions:
kubectl get workloads.kueue.x-k8s.io -n team-a
kubectl describe workload -n team-a -l kueue.x-k8s.io/queue-name=gpu-queue

Expected signal (quota available): the Workload gains QuotaReserved then Admitted=True, an Admitted event is emitted by the kueue manager, and the Job's .spec.suspend flips to false: ⁵⁶

kubectl get job train-smoke -n team-a -o jsonpath='{.spec.suspend}'   # -> false once admitted
kubectl get pods -n team-a                                            # trainer pods schedule

Expected signal (over quota): submit a Job requesting more nvidia.com/gpu than nominalQuota + borrowingLimit and it stays suspend: true; its Workload has no Admitted condition and kubectl describe reports insufficient quota, proof the gate holds rather than overcommitting. ⁶

Failure modes¶

Job runs immediately, ignoring quota. The kueue.x-k8s.io/queue-name label is missing (or on the wrong object), so Kueue never manages it. Default config only governs labelled Jobs. ⁵
Job stuck suspend: true forever. ClusterQueue quota (plus any cohort borrow) is below the request, the LocalQueue points at a non-existent/Inactive ClusterQueue, or the ResourceFlavor node labels match no nodes so quota is unschedulable. kubectl describe workload names the reason. ⁶
nominalQuota exceeds allocatable. Kueue admits against quota numbers, not live capacity. If nominalQuota for nvidia.com/gpu is larger than what nodes actually expose, Workloads admit then sit Pending at the scheduler. Keep quota ≤ real allocatable. ⁶
Unexpected cross-team preemption. reclaimWithinCohort: Any or borrowWithinCohort lets a borrowing/reclaiming queue evict peers. Start with Never/LowerPriority and a priority scheme before enabling Any. ²
Time-slicing inflates GPU count. Kueue counts the advertised nvidia.com/gpu; with time-slicing replicas, quota maps to slices, not isolated GPUs. Size quota against the sharing model, not raw card count.
CRD version mismatch. Applying v1beta2 manifests against a cluster running an older Kueue (only v1beta1 served) fails admission; cohortName and the borrowWithinCohort shape differ across versions. Match manifest API version to the installed controller.

References¶

Kueue installation (versions, manifest + Helm OCI, kubectl wait): https://kueue.sigs.k8s.io/docs/installation/
Kueue v1beta2 API reference (ResourceFlavor, ClusterQueue, LocalQueue, Workload, preemption enums): https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta2/
ClusterQueue concept (resourceGroups, quotas, flavorFungibility): https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/
Cohort concept (cohortName, borrow/lend, fair sharing): https://kueue.sigs.k8s.io/docs/concepts/cohort/
Run a Kubernetes Job with Kueue (queue-name label, suspend, Admitted): https://kueue.sigs.k8s.io/docs/tasks/run/jobs/
Troubleshooting Jobs (suspend/admission diagnostics): https://kueue.sigs.k8s.io/docs/tasks/troubleshooting/troubleshooting_jobs/
Releases: https://github.com/kubernetes-sigs/kueue/releases

Kueue installation — https://kueue.sigs.k8s.io/docs/installation/ (Kubernetes 1.29+, VERSION=v0.18.1, kubectl apply --server-side -f .../manifests.yaml, Helm oci://registry.k8s.io/kueue/charts/kueue --version=0.18.1, namespace kueue-system, kubectl wait deploy/kueue-controller-manager). ↩↩↩↩
Kueue v1beta2 API reference — https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta2/ ↩↩↩↩↩↩↩↩
ClusterQueue concept — https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/ ↩↩↩↩↩
Cohort concept — https://kueue.sigs.k8s.io/docs/concepts/cohort/ ↩
Run a Kubernetes Job — https://kueue.sigs.k8s.io/docs/tasks/run/jobs/ ↩↩↩↩
Troubleshooting Jobs — https://kueue.sigs.k8s.io/docs/tasks/troubleshooting/troubleshooting_jobs/ ↩↩↩↩