Helm: Volcano gang scheduler¶
Scope: install Volcano via Helm; configure the scheduler/controller/admission, stand up queues, and make gang (minMember) scheduling place distributed jobs all-or-nothing so partial gangs never deadlock GPUs, while the default scheduler keeps owning everything else. Pairs with Volcano Job.
Reference templates from the upstream Volcano Helm chart and CRDs. Pin chart and image versions; apply via GitOps rather than
helm installby hand in production. Never hardware-tested here.
flowchart TB
JOB["VolcanoJob / annotated pods"] --> PG["PodGroup (minMember=N)"]
PG --> ENQ["enqueue: queue admits PodGroup"]
ENQ --> GANG["gang plugin: place all N or none"]
GANG -->|"all N fit"| RUN["Running"]
GANG -->|"< N fit"| PEND["Pending (no pod bound)"]
DEFSCHED["default kube-scheduler"] -.-> OTHER["everything without schedulerName: volcano"]
What it is¶
Volcano is a CNCF batch system that ships its own scheduler, a controller manager, and an admission webhook. The piece that matters for GPU clusters is the gang scheduler: a job is represented by a PodGroup with minMember: N, and the gang plugin only binds pods once at least N of them can be placed at the same time. If fewer than N fit, zero are bound; the workload sits Pending instead of grabbing some GPUs and blocking on a rendezvous that can never complete. That is the failure the default scheduler causes on multi-pod training/inference jobs (Kubernetes for GPUs, hub).
Volcano runs alongside the default scheduler, not as a replacement. Only pods with spec.schedulerName: volcano (set directly, by a VolcanoJob, or by the admission webhook) are scheduled by Volcano; everything else stays with kube-scheduler.14
Prerequisites¶
- Kubernetes 1.12+ with CRD support (current Volcano tracks recent K8s; pin to a release whose support matrix covers your cluster).1
- A working GPU platform underneath: GPU Operator advertising
nvidia.com/gpu, optionally MIG/time-slicing (hub). Volcano schedules the resource; it does not create it. helm3.x and cluster-admin (Volcano installs cluster-scoped CRDs, RBAC, and a mutating webhook; see RBAC for GPU Platform Operators).- Decide the namespace up front: the chart defaults to
volcano-system.
Install¶
Pin the chart version. 1.12.2 is used below as a concrete, stable reference template; confirm the current release for your K8s version on the chart repo.2
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano \
-n volcano-system --create-namespace \
--version 1.12.2 \
--set custom.scheduler_replicas=1 \
--set custom.controller_replicas=1 \
--set custom.admission_replicas=1
This deploys three Deployments: volcano-scheduler, volcano-controllers, and volcano-admission, plus the CRDs (PodGroup, Queue, Job, …) and a default Queue with weight: 1 that receives any PodGroup not assigned elsewhere.15
Override the scheduler config (actions + plugins)¶
The scheduler behaviour is a volcano-scheduler.conf (a ConfigMap) listing actions and tiered plugins. Override it through custom.scheduler_config_override. The default is null (chart ships its built-in config); set it explicitly when you need to tune gang/preemption/binpack. Field names below are exact.3
# values-volcano.yaml -> helm upgrade ... -f values-volcano.yaml
custom:
scheduler_replicas: 2 # HA: leader-elected
controller_replicas: 1
admission_enable: true
scheduler_config_override: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang # <- all-or-nothing for PodGroups
enablePreemptable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf # dominant-resource fairness across queues
enablePreemptable: false
- name: predicates # node feasibility (taints, affinity, resources)
- name: proportion # enforce Queue weight/capability
- name: nodeorder
- name: binpack # pack GPUs to reduce fragmentation
helm upgrade volcano volcano-sh/volcano -n volcano-system \
--version 1.12.2 --reuse-values -f values-volcano.yaml
gang must stay enabled; it is the plugin that enforces minMember. enqueue (action) gates a PodGroup into the cluster only when its minResources can plausibly be met, which is what prevents head-of-line jobs from squatting. Order of plugins within a tier is significant; keep predicates before proportion/binpack.3
Stand up a tenant queue (optional)¶
The default queue works for a single tenant. For fair-share across teams, create explicit queues; proportion divides cluster resources by weight, and capability is a hard cap (security and multi-tenancy). For full multi-team quota with borrowing/preemption, layer Kueue on top.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: research
spec:
weight: 4 # relative share vs other queues
reclaimable: true # idle capacity can be reclaimed by others
capability: # hard upper bound for this queue
nvidia.com/gpu: "64"
Configuration¶
Key Helm values and CRD fields. Helm keys are exact paths into the chart's values.yaml.345
| Key / field | Where | Meaning |
|---|---|---|
custom.scheduler_replicas |
helm | Scheduler Deployment replicas; >1 uses leader election (HA). |
custom.controller_replicas |
helm | volcano-controllers replicas (PodGroup/Job lifecycle). |
custom.admission_enable |
helm | Enable the mutating/validating webhook (patches schedulerName, validates jobs). |
custom.scheduler_config_override |
helm | Full volcano-scheduler.conf (actions + tiers/plugins). |
custom.scheduler_log_level |
helm | Scheduler verbosity (klog -v). |
basic.scheduler_image_tag_version |
helm | Pin the scheduler image tag (overrides basic.image_tag_version). |
actions |
conf | Ordered scheduling actions, e.g. enqueue, allocate, backfill. |
tiers[].plugins[].name |
conf | Plugin id: gang, priority, drf, proportion, predicates, binpack, … |
gang plugin |
conf | Enforces PodGroup.spec.minMember — all-or-nothing. |
spec.minMember |
PodGroup | Min pods that must be placeable together before any bind. |
spec.minResources |
PodGroup | Aggregate resources gated by enqueue before admission. |
spec.queue |
PodGroup | Target Queue (must be Open); defaults to default. |
spec.priorityClassName |
PodGroup | Scheduling priority for the group. |
spec.weight |
Queue | Relative share for the proportion plugin. |
spec.capability |
Queue | Hard per-queue resource ceiling. |
spec.reclaimable |
Queue | Whether others may reclaim this queue's idle capacity (default true). |
status.state |
Queue | Read-only: Open accepts PodGroups, Closed rejects new ones. Not a spec field — set it with vcctl queue operate -a open/close -n <queue>. |
spec.schedulerName: volcano |
Pod/Job | Routes the pod to Volcano instead of the default scheduler. |
scheduling.k8s.io/group-name |
pod annotation | Binds a plain pod to a named PodGroup. |
Apply & verify¶
1. Volcano is up.
kubectl get deploy -n volcano-system
# EXPECT: volcano-scheduler, volcano-controllers, volcano-admission all READY n/n
kubectl get pods -n volcano-system
# EXPECT: all pods Running; scheduler/controller/admission 1/1
kubectl get queue
# EXPECT: a 'default' queue, STATE Open
2. A gang schedules all-or-nothing. This VolcanoJob requests 2 GPU workers with minAvailable: 2. On a cluster with <2 free GPUs, nothing should bind; with >=2, both bind together. Adjust GPU count to your hardware.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: gang-smoke
namespace: default
spec:
schedulerName: volcano
minAvailable: 2 # gang: run only when both workers can start
queue: default
tasks:
- name: worker
replicas: 2
template:
spec:
restartPolicy: Never
schedulerName: volcano
containers:
- name: smi
image: nvidia/cuda:13.0.0-base-ubuntu24.04
command: ["bash", "-c", "nvidia-smi -L && sleep 30"]
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f gang-smoke.yaml
# Volcano auto-creates a PodGroup named after the job:
kubectl get podgroup
# EXPECT (enough GPUs): PHASE Running, MINMEMBER 2
# EXPECT (too few GPUs): PHASE Pending or Inqueue, and 0 worker pods Running
kubectl get pods -l volcano.sh/job-name=gang-smoke
# EXPECT: both worker pods Running together, or both Pending — never exactly one Running
The all-or-nothing signal is the key assertion: under GPU pressure you must never see one worker Running while the other is Pending. If you do, the gang plugin is not active; re-check scheduler_config_override.
3. Plain pods (no VolcanoJob). For a Deployment/Pod, create the PodGroup yourself, set schedulerName: volcano, and annotate the pod template so Volcano associates it.46
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: infer-gang
namespace: default
spec:
minMember: 4
queue: default
---
apiVersion: v1
kind: Pod
metadata:
name: infer-0
namespace: default
annotations:
scheduling.k8s.io/group-name: infer-gang # binds this pod to the PodGroup
spec:
schedulerName: volcano
containers:
- name: app
image: nvcr.io/nvidia/pytorch:25.05-py3
command: ["sleep", "120"]
resources:
limits:
nvidia.com/gpu: 1
For a Deployment of N replicas, set minMember: N on the PodGroup and put the same annotation + schedulerName in the pod template; all N bind together or none do.
Failure modes¶
gangplugin disabled / wrong config. Pods bind one at a time; a 4-pod job grabs 3 GPUs and deadlocks on rendezvous. Symptom: PodGroupUnknown, partial workersRunning. Fix: ensuregangis in the tiers and the override actually applied (kubectl get cm -n volcano-system).schedulerNamemissing. The default scheduler grabs the pod and ignores the PodGroup entirely, so gang semantics never apply. Every pod (and the VolcanoJob/task template) must carryschedulerName: volcano.minMember/minAvailablemismatch with replicas. Set above the real pod count and the job never starts; set to 1 on a true gang and you reintroduce partial placement. Match it to the workers that must rendezvous.- Queue
Closedor overcapability. PodGroup staysPending/Inqueueand never admits. Checkkubectl get queueSTATE and thecapabilityceiling vs the job's request. minResourcesunsatisfiable.enqueuekeeps the PodGroup out of scheduling indefinitely (correct, but easy to misread as a hang). CompareminResourcesto allocatable cluster GPUs.- Admission webhook unavailable. If
volcano-admissionis down and you rely on it to injectschedulerName, pods silently fall back to the default scheduler. Treat the webhook as part of the critical path or setschedulerNameexplicitly. - Two schedulers, one node, racing. Volcano and the default scheduler can both consider the same nodes; this is expected (they own disjoint pods by
schedulerName), but mixing a single workload across both schedulers is unsupported. Keep a gang entirely on Volcano.
References¶
- Installation (Helm): https://volcano.sh/en/docs/installation/
- Volcano Helm chart values.yaml: https://github.com/volcano-sh/volcano/blob/master/installer/helm/chart/volcano/values.yaml
- Helm charts repo / releases: https://github.com/volcano-sh/helm-charts/releases
- PodGroup CRD: https://volcano.sh/en/docs/v1-11-0/podgroup/
- Queue CRD: https://volcano.sh/en/docs/queue/
- VolcanoJob CRD: https://volcano.sh/en/docs/vcjob/
- Scheduler configuration (actions/plugins): https://volcano.sh/en/docs/scheduler_introduction/
- Kubeflow Spark + Volcano (group-name annotation, schedulerName): https://www.kubeflow.org/docs/components/spark-operator/user-guide/volcano-integration/
Related: Helm/GPU platform hub · Volcano Job · Kueue · Kubernetes for GPUs · Security & multi-tenancy · Glossary
-
Installation | Volcano — https://volcano.sh/en/docs/installation/ ↩↩↩
-
volcano-sh/helm-charts releases — https://github.com/volcano-sh/helm-charts/releases ↩
-
volcano-sh/volcano values.yaml — https://github.com/volcano-sh/volcano/blob/master/installer/helm/chart/volcano/values.yaml ↩↩↩
-
PodGroup | Volcano — https://volcano.sh/en/docs/v1-11-0/podgroup/ ↩↩↩
-
Volcano integration | Kubeflow Spark Operator — https://www.kubeflow.org/docs/components/spark-operator/user-guide/volcano-integration/ ↩