Markdown

Helm: Volcano gang scheduler¶

Scope: install Volcano via Helm; configure the scheduler/controller/admission, stand up queues, and make gang (minMember) scheduling place distributed jobs all-or-nothing so partial gangs never deadlock GPUs, while the default scheduler keeps owning everything else. Pairs with Volcano Job.

Reference templates from the upstream Volcano Helm chart and CRDs. Pin chart and image versions; apply via GitOps rather than helm install by hand in production. Never hardware-tested here.

flowchart TB
  JOB["VolcanoJob / annotated pods"] --> PG["PodGroup (minMember=N)"]
  PG --> ENQ["enqueue: queue admits PodGroup"]
  ENQ --> GANG["gang plugin: place all N or none"]
  GANG -->|"all N fit"| RUN["Running"]
  GANG -->|"< N fit"| PEND["Pending (no pod bound)"]
  DEFSCHED["default kube-scheduler"] -.-> OTHER["everything without schedulerName: volcano"]

What it is¶

Volcano is a CNCF batch system that ships its own scheduler, a controller manager, and an admission webhook. The piece that matters for GPU clusters is the gang scheduler: a job is represented by a PodGroup with minMember: N, and the gang plugin only binds pods once at least N of them can be placed at the same time. If fewer than N fit, zero are bound; the workload sits Pending instead of grabbing some GPUs and blocking on a rendezvous that can never complete. That is the failure the default scheduler causes on multi-pod training/inference jobs (Kubernetes for GPUs, hub).

Volcano runs alongside the default scheduler, not as a replacement. Only pods with spec.schedulerName: volcano (set directly, by a VolcanoJob, or by the admission webhook) are scheduled by Volcano; everything else stays with kube-scheduler.¹⁴

Prerequisites¶

Kubernetes 1.12+ with CRD support (current Volcano tracks recent K8s; pin to a release whose support matrix covers your cluster).¹
A working GPU platform underneath: GPU Operator advertising nvidia.com/gpu, optionally MIG/time-slicing (hub). Volcano schedules the resource; it does not create it.
helm 3.x and cluster-admin (Volcano installs cluster-scoped CRDs, RBAC, and a mutating webhook; see RBAC for GPU Platform Operators).
Decide the namespace up front: the chart defaults to volcano-system.

Install¶

Pin the chart version. 1.12.2 is used below as a concrete, stable reference template; confirm the current release for your K8s version on the chart repo.²

helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update

helm install volcano volcano-sh/volcano \
  -n volcano-system --create-namespace \
  --version 1.12.2 \
  --set custom.scheduler_replicas=1 \
  --set custom.controller_replicas=1 \
  --set custom.admission_replicas=1

This deploys three Deployments: volcano-scheduler, volcano-controllers, and volcano-admission, plus the CRDs (PodGroup, Queue, Job, …) and a default Queue with weight: 1 that receives any PodGroup not assigned elsewhere.¹⁵

Override the scheduler config (actions + plugins)¶

The scheduler behaviour is a volcano-scheduler.conf (a ConfigMap) listing actions and tiered plugins. Override it through custom.scheduler_config_override. The default is null (chart ships its built-in config); set it explicitly when you need to tune gang/preemption/binpack. Field names below are exact.³

# values-volcano.yaml  -> helm upgrade ... -f values-volcano.yaml
custom:
  scheduler_replicas: 2          # HA: leader-elected
  controller_replicas: 1
  admission_enable: true
  scheduler_config_override: |
    actions: "enqueue, allocate, backfill"
    tiers:
      - plugins:
          - name: priority
          - name: gang             # <- all-or-nothing for PodGroups
            enablePreemptable: false
          - name: conformance
      - plugins:
          - name: overcommit
          - name: drf              # dominant-resource fairness across queues
            enablePreemptable: false
          - name: predicates       # node feasibility (taints, affinity, resources)
          - name: proportion       # enforce Queue weight/capability
          - name: nodeorder
          - name: binpack          # pack GPUs to reduce fragmentation

helm upgrade volcano volcano-sh/volcano -n volcano-system \
  --version 1.12.2 --reuse-values -f values-volcano.yaml

gang must stay enabled; it is the plugin that enforces minMember. enqueue (action) gates a PodGroup into the cluster only when its minResources can plausibly be met, which is what prevents head-of-line jobs from squatting. Order of plugins within a tier is significant; keep predicates before proportion/binpack.³

Stand up a tenant queue (optional)¶

The default queue works for a single tenant. For fair-share across teams, create explicit queues; proportion divides cluster resources by weight, and capability is a hard cap (security and multi-tenancy). For full multi-team quota with borrowing/preemption, layer Kueue on top.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: research
spec:
  weight: 4                 # relative share vs other queues
  reclaimable: true         # idle capacity can be reclaimed by others
  capability:               # hard upper bound for this queue
    nvidia.com/gpu: "64"

Configuration¶

Key Helm values and CRD fields. Helm keys are exact paths into the chart's values.yaml.³⁴⁵

Key / field	Where	Meaning
`custom.scheduler_replicas`	helm	Scheduler Deployment replicas; >1 uses leader election (HA).
`custom.controller_replicas`	helm	volcano-controllers replicas (PodGroup/Job lifecycle).
`custom.admission_enable`	helm	Enable the mutating/validating webhook (patches `schedulerName`, validates jobs).
`custom.scheduler_config_override`	helm	Full `volcano-scheduler.conf` (actions + tiers/plugins).
`custom.scheduler_log_level`	helm	Scheduler verbosity (klog `-v`).
`basic.scheduler_image_tag_version`	helm	Pin the scheduler image tag (overrides `basic.image_tag_version`).
`actions`	conf	Ordered scheduling actions, e.g. `enqueue, allocate, backfill`.
`tiers[].plugins[].name`	conf	Plugin id: `gang`, `priority`, `drf`, `proportion`, `predicates`, `binpack`, …
`gang` plugin	conf	Enforces `PodGroup.spec.minMember` — all-or-nothing.
`spec.minMember`	PodGroup	Min pods that must be placeable together before any bind.
`spec.minResources`	PodGroup	Aggregate resources gated by `enqueue` before admission.
`spec.queue`	PodGroup	Target Queue (must be `Open`); defaults to `default`.
`spec.priorityClassName`	PodGroup	Scheduling priority for the group.
`spec.weight`	Queue	Relative share for the `proportion` plugin.
`spec.capability`	Queue	Hard per-queue resource ceiling.
`spec.reclaimable`	Queue	Whether others may reclaim this queue's idle capacity (default true).
`status.state`	Queue	Read-only: `Open` accepts PodGroups, `Closed` rejects new ones. Not a spec field — set it with `vcctl queue operate -a open/close -n <queue>`.
`spec.schedulerName: volcano`	Pod/Job	Routes the pod to Volcano instead of the default scheduler.
`scheduling.k8s.io/group-name`	pod annotation	Binds a plain pod to a named PodGroup.

Apply & verify¶

1. Volcano is up.

kubectl get deploy -n volcano-system
# EXPECT: volcano-scheduler, volcano-controllers, volcano-admission all READY n/n
kubectl get pods -n volcano-system
# EXPECT: all pods Running; scheduler/controller/admission 1/1
kubectl get queue
# EXPECT: a 'default' queue, STATE Open

2. A gang schedules all-or-nothing. This VolcanoJob requests 2 GPU workers with minAvailable: 2. On a cluster with <2 free GPUs, nothing should bind; with >=2, both bind together. Adjust GPU count to your hardware.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: gang-smoke
  namespace: default
spec:
  schedulerName: volcano
  minAvailable: 2          # gang: run only when both workers can start
  queue: default
  tasks:
    - name: worker
      replicas: 2
      template:
        spec:
          restartPolicy: Never
          schedulerName: volcano
          containers:
            - name: smi
              image: nvidia/cuda:13.0.0-base-ubuntu24.04
              command: ["bash", "-c", "nvidia-smi -L && sleep 30"]
              resources:
                limits:
                  nvidia.com/gpu: 1

kubectl apply -f gang-smoke.yaml

# Volcano auto-creates a PodGroup named after the job:
kubectl get podgroup
# EXPECT (enough GPUs): PHASE Running, MINMEMBER 2
# EXPECT (too few GPUs): PHASE Pending or Inqueue, and 0 worker pods Running

kubectl get pods -l volcano.sh/job-name=gang-smoke
# EXPECT: both worker pods Running together, or both Pending — never exactly one Running

The all-or-nothing signal is the key assertion: under GPU pressure you must never see one worker Running while the other is Pending. If you do, the gang plugin is not active; re-check scheduler_config_override.

3. Plain pods (no VolcanoJob). For a Deployment/Pod, create the PodGroup yourself, set schedulerName: volcano, and annotate the pod template so Volcano associates it.⁴⁶

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: infer-gang
  namespace: default
spec:
  minMember: 4
  queue: default
---
apiVersion: v1
kind: Pod
metadata:
  name: infer-0
  namespace: default
  annotations:
    scheduling.k8s.io/group-name: infer-gang   # binds this pod to the PodGroup
spec:
  schedulerName: volcano
  containers:
    - name: app
      image: nvcr.io/nvidia/pytorch:25.05-py3
      command: ["sleep", "120"]
      resources:
        limits:
          nvidia.com/gpu: 1

For a Deployment of N replicas, set minMember: N on the PodGroup and put the same annotation + schedulerName in the pod template; all N bind together or none do.

Failure modes¶

gang plugin disabled / wrong config. Pods bind one at a time; a 4-pod job grabs 3 GPUs and deadlocks on rendezvous. Symptom: PodGroup Unknown, partial workers Running. Fix: ensure gang is in the tiers and the override actually applied (kubectl get cm -n volcano-system).
schedulerName missing. The default scheduler grabs the pod and ignores the PodGroup entirely, so gang semantics never apply. Every pod (and the VolcanoJob/task template) must carry schedulerName: volcano.
minMember / minAvailable mismatch with replicas. Set above the real pod count and the job never starts; set to 1 on a true gang and you reintroduce partial placement. Match it to the workers that must rendezvous.
Queue Closed or over capability. PodGroup stays Pending/Inqueue and never admits. Check kubectl get queue STATE and the capability ceiling vs the job's request.
minResources unsatisfiable. enqueue keeps the PodGroup out of scheduling indefinitely (correct, but easy to misread as a hang). Compare minResources to allocatable cluster GPUs.
Admission webhook unavailable. If volcano-admission is down and you rely on it to inject schedulerName, pods silently fall back to the default scheduler. Treat the webhook as part of the critical path or set schedulerName explicitly.
Two schedulers, one node, racing. Volcano and the default scheduler can both consider the same nodes; this is expected (they own disjoint pods by schedulerName), but mixing a single workload across both schedulers is unsupported. Keep a gang entirely on Volcano.

References¶

Installation (Helm): https://volcano.sh/en/docs/installation/
Volcano Helm chart values.yaml: https://github.com/volcano-sh/volcano/blob/master/installer/helm/chart/volcano/values.yaml
Helm charts repo / releases: https://github.com/volcano-sh/helm-charts/releases
PodGroup CRD: https://volcano.sh/en/docs/v1-11-0/podgroup/
Queue CRD: https://volcano.sh/en/docs/queue/
VolcanoJob CRD: https://volcano.sh/en/docs/vcjob/
Scheduler configuration (actions/plugins): https://volcano.sh/en/docs/scheduler_introduction/
Kubeflow Spark + Volcano (group-name annotation, schedulerName): https://www.kubeflow.org/docs/components/spark-operator/user-guide/volcano-integration/

Installation | Volcano — https://volcano.sh/en/docs/installation/ ↩↩↩
volcano-sh/helm-charts releases — https://github.com/volcano-sh/helm-charts/releases ↩
volcano-sh/volcano values.yaml — https://github.com/volcano-sh/volcano/blob/master/installer/helm/chart/volcano/values.yaml ↩↩↩
PodGroup | Volcano — https://volcano.sh/en/docs/v1-11-0/podgroup/ ↩↩↩
Queue | Volcano — https://volcano.sh/en/docs/queue/ ↩↩
Volcano integration | Kubeflow Spark Operator — https://www.kubeflow.org/docs/components/spark-operator/user-guide/volcano-integration/ ↩