Manifest: Volcano job¶
Scope: a Volcano Job (batch.volcano.sh/v1alpha1) that gang-schedules a multi-pod GPU training run (minAvailable, multiple tasks, schedulerName: volcano) so every worker lands at once or none do. Reference templates from upstream Volcano docs; not hardware-tested. Pairs with Volcano Gang Scheduler, which installs the controller and scheduler this CRD depends on.
Apply via GitOps and pin every image. The Volcano scheduler must be installed first (Volcano Gang Scheduler); without it these pods sit
Pendingforever.
flowchart TB
VCJOB["VolcanoJob (minAvailable=N)"] --> PG["PodGroup (auto-created)"]
PG --> GANG{"all N schedulable?"}
GANG -->|"yes"| BIND["bind all pods together"]
GANG -->|"no"| WAIT["hold; nothing binds"]
BIND --> RUN["torchrun / mpirun across pods"]
What it is¶
A Volcano Job is a CRD (kind: Job, group batch.volcano.sh, short name vcjob) that wraps one or more tasks, each a pod template with its own replicas. The Volcano controller compiles the Job into a PodGroup and the Volcano scheduler treats minAvailable as a gang gate: it binds pods only when at least minAvailable of them can be placed simultaneously. This is the all-or-nothing semantics a distributed torchrun/MPI run needs: a half-placed job is a deadlocked job holding idle GPUs.
Two ergonomics matter for GPU training:
plugins(map[string][]string). Thesvcplugin creates a headless Service + per-pod DNS so workers find rank 0; thepytorchandmpiplugins layer on top ofsvcand inject the framework env (MASTER_ADDR,MASTER_PORT,WORLD_SIZE,RANKfor PyTorch;MPI_HOSTfor MPI), so you write less YAML.234policiesare lifecycle rules (e.g.event: TaskCompleted -> action: CompleteJob) so the Job terminates cleanly when rank 0 exits instead of leaving workers running.1
GPUs are requested the normal way (nvidia.com/gpu limits) inside each task's pod template; Volcano only governs when the group binds, not what a pod consumes.
Prerequisites¶
- Volcano installed (controllers + scheduler + CRDs). See Volcano Gang Scheduler. Confirm CRDs exist:
kubectl get crd jobs.batch.volcano.sh podgroups.scheduling.volcano.sh. - GPU Operator advertising
nvidia.com/gpuon nodes (GPU Operator ClusterPolicy). - A Volcano
Queuefor the job (thedefaultqueue is created by the Helm install; custom quota lives in Kueue ClusterQueue if you front Volcano with Kueue). - For multi-node NCCL over RDMA, an RDMA interface in-pod (NicClusterPolicy); otherwise NCCL falls back to TCP.
The manifest¶
Variant A: PyTorch torchrun (gang-scheduled, 1 master + 2 workers, 1 GPU each)¶
The pytorch plugin injects MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK and force-enables svc; the default port is 23456.2 minAvailable: 3 gates the gang on all three pods.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: torchrun-gang
namespace: ml
spec:
minAvailable: 3 # gang gate: all 3 pods or none
schedulerName: volcano # MUST be volcano for gang semantics
queue: default
maxRetry: 3
plugins:
pytorch: ["--master=master", "--worker=worker", "--port=23456"]
policies:
- event: PodEvicted
action: RestartJob
tasks:
- name: master # RANK 0; torchrun rendezvous endpoint
replicas: 1
policies:
- event: TaskCompleted # when rank 0 finishes, end the whole Job
action: CompleteJob
template:
spec:
schedulerName: volcano
restartPolicy: OnFailure
containers:
- name: master
image: nvcr.io/nvidia/pytorch:25.05-py3 # pin to a tested tag
command: ["torchrun"]
args:
- "--nnodes=3"
- "--nproc_per_node=1"
- "--node_rank=$(RANK)"
- "--master_addr=$(MASTER_ADDR)"
- "--master_port=$(MASTER_PORT)"
- "/workspace/train.py"
resources:
limits:
nvidia.com/gpu: 1
- name: worker # RANK 1..N
replicas: 2
template:
spec:
schedulerName: volcano
restartPolicy: OnFailure
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:25.05-py3
command: ["torchrun"]
args:
- "--nnodes=3"
- "--nproc_per_node=1"
- "--node_rank=$(RANK)"
- "--master_addr=$(MASTER_ADDR)"
- "--master_port=$(MASTER_PORT)"
- "/workspace/train.py"
resources:
limits:
nvidia.com/gpu: 1
$(RANK), $(MASTER_ADDR), $(MASTER_PORT) are the env vars the pytorch plugin injects; Kubernetes $(VAR) expansion in args substitutes them at pod start.2 WORLD_SIZE is injected too. Replace train.py/image with yours; the plugin sets up rendezvous, not your code.
Variant B: MPI (mpirun from a master over SSH to workers)¶
The mpi plugin force-enables svc + ssh (password-free login) and exposes MPI_HOST (worker DNS names) for mpirun --host.3 The master starts sshd then runs mpiexec; workers run sshd -D.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mpi-gang
namespace: ml
spec:
minAvailable: 3
schedulerName: volcano
queue: default
plugins:
mpi: ["--master=mpimaster", "--worker=mpiworker", "--port=22"]
tasks:
- name: mpimaster
replicas: 1
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
schedulerName: volcano
restartPolicy: OnFailure
containers:
- name: mpimaster
image: volcanosh/example-mpi:0.0.3 # replace with your CUDA+MPI image
workingDir: /home
command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd;
mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world;
- name: mpiworker
replicas: 2
template:
spec:
schedulerName: volcano
restartPolicy: OnFailure
containers:
- name: mpiworker
image: volcanosh/example-mpi:0.0.3
workingDir: /home
command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
The volcanosh/example-mpi:0.0.3 image is a CPU mpi_hello_world proof from upstream;3 for GPU work swap in your CUDA-aware MPI image, add nvidia.com/gpu limits to both tasks, and point -np at your total rank count.
Configuration¶
| Field | Path | Meaning | Notes |
|---|---|---|---|
apiVersion |
top | batch.volcano.sh/v1alpha1 |
the only GA-stage Job API version1 |
kind |
top | Job |
short name vcjob |
schedulerName |
spec.schedulerName |
scheduler for the Job | must be volcano for gang semantics; also set it per-task template1 |
minAvailable |
spec.minAvailable |
min pods that must place together | the gang gate; defaults to sum of task replicas1 |
tasks[].name |
spec.tasks |
task/role identifier | referenced by pytorch/mpi plugin --master/--worker23 |
tasks[].replicas |
spec.tasks |
pods for this task | total pods = sum across tasks |
tasks[].minAvailable |
spec.tasks |
per-task min (optional, *int32) |
for per-role gang gating; omit to use Job-level minAvailable5 |
tasks[].template |
spec.tasks |
v1.PodTemplateSpec |
standard pod spec; put nvidia.com/gpu limits here |
tasks[].policies |
spec.tasks |
lifecycle rules | e.g. event: TaskCompleted -> action: CompleteJob1 |
plugins |
spec.plugins |
map[string][]string |
svc, env, ssh, pytorch, mpi123 |
policies |
spec.policies |
Job-level lifecycle | e.g. event: PodEvicted -> action: RestartJob1 |
queue |
spec.queue |
Volcano queue | defaults to default1 |
maxRetry |
spec.maxRetry |
Job retry budget | int32 |
priorityClassName |
spec.priorityClassName |
preemption priority | standard K8s PriorityClass5 |
ttlSecondsAfterFinished |
spec.ttlSecondsAfterFinished |
GC delay after finish | *int32; auto-clean completed jobs5 |
minSuccess |
spec.minSuccess |
pods that must succeed for Job success | *int32, optional5 |
PyTorch plugin args: --master=<taskName>, --worker=<taskName>, --port=<int> (default 23456); optional --wait-master-enabled, --wait-master-timeout, --wait-master-image.2 MPI plugin args: --master=<taskName>, --worker=<taskName>, --port=<int> (default 22).3
Apply & verify¶
kubectl apply -f torchrun-gang.yaml
# 1. The Job exists under the Volcano API group:
kubectl get jobs.batch.volcano.sh -n ml # or: kubectl get vcjob -n ml
# 2. The auto-created PodGroup carries the gang gate:
kubectl get podgroup -n ml -o wide # short name: pg
# 3. All pods of the job (label set by Volcano):
kubectl get pod -n ml -l volcano.sh/job-name=torchrun-gang -o wide
Expected signals:
kubectl get vcjobSTATUS movesPending -> Running -> Completed. A healthy gang showsRunningwithMINmatching yourminAvailable.6kubectl get pgshowsphase: Runningandrunning: 3once the gang binds; the count must reachminAvailable.6- All three pods reach
Runningtogether (same scheduling cycle). If you see someRunningand others stuckPending, the gang did not engage; checkschedulerName. - Rank-0 logs show the rendezvous:
kubectl logs -n ml -l volcano.sh/job-name=torchrun-gang,volcano.sh/task-spec=master. WithNCCL_DEBUG=INFO, look forNCCL INFO ... comm ... nranks 3; over RDMA,[GDRDMA](NicClusterPolicy).
The decisive check vs. the default scheduler: under Volcano, pods bind as a set. If capacity is short, none bind (no idle, half-allocated GPUs). That is the property you are buying.
Failure modes¶
- Pods stuck
Pending, no PodGroup status. Volcano scheduler not running, orschedulerNameis notvolcano. Set it at bothspec.schedulerNameandspec.tasks[].template.spec.schedulerName; confirm the scheduler pod involcano-system(Volcano Gang Scheduler). - Some pods
Running, othersPendingindefinitely.minAvailablenot satisfiable by cluster GPU capacity; the gang correctly refuses partial placement. Reducereplicas/minAvailableor add nodes. (Default-scheduler behaviour, partial placement and deadlock, is exactly what this prevents.) MASTER_ADDR/RANKempty in workers.pytorchplugin not listed inspec.plugins, or the task names don't match the plugin's--master/--workerargs. The plugin keys off taskname.2- MPI workers unreachable / SSH refused.
mpi(orssh) plugin missing; the master's${MPI_HOST}is unset without it. Ensurempi: [...]is present sosvc+sshare force-enabled.3 - Job never completes after training ends. No
TaskCompleted -> CompleteJobpolicy on the master, so workers linger. Add the task policy shown above.1 - GPUs requested but pods land CPU-only / unschedulable.
nvidia.com/gpulimit missing from the tasktemplate, or GPU Operator not advertising the resource (GPU Operator ClusterPolicy). error validating "...": no matches for kind "Job" in version "batch.volcano.sh/v1alpha1". CRDs absent; install Volcano first.
References¶
- VolcanoJob (vcjob) CRD reference: https://volcano.sh/en/docs/vcjob/
- PyTorch plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md
- MPI plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_mpi_plugin.md
- SVC plugin user guide: https://volcano.sh/en/docs/user-guide/how_to_use_svc_plugin/
- batch v1alpha1 API (JobSpec / TaskSpec): https://pkg.go.dev/volcano.sh/apis/pkg/apis/batch/v1alpha1
- Volcano quickstart / verify (get vcjob, get pg): https://volcano.sh/en/docs/v1-11-0/tutorials/
Related: Volcano scheduler (Helm) · GPU Operator ClusterPolicy · Kueue ClusterQueue · NIC ClusterPolicy · K8s GPU platform hub · Glossary
-
VolcanoJob CRD reference —
kind: Job,apiVersion: batch.volcano.sh/v1alpha1,minAvailable,schedulerName,tasks,plugins(ssh/env/svc),policies,queue,maxRetry: https://volcano.sh/en/docs/vcjob/ ↩↩↩↩↩↩↩↩↩ -
PyTorch plugin —
pytorch: ["--master=master","--worker=worker","--port=23456",...], injectsMASTER_ADDR/MASTER_PORT/WORLD_SIZE/RANK, force-enablessvc, default port 23456: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md ↩↩↩↩↩↩↩ -
MPI plugin —
mpi: ["--master=mpimaster","--worker=mpiworker","--port=22"],${MPI_HOST}formpiexec --host, force-enablessvc+ssh,volcanosh/example-mpi:0.0.3: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_mpi_plugin.md ↩↩↩↩↩↩↩ -
SVC plugin — headless Service + per-pod DNS for inter-task discovery; prerequisite for pytorch/mpi plugins: https://volcano.sh/en/docs/user-guide/how_to_use_svc_plugin/ ↩
-
batch v1alpha1 Go API —
TaskSpec.MinAvailable *int32,JobSpec.PriorityClassName,JobSpec.TtlSecondsAfterFinished *int32,JobSpec.MinSuccess *int32: https://pkg.go.dev/volcano.sh/apis/pkg/apis/batch/v1alpha1 ↩↩↩↩ -
Volcano tutorial — verify with
kubectl get vcjob <name>,kubectl get pod -l volcano.sh/job-name=<name>, PodGroupphase: Running/running: N: https://volcano.sh/en/docs/v1-11-0/tutorials/ ↩↩