Markdown

Manifest: Volcano job¶

Scope: a Volcano Job (batch.volcano.sh/v1alpha1) that gang-schedules a multi-pod GPU training run (minAvailable, multiple tasks, schedulerName: volcano) so every worker lands at once or none do. Reference templates from upstream Volcano docs; not hardware-tested. Pairs with Volcano Gang Scheduler, which installs the controller and scheduler this CRD depends on.

Apply via GitOps and pin every image. The Volcano scheduler must be installed first (Volcano Gang Scheduler); without it these pods sit Pending forever.

flowchart TB
  VCJOB["VolcanoJob (minAvailable=N)"] --> PG["PodGroup (auto-created)"]
  PG --> GANG{"all N schedulable?"}
  GANG -->|"yes"| BIND["bind all pods together"]
  GANG -->|"no"| WAIT["hold; nothing binds"]
  BIND --> RUN["torchrun / mpirun across pods"]

What it is¶

A Volcano Job is a CRD (kind: Job, group batch.volcano.sh, short name vcjob) that wraps one or more tasks, each a pod template with its own replicas. The Volcano controller compiles the Job into a PodGroup and the Volcano scheduler treats minAvailable as a gang gate: it binds pods only when at least minAvailable of them can be placed simultaneously. This is the all-or-nothing semantics a distributed torchrun/MPI run needs: a half-placed job is a deadlocked job holding idle GPUs.

Two ergonomics matter for GPU training:

plugins (map[string][]string). The svc plugin creates a headless Service + per-pod DNS so workers find rank 0; the pytorch and mpi plugins layer on top of svc and inject the framework env (MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK for PyTorch; MPI_HOST for MPI), so you write less YAML.²³⁴
policies are lifecycle rules (e.g. event: TaskCompleted -> action: CompleteJob) so the Job terminates cleanly when rank 0 exits instead of leaving workers running.¹

GPUs are requested the normal way (nvidia.com/gpu limits) inside each task's pod template; Volcano only governs when the group binds, not what a pod consumes.

Prerequisites¶

Volcano installed (controllers + scheduler + CRDs). See Volcano Gang Scheduler. Confirm CRDs exist: kubectl get crd jobs.batch.volcano.sh podgroups.scheduling.volcano.sh.
GPU Operator advertising nvidia.com/gpu on nodes (GPU Operator ClusterPolicy).
A Volcano Queue for the job (the default queue is created by the Helm install; custom quota lives in Kueue ClusterQueue if you front Volcano with Kueue).
For multi-node NCCL over RDMA, an RDMA interface in-pod (NicClusterPolicy); otherwise NCCL falls back to TCP.

The manifest¶

Variant A: PyTorch `torchrun` (gang-scheduled, 1 master + 2 workers, 1 GPU each)¶

The pytorch plugin injects MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK and force-enables svc; the default port is 23456.² minAvailable: 3 gates the gang on all three pods.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: torchrun-gang
  namespace: ml
spec:
  minAvailable: 3                 # gang gate: all 3 pods or none
  schedulerName: volcano          # MUST be volcano for gang semantics
  queue: default
  maxRetry: 3
  plugins:
    pytorch: ["--master=master", "--worker=worker", "--port=23456"]
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - name: master                # RANK 0; torchrun rendezvous endpoint
      replicas: 1
      policies:
        - event: TaskCompleted    # when rank 0 finishes, end the whole Job
          action: CompleteJob
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: master
              image: nvcr.io/nvidia/pytorch:25.05-py3   # pin to a tested tag
              command: ["torchrun"]
              args:
                - "--nnodes=3"
                - "--nproc_per_node=1"
                - "--node_rank=$(RANK)"
                - "--master_addr=$(MASTER_ADDR)"
                - "--master_port=$(MASTER_PORT)"
                - "/workspace/train.py"
              resources:
                limits:
                  nvidia.com/gpu: 1
    - name: worker                # RANK 1..N
      replicas: 2
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: worker
              image: nvcr.io/nvidia/pytorch:25.05-py3
              command: ["torchrun"]
              args:
                - "--nnodes=3"
                - "--nproc_per_node=1"
                - "--node_rank=$(RANK)"
                - "--master_addr=$(MASTER_ADDR)"
                - "--master_port=$(MASTER_PORT)"
                - "/workspace/train.py"
              resources:
                limits:
                  nvidia.com/gpu: 1

$(RANK), $(MASTER_ADDR), $(MASTER_PORT) are the env vars the pytorch plugin injects; Kubernetes $(VAR) expansion in args substitutes them at pod start.² WORLD_SIZE is injected too. Replace train.py/image with yours; the plugin sets up rendezvous, not your code.

Variant B: MPI (`mpirun` from a master over SSH to workers)¶

The mpi plugin force-enables svc + ssh (password-free login) and exposes MPI_HOST (worker DNS names) for mpirun --host.³ The master starts sshd then runs mpiexec; workers run sshd -D.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mpi-gang
  namespace: ml
spec:
  minAvailable: 3
  schedulerName: volcano
  queue: default
  plugins:
    mpi: ["--master=mpimaster", "--worker=mpiworker", "--port=22"]
  tasks:
    - name: mpimaster
      replicas: 1
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: mpimaster
              image: volcanosh/example-mpi:0.0.3   # replace with your CUDA+MPI image
              workingDir: /home
              command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world;
    - name: mpiworker
      replicas: 2
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: mpiworker
              image: volcanosh/example-mpi:0.0.3
              workingDir: /home
              command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;

The volcanosh/example-mpi:0.0.3 image is a CPU mpi_hello_world proof from upstream;³ for GPU work swap in your CUDA-aware MPI image, add nvidia.com/gpu limits to both tasks, and point -np at your total rank count.

Configuration¶

Field	Path	Meaning	Notes
`apiVersion`	top	`batch.volcano.sh/v1alpha1`	the only GA-stage Job API version¹
`kind`	top	`Job`	short name `vcjob`
`schedulerName`	`spec.schedulerName`	scheduler for the Job	must be `volcano` for gang semantics; also set it per-task template¹
`minAvailable`	`spec.minAvailable`	min pods that must place together	the gang gate; defaults to sum of task `replicas`¹
`tasks[].name`	`spec.tasks`	task/role identifier	referenced by `pytorch`/`mpi` plugin `--master`/`--worker`²³
`tasks[].replicas`	`spec.tasks`	pods for this task	total pods = sum across tasks
`tasks[].minAvailable`	`spec.tasks`	per-task min (optional, `*int32`)	for per-role gang gating; omit to use Job-level `minAvailable`⁵
`tasks[].template`	`spec.tasks`	`v1.PodTemplateSpec`	standard pod spec; put `nvidia.com/gpu` limits here
`tasks[].policies`	`spec.tasks`	lifecycle rules	e.g. `event: TaskCompleted -> action: CompleteJob`¹
`plugins`	`spec.plugins`	`map[string][]string`	`svc`, `env`, `ssh`, `pytorch`, `mpi`¹²³
`policies`	`spec.policies`	Job-level lifecycle	e.g. `event: PodEvicted -> action: RestartJob`¹
`queue`	`spec.queue`	Volcano queue	defaults to `default`¹
`maxRetry`	`spec.maxRetry`	Job retry budget	`int32`
`priorityClassName`	`spec.priorityClassName`	preemption priority	standard K8s PriorityClass⁵
`ttlSecondsAfterFinished`	`spec.ttlSecondsAfterFinished`	GC delay after finish	`*int32`; auto-clean completed jobs⁵
`minSuccess`	`spec.minSuccess`	pods that must succeed for Job success	`*int32`, optional⁵

PyTorch plugin args: --master=<taskName>, --worker=<taskName>, --port=<int> (default 23456); optional --wait-master-enabled, --wait-master-timeout, --wait-master-image.² MPI plugin args: --master=<taskName>, --worker=<taskName>, --port=<int> (default 22).³

Apply & verify¶

kubectl apply -f torchrun-gang.yaml

# 1. The Job exists under the Volcano API group:
kubectl get jobs.batch.volcano.sh -n ml          # or: kubectl get vcjob -n ml

# 2. The auto-created PodGroup carries the gang gate:
kubectl get podgroup -n ml -o wide               # short name: pg

# 3. All pods of the job (label set by Volcano):
kubectl get pod -n ml -l volcano.sh/job-name=torchrun-gang -o wide

Expected signals:

kubectl get vcjob STATUS moves Pending -> Running -> Completed. A healthy gang shows Running with MIN matching your minAvailable.⁶
kubectl get pg shows phase: Running and running: 3 once the gang binds; the count must reach minAvailable.⁶
All three pods reach Running together (same scheduling cycle). If you see some Running and others stuck Pending, the gang did not engage; check schedulerName.
Rank-0 logs show the rendezvous: kubectl logs -n ml -l volcano.sh/job-name=torchrun-gang,volcano.sh/task-spec=master. With NCCL_DEBUG=INFO, look for NCCL INFO ... comm ... nranks 3; over RDMA, [GDRDMA] (NicClusterPolicy).

The decisive check vs. the default scheduler: under Volcano, pods bind as a set. If capacity is short, none bind (no idle, half-allocated GPUs). That is the property you are buying.

Failure modes¶

Pods stuck Pending, no PodGroup status. Volcano scheduler not running, or schedulerName is not volcano. Set it at both spec.schedulerName and spec.tasks[].template.spec.schedulerName; confirm the scheduler pod in volcano-system (Volcano Gang Scheduler).
Some pods Running, others Pending indefinitely. minAvailable not satisfiable by cluster GPU capacity; the gang correctly refuses partial placement. Reduce replicas/minAvailable or add nodes. (Default-scheduler behaviour, partial placement and deadlock, is exactly what this prevents.)
MASTER_ADDR/RANK empty in workers. pytorch plugin not listed in spec.plugins, or the task names don't match the plugin's --master/--worker args. The plugin keys off task name.²
MPI workers unreachable / SSH refused. mpi (or ssh) plugin missing; the master's ${MPI_HOST} is unset without it. Ensure mpi: [...] is present so svc+ssh are force-enabled.³
Job never completes after training ends. No TaskCompleted -> CompleteJob policy on the master, so workers linger. Add the task policy shown above.¹
GPUs requested but pods land CPU-only / unschedulable. nvidia.com/gpu limit missing from the task template, or GPU Operator not advertising the resource (GPU Operator ClusterPolicy).
error validating "...": no matches for kind "Job" in version "batch.volcano.sh/v1alpha1". CRDs absent; install Volcano first.

References¶

VolcanoJob (vcjob) CRD reference: https://volcano.sh/en/docs/vcjob/
PyTorch plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md
MPI plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_mpi_plugin.md
SVC plugin user guide: https://volcano.sh/en/docs/user-guide/how_to_use_svc_plugin/
batch v1alpha1 API (JobSpec / TaskSpec): https://pkg.go.dev/volcano.sh/apis/pkg/apis/batch/v1alpha1
Volcano quickstart / verify (get vcjob, get pg): https://volcano.sh/en/docs/v1-11-0/tutorials/

VolcanoJob CRD reference — kind: Job, apiVersion: batch.volcano.sh/v1alpha1, minAvailable, schedulerName, tasks, plugins (ssh/env/svc), policies, queue, maxRetry: https://volcano.sh/en/docs/vcjob/ ↩↩↩↩↩↩↩↩↩
PyTorch plugin — pytorch: ["--master=master","--worker=worker","--port=23456",...], injects MASTER_ADDR/MASTER_PORT/WORLD_SIZE/RANK, force-enables svc, default port 23456: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md ↩↩↩↩↩↩↩
MPI plugin — mpi: ["--master=mpimaster","--worker=mpiworker","--port=22"], ${MPI_HOST} for mpiexec --host, force-enables svc+ssh, volcanosh/example-mpi:0.0.3: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_mpi_plugin.md ↩↩↩↩↩↩↩
SVC plugin — headless Service + per-pod DNS for inter-task discovery; prerequisite for pytorch/mpi plugins: https://volcano.sh/en/docs/user-guide/how_to_use_svc_plugin/ ↩
batch v1alpha1 Go API — TaskSpec.MinAvailable *int32, JobSpec.PriorityClassName, JobSpec.TtlSecondsAfterFinished *int32, JobSpec.MinSuccess *int32: https://pkg.go.dev/volcano.sh/apis/pkg/apis/batch/v1alpha1 ↩↩↩↩
Volcano tutorial — verify with kubectl get vcjob <name>, kubectl get pod -l volcano.sh/job-name=<name>, PodGroup phase: Running / running: N: https://volcano.sh/en/docs/v1-11-0/tutorials/ ↩↩