Skip to content
Markdown

Manifest: Volcano job

Scope: a Volcano Job (batch.volcano.sh/v1alpha1) that gang-schedules a multi-pod GPU training run (minAvailable, multiple tasks, schedulerName: volcano) so every worker lands at once or none do. Reference templates from upstream Volcano docs; not hardware-tested. Pairs with Volcano Gang Scheduler, which installs the controller and scheduler this CRD depends on.

Apply via GitOps and pin every image. The Volcano scheduler must be installed first (Volcano Gang Scheduler); without it these pods sit Pending forever.

flowchart TB
  VCJOB["VolcanoJob (minAvailable=N)"] --> PG["PodGroup (auto-created)"]
  PG --> GANG{"all N schedulable?"}
  GANG -->|"yes"| BIND["bind all pods together"]
  GANG -->|"no"| WAIT["hold; nothing binds"]
  BIND --> RUN["torchrun / mpirun across pods"]

What it is

A Volcano Job is a CRD (kind: Job, group batch.volcano.sh, short name vcjob) that wraps one or more tasks, each a pod template with its own replicas. The Volcano controller compiles the Job into a PodGroup and the Volcano scheduler treats minAvailable as a gang gate: it binds pods only when at least minAvailable of them can be placed simultaneously. This is the all-or-nothing semantics a distributed torchrun/MPI run needs: a half-placed job is a deadlocked job holding idle GPUs.

Two ergonomics matter for GPU training:

  • plugins (map[string][]string). The svc plugin creates a headless Service + per-pod DNS so workers find rank 0; the pytorch and mpi plugins layer on top of svc and inject the framework env (MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK for PyTorch; MPI_HOST for MPI), so you write less YAML.234
  • policies are lifecycle rules (e.g. event: TaskCompleted -> action: CompleteJob) so the Job terminates cleanly when rank 0 exits instead of leaving workers running.1

GPUs are requested the normal way (nvidia.com/gpu limits) inside each task's pod template; Volcano only governs when the group binds, not what a pod consumes.

Prerequisites

  • Volcano installed (controllers + scheduler + CRDs). See Volcano Gang Scheduler. Confirm CRDs exist: kubectl get crd jobs.batch.volcano.sh podgroups.scheduling.volcano.sh.
  • GPU Operator advertising nvidia.com/gpu on nodes (GPU Operator ClusterPolicy).
  • A Volcano Queue for the job (the default queue is created by the Helm install; custom quota lives in Kueue ClusterQueue if you front Volcano with Kueue).
  • For multi-node NCCL over RDMA, an RDMA interface in-pod (NicClusterPolicy); otherwise NCCL falls back to TCP.

The manifest

Variant A: PyTorch torchrun (gang-scheduled, 1 master + 2 workers, 1 GPU each)

The pytorch plugin injects MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK and force-enables svc; the default port is 23456.2 minAvailable: 3 gates the gang on all three pods.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: torchrun-gang
  namespace: ml
spec:
  minAvailable: 3                 # gang gate: all 3 pods or none
  schedulerName: volcano          # MUST be volcano for gang semantics
  queue: default
  maxRetry: 3
  plugins:
    pytorch: ["--master=master", "--worker=worker", "--port=23456"]
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - name: master                # RANK 0; torchrun rendezvous endpoint
      replicas: 1
      policies:
        - event: TaskCompleted    # when rank 0 finishes, end the whole Job
          action: CompleteJob
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: master
              image: nvcr.io/nvidia/pytorch:25.05-py3   # pin to a tested tag
              command: ["torchrun"]
              args:
                - "--nnodes=3"
                - "--nproc_per_node=1"
                - "--node_rank=$(RANK)"
                - "--master_addr=$(MASTER_ADDR)"
                - "--master_port=$(MASTER_PORT)"
                - "/workspace/train.py"
              resources:
                limits:
                  nvidia.com/gpu: 1
    - name: worker                # RANK 1..N
      replicas: 2
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: worker
              image: nvcr.io/nvidia/pytorch:25.05-py3
              command: ["torchrun"]
              args:
                - "--nnodes=3"
                - "--nproc_per_node=1"
                - "--node_rank=$(RANK)"
                - "--master_addr=$(MASTER_ADDR)"
                - "--master_port=$(MASTER_PORT)"
                - "/workspace/train.py"
              resources:
                limits:
                  nvidia.com/gpu: 1

$(RANK), $(MASTER_ADDR), $(MASTER_PORT) are the env vars the pytorch plugin injects; Kubernetes $(VAR) expansion in args substitutes them at pod start.2 WORLD_SIZE is injected too. Replace train.py/image with yours; the plugin sets up rendezvous, not your code.

Variant B: MPI (mpirun from a master over SSH to workers)

The mpi plugin force-enables svc + ssh (password-free login) and exposes MPI_HOST (worker DNS names) for mpirun --host.3 The master starts sshd then runs mpiexec; workers run sshd -D.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mpi-gang
  namespace: ml
spec:
  minAvailable: 3
  schedulerName: volcano
  queue: default
  plugins:
    mpi: ["--master=mpimaster", "--worker=mpiworker", "--port=22"]
  tasks:
    - name: mpimaster
      replicas: 1
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: mpimaster
              image: volcanosh/example-mpi:0.0.3   # replace with your CUDA+MPI image
              workingDir: /home
              command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world;
    - name: mpiworker
      replicas: 2
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - name: mpiworker
              image: volcanosh/example-mpi:0.0.3
              workingDir: /home
              command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;

The volcanosh/example-mpi:0.0.3 image is a CPU mpi_hello_world proof from upstream;3 for GPU work swap in your CUDA-aware MPI image, add nvidia.com/gpu limits to both tasks, and point -np at your total rank count.

Configuration

Field Path Meaning Notes
apiVersion top batch.volcano.sh/v1alpha1 the only GA-stage Job API version1
kind top Job short name vcjob
schedulerName spec.schedulerName scheduler for the Job must be volcano for gang semantics; also set it per-task template1
minAvailable spec.minAvailable min pods that must place together the gang gate; defaults to sum of task replicas1
tasks[].name spec.tasks task/role identifier referenced by pytorch/mpi plugin --master/--worker23
tasks[].replicas spec.tasks pods for this task total pods = sum across tasks
tasks[].minAvailable spec.tasks per-task min (optional, *int32) for per-role gang gating; omit to use Job-level minAvailable5
tasks[].template spec.tasks v1.PodTemplateSpec standard pod spec; put nvidia.com/gpu limits here
tasks[].policies spec.tasks lifecycle rules e.g. event: TaskCompleted -> action: CompleteJob1
plugins spec.plugins map[string][]string svc, env, ssh, pytorch, mpi123
policies spec.policies Job-level lifecycle e.g. event: PodEvicted -> action: RestartJob1
queue spec.queue Volcano queue defaults to default1
maxRetry spec.maxRetry Job retry budget int32
priorityClassName spec.priorityClassName preemption priority standard K8s PriorityClass5
ttlSecondsAfterFinished spec.ttlSecondsAfterFinished GC delay after finish *int32; auto-clean completed jobs5
minSuccess spec.minSuccess pods that must succeed for Job success *int32, optional5

PyTorch plugin args: --master=<taskName>, --worker=<taskName>, --port=<int> (default 23456); optional --wait-master-enabled, --wait-master-timeout, --wait-master-image.2 MPI plugin args: --master=<taskName>, --worker=<taskName>, --port=<int> (default 22).3

Apply & verify

kubectl apply -f torchrun-gang.yaml

# 1. The Job exists under the Volcano API group:
kubectl get jobs.batch.volcano.sh -n ml          # or: kubectl get vcjob -n ml

# 2. The auto-created PodGroup carries the gang gate:
kubectl get podgroup -n ml -o wide               # short name: pg

# 3. All pods of the job (label set by Volcano):
kubectl get pod -n ml -l volcano.sh/job-name=torchrun-gang -o wide

Expected signals:

  • kubectl get vcjob STATUS moves Pending -> Running -> Completed. A healthy gang shows Running with MIN matching your minAvailable.6
  • kubectl get pg shows phase: Running and running: 3 once the gang binds; the count must reach minAvailable.6
  • All three pods reach Running together (same scheduling cycle). If you see some Running and others stuck Pending, the gang did not engage; check schedulerName.
  • Rank-0 logs show the rendezvous: kubectl logs -n ml -l volcano.sh/job-name=torchrun-gang,volcano.sh/task-spec=master. With NCCL_DEBUG=INFO, look for NCCL INFO ... comm ... nranks 3; over RDMA, [GDRDMA] (NicClusterPolicy).

The decisive check vs. the default scheduler: under Volcano, pods bind as a set. If capacity is short, none bind (no idle, half-allocated GPUs). That is the property you are buying.

Failure modes

  • Pods stuck Pending, no PodGroup status. Volcano scheduler not running, or schedulerName is not volcano. Set it at both spec.schedulerName and spec.tasks[].template.spec.schedulerName; confirm the scheduler pod in volcano-system (Volcano Gang Scheduler).
  • Some pods Running, others Pending indefinitely. minAvailable not satisfiable by cluster GPU capacity; the gang correctly refuses partial placement. Reduce replicas/minAvailable or add nodes. (Default-scheduler behaviour, partial placement and deadlock, is exactly what this prevents.)
  • MASTER_ADDR/RANK empty in workers. pytorch plugin not listed in spec.plugins, or the task names don't match the plugin's --master/--worker args. The plugin keys off task name.2
  • MPI workers unreachable / SSH refused. mpi (or ssh) plugin missing; the master's ${MPI_HOST} is unset without it. Ensure mpi: [...] is present so svc+ssh are force-enabled.3
  • Job never completes after training ends. No TaskCompleted -> CompleteJob policy on the master, so workers linger. Add the task policy shown above.1
  • GPUs requested but pods land CPU-only / unschedulable. nvidia.com/gpu limit missing from the task template, or GPU Operator not advertising the resource (GPU Operator ClusterPolicy).
  • error validating "...": no matches for kind "Job" in version "batch.volcano.sh/v1alpha1". CRDs absent; install Volcano first.

References

  • VolcanoJob (vcjob) CRD reference: https://volcano.sh/en/docs/vcjob/
  • PyTorch plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md
  • MPI plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_mpi_plugin.md
  • SVC plugin user guide: https://volcano.sh/en/docs/user-guide/how_to_use_svc_plugin/
  • batch v1alpha1 API (JobSpec / TaskSpec): https://pkg.go.dev/volcano.sh/apis/pkg/apis/batch/v1alpha1
  • Volcano quickstart / verify (get vcjob, get pg): https://volcano.sh/en/docs/v1-11-0/tutorials/

Related: Volcano scheduler (Helm) · GPU Operator ClusterPolicy · Kueue ClusterQueue · NIC ClusterPolicy · K8s GPU platform hub · Glossary


  1. VolcanoJob CRD reference — kind: Job, apiVersion: batch.volcano.sh/v1alpha1, minAvailable, schedulerName, tasks, plugins (ssh/env/svc), policies, queue, maxRetry: https://volcano.sh/en/docs/vcjob/ 

  2. PyTorch plugin — pytorch: ["--master=master","--worker=worker","--port=23456",...], injects MASTER_ADDR/MASTER_PORT/WORLD_SIZE/RANK, force-enables svc, default port 23456: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md 

  3. MPI plugin — mpi: ["--master=mpimaster","--worker=mpiworker","--port=22"], ${MPI_HOST} for mpiexec --host, force-enables svc+ssh, volcanosh/example-mpi:0.0.3: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_mpi_plugin.md 

  4. SVC plugin — headless Service + per-pod DNS for inter-task discovery; prerequisite for pytorch/mpi plugins: https://volcano.sh/en/docs/user-guide/how_to_use_svc_plugin/ 

  5. batch v1alpha1 Go API — TaskSpec.MinAvailable *int32, JobSpec.PriorityClassName, JobSpec.TtlSecondsAfterFinished *int32, JobSpec.MinSuccess *int32: https://pkg.go.dev/volcano.sh/apis/pkg/apis/batch/v1alpha1 

  6. Volcano tutorial — verify with kubectl get vcjob <name>, kubectl get pod -l volcano.sh/job-name=<name>, PodGroup phase: Running / running: N: https://volcano.sh/en/docs/v1-11-0/tutorials/