Recipe: gang-scheduled distributed training¶
Scope: a standalone recipe to launch a gang-scheduled distributed-training smoke job (Volcano Job + torchrun): the manifest, apply/verify, an MFU sanity check, and the two failure modes that bite first (partial-gang deadlock and an NCCL stall). The applied, copy-paste counterpart to the training step in workload bring-up recipes; the field-by-field CRD reference lives in Volcano Job.
Reference templates from upstream Volcano / PyTorch docs. Nothing here was executed on hardware. Pin every image to a tested digest, substitute real namespace / queue / RDMA resource names, and run on one node-pair before a fleet roll.
flowchart LR
APPLY["kubectl apply vcjob"] --> PG["PodGroup (minAvailable=N)"]
PG --> GANG{"all N GPUs<br/>placeable now?"}
GANG -->|"no"| HOLD["hold: nothing binds<br/>(no idle GPUs)"]
GANG -->|"yes"| BIND["bind all pods together"]
BIND --> RDZV["torchrun c10d rendezvous"]
RDZV --> NCCL{"NCCL ring forms?"}
NCCL -->|"timeout"| FAIL["NCCL stall: check IB / GDR"]
NCCL -->|"yes"| STEP["steps run; read MFU"]
What it is¶
A one-shot distributed-training smoke job: prove that the scheduler gang-places every rank, that torchrun rendezvous succeeds, that NCCL forms a collective over the fabric, and that the run hits a sane MFU, before anyone trusts a multi-day training run on the cluster. It is step 8 of the end-to-end bring-up and reuses the gang primitive from Volcano Job.
The job is a Volcano Job (batch.volcano.sh/v1alpha1). Volcano compiles it into a PodGroup and treats minAvailable as a gang gate: pods bind only when all minAvailable of them can be placed in the same scheduling cycle: all-or-nothing. That is exactly the semantics a torchrun collective needs; a half-placed job is a deadlocked job pinning idle GPUs. The pytorch plugin injects MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK and force-enables the svc plugin (headless Service + per-pod DNS) so workers find rank 0.1
Why it matters¶
The default Kubernetes scheduler places pods one at a time. A 16-GPU job whose 13th pod cannot place leaves 12 pods Running, holding 12 GPUs idle, blocked on a barrier that will never complete, a silent deadlock that burns GPU-hours (workload bring-up recipes). Gang scheduling converts that into an honest Pending: none bind, no GPU is wasted, and the failure is legible. Running this smoke job first also isolates layers: if the fabric is bad, you see it here as an NCCL stall, not three weeks later as a misread "model" problem (fabric bring-up and benchmarking).
When it is needed (and when not)¶
- Needed: first workload on a new cluster or node-pool; after a fabric, driver, or Volcano change; as the gate before a real FSDP / DiLoCo run (distributed training recipes).
- Needed: any multi-pod job where partial placement deadlocks (DDP, FSDP, TP/PP, MPI collectives).
- Not needed: a single-pod, single-node job (
nproc_per_nodecovers the GPUs on one host; no gang gate required). Use a plainJob/PyTorchJob. - Not the first test: validate the fabric with
nccl-testsand the GPUs withdcgmi diagbefore this (Smoke Tests: GPU Platform, GPU Health Gating). A failed NCCL collective here is ambiguous until the fabric is known-good.
How: implement, integrate, maintain¶
1. The manifest¶
Two nodes, 8 GPUs each (16 ranks). minAvailable: 2 gates both pods together; pytorch plugin wires rendezvous; an RDMA resource keeps NCCL off TCP (NicClusterPolicy). Save as gang-smoke.yaml.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: gang-smoke
namespace: ml
spec:
minAvailable: 2 # gang gate: both pods place or neither does
schedulerName: volcano # MUST be volcano for gang semantics
queue: default
maxRetry: 3
plugins:
pytorch: ["--master=master", "--worker=worker", "--port=23456"]
policies:
- event: PodEvicted
action: RestartJob # any rank evicted -> restart the whole gang
tasks:
- name: master # RANK 0; rendezvous endpoint
replicas: 1
policies:
- event: TaskCompleted # rank 0 exits -> end the Job, don't orphan workers
action: CompleteJob
template:
spec:
schedulerName: volcano
restartPolicy: OnFailure
containers:
- &trainer
name: trainer
image: nvcr.io/nvidia/pytorch:25.05-py3 # pin to a tested digest
command: ["torchrun"]
args:
- "--nnodes=2"
- "--nproc_per_node=8"
- "--node_rank=$(RANK)" # injected by pytorch plugin
- "--rdzv_backend=c10d"
- "--rdzv_endpoint=$(MASTER_ADDR):$(MASTER_PORT)"
- "--max-restarts=3"
- "/workspace/smoke_train.py"
env:
- { name: NCCL_DEBUG, value: "INFO" }
- { name: NCCL_IB_HCA, value: "mlx5" } # match your HCA
- { name: NCCL_NET_GDR_LEVEL, value: "SYS" }
resources:
limits:
nvidia.com/gpu: 8
rdma/rdma_shared_device_a: 1 # confirm the real key
- name: worker # RANK 1
replicas: 1
template:
spec:
schedulerName: volcano
restartPolicy: OnFailure
containers:
- <<: *trainer
A minimal, self-contained training body: pure NCCL all-reduce loop, no dataset, prints throughput. Bake it into the image at /workspace/smoke_train.py:
# smoke_train.py — minimal multi-node NCCL smoke; reports tokens/s/GPU
import os, time, torch, torch.distributed as dist
dist.init_process_group("nccl") # reads RANK/WORLD_SIZE/MASTER_*
rank, world = dist.get_rank(), dist.get_world_size()
local = int(os.environ["LOCAL_RANK"]) # set by torchrun
torch.cuda.set_device(local)
HID, STEPS = 8192, 50
w = torch.randn(HID, HID, device="cuda", dtype=torch.bfloat16)
x = torch.randn(HID, HID, device="cuda", dtype=torch.bfloat16)
torch.cuda.synchronize(); dist.barrier(); t0 = time.time()
for _ in range(STEPS):
y = x @ w # compute
dist.all_reduce(y) # collective over the fabric
torch.cuda.synchronize(); dist.barrier(); dt = time.time() - t0
# 2 matmuls-worth of FLOPs per step (2*N^3), gathered for an MFU estimate
flops = 2 * STEPS * (2 * HID ** 3)
if rank == 0:
print(f"world={world} steps={STEPS} wall={dt:.2f}s "
f"per_gpu_TFLOPs={flops / dt / world / 1e12:.1f}")
dist.destroy_process_group()
2. Apply and verify¶
kubectl apply -f gang-smoke.yaml
# Job exists under the Volcano API group, moving Pending -> Running -> Completed:
kubectl get vcjob -n ml gang-smoke
# PodGroup carries the gang gate; "running" must reach minAvailable:
kubectl get podgroup -n ml -o wide # short name: pg
# All ranks of the job (label set by Volcano) bind in the SAME cycle:
kubectl get pod -n ml -l volcano.sh/job-name=gang-smoke -o wide
# Rendezvous + collective in rank-0 logs:
kubectl logs -n ml -l volcano.sh/job-name=gang-smoke,volcano.sh/task-spec=master -f
Pass signals:
vcjobSTATUS reachesRunningwithMIN=minAvailable, thenCompleted.2pgshowsphase: Runningandrunning: 2once the gang binds.2- Both pods are
Runningtogether. SomeRunning, somePending= the gang did not engage; checkschedulerNameon Job and task templates. - Rank-0 log shows
NCCL INFO ... comm ... nranks 16, and over RDMA[GDRDMA]/NET/IB(notNET/Socket).3
3. MFU sanity check¶
The script prints per_gpu_TFLOPs. MFU is achieved FLOP/s divided by the GPU's dense peak for that dtype:4
Use the vendor's published dense peak for your GPU and dtype as the denominator (e.g. H100 SXM BF16 dense ~989 TFLOP/s per NVIDIA's datasheet; confirm the exact figure for your SKU and clocks).5 This loop is comms-heavy and not training, so its MFU is a wiring/throughput sanity check, not a model baseline. Establish the real baseline with the actual model + parallelism + hardware and track it in the SLO/SLI catalog; regressions are worked in Training MFU Regression. Read SM-active / tensor-active from DCGM, not nvidia-smi GPU-util, for the true utilization signal (Training MFU Regression).
4. Maintain¶
- Pin the image to a digest and roll it through GitOps; re-run this smoke job after any Volcano, driver, or fabric change as a regression gate (Smoke Tests: GPU Platform).
- Keep the
TaskCompleted -> CompleteJobpolicy so finished jobs don't orphan workers, andttlSecondsAfterFinishedto GC them (Volcano Job). - Gate the cluster on this passing before promoting it to real training queues (GPU Health Gating).
Failure modes¶
- Partial-gang deadlock (default scheduler): without
schedulerName: volcano(on both Job and task templates), pods place one at a time: someRunning, somePending, ranks blocked at the NCCL barrier, GPUs idle forever. Fix: setschedulerName: volcanoeverywhere; Volcano then holds the whole gangPendinginstead of half-placing it. - Gang correctly refuses: all pods
Pending,pgnotRunning:minAvailableexceeds free GPU capacity. This is correct behaviour, not a bug. Add nodes or reducereplicas/minAvailable. - NCCL stall / rendezvous timeout: pods
Runningbut rank 0 hangs afternranks, or logs showNET/Socketinstead ofNET/IB. Usually the RDMA resource was not requested (NCCL fell back to TCP) or the wrongNCCL_IB_HCAfilter. Confirm the RDMA resource key and HCA name; validate the fabric standalone first (fabric bring-up and benchmarking, NCCL Hang / Collective Stall). MASTER_ADDR/RANKempty:pytorchplugin absent fromspec.plugins, or task names don't match--master/--worker. The plugin keys off taskname.1no matches for kind "Job" in version "batch.volcano.sh/v1alpha1": Volcano CRDs not installed; install the scheduler first (Volcano Gang Scheduler).- Low / nonsensical MFU: expected for this comms-bound loop; do not treat it as a model baseline. Profile a real step per Training MFU Regression.
References¶
- VolcanoJob (vcjob) CRD reference: https://volcano.sh/en/docs/vcjob/
- Volcano PyTorch plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md
- Volcano quickstart / verify (
get vcjob,get pg): https://volcano.sh/en/docs/v1-11-0/tutorials/ - torchrun / TorchElastic (
--nnodes,--nproc_per_node,--node_rank,--rdzv_backend,--rdzv_endpoint,--max-restarts): https://docs.pytorch.org/docs/stable/elastic/run.html - PyTorch distributed (
init_process_group, NCCL backend): https://docs.pytorch.org/docs/stable/distributed.html - NCCL environment variables (
NCCL_DEBUG,NCCL_IB_HCA,NCCL_NET_GDR_LEVEL): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html - MFU definition (Megatron-LM, arxiv 1909.08053): https://arxiv.org/abs/1909.08053
- NVIDIA H100 datasheet (dense BF16 peak — confirm for your SKU): https://resources.nvidia.com/en-us-tensor-core
Related: Workload Bring-Up · Volcano Job · Volcano Scheduler (Helm) · Distributed Training Recipes · FSDP · MFU Regression · Smoke Tests · GPU Health Gating · SLO/SLI Catalog · Glossary
-
Volcano PyTorch plugin injects
MASTER_ADDR/MASTER_PORT/WORLD_SIZE/RANK, args--master/--worker/--port(default23456), force-enablessvc: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md ↩↩ -
Verify with
kubectl get vcjob <name>,kubectl get pod -l volcano.sh/job-name=<name>, PodGroupphase: Running/running: N: https://volcano.sh/en/docs/v1-11-0/tutorials/ ↩↩ -
NCCL_DEBUG=INFOprintsnranks, transport (NET/IBvsNET/Socket), and[GDRDMA]when GPUDirect RDMA is engaged: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩ -
MFU = achieved model FLOP/s ÷ hardware peak FLOP/s for the dtype (Megatron-LM, arxiv 1909.08053): https://arxiv.org/abs/1909.08053 ↩
-
NVIDIA H100 published dense BF16 peak (~989 TFLOP/s SXM, sparsity off) — verify the exact figure for your SKU and clocks: https://resources.nvidia.com/en-us-tensor-core ↩