Skip to content
Markdown

Recipe: gang-scheduled distributed training

Scope: a standalone recipe to launch a gang-scheduled distributed-training smoke job (Volcano Job + torchrun): the manifest, apply/verify, an MFU sanity check, and the two failure modes that bite first (partial-gang deadlock and an NCCL stall). The applied, copy-paste counterpart to the training step in workload bring-up recipes; the field-by-field CRD reference lives in Volcano Job.

Reference templates from upstream Volcano / PyTorch docs. Nothing here was executed on hardware. Pin every image to a tested digest, substitute real namespace / queue / RDMA resource names, and run on one node-pair before a fleet roll.

flowchart LR
  APPLY["kubectl apply vcjob"] --> PG["PodGroup (minAvailable=N)"]
  PG --> GANG{"all N GPUs<br/>placeable now?"}
  GANG -->|"no"| HOLD["hold: nothing binds<br/>(no idle GPUs)"]
  GANG -->|"yes"| BIND["bind all pods together"]
  BIND --> RDZV["torchrun c10d rendezvous"]
  RDZV --> NCCL{"NCCL ring forms?"}
  NCCL -->|"timeout"| FAIL["NCCL stall: check IB / GDR"]
  NCCL -->|"yes"| STEP["steps run; read MFU"]

What it is

A one-shot distributed-training smoke job: prove that the scheduler gang-places every rank, that torchrun rendezvous succeeds, that NCCL forms a collective over the fabric, and that the run hits a sane MFU, before anyone trusts a multi-day training run on the cluster. It is step 8 of the end-to-end bring-up and reuses the gang primitive from Volcano Job.

The job is a Volcano Job (batch.volcano.sh/v1alpha1). Volcano compiles it into a PodGroup and treats minAvailable as a gang gate: pods bind only when all minAvailable of them can be placed in the same scheduling cycle: all-or-nothing. That is exactly the semantics a torchrun collective needs; a half-placed job is a deadlocked job pinning idle GPUs. The pytorch plugin injects MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK and force-enables the svc plugin (headless Service + per-pod DNS) so workers find rank 0.1

Why it matters

The default Kubernetes scheduler places pods one at a time. A 16-GPU job whose 13th pod cannot place leaves 12 pods Running, holding 12 GPUs idle, blocked on a barrier that will never complete, a silent deadlock that burns GPU-hours (workload bring-up recipes). Gang scheduling converts that into an honest Pending: none bind, no GPU is wasted, and the failure is legible. Running this smoke job first also isolates layers: if the fabric is bad, you see it here as an NCCL stall, not three weeks later as a misread "model" problem (fabric bring-up and benchmarking).

When it is needed (and when not)

  • Needed: first workload on a new cluster or node-pool; after a fabric, driver, or Volcano change; as the gate before a real FSDP / DiLoCo run (distributed training recipes).
  • Needed: any multi-pod job where partial placement deadlocks (DDP, FSDP, TP/PP, MPI collectives).
  • Not needed: a single-pod, single-node job (nproc_per_node covers the GPUs on one host; no gang gate required). Use a plain Job/PyTorchJob.
  • Not the first test: validate the fabric with nccl-tests and the GPUs with dcgmi diag before this (Smoke Tests: GPU Platform, GPU Health Gating). A failed NCCL collective here is ambiguous until the fabric is known-good.

How: implement, integrate, maintain

1. The manifest

Two nodes, 8 GPUs each (16 ranks). minAvailable: 2 gates both pods together; pytorch plugin wires rendezvous; an RDMA resource keeps NCCL off TCP (NicClusterPolicy). Save as gang-smoke.yaml.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: gang-smoke
  namespace: ml
spec:
  minAvailable: 2                 # gang gate: both pods place or neither does
  schedulerName: volcano          # MUST be volcano for gang semantics
  queue: default
  maxRetry: 3
  plugins:
    pytorch: ["--master=master", "--worker=worker", "--port=23456"]
  policies:
    - event: PodEvicted
      action: RestartJob          # any rank evicted -> restart the whole gang
  tasks:
    - name: master                # RANK 0; rendezvous endpoint
      replicas: 1
      policies:
        - event: TaskCompleted    # rank 0 exits -> end the Job, don't orphan workers
          action: CompleteJob
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - &trainer
              name: trainer
              image: nvcr.io/nvidia/pytorch:25.05-py3   # pin to a tested digest
              command: ["torchrun"]
              args:
                - "--nnodes=2"
                - "--nproc_per_node=8"
                - "--node_rank=$(RANK)"             # injected by pytorch plugin
                - "--rdzv_backend=c10d"
                - "--rdzv_endpoint=$(MASTER_ADDR):$(MASTER_PORT)"
                - "--max-restarts=3"
                - "/workspace/smoke_train.py"
              env:
                - { name: NCCL_DEBUG, value: "INFO" }
                - { name: NCCL_IB_HCA, value: "mlx5" }      # match your HCA
                - { name: NCCL_NET_GDR_LEVEL, value: "SYS" }
              resources:
                limits:
                  nvidia.com/gpu: 8
                  rdma/rdma_shared_device_a: 1              # confirm the real key
    - name: worker                # RANK 1
      replicas: 1
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - <<: *trainer

A minimal, self-contained training body: pure NCCL all-reduce loop, no dataset, prints throughput. Bake it into the image at /workspace/smoke_train.py:

# smoke_train.py — minimal multi-node NCCL smoke; reports tokens/s/GPU
import os, time, torch, torch.distributed as dist

dist.init_process_group("nccl")                      # reads RANK/WORLD_SIZE/MASTER_*
rank, world = dist.get_rank(), dist.get_world_size()
local = int(os.environ["LOCAL_RANK"])                # set by torchrun
torch.cuda.set_device(local)

HID, STEPS = 8192, 50
w = torch.randn(HID, HID, device="cuda", dtype=torch.bfloat16)
x = torch.randn(HID, HID, device="cuda", dtype=torch.bfloat16)

torch.cuda.synchronize(); dist.barrier(); t0 = time.time()
for _ in range(STEPS):
    y = x @ w                                        # compute
    dist.all_reduce(y)                               # collective over the fabric
torch.cuda.synchronize(); dist.barrier(); dt = time.time() - t0

# 2 matmuls-worth of FLOPs per step (2*N^3), gathered for an MFU estimate
flops = 2 * STEPS * (2 * HID ** 3)
if rank == 0:
    print(f"world={world} steps={STEPS} wall={dt:.2f}s "
          f"per_gpu_TFLOPs={flops / dt / world / 1e12:.1f}")
dist.destroy_process_group()

2. Apply and verify

kubectl apply -f gang-smoke.yaml

# Job exists under the Volcano API group, moving Pending -> Running -> Completed:
kubectl get vcjob -n ml gang-smoke

# PodGroup carries the gang gate; "running" must reach minAvailable:
kubectl get podgroup -n ml -o wide                 # short name: pg

# All ranks of the job (label set by Volcano) bind in the SAME cycle:
kubectl get pod -n ml -l volcano.sh/job-name=gang-smoke -o wide

# Rendezvous + collective in rank-0 logs:
kubectl logs -n ml -l volcano.sh/job-name=gang-smoke,volcano.sh/task-spec=master -f

Pass signals:

  • vcjob STATUS reaches Running with MIN = minAvailable, then Completed.2
  • pg shows phase: Running and running: 2 once the gang binds.2
  • Both pods are Running together. Some Running, some Pending = the gang did not engage; check schedulerName on Job and task templates.
  • Rank-0 log shows NCCL INFO ... comm ... nranks 16, and over RDMA [GDRDMA] / NET/IB (not NET/Socket).3

3. MFU sanity check

The script prints per_gpu_TFLOPs. MFU is achieved FLOP/s divided by the GPU's dense peak for that dtype:4

MFU = per_gpu_TFLOPs / peak_TFLOPs_for_dtype

Use the vendor's published dense peak for your GPU and dtype as the denominator (e.g. H100 SXM BF16 dense ~989 TFLOP/s per NVIDIA's datasheet; confirm the exact figure for your SKU and clocks).5 This loop is comms-heavy and not training, so its MFU is a wiring/throughput sanity check, not a model baseline. Establish the real baseline with the actual model + parallelism + hardware and track it in the SLO/SLI catalog; regressions are worked in Training MFU Regression. Read SM-active / tensor-active from DCGM, not nvidia-smi GPU-util, for the true utilization signal (Training MFU Regression).

4. Maintain

  • Pin the image to a digest and roll it through GitOps; re-run this smoke job after any Volcano, driver, or fabric change as a regression gate (Smoke Tests: GPU Platform).
  • Keep the TaskCompleted -> CompleteJob policy so finished jobs don't orphan workers, and ttlSecondsAfterFinished to GC them (Volcano Job).
  • Gate the cluster on this passing before promoting it to real training queues (GPU Health Gating).

Failure modes

  • Partial-gang deadlock (default scheduler): without schedulerName: volcano (on both Job and task templates), pods place one at a time: some Running, some Pending, ranks blocked at the NCCL barrier, GPUs idle forever. Fix: set schedulerName: volcano everywhere; Volcano then holds the whole gang Pending instead of half-placing it.
  • Gang correctly refuses: all pods Pending, pg not Running: minAvailable exceeds free GPU capacity. This is correct behaviour, not a bug. Add nodes or reduce replicas / minAvailable.
  • NCCL stall / rendezvous timeout: pods Running but rank 0 hangs after nranks, or logs show NET/Socket instead of NET/IB. Usually the RDMA resource was not requested (NCCL fell back to TCP) or the wrong NCCL_IB_HCA filter. Confirm the RDMA resource key and HCA name; validate the fabric standalone first (fabric bring-up and benchmarking, NCCL Hang / Collective Stall).
  • MASTER_ADDR / RANK empty: pytorch plugin absent from spec.plugins, or task names don't match --master / --worker. The plugin keys off task name.1
  • no matches for kind "Job" in version "batch.volcano.sh/v1alpha1": Volcano CRDs not installed; install the scheduler first (Volcano Gang Scheduler).
  • Low / nonsensical MFU: expected for this comms-bound loop; do not treat it as a model baseline. Profile a real step per Training MFU Regression.

References

  • VolcanoJob (vcjob) CRD reference: https://volcano.sh/en/docs/vcjob/
  • Volcano PyTorch plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md
  • Volcano quickstart / verify (get vcjob, get pg): https://volcano.sh/en/docs/v1-11-0/tutorials/
  • torchrun / TorchElastic (--nnodes, --nproc_per_node, --node_rank, --rdzv_backend, --rdzv_endpoint, --max-restarts): https://docs.pytorch.org/docs/stable/elastic/run.html
  • PyTorch distributed (init_process_group, NCCL backend): https://docs.pytorch.org/docs/stable/distributed.html
  • NCCL environment variables (NCCL_DEBUG, NCCL_IB_HCA, NCCL_NET_GDR_LEVEL): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
  • MFU definition (Megatron-LM, arxiv 1909.08053): https://arxiv.org/abs/1909.08053
  • NVIDIA H100 datasheet (dense BF16 peak — confirm for your SKU): https://resources.nvidia.com/en-us-tensor-core

Related: Workload Bring-Up · Volcano Job · Volcano Scheduler (Helm) · Distributed Training Recipes · FSDP · MFU Regression · Smoke Tests · GPU Health Gating · SLO/SLI Catalog · Glossary


  1. Volcano PyTorch plugin injects MASTER_ADDR/MASTER_PORT/WORLD_SIZE/RANK, args --master/--worker/--port (default 23456), force-enables svc: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md 

  2. Verify with kubectl get vcjob <name>, kubectl get pod -l volcano.sh/job-name=<name>, PodGroup phase: Running / running: N: https://volcano.sh/en/docs/v1-11-0/tutorials/ 

  3. NCCL_DEBUG=INFO prints nranks, transport (NET/IB vs NET/Socket), and [GDRDMA] when GPUDirect RDMA is engaged: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html 

  4. MFU = achieved model FLOP/s ÷ hardware peak FLOP/s for the dtype (Megatron-LM, arxiv 1909.08053): https://arxiv.org/abs/1909.08053 

  5. NVIDIA H100 published dense BF16 peak (~989 TFLOP/s SXM, sparsity off) — verify the exact figure for your SKU and clocks: https://resources.nvidia.com/en-us-tensor-core