Markdown

Recipe: gang-scheduled distributed training¶

Scope: a standalone recipe to launch a gang-scheduled distributed-training smoke job (Volcano Job + torchrun): the manifest, apply/verify, an MFU sanity check, and the two failure modes that bite first (partial-gang deadlock and an NCCL stall). The applied, copy-paste counterpart to the training step in workload bring-up recipes; the field-by-field CRD reference lives in Volcano Job.

Reference templates from upstream Volcano / PyTorch docs. Nothing here was executed on hardware. Pin every image to a tested digest, substitute real namespace / queue / RDMA resource names, and run on one node-pair before a fleet roll.

flowchart LR
  APPLY["kubectl apply vcjob"] --> PG["PodGroup (minAvailable=N)"]
  PG --> GANG{"all N GPUs<br/>placeable now?"}
  GANG -->|"no"| HOLD["hold: nothing binds<br/>(no idle GPUs)"]
  GANG -->|"yes"| BIND["bind all pods together"]
  BIND --> RDZV["torchrun c10d rendezvous"]
  RDZV --> NCCL{"NCCL ring forms?"}
  NCCL -->|"timeout"| FAIL["NCCL stall: check IB / GDR"]
  NCCL -->|"yes"| STEP["steps run; read MFU"]

What it is¶

A one-shot distributed-training smoke job: prove that the scheduler gang-places every rank, that torchrun rendezvous succeeds, that NCCL forms a collective over the fabric, and that the run hits a sane MFU, before anyone trusts a multi-day training run on the cluster. It is step 8 of the end-to-end bring-up and reuses the gang primitive from Volcano Job.

The job is a Volcano Job (batch.volcano.sh/v1alpha1). Volcano compiles it into a PodGroup and treats minAvailable as a gang gate: pods bind only when all minAvailable of them can be placed in the same scheduling cycle: all-or-nothing. That is exactly the semantics a torchrun collective needs; a half-placed job is a deadlocked job pinning idle GPUs. The pytorch plugin injects MASTER_ADDR, MASTER_PORT, WORLD_SIZE, RANK and force-enables the svc plugin (headless Service + per-pod DNS) so workers find rank 0.¹

Why it matters¶

The default Kubernetes scheduler places pods one at a time. A 16-GPU job whose 13th pod cannot place leaves 12 pods Running, holding 12 GPUs idle, blocked on a barrier that will never complete, a silent deadlock that burns GPU-hours (workload bring-up recipes). Gang scheduling converts that into an honest Pending: none bind, no GPU is wasted, and the failure is legible. Running this smoke job first also isolates layers: if the fabric is bad, you see it here as an NCCL stall, not three weeks later as a misread "model" problem (fabric bring-up and benchmarking).

When it is needed (and when not)¶

Needed: first workload on a new cluster or node-pool; after a fabric, driver, or Volcano change; as the gate before a real FSDP / DiLoCo run (distributed training recipes).
Needed: any multi-pod job where partial placement deadlocks (DDP, FSDP, TP/PP, MPI collectives).
Not needed: a single-pod, single-node job (nproc_per_node covers the GPUs on one host; no gang gate required). Use a plain Job/PyTorchJob.
Not the first test: validate the fabric with nccl-tests and the GPUs with dcgmi diag before this (Smoke Tests: GPU Platform, GPU Health Gating). A failed NCCL collective here is ambiguous until the fabric is known-good.

How: implement, integrate, maintain¶

1. The manifest¶

Two nodes, 8 GPUs each (16 ranks). minAvailable: 2 gates both pods together; pytorch plugin wires rendezvous; an RDMA resource keeps NCCL off TCP (NicClusterPolicy). Save as gang-smoke.yaml.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: gang-smoke
  namespace: ml
spec:
  minAvailable: 2                 # gang gate: both pods place or neither does
  schedulerName: volcano          # MUST be volcano for gang semantics
  queue: default
  maxRetry: 3
  plugins:
    pytorch: ["--master=master", "--worker=worker", "--port=23456"]
  policies:
    - event: PodEvicted
      action: RestartJob          # any rank evicted -> restart the whole gang
  tasks:
    - name: master                # RANK 0; rendezvous endpoint
      replicas: 1
      policies:
        - event: TaskCompleted    # rank 0 exits -> end the Job, don't orphan workers
          action: CompleteJob
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - &trainer
              name: trainer
              image: nvcr.io/nvidia/pytorch:25.05-py3   # pin to a tested digest
              command: ["torchrun"]
              args:
                - "--nnodes=2"
                - "--nproc_per_node=8"
                - "--node_rank=$(RANK)"             # injected by pytorch plugin
                - "--rdzv_backend=c10d"
                - "--rdzv_endpoint=$(MASTER_ADDR):$(MASTER_PORT)"
                - "--max-restarts=3"
                - "/workspace/smoke_train.py"
              env:
                - { name: NCCL_DEBUG, value: "INFO" }
                - { name: NCCL_IB_HCA, value: "mlx5" }      # match your HCA
                - { name: NCCL_NET_GDR_LEVEL, value: "SYS" }
              resources:
                limits:
                  nvidia.com/gpu: 8
                  rdma/rdma_shared_device_a: 1              # confirm the real key
    - name: worker                # RANK 1
      replicas: 1
      template:
        spec:
          schedulerName: volcano
          restartPolicy: OnFailure
          containers:
            - <<: *trainer

A minimal, self-contained training body: pure NCCL all-reduce loop, no dataset, prints throughput. Bake it into the image at /workspace/smoke_train.py:

# smoke_train.py — minimal multi-node NCCL smoke; reports tokens/s/GPU
import os, time, torch, torch.distributed as dist

dist.init_process_group("nccl")                      # reads RANK/WORLD_SIZE/MASTER_*
rank, world = dist.get_rank(), dist.get_world_size()
local = int(os.environ["LOCAL_RANK"])                # set by torchrun
torch.cuda.set_device(local)

HID, STEPS = 8192, 50
w = torch.randn(HID, HID, device="cuda", dtype=torch.bfloat16)
x = torch.randn(HID, HID, device="cuda", dtype=torch.bfloat16)

torch.cuda.synchronize(); dist.barrier(); t0 = time.time()
for _ in range(STEPS):
    y = x @ w                                        # compute
    dist.all_reduce(y)                               # collective over the fabric
torch.cuda.synchronize(); dist.barrier(); dt = time.time() - t0

# 2 matmuls-worth of FLOPs per step (2*N^3), gathered for an MFU estimate
flops = 2 * STEPS * (2 * HID ** 3)
if rank == 0:
    print(f"world={world} steps={STEPS} wall={dt:.2f}s "
          f"per_gpu_TFLOPs={flops / dt / world / 1e12:.1f}")
dist.destroy_process_group()

2. Apply and verify¶

kubectl apply -f gang-smoke.yaml

# Job exists under the Volcano API group, moving Pending -> Running -> Completed:
kubectl get vcjob -n ml gang-smoke

# PodGroup carries the gang gate; "running" must reach minAvailable:
kubectl get podgroup -n ml -o wide                 # short name: pg

# All ranks of the job (label set by Volcano) bind in the SAME cycle:
kubectl get pod -n ml -l volcano.sh/job-name=gang-smoke -o wide

# Rendezvous + collective in rank-0 logs:
kubectl logs -n ml -l volcano.sh/job-name=gang-smoke,volcano.sh/task-spec=master -f

Pass signals:

vcjob STATUS reaches Running with MIN = minAvailable, then Completed.²
pg shows phase: Running and running: 2 once the gang binds.²
Both pods are Running together. Some Running, some Pending = the gang did not engage; check schedulerName on Job and task templates.
Rank-0 log shows NCCL INFO ... comm ... nranks 16, and over RDMA [GDRDMA] / NET/IB (not NET/Socket).³

3. MFU sanity check¶

The script prints per_gpu_TFLOPs. MFU is achieved FLOP/s divided by the GPU's dense peak for that dtype:⁴

MFU = per_gpu_TFLOPs / peak_TFLOPs_for_dtype

Use the vendor's published dense peak for your GPU and dtype as the denominator (e.g. H100 SXM BF16 dense ~989 TFLOP/s per NVIDIA's datasheet; confirm the exact figure for your SKU and clocks).⁵ This loop is comms-heavy and not training, so its MFU is a wiring/throughput sanity check, not a model baseline. Establish the real baseline with the actual model + parallelism + hardware and track it in the SLO/SLI catalog; regressions are worked in Training MFU Regression. Read SM-active / tensor-active from DCGM, not nvidia-smi GPU-util, for the true utilization signal (Training MFU Regression).

4. Maintain¶

Pin the image to a digest and roll it through GitOps; re-run this smoke job after any Volcano, driver, or fabric change as a regression gate (Smoke Tests: GPU Platform).
Keep the TaskCompleted -> CompleteJob policy so finished jobs don't orphan workers, and ttlSecondsAfterFinished to GC them (Volcano Job).
Gate the cluster on this passing before promoting it to real training queues (GPU Health Gating).

Failure modes¶

Partial-gang deadlock (default scheduler): without schedulerName: volcano (on both Job and task templates), pods place one at a time: some Running, some Pending, ranks blocked at the NCCL barrier, GPUs idle forever. Fix: set schedulerName: volcano everywhere; Volcano then holds the whole gang Pending instead of half-placing it.
Gang correctly refuses: all pods Pending, pg not Running: minAvailable exceeds free GPU capacity. This is correct behaviour, not a bug. Add nodes or reduce replicas / minAvailable.
NCCL stall / rendezvous timeout: pods Running but rank 0 hangs after nranks, or logs show NET/Socket instead of NET/IB. Usually the RDMA resource was not requested (NCCL fell back to TCP) or the wrong NCCL_IB_HCA filter. Confirm the RDMA resource key and HCA name; validate the fabric standalone first (fabric bring-up and benchmarking, NCCL Hang / Collective Stall).
MASTER_ADDR / RANK empty: pytorch plugin absent from spec.plugins, or task names don't match --master / --worker. The plugin keys off task name.¹
no matches for kind "Job" in version "batch.volcano.sh/v1alpha1": Volcano CRDs not installed; install the scheduler first (Volcano Gang Scheduler).
Low / nonsensical MFU: expected for this comms-bound loop; do not treat it as a model baseline. Profile a real step per Training MFU Regression.

References¶

VolcanoJob (vcjob) CRD reference: https://volcano.sh/en/docs/vcjob/
Volcano PyTorch plugin user guide: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md
Volcano quickstart / verify (get vcjob, get pg): https://volcano.sh/en/docs/v1-11-0/tutorials/
torchrun / TorchElastic (--nnodes, --nproc_per_node, --node_rank, --rdzv_backend, --rdzv_endpoint, --max-restarts): https://docs.pytorch.org/docs/stable/elastic/run.html
PyTorch distributed (init_process_group, NCCL backend): https://docs.pytorch.org/docs/stable/distributed.html
NCCL environment variables (NCCL_DEBUG, NCCL_IB_HCA, NCCL_NET_GDR_LEVEL): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
MFU definition (Megatron-LM, arxiv 1909.08053): https://arxiv.org/abs/1909.08053
NVIDIA H100 datasheet (dense BF16 peak — confirm for your SKU): https://resources.nvidia.com/en-us-tensor-core

Volcano PyTorch plugin injects MASTER_ADDR/MASTER_PORT/WORLD_SIZE/RANK, args --master/--worker/--port (default 23456), force-enables svc: https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md ↩↩
Verify with kubectl get vcjob <name>, kubectl get pod -l volcano.sh/job-name=<name>, PodGroup phase: Running / running: N: https://volcano.sh/en/docs/v1-11-0/tutorials/ ↩↩
NCCL_DEBUG=INFO prints nranks, transport (NET/IB vs NET/Socket), and [GDRDMA] when GPUDirect RDMA is engaged: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html ↩
MFU = achieved model FLOP/s ÷ hardware peak FLOP/s for the dtype (Megatron-LM, arxiv 1909.08053): https://arxiv.org/abs/1909.08053 ↩
NVIDIA H100 published dense BF16 peak (~989 TFLOP/s SXM, sparsity off) — verify the exact figure for your SKU and clocks: https://resources.nvidia.com/en-us-tensor-core ↩