Distributed training as a platform service¶
Scope: running elastic, multi-worker distributed training as a managed service on a Kubernetes GPU cluster, the platform layer that schedules workers, brings up a rendezvous, surfaces fabric topology to each rank, and survives workers joining and leaving. This is the operational substrate around the training run, not the parallelism algorithm: FSDP/DDP/DiLoCo (see distributed training) execute inside the workers this page schedules. Built on PyTorch Elastic (torchelastic) + an etcd rendezvous; complements a GPU-allocation operator that drives resize.
What it is¶
A managed distributed-training service turns "run this N-GPU job" into a reconciled set of worker pods that form a process group, tolerate membership changes, and report health, without a human running torchrun on each node. Five moving parts:
- Worker pods: each runs
torchrun(the PyTorch Elastic launcher), one per node, with the model's parallelism strategy inside. - A rendezvous backend: how workers discover each other and agree on ranks/world size.
etcd-v2supports elastic membership (workers join/leave, the group re-forms); the defaultc10d/TCPStore backend uses rank-0-as-master semantics and does not deliver elasticity on its own. ([PyTorch Elastic]) - Rendezvous as a pod: a dedicated rendezvous endpoint (an
etcdpod, or a TCPStore pod) per job, fronted by a headless Service, so membership coordination is decoupled from any single worker. - An init container that injects topology: resolves each node's labels (GPU model, provider, region, fabric) via the downward API and writes them to a shared volume the worker entrypoint sources, so NCCL and the training code know where they are running.
- A control loop, typically the platform operator, that sets
min/maxworld size, coordinates resize, and restarts failed ranks.
flowchart TB
OP["Operator: TrainingDeployment (min/max world size)"] --> RDZV["Rendezvous pod (etcd-v2)<br/>headless Service"]
OP --> W1["Worker pod 0"]
OP --> W2["Worker pod 1"]
OP --> WN["Worker pod N"]
subgraph WORKER["Each worker pod"]
INIT["init container:<br/>resolve node labels → /etc/env"] --> RUN["torchrun --rdzv-backend=etcd-v2<br/>--nnodes=min:max --max-restarts=K"]
end
W1 -.->|"register rank, watch membership"| RDZV
W2 -.-> RDZV
WN -.-> RDZV
RDZV -->|"world re-forms on join/leave"| RUN
Why it matters¶
The reason to run training as a service rather than torchrun by hand is churn and scale. On spot, preemptible, rented, or geo-distributed capacity, workers will come and go: a node is reclaimed, a tunnel flaps, a GPU faults. Four properties follow:
- Elasticity. With an elastic rendezvous, a lost worker does not kill the job; the group re-forms at a smaller world size and continues, then grows again when capacity returns. This is the difference between a spot reclaim costing one step versus the whole run.
- Fault tolerance.
torchrunrestarts a failed rank from the last rendezvous; combined with checkpointing (checkpoint recovery) the run survives hardware faults that are routine at scale (goodput / ETTR). - Topology awareness. NCCL performance depends on knowing the fabric. Injecting node topology into every worker lets the collective pick the right transport (NCCL collectives) instead of guessing, decisive on a heterogeneous, overlay-stitched fleet.
- Managed lifecycle. Declarative resize, gang admission, and per-job rendezvous come from the platform, not from operators SSHing into nodes.
When to use it (and when not)¶
Use the managed-service pattern when:
- You run multi-node training on Kubernetes, especially on elastic/spot/preemptible/heterogeneous/geo capacity where membership changes mid-run.
- You operate many training jobs as a platform for multiple tenants and need declarative, repeatable bring-up.
- The run must survive worker loss rather than restart from scratch.
Do not use it when:
- The job is single-node / single-GPU:
torchrun --standalone(or nothing) is simpler; the rendezvous machinery is overhead. - Slurm already runs your training. On an HPC fabric, Slurm gang-schedules and launches ranks directly (Slurm, Slurm vs Kubernetes, recipe: gang-scheduled training); layering a K8s elastic service on top is redundant.
- The fleet is static and reliable (one DC, no preemption): fixed-world-size launch converges most cleanly, and elasticity buys nothing. [see DiLoCo decision]
How to build and operate it¶
Reference shapes, unexecuted. Pin and verify torchrun/rendezvous flags, the rendezvous backend, NCCL env vars, and the base-image stack against current PyTorch/NCCL/etcd docs for your versions; these APIs shift across releases.
1. Build a base image that contains the rendezvous client¶
A worker cannot use a rendezvous backend whose client library is not in its image. A working distributed-training image bundles a matched stack:
# Reference shape — pin every version against the NGC/PyTorch matrix you target.
FROM nvcr.io/nvidia/pytorch:24.10-py3 # PyTorch + CUDA + a matched NCCL
RUN pip install --no-cache-dir python-etcd # REQUIRED for the etcd-v2 rendezvous backend
COPY entrypoint.sh /usr/local/bin/entrypoint.sh
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
A mismatched or missing rendezvous client is the most common "workers never form a group" cause.
2. Run an elastic rendezvous as its own pod, off the GPU nodes¶
Deploy the rendezvous (an etcd pod for etcd-v2) behind a headless Service, and schedule it onto a CPU node, not a GPU node: a rendezvous pod on a GPU node wastes an accelerator and contends for the NIC. Do not mistake a single dedicated TCPStore pod for elasticity: the c10d backend's rank-0-master semantics mean such a pod is an inert middleman unless you use etcd-v2. ([PyTorch Elastic])
3. Launch workers with elastic torchrun¶
# One per worker pod. min:max enables elastic membership; max-restarts bounds rank recovery.
torchrun \
--rdzv-backend=etcd-v2 \
--rdzv-endpoint="${RDZV_HOST}:2379" \
--rdzv-id="${JOB_ID}" \
--nnodes="${MIN_NODES}:${MAX_NODES}" \
--nproc-per-node="${GPUS_PER_NODE}" \
--max-restarts="${MAX_RESTARTS}" \
train.py
4. Inject node topology with an init container¶
Surface where each worker runs so NCCL and the training code can use it. A tiny init container resolves labels and writes an env file on a shared emptyDir; the entrypoint sources it before torchrun.
# init container (alpine + kubectl). NODE_NAME from the downward API. Degrade gracefully on missing labels.
NODE_NAME="${NODE_NAME:?}"
for k in gpu-model provider region; do
v=$(kubectl get node "$NODE_NAME" -o "jsonpath={.metadata.labels.platform\.example\.com/$k}")
echo "NODE_$(echo "$k" | tr 'a-z-' 'A-Z_')=${v}" >> /etc/inject/env
done
# worker entrypoint sources it, then binds NCCL to the right interface (e.g. the overlay device on a geo fleet).
. /etc/inject/env
export NCCL_SOCKET_IFNAME="${NCCL_SOCKET_IFNAME:-eth0}"
exec torchrun ...
For a WireGuard-stitched fleet, gate startup on mesh readiness (poll the rendezvous/peer keys) so torchrun does not begin before the network is up.
5. Gang-schedule and spread¶
A partially-scheduled training job wastes the GPUs it did get while waiting for the rest. Use Volcano or Kueue for gang/all-or-nothing admission, and topology-aware scheduling so ranks land close on the fabric. See recipe: gang-scheduled training.
6. Monitor the platform-specific signals¶
Beyond loss curves, alert on the operational health of the service (observability & monitoring):
- Rendezvous pod on a GPU node: wasted accelerator; alert and reschedule.
- World size below
min: the elastic group cannot reach the floor; training is stalled, not progressing. - Rank-restart storm: a flaky node or process bug restarting ranks faster than progress; a node to cordon (health gating).
- Rendezvous flapping: restart loops from resize storms or repeated rank loss.
Failure modes¶
- Inert rendezvous. A lone TCPStore pod treated as elastic; the
c10drank-0-master design means it does nothing. Useetcd-v2for elasticity. - Rendezvous on a GPU node. Burns an accelerator and contends the NIC. Pin it to CPU nodes.
- Missing rendezvous client in the image. Workers never form a group. Bundle
python-etcd(etcd-v2) in the base image. - Topology-blind workers. With no label injection, NCCL guesses the transport, giving slow or failing collectives, especially on overlay/heterogeneous fabrics.
- Below-min stall. Elastic floor unmet; the job sits idle holding GPUs. Alert and either scale up or lower
min. - Rank-restart / world-size thrash. A flaky node churns membership; the group re-forms repeatedly and makes no progress. Cordon the node; bound
max-restarts. - No gang scheduling. Half the workers run and idle while the rest pend. Gang-admit.
Open questions & validation¶
- Measured re-form time after a worker loss at your scale, and step-loss per spot reclaim: is elasticity actually cheaper than fixed-size + checkpoint-restart for your churn rate?
- Rendezvous backend choice (
etcd-v2vs alternatives) under real membership churn; etcd sizing for many concurrent jobs. min/max/max-restartstuned to your preemption rate so the group neither stalls nor thrashes.- Label-injection coverage: every node carries the topology labels NCCL needs (missing labels degrade silently).
- Interaction with checkpointing cadence (checkpoint recovery) so a re-form resumes, not restarts.
References¶
- PyTorch Elastic —
torchrun, rendezvous backends (etcd-v2,c10d), elastic launch: https://pytorch.org/docs/stable/elastic/run.html - PyTorch Elastic — rendezvous design and membership: https://pytorch.org/docs/stable/elastic/rendezvous.html
- etcd (rendezvous backing store): https://etcd.io/docs/
- NVIDIA NGC PyTorch container (base-image stack): https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
- NCCL environment variables (
NCCL_SOCKET_IFNAME, topology): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html - Kubernetes Downward API (node/pod metadata into containers): https://kubernetes.io/docs/concepts/workloads/pods/downward-api/
Related: Distributed Training Platform · Kubernetes Operator for GPU Allocation · Recipe: Gang-Scheduled Training · Overlay & Mesh Networking · NCCL Collectives & Algorithms · Topology-Aware K8s Scheduling · Helm: Volcano Scheduler · Checkpoint Recovery · Glossary