Markdown

Distributed training as a platform service¶

Scope: running elastic, multi-worker distributed training as a managed service on a Kubernetes GPU cluster, the platform layer that schedules workers, brings up a rendezvous, surfaces fabric topology to each rank, and survives workers joining and leaving. This is the operational substrate around the training run, not the parallelism algorithm: FSDP/DDP/DiLoCo (see distributed training) execute inside the workers this page schedules. Built on PyTorch Elastic (torchelastic) + an etcd rendezvous; complements a GPU-allocation operator that drives resize.

What it is¶

A managed distributed-training service turns "run this N-GPU job" into a reconciled set of worker pods that form a process group, tolerate membership changes, and report health, without a human running torchrun on each node. Five moving parts:

Worker pods: each runs torchrun (the PyTorch Elastic launcher), one per node, with the model's parallelism strategy inside.
A rendezvous backend: how workers discover each other and agree on ranks/world size. etcd-v2 supports elastic membership (workers join/leave, the group re-forms); the default c10d/TCPStore backend uses rank-0-as-master semantics and does not deliver elasticity on its own. ([PyTorch Elastic])
Rendezvous as a pod: a dedicated rendezvous endpoint (an etcd pod, or a TCPStore pod) per job, fronted by a headless Service, so membership coordination is decoupled from any single worker.
An init container that injects topology: resolves each node's labels (GPU model, provider, region, fabric) via the downward API and writes them to a shared volume the worker entrypoint sources, so NCCL and the training code know where they are running.
A control loop, typically the platform operator, that sets min/max world size, coordinates resize, and restarts failed ranks.

flowchart TB
  OP["Operator: TrainingDeployment (min/max world size)"] --> RDZV["Rendezvous pod (etcd-v2)<br/>headless Service"]
  OP --> W1["Worker pod 0"]
  OP --> W2["Worker pod 1"]
  OP --> WN["Worker pod N"]
  subgraph WORKER["Each worker pod"]
    INIT["init container:<br/>resolve node labels → /etc/env"] --> RUN["torchrun --rdzv-backend=etcd-v2<br/>--nnodes=min:max --max-restarts=K"]
  end
  W1 -.->|"register rank, watch membership"| RDZV
  W2 -.-> RDZV
  WN -.-> RDZV
  RDZV -->|"world re-forms on join/leave"| RUN

Why it matters¶

The reason to run training as a service rather than torchrun by hand is churn and scale. On spot, preemptible, rented, or geo-distributed capacity, workers will come and go: a node is reclaimed, a tunnel flaps, a GPU faults. Four properties follow:

Elasticity. With an elastic rendezvous, a lost worker does not kill the job; the group re-forms at a smaller world size and continues, then grows again when capacity returns. This is the difference between a spot reclaim costing one step versus the whole run.
Fault tolerance. torchrun restarts a failed rank from the last rendezvous; combined with checkpointing (checkpoint recovery) the run survives hardware faults that are routine at scale (goodput / ETTR).
Topology awareness. NCCL performance depends on knowing the fabric. Injecting node topology into every worker lets the collective pick the right transport (NCCL collectives) instead of guessing, decisive on a heterogeneous, overlay-stitched fleet.
Managed lifecycle. Declarative resize, gang admission, and per-job rendezvous come from the platform, not from operators SSHing into nodes.

When to use it (and when not)¶

Use the managed-service pattern when:

You run multi-node training on Kubernetes, especially on elastic/spot/preemptible/heterogeneous/geo capacity where membership changes mid-run.
You operate many training jobs as a platform for multiple tenants and need declarative, repeatable bring-up.
The run must survive worker loss rather than restart from scratch.

Do not use it when:

The job is single-node / single-GPU: torchrun --standalone (or nothing) is simpler; the rendezvous machinery is overhead.
Slurm already runs your training. On an HPC fabric, Slurm gang-schedules and launches ranks directly (Slurm, Slurm vs Kubernetes, recipe: gang-scheduled training); layering a K8s elastic service on top is redundant.
The fleet is static and reliable (one DC, no preemption): fixed-world-size launch converges most cleanly, and elasticity buys nothing. [see DiLoCo decision]

How to build and operate it¶

Reference shapes, unexecuted. Pin and verify torchrun/rendezvous flags, the rendezvous backend, NCCL env vars, and the base-image stack against current PyTorch/NCCL/etcd docs for your versions; these APIs shift across releases.

1. Build a base image that contains the rendezvous client¶

A worker cannot use a rendezvous backend whose client library is not in its image. A working distributed-training image bundles a matched stack:

# Reference shape — pin every version against the NGC/PyTorch matrix you target.
FROM nvcr.io/nvidia/pytorch:24.10-py3      # PyTorch + CUDA + a matched NCCL
RUN pip install --no-cache-dir python-etcd # REQUIRED for the etcd-v2 rendezvous backend
COPY entrypoint.sh /usr/local/bin/entrypoint.sh
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]

A mismatched or missing rendezvous client is the most common "workers never form a group" cause.

2. Run an elastic rendezvous as its own pod, off the GPU nodes¶

Deploy the rendezvous (an etcd pod for etcd-v2) behind a headless Service, and schedule it onto a CPU node, not a GPU node: a rendezvous pod on a GPU node wastes an accelerator and contends for the NIC. Do not mistake a single dedicated TCPStore pod for elasticity: the c10d backend's rank-0-master semantics mean such a pod is an inert middleman unless you use etcd-v2. ([PyTorch Elastic])

3. Launch workers with elastic `torchrun`¶

# One per worker pod. min:max enables elastic membership; max-restarts bounds rank recovery.
torchrun \
  --rdzv-backend=etcd-v2 \
  --rdzv-endpoint="${RDZV_HOST}:2379" \
  --rdzv-id="${JOB_ID}" \
  --nnodes="${MIN_NODES}:${MAX_NODES}" \
  --nproc-per-node="${GPUS_PER_NODE}" \
  --max-restarts="${MAX_RESTARTS}" \
  train.py

4. Inject node topology with an init container¶

Surface where each worker runs so NCCL and the training code can use it. A tiny init container resolves labels and writes an env file on a shared emptyDir; the entrypoint sources it before torchrun.

# init container (alpine + kubectl). NODE_NAME from the downward API. Degrade gracefully on missing labels.
NODE_NAME="${NODE_NAME:?}"
for k in gpu-model provider region; do
  v=$(kubectl get node "$NODE_NAME" -o "jsonpath={.metadata.labels.platform\.example\.com/$k}")
  echo "NODE_$(echo "$k" | tr 'a-z-' 'A-Z_')=${v}" >> /etc/inject/env
done

# worker entrypoint sources it, then binds NCCL to the right interface (e.g. the overlay device on a geo fleet).
. /etc/inject/env
export NCCL_SOCKET_IFNAME="${NCCL_SOCKET_IFNAME:-eth0}"
exec torchrun ...

For a WireGuard-stitched fleet, gate startup on mesh readiness (poll the rendezvous/peer keys) so torchrun does not begin before the network is up.

5. Gang-schedule and spread¶

A partially-scheduled training job wastes the GPUs it did get while waiting for the rest. Use Volcano or Kueue for gang/all-or-nothing admission, and topology-aware scheduling so ranks land close on the fabric. See recipe: gang-scheduled training.

6. Monitor the platform-specific signals¶

Beyond loss curves, alert on the operational health of the service (observability & monitoring):

Rendezvous pod on a GPU node: wasted accelerator; alert and reschedule.
World size below min: the elastic group cannot reach the floor; training is stalled, not progressing.
Rank-restart storm: a flaky node or process bug restarting ranks faster than progress; a node to cordon (health gating).
Rendezvous flapping: restart loops from resize storms or repeated rank loss.

Failure modes¶

Inert rendezvous. A lone TCPStore pod treated as elastic; the c10d rank-0-master design means it does nothing. Use etcd-v2 for elasticity.
Rendezvous on a GPU node. Burns an accelerator and contends the NIC. Pin it to CPU nodes.
Missing rendezvous client in the image. Workers never form a group. Bundle python-etcd (etcd-v2) in the base image.
Topology-blind workers. With no label injection, NCCL guesses the transport, giving slow or failing collectives, especially on overlay/heterogeneous fabrics.
Below-min stall. Elastic floor unmet; the job sits idle holding GPUs. Alert and either scale up or lower min.
Rank-restart / world-size thrash. A flaky node churns membership; the group re-forms repeatedly and makes no progress. Cordon the node; bound max-restarts.
No gang scheduling. Half the workers run and idle while the rest pend. Gang-admit.

Open questions & validation¶

Measured re-form time after a worker loss at your scale, and step-loss per spot reclaim: is elasticity actually cheaper than fixed-size + checkpoint-restart for your churn rate?
Rendezvous backend choice (etcd-v2 vs alternatives) under real membership churn; etcd sizing for many concurrent jobs.
min/max/max-restarts tuned to your preemption rate so the group neither stalls nor thrashes.
Label-injection coverage: every node carries the topology labels NCCL needs (missing labels degrade silently).
Interaction with checkpointing cadence (checkpoint recovery) so a re-form resumes, not restarts.

References¶

PyTorch Elastic — torchrun, rendezvous backends (etcd-v2, c10d), elastic launch: https://pytorch.org/docs/stable/elastic/run.html
PyTorch Elastic — rendezvous design and membership: https://pytorch.org/docs/stable/elastic/rendezvous.html
etcd (rendezvous backing store): https://etcd.io/docs/
NVIDIA NGC PyTorch container (base-image stack): https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
NCCL environment variables (NCCL_SOCKET_IFNAME, topology): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Kubernetes Downward API (node/pod metadata into containers): https://kubernetes.io/docs/concepts/workloads/pods/downward-api/