Distributed training recipes (FSDP · DiLoCo)¶
Scope: an index / decision page for the two scale-out paradigms, FSDP for high-bandwidth single-DC sharding and DiLoCo for low-communication / geo-distributed training. It frames the choice (which paradigm for which network) and routes to the focused pages that carry the runnable configs, launch commands, and NCCL/topology tuning. The decision counterpart to distributed training, reused by post-training (fine-tuning and post-training).
Reference templates on real PyTorch/OpenDiLoCo APIs. Validate against the installed version; lay parallelism onto the actual topology (networking fabric).
Focused pages¶
- FSDP single-DC recipe: use this when you are wiring up FSDP2/HSDP on one high-bandwidth DC and need the wrapping code,
torchrunlaunch, and NCCL/topology tuning. - DiLoCo geo-distributed recipe: use this when workers sit across DCs or slow/heterogeneous links and you need the inner/outer loop, OpenDiLoCo/PRIME setup, and H tuning.
- FSDP concept: use this to understand why sharding communicates every step and what each ZeRO stage shards.
- DiLoCo concept: use this to understand the pseudo-gradient / outer-optimiser idea before reaching for the recipe.
Paradigm: communication is the axis¶
Both methods answer "how do many GPUs train one model" but assume opposite networks:
- FSDP / ZeRO: shard params/grads/optimizer across ranks and communicate every step (all-gather params, reduce-scatter grads). Demands a fast fabric (NVLink + IB/RoCE with GDR). The single-DC default.
- DiLoCo: each replica trains locally for H steps, then synchronises infrequently. Cuts communication ~500×, tolerating slow/heterogeneous links. The approach for multi-DC and decentralised GPU (cloud and cost).
Choose by the network between workers, not by preference.
Loop: DiLoCo inner/outer optimisation
flowchart TB
INNER["Inner loop: H local AdamW steps (no cross-replica comms)"] --> PG["Pseudo-gradient = parameter delta"]
PG -->|"all-reduce every H steps"| OUTER["Outer optimiser: Nesterov SGD"]
OUTER --> INNER
FSDP2 (single-DC sharding)¶
PyTorch-native fully_shard (FSDP2): wrap per transformer block so params are gathered just-in-time and freed after use, launch with torchrun, and shard within a node while replicating across (HSDP) when scaling out. The throughput levers are overlap (prefetch / limit_all_gathers), HSDP topology layout (all-gather intra-node on NVLink, reduce inter-node over IB), NCCL/GDR tuning, and ACS-off.
The wrapping code, full torchrun launch, and the complete NCCL/topology tuning list live in the FSDP single-DC recipe. For the conceptual model of why per-step sharding communicates so much, see FSDP concept.
DiLoCo (low-communication / geo-distributed)¶
Two-level optimisation: an inner optimiser (e.g. AdamW) runs H local steps per replica; an outer optimiser (e.g. Nesterov SGD) applies the accumulated pseudo-gradient (the parameter delta) every H steps, the only cross-replica comms. OpenDiLoCo (Prime Intellect) implements this over a Hivemind DHT so geographically separate, fault-tolerant workers can join/leave; its scaled reproduction trained a 1.1B model across continents at 90-95% compute utilisation, and the same lineage later trained INTELLECT-1 (10B) globally distributed. OpenDiLoCo is superseded by Prime Intellect's PRIME stack (prime/prime-rl, FSDP2-based). DiLoCo is the natural pair for the decentralised-GPU economics in cloud and cost.
The inner/outer training skeleton, OpenDiLoCo/PRIME setup, and H tuning live in the DiLoCo geo-distributed recipe. For the conceptual derivation of the pseudo-gradient, see DiLoCo concept.
Decision matrix¶
| Network between workers | Use | Why |
|---|---|---|
| NVLink + IB/RoCE, one DC | FSDP / HSDP | per-step comms are cheap; max MFU |
| Multiple DCs, good but not IB | HSDP across DCs or DiLoCo | balance comms vs sync staleness |
| Slow/heterogeneous/internet | DiLoCo / PRIME | ~500× less comms tolerates the link |
Don't-miss checklist¶
- Lay sharding onto topology: HSDP keeps all-gather on NVLink, reduce over IB (networking fabric).
- Overlap comms with compute (prefetch); confirm GDR and ACS-off.
- Activation checkpointing + BF16 to fit; measure MFU (distributed training, observability).
- Use DiLoCo only when the inter-worker network actually warrants it; tune H against convergence.
- Sharded async checkpoints; verify deterministic resume (storage and data).
Failure modes¶
- Plain FSDP (full shard) stretched across nodes over IB → all-gather bound, low MFU; should have been HSDP.
- DiLoCo H too large → replicas drift, convergence degrades; too small → loses the comms saving.
- No comms/compute overlap → GPUs idle on every all-gather.
- ACS on / PCIe trained down → silent FSDP throughput loss.
Open questions & validation¶
- Validate the FSDP2 wrapping and HSDP config against the installed PyTorch version.
- Sweep DiLoCo H for the model/network and confirm convergence vs a single-DC baseline.
- Confirm OpenDiLoCo/PRIME API and fault-tolerance behaviour on the target setup.
References¶
- PyTorch FSDP: https://docs.pytorch.org/docs/stable/fsdp.html · FSDP2 notes: https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html
- DiLoCo paper (DeepMind): https://arxiv.org/abs/2311.08105
- OpenDiLoCo (Prime Intellect): https://github.com/PrimeIntellect-ai/OpenDiloco · blog: https://www.primeintellect.ai/blog/opendiloco
- INTELLECT-1 (10B globally distributed): https://www.primeintellect.ai/blog/intellect-1
- NCCL env (FSDP tuning): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Related: Fabric · Storage · Training · Optimization · Cloud & Cost · Fine-tuning · Glossary