Skip to content
Markdown

Distributed training recipes (FSDP · DiLoCo)

Scope: an index / decision page for the two scale-out paradigms, FSDP for high-bandwidth single-DC sharding and DiLoCo for low-communication / geo-distributed training. It frames the choice (which paradigm for which network) and routes to the focused pages that carry the runnable configs, launch commands, and NCCL/topology tuning. The decision counterpart to distributed training, reused by post-training (fine-tuning and post-training).

Reference templates on real PyTorch/OpenDiLoCo APIs. Validate against the installed version; lay parallelism onto the actual topology (networking fabric).

Focused pages

  • FSDP single-DC recipe: use this when you are wiring up FSDP2/HSDP on one high-bandwidth DC and need the wrapping code, torchrun launch, and NCCL/topology tuning.
  • DiLoCo geo-distributed recipe: use this when workers sit across DCs or slow/heterogeneous links and you need the inner/outer loop, OpenDiLoCo/PRIME setup, and H tuning.
  • FSDP concept: use this to understand why sharding communicates every step and what each ZeRO stage shards.
  • DiLoCo concept: use this to understand the pseudo-gradient / outer-optimiser idea before reaching for the recipe.

Paradigm: communication is the axis

Both methods answer "how do many GPUs train one model" but assume opposite networks:

  • FSDP / ZeRO: shard params/grads/optimizer across ranks and communicate every step (all-gather params, reduce-scatter grads). Demands a fast fabric (NVLink + IB/RoCE with GDR). The single-DC default.
  • DiLoCo: each replica trains locally for H steps, then synchronises infrequently. Cuts communication ~500×, tolerating slow/heterogeneous links. The approach for multi-DC and decentralised GPU (cloud and cost).

Choose by the network between workers, not by preference.

Loop: DiLoCo inner/outer optimisation

flowchart TB
  INNER["Inner loop: H local AdamW steps (no cross-replica comms)"] --> PG["Pseudo-gradient = parameter delta"]
  PG -->|"all-reduce every H steps"| OUTER["Outer optimiser: Nesterov SGD"]
  OUTER --> INNER

FSDP2 (single-DC sharding)

PyTorch-native fully_shard (FSDP2): wrap per transformer block so params are gathered just-in-time and freed after use, launch with torchrun, and shard within a node while replicating across (HSDP) when scaling out. The throughput levers are overlap (prefetch / limit_all_gathers), HSDP topology layout (all-gather intra-node on NVLink, reduce inter-node over IB), NCCL/GDR tuning, and ACS-off.

The wrapping code, full torchrun launch, and the complete NCCL/topology tuning list live in the FSDP single-DC recipe. For the conceptual model of why per-step sharding communicates so much, see FSDP concept.

DiLoCo (low-communication / geo-distributed)

Two-level optimisation: an inner optimiser (e.g. AdamW) runs H local steps per replica; an outer optimiser (e.g. Nesterov SGD) applies the accumulated pseudo-gradient (the parameter delta) every H steps, the only cross-replica comms. OpenDiLoCo (Prime Intellect) implements this over a Hivemind DHT so geographically separate, fault-tolerant workers can join/leave; its scaled reproduction trained a 1.1B model across continents at 90-95% compute utilisation, and the same lineage later trained INTELLECT-1 (10B) globally distributed. OpenDiLoCo is superseded by Prime Intellect's PRIME stack (prime/prime-rl, FSDP2-based). DiLoCo is the natural pair for the decentralised-GPU economics in cloud and cost.

The inner/outer training skeleton, OpenDiLoCo/PRIME setup, and H tuning live in the DiLoCo geo-distributed recipe. For the conceptual derivation of the pseudo-gradient, see DiLoCo concept.

Decision matrix

Network between workers Use Why
NVLink + IB/RoCE, one DC FSDP / HSDP per-step comms are cheap; max MFU
Multiple DCs, good but not IB HSDP across DCs or DiLoCo balance comms vs sync staleness
Slow/heterogeneous/internet DiLoCo / PRIME ~500× less comms tolerates the link

Don't-miss checklist

  • Lay sharding onto topology: HSDP keeps all-gather on NVLink, reduce over IB (networking fabric).
  • Overlap comms with compute (prefetch); confirm GDR and ACS-off.
  • Activation checkpointing + BF16 to fit; measure MFU (distributed training, observability).
  • Use DiLoCo only when the inter-worker network actually warrants it; tune H against convergence.
  • Sharded async checkpoints; verify deterministic resume (storage and data).

Failure modes

  • Plain FSDP (full shard) stretched across nodes over IB → all-gather bound, low MFU; should have been HSDP.
  • DiLoCo H too large → replicas drift, convergence degrades; too small → loses the comms saving.
  • No comms/compute overlap → GPUs idle on every all-gather.
  • ACS on / PCIe trained down → silent FSDP throughput loss.

Open questions & validation

  • Validate the FSDP2 wrapping and HSDP config against the installed PyTorch version.
  • Sweep DiLoCo H for the model/network and confirm convergence vs a single-DC baseline.
  • Confirm OpenDiLoCo/PRIME API and fault-tolerance behaviour on the target setup.

References

  • PyTorch FSDP: https://docs.pytorch.org/docs/stable/fsdp.html · FSDP2 notes: https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html
  • DiLoCo paper (DeepMind): https://arxiv.org/abs/2311.08105
  • OpenDiLoCo (Prime Intellect): https://github.com/PrimeIntellect-ai/OpenDiloco · blog: https://www.primeintellect.ai/blog/opendiloco
  • INTELLECT-1 (10B globally distributed): https://www.primeintellect.ai/blog/intellect-1
  • NCCL env (FSDP tuning): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

Related: Fabric · Storage · Training · Optimization · Cloud & Cost · Fine-tuning · Glossary