Markdown

Distributed training recipes (FSDP · DiLoCo)¶

Scope: an index / decision page for the two scale-out paradigms, FSDP for high-bandwidth single-DC sharding and DiLoCo for low-communication / geo-distributed training. It frames the choice (which paradigm for which network) and routes to the focused pages that carry the runnable configs, launch commands, and NCCL/topology tuning. The decision counterpart to distributed training, reused by post-training (fine-tuning and post-training).

Reference templates on real PyTorch/OpenDiLoCo APIs. Validate against the installed version; lay parallelism onto the actual topology (networking fabric).

Focused pages¶

FSDP single-DC recipe: use this when you are wiring up FSDP2/HSDP on one high-bandwidth DC and need the wrapping code, torchrun launch, and NCCL/topology tuning.
DiLoCo geo-distributed recipe: use this when workers sit across DCs or slow/heterogeneous links and you need the inner/outer loop, OpenDiLoCo/PRIME setup, and H tuning.
FSDP concept: use this to understand why sharding communicates every step and what each ZeRO stage shards.
DiLoCo concept: use this to understand the pseudo-gradient / outer-optimiser idea before reaching for the recipe.

Paradigm: communication is the axis¶

Both methods answer "how do many GPUs train one model" but assume opposite networks:

FSDP / ZeRO: shard params/grads/optimizer across ranks and communicate every step (all-gather params, reduce-scatter grads). Demands a fast fabric (NVLink + IB/RoCE with GDR). The single-DC default.
DiLoCo: each replica trains locally for H steps, then synchronises infrequently. Cuts communication ~500×, tolerating slow/heterogeneous links. The approach for multi-DC and decentralised GPU (cloud and cost).

Choose by the network between workers, not by preference.

Loop: DiLoCo inner/outer optimisation

flowchart TB
  INNER["Inner loop: H local AdamW steps (no cross-replica comms)"] --> PG["Pseudo-gradient = parameter delta"]
  PG -->|"all-reduce every H steps"| OUTER["Outer optimiser: Nesterov SGD"]
  OUTER --> INNER

FSDP2 (single-DC sharding)¶

PyTorch-native fully_shard (FSDP2): wrap per transformer block so params are gathered just-in-time and freed after use, launch with torchrun, and shard within a node while replicating across (HSDP) when scaling out. The throughput levers are overlap (prefetch / limit_all_gathers), HSDP topology layout (all-gather intra-node on NVLink, reduce inter-node over IB), NCCL/GDR tuning, and ACS-off.

The wrapping code, full torchrun launch, and the complete NCCL/topology tuning list live in the FSDP single-DC recipe. For the conceptual model of why per-step sharding communicates so much, see FSDP concept.

DiLoCo (low-communication / geo-distributed)¶

Two-level optimisation: an inner optimiser (e.g. AdamW) runs H local steps per replica; an outer optimiser (e.g. Nesterov SGD) applies the accumulated pseudo-gradient (the parameter delta) every H steps, the only cross-replica comms. OpenDiLoCo (Prime Intellect) implements this over a Hivemind DHT so geographically separate, fault-tolerant workers can join/leave; its scaled reproduction trained a 1.1B model across continents at 90-95% compute utilisation, and the same lineage later trained INTELLECT-1 (10B) globally distributed. OpenDiLoCo is superseded by Prime Intellect's PRIME stack (prime/prime-rl, FSDP2-based). DiLoCo is the natural pair for the decentralised-GPU economics in cloud and cost.

The inner/outer training skeleton, OpenDiLoCo/PRIME setup, and H tuning live in the DiLoCo geo-distributed recipe. For the conceptual derivation of the pseudo-gradient, see DiLoCo concept.

Decision matrix¶

Network between workers	Use	Why
NVLink + IB/RoCE, one DC	FSDP / HSDP	per-step comms are cheap; max MFU
Multiple DCs, good but not IB	HSDP across DCs or DiLoCo	balance comms vs sync staleness
Slow/heterogeneous/internet	DiLoCo / PRIME	~500× less comms tolerates the link

Don't-miss checklist¶

Lay sharding onto topology: HSDP keeps all-gather on NVLink, reduce over IB (networking fabric).
Overlap comms with compute (prefetch); confirm GDR and ACS-off.
Activation checkpointing + BF16 to fit; measure MFU (distributed training, observability).
Use DiLoCo only when the inter-worker network actually warrants it; tune H against convergence.
Sharded async checkpoints; verify deterministic resume (storage and data).

Failure modes¶

Plain FSDP (full shard) stretched across nodes over IB → all-gather bound, low MFU; should have been HSDP.
DiLoCo H too large → replicas drift, convergence degrades; too small → loses the comms saving.
No comms/compute overlap → GPUs idle on every all-gather.
ACS on / PCIe trained down → silent FSDP throughput loss.

Open questions & validation¶

Validate the FSDP2 wrapping and HSDP config against the installed PyTorch version.
Sweep DiLoCo H for the model/network and confirm convergence vs a single-DC baseline.
Confirm OpenDiLoCo/PRIME API and fault-tolerance behaviour on the target setup.

References¶

PyTorch FSDP: https://docs.pytorch.org/docs/stable/fsdp.html · FSDP2 notes: https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html
DiLoCo paper (DeepMind): https://arxiv.org/abs/2311.08105
OpenDiLoCo (Prime Intellect): https://github.com/PrimeIntellect-ai/OpenDiloco · blog: https://www.primeintellect.ai/blog/opendiloco
INTELLECT-1 (10B globally distributed): https://www.primeintellect.ai/blog/intellect-1
NCCL env (FSDP tuning): https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

Related: Fabric · Storage · Training · Optimization · Cloud & Cost · Fine-tuning · Glossary