Markdown

Recipe: DiLoCo geo-distributed training¶

Scope: a standalone recipe to run low-communication DiLoCo training across datacenters or poorly-connected workers, covering inner/outer optimizer config, sync cadence (H), the launcher, and the apply/verify loop, plus the decision of when DiLoCo beats standard FSDP/DDP. The geo-distributed counterpart to Recipe: FSDP single-DC; the concept page is DiLoCo, the survey is distributed-training recipes.

Reference templates on real PyTorch torch.distributed and Prime Intellect APIs. Pin and validate against your installed build; framework flags (Prime Intellect prime, Hivemind) vary by version. Verify on the repo. Not hardware-tested here.

What it is¶

DiLoCo (Distributed Low-Communication, DeepMind arXiv 2311.08105) is a two-level optimizer. Each worker runs H inner steps of ordinary AdamW on its own data shard with zero cross-worker communication. Every H steps each worker computes its pseudo-gradient (the delta between its current parameters and the parameters at the last sync). These are all-reduced across workers, and an outer optimizer (Nesterov SGD) applies the averaged delta to the global parameters, which become the next starting point. Repeat.

Because only one parameter-sized tensor crosses the inter-worker link every H steps (not every step), DiLoCo communicates roughly 500x less than per-step data parallelism, trading communication frequency for a small convergence gap. A "worker" can itself be a multi-GPU FSDP shard group: FSDP inside, DiLoCo across. This recipe is the geo-distributed half of distributed-training recipes; for the single-DC fast-fabric path use Recipe: FSDP single-DC.

Why it matters¶

Trains across DCs / WAN. Sync cost amortizes over H steps, so workers can sit in different regions on ordinary internet links rather than a shared InfiniBand fabric, the economics of decentralized and neocloud GPU pools.
Tolerates heterogeneous, churny pools. Loosely coupled and fault-tolerant by design: a synchronous per-step all-reduce is impossible over spot/volunteer hardware, but an infrequent pseudo-gradient exchange is not. OpenDiLoCo reported 90-95% compute utilization training across continents (arXiv 2407.07852).
Recovers utilization when the network gates throughput. When the link, not the GPUs, is the bottleneck, cutting comms ~500x is the difference between idle and compute-bound.
Composes with FSDP. The inner loop is an ordinary training step, so each worker can be an FSDP shard group on its own fast fabric. DiLoCo only changes the between-worker sync.

When it is needed (and when not)¶

Use DiLoCo when workers are separated by slow or heterogeneous links (cross-DC, WAN, spot/decentralized pools) where FSDP's per-step all-gather/reduce-scatter would stall on the fabric.
Use FSDP / HSDP (Recipe: FSDP single-DC) when every worker shares a high-bandwidth IB/RoCE fabric in one cluster. Per-step sync gives the cleanest convergence and DiLoCo's outer loop buys nothing.
Use DDP (distributed training) when the model fits comfortably on one GPU and all ranks are co-located: one all-reduce per step is simpler and faster in that regime.
The decision is per-link, not global. Fast fabric inside an island -> FSDP; slow link between islands -> DiLoCo on top. Choose by the network between workers, not by preference.

flowchart LR
  Q1{"Workers on one<br/>fast IB/NVLink fabric?"} -->|"yes"| FSDP["FSDP / HSDP<br/>(recipe-fsdp-single-dc)"]
  Q1 -->|"no, cross-DC / WAN / spot"| Q2{"Pool churns<br/>(join/leave, stragglers)?"}
  Q2 -->|"no, fixed worker set"| DLC["DiLoCo over torch.distributed<br/>(gloo all-reduce)"]
  Q2 -->|"yes, decentralized"| PRIME["Prime Intellect prime<br/>(DHT, fault tolerant)"]

How: implement, integrate, maintain¶

1. Implement: the inner/outer loop¶

The whole method is a thin wrapper around a normal training step: snapshot the global params, run H inner steps, all-reduce the parameter delta, step the outer optimizer. Use a WAN-friendly backend (gloo) for the cross-worker process group; the inner step is unchanged from single-replica training.

# diloco_train.py  -- algorithm reference; validate torch.distributed signatures on your build.
from __future__ import annotations
import os, torch, torch.distributed as dist

def diloco_train(model: torch.nn.Module, loader, steps_total: int,
                 H: int = 500, outer_lr: float = 0.7) -> None:
    dist.init_process_group("gloo")                       # WAN/TCP backend, not NCCL-over-IB
    inner = torch.optim.AdamW(model.parameters(), lr=4e-4)
    outer = torch.optim.SGD(model.parameters(),           # outer momentum stabilizes rare syncs
                            lr=outer_lr, momentum=0.9, nesterov=True)
    it = iter(loader)
    for _outer_step in range(steps_total // H):
        snap = [p.detach().clone() for p in model.parameters()]   # global params at last sync
        for _ in range(H):                                # INNER: no cross-worker comms
            batch = next(it)
            loss = model(**batch).loss
            inner.zero_grad(); loss.backward(); inner.step()
        for p, g0 in zip(model.parameters(), snap):
            p.grad = g0 - p.data                          # pseudo-gradient = delta from snapshot
            dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) # the ONLY cross-worker comm, every H steps
            p.data.copy_(g0)                              # reset to snapshot before outer step
        outer.step()                                      # outer applies averaged pseudo-gradient
    dist.destroy_process_group()

H, outer_lr, and the worker topology are the primary sweep axes. Expose them via env so they are easy to vary across a campaign:

H        = int(os.environ.get("DILOCO_H", 500))          # 50 / 125 / 500 reported in OpenDiLoCo
outer_lr = float(os.environ.get("OUTER_LR", 0.7))        # Nesterov SGD outer LR, published default

2. Integrate: launch one worker per DC¶

Each worker is a separate torch.distributed rank reached over WAN. Point every rank at a shared MASTER_ADDR (a routable host the others can reach), set WORLD_SIZE to the number of DCs, and assign RANK per site. Run the same command at each location.

# Run at DC-A (rank 0, also the rendezvous host), DC-B (rank 1), DC-C (rank 2)...
export MASTER_ADDR=dc-a.example.net        # routable from every worker over WAN
export MASTER_PORT=29500
export WORLD_SIZE=3                         # number of geo-distributed workers
export RANK=0                               # 0 at DC-A, 1 at DC-B, 2 at DC-C
export DILOCO_H=500                         # local steps between syncs; tune to link bandwidth
export GLOO_SOCKET_IFNAME=eth0             # pin the WAN-facing NIC for the gloo backend
python diloco_train.py

When each worker is itself a multi-GPU FSDP shard group, the inner loop runs under torchrun on that DC's fast fabric (Recipe: FSDP single-DC) while a separate gloo group carries the cross-DC pseudo-gradient. For a churny, join/leave decentralized pool, use the Prime Intellect prime framework, which carries the exchange over a Hivemind DHT with fault tolerance rather than a fixed WORLD_SIZE:

# Prime Intellect prime (DHT-backed, fault-tolerant DiLoCo) -- flags vary by version; verify on the repo.
# Bootstrap a DHT peer, note the printed multiaddr, then start each worker pointing at it:
#   git clone https://github.com/PrimeIntellect-ai/prime && cd prime
#   uv sync
#   uv run torchrun ... --diloco.inner_steps 500 --dht.initial_peers <MADDR>
# Exact CLI/config keys differ across releases -- read the current README before running.

3. Verify: convergence vs a single-DC baseline, and comms saved¶

DiLoCo's correctness gate is convergence parity: the loss curve must track an FSDP/DDP single-DC baseline at the same token budget. Sweep H and watch for drift: too-large H lets workers diverge and the outer step fights it. Log loss per outer step and the bytes moved per sync so the comms saving is measurable, not assumed.

import torch

def diloco_sync_cost(model: torch.nn.Module, H: int) -> dict[str, float]:
    # One parameter-sized all-reduce every H steps. Bytes-per-step is the comms metric to track.
    params = sum(p.numel() for p in model.parameters())
    bytes_per_sync = params * 4                            # fp32 pseudo-gradient (illustrative dtype)
    return {"params": float(params),
            "bytes_per_sync": float(bytes_per_sync),
            "bytes_per_step": bytes_per_sync / H}          # vs ~2x params/step for per-step DP

Watch the loss-vs-baseline gap and the per-step comms in telemetry / monitoring / alerting against the SLO/SLI catalog; a convergence regression follows the same triage as the MFU regression runbook. A PromQL guard on outer-loop progress (emit diloco_loss and diloco_loss_baseline from your exporter):

# fire when DiLoCo loss runs >5% above the single-DC baseline for 1h (drift; suspect H too large)
(
  avg_over_time(diloco_loss[1h])
  /
  avg_over_time(diloco_loss_baseline[1h])
) > 1.05

4. Maintain: checkpoint, resume, and tune H¶

The output is an ordinary checkpoint identical in shape to any other run. Save the global parameters and both optimizer states (inner AdamW and outer SGD) at outer-step boundaries so a resume re-enters the loop cleanly. There is no DiLoCo-specific serving path; the checkpoint is served by the normal stacks (serving open-weight models, inference serving).

import torch

def save_outer_ckpt(model, inner, outer, outer_step: int, path: str) -> None:
    if dist.get_rank() != 0:                               # one writer; global params are identical
        return
    torch.save({"model": model.state_dict(),
                "inner": inner.state_dict(),
                "outer": outer.state_dict(),               # outer momentum buffer matters on resume
                "outer_step": outer_step}, path)

Maintenance is mostly tuning H to the link: raise H when the network gates throughput (fewer, larger syncs), lower it when convergence drifts. Larger H also widens the divergence window, so re-validate against the baseline after any change. For a decentralized pool, treat straggler/churn handling as a framework concern. Use the fault-tolerant prime stack rather than a fixed-WORLD_SIZE gloo group. Streaming DiLoCo (DeepMind arXiv 2501.18512) further cuts peak bandwidth by syncing parameter subsets in sequence and overlapping sync with compute (~400x less bandwidth reported). Adopt it when even the per-H burst saturates the WAN link.

Failure modes¶

H too large -> workers drift apart, the outer update fights divergence, quality drops. Tune H to the link, not just to minimize comms.
Forgetting the snapshot/reset -> the pseudo-gradient is computed or applied against the wrong base; you must reset params to the pre-inner snapshot before outer.step().
Picking NCCL-over-IB for the cross-worker group -> there is no fast fabric between workers; use gloo/TCP (or a DHT overlay). A step-synchronous design here stalls, exactly the case DiLoCo exists to avoid.
Stragglers / churn on a fixed WORLD_SIZE -> a single slow or departing worker blocks the all-reduce. Decentralized pools need the DHT-backed, fault-tolerant prime stack.
Skipping the baseline -> without a single-DC convergence reference you cannot tell a healthy run from silent drift; always validate H against a baseline.

References¶

DiLoCo (Douillard et al., DeepMind, 2023): https://arxiv.org/abs/2311.08105
Streaming DiLoCo with overlapping communication (DeepMind, 2025): https://arxiv.org/abs/2501.18512
OpenDiLoCo (Jaghouar et al., Prime Intellect, 2024): https://arxiv.org/abs/2407.07852
OpenDiLoCo repo (superseded by prime): https://github.com/PrimeIntellect-ai/OpenDiloco
Prime Intellect prime (DHT-backed distributed training): https://github.com/PrimeIntellect-ai/prime
INTELLECT-1 (first globally-distributed 10B training): https://www.primeintellect.ai/blog/intellect-1
Hivemind (decentralized training DHT): https://github.com/learning-at-home/hivemind
PyTorch torch.distributed (backends, all_reduce): https://docs.pytorch.org/docs/stable/distributed.html