Markdown

DiLoCo (distributed low-communication)¶

Scope: a two-level optimisation method that lets workers train mostly independently and synchronise rarely. It is the low-bandwidth alternative to per-step data parallelism for multi-DC and heterogeneous-link training. Contrast with FSDP; part of the distributed-training survey in distributed training; the runnable geo-distributed recipe is Recipe: DiLoCo geo-distributed.

Reference templates on real torch.distributed and Hivemind/DHT APIs: pin versions and validate before production use. The two numpy blocks (outer-step math and communication volume) are executed and asserted; the torch and shell blocks are unexecuted reference templates.

What it is¶

DiLoCo (DeepMind, arXiv 2311.08105) is a local-SGD training method with two nested optimisers:

An inner optimiser (AdamW) runs H local steps on each worker's own data, with no cross-worker communication.
Every H steps, each worker computes its pseudo-gradient (the delta between the parameters at the last sync and its current parameters), and these are all-reduced across workers.
An outer optimiser (SGD with Nesterov momentum) applies the averaged pseudo-gradient to the global parameters, which are broadcast back as the new starting point. Repeat.

Because workers only exchange one parameter-sized tensor every H steps (not every step), DiLoCo communicates roughly 500x less than standard data-parallel/FSDP, trading communication frequency for a small quality gap. It targets settings where the inter-worker link, not compute, is the bottleneck.

Why use it¶

Multi-datacentre / WAN training. Sync cost is amortised over H steps, so workers can sit in different DCs or regions on ordinary internet links rather than a shared IB fabric (cloud and cost).
Heterogeneous, decentralised GPU pools. Loosely-coupled, fault-tolerant by design, which suits commodity / spot / volunteer hardware where a synchronous all-reduce every step is impossible.
Bandwidth-bound clusters. When the network, not the GPUs, gates throughput, cutting comms ~500x recovers utilisation (OpenDiLoCo reported 90-95% compute utilisation training across continents).

When to use it (and when not)¶

Choose DiLoCo over FSDP/HSDP (FSDP) when workers are separated by slow or heterogeneous links (cross-DC, WAN, spot pools); FSDP's per-step all-gather/reduce-scatter assumes a fast intra-cluster fabric and will stall over such links.
Choose FSDP when all workers share a high-bandwidth IB/RoCE fabric in one cluster, where per-step sync gives the cleanest convergence and DiLoCo's overhead buys nothing.
DiLoCo and FSDP compose: run FSDP inside each DiLoCo worker (a worker can itself be a multi-GPU shard) and DiLoCo between islands. The decision is per-link: fast fabric picks FSDP, slow link picks DiLoCo on top.

Architecture¶

flowchart TB
  subgraph W0["Worker 0 (DC A)"]
    I0["Inner AdamW: H local steps"]
    P0["pseudo-grad = theta_global - theta_0"]
    I0 --> P0
  end
  subgraph W1["Worker 1 (DC B)"]
    I1["Inner AdamW: H local steps"]
    P1["pseudo-grad = theta_global - theta_1"]
    I1 --> P1
  end
  P0 -->|"all-reduce every H steps"| AR["Average pseudo-gradients"]
  P1 -->|"all-reduce every H steps"| AR
  AR --> OUT["Outer Nesterov SGD -> new theta_global"]
  OUT -.->|"broadcast, repeat"| I0
  OUT -.->|"broadcast, repeat"| I1

How to use it¶

The method is a thin loop wrapped around an ordinary training step: keep a snapshot of the global parameters, run H inner steps, all-reduce the delta, step the outer optimiser. (OpenDiLoCo provides a reference implementation; the skeleton below is the algorithm, not its full code.)

# diloco_skeleton.py: algorithm sketch (torch reference template); see OpenDiLoCo for a
# complete impl. Validate torch.distributed signatures against your installed build.
import torch, torch.distributed as dist

dist.init_process_group("gloo")                      # WAN-friendly backend
inner = torch.optim.AdamW(model.parameters(), lr=4e-4)
outer = torch.optim.SGD(model.parameters(), lr=0.7, momentum=0.9, nesterov=True)
H = 500                                               # local steps between syncs

while not done:
    global_params = [p.detach().clone() for p in model.parameters()]   # snapshot
    for _ in range(H):                                # inner: no communication
        loss = model(next(loader)).loss
        inner.zero_grad(); loss.backward(); inner.step()

    for p, g0 in zip(model.parameters(), global_params):
        p.grad = (g0 - p.data)                        # pseudo-gradient (delta)
        dist.all_reduce(p.grad, op=dist.ReduceOp.AVG) # the ONLY cross-worker comm
        p.data.copy_(g0)                              # reset to snapshot...
    outer.step()                                      # ...then outer applies avg delta

The defining part is the outer step, and its math is exact enough to validate with numpy alone. The block below reproduces the pseudo-gradient, the averaging all-reduce, and the outer SGD update, then asserts the identities that make DiLoCo correct: with outer_lr=1 and no momentum it reduces exactly to FedAvg parameter averaging (equivalence to a slow reference); it is a convex blend at any outer_lr; outer_lr=0 is a no-op; Nesterov momentum provably accelerates repeated syncs; and the published sign moves toward worker consensus while the flipped sign (the classic reset/sign bug) provably diverges.

# diloco_outer_math.py: numpy validation of the DiLoCo outer step (companion to the
# torch.distributed reference). Asserts the pseudo-gradient/average/outer-SGD identity,
# the FedAvg equivalence, the outer_lr=0 boundary, Nesterov acceleration, and that the
# published sign converges toward worker consensus while the flipped sign diverges.
import numpy as np

def pseudo_grads(theta_global, thetas_local):
    # DiLoCo pseudo-gradient per worker: snapshot minus post-inner params (paper's sign).
    return [theta_global - t for t in thetas_local]

def outer_step(theta, grad, lr, momentum=0.0, nesterov=False, buf=None):
    # PyTorch-style SGD (dampening=0); buf carries outer momentum across sync rounds.
    buf = grad.copy() if buf is None else momentum * buf + grad
    d = grad + momentum * buf if nesterov else buf
    return theta - lr * d, buf

def diloco_round(theta_global, thetas_local, lr, momentum=0.0, nesterov=False, buf=None):
    avg = np.mean(pseudo_grads(theta_global, thetas_local), axis=0)   # all-reduce AVG
    return outer_step(theta_global, avg, lr, momentum, nesterov, buf)

rng = np.random.default_rng(0)
D, K = 256, 4                                          # param dim, worker count
theta_g = rng.standard_normal(D)                       # global params at last sync
locals_ = [theta_g + rng.standard_normal(D) for _ in range(K)]   # each drifted by H inner steps
consensus = np.mean(locals_, axis=0)

# 1. FedAvg equivalence (slow reference): outer_lr=1, no momentum == plain param averaging.
new, _ = diloco_round(theta_g, locals_, lr=1.0)
assert np.allclose(new, consensus), "outer_lr=1 must reduce to the FedAvg mean of worker params"

# 2. Convex-combination identity for vanilla outer SGD at a general lr.
lr = 0.7
new, _ = diloco_round(theta_g, locals_, lr=lr)
assert np.allclose(new, (1 - lr) * theta_g + lr * consensus), "outer step is a convex blend"

# 3. Boundary: outer_lr=0 -> base unchanged (a stalled outer opt discards worker progress).
new0, _ = diloco_round(theta_g, locals_, lr=0.0)
assert np.array_equal(new0, theta_g), "lr=0 must not move the global params"

# 4. Nesterov acceleration (adversarial): same pseudo-grad each round -> momentum path must
#    travel strictly farther than plain SGD after >=2 rounds (why the outer opt has momentum).
g = np.mean(pseudo_grads(theta_g, locals_), axis=0)    # fixed averaged pseudo-gradient
p_plain = theta_g.copy(); p_nest = theta_g.copy(); buf = None
for _ in range(3):
    p_plain, _   = outer_step(p_plain, g, lr=0.3, momentum=0.0)
    p_nest,  buf = outer_step(p_nest,  g, lr=0.3, momentum=0.9, nesterov=True, buf=buf)
assert np.linalg.norm(p_nest - theta_g) > np.linalg.norm(p_plain - theta_g), "momentum must accelerate"

# 5. Sign check (adversarial): the published sign (global - local) moves toward consensus;
#    the flipped sign (local - global) provably moves away -> catches the reset/sign bug.
good, _ = diloco_round(theta_g, locals_, lr=0.5)
bad_avg = np.mean([t - theta_g for t in locals_], axis=0)        # WRONG sign
bad, _ = outer_step(theta_g, bad_avg, lr=0.5)
assert np.linalg.norm(good - consensus) < np.linalg.norm(theta_g - consensus), "correct sign converges"
assert np.linalg.norm(bad  - consensus) > np.linalg.norm(theta_g - consensus), "flipped sign diverges"
print("outer-step OK: fedavg_equiv=True convex=True lr0_noop=True nesterov_faster=True sign_converges=True")

Running it prints outer-step OK: fedavg_equiv=True convex=True lr0_noop=True nesterov_faster=True sign_converges=True.

How to integrate with it¶

DiLoCo changes only the between-worker sync, so the inner loop is an ordinary training step and integration is about wiring the process group and composing with intra-worker parallelism.

Backend. Use a WAN/TCP backend (gloo) or a DHT overlay for the cross-worker group, not NCCL-over-IB; there is no fast fabric between workers to justify NCCL.
FSDP inside, DiLoCo across. Each "worker" can be a single GPU or a multi-GPU FSDP shard group running under torchrun on its own fast fabric, while a separate gloo group carries the cross-DC pseudo-gradient. Develop the inner step exactly as a normal single-replica loop, then wrap the outer sync.
Sweep axes. H, the outer LR, and the worker topology are the primary knobs. Expose H and the outer LR through the environment so a campaign can vary them without code edits; the outer momentum (0.9, Nesterov) is what stabilises infrequent updates and is the published default.

# reference template (torch): expose H and the outer LR as the primary sweep axes.
import os, torch
H = int(os.environ.get("DILOCO_H", 500))                 # 50 / 125 / 500 reported in OpenDiLoCo
outer = torch.optim.SGD(model.parameters(),
                        lr=float(os.environ.get("OUTER_LR", 0.7)),
                        momentum=0.9, nesterov=True)

Launch one worker per DC as a torch.distributed rank reached over WAN: point every rank at a routable MASTER_ADDR, set WORLD_SIZE to the number of workers, assign RANK per site, and pin the WAN-facing NIC for gloo.

# reference template: run at DC-A (rank 0, rendezvous host), DC-B (rank 1), DC-C (rank 2)...
export MASTER_ADDR=dc-a.example.net        # routable from every worker over WAN
export MASTER_PORT=29500
export WORLD_SIZE=3                         # number of geo-distributed workers
export RANK=0                              # 0 at DC-A, 1 at DC-B, 2 at DC-C
export DILOCO_H=500                         # local steps between syncs; tune to link bandwidth
export GLOO_SOCKET_IFNAME=eth0             # pin the WAN-facing NIC for the gloo backend
python diloco_train.py

How to run it in production¶

The output is an ordinary checkpoint identical in shape to any other run, but the loop has extra state and a correctness gate that a per-step trainer does not.

Checkpoint both optimiser states at outer-step boundaries. Save the global parameters, the inner AdamW state, and the outer SGD state (including its momentum buffer) so a resume re-enters the loop cleanly. Saving on the wrong boundary, or dropping the outer momentum buffer, corrupts the resume (checkpoint-recovery runbook).

# reference template (torch): one writer; global params are identical across ranks.
import torch, torch.distributed as dist
def save_outer_ckpt(model, inner, outer, outer_step, path):
    if dist.get_rank() != 0:
        return
    torch.save({"model": model.state_dict(),
                "inner": inner.state_dict(),
                "outer": outer.state_dict(),         # outer momentum buffer matters on resume
                "outer_step": outer_step}, path)

Gate on convergence parity. DiLoCo's correctness check is that the loss curve tracks an FSDP/DDP single-DC baseline at the same token budget. Emit diloco_loss and diloco_loss_baseline and alert on drift; a regression follows the same triage as the MFU regression runbook. Wire it into telemetry / monitoring / alerting.

# fire when DiLoCo loss runs >5% above the single-DC baseline for 1h (drift; suspect H too large)
(
  avg_over_time(diloco_loss[1h])
  /
  avg_over_time(diloco_loss_baseline[1h])
) > 1.05

Fault tolerance. A fixed WORLD_SIZE gloo group blocks the all-reduce on any slow or departing worker. For a churny, join/leave decentralised pool use the DHT-backed, fault-tolerant prime stack rather than a fixed world size.

How to maintain it¶

Maintenance is mostly tuning H to the link as conditions change and re-validating after every change:

Raise H when the network gates throughput (fewer, larger syncs); lower H when the loss drifts from the baseline. Larger H widens the divergence window, so re-run the baseline comparison after any H change rather than assuming the old setting still holds.
Adopt Streaming DiLoCo (DeepMind, arXiv 2501.18512) when even the per-H burst saturates the WAN link: it syncs parameter subsets in sequence and overlaps sync with compute, reporting roughly 400x less peak bandwidth.
Track the framework. OpenDiLoCo is no longer maintained; migrate to its successor PRIME (PrimeIntellect-ai/prime) for the improved fault tolerance and bandwidth use, and re-verify flag names against the current README (they vary by version).

How to scale it¶

DiLoCo scales across instances and datacentres, not just across a rack. The open implementations carry the inter-worker exchange over a peer-to-peer DHT so workers can join/leave:

OpenDiLoCo (Prime Intellect, github PrimeIntellect-ai/OpenDiloco, arXiv 2407.07852) implements DiLoCo on a Hivemind DHT, demonstrated training across two continents and three countries at ~90-95% utilisation. Note: OpenDiLoCo is no longer maintained; its successor is PRIME (github PrimeIntellect-ai/prime), with improved fault tolerance and bandwidth use.
INTELLECT-1 (Prime Intellect) scaled this to a 10B-parameter model trained over globally distributed workers, the reference point for low-communication training at scale.

The reason this scales over a WAN at all is the communication volume. DiLoCo moves one parameter-sized all-reduce every H steps against per-step data parallel's one every step, so the saving ratio is exactly H and is independent of model size. The block below asserts that identity, the ~500x headline at H=500, and the H=1 boundary where DiLoCo degenerates back to per-step DP with no saving.

# diloco_comms.py: numpy check of the headline "~500x less communication" and how it
# scales. DiLoCo moves one parameter-sized all-reduce every H steps; per-step data
# parallel moves one every step. The saving ratio is exactly H, independent of model size.
import numpy as np

def diloco_bytes_per_step(params, H, dtype_bytes=4):
    return params * dtype_bytes / H          # one pseudo-grad all-reduce amortized over H

def dp_bytes_per_step(params, dtype_bytes=4):
    return params * dtype_bytes              # per-step DP all-reduces the gradient every step

for params in (125_000_000, 10_000_000_000):           # 125M and INTELLECT-1's 10B scale
    for H in (50, 125, 500):
        ratio = dp_bytes_per_step(params) / diloco_bytes_per_step(params, H)
        assert np.isclose(ratio, H), "comms saving must equal H, independent of param count"

# the page's headline: H=500 -> ~500x less communication than per-step data parallel.
assert np.isclose(dp_bytes_per_step(1_000_000_000) / diloco_bytes_per_step(1_000_000_000, 500), 500.0)
# boundary: H=1 (sync every step) recovers exactly the per-step DP volume (no saving).
assert np.isclose(diloco_bytes_per_step(1_000_000_000, 1), dp_bytes_per_step(1_000_000_000))
print("comms OK: ratio==H for all sizes; H=500 -> 500x; H=1 -> parity with per-step DP")

Running it prints comms OK: ratio==H for all sizes; H=500 -> 500x; H=1 -> parity with per-step DP. Bootstrap the DHT overlay and point each worker at it:

# OpenDiLoCo (reference; pin a commit and verify on the repo as of mid-2026)
hivemind-dht --host_maddrs /ip4/0.0.0.0/tcp/0           # bootstrap the DHT peer
# then launch each worker pointing at the DHT, e.g. (8 workers, H=500):
# torchrun ... train_fsdp.py --per-device ... --hv.local_steps 500 \
#   --hv.initial_peers <DHT_MADDR>   # flags vary by version; check the README

Inference¶

Not applicable. DiLoCo is purely a training-time optimisation method; it produces a single set of global parameters identical in shape to any other checkpoint, which is then served by the normal stacks (inference serving, serving open-weight models). There is no DiLoCo-specific inference path.

Fine-tuning¶

DiLoCo applies directly to distributed fine-tuning over WAN / heterogeneous pools: the inner AdamW loop runs any fine-tuning objective (continued pretraining, SFT) while the outer sync keeps geographically separated workers in agreement. For parameter-efficient and RL post-training methods (which usually run inside one cluster and lean on FSDP) see fine-tuning and post-training, SFT and LoRA, and the RL libraries in RL libraries.

Optimised hardware¶

Built for non-IB networks. DiLoCo's whole premise is tolerating slow/heterogeneous links; it does not need NVLink or InfiniBand between workers, only enough bandwidth to move one parameter-sized all-reduce every H steps. That is the explicit contrast with FSDP, whose per-step all-gather/reduce-scatter demands a fast intra-cluster fabric.
Inside a worker, ordinary fast-fabric rules still apply: if a worker is an FSDP shard group, it wants NVLink intra-node and IB intra-island with GDR (NCCL_NET_GDR_LEVEL=SYS, [GDRDMA] in NCCL_DEBUG=INFO) (performance tuning).
The inter-worker sync runs over TCP/WAN; backend is typically gloo or a DHT overlay, not NCCL-over-IB.

Cookbook (common use cases)¶

1. The DiLoCo outer loop (drop H inner steps between syncs)

# reference template (torch): the outer loop validated numerically in "How to use it".
for round in range(num_rounds):
    snap = [p.detach().clone() for p in model.parameters()]
    for _ in range(H): train_one_step(model, inner, loader)   # no comms
    for p, g0 in zip(model.parameters(), snap):
        p.grad = g0 - p.data
        dist.all_reduce(p.grad, op=dist.ReduceOp.AVG)
        p.data.copy_(g0)
    outer.step()                                              # one sync / round

2. Bootstrap an OpenDiLoCo / Hivemind run (multi-DC)

hivemind-dht --host_maddrs /ip4/0.0.0.0/tcp/0   # note the printed multiaddr
# start N workers in different DCs, each with --hv.initial_peers <that multiaddr>
# and --hv.local_steps 500 ; exact flags per the repo README (verify, mid-2026)

3. Multi-DC layout: FSDP inside, DiLoCo across

DC-A: 8xGPU FSDP shard group  ─┐
DC-B: 8xGPU FSDP shard group  ─┼─ DiLoCo all-reduce of pseudo-grad every H steps
DC-C: 8xGPU FSDP shard group  ─┘   (fast fabric intra-DC, WAN inter-DC)

Failure modes¶

H too large moves workers apart until the outer update fights divergence and quality drops. Tune H to the link, not just to minimise comms.
Forgetting the snapshot/reset computes or applies the pseudo-gradient against the wrong base; you must reset params to the pre-inner snapshot before the outer step. The numpy sign check above is exactly this bug: the flipped sign provably diverges.
Using NCCL-over-IB assumptions stalls, because there is no fast fabric between workers; a step-synchronous design (FSDP) here is exactly the case DiLoCo exists to avoid.
Stragglers / churn in a decentralised pool block a fixed-WORLD_SIZE all-reduce; use the fault-tolerant successor (PRIME) rather than the unmaintained OpenDiLoCo for production.
Treating DiLoCo as a serving feature mistakes a training-only method for an inference path; the output is just a normal checkpoint.

References¶

DiLoCo (Douillard et al., DeepMind, 2023): https://arxiv.org/abs/2311.08105
Streaming DiLoCo with overlapping communication (DeepMind, 2025): https://arxiv.org/abs/2501.18512
OpenDiLoCo (Jaghouar et al., Prime Intellect, 2024): https://arxiv.org/abs/2407.07852
OpenDiLoCo repo (note: superseded by PRIME): https://github.com/PrimeIntellect-ai/OpenDiloco · PRIME: https://github.com/PrimeIntellect-ai/prime
INTELLECT-1 (first globally-distributed 10B training): https://www.primeintellect.ai/blog/intellect-1
Hivemind (decentralised training DHT): https://github.com/learning-at-home/hivemind
PyTorch torch.distributed (backends, all_reduce): https://docs.pytorch.org/docs/stable/distributed.html