Markdown

DeepSpeed and ZeRO¶

Scope: the ZeRO family of sharded-data-parallel optimisations (stages 1/2/3 + CPU/NVMe offload) and the deepspeed launcher, as one of the memory-scaling strategies surveyed in distributed training.

Reference templates use real APIs (deepspeed, torch, NCCL): pin versions and validate before production use. The deepspeed and torch snippets below need CUDA GPUs and are not executed here; each core algorithm they teach (the ZeRO memory model, the sharded optimizer step, and the reduce-scatter plus all-gather collective) is mirrored by a runnable, asserted numpy block (run with python3) that validates the underlying math.

What it is¶

DeepSpeed is a training library whose core contribution is ZeRO (Zero Redundancy Optimizer): a set of techniques that eliminate the per-GPU memory redundancy of plain data parallelism. In standard DDP every rank holds a full copy of the optimizer state, gradients, and parameters; ZeRO partitions that state across the data-parallel group so each rank stores only 1/N of it, gathering shards on demand during forward/backward.

The partitioning is staged:

ZeRO-1: partition optimizer state (Adam moments, fp32 master weights). Largest memory term, no extra communication versus DDP.
ZeRO-2: additionally partition the gradients.
ZeRO-3: additionally partition the parameters themselves; each layer's weights are all-gathered just-in-time for its forward/backward, then released. This is the equivalent of FSDP full sharding (FSDP).

Offload moves partitioned state off the GPU: ZeRO-Offload to host CPU+DRAM, and ZeRO-Infinity extends ZeRO-3 to spill optimizer/parameter state to NVMe, enabling models far larger than aggregate GPU memory at the cost of bandwidth.

Why use it¶

Memory: ZeRO-3 makes the per-GPU memory roughly total_state / N, so adding GPUs both adds compute and reduces per-GPU footprint, the lever that fits 70B+ models for full fine-tuning (fine-tuning and post-training).
Mature and integrated: shipped since 2020, wired into HuggingFace transformers/accelerate, Megatron-DeepSpeed, and most RL stacks (verl, SkyRL, OpenRLHF, NeMo-RL all support a DeepSpeed backend, see RL libraries).
vs FSDP (FSDP): same sharding idea. ZeRO is a config-file-driven library with first-class CPU/NVMe offload and a built-in launcher; FSDP2 is native PyTorch with a tighter DeviceMesh/DTensor integration. For new pure-PyTorch code FSDP2 is increasingly preferred; DeepSpeed wins where offload, its config ergonomics, or an existing integration matter.

When to use it (and when not)¶

Use ZeRO-1/2 when the model fits but optimizer/gradient memory is the squeeze; near-free speed, minimal config.
Use ZeRO-3 when parameters themselves do not fit per GPU; accept extra all-gather traffic.
Use offload (CPU/NVMe / ZeRO-Infinity) only when you have run out of GPU memory and accept a large throughput hit; it trades PCIe/NVMe bandwidth for capacity.
Prefer FSDP2 (FSDP) for greenfield PyTorch-native training that wants DeviceMesh composition with TP/PP.
Prefer tensor (tensor parallelism) or pipeline (pipeline parallelism) parallelism when a single layer is too large for one GPU; ZeRO shards across layers, not within them. In practice ZeRO is composed with TP/PP for the largest models.

Architecture¶

N data-parallel ranks each own one shard of the optimizer, gradient, and (under ZeRO-3) parameter state; shards are all-gathered on demand for compute and reduce-scattered after backward, and can spill to CPU or NVMe under ZeRO-Infinity.

flowchart TB
  subgraph DP["Data-parallel group (N ranks)"]
    G0["GPU 0: shard 0 (opt/grad/param)"]
    G1["GPU 1: shard 1"]
    G2["GPU 2: shard 2"]
    G3["GPU 3: shard 3"]
  end
  subgraph OFF["Offload tiers (ZeRO-Infinity)"]
    CPU["Host CPU + DRAM"]
    NVME["Local NVMe"]
  end
  G0 ---|"all-gather params (fwd/bwd)"| G1
  G1 --- G2
  G2 --- G3
  G3 -.->|"reduce-scatter grads"| G0
  G0 -.->|"offload optimizer state"| CPU
  CPU -.->|"spill (ZeRO-Infinity)"| NVME

How to use it¶

DeepSpeed is driven by a JSON config plus the deepspeed launcher. Wrap any model and optimizer with deepspeed.initialize; the returned engine owns the forward, the gradient reduce-scatter, and the sharded optimizer step:

# Reference template: deepspeed + torch + CUDA GPUs (not executed here).
# train.py
import deepspeed, torch

net = build_model()                       # build ONCE; pass the same instance to both args
model, optimizer, _, _ = deepspeed.initialize(
    model=net,
    model_parameters=[p for p in net.parameters() if p.requires_grad],
    config="ds_config.json",
)
for batch in loader:
    loss = model(batch)        # engine handles fwd
    model.backward(loss)       # engine handles grad reduce-scatter
    model.step()               # engine handles optimizer + sharding

# single node, all visible GPUs
deepspeed train.py --deepspeed --deepspeed_config ds_config.json

// ds_config.json: minimal ZeRO-2
{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 8,
  "bf16": { "enabled": true },
  "zero_optimization": { "stage": 2, "overlap_comm": true, "contiguous_gradients": true }
}

Validated (numpy, runnable). Which stage you set in that config is a memory calculation. For mixed-precision Adam the per-parameter cost is 2 bytes (fp16 params) + 2 bytes (fp16 grads) + 12 bytes (fp32 optimizer state) = 16 bytes; ZeRO-1/2/3 divide progressively more of that by the data-parallel degree N. The block reproduces Table 1 of the ZeRO paper (a 7.5B model on 64 GPUs: 120 / 31.4 / 16.6 / 1.9 GB), proves each stage strictly shrinks per-GPU memory, that ZeRO-3 reaches total/N while plain DDP stays flat in N, that optimizer offload removes the 12-byte term from the GPU, and catches a common mis-accounting.

import numpy as np

# Mixed-precision Adam byte accounting (ZeRO paper, Rajbhandari et al. 2019).
# Per parameter: fp16 params 2B, fp16 grads 2B, fp32 optimizer states
# (fp32 master weights 4B + Adam momentum 4B + variance 4B) = 12B. K = 12.
P_BYTES, G_BYTES, OPT_BYTES = 2, 2, 12          # per parameter, resident on GPU
FULL = P_BYTES + G_BYTES + OPT_BYTES            # 16 bytes/param (baseline DDP)

def per_gpu_bytes(psi, n, stage, offload_opt=False):
    # psi = parameter count, n = data-parallel degree, stage in {0,1,2,3}.
    opt = 0 if offload_opt else OPT_BYTES       # ZeRO-Offload moves opt state off GPU
    if stage == 0:                              # plain DDP: nothing sharded
        return (P_BYTES + G_BYTES + opt) * psi
    if stage == 1:                              # partition optimizer states
        return (P_BYTES + G_BYTES + opt / n) * psi
    if stage == 2:                              # + partition gradients
        return (P_BYTES + (G_BYTES + opt) / n) * psi
    if stage == 3:                              # + partition parameters
        return (P_BYTES + G_BYTES + opt) / n * psi
    raise ValueError(stage)

GB = 1e9
psi, n = 7.5e9, 64                              # 7.5B params, 64-way data parallel

# 1) Reproduce Table 1 of the ZeRO paper (7.5B model, N_d = 64).
assert round(per_gpu_bytes(psi, n, 0) / GB, 1) == 120.0
assert round(per_gpu_bytes(psi, n, 1) / GB, 1) == 31.4     # P_os
assert round(per_gpu_bytes(psi, n, 2) / GB, 1) == 16.6     # P_os+g
assert round(per_gpu_bytes(psi, n, 3) / GB, 1) == 1.9      # P_os+g+p

# 2) Each further stage strictly reduces per-GPU memory (for n > 1).
mem = [per_gpu_bytes(psi, n, s) for s in range(4)]
assert mem[0] > mem[1] > mem[2] > mem[3]

# 3) ZeRO-3 drives per-GPU state to total/N exactly (the "adding GPUs shrinks the
#    footprint" lever); plain DDP is flat in N.
assert np.isclose(per_gpu_bytes(psi, n, 3), FULL * psi / n)
assert per_gpu_bytes(psi, 1, 0) == per_gpu_bytes(psi, 64, 0)      # DDP: no N benefit

# 4) Offload (ZeRO-Infinity) removes the optimizer term from GPU memory.
assert np.isclose(per_gpu_bytes(psi, n, 3, offload_opt=True),
                  (P_BYTES + G_BYTES) * psi / n)

# 5) Adversarial: mislabelling ZeRO-2 as "grads NOT sharded" (the stage-1 formula)
#    over-counts memory and fails to match the paper's 16.6 GB.
wrong_z2 = (P_BYTES + G_BYTES + OPT_BYTES / n) * psi              # stage-1 math by mistake
assert round(wrong_z2 / GB, 1) != 16.6
assert wrong_z2 > per_gpu_bytes(psi, n, 2)

print("block1 OK: memory model matches paper 120/31.4/16.6/1.9 GB; monotone in stage; "
      "stage3=total/N; DDP flat in N; offload drops opt term; mis-accounting caught")

Validated (numpy, runnable). The engine's model.step() under ZeRO-1 partitions the Adam state (momentum, variance, fp32 master weights) across ranks so each rank updates only its slice, then all-gathers the result. Because Adam is element-wise, that sharded step must reproduce the unsharded single-GPU update exactly. The block proves that equivalence and catches two real failures: a rank running a wrong (bias-correction-free) Adam, and a gather that drops a shard (leaving stale params, the ZeRO-3 save bug in miniature).

import numpy as np

def adam_step(w, g, m, v, t, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8):
    # standard Adam with bias correction; element-wise, hence shard-separable.
    m = b1 * m + (1 - b1) * g
    v = b2 * v + (1 - b2) * g * g
    mhat = m / (1 - b1 ** t)
    vhat = v / (1 - b2 ** t)
    return w - lr * mhat / (np.sqrt(vhat) + eps), m, v

rng = np.random.default_rng(0)
P, N, t = 12, 4, 7                      # 12 params, 4 ranks, optimizer step 7
w = rng.standard_normal(P)
g = rng.standard_normal(P)
m = rng.standard_normal(P) * 0.1        # first-moment estimate
v = rng.random(P) * 0.1 + 1e-3          # second-moment estimate (positive)

# Reference: one unsharded Adam step over all P params (single-GPU ground truth).
w_ref, _, _ = adam_step(w, g, m, v, t)

# ZeRO-1: partition optimizer state into N disjoint shards; each rank updates only
# its slice from its own (m, v, fp32 master weights), then params are all-gathered.
shards = np.array_split(np.arange(P), N)
assert P % N == 0 and sum(len(s) for s in shards) == P
w_zero = w.copy()
for s in shards:
    w_zero[s], _, _ = adam_step(w[s], g[s], m[s], v[s], t)       # rank-local update

# Sharding the optimizer is transparent: identical result to the unsharded step.
assert np.allclose(w_zero, w_ref, atol=1e-12), np.abs(w_zero - w_ref).max()

# Adversarial 1: a rank running a wrong (bias-correction-free) Adam diverges the gather.
def adam_no_bias(w, g, m, v, t, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8):
    m = b1 * m + (1 - b1) * g
    v = b2 * v + (1 - b2) * g * g
    return w - lr * m / (np.sqrt(v) + eps), m, v                 # missing 1-b**t term
w_bad = w.copy()
w_bad[shards[0]], _, _ = adam_no_bias(w[shards[0]], g[shards[0]], m[shards[0]], v[shards[0]], t)
for s in shards[1:]:
    w_bad[s], _, _ = adam_step(w[s], g[s], m[s], v[s], t)
assert not np.allclose(w_bad, w_ref, atol=1e-9)                  # one wrong rank corrupts it

# Adversarial 2: a gather that drops a shard leaves those params stale (== original),
# the ZeRO-3 "weights not consolidated on save" bug in miniature.
w_drop = w.copy()
for s in shards[1:]:                                             # shard 0 never written back
    w_drop[s], _, _ = adam_step(w[s], g[s], m[s], v[s], t)
assert np.array_equal(w_drop[shards[0]], w[shards[0]])           # stale slice
assert not np.allclose(w_drop, w_ref, atol=1e-9)                 # missing shard => wrong

print("block2 OK: sharded Adam == unsharded Adam; wrong-Adam rank and dropped-shard "
      "gather both caught")

How to integrate with it¶

Most projects never call DeepSpeed directly; they let HuggingFace accelerate/transformers consume the same JSON:

accelerate launch --config_file acc.yaml train_hf.py   # acc.yaml -> deepspeed_config_file: ds_config.json
# or transformers Trainer: TrainingArguments(deepspeed="ds_config.json", bf16=True)

Key dev knobs in the config: gradient_clipping, activation_checkpointing (recompute activations to trade compute for memory), zero_optimization.stage3_gather_16bit_weights_on_model_save (so save_pretrained emits consolidated weights under ZeRO-3), and wall_clock_breakdown for per-phase timing. Inspect throughput with the engine's built-in flops/timer output before tuning.

How to run it in production¶

Production DeepSpeed is a launch-orchestration and fabric problem: the training code is unchanged, and what matters is how ranks are started, how the all-gather/reduce-scatter reaches the wire, and how the sharded state is checkpointed and consolidated.

Launch orchestration¶

The deepspeed launcher reads a hostfile (machines + slot counts) and starts ranks over passwordless SSH:

# hostfile  (one line per node)
node-0 slots=8
node-1 slots=8

deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 8 \
  train.py --deepspeed --deepspeed_config ds_config_zero3.json

DeepSpeed auto-propagates NCCL_* and PYTHON* env vars; add others via a .deepspeed_env file (newline-separated VAR=VAL). Pin the rendezvous explicitly with --master_addr/--master_port when the default interface is ambiguous (see the cookbook below).

Hardware and fabric¶

Communication: ZeRO-3 adds parameter all-gathers on top of DDP's gradient reduce-scatter, so it is bandwidth-bound. Intra-node NVLink/NVSwitch and inter-node InfiniBand/RoCE with GPUDirect RDMA are what keep the gather cost hidden behind compute (networking fabric, performance tuning).
NCCL: set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS, consider NCCL_NVLS_ENABLE for NVLink SHARP all-reduce; confirm GDR with NCCL_DEBUG=INFO showing [GDRDMA]. Keep PCIe ACS off for P2P/GDR.
Offload tradeoffs: CPU offload is gated by PCIe (and host DRAM bandwidth); NVMe offload (ZeRO-Infinity) by drive throughput and the async-I/O (aio) settings. Both can drop GPU utilisation sharply, so measure MFU and prefer adding GPUs over offload when budget allows (cloud and cost).
Blackwell: ZeRO benefits from bf16/fp8 mixed precision to shrink the state it shards; pair with the platform's transformer-engine kernels (the Blackwell platform).

Checkpointing and consolidation¶

Because ZeRO shards the optimizer state (and, at stages 2/3, gradients and parameters), a checkpoint is naturally sharded per rank:

Resumable training state: model.save_checkpoint(dir) / model.load_checkpoint(dir) writes and restores the sharded optimizer state efficiently for resume at the same topology; this is the path to prefer for long runs.
Consolidated 16-bit weights for inference or save_pretrained: set zero_optimization.stage3_gather_16bit_weights_on_model_save: true, or call model.save_16bit_model(...), so a single gathered model is emitted under ZeRO-3 instead of shards.
Offline fp32 reconstruction: DeepSpeed writes a zero_to_fp32.py helper into the checkpoint directory that rebuilds a single full fp32 state_dict from the shards after the fact, without needing the original GPUs.

How to maintain it¶

A ZeRO job is only as healthy as its slowest rank and its fabric; maintenance keeps every rank identical, the gather cost hidden, and the config under control.

Pin versions across every node. All ranks must run the same DeepSpeed, PyTorch, CUDA, and NCCL build; a mismatched build across hosts causes silent hangs or divergent numerics. Upgrade the whole job as one coordinated redeploy, never a rolling per-node bump.
Re-verify the fabric after every change. Re-check NCCL_DEBUG=INFO for [GDRDMA] and PCIe ACS state after any driver, firmware, or topology change, since a regression silently drops the all-gather/reduce-scatter onto TCP and step time collapses (networking fabric).
Watch throughput and MFU. Enable wall_clock_breakdown: true for per-phase timing and track MFU; if offload has quietly dropped GPU utilisation, prefer adding GPUs over offload when budget allows (performance tuning).
Treat the JSON config as code. Keep it in version control: changing stage, bucket sizes, or offload flips memory and throughput characteristics, so config drift is a real regression source.
Let the engine own the step. Configure the LR schedule in the JSON and let the engine own step(); an external scheduler stepping in parallel double-steps or applies the wrong LR.

How to scale it¶

ZeRO is fundamentally a multi-instance / multi-node technique: the data-parallel group spans all ranks across all nodes (the launcher and hostfile that start those ranks are covered under production above). For very large models compose ZeRO with tensor (tensor parallelism) and pipeline (pipeline parallelism) parallelism (the Megatron-DeepSpeed "3D" layout); keep TP within a node and let ZeRO-DP span nodes (distributed training, distributed-training recipes).

Scaling is comm-bound because ZeRO-2/3 replace DDP's single gradient all-reduce with a reduce-scatter of gradients plus an all-gather of parameters. That substitution is exact, which is why sharding is transparent to the result and why a dropped shard is a correctness bug, not just a slowdown.

Validated (numpy, runnable). The block proves that reduce-scatter followed by all-gather equals a full all-reduce over the whole gradient, that a sharded parameter tensor all-gathers back to the original, and it catches two real failures: forgetting the 1/N averaging in reduce-scatter (which inflates the gradient N-fold) and a dropped shard on the wire (the NCCL-on-TCP / GDR-fallback symptom).

import numpy as np

rng = np.random.default_rng(3)
N, D = 4, 20                             # 4 ranks, gradient/param vector of length 20
grads = rng.standard_normal((N, D))      # each rank's local gradient (own microbatch)

# DDP reference: all-reduce = average the full gradient across all ranks.
allreduce = grads.mean(axis=0)

# ZeRO-2/3 replace that single all-reduce with reduce-scatter then all-gather.
# reduce-scatter: rank r receives the averaged shard r (sum over ranks / N).
shards = np.array_split(np.arange(D), N)
reduced = {r: grads[:, s].sum(axis=0) / N for r, s in enumerate(shards)}
# all-gather: every rank concatenates all averaged shards back to full length.
gathered = np.concatenate([reduced[r] for r in range(N)])

assert np.allclose(gathered, allreduce, atol=1e-12)              # RS + AG == all-reduce

# Parameter all-gather: rank r stores only shard r; gather reconstructs the full tensor.
params = rng.standard_normal(D)
sharded = [params[s] for s in shards]
assert np.allclose(np.concatenate(sharded), params, atol=1e-12)

# Adversarial 1: forgetting the /N in reduce-scatter (SUM not average) inflates N-fold.
bad_sum = np.concatenate([grads[:, s].sum(axis=0) for s in shards])
assert not np.allclose(bad_sum, allreduce, atol=1e-6)
assert np.allclose(bad_sum, N * allreduce, atol=1e-12)          # exact corruption signature

# Adversarial 2: a dropped shard (comm fails on one rank's portion) corrupts the gather.
broken = gathered.copy()
broken[shards[1]] = 0.0                                          # shard 1 never arrived
assert not np.allclose(broken, allreduce, atol=1e-6)

print("block3 OK: reduce-scatter + all-gather == all-reduce; param gather reconstructs "
      "the full tensor; missing /N and dropped shard both caught")

Inference¶

ZeRO is a training optimiser, but two related inference paths exist: DeepSpeed-Inference (kernel-fused tensor-parallel serving) and ZeRO-Inference (offload weights to CPU/NVMe to run a model larger than GPU memory for throughput-oriented, latency-tolerant batch inference). For production LLM serving these are largely superseded by vLLM/SGLang/Dynamo (see inference serving and serving open-weight models); reach for ZeRO-Inference only for offline scoring of an oversized model on scarce GPUs.

Fine-tuning¶

ZeRO is the standard backend for full fine-tuning of large models: ZeRO-2 or ZeRO-3 keeps a 7B-70B full fine-tune within a node or a small multi-node group, and CPU/NVMe offload pushes the ceiling further on constrained hardware. Parameter-efficient methods (LoRA/QLoRA, see SFT and LoRA) reduce the optimizer-state pressure ZeRO targets, so they often need only ZeRO-1/2. Post-training recipes (SFT, DPO, GRPO) frequently run the policy under a DeepSpeed engine (fine-tuning and post-training, GRPO, DPO).

Cookbook (common use cases)¶

1. ZeRO-3 full-shard config (params + grads + optimizer partitioned)

// ds_config_zero3.json
{
  "train_micro_batch_size_per_gpu": 2,
  "gradient_accumulation_steps": 16,
  "bf16": { "enabled": true },
  "gradient_clipping": 1.0,
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": 5e8,
    "stage3_prefetch_bucket_size": 5e8,
    "stage3_param_persistence_threshold": 1e6,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "activation_checkpointing": { "partition_activations": true, "contiguous_memory_optimization": true }
}

2. Add CPU then NVMe offload (ZeRO-Infinity)

// patch the zero_optimization block above
"zero_optimization": {
  "stage": 3,
  "offload_optimizer": { "device": "nvme", "nvme_path": "/local_nvme", "pin_memory": true, "buffer_count": 4 },
  "offload_param":     { "device": "cpu",  "pin_memory": true }
},
"aio": { "block_size": 1048576, "queue_depth": 8, "thread_count": 1, "single_submit": false, "overlap_events": true }

3. Multi-node launch with explicit master + extra env

printf 'NCCL_IB_HCA=mlx5\nNCCL_NET_GDR_LEVEL=SYS\n' > .deepspeed_env
deepspeed --hostfile hostfile --num_nodes 4 --num_gpus 8 \
  --master_addr node-0 --master_port 29500 \
  train.py --deepspeed --deepspeed_config ds_config_zero3.json

Failure modes¶

Several of these are reproduced in miniature by the validated numpy blocks above: the dropped-shard gather (sharded-Adam block), a mis-accounted stage memory (memory-model block), and a dropped reduce-scatter shard (collective block).

ZeRO-3 weights not consolidated on save: save_pretrained writes shards; set stage3_gather_16bit_weights_on_model_save: true or call model.save_16bit_model(...).
Offload enabled "to be safe": 2-5x throughput collapse from PCIe/NVMe bandwidth. Only offload when truly out of GPU memory (performance tuning).
Hostfile / SSH misconfig: ranks never start or hang in init; verify passwordless SSH and that slots matches real GPU count per node.
NCCL on TCP (missing NCCL_IB_HCA, ACS on): all-gather dominates step time; confirm [GDRDMA] in NCCL_DEBUG=INFO (networking fabric).
Mixing ZeRO-3 with a custom forward that touches released params: shape/None errors; parameters exist only inside their gathered scope.
DeepSpeed optimizer vs external scheduler mismatch: double-stepping or wrong LR; let the engine own step() and configure the scheduler in the JSON.
bf16 vs fp16 loss scaling: fp16 needs dynamic loss scaling; prefer bf16 on Ampere+/Hopper/Blackwell to avoid overflow tuning.

References¶

DeepSpeed site & ZeRO tutorial: https://www.deepspeed.ai/tutorials/zero/ · ZeRO-Offload: https://www.deepspeed.ai/tutorials/zero-offload/
DeepSpeed config-json reference: https://www.deepspeed.ai/docs/config-json/ · Getting started (launcher/hostfile): https://www.deepspeed.ai/getting-started/
ZeRO API (zero3): https://deepspeed.readthedocs.io/en/latest/zero3.html · GitHub: https://github.com/deepspeedai/DeepSpeed
ZeRO paper (Rajbhandari et al., 2019): https://arxiv.org/abs/1910.02054 · ZeRO-Infinity: https://arxiv.org/abs/2104.07857
HuggingFace transformers DeepSpeed integration: https://huggingface.co/docs/transformers/en/deepspeed