Skip to content
Markdown

Runbook: training out-of-memory

Scope: triage and fix a distributed training job that hits CUDA out of memory on the GPU. Separate a true OOM (the working set genuinely exceeds VRAM) from caching-allocator fragmentation (free memory exists but no contiguous block does), and apply the right lever: batch/sequence length, activation checkpointing, sharding, or allocator config.

Run this when a training rank dies with torch.cuda.OutOfMemoryError: CUDA out of memory. The error line itself names the answer: read the Tried to allocate ... / reserved ... / allocated ... / free ... fields before changing anything. Severity: job-down, but recoverable from the last checkpoint. Do not blindly cut batch size; that hides a fragmentation bug and burns throughput.

Reference templates on real APIs; pin versions and validate before production use. The commands here are not hardware-tested.

A true OOM and a fragmentation OOM look identical at the prompt but need opposite fixes. The PyTorch caching allocator retains freed blocks, so memory_reserved() is normally larger than memory_allocated(); when the gap is large and an allocation still fails, the pool is fragmented, not exhausted (the allocator page, GPU memory hierarchy). Cutting batch size on a fragmentation OOM wastes capacity; tuning the allocator on a true OOM does nothing. Sharding background is in FSDP and DeepSpeed ZeRO; recovery is the checkpoint-recovery runbook.

Trigger

  • A rank raises torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate <X> ... GPU has a total capacity of <Y> ... <Z> is reserved by PyTorch but unallocated.
  • One rank dies and the collective wedges the rest, so the surviving ranks present as a hang (the NCCL-hang runbook); the OOM rank is the root cause.
  • OOM appears mid-run (later step than where training started). That points at a growing working set (longer sequences, leaked references, optimizer state materializing) or progressive fragmentation, not a static sizing error.
  • OOM appears on step 0 / warmup. That points at a static sizing error: batch, sequence length, model size vs sharding degree.

Pre-checks

  • Read the error fields first. Tried to allocate vs free vs reserved but unallocated is the entire diagnosis. If reserved but unallocated is large relative to what was Tried to allocate, suspect fragmentation, not exhaustion (the allocator page).
  • Rule out a hardware cause so you do not chase a software OOM that is really a faulted GPU. Scan for XID and ECC events on the OOM node; a row-remap/ECC-degraded GPU loses usable VRAM (the ECC-toggle runbook, the GPU-fault runbook):
    ssh "$OOM_NODE" 'dmesg -T | grep -i "NVRM: Xid"'
    ssh "$OOM_NODE" 'nvidia-smi -q -d ROW_REMAPPER,ECC | grep -iE "remap|pending|uncorrect"'
    
  • Confirm no co-tenant is eating the card. A leaked prior process or a MIG/MPS co-tenant shrinks the budget the job assumed it had (MIG partitioning):
    ssh "$OOM_NODE" 'nvidia-smi --query-compute-apps=pid,used_memory --format=csv'
    
  • Note the last good checkpoint / step so the retry is bounded (the checkpoint-recovery runbook).
  • Identify the OOM rank and its node from the launcher logs. That is the one to instrument; the others are collateral.

Flow

flowchart TB
    A["Rank dies: CUDA out of memory"] --> B["Read error fields:<br/>tried / reserved / free"]
    B --> C["scancel job (free the GPUs)"]
    C --> D["XID / ECC / row-remap<br/>on OOM node?"]
    D -->|"yes"| E["GPU fault / ECC path"]
    D -->|"no"| F["memory_summary:<br/>reserved - allocated gap"]
    F -->|"gap small,<br/>reserved near capacity"| G["TRUE OOM"]
    F -->|"gap large,<br/>alloc still failed"| H["FRAGMENTATION"]
    G --> I["1. micro-batch + grad accum<br/>2. shorter / bucketed seq<br/>3. activation checkpointing<br/>4. raise sharding degree"]
    H --> J["PYTORCH_ALLOC_CONF=<br/>expandable_segments:True<br/>(then max_split_size_mb)"]
    I --> K["Relaunch from last checkpoint"]
    J --> K
    E --> L["Drain node, replace,<br/>resume at reduced world-size"]

Procedure

JOB_ID=123456          # Slurm job id of the OOM run
OOM_NODE=gpu-21        # node hosting the OOM rank

Cordon and stop before mutating. Do not edit a live run's launch config; take the allocation out of rotation, change the template, relaunch.

  1. Stop the wedged job cleanly. One OOM rank leaves the rest blocked in a collective and holding the allocation, so kill it to stop it occupying GPUs (the NCCL-hang runbook):
    scancel "$JOB_ID"
    
  2. Classify: true OOM vs fragmentation. Instrument the OOM rank to dump the allocator state at the failure point. Wrap the step (or set it to fire in the OOM handler) and read the summary (the allocator page):
    import torch
    print(torch.cuda.memory_summary())                  # full segment/block breakdown
    alloc = torch.cuda.memory_allocated()               # bytes live tensors hold
    resv  = torch.cuda.memory_reserved()                # bytes the pool backs
    print(f"allocated={alloc>>20}MiB reserved={resv>>20}MiB gap={(resv-alloc)>>20}MiB")
    
  3. gap small, reserved near card capacity, allocated near reserved -> true OOM: the working set does not fit. Go to step 3.
  4. gap large (reserved holds memory that allocated is not using) and the allocation still failed -> fragmentation. Go to step 4.
  5. True OOM: shrink the working set, in this order (least to most invasive). Relaunch after each and re-measure peak with torch.cuda.max_memory_reserved():
  6. Reduce per-GPU batch (micro-batch) and add gradient accumulation to hold the global batch constant. This is the cheapest lever and changes math the least.
  7. Reduce or bucket sequence length if inputs are long; activation memory scales with sequence length.
  8. Enable activation checkpointing to trade compute for memory, recomputing activations in backward instead of storing them (performance optimization):
    from torch.utils.checkpoint import checkpoint
    out = checkpoint(block, x, use_reentrant=False)   # pass use_reentrant explicitly
    
  9. Increase the sharding degree so optimizer state, gradients, and parameters are partitioned across more ranks, via FSDP or ZeRO-2/3 (FSDP, DeepSpeed ZeRO). This is the structural fix when a single GPU cannot hold one shard.
  10. Fragmentation: fix the allocator, do not cut batch. Set the allocator config at process start (before CUDA init) so the pool stops splitting into unusable blocks. expandable_segments:True is the first lever; it lets segments grow rather than stranding fixed blocks (the allocator page):
    # Primary name in recent PyTorch; PYTORCH_CUDA_ALLOC_CONF is the back-compat alias.
    export PYTORCH_ALLOC_CONF=expandable_segments:True
    # If a few large allocations dominate, cap splitting instead/as well:
    #   export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:256
    # For latency-sensitive reclaim of cached blocks under pressure:
    #   export PYTORCH_ALLOC_CONF=garbage_collection_threshold:0.8,expandable_segments:True
    
    Do not sprinkle torch.cuda.empty_cache() in the training loop as a fix; it returns cached blocks to the driver and forces re-cudaMalloc, slowing the step without addressing fragmentation.
  11. Relaunch from the last checkpoint with the corrected template. Never resume into the same OOM; confirm the lever from step 3/4 is in the launch env, not just the interactive shell (the checkpoint-recovery runbook).

Verification

  • The relaunched job runs past the step where it previously OOMed and continues to a checkpoint. That is the single proof that the lever worked, not just that step 0 survived.
  • Peak headroom exists. On the previously-failing rank, peak reserved sits below card capacity with margin:
    import torch
    print(f"peak_reserved={torch.cuda.max_memory_reserved()>>20}MiB")
    print(f"free_total={[x>>20 for x in torch.cuda.mem_get_info()]}MiB")  # (free, total)
    
  • For a fragmentation fix, torch.cuda.memory_summary() after warmup shows the reserved - allocated gap no longer growing across steps (allocator reached steady state).
  • No co-tenant or leaked process on the node: nvidia-smi --query-compute-apps=pid,used_memory --format=csv lists only this job's ranks.

Rollback

An OOM is a job failure, not a config rollout. Recovery is forward, but the levers are reversible:

  • Revert allocator env if expandable_segments:True regresses throughput or interacts badly with a captured CUDA-graph pool: drop back to the default allocator config and instead reduce the working set (step 3).
  • Revert sharding-degree change if raising it added communication that breaches the step-time/MFU target (the MFU-regression runbook); prefer activation checkpointing at the original degree.
  • Restart from the last good checkpoint in all cases; never resume into the OOM (the checkpoint-recovery runbook).
  • If the OOM traced to a faulted/ECC-degraded GPU, drain the node and resume at reduced world-size or after replacement (the GPU-fault runbook, the ECC-toggle runbook).

References

  • PyTorch CUDA semantics — caching allocator, PYTORCH_ALLOC_CONF / PYTORCH_CUDA_ALLOC_CONF, expandable_segments, max_split_size_mb, garbage_collection_threshold: https://docs.pytorch.org/docs/stable/notes/cuda.html
  • PyTorch CUDA memory APIs — memory_summary, memory_allocated, memory_reserved, max_memory_reserved, mem_get_info, empty_cache: https://docs.pytorch.org/docs/stable/cuda.html
  • PyTorch activation checkpointing — torch.utils.checkpoint.checkpoint, use_reentrant: https://docs.pytorch.org/docs/stable/checkpoint.html
  • PyTorch FSDP (fully sharded data parallel): https://docs.pytorch.org/docs/stable/fsdp.html
  • DeepSpeed ZeRO (sharded optimizer/gradient/parameter state): https://www.deepspeed.ai/tutorials/zero/
  • Slurm scancel (signal/cancel a job): https://slurm.schedmd.com/scancel.html
  • NVIDIA nvidia-smi (compute-apps, ECC, row-remapper queries): https://docs.nvidia.com/deploy/nvidia-smi/index.html

Related: Allocator Tuning · GPU Memory Hierarchy · FSDP · DeepSpeed ZeRO · NCCL Hang · Checkpoint Recovery · Operational Runbooks · Glossary