Markdown

PyTorch CUDA caching allocator tuning¶

Scope: PyTorch's native CUDA caching allocator and how to tune it with PYTORCH_ALLOC_CONF (expandable_segments, max_split_size_mb, garbage_collection_threshold, roundup_power2_divisions, backend), diagnose fragmentation and OOM with torch.cuda.memory_summary/memory_stats/mem_get_info and the memory snapshot, and why most "CUDA out of memory" failures are fragmentation, not true exhaustion. For the underlying CUDA APIs see CUDA Stream-Ordered Memory Allocator; for graph-captured pools see CUDA Graphs: Capture, Replay, and Launch Overhead.

What it is¶

PyTorch does not call cudaMalloc/cudaFree per tensor. It interposes a caching allocator: "PyTorch uses a caching memory allocator for CUDA memory. By default, it adapts to allocation patterns by splitting and recycling GPU memory blocks on demand" (Fregly, Ch. 13). It works in two levels: segments are contiguous regions obtained from the driver (cudaMalloc, or a virtual mapping), and blocks are sub-regions within a segment that track individual tensor allocations. When a request arrives the allocator finds a suitable free block; if the block is bigger than needed it splits it: the front serves the allocation, the back becomes a new free block. On free, the allocator tries to merge a block with its immediate neighbours, but only if the neighbour is also free, and "blocks in different segments can never merge" (PyTorch DevLog, When does fragmentation occur in the CUDA caching allocator?).

Because the allocator retains freed blocks rather than returning them to the driver, torch.cuda.memory_reserved() (the pool's total backing memory) is normally larger than torch.cuda.memory_allocated() (bytes the live tensors actually hold). The gap is cached, reusable memory, not a leak.

You tune the allocator at process start through the environment variable PYTORCH_ALLOC_CONF. As of recent PyTorch this is the primary name; "PYTORCH_CUDA_ALLOC_CONF ... is its alias and is provided only for backward compatibility" (PyTorch, CUDA semantics). The format is comma-separated option:value pairs:

# Set BEFORE the process starts (it is read once at allocator init).
export PYTORCH_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8"

Why use it¶

The caching allocator exists because per-tensor cudaMalloc/cudaFree would be ruinous: those calls are device-wide synchronizing operations that "involve OS-level calls" and force the whole GPU to stall (see CUDA Stream-Ordered Memory Allocator). Caching and recycling blocks removes that round-trip from the hot path. But recycling has a cost: fragmentation.

Fragmentation "happens when GPU memory gets split into many noncontiguous free chunks over time. This makes it difficult to allocate a large tensor even if enough total memory remains free" (Fregly, Ch. 13). The DevLog states the failure mode precisely: fragmentation is when "even though there is 'technically' enough free space to store a requested allocation, the CUDA caching allocator is unable to actually serve the request" (PyTorch DevLog). The allocator cannot combine a 16 MiB free block in one segment with another 16 MiB free block in a different segment to serve a 32 MiB request, even though 32 MiB is free in aggregate.

This is why most "CUDA out of memory" aborts are not true exhaustion. The error usually means the allocator could not find a single contiguous free block of the requested size, not that the device has no free bytes. PyTorch's own guidance is that max_split_size_mb "should be used as a last resort for a workload that is aborting due to 'out of memory'" (PyTorch, CUDA semantics), confirming the OOM-is-fragmentation framing.

The workloads that fragment worst are those with variable-sized allocations. "In MoE models, this is especially problematic because the number of tokens routed to each expert can change with every batch. As such, each expert's output activation tensor may be a different size on every iteration. Variable-sized memory allocations leave behind uneven, fragmented memory blocks" that "accumulate across training or inference runs" (Fregly, Ch. 13). LLM serving is the other classic case: every megabyte not pinned by weights or graph buffers is wanted for KV cache, so fragmentation directly costs you batch capacity (PyTorch DevLog).

When to use it (and when not)¶

The default native allocator is well tuned; reach for these knobs only with evidence. Order of intervention:

First, remove the variance at its source. If a buffer changes size every iteration, "allocate a fixed-size expert output buffer upfront and size it to the maximum possible number of tokens any expert may process in your batch. Then you can reuse this buffer on every iteration" (Fregly, Ch. 13). A constant-shaped buffer never fragments. This beats any allocator flag and is the recommended first move for MoE and other dynamic-shape models (worked in How to scale it).
Tune expandable_segments when fragmentation is driven by allocation ordering: small allocations interleaved with large ones, growing KV cache, changing batch sizes. It lets a segment grow in place instead of spawning unmergeable new segments.
Tune max_split_size_mb / garbage_collection_threshold as last-resort OOM mitigations when you cannot eliminate the variable shapes and the job is aborting.
Switch backend:cudaMallocAsync only deliberately. The native backend is the default and is stream-aware. The CUDA async backend "can help avoid synchronizations on memory free events — and can improve performance in multithreaded contexts like multiworker data loading" (Fregly, Ch. 13), but max_split_size_mb, roundup_power2_divisions and garbage_collection_threshold only apply to backend:native (PyTorch, CUDA semantics). Measure before adopting.

Do not reach for torch.cuda.empty_cache() as a routine fix. "Frequent use of torch.cuda.empty_cache() can disrupt allocator efficiency and incur longer-term performance costs. Treat this call as a one-time safety mechanism ... and not a regular maintenance tool" (Fregly, Ch. 13). It returns cached blocks to the driver, so the next allocations pay the cudaMalloc cost again.

Caveat: expandable_segments is still marked experimental in the PyTorch docs (default False), and it interacts with VMM-based allocators such as NCCL's ncclMemAlloc; validate on your stack. With graph capture, expandable segments do not lift CUDA Graphs' rule against new allocations inside the capture region. See CUDA Graphs: Capture, Replay, and Launch Overhead.

Architecture¶

The allocator is a two-level structure. A segment is one contiguous reservation from the driver; blocks are the sub-regions a segment is carved into as tensors are allocated and freed. Every allocation is served from a single block inside a single segment, and this is the whole story of fragmentation: a block can only ever merge with a free neighbour in the same segment, so free space stranded across two segments can never combine to serve one request. Under backend:native a miss opens a brand new segment via cudaMalloc; under expandable_segments the segment is a virtual reservation that grows in place, keeping one contiguous address range where neighbours always merge.

flowchart TD
    REQ["Tensor alloc request (size S)"] --> FIND{"Free block >= S
in a segment?"}
    FIND -- "yes, exact-ish" --> USE["Serve block"]
    FIND -- "yes, larger" --> SPLIT["Split: front serves S,
back becomes free block"]
    FIND -- "no" --> GROW{"expandable_segments?"}
    GROW -- "yes" --> MAP["Map more physical pages
into the existing segment"]
    GROW -- "no" --> CUMALLOC["cudaMalloc a new segment
(cannot merge with old ones)"]
    MAP --> USE
    CUMALLOC --> USE
    SPLIT --> USE

The split and merge rule is small enough to model exactly. This standard-library model asserts the split, the exact-fit boundary, and the failure case the diagram warns about: 16 MiB plus 16 MiB free across two segments cannot serve 32 MiB, while the same bytes in one expandable segment can. The final loop checks the allocator against a slow reference over 5000 random layouts.

"""Executable model of the PyTorch CUDA caching allocator split/merge rule.

Validates the core invariant: the allocator serves a request from a single
contiguous free block, so blocks in different segments never combine, and only
an expandable (single-segment) layout lets neighbours merge. Stdlib only.
"""
from __future__ import annotations

import random


def alloc(segments, size):
    """First-fit within one segment. Split the block when it is larger."""
    for blocks in segments:
        for i, blk in enumerate(blocks):
            if blk["free"] and blk["size"] >= size:
                remainder = blk["size"] - size
                blk["size"], blk["free"] = size, False
                if remainder:
                    blocks.insert(i + 1, {"size": remainder, "free": True})
                return True
    return False  # no single free block fits; native backend would cudaMalloc a new segment


def free(blocks, idx):
    """Free a block, then coalesce with free neighbours in the SAME segment."""
    blocks[idx]["free"] = True
    if idx + 1 < len(blocks) and blocks[idx + 1]["free"]:
        blocks[idx]["size"] += blocks.pop(idx + 1)["size"]
    if idx > 0 and blocks[idx - 1]["free"]:
        blocks[idx - 1]["size"] += blocks.pop(idx)["size"]


def total_free(segments):
    return sum(b["size"] for blocks in segments for b in blocks if b["free"])


def largest_free_block(segments):
    sizes = [b["size"] for blocks in segments for b in blocks if b["free"]]
    return max(sizes, default=0)


# 1. Happy path: split a 16 MiB free block to serve 10, leaving a 6 MiB tail.
seg = [[{"size": 16, "free": True}]]
assert alloc(seg, 10) is True
assert total_free(seg) == 6 and largest_free_block(seg) == 6

# 2. Boundary: an exact-fit request consumes the block with no remainder.
assert alloc(seg, 6) is True
assert total_free(seg) == 0

# 3. Failure case (fragmentation): 16 + 16 free across two segments cannot serve
#    32, even though 32 is free in aggregate. Blocks never merge across segments.
two = [[{"size": 16, "free": True}], [{"size": 16, "free": True}]]
assert total_free(two) == 32
assert largest_free_block(two) == 16
assert alloc(two, 32) is False

# 4. expandable_segments equivalence: the same 32 MiB in ONE segment coalesces on
#    free and DOES serve the 32 MiB request the two-segment layout could not.
one = [[{"size": 16, "free": False}, {"size": 16, "free": False}]]
free(one[0], 0)
free(one[0], 1)  # neighbours merge into a single 32 MiB free block
assert largest_free_block(one) == 32
assert alloc(one, 32) is True

# 5. Equivalence to a slow reference over random layouts: first-fit succeeds for a
#    request iff the largest single free block can hold it.
rng = random.Random(0)
saw_success = saw_failure = False
for _ in range(5000):
    segs = [
        [{"size": rng.randint(1, 8), "free": rng.random() < 0.6} for _ in range(rng.randint(1, 4))]
        for _ in range(rng.randint(1, 3))
    ]
    size = rng.randint(1, 8)
    ref = size <= largest_free_block(segs)  # slow reference: max free block
    got = alloc(segs, size)  # first-fit allocator
    assert got == ref
    saw_success |= got
    saw_failure |= not got
assert saw_success and saw_failure  # both branches exercised, not vacuous

print("architecture allocator model: all asserts passed")

Code convention for this page: numpy and standard-library blocks like the one above are self-contained and were executed locally with their assertions passing. The torch and CUDA blocks below are reference templates adapted from the cited book and PyTorch documentation; they require a CUDA GPU and are not run here. Each torch block is paired with a runnable model of the core math it illustrates.

How to use it: set the allocator config before the process starts¶

PYTORCH_ALLOC_CONF is read once when the allocator initializes. Set it in the environment, not after import torch.

# Fragmentation-leaning workload (dynamic shapes, KV cache, changing batch size):
export PYTORCH_ALLOC_CONF="expandable_segments:True"

# Last-resort OOM mitigation on the native backend:
export PYTORCH_ALLOC_CONF="max_split_size_mb:512,garbage_collection_threshold:0.8"

What each documented option does (PyTorch, CUDA semantics):

expandable_segments:True (experimental, default False). Backs each segment with a CUDA virtual-memory reservation and maps physical pages on demand, so "because everything is in one contiguous virtual address range, blocks within the segment can always merge with their neighbors" (PyTorch DevLog). This is the most direct lever against fragmentation from changing batch sizes. Limit: it does not help when long-lived allocations are interleaved with short-lived ones in the same pool: "free blocks on either side of a live block can't merge across it" (PyTorch DevLog).
max_split_size_mb:<N> "Prevents the native allocator from splitting blocks larger than this size (in MB)" (PyTorch, CUDA semantics). Default is unlimited (no cap). Keeping large free blocks intact instead of slicing them into tiny pieces preserves contiguous space for large layers. The book frames the same effect: it "instructs the allocator to keep large free blocks intact ... rather than continually splitting them into tiny pieces" (Fregly, Ch. 13). Native backend only; last-resort OOM knob.
garbage_collection_threshold:<f> "Upon setting this threshold (e.g., 0.8), the allocator will start reclaiming GPU memory blocks if the GPU memory capacity usage exceeds the threshold." Valid range is greater than 0.0 and less than 1.0; default 1.0; native backend only (PyTorch, CUDA semantics). It reclaims cached-but-idle blocks proactively to avoid a hard OOM, trading a little throughput.
roundup_power2_divisions:[N:M,...] Rounds allocation sizes up within each power-of-two range so similar requests hit the same bucket and reuse the same blocks. The book's example roundup_power2_divisions:[256:1,512:2,1024:4,>:8] divides the 512-1024 MB range into 2 buckets so "a 600 MB request rounds up to 768 MB"; the book adds "Check allocator logs to confirm actual bucketing in your environment" (Fregly, Ch. 13). Native backend only.
backend:native|cudaMallocAsync "The default is native." cudaMallocAsync requires CUDA 11.4+ (PyTorch, CUDA semantics) and routes allocation through the CUDA stream-ordered allocator (see CUDA Stream-Ordered Memory Allocator). The book's combined example is valid but mixes a native-only option with the async backend; in practice choose one backend and tune accordingly:

# Book example (Fregly, Ch. 13). Note: max_split_size_mb and roundup_power2_divisions
# apply to backend:native only; they are ignored under cudaMallocAsync.
export PYTORCH_ALLOC_CONF="max_split_size_mb:256,roundup_power2_divisions:[256:1,512:2,1024:4,>:8],backend:cudaMallocAsync"

How to integrate it: graph capture and the allocator backend¶

torch.cuda.empty_cache() returns all cached, unallocated blocks to the driver. The one legitimate routine use is immediately before CUDA Graph capture: "you can invoke torch.cuda.empty_cache() immediately before entering the capture block. This will clear unused cached memory and give the allocator the best chance to lay out your prereserved buffers without interruption" (Fregly, Ch. 13). Outside that, calling it per iteration "can disrupt allocator efficiency and incur longer-term performance costs" (Fregly, Ch. 13), because every cleared block must be re-acquired from the driver on the next allocation.

Reference template (requires a CUDA GPU and torch; not executed here):

# Legitimate: defragment the pool once, right before graph capture.
torch.cuda.empty_cache()
with torch.cuda.graph(g, stream=capture_stream):
    ...

The core of that call is that it returns only free blocks and never touches live ones, so reserved bytes fall to exactly the live bytes. That is checkable without a GPU:

"""empty_cache() returns cached free blocks to the driver.

After empty_cache the reserved bytes fall to exactly the live (allocated) bytes,
because only FREE blocks are returned; live blocks are never touched. It is a
no-op when the pool holds no free blocks. Stdlib only.
"""
from __future__ import annotations


def reserved(blocks):
    return sum(b["size"] for b in blocks)


def allocated(blocks):
    return sum(b["size"] for b in blocks if not b["free"])


def empty_cache(blocks):
    """Drop every FREE block; keep live blocks untouched (returns a new list)."""
    return [b for b in blocks if not b["free"]]


pool = [
    {"size": 100, "free": False},
    {"size": 40, "free": True},
    {"size": 60, "free": False},
    {"size": 25, "free": True},
]
before_res, live = reserved(pool), allocated(pool)
after = empty_cache(pool)

# reserved falls to exactly the live bytes; no live block was dropped.
assert allocated(after) == live == 160
assert reserved(after) == live  # 160: the 65 cached bytes were returned
assert reserved(after) < before_res  # 160 < 225
assert all(not b["free"] for b in after)

# No-op edge: with nothing free, empty_cache changes nothing.
full = [{"size": 100, "free": False}, {"size": 60, "free": False}]
assert empty_cache(full) == full and reserved(empty_cache(full)) == 160

# Adversarial: a live block must never be released.
live_ids = {id(b) for b in pool if not b["free"]}
assert {id(b) for b in after} == live_ids

print("empty_cache model: all asserts passed")

Two integration constraints follow the allocator's structure. First, switching to backend:cudaMallocAsync moves allocation onto the CUDA stream-ordered allocator, so the native-only tuning options (max_split_size_mb, roundup_power2_divisions, garbage_collection_threshold) stop applying; keep one backend per process. Second, expandable_segments uses the CUDA virtual memory manager, so it interacts with other VMM clients such as NCCL's ncclMemAlloc, and it does not lift the CUDA Graphs rule against new allocations inside a capture region. Pre-allocate before capture and validate the combination on your stack. See CUDA Graphs: Capture, Replay, and Launch Overhead.

How to run it in production¶

Set PYTORCH_ALLOC_CONF where the process is launched (the container ENV, the systemd unit, or the scheduler job spec), not from inside the script, because the allocator reads it once at init and ignores later changes. Pick conservative defaults and change them only on evidence: the native backend with no flags is the right starting point, expandable_segments:True for dynamic-shape and KV-cache-heavy services, and garbage_collection_threshold (for example 0.8) as a proactive guard that reclaims cached-but-idle blocks before a long-running service hits a hard OOM.

Leave the memory-history recorder on for triage. "Python trace collection is ~2us per trace, cheap enough to leave on for production triage" so the next OOM dumps a snapshot you can open offline instead of reproducing the failure. Watch two counters continuously: the reserved - allocated gap and num_alloc_retries. Monitor torch.cuda.memory_stats() "over long runs to make sure your memory footprint stays stable and doesn't explode in size" (Fregly, Ch. 13). Keep empty_cache() out of the steady-state loop; it belongs only at one-time boundaries such as pre-graph-capture.

How to maintain it: diagnose fragmentation and read snapshots¶

Two cheap counters catch fragmentation before it aborts a run. Monitor torch.cuda.memory_stats() "over long runs to make sure your memory footprint stays stable and doesn't explode in size" (Fregly, Ch. 13), and watch torch.cuda.mem_get_info(): "if free memory drops while the number of allocated tensors stays constant, fragmentation is increasing" (Fregly, Ch. 13).

Reference template (requires a CUDA GPU and torch; not executed here):

import torch

free, total = torch.cuda.mem_get_info()          # device free vs total (bytes)
allocated = torch.cuda.memory_allocated()         # bytes held by live tensors
reserved  = torch.cuda.memory_reserved()          # bytes the allocator holds from the driver

# reserved - allocated is cached, reusable memory. A large and GROWING gap,
# or reserved climbing while allocated is flat, signals fragmentation.
print(f"free={free>>20} MiB  total={total>>20} MiB")
print(f"allocated={allocated>>20} MiB  reserved={reserved>>20} MiB")

stats = torch.cuda.memory_stats()
# Key fragmentation indicators (see the stats reference for the full set):
print("active_split_bytes:", stats.get("active_split_bytes.all.current"))
print("inactive_split_bytes:", stats.get("inactive_split_bytes.all.current"))
print("num_alloc_retries:", stats.get("num_alloc_retries"))  # nonzero => allocator
                                                              # had to retry after a sync

num_alloc_retries rising is a strong fragmentation signal: the allocator failed to find a block, synced/freed cache, and retried. For a human-readable rollup, torch.cuda.memory_summary() prints allocated/reserved/active/inactive split sizes and per-size-class counts in one table, useful when triaging an OOM in logs.

The detection rule above (reserved climbing while allocated is flat, equivalently device-free dropping while live tensors are constant) is exact and testable. This model asserts the healthy trace stays quiet, the fragmenting trace fires on every step, the reserved >= allocated invariant holds, and a corrupted sample that violates it is rejected:

"""Fragmentation detector over the counters torch.cuda exposes.

reserved >= allocated always, the gap is cached reusable memory, and
fragmentation is rising when reserved climbs while allocated stays flat
(equivalently, device-free drops while live tensors are constant). numpy only.
"""
from __future__ import annotations

import numpy as np


def fragmentation_rising(allocated, reserved, atol=0):
    """Boolean mask over steps where reserved grew but allocated did not."""
    allocated = np.asarray(allocated, dtype=np.int64)
    reserved = np.asarray(reserved, dtype=np.int64)
    assert allocated.shape == reserved.shape
    assert np.all(reserved >= allocated), "invariant violated: reserved < allocated"
    d_alloc = np.diff(allocated)
    d_res = np.diff(reserved)
    return (d_res > 0) & (np.abs(d_alloc) <= atol)


# Healthy trace: reserved grows only when allocated grows (real growth, not frag).
alloc_ok = [100, 200, 300, 300, 300]
res_ok = [128, 256, 384, 384, 384]
assert not fragmentation_rising(alloc_ok, res_ok).any()

# Fragmenting trace: allocated flat at 300 while reserved keeps climbing.
alloc_bad = [300, 300, 300, 300]
res_bad = [320, 384, 448, 512]
mask = fragmentation_rising(alloc_bad, res_bad)
assert mask.all() and int(mask.sum()) == 3

# The gap is reusable cache, never negative (invariant across the whole trace).
gap = np.asarray(res_bad) - np.asarray(alloc_bad)
assert np.all(gap >= 0) and gap.tolist() == [20, 84, 148, 212]

# Adversarial: a corrupted sample with reserved < allocated must be rejected.
try:
    fragmentation_rising([100, 200], [128, 150])  # 150 < 200
    raise SystemExit("detector accepted an impossible reserved<allocated sample")
except AssertionError:
    pass

print("fragmentation detector: all asserts passed")

For the full allocation history (which op allocated what, and where the gaps are), record a snapshot and open it in the offline visualizer. The documented workflow is "enable memory history, run the code to be observed, and then save a file with a pickled snapshot" (PyTorch, Understanding CUDA Memory Usage):

Reference template (requires a CUDA GPU and torch; not executed here):

import torch

# 1. Start recording allocation stack traces. Python trace collection is ~2us
#    per trace, cheap enough to leave on for production triage.
torch.cuda.memory._record_memory_history()

run_your_training_or_inference_step()

# 2. Dump a pickled snapshot of the allocator state + history.
torch.cuda.memory._dump_snapshot("my_snapshot.pickle")

# 3. Stop recording.
torch.cuda.memory._record_memory_history(enabled=None)

Open my_snapshot.pickle at pytorch.org/memory_viz. "The visualizer is a javascript application that runs locally on your computer. It does not upload any snapshot data" (PyTorch, Understanding CUDA Memory Usage). Its Active Memory Timeline shows live tensors over time with stack traces; the Allocator State History shows individual allocator events. Fragmentation appears as free gaps wedged between live blocks that the allocator cannot coalesce. The book also notes "NVIDIA's Nsight System's CUDA Memory Inspector can help visualize how memory fragmentation happens over time" (Fregly, Ch. 13). See Profiling GPUs: Nsight Systems and Nsight Compute.

Reading a snapshot comes down to one measurement: the largest request the allocator can still serve is the longest contiguous free run in a single segment, not the sum of free bytes. This model reconstructs that from a captured live/free layout, checks it against a brute-force reference plus 2000 random fuzz cases, and shows the fragmentation the visualizer would draw (10 free cells, but nothing larger than 3 is allocatable):

"""Read a captured allocator snapshot: largest allocatable = longest free run.

A snapshot shows each segment as a layout of live/free memory. The largest
request the allocator can serve from it is the longest CONTIGUOUS free run in a
single segment, not the sum of free bytes across the snapshot. This is why the
visualizer shows free gaps wedged between live blocks. numpy only.
"""
from __future__ import annotations

import numpy as np


def longest_free_run(segment):
    """Longest run of free cells (0=free, 1=live)."""
    best = run = 0
    for cell in np.asarray(segment, dtype=np.int8):
        run = run + 1 if cell == 0 else 0
        best = max(best, run)
    return best


def largest_allocatable(snapshot):
    return max((longest_free_run(s) for s in snapshot), default=0)


def brute_force_longest_run(segment):
    """Slow reference: check every [i, j] window explicitly."""
    seg = list(segment)
    best = 0
    for i in range(len(seg)):
        for j in range(i, len(seg)):
            if all(c == 0 for c in seg[i : j + 1]):
                best = max(best, j - i + 1)
    return best


# Snapshot: two segments, 0=free 1=live. Free bytes are scattered by live blocks.
snapshot = [
    [0, 0, 1, 0, 0, 0],  # segment 0: free runs of 2 and 3
    [1, 1, 0, 0, 1, 0, 0, 0],  # segment 1: free runs of 2 and 3
]
total_free = int(np.sum(np.concatenate(snapshot) == 0))  # 10 free cells
biggest = largest_allocatable(snapshot)  # longest single run

assert biggest == 3
assert total_free == 10
# Fragmentation is visible: 10 free cells, but no request > 3 can be served.
assert total_free > biggest

# Equivalence to the slow reference on every segment, plus random fuzzing.
rng = np.random.default_rng(0)
for seg in snapshot:
    assert longest_free_run(seg) == brute_force_longest_run(seg)
for _ in range(2000):
    seg = rng.integers(0, 2, size=int(rng.integers(1, 12))).tolist()
    assert longest_free_run(seg) == brute_force_longest_run(seg)

# Boundary cases: all free, all live.
assert longest_free_run([0, 0, 0, 0]) == 4
assert longest_free_run([1, 1, 1]) == 0

print("snapshot largest-allocatable: all asserts passed")

After changing a flag, confirm it helped rather than assuming it did. Re-run with the snapshot recorder on, compare reserved - allocated and num_alloc_retries before and after, and verify the OOM no longer fires at your target batch size. Because the source-level fix (fixed-size buffers) and the allocator flags address the same fragmentation, prefer the source fix where the shapes allow it and keep the flags as a safety net.

How to scale it: remove the variance at its source¶

The first scaling move is the source-level fix from When to use it: size one buffer to the maximum shape any iteration needs and reuse it every step, so the shape never changes and the allocator never fragments. For a growing KV cache or a changing batch size that you cannot pin to a fixed shape, expandable_segments:True is the scaling lever, since it lets one virtual segment grow in place rather than spawning unmergeable new segments as the workload scales up.

The payoff of the fixed-size buffer is concrete: variable-sized allocations strand memory in per-size blocks that never merge, so their peak reserved footprint climbs with the number of distinct shapes, while the fixed buffer holds a constant reservation. This model runs both strategies over a grow-then-shrink shape sequence and asserts the fixed buffer is optimal (a lower bound every scheme obeys) while the variable path over-reserves:

"""Why a fixed-size reused buffer beats variable-size allocations at scale.

Size one buffer to the maximum and reuse it every iteration. This models both
strategies and shows the variable-size path strands memory in unmergeable
per-size blocks, so its peak reserved footprint exceeds the fixed buffer's.
Stdlib only.
"""
from __future__ import annotations


def fixed_peak(sizes):
    """One buffer sized to the max, reused every iteration."""
    return max(sizes)  # constant reservation, zero fragmentation


def variable_peak(sizes):
    """Per-iteration size; freed blocks are cached but never merge across sizes.

    Reuse a cached free block only if it is large enough (first-fit >= size);
    otherwise open a new block (a new segment). Returns (peak_reserved, n_blocks).
    """
    blocks = []  # [size, free] cached blocks, none ever merge
    for s in sizes:
        hit = next((b for b in blocks if b[1] and b[0] >= s), None)
        if hit is None:
            blocks.append([s, False])  # new segment, cannot reuse a smaller one
        else:
            hit[1] = False  # reuse a big-enough cached block
        for b in blocks:  # free everything back to the cache
            b[1] = True
    return sum(b[0] for b in blocks), len(blocks)


sizes = [1, 2, 3, 4, 5, 4, 3, 2, 1]  # grow then shrink (MoE-like variance)

fp = fixed_peak(sizes)
vp, nblocks = variable_peak(sizes)

assert fp == 5  # hold only the largest shape
assert vp == 15 and nblocks == 5  # 1+2+3+4+5 stranded in 5 unmergeable blocks
assert vp > fp  # variable path over-reserves through fragmentation

# The fixed buffer is optimal: you must hold at least max(sizes) live at the peak,
# so fixed_peak is a lower bound every correct scheme obeys.
assert fp == max(sizes)
assert vp >= fp

# Boundary: a constant-shape workload never fragments, both schemes agree.
const = [4, 4, 4, 4]
cvp, cblocks = variable_peak(const)
assert fixed_peak(const) == 4 and cvp == 4 and cblocks == 1

print("fixed vs variable buffer model: all asserts passed")

Failure modes¶

Symptom	Root cause	Fix
"CUDA out of memory" while `mem_get_info` still reports free bytes	Fragmentation: no single contiguous free block of the requested size, and blocks cannot merge across segments	Fixed-size buffers first, then `expandable_segments:True`, then `max_split_size_mb` as a last resort
`PYTORCH_ALLOC_CONF` has no effect	Set after the allocator initialized (for example, exported from inside the script after `import torch`)	Set it in the environment before the process starts; it is read once at init
`max_split_size_mb`, `roundup_power2_divisions`, or `garbage_collection_threshold` silently ignored	Running under `backend:cudaMallocAsync`, where these native-only options do not apply	Use `backend:native`, or drop the options and tune the async backend on its own terms
`expandable_segments` barely reduces fragmentation	Long-lived allocations interleaved with short-lived ones: "free blocks on either side of a live block can't merge across it"	Separate the lifetimes, or pin the recurring shape to a fixed-size buffer
Throughput regresses after adding `empty_cache()` per iteration	Every cleared block must be re-acquired from the driver on the next allocation	Reserve `empty_cache()` for one-time boundaries such as pre-graph-capture
`expandable_segments` conflicts with NCCL or breaks under graph capture	Interaction with VMM allocators like `ncclMemAlloc`; expandable segments do not lift the capture-region no-allocation rule	It is experimental (default `False`); pre-allocate before capture and validate on your stack
`roundup_power2_divisions` bucketing is not what you expected	Actual bucketing depends on the environment	"Check allocator logs to confirm actual bucketing in your environment" (Fregly, Ch. 13)

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 13 "Profiling, Tuning, and Scaling PyTorch" — the caching allocator, fragmentation from variable-sized (MoE) allocations, fixed-size buffer strategy, PYTORCH_ALLOC_CONF options (max_split_size_mb, roundup_power2_divisions, backend:cudaMallocAsync), torch.cuda.memory_stats()/mem_get_info() for fragmentation tracking, and empty_cache() as a one-time pre-graph-capture step.
PyTorch, CUDA semantics — Memory management: PYTORCH_ALLOC_CONF (and the PYTORCH_CUDA_ALLOC_CONF backward-compatible alias), backend (native default, cudaMallocAsync needs CUDA 11.4+), max_split_size_mb, garbage_collection_threshold, roundup_power2_divisions, expandable_segments. https://docs.pytorch.org/docs/stable/notes/cuda.html
PyTorch, Understanding CUDA Memory Usage — torch.cuda.memory._record_memory_history, _dump_snapshot, and the offline visualizer at pytorch.org/memory_viz. https://docs.pytorch.org/docs/stable/torch_cuda_memory.html
PyTorch DevLog, When does fragmentation occur in the CUDA caching allocator? — segments vs blocks, split/merge rules, "blocks in different segments can never merge," and how expandable_segments lets blocks merge within one virtual segment. https://docs.pytorch.org/devlogs/eager/2026-06-01-cuda-caching-allocator/