Markdown

CUDA stream-ordered memory allocator¶

Scope: cudaMallocAsync/cudaFreeAsync and the per-device memory pool behind them: stream-ordered allocation that reuses freed blocks without a device-wide sync, and the release-threshold and cudaMemPoolTrimTo knobs that trade footprint against fragmentation. For the surrounding stream/event mechanics see CUDA Streams and Concurrency; for pools captured into a graph see CUDA Graphs: Capture, Replay, and Launch Overhead.

What it is¶

cudaMallocAsync(void** ptr, size_t size, cudaStream_t stream) and cudaFreeAsync(void* ptr, cudaStream_t stream) are the stream-ordered allocation APIs introduced in CUDA 11.2 (CUDA C++ Programming Guide, Stream Ordered Memory Allocator; NVIDIA Developer Blog, Part 1). Unlike cudaMalloc/cudaFree, the allocation and free are operations placed in a stream and ordered relative to the other work in that stream, not synchronous, device-wide calls.

The allocation request "records the allocation request in the same CUDA stream that will use it ... It will not block the other streams" (Fregly, Ch. 11). A cudaFreeAsync(ptr, stream) does not return the memory immediately; the free is ordered after the prior work in stream, and only once that work completes does the block become reusable. Because the free waits for only that stream, "there is no expensive global cudaDeviceSynchronize and no implicit synchronization with other streams" (Fregly, Ch. 6).

Both APIs draw from a memory pool (cudaMemPool_t). Every device has a default pool returned by cudaDeviceGetDefaultMemPool(cudaMemPool_t* memPool, int device) (CUDA Runtime API); plain cudaMallocAsync allocates from the calling device's current pool, and cudaMallocFromPoolAsync(void** ptr, size_t size, cudaMemPool_t pool, cudaStream_t stream) targets an explicit pool. The pool "recycles freed memory buffers and avoids repeated OS calls to allocate new memory" (Fregly, Ch. 6); a freed block is handed back to the next same-stream allocation instead of being unmapped and re-mapped through the driver.

flowchart LR
    subgraph stream1 ["Stream 1 (issue order)"]
        A1["cudaMallocAsync(d, sz)"] --> K1["kernel(d)"] --> F1["cudaFreeAsync(d)"]
    end
    F1 -. "block recycled" .-> POOL["Per-device memory pool"]
    POOL -. "satisfies next alloc" .-> A1
    POOL --> OS["OS / driver: cuMemMap on miss, release when > threshold"]

sequenceDiagram
    participant H as "Host thread"
    participant S0 as "Stream 0 (attention, batch N)"
    participant S2 as "Stream 2 (alloc, batch N+2)"
    H->>S2: "cudaMallocAsync(scratch, sz, S2)"
    Note over S0: "keeps running, never stalled"
    H->>S0: "attentionKernel<<<...,S0>>>"
    Note over S2: "alloc enqueued in S2 only"

Why it matters¶

Legacy cudaMalloc "is a blocking, device-wide operation that synchronizes the device before returning. This can stall work in other streams since every allocation forces the entire GPU to stall until the memory is reserved. This pauses all streams, limits parallelism, and destroys your workload's performance" (Fregly, Ch. 11). It also "involve[s] OS-level calls like mmap/ioctl," incurring "kernel-space context switches and driver overhead" (Fregly, Ch. 6). In a pipeline where one stream runs attention on batch N while another prepares batch N+1, a single blocking cudaMalloc on the second stream stalls every SM until the allocator finishes, wiping out the overlap you built.

Three concrete wins over cudaMalloc/cudaFree:

No global serialization. Each allocate/free synchronizes only within its own stream, so memory management "never serializes streams that are feeding those kernels" (Fregly, Ch. 11). The other streams "continue launching kernels, copying data, or doing whatever they were doing," even while stream 1's allocation is in flight.
Lower per-call overhead at high churn. Pool reuse eliminates the driver/OS round-trip on the hot path. "When your code issues thousands (or millions) of allocate/free cycles," the pool reduces "fragmentation and smoothing out latency spikes" (Fregly, Ch. 6).
Right-sized scratch without over-provisioning. Variable-length LLM batches need per-batch scratch (attention KV, activations). "Without stream-ordered allocation, you'd either have to allocate all the memory upfront (increasing memory footprint) or incur heavy synchronization penalties" (Fregly, Ch. 11). With cudaMallocAsync you reserve "exactly enough space ... and not a single byte more."

The mechanism that makes reuse cheap is the release threshold. By default it is 0, meaning "all unused memory in the pool is released back to the OS during every synchronization operation" (NVIDIA Developer Blog, Part 1). In an iterative loop that defeats the point: each iteration re-maps memory through the driver. Raising the threshold lets the pool retain freed backing memory across syncs, so the next iteration's allocations are satisfied from the pool: "fewer OS calls and better performance by avoiding repetitive memory allocations and de-allocations" (Fregly, Ch. 11).

When it is needed (and when not)¶

Use the stream-ordered allocator when:

Your pipeline allocates scratch per mini-batch, "common in LLM training and inference" (Fregly, Ch. 11). Variable sequence lengths mean batch N+1 (1,024 tokens) needs a larger buffer than batch N (512 tokens), so reusing one fixed allocation is insufficient.
You run a growing KV cache in autoregressive decoding, where each token appends KV pairs and you periodically extend the scratch region in the same stream that runs the attention kernel (Fregly, Ch. 11).
You issue frequent, fine-grained allocations in a long-running loop, where the per-call cudaMalloc sync and OS overhead dominate.

Prefer plain cudaMalloc/cudaFree when:

The buffer is a one-time, long-lived allocation done at startup. "For simple, one-time buffers, a blocking cudaMalloc and cudaFree may suffice" (Fregly, Ch. 6). A pool buys nothing when you allocate once and hold it for the program's lifetime.
You are memory-capacity constrained and cannot afford the pool retaining freed bytes, though here you tune the threshold or call cudaMemPoolTrimTo rather than abandoning the allocator.

Framework caveat: in PyTorch the native caching allocator is already stream-aware and is the default. Switching it to the CUDA backend (PYTORCH_ALLOC_CONF=backend:cudaMallocAsync) is a deliberate choice, not a default win; measure before adopting it. See Frameworks.

How: implement, integrate, maintain¶

1. Raise the release threshold before the hot loop¶

Set cudaMemPoolAttrReleaseThreshold so the pool keeps freed memory resident instead of unmapping it on every sync. The attribute is "the amount of reserved memory in bytes to hold onto before trying to release memory back to the OS" (CUDA Runtime API). UINT64_MAX keeps everything resident for the pool's lifetime.

#include <cuda_runtime.h>
#include <cstdint>

cudaMemPool_t pool;
int device = -1;
cudaGetDevice(&device);
cudaDeviceGetDefaultMemPool(&pool, device);

// Keep freed backing memory in the pool across syncs (default threshold is 0,
// which releases all unused pool memory back to the OS on every sync).
uint64_t threshold = UINT64_MAX;
cudaMemPoolSetAttribute(pool, cudaMemPoolAttrReleaseThreshold, &threshold);

The book writes cudaDeviceGetDefaultMemPool(&pool, device) with device as an int, which matches the runtime API signature cudaDeviceGetDefaultMemPool(cudaMemPool_t*, int). cudaMemPoolSetAttribute's value argument is a pointer to a uint64_t for threshold/footprint attributes.

2. Allocate, use, and free per stream in the loop¶

Each stream carries its own allocate → H2D → kernel → D2H → free sequence. The frees are stream-ordered, so block N becomes reusable once stream work through that free completes, without a device sync.

const int NUM_STREAMS = 2, BATCHES = 8;
cudaStream_t s[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; ++i)
    cudaStreamCreateWithFlags(&s[i], cudaStreamNonBlocking);

for (int b = 0; b < BATCHES; ++b) {
    cudaStream_t st = s[b % NUM_STREAMS];
    float *dA = nullptr, *dB = nullptr, *dC = nullptr;
    cudaMallocAsync(&dA, bytes, st);      // stream-ordered: no global stall
    cudaMallocAsync(&dB, bytes, st);
    cudaMallocAsync(&dC, bytes, st);
    cudaMemcpyAsync(dA, hA + b * elems, bytes, cudaMemcpyHostToDevice, st);
    cudaMemcpyAsync(dB, hB + b * elems, bytes, cudaMemcpyHostToDevice, st);
    computeKernel<<<grid, block, 0, st>>>(dA, dB, dC);
    cudaMemcpyAsync(hC + b * elems, dC, bytes, cudaMemcpyDeviceToHost, st);
    cudaFreeAsync(dA, st);                // deferred until st's prior work completes
    cudaFreeAsync(dB, st);
    cudaFreeAsync(dC, st);
}
for (int i = 0; i < NUM_STREAMS; ++i) {
    cudaStreamSynchronize(s[i]);
    cudaStreamDestroy(s[i]);
}

Use non-blocking streams (cudaStreamNonBlocking) so the allocator's stream-ordered guarantee is not undone by an implicit stream-0 barrier. See CUDA Streams and Concurrency. Allocate into the same stream that will consume the buffer; cross-stream reuse needs an explicit ordering dependency (next section).

3. Understand cross-stream reuse and its safety¶

A block freed on stream A can be reused by an allocation on stream B only if the runtime can prove the free has completed. Two pool policies govern this (CUDA C++ Programming Guide):

cudaMemPoolReuseFollowEventDependencies: "Allow cudaMallocAsync to use memory asynchronously freed in another stream as long as a stream ordering dependency exists" (an event/stream-wait edge from the freeing stream to the allocating stream).
cudaMemPoolReuseAllowOpportunistic: "Allow reuse of already completed frees when there is no dependency," i.e. the runtime observes the free has actually finished on the device.

These are why correctness does not depend on you tracking liveness manually: the allocator only hands a freed block to another stream once ordering or completion is established. Do not access a buffer after its cudaFreeAsync; the free is ordered, and the memory may be reissued to another allocation.

Interop note: allocations are not locked to their allocating API. "Memory allocated with cudaMallocAsync can be freed with cudaFree()" and vice versa (CUDA C++ Programming Guide), but freeing an async allocation with the synchronous cudaFree reintroduces a device sync, so keep to cudaFreeAsync on hot paths.

4. Reclaim footprint deliberately¶

A high release threshold trades memory for speed; reclaim explicitly at phase boundaries with cudaMemPoolTrimTo(cudaMemPool_t pool, size_t minBytesToKeep), which "releases memory back to the OS until the pool contains fewer than minBytesToKeep reserved bytes" (CUDA Runtime API). The minBytesToKeep argument lets you "hold onto a specified amount of memory, for example the amount it expects to need in a subsequent phase" (CUDA C++ Programming Guide).

// Between training and a lower-memory eval phase: keep ~512 MiB, return the rest.
cudaMemPoolTrimTo(pool, /*minBytesToKeep=*/512ull * 1024 * 1024);

cudaMallocAsync "does not stall other streams," and cudaMemPoolTrimTo lets you "proactively return memory" to "balance total GPU memory footprint against fragmentation" (Fregly, Ch. 6). Trim is most useful when another GPU API (a separate process, or a graphics API such as Vulkan/DirectX) needs the freed memory, since the driver will not otherwise hand pool memory to a foreign client.

5. Observe and validate¶

Read pool counters to confirm the threshold is actually retaining memory and to watch for runaway reserved bytes. The high-watermark attributes track peaks since the last reset.

uint64_t reserved = 0, used = 0;
cudaMemPoolGetAttribute(pool, cudaMemPoolAttrReservedMemCurrent, &reserved);
cudaMemPoolGetAttribute(pool, cudaMemPoolAttrUsedMemCurrent, &used);
// reserved >= used; (reserved - used) is held-but-idle pool memory.

cudaMemPoolAttrReservedMemCurrent is the backing memory currently mapped; cudaMemPoolAttrUsedMemCurrent is what the application currently holds; cudaMemPoolAttrReservedMemHigh/cudaMemPoolAttrUsedMemHigh are the high-water marks (CUDA Runtime API). For overlap and allocation behaviour on the timeline, profile with Nsight Systems; allocations should not appear as device-wide stalls. See Profiling GPUs: Nsight Systems and Nsight Compute. Multi-GPU note: a pool's allocations are accessible from a peer device only after cudaMemPoolSetAccess, and only if cudaDeviceCanAccessPeer reports the devices are peer-capable.

The code in this page is adapted from the cited book and NVIDIA documentation. It has not been compiled or hardware-tested here.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 11 "Inter-Kernel Pipelining, Synchronization, and CUDA Stream-Ordered Memory Allocations" — stream-ordered allocation vs blocking cudaMalloc, per-batch LLM scratch, KV-cache growth, cudaMemPoolSetAttribute/cudaMemPoolAttrReleaseThreshold. Ch. 6 "GPU Architecture, CUDA Programming, and Maximizing Occupancy" — memory pools, fragmentation, cudaMemPoolTrimTo, cudaMallocAsync/cudaFreeAsync basics.
NVIDIA, CUDA Runtime API — Stream Ordered Memory Allocator (cudaMallocAsync, cudaFreeAsync, cudaMallocFromPoolAsync, cudaDeviceGetDefaultMemPool, cudaMemPoolSetAttribute/GetAttribute, cudaMemPoolTrimTo, cudaMemPoolAttr enum): https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html
NVIDIA, CUDA C++ Programming Guide — Stream Ordered Memory Allocator (release threshold semantics, reuse policies, cudaMemPoolSetAccess/cudaDeviceCanAccessPeer, cudaFree/cudaFreeAsync interop): https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/stream-ordered-memory-allocation.html
NVIDIA Developer Blog — Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1 (default threshold 0 releases on every sync; UINT64_MAX to persist; CUDA 11.2 introduction; trimming): https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/