Markdown

CUDA streams and concurrency¶

Scope: CUDA streams as the unit of inter-kernel concurrency: the legacy default stream (stream 0) versus the per-thread default stream (PTDS) versus explicit non-blocking streams; overlapping copy with compute via cudaMemcpyAsync on pinned host memory and the copy engines; and cudaEvent_t for fine-grained cross-stream synchronization. Where work overlaps within a single kernel instead, see Shared Memory, Bank Conflicts, and Tiling and Kernel Fusion; for replaying a captured stream as a graph see CUDA Graphs: Capture, Replay, and Launch Overhead.

flowchart LR
    subgraph Serial["Default stream (stream 0): serial, GPU idles during copies"]
        direction LR
        S1["H2D copy"] --> S2["compute kernel"] --> S3["D2H copy"]
    end
    subgraph Overlap["Non-blocking streams + pinned memory: copy overlaps compute"]
        direction LR
        A1["stream 1: H2D copy (chunk N+1)"]
        A2["stream 0: compute kernel (chunk N)"]
        A3["stream 2: D2H copy (chunk N-1)"]
    end
    Serial -->|"explicit streams<br/>cudaMemcpyAsync"| Overlap
    Overlap --> Win["Higher throughput:<br/>transfer latency hidden behind compute"]

What it is¶

A CUDA stream is an ordered queue of operations: kernel launches, memory copies, and (with the stream-ordered allocator) allocations. They execute in issue order within the stream but run concurrently across streams when hardware resources permit. A stream is the GPU's mechanism for keeping its distinct engines busy in parallel: the streaming multiprocessors (SMs) run compute while the dedicated direct-memory-access (DMA) copy engines move data host-to-device (H2D) and device-to-host (D2H). Because the SM compute pipeline runs independently of the copy engines, a kernel in one stream can fully overlap with a transfer in another.

Three stream identities behave differently, and conflating them is the most common concurrency bug:

Legacy default stream (stream 0, cudaStreamLegacy). Operations issued without an explicit stream land here. It serializes its own commands and is a device-wide barrier: any work in stream 0 waits for all prior work in every other stream, and any work in a non-default stream waits for prior work in stream 0. One stray copy into stream 0 stalls the whole GPU.
Per-thread default stream (PTDS, cudaStreamPerThread). With PTDS enabled, each host thread gets its own private default stream that serializes only its own commands and does not synchronize with other threads' default streams. It removes the host-wide barrier for multithreaded apps. Note: a PTDS stream still synchronizes with the legacy default stream if you mix them in one process.
Explicit non-blocking streams. Streams created with cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking) are independent queues that only synchronize when you insert an explicit dependency (an event wait). The cudaStreamNonBlocking flag means the stream "should perform no implicit synchronization with stream 0" (CUDA Runtime API, Stream Management).

Events (cudaEvent_t) are lightweight markers a stream records at a point in its queue; another stream (or the host) waits on that marker. They give fine-grained ordering between streams without the device-wide stall of cudaDeviceSynchronize() or the full-queue drain of cudaStreamSynchronize().

Why it matters¶

Without streams, every kernel and copy serializes on the default stream and the GPU idles whenever it transfers data. The fix is a software pipeline: while stream 0 computes batch N, stream 1 issues cudaMemcpyAsync() to copy batch N+1 H2D, and on a device with two copy engines stream 2 writes batch N−1's results D2H, a three-way overlap that hides transfer latency behind compute and vice versa. On modern GPUs you can run up to 128 concurrently executing kernels per device (the resident-grid limit); beyond that, launches queue until a slot frees.

Overlap is not free and not guaranteed. Two requirements are load-bearing:

Pinned (page-locked) host memory. cudaMemcpyAsync() is truly asynchronous only from pinned memory. If you pass a pageable host pointer, the runtime performs a hidden host-side staging copy into pinned memory that blocks the calling host thread and the enqueuing stream until staging completes, defeating overlap for that transfer. The DMA engines can only DMA directly out of page-locked buffers.
No accidental stream-0 traffic. Because stream 0 is a global barrier under the legacy model, a single library call or driver API that defaults to stream 0 serializes everything. This is why frameworks scrupulously avoid it.

This is exactly how deep-learning frameworks reach high utilization. PyTorch runs cuDNN/cuBLAS work and NCCL collectives on dedicated non-default streams (NCCL on a high-priority stream) so gradient all-reduce overlaps with backward compute; events record when gradients are ready and the communication stream waits on that event. See Distributed Training Platform and Performance Optimization and Tuning.

When it is needed (and when not)¶

Reach for explicit streams + events when:

Your workload is batched and bottlenecked on host↔device transfer: streaming successive mini-batches through H2D / compute / D2H overlap is the canonical win for training and inference pipelines.
You have independent kernels that fit together on the SMs (combined registers, shared memory, and block counts within per-SM limits) and would otherwise run back-to-back.
You need producer→consumer ordering across streams (a compute stream feeding a communication stream): use an event, not a host-side sync.
You run multiple host threads each issuing GPU work: enable PTDS to drop the host-wide barrier.

Do not add streams when:

Compute or memory bandwidth is already saturated. If a kernel maxes SM throughput or a copy saturates HBM bandwidth, overlapping a second operation just splits the same resource: you see two operations each at ~50%, not a speedup. Profile GPU utilization to find the saturation point. See Roofline Model and Arithmetic Intensity.
Tiles are too small. Tiny chunks underutilize the copy engines and SMs and pay launch overhead per chunk; oversized chunks or too many simultaneous launches exhaust the kernel-slot or per-SM resource limits and stall. Tune chunk size against your device's copy-engine count and occupancy.
The bottleneck is on-device tile reuse, not host↔device traffic: that calls for Shared Memory, Bank Conflicts, and Tiling / thread-block clusters, not streams.

Note: the marginal SM-utilization gain from layering warp-specialization or thread-block clusters on top of a well-tuned two-stream pipeline rarely justifies the engineering cost once you already saturate HBM bandwidth. Most real LLM workloads do fine with double-buffered kernels plus two or three streams.

How: implement, integrate, maintain¶

1. Create explicit non-blocking streams (never use stream 0 for hot paths)¶

cudaStreamNonBlocking is required so the stream does not implicitly synchronize with the legacy default stream; without it you reintroduce the hidden barrier.

cudaStream_t streamA, streamB;
cudaStreamCreateWithFlags(&streamA, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&streamB, cudaStreamNonBlocking);

myKernel<<<grid, block, 0, streamA>>>(/* ... */);                       // streamA
cudaMemcpyAsync(dst, src, bytes, cudaMemcpyHostToDevice, streamB);      // streamB
// streamA and streamB overlap freely; neither touches stream 0.

Always pass an explicit cudaStream_t to <<<...>>>, cudaMemcpyAsync(), and cudaMallocAsync(). Many libraries (cuBLAS, Thrust) accept a stream argument; set it. Keep stream 0 for one-time setup/cleanup only.

2. Pin host memory and overlap copy with compute¶

Allocate host buffers with cudaMallocHost() (page-locked) so cudaMemcpyAsync() truly overlaps. On a device with two copy engines, three streams give H2D / compute / D2H concurrency.

float *hA, *hB, *hC;
cudaMallocHost(&hA, bytes);          // page-locked => DMA-able, async copies overlap
cudaMallocHost(&hB, bytes);
cudaMallocHost(&hC, bytes);

// stream1: H2D, compute, D2H -- all enqueued in order, async to the host
cudaMemcpyAsync(d_data1, hA, bytes, cudaMemcpyHostToDevice, stream1);
computeKernel<<<grid, block, 0, stream1>>>(d_data1, d_result1);
cudaMemcpyAsync(hC, d_result1, bytes, cudaMemcpyDeviceToHost, stream1);

// stream2: independent chunk runs concurrently with stream1
cudaMemcpyAsync(d_data2, hB, bytes, cudaMemcpyHostToDevice, stream2);
computeKernel<<<grid, block, 0, stream2>>>(d_data2, d_result2);
cudaMemcpyAsync(hC + n, d_result2, bytes, cudaMemcpyDeviceToHost, stream2);

cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
cudaFreeHost(hA); cudaFreeHost(hB); cudaFreeHost(hC);

Query how many copy engines the device has to plan the depth of overlap:

cudaDeviceProp prop{};
cudaGetDeviceProperties(&prop, /*device=*/0);
// asyncEngineCount == 2 => H2D and D2H can run concurrently with compute.
int copy_engines = prop.asyncEngineCount;

3. A round-robin batch pipeline with the stream-ordered allocator¶

Pair streams with cudaMallocAsync() / cudaFreeAsync() so per-batch scratch allocation enqueues into the same stream and never forces a device-wide sync (legacy cudaMalloc/cudaFree synchronize the whole device and stall every stream). The allocator draws from a per-device pool; raise its release threshold to keep memory resident and cut OS round-trips.

cudaMemPool_t pool;
int device = -1;
cudaGetDevice(&device);
cudaDeviceGetDefaultMemPool(&pool, device);
uint64_t threshold = UINT64_MAX;                       // keep freed memory in the pool
cudaMemPoolSetAttribute(pool, cudaMemPoolAttrReleaseThreshold, &threshold);

const int NUM_STREAMS = 2, BATCHES = 8;
cudaStream_t s[NUM_STREAMS];
for (int i = 0; i < NUM_STREAMS; ++i)
    cudaStreamCreateWithFlags(&s[i], cudaStreamNonBlocking);

for (int b = 0; b < BATCHES; ++b) {
    cudaStream_t st = s[b % NUM_STREAMS];
    float *dA, *dB, *dC;
    cudaMallocAsync(&dA, bytes, st);                   // stream-ordered: no global stall
    cudaMallocAsync(&dB, bytes, st);
    cudaMallocAsync(&dC, bytes, st);
    cudaMemcpyAsync(dA, hA + b * elems, bytes, cudaMemcpyHostToDevice, st);
    cudaMemcpyAsync(dB, hB + b * elems, bytes, cudaMemcpyHostToDevice, st);
    computeKernel<<<grid, block, 0, st>>>(dA, dB, dC);
    cudaMemcpyAsync(hC + b * elems, dC, bytes, cudaMemcpyDeviceToHost, st);
    cudaFreeAsync(dA, st);
    cudaFreeAsync(dB, st);
    cudaFreeAsync(dC, st);
}
for (int i = 0; i < NUM_STREAMS; ++i) {
    cudaStreamSynchronize(s[i]);
    cudaStreamDestroy(s[i]);
}

Each stream carries its own allocate → H2D → kernel → D2H → free sequence; while stream 0 computes batch N, stream 1's H2D for batch N+1 runs on the copy engine. This matters for LLM pipelines that allocate per-batch scratch for attention KV or activations whose size varies with sequence length; cudaMallocAsync reserves exactly the needed bytes per batch without dragging the other streams to a stop.

4. Cross-stream ordering with events (not `cudaStreamSynchronize`)¶

For a producer→consumer hand-off, record an event in the producer stream and have the consumer stream wait on it. Only the consumer stream stalls; the host thread and all other streams keep running. Use cudaEventDisableTiming for sync-only events to lower overhead.

cudaEvent_t doneEvent;
cudaEventCreateWithFlags(&doneEvent, cudaEventDisableTiming);

producerKernel<<<grid, block, 0, producerStream>>>(/* ... */);
cudaEventRecord(doneEvent, producerStream);            // mark "data ready"

// consumerStream waits for the event, then proceeds; flags = 0
cudaStreamWaitEvent(consumerStream, doneEvent, 0);
consumerKernel<<<grid, block, 0, consumerStream>>>(/* ... */);

Signatures, per the CUDA Runtime API: cudaStreamWaitEvent(cudaStream_t stream, cudaEvent_t event, unsigned int flags = 0) and cudaEventRecord(cudaEvent_t event, cudaStream_t stream = 0). Reserve cudaStreamSynchronize() for well-defined global points (e.g. end of a training epoch) and avoid cudaDeviceSynchronize() / cudaStreamSynchronize(0) on hot paths; both block far more than you intend.

To run a callback on the host when GPU work completes (e.g. recycle a host-side pool), use cudaLaunchHostFunc(). Do not call any CUDA device API from inside that callback: it runs on a runtime-managed host thread and calling back into the device can deadlock. Limit it to CPU-side work.

5. Enable PTDS for multithreaded hosts¶

If multiple CPU threads each issue GPU work, enable the per-thread default stream so their implicit default streams stop synchronizing with each other and with stream 0. Set it before any CUDA header is included:

# compile-time
nvcc --default-stream per-thread app.cu -o app

/* or define before including any CUDA header */
#define CUDA_API_PER_THREAD_DEFAULT_STREAM 1

PTDS removes the host-wide barrier; explicit non-blocking streams remove the per-operation barrier. Use both: nothing you enqueue can then accidentally collide on stream 0.

6. Frameworks (PyTorch)¶

You rarely write raw streams in PyTorch; it schedules cuDNN/cuBLAS and NCCL on its own non-default streams already. Two knobs matter for overlap:

Pinned-memory transfers. DataLoader(..., pin_memory=True) page-locks host batches; pair it with tensor.to(device, non_blocking=True) so the H2D copy overlaps with compute. Per the PyTorch CUDA semantics docs: "once you pin a tensor or storage, you can use asynchronous GPU copies. Just pass an additional non_blocking=True argument."
Stream-ordered allocator. Select CUDA's cudaMallocAsync backend with PYTORCH_ALLOC_CONF=backend:cudaMallocAsync (the older PYTORCH_CUDA_ALLOC_CONF is a backward-compatibility alias; cudaMallocAsync requires CUDA 11.4+). PyTorch's default native caching allocator is already stream-aware and avoids device-wide sync except when it must call cudaMalloc for more memory.

PYTORCH_ALLOC_CONF=backend:cudaMallocAsync python train.py

7. Validate the overlap¶

Streams are easy to get wrong silently: a pageable buffer or a stray stream-0 op serializes everything with no error. Confirm overlap on the timeline with Nsight Systems: H2D, kernel, and D2H rows should visually overlap across streams, and stream 0 should be idle on the hot path.

nsys profile -o streams_report --stats=true ./app

See Profiling GPUs: Nsight Systems and Nsight Compute and Observability and Monitoring. Re-check after each change: confirm the timeline overlaps and wall-clock improved, since fixing one stall often just exposes the next bottleneck (HBM bandwidth or occupancy, see CUDA Occupancy Tuning).

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 11 "Inter-Kernel Pipelining, Synchronization, and CUDA Stream-Ordered Memory Allocations" — overlapping kernels with streams, copy/compute overlap, pinned memory, the stream-ordered allocator (cudaMallocAsync/cudaFreeAsync, cudaMemPoolAttrReleaseThreshold), legacy vs per-thread default streams, events for cross-stream sync, and cudaLaunchHostFunc callbacks. Code in this page is adapted from that chapter; it has not been hardware-tested here.
NVIDIA, CUDA Runtime API — Stream Management (cudaStreamCreateWithFlags, cudaStreamNonBlocking, cudaStreamWaitEvent, cudaStreamSynchronize): https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
NVIDIA, CUDA Runtime API — Event Management (cudaEventCreateWithFlags, cudaEventDisableTiming, cudaEventRecord): https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html
NVIDIA, CUDA Programming Guide — Asynchronous Concurrent Execution (streams, default-stream semantics, PTDS, overlap requirements, pinned memory): https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html
NVIDIA, CUDA Programming Guide — Stream-Ordered Memory Allocator (cudaMallocAsync, cudaFreeAsync, memory pools): https://docs.nvidia.com/cuda/cuda-programming-guide/02-basics/asynchronous-execution.html
PyTorch, CUDA semantics — pinned memory + non_blocking=True, DataLoader(pin_memory=True), and the PYTORCH_ALLOC_CONF=backend:cudaMallocAsync allocator backend: https://docs.pytorch.org/docs/stable/notes/cuda.html