Markdown

NVSHMEM: GPU-initiated communication¶

Scope: NVSHMEM's PGAS one-sided model (GPU threads issuing put/get directly from kernel code with the CPU off the critical path) for fine-grained compute/comm overlap inside a kernel, MoE all-to-all (DeepEP), and when GPU-initiated communication beats host-launched NCCL collectives.

Reference templates, drawn from Chris Fregly, AI Systems Performance Engineering (O'Reilly), the NVIDIA NVSHMEM API/environment documentation, and the DeepSeek deepseek-ai/DeepEP repository. Nothing here was executed on hardware. API names, env-var defaults, and transports change between NVSHMEM releases; pin against the exact NVSHMEM version in your container and validate one PE pair before trusting a fleet. Message-rate figures are vendor/paper measurements on specific NICs, not your ceiling.

What it is¶

NVSHMEM is NVIDIA's GPU-accelerated implementation of OpenSHMEM, a Partitioned Global Address Space (PGAS) model.¹⁷ Each GPU is a processing element (PE) (a process that is part of a parallel NVSHMEM application), and all PEs allocate from a symmetric heap: nvshmem_malloc returns a buffer at the same symmetric address on every PE.¹¹⁸ With PGAS, a GPU thread can read or write into another GPU's memory directly from device code, bypassing the CPU.¹ There is a global address space, but GPU caches are not globally coherent across GPUs (only the CPU–GPU path over NVLink-C2C is cache coherent), so software (NVSHMEM or NCCL) supplies the synchronization and ordering needed for correctness.²

The core operations are one-sided: the initiating PE drives the whole transfer, the target PE is passive and is not interrupted. The device-callable primitives the book demonstrates:¹¹⁸

nvshmem_float_p(dest, value, pe) / nvshmem_int_p(dest, value, pe): single-element put. NVIDIA: "a very low latency put capability for single elements of most basic types."¹⁸
nvshmem_TYPENAME_g(source, pe): single-element get (return value); nvshmem_TYPENAME_get(dest, source, nelems, pe) is the blocking bulk get.¹⁸
nvshmem_quiet(): completion/ordering, a prior put is considered complete only after a subsequent nvshmem_quiet.¹⁸
nvshmem_int_wait_until(ptr, NVSHMEM_CMP_EQ, val): point-to-point wait on a local symmetric variable.¹
nvshmem_barrier_all(): device- or host-side global barrier across all PEs.¹

The canonical send-and-signal idiom: the sender nvshmem_*_ps the payload into the receiver's symmetric buffer, calls nvshmem_quiet() to force completion, then nvshmem_*_ps a flag; the receiver spins on nvshmem_int_wait_until and reads the payload once the flag flips. The entire exchange runs on-device with no CPU intervention and no extra copies.¹

// Sender kernel (GPU/PE 0): put payload, order, then signal.
__global__ void sender_kernel(float *local_data, float *remote_data,
                              int *remote_flag, int dest_pe) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    float value = local_data[idx];
    nvshmem_float_p(remote_data + 1, value, dest_pe);  // one-sided put into dest_pe's memory
    nvshmem_quiet();                                   // ensure the put completed before signalling
    nvshmem_int_p(remote_flag + 0, 1, dest_pe);        // set the completion flag on dest_pe
}

// Receiver kernel (GPU/PE 1): wait on the flag, then consume.
__global__ void receiver_kernel(float *recv_buffer, float *remote_data,
                                int *remote_flag) {
    nvshmem_int_wait_until(remote_flag + 0, NVSHMEM_CMP_EQ, 1);  // spin until signalled
    recv_buffer[0] = remote_data[1] * 2.0f;                     // payload now valid
}

Because nvshmem_*_p is per-element, when the destination is reached over InfiniBand it generates one RMA message per element: fine over NVLink, but a throughput trap for bulk transfers, where bulk put/get or nvshmemx block variants belong instead.¹⁸

Ordering note: fence/quiet/barrier issued on the GPU order only operations issued from the GPU; CPU-issued ordering covers only CPU-issued operations. Do not assume a host nvshmem_quiet flushes device-issued puts.¹⁸

Why it matters¶

Host-launched collectives put the CPU on the critical path of every exchange: kernel launch, host synchronize, then the next phase. NVSHMEM moves communication into the kernel, so a GPU emits put/get between compute instructions without ever returning to the host. Three consequences:

Fine-grained overlap inside a single kernel. A persistent kernel can compute, push a tile to a peer, and keep computing: overlap at warp granularity, not at the coarse stream/launch granularity DDP's bucketed all-reduce achieves.³ The book's worked example: a two-stage transformer (attention then MLP) where GPU 0 computes attention, NVSHMEM-puts activations to GPU 1 and signals, and GPU 1's persistent kernel picks up the MLP while GPU 0 already advances to the next batch. After a few iterations both GPUs run in tandem at near-100% utilization with no host stalls.³
Inter-node without the CPU, via IBGDA. Over InfiniBand, the IBGDA (InfiniBand GPUDirect Async) transport implements both the control plane and data plane of IB communication in the GPU, so the GPU drives the NIC directly from a CUDA kernel with no CPU reverse-proxy.²⁰ The message-rate difference is the whole point: NVIDIA reports the proxy-based IBRC transport caps near ~1.7 M put-ops/s regardless of CTAs/QPs, while IBGDA scales with CTAs toward the ~215 M-ops/s hardware limit of a ConnectX-6 NIC with eight CTAs.²⁰ This is the same GPU-initiated path NCCL exposes as IBGDA / "direct NIC," letting the GPU drive full-bandwidth RDMA without CPU intervention.¹²
Irregular, data-dependent communication. NVSHMEM shines when the communication pattern is not a fixed collective: graph algorithms, dynamic load balancing, discrete-event simulation, and device-side work-stealing via remote atomics (nvshmem_int_atomic_inc on a global counter lets each PE claim the next task index with no host coordination).³⁴ Static CUDA Graphs and bulk collectives cannot express this; an NVSHMEM kernel adapts on the fly.⁶

sequenceDiagram
    participant G0 as "GPU 0 / PE 0 (kernel)"
    participant Mem1 as "PE 1 symmetric heap"
    participant G1 as "GPU 1 / PE 1 (kernel)"
    G0->>Mem1: "nvshmem_float_p (payload)"
    G0->>G0: "nvshmem_quiet (force completion)"
    G0->>Mem1: "nvshmem_int_p (flag = 1)"
    G1->>Mem1: "nvshmem_int_wait_until (CMP_EQ 1)"
    Mem1-->>G1: "payload valid, consume"
    Note over G0,G1: "no CPU in the loop, no kernel relaunch"

When it is needed (and when not)¶

Use GPU-initiated NVSHMEM when:

MoE dispatch/combine all-to-all. Expert parallelism routes each token to a data-dependent subset of experts; the dispatch and combine all-to-all are small, dynamic, and latency-bound: exactly IBGDA's strength. DeepSeek's DeepEP is the canonical case: high-throughput and low-latency all-to-all GPU kernels for MoE dispatch and combine, with FP8 support.²² (DeepSeek-V3 routes each token through 1 shared + 8 of 256 experts, ~9 active experts per token.¹³)
Event-driven, fine-grained coordination (dynamic task queues, point-to-point signalling, persistent producer/consumer kernels) where keeping the host out of the loop is the win.³
Lowest-jitter model-parallel steps. Launch one cooperative kernel spanning all GPUs with nvshmemx_collective_launch() and use device-side nvshmem_barrier_all(); all kernels using NVSHMEM device-level sync or collectives must be launched this way so they run concurrently on every PE.⁵

Prefer host-launched NCCL instead when:

The workload is bulk symmetric collectives: all-reduce of gradients, all-gather, reduce-scatter, broadcast. NCCL is topology-aware, rings/trees saturate every NVLink/NVSwitch path, and PyTorch DDP already overlaps bucketed all-reduce on a background stream.⁸⁹ The book's rule: prefer NCCL/NVSHMEM for bulk collectives, NVSHMEM when fine-grained device-initiated control dominates.¹⁰
For one-to-one inference transfers (KV-cache movement) NVIDIA steers toward NIXL rather than either NCCL send/recv or hand-rolled NVSHMEM.¹¹ See disaggregated inference.

Do not reach for NVSHMEM to "speed up DDP": a well-tuned DDP all-reduce is already NCCL on a background stream (comms/compute overlap). NVSHMEM is GPU-level shared-memory programming: you own the races. Avoid over-synchronizing, because overusing nvshmem_barrier_all() stalls every GPU on the slowest peer; use fine-grained signals (nvshmem_signal_wait_until, nvshmemx_signal_op) when only a subset must coordinate.³⁷

How: implement, integrate, maintain¶

Install / link. NVSHMEM ships in the CUDA toolkit ecosystem / NGC containers. Device-side NVSHMEM requires relocatable device code: compile with nvcc -rdc=true and link -lnvshmem_device -lnvshmem_host (exact flags per release, check the install guide).²¹ IBGDA on InfiniBand needs Mellanox HCAs, MLNX_OFED 5.0+, and nvidia.ko >= 510.40.3 loaded with PeerMappingOverride=1.²⁰ The same GPUDirect RDMA prerequisite as NCCL applies: load the peer-memory kernel module and confirm it (lsmod | grep nvidia_peermem).¹⁵

Run-time configuration (official NVSHMEM environment variables; set explicitly, do not rely on defaults):¹⁹

# Enable GPU-initiated InfiniBand transport (default: false).
export NVSHMEM_IB_ENABLE_IBGDA=true

# Remote transport selection; default "ibrc" (proxy-based, CPU reverse-proxy).
# Options: ibrc, ucx, libfabric, ibdevx, gpunetio, none.
export NVSHMEM_REMOTE_TRANSPORT=ibdevx          # device-side IB path used with IBGDA

# Symmetric heap per PE; default 1073741824 (1 GiB). Suffixes k/m/g/t.
export NVSHMEM_SYMMETRIC_SIZE=4g

NVSHMEM_IB_ENABLE_IBGDA=true is what turns on the GPU driving the NIC directly; with the default ibrc/proxy path you are back to the ~1.7 M-ops/s ceiling.²⁰¹⁹ Size NVSHMEM_SYMMETRIC_SIZE to your largest symmetric allocation: every nvshmem_malloc draws from it.¹⁹

Integrate with CUDA Graphs. Kernels that call NVSHMEM can be captured inside a CUDA Graph like any other kernel; the NVSHMEM device operations happen inside the captured kernel and are not separate graph nodes. But a static graph fixes the communication script: NVSHMEM's value is letting kernels adapt at runtime, so capture it only when the pattern is actually static.⁶ See CUDA graphs and CUDA streams concurrency.

MoE / DeepEP integration. DeepEP implements device-initiated sparse all-to-all (dispatch/combine). Two cautions on version drift, and official docs win over the book on disagreement:

The book (2025) and NVIDIA/DeepEP V1 describe DeepEP built on NVSHMEM + IBGDA.¹⁴²² As of this writing the upstream deepseek-ai/DeepEP README states V2 switched from the NVSHMEM backend to a lighter NCCL Gin backend, with NVSHMEM retained only for legacy V1 methods.²² Check which backend your DeepEP build links before assuming an NVSHMEM dependency.
DeepEP's legacy kernels use an out-of-doc PTX load (ld.global.nc.L1::no_allocate.L2::256b) to stream all-to-all buffers through L2 without evicting L1; it is not in NVIDIA's PTX ISA and is gated by the build flag DISABLE_AGGRESSIVE_PTX_INSTRS=1 for when it breaks on a new architecture.¹⁴²² Treat it as fragile across GPU generations.

Validate / maintain. Confirm the data path is actually GPU-initiated, not silently on the proxy: with IBGDA active the GPU drives the NIC, so CPU stays near-idle during transfers; a CPU spike during communication means you fell back to a host-staged path (same failure signature as NCCL dropping to TCP).¹⁶ Manage synchronization deliberately: prefer point-to-point signals over global barriers, allocate all symmetric buffers before any capture, and re-pin every env var on NVSHMEM upgrades since defaults and available transports shift between releases.⁷¹⁹

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly), "Fine-Grained GPU-to-GPU Memory Sharing with NVSHMEM": each GPU is a PE in a PGAS; nvshmem_malloc symmetric buffers; send-and-signal pattern with nvshmem_float_p, nvshmem_quiet, nvshmem_int_p, nvshmem_int_wait_until(NVSHMEM_CMP_EQ), nvshmem_barrier_all; one-sided ops run entirely on-device over NVLink/PCIe with no CPU intervention. ↩↩↩↩↩↩
Fregly, AI Systems Performance Engineering, "Multi-GPU Programming": PGAS via NVSHMEM (NVIDIA's GPU-accelerated OpenSHMEM); GPU caches not globally coherent across GPUs, only the CPU–GPU NVLink-C2C path is cache coherent; NCCL and NVSHMEM provide the synchronization/ordering; remote atomics and one-sided ops over RDMA are provided by NVSHMEM, GPUDirect RDMA supplies the data path not the atomic APIs. ↩
Fregly, AI Systems Performance Engineering: NVSHMEM eliminates host staging and kernel-launch overhead, turning multistep comms into a single hardware transaction; two-stage transformer (attention + MLP) handoff example reaching near-100% utilization; shines for irregular/data-dependent workloads; avoid over-synchronizing with nvshmem_barrier_all(). ↩↩↩↩↩
Fregly, AI Systems Performance Engineering: device-side work-stealing kernel using nvshmem_int_atomic_inc(queue_head) so each PE claims the next task index with no host coordination. ↩
Fregly, AI Systems Performance Engineering: nvshmemx_collective_launch() starts a cooperative kernel spanning all GPUs; all kernels using NVSHMEM device-level synchronization or collectives must be launched this way to run concurrently on every PE; device-side nvshmem_barrier_all() for lockstep. ↩
Fregly, AI Systems Performance Engineering: NVSHMEM kernels can be captured inside CUDA Graphs; device ops occur inside the kernel, not as separate graph nodes; NVSHMEM's strength is adapting at runtime versus a fixed graph communication script. ↩↩
Fregly, AI Systems Performance Engineering: fine-grained primitives nvshmem_wait_until, nvshmem_signal_fetch, nvshmem_signal_wait_until, nvshmemx_signal_op for point-to-point synchronization when only a subset of devices must coordinate; over-use of global barriers stalls all GPUs on the slowest peer. ↩↩
Fregly, AI Systems Performance Engineering, "Capturing Multi-GPU Collectives with NCCL and CUDA Graphs": NCCL is the go-to for bulk collectives (broadcast, reduction, all-to-all); NCCL arranges tensors into rings/trees saturating NVLink/NVSwitch; DDP overlaps bucketed all-reduce on a separate stream. ↩
Fregly, AI Systems Performance Engineering: topology-aware NCCL keeps traffic on the fastest interconnect; DDP uses asynchronous NCCL all-reduces on a background CUDA stream to overlap with backward computation. ↩
Fregly, AI Systems Performance Engineering: prefer NCCL/NVSHMEM for bulk collective transfers; NVSHMEM where fine-grained device-initiated control of the communication dominates. ↩
Fregly, AI Systems Performance Engineering: large-scale inference prefers NIXL for one-to-one transfers (KV-cache movement) due to lower overhead/latency; NCCL send/recv remains but is less optimized for minimal latency. ↩
Fregly, AI Systems Performance Engineering: NCCL supports GPU-initiated networking with InfiniBand GPUDirect Async (IBGDA) and the direct-NIC path, letting the GPU drive full-bandwidth RDMA without CPU intervention. ↩
Fregly, AI Systems Performance Engineering: DeepSeek-V3 uses 1 shared expert plus 8 router-selected experts out of 256 per token (~9 active experts, ~37B active parameters). ↩
Fregly, AI Systems Performance Engineering, "DeepSeek's Use of Inline PTX": DeepEP is DeepSeek's expert-parallel communication library; uses out-of-doc PTX ld.global.nc.l1::no_allocate.l2::256b to stream all-to-all buffers into L2 without L1 eviction; not in NVIDIA's PTX ISA; gated by build flag DISABLE_AGGRESSIVE_PTX_INSTRS=1; warned to be unstable across GPU generations. ↩↩
Fregly, AI Systems Performance Engineering: verify true GPUDirect RDMA with lsmod | grep nvidia_peermem and check dmesg; the nvidia-peermem driver registers GPU memory with the NIC. ↩
Fregly, AI Systems Performance Engineering: a red flag for a fallback off the GPU-direct path is GPU utilization dropping and CPU utilization spiking during communication, indicating the CPU is copying data. ↩
NVSHMEM is NVIDIA's GPU-accelerated implementation of OpenSHMEM (a PGAS programming model); device-side APIs are callable by CUDA kernel threads to access symmetric memory via one-sided put/get/atomic. NVIDIA NVSHMEM Introduction. https://docs.nvidia.com/nvshmem/api/introduction.html ↩
NVIDIA NVSHMEM Remote Memory Access API: void nvshmem_TYPENAME_p(TYPE *dest, TYPE value, int pe) (and __device__ variant) — "very low latency put capability for single elements"; nvshmem_TYPENAME_g single-element get and nvshmem_TYPENAME_get bulk get return after delivery to dest; a put is complete only after a subsequent nvshmem_quiet; single-element _p over InfiniBand generates one RMA message per element; GPU-issued fence/quiet/barrier order only GPU-issued operations. https://docs.nvidia.com/nvshmem/api/gen/api/rma.html , https://docs.nvidia.com/nvshmem/api/using.html ↩↩↩↩↩↩↩
NVIDIA NVSHMEM Environment Variables: NVSHMEM_IB_ENABLE_IBGDA (bool, default false) — "Set to enable GPU-initiated communication transport"; NVSHMEM_REMOTE_TRANSPORT (string, default "ibrc"; values ibrc, ucx, libfabric, ibdevx, gpunetio, none); NVSHMEM_SYMMETRIC_SIZE (size, default 1073741824 = 1 GiB, suffixes k/m/g/t) symmetric heap per PE. https://docs.nvidia.com/nvshmem/api/gen/env.html ↩↩↩↩
NVIDIA Technical Blog, "Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async": IBGDA implements both control and data plane of IB in the GPU, removing the CPU reverse-proxy; IBRC put rate caps ~1.7 MOPS regardless of CTAs/QPs while IBGDA scales with CTAs toward the ~215 MOPS ConnectX-6 limit at eight CTAs; requires Mellanox HCAs, MLNX_OFED 5.0+, nvidia.ko >= 510.40.3 with PeerMappingOverride=1; introduced NVSHMEM 2.6.0. https://developer.nvidia.com/blog/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async/ ↩↩↩↩
NVIDIA NVSHMEM "Using NVSHMEM": device code requires relocatable device code (nvcc -rdc=true) and linking the NVSHMEM device + host libraries; exact flags per release. https://docs.nvidia.com/nvshmem/api/using.html ↩
DeepSeek deepseek-ai/DeepEP — efficient expert-parallel communication library: high-throughput and low-latency all-to-all GPU kernels for MoE dispatch and combine with FP8 support. Upstream README states V2 switched from the NVSHMEM backend to the lighter NCCL Gin backend, with NVSHMEM retained for legacy V1; build flag DISABLE_AGGRESSIVE_PTX_INSTRS disables aggressive load/store instructions in legacy methods. Prefer this source over the book where the backend disagrees. https://github.com/deepseek-ai/DeepEP ↩↩↩↩