Markdown

CUDA unified memory and NVLink-C2C page migration¶

Scope: a single pointer addressable from CPU and GPU (cudaMallocManaged), on-demand page migration and the page-fault stalls it causes, defusing those stalls with cudaMemPrefetchAsync + cudaMemAdvise, GPU memory oversubscription, and how Grace-Hopper/Grace-Blackwell NVLink-C2C cache-coherent memory changes the model, including its NUMA non-uniform-access caveat. For where this memory physically lives see GPU Memory Hierarchy; for the stream the prefetch rides on see CUDA Streams and Concurrency.

What it is¶

Unified Memory gives the application a single pointer that is valid on both the host and every GPU. You allocate with cudaMallocManaged(void** devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal) and the same address dereferences correctly from a CPU thread and from a kernel (CUDA Runtime API, Memory Management; CUDA C++ Programming Guide, Unified Memory). There is no explicit cudaMemcpy between host and device; "you focus on algorithms instead of explicit memory management" (Fregly, "AI Systems Performance Engineering," Ch. 6).

On discrete GPUs (compute capability 6.0+, Pascal and later) the data is not copied eagerly. The page lives in one processor's physical memory at a time, and the hardware moves it on demand: when a processor touches a page that is resident elsewhere, it takes a page fault, the runtime unmaps the page from the current owner, copies the backing page, and remaps it at the faulting processor (CUDA C++ Programming Guide, Unified Memory). The programming guide is explicit that this transfer "can be an expensive operation, and the amount of work is proportional to the page size."

flowchart LR
    subgraph cpu ["CPU (system memory)"]
        P0["managed page (resident)"]
    end
    subgraph gpu ["GPU (HBM)"]
        K["kernel touches page"]
    end
    K -- "1: page fault" --> PME["Page Migration Engine"]
    PME -- "2: unmap from CPU" --> P0
    PME -- "3: copy page over the link" --> GPU2["page now resident in HBM"]
    GPU2 -- "4: kernel resumes" --> K

Two complementary APIs let you control placement instead of paying for faults:

cudaMemPrefetchAsync is an "asynchronous stream-ordered API [that] may migrate data to reside closer to the specified processor" (CUDA C++ Programming Guide). You move the page before the kernel faults on it.
cudaMemAdvise sets persistent hints (cudaMemoryAdvise enum) about how a range is used, so the migration policy stops fighting the access pattern.

Why it matters¶

On-demand migration is correct but, untuned, slow. The first time a kernel sweeps a freshly-allocated managed array, every page it touches faults, serially, on the critical path. The fault is not just a copy: the runtime must ensure "the currently owning processor cannot access this page anymore" before handing it over (Fregly, Ch. 6; CUDA C++ Programming Guide). A kernel that should be HBM-bandwidth-bound instead stalls warp after warp waiting for the Page Migration Engine to service faults. The GPU looks busy and delivers little goodput.

Prefetch collapses that storm into one bulk transfer. Issuing cudaMemPrefetchAsync into the same stream ahead of the kernel migrates the whole range as a single stream-ordered operation, so the kernel finds its data already resident and runs at memory-bandwidth instead of fault-latency. cudaMemAdvise removes the steady-state churn: cudaMemAdviseSetReadMostly on a read-shared table lets the system replicate it read-only to multiple processors instead of bouncing one copy back and forth; cudaMemAdviseSetPreferredLocation pins a range's home so it is faulted in once and stays; cudaMemAdviseSetAccessedBy establishes a mapping so a processor can access the range without ever triggering a migration (CUDA C++ Programming Guide).

The second reason the model matters is oversubscription. "Unified memory enables applications to oversubscribe the memory of any individual processor": you can allocate "arrays larger than the memory capacity" of a single GPU and let pages stream in and out, enabling "out-of-core processing" (CUDA C++ Programming Guide). The cost is that an oversubscribed working set thrashes: pages evicted to make room must be re-faulted later. Oversubscription buys capacity, not speed.

When it is needed (and when not)¶

Use Unified Memory when:

The access pattern is sparse or data-dependent and you cannot statically schedule cudaMemcpy: graph traversal, pointer-chasing structures, or kernels whose touched footprint is not known ahead of time.
You need to oversubscribe a single GPU's memory to run a problem larger than HBM, accepting page traffic as the price of fitting.
You are on Grace-Hopper / Grace-Blackwell (see below), where the coherent NVLink-C2C link makes managed and system-allocated memory cheap enough to use as the default rather than the fallback.

Prefer explicit cudaMalloc + cudaMemcpyAsync (or the stream-ordered allocator) when:

The data flow is regular and statically known, such as a dense GEMM or a fixed-shape training step. Explicit double-buffered copies overlapped on a stream beat fault-driven migration and never surprise you with a stall.
You are latency-critical on a discrete PCIe GPU and cannot tolerate a fault storm; here you would have to prefetch everything anyway, at which point an explicit copy is simpler and just as fast.

Managed memory without prefetch/advise on a discrete GPU is the worst of both worlds: it is convenient to write and slow to run. If you adopt it, budget for the tuning in the next section.

How: implement, integrate, maintain¶

1. Allocate managed, then prefetch before the kernel¶

The pattern is allocate once, advise the steady-state policy, prefetch ahead of each consumer.

#include <cuda_runtime.h>

size_t n = 1u << 26;                 // 64 Mi elements
float *a = nullptr;
cudaMallocManaged(&a, n * sizeof(float));   // single pointer, host + device

int device = -1;
cudaGetDevice(&device);

// Initialize on the host -> pages are resident in system memory here.
for (size_t i = 0; i < n; ++i) a[i] = 1.0f;

// Move the whole range to the GPU as ONE stream-ordered transfer,
// so the kernel does not fault page-by-page on first touch.
cudaMemPrefetchAsync(a, n * sizeof(float), device, /*stream=*/0);

saxpy<<<grid, block, 0, /*stream=*/0>>>(a, /*...*/);

// Pull results back before the host reads them, again as one transfer.
cudaMemPrefetchAsync(a, n * sizeof(float), cudaCpuDeviceId, /*stream=*/0);
cudaStreamSynchronize(0);

cudaCpuDeviceId is the sentinel device id meaning "the host". Pass it as the destination to migrate a range back to system memory (CUDA C++ Programming Guide). The prefetch is stream-ordered: it begins only after prior work in that stream completes, and the kernel you launch afterward in the same stream sees the data resident.

API-version caveat (verify against your toolkit). The signatures above are the classic form: cudaMemPrefetchAsync(const void* devPtr, size_t count, int dstDevice, cudaStream_t stream) and cudaMemAdvise(const void* devPtr, size_t count, cudaMemoryAdvise advice, int device). In CUDA 13.0 the default cudaMemPrefetchAsync/cudaMemAdvise symbols became the _v2 forms, which take a struct cudaMemLocation location (plus a flags argument on prefetch) instead of an int device id (CUDA Runtime API, Memory Management; NVIDIA Developer Forums, "no suitable constructor exists to convert from int to cudaMemLocation"). On CUDA 13+, build the location explicitly:

// CUDA 13+ (_v2) form: target a specific GPU.
cudaMemLocation loc{};
loc.type = cudaMemLocationTypeDevice;
loc.id   = device;
cudaMemPrefetchAsync(a, n * sizeof(float), loc, /*flags=*/0, /*stream=*/0);

// Target a host NUMA node instead of the generic CPU.
cudaMemLocation host{};
host.type = cudaMemLocationTypeHostNuma;
host.id   = 0;                         // host NUMA node id
cudaMemPrefetchAsync(a, n * sizeof(float), host, /*flags=*/0, /*stream=*/0);

The cudaMemLocationTypeHostNuma location type is exactly the hook the NUMA caveat below needs: on a coherent CPU-GPU system you can prefetch to (or set the preferred location to) a specific CPU NUMA node, not just "the CPU."

2. Set policy with cudaMemAdvise¶

Advice is persistent and complements prefetch. Apply it once after allocation.

// A lookup table read by many threads, rarely written: replicate read-only.
cudaMemAdvise(table, bytes, cudaMemAdviseSetReadMostly, device);

// This range's home is the GPU; fault it in once and keep it there.
cudaMemAdvise(weights, bytes, cudaMemAdviseSetPreferredLocation, device);

// The host will also touch it; map it so host access does not migrate the page.
cudaMemAdvise(weights, bytes, cudaMemAdviseSetAccessedBy, cudaCpuDeviceId);

The enum semantics, verbatim: cudaMemAdviseSetReadMostly "implies that the data is mostly going to be read from and only occasionally written to"; cudaMemAdviseSetPreferredLocation "sets the preferred location for the data to be the specified device's physical memory"; cudaMemAdviseSetAccessedBy "tells the system that the data will be frequently accessed by location.id" (CUDA C++ Programming Guide). Each has an Unset* counterpart to clear the hint. Advice does not move data; it shapes how the migration engine reacts to faults. Combine SetPreferredLocation (where it lives) with SetAccessedBy (who can reach it without migrating) to eliminate steady-state churn.

3. Oversubscribe deliberately, and measure the thrash¶

cudaMallocManaged will hand back an allocation larger than GPU memory; the pages simply are not all resident at once. This is the only way to run a working set bigger than HBM without manual tiling, but a hot working set that exceeds capacity re-faults evicted pages continuously. Treat oversubscription as a capacity tool, profile the page-fault rate, and if the resident set thrashes, fall back to explicit tiling. Allocations larger than memory enable "out-of-core processing" (CUDA C++ Programming Guide); they do not make it fast.

4. Grace-Hopper / Grace-Blackwell: coherent NVLink-C2C changes the rules¶

On the Grace-Hopper Superchip (and the Grace-Blackwell GB200 successor) the Grace CPU and the GPU are joined by NVLink-C2C, a cache-coherent chip-to-chip link delivering "up to 900 GB/s total bandwidth ... 7x higher bandwidth than x16 PCIe Gen5 lanes" (NVIDIA Grace Hopper Superchip Architecture In-Depth). Coherence is the headline: "NVLink-C2C hardware-coherency enables the Grace CPU to cache GPU memory at cache-line granularity and for the GPU and CPU to access each other's memory without page-migrations." With Address Translation Services (ATS), "the CPU and GPU [share] a single per-process page table, enabling all CPU and GPU threads to access all system-allocated memory". So plain malloc'd (system-allocated) memory is GPU-addressable, not just cudaMallocManaged memory. On these systems a remote access can be served as a 64/128-byte cache-line fetch over the coherent link instead of forcing a page migration, which is why managed/system memory becomes a sane default here rather than a discrete-GPU fallback.

flowchart LR
    subgraph chip ["Grace-Hopper Superchip"]
        GRACE["Grace CPU<br/>NUMA node 0<br/>LPDDR5X"]
        HOPPER["Hopper GPU<br/>NUMA node 1<br/>HBM3"]
    end
    GRACE <-- "NVLink-C2C: coherent, ~900 GB/s, cache-line granular" --> HOPPER
    GRACE -. "local: ~500 GB/s class" .-> GRACE
    HOPPER -. "local HBM: multi-TB/s" .-> HOPPER

The NUMA non-uniform caveat. Coherent does not mean uniform. "From an OS perspective, the Grace CPU and Hopper GPU are just two separate NUMA nodes" (NVIDIA Grace Hopper Superchip Architecture In-Depth). Local and remote bandwidth differ by an order of magnitude, and the C2C link is itself asymmetric. Independent measurements on GH200 report GPU-local HBM3 around 3.4 TB/s and CPU-local LPDDR5X around 486 GB/s, while across NVLink-C2C the link delivers roughly "375 GB/s for host-to-device (H2D) transfers and 297 GB/s for device-to-host (D2H) transfers" against a 450 GB/s theoretical (Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper). A kernel that streams from CPU-resident LPDDR5X over C2C therefore runs at a fraction of HBM bandwidth even though no page ever migrates. The lesson: on coherent systems the question shifts from "did it migrate?" to "is it local?". Use cudaMemAdviseSetPreferredLocation and cudaMemPrefetchAsync (now able to target a specific host NUMA node via cudaMemLocationTypeHostNuma) to keep a kernel's hot footprint in the fast node, and use NUMA-aware host placement so CPU-side data lands on the Grace node the GPU is paired with. See NUMA Affinity and CPU Pinning for GPU Pipelines.

Note also that placement on Grace-Hopper is first-touch: the page is backed on the NUMA node of whichever processor writes it first, and system-allocated memory uses access-counter-based automatic migration (a hardware interrupt once a per-page access counter crosses a threshold) rather than the immediate page-fault migration of classic managed memory (Harnessing Integrated CPU-GPU System Memory for HPC). Initialize data on the processor that will consume it, or prefetch it there explicitly.

5. Validate¶

Profile with Nsight Systems and look specifically at the Unified Memory track: a healthy prefetched run shows a few bulk H2D/D2H migrations, not a dense field of small page faults overlapping the kernel. Persistent page-fault activity inside a kernel's runtime is the signature of missing or mistimed prefetch. See Profiling GPUs: Nsight Systems and Nsight Compute. On Grace-Hopper, also check NUMA locality of the hot range (which node backs it) rather than just fault counts. A coherent system can be slow with zero migrations if the working set sits on the remote node.

The code in this page is adapted from the cited book and NVIDIA documentation. It has not been compiled or hardware-tested here.

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly). Ch. 6 "GPU Architecture, CUDA Programming, and Maximizing Occupancy" — Unified Memory model, on-demand page migration, page-fault cost proportional to page size, cudaMallocManaged, prefetch and advise to avoid fault stalls.
NVIDIA, CUDA C++ Programming Guide — Unified Memory (on-demand migration and page faults, cudaMemPrefetchAsync, cudaMemAdvise and the cudaMemoryAdvise enum, cudaCpuDeviceId, oversubscription / out-of-core, ATS and single per-process page table): https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/unified-memory.html
NVIDIA, CUDA Runtime API — Memory Management (cudaMallocManaged with cudaMemAttachGlobal, current _v2 cudaMemPrefetchAsync/cudaMemAdvise signatures taking cudaMemLocation, cudaMemLocationTypeDevice/cudaMemLocationTypeHostNuma): https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html
NVIDIA Developer Blog — NVIDIA Grace Hopper Superchip Architecture In-Depth (NVLink-C2C 900 GB/s coherent link, access without page migration, ATS single per-process page table, Grace/Hopper as two NUMA nodes, LPDDR5X/HBM3 capacities and bandwidth): https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/
NVIDIA Developer Forums — CUDA 13 cudaMemPrefetchAsync int-to-cudaMemLocation API change (_v2 default): https://forums.developer.nvidia.com/t/cudamemprefetchasync-compilation-error-with-cuda-13-1-on-rtx-5070-ti-no-suitable-constructor-exists-to-convert-from-int-to-cudamemlocation/357462
T. Fukushima et al., Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper (arXiv:2407.07850) — measured GH200 NUMA bandwidth asymmetry (HBM3 ~3.4 TB/s, LPDDR5X ~486 GB/s, C2C 375/297 GB/s H2D/D2H), first-touch placement, access-counter-based migration: https://arxiv.org/html/2407.07850v1