The NVIDIA Grace CPU¶
Scope: the NVIDIA Grace CPU (a 72-core Arm Neoverse V2 server processor with high-bandwidth on-package LPDDR5X) and its role next to the GPU inside a Grace Blackwell (GB200/GB300) superchip: the cache-coherent NVLink-C2C CPU↔GPU link, the host work it offloads (data loading, speculative-decoding draft models, orchestration, kernel launch), the unified Grace-Blackwell memory model, and the aarch64 software implications for an ops team. Specs shift; confirm on the cited NVIDIA pages.
Figures verified against the NVIDIA Grace CPU product page, the Grace CPU Superchip architecture blog, and the GB200 NVLink tuning guide as of June 2026, and cross-checked against Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 2. Re-check the datasheet before relying on any single number. The "How" steps below are reference templates, unexecuted and not hardware-tested.
What it is¶
Grace is NVIDIA's data-center CPU, custom-designed for memory bandwidth and power efficiency rather than peak single-thread clock. A single Grace CPU is 72 Arm Neoverse V2 cores (up to ~3.44 GHz) with 114 MB of L3 cache, fed by a wide server-class LPDDR5X subsystem delivering up to ~500 GB/s of memory bandwidth (NVIDIA quotes ~536 GB/s MemSet / ~507 GB/s Triad) at roughly 16 W for the memory, far more bandwidth-per-watt than registered DDR5 DIMMs. In a GB200 superchip the Grace CPU carries up to ~480 GB of on-package LPDDR5X with ECC.
It does not sit on the far side of a PCIe bus. In a Grace Blackwell GB200 superchip, one Grace CPU is joined to two Blackwell B200 GPUs by NVLink-C2C (chip-to-chip), a 900 GB/s bidirectional, cache-coherent link, about an order of magnitude beyond PCIe (PCIe Gen5 x16 is ~64 GB/s per direction; Gen6 x16 on B300 is ~128 GB/s per direction). Cache coherency means the CPU and GPUs see the same memory values without explicit copies or address-translation overhead: the GPU can read/write Grace's LPDDR5X and Grace can read/write GPU HBM3e directly, in one unified address space NVIDIA calls Extended GPU Memory (EGM). The book (Fregly, Ch. 2) frames Grace's job as "never becoming the bottleneck when shoveling data to the GPUs."
flowchart TB
subgraph GB200["Grace Blackwell GB200 Superchip (unified coherent address space)"]
GRACE["Grace CPU: 72x Arm Neoverse V2, 114 MB L3"]
GRACE --- LP["~480 GB LPDDR5X (ECC), ~500 GB/s"]
GRACE -- "NVLink-C2C: 900 GB/s, cache-coherent" --> B1["Blackwell B200 GPU #1"]
GRACE -- "NVLink-C2C: 900 GB/s, cache-coherent" --> B2["Blackwell B200 GPU #2"]
B1 --- H1["192 (180 usable) GB HBM3e, ~8 TB/s"]
B2 --- H2["192 (180 usable) GB HBM3e, ~8 TB/s"]
end
B1 -- "NVLink 5 (18 ports, 1.8 TB/s/GPU)" --> SW["NVSwitch fabric (rack)"]
B2 -- "NVLink 5" --> SW
A full GB200/GB300 NVL72 rack contains 36 Grace CPUs and 72 Blackwell GPUs (18 compute trays, two superchips each). Grace is the host CPU of that rack, not an afterthought, but the data-feeding and orchestration tier of a unified CPU-GPU machine.
Why it matters¶
Memory ceiling becomes soft. A GB200 superchip exposes roughly ~900 GB of coherent unified memory (≈480 GB Grace LPDDR5X + 2×180 GB usable HBM3e). A model or embedding table that overflows a GPU's HBM can reside in Grace's LPDDR5X and the GPU still works on it over NVLink-C2C, with no NVMe spill and no cross-network fetch. A ~500 GB model that will not fit one GPU's HBM can live entirely inside one superchip module. The classic "CUDA out of memory" wall moves out substantially.
But placement still matters. Reaching LPDDR5X over NVLink-C2C is ~10x lower bandwidth and higher latency than a GPU touching its own HBM (~8 TB/s). Coherence buys correctness, not free performance: a good runtime keeps hot data in HBM and uses Grace memory for overflow or latency-tolerant data, with asynchronous prefetch and staged pipelines. The win is that overflow stays on-package instead of going to SSD or the network.
Cheap kernel launch and task hand-off. Because the CPU and GPU are on the same module behind a low-latency coherent link, launching a GPU kernel or handing a task between CPU and GPU does not traverse a slow PCIe bus. The hand-off is closer to a local function call than a remote one. This makes the CPU practical for the control-heavy and random-access work GPUs are bad at, with results immediately visible to the GPU.
When it is needed (and when not)¶
Lean on Grace when: - The model + working set exceeds GPU HBM but fits the combined superchip memory. Use Grace LPDDR5X as a coherent extension instead of partitioning across GPUs or spilling to storage. - The input pipeline is the bottleneck (streaming from storage, tokenization, decode/augmentation) and you want it adjacent to the GPU at ~500 GB/s rather than across PCIe. See data-loading pipeline tuning. - You run speculative decoding and want a small draft model or control logic on the CPU, with accept/reject and the verified KV state coherently shared with the GPU target model. See speculative decoding. - The workload is control-heavy / random-access (graph traversal, sampling logic, orchestration, dynamic batching) that runs poorly on SIMT hardware.
Do not lean on Grace when: - The kernel is memory-bandwidth-bound and HBM-resident. Pulling its working set across NVLink-C2C at ~1/10th HBM bandwidth will dominate runtime. Keep it in HBM; see roofline / arithmetic intensity. - You assumed an x86 host. Grace is aarch64; see the aarch64 implications below. - You expected GPU-to-GPU coherence. Only the CPU↔GPU NVLink-C2C path is cache-coherent. Across the NVL72, GPUs share a global address space but their caches are not globally coherent; multi-GPU correctness comes from software (NCCL, NVSHMEM) over NVSwitch/NVLink, not Grace coherence.
How: implement, integrate, maintain¶
The defining operational facts are that Grace is aarch64 and that the CPU-GPU memory is coherent and unified. Both change how you build, place data, and pin work versus an x86 + discrete-GPU node.
aarch64 software surface¶
Grace is a standards-based Arm SBSA (Server Base System Architecture) design, so unlike a consumer-class Arm part it behaves like any other server CPU: existing aarch64 binaries, compilers, tools, and all major Linux distributions run unmodified, and the full NVIDIA stack (CUDA, HPC SDK, NGC containers, NIM) ships Arm-native. The friction is the same shape as on DGX Spark: you must use arm64 artifacts, not x86.
# Confirm you are on Grace (aarch64) and see the coherent CPU-GPU topology
uname -m # aarch64
lscpu | grep -E "Model name|CPU\(s\)" # Neoverse-V2, 72 cores per Grace
nvidia-smi # Blackwell GPUs visible on the superchip
nvidia-smi topo -m # C2C / NVLink coherent links between CPU and GPUs
numactl -H # NUMA layout: Grace LPDDR5X + GPU memory nodes
- Use arm64-SBSA CUDA, not x86. On the CUDA downloads page select
linux -> arm64-sbsa -> <distro>. NVIDIA supports cross-development forarm64-sbsafrom an x86 host on Ubuntu 22.04/24.04, RHEL 8/9, SLES 15, and KylinOS 10; the cross host compiler isaarch64-linux-gnu-g++. - Pull arm64 container tags only. x86-only images and wheels will not run. NGC and
nvcr.io/nvidia/cudapublish multi-arch (arm64-included) tags; verify each third-party Python dependency ships an aarch64 wheel before relying on stockpip install. This is the single most common porting surprise versus an x86 GPU node.
# arm64 manifest pulls; verify the current patch tag on NGC
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.3-base-ubuntu24.04 nvidia-smi
Place data and pin work for the coherent unified memory¶
Coherence makes the program correct; you still make it fast by managing placement.
- Keep hot data in HBM; overflow to LPDDR5X. Treat Grace memory as a large, ~10x-slower coherent tier: fine for overflow, embedding tables, or latency-tolerant state, not for bandwidth-bound inner loops. Prefetch asynchronously and stage pipelines so the GPU is not blocked on a C2C fetch. (CUDA managed/unified-memory APIs and prefetch hints apply; see CUDA unified memory.)
- NUMA-pin the host pipeline. Bind data-loader and orchestration threads to the Grace cores nearest the GPU they feed, and allocate their buffers on the matching memory node, so the feed path uses NVLink-C2C cleanly. See NUMA & CPU pinning.
- Offload the right work. Put tokenization, decode/augmentation, sampling, dynamic batching, and a speculative-decoding draft model on Grace; the coherent link makes their outputs immediately GPU-visible without an explicit copy. Keep dense matrix math on the GPU.
Maintain¶
- Topology / coherence checks.
nvidia-smi topo -mandnumactl -Hshould show the C2C link and the expected CPU + GPU memory nodes. The Fabric Manager and rack-level NVLink fabric (NVSwitch/NVLink) are separate from the on-superchip C2C coherence. Do not conflate a healthy NVL72 fabric with a healthy CPU↔GPU link. - Architecture, not just driver, must match. Build/JIT for the deployed GPU's compute capability and ship arm64 host binaries; an x86 host binary or a wrong-
smdevice binary fails or falls back to slow PTX JIT. - Mind the GB200 vs Grace-Superchip 900 GB/s ambiguity. In a GB200 the 900 GB/s NVLink-C2C is the CPU↔GPU link. NVIDIA also sells a CPU-only Grace CPU Superchip (two Grace dies, 144 cores, up to ~960 GB LPDDR5X, up to ~1 TB/s aggregate) where C2C joins the two CPUs. Confirm which part you have before quoting a bandwidth or core count.
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly), Chapter 2 — "AI System Hardware Overview": Grace CPU as Arm Neoverse V2, ~500 GB/s LPDDR5X, >100 MB L3; NVLink-C2C ~900 GB/s cache-coherent CPU↔GPU; ~480 GB Grace LPDDR5X and ~900 GB unified per GB200 superchip; overflow into CPU memory; cheap kernel launch over C2C; CPU↔GPU coherent but GPU caches not globally coherent across the NVL72.
- NVIDIA Grace CPU and Arm Architecture (Neoverse V2, LPDDR5X up to ~500 GB/s, SBSA, aarch64 ecosystem, NVLink-C2C): https://www.nvidia.com/en-us/data-center/grace-cpu/
- NVIDIA Grace CPU Superchip Architecture In Depth (72 Neoverse V2 cores up to 3.44 GHz, 114 MB L3, ~536 GB/s MemSet / ~507 GB/s Triad, ~16 W memory): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip-architecture-in-depth/
- NVIDIA Grace CPU Integrates with the Arm Software Ecosystem (SBSA, unmodified aarch64 binaries/distros, Arm-native CUDA / HPC SDK / NGC, arm64-sbsa cross-development targets): https://developer.nvidia.com/blog/nvidia-grace-cpu-integrates-with-the-arm-software-ecosystem/
- NVIDIA NVLink-C2C (900 GB/s bidirectional, cache-coherent, coherent unified CPU-GPU memory vs PCIe): https://www.nvidia.com/en-us/data-center/nvlink-c2c/
- NVIDIA GB200 NVL72 — trillion-parameter training & real-time inference (GB200 = 1 Grace + 2 Blackwell over 900 GB/s NVLink-C2C; NVL72 = 36 Grace + 72 Blackwell): https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
- NVIDIA GB200 NVL Multi-Node Tuning Guide — Grace Blackwell Superchip overview (coherent access to unified memory, no explicit transfer/address-translation overhead): https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/overview.html
- NVIDIA Data Center CPUs documentation (Grace platform, performance tuning): https://docs.nvidia.com/dccpu/index.html
- CUDA for Arm64 (NGC) — Arm-native CUDA containers for Grace / aarch64: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-arm64
Related: Grace-Blackwell platform · GPU generations · DGX Spark · CUDA unified memory · NVSwitch / NVLink · MoE sparsity scaling · Roofline / arithmetic intensity · Goodput in AI systems · Data-loading pipeline tuning · Speculative decoding · NUMA & CPU pinning · Glossary