Markdown

NVIDIA DGX Spark (GB10 desktop AI)¶

Scope: NVIDIA's GB10 Grace Blackwell desktop AI computer (formerly "Project DIGITS"), a single-board Arm + Blackwell system with 128 GB of unified memory, two ConnectX-7 ports for clustering a pair of units, and DGX OS. The operational story is its difference from a rack GPU: an aarch64 software surface, desktop power and thermals, and a unified-memory model that trades bandwidth for capacity. Specs shift; confirm on the cited NVIDIA pages.

Figures verified against the NVIDIA DGX Spark product page and docs.nvidia.com DGX Spark hardware guide as of June 2026. Re-check the datasheet before relying on any single number.

What it is¶

DGX Spark is a compact desktop computer (150 x 150 x 50.5 mm, 1.2 kg) built on the GB10 Grace Blackwell Superchip (NVIDIA with MediaTek). It pairs a 20-core Arm CPU (10x Cortex-X925 + 10x Cortex-A725) with a Blackwell GPU (6,144 CUDA cores, 5th-gen Tensor Cores) on one package, joined internally by NVLink-C2C. It delivers up to 1 PFLOP of FP4 (with sparsity) / up to 1,000 TOPS, runs DGX OS, and lists at ~$3,999 for the Founders Edition. It is positioned as a desktop prototyping device whose work transfers to DGX cloud and datacenter DGX without code changes: same CUDA, same Blackwell architecture, same container images (for aarch64).

It is not a rack part: no SXM, no NVSwitch, no Fabric Manager, no MIG. It is a single coherent SoC on a desk.

flowchart TB
  subgraph GB10["GB10 Grace Blackwell Superchip (~140 W TDP)"]
    CPU["20-core Arm: 10x Cortex-X925 + 10x Cortex-A725"] -- "NVLink-C2C (coherent)" --> GPU["Blackwell GPU: 6144 CUDA, 5th-gen Tensor"]
    CPU --- MEM["128 GB LPDDR5x unified, 273 GB/s (ATS)"]
    GPU --- MEM
  end
  GB10 --> CX["2x QSFP ConnectX-7 (200 Gb/s)"]
  CX -- "cluster two units" --> SPARK2["2nd DGX Spark: 256 GB combined, up to 405B params"]
  GB10 --> IO["1x RJ-45 10 GbE + Wi-Fi 7"]

Lineup & specifications¶

All figures from the NVIDIA DGX Spark product page and the docs.nvidia.com DGX Spark hardware guide (cited in References).

Attribute	Value	Note
Superchip	GB10 Grace Blackwell (NVIDIA + MediaTek)	TSMC 3nm class; formerly Project DIGITS
CPU	20-core Arm: 10x Cortex-X925 + 10x Cortex-A725	aarch64
GPU	Blackwell, 6,144 CUDA cores, 5th-gen Tensor / 4th-gen RT	compute capability 12.1 (sm_121)
AI compute	Up to 1 PFLOP FP4 (sparse); up to 1,000 TOPS	NVFP4 / FP4
Memory	128 GB LPDDR5x, coherent unified system memory	256-bit, 4266 MHz, 273 GB/s; not HBM, not discrete VRAM
Internal interconnect	NVLink-C2C (CPU↔GPU coherent)	on-package
Cluster networking	2x QSFP ConnectX-7, 200 Gb/s	connect two units
Other I/O	1x RJ-45 10 GbE, Wi-Fi 7
OS	NVIDIA DGX OS	CUDA / cuDNN / TensorRT, NGC, AI Enterprise
Power	240 W external PSU; GB10 SOC TDP 140 W	desktop, air-cooled
Price	~$3,999 (Founders Edition)	verify current pricing

NVIDIA states a single unit runs models up to ~200 billion parameters in its unified memory, and two units linked over ConnectX scale to models up to ~405 billion parameters. The 256 GB combined figure is simply 2x128 GB across the pair (derived; NVIDIA quotes the 405B-parameter ceiling rather than the combined byte count).

Operational differences¶

DGX Spark behaves like a Blackwell datacenter GPU in software (CUDA, NCCL, TensorRT) but differs sharply in the operational details that matter for ops and integration.

Unified memory, not VRAM (the central design point). The 128 GB LPDDR5x is one coherent pool shared by CPU and GPU via Address Translation Services (ATS) over NVLink-C2C, with no separate device VRAM to copy into. Implication for big models: a 70B–200B model that would not fit a discrete GPU's HBM fits here, because the GPU addresses the whole 128 GB directly with no host-to-device staging. Implication for bandwidth: 273 GB/s is roughly an order of magnitude below datacenter HBM (B200 ~8 TB/s, H100 ~3.35 TB/s), so memory-bandwidth-bound work (large-batch training, high-throughput serving) runs far slower than on an HBM GPU. The trade is capacity-at-desktop over throughput: a prototyping and single-stream inference profile, not a training-throughput one.
aarch64 software surface. GB10's CPU is Arm. Containers, wheels, and drivers must be arm64/aarch64; x86-only images and binaries will not run. Use the Arm builds of CUDA, the arm64 tags of NGC/nvidia/cuda images, and verify any third-party dependency publishes aarch64 artifacts. This is the most common porting friction versus an x86 DGX node.
No MIG, no vGPU, no NVSwitch, no Fabric Manager. It is a single GPU on a desktop SoC; partitioning, multi-tenant vGPU, and the NVSwitch/Fabric Manager machinery of DGX systems do not apply. ECC is the on-die LPDDR behaviour, not datacenter HBM ECC.
Compute capability sm_121. NVIDIA lists "GB10 (DGX Spark)" at compute capability 12.1. Build/JIT for sm_121 (or a PTX target that JITs forward); a binary built only for sm_120 (RTX 50 / desktop Blackwell) or sm_90 (Hopper) may need recompilation or PTX fallback.
DGX OS on a desktop. Same DGX OS family as rack DGX (CUDA/cuDNN/TensorRT preloaded, NGC, AI Enterprise) but on desktop hardware. Management is the on-box DGX Dashboard and NVIDIA Sync, not Base Command / Mission Control. See the GPU software stack for the shared CUDA/NCCL surface.
Desktop power and thermals. A 240 W wall supply and active air cooling: no rack busbar, no liquid loop, no facility planning. It plugs into a standard outlet; the datacentre readiness constraints do not apply.

Install & setup¶

DGX Spark ships imaged with DGX OS (a customized Ubuntu 24.04 server-plus-desktop image, NVIDIA Base OS kernel, NVIDIA Container Toolkit preinstalled). It launched on the R580 driver with CUDA 13.0 and the OpenRM (GSP-RM) open kernel module, which is mandatory on this Blackwell-class part. There is no proprietary-module option. The steps below are reference templates (unexecuted, not hardware-tested; pin versions). The defining checks are that you are on aarch64, the unified memory is visible, and your CUDA/build target matches sm_121.

# Confirm Arm + GPU + unified memory + compute capability
uname -m                         # aarch64
nvidia-smi                       # 1x Blackwell GPU; memory reported from the unified pool
nvidia-smi --query-gpu=compute_cap --format=csv   # 12.1 (sm_121)

# Persistence (prefer the daemon; legacy nvidia-smi -pm flag is slated for deprecation)
sudo systemctl enable --now nvidia-persistenced

aarch64 driver and CUDA specifics¶

CUDA toolkit is arm64-SBSA, not x86. From the CUDA downloads page select linux -> arm64-sbsa -> ... -> Ubuntu -> 24.04; cross-compilation on an x86 host uses aarch64-linux-gnu-g++ as both host and CUDA host compiler. If installing the driver from a runfile, NVIDIA documents --silent -m=kernel-open (open modules) for this platform.
Build for sm_121. NVIDIA lists GB10 (DGX Spark) at compute capability 12.1; the porting guide compiles with CMake -DCMAKE_CUDA_ARCHITECTURES="121-real" (equivalently nvcc -gencode arch=compute_121,code=sm_121). A binary built only for sm_120 (RTX 50 / desktop Blackwell) or sm_90 (Hopper) may need recompilation or PTX-JIT fallback.

Container and wheel architecture caveats (the #1 porting friction)¶

Pull arm64 image tags only: x86-only images will not run. The NVIDIA Container Toolkit is preinstalled; NGC needs the arm64 NGC CLI.
CUDA-major mismatch is the sharp edge. DGX Spark ships CUDA 13 (libcudart.so.13), but the great majority of pip ML wheels are still built against CUDA 12 (libcudart.so.12), and sm_121 is only supported from CUDA 13.0. Expect to use CUDA-13 / arm64 builds (or rebuild from source) rather than stock pip install; verify each third-party dependency publishes aarch64 + CUDA-13 artifacts.

# Use arm64 container images only (x86 images will not run); prefer a CUDA 13 base on this box.
# NGC cuda:13.0.x tags are multi-arch (arm64 included); verify the current patch tag on NGC.
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.3-base-ubuntu24.04 nvidia-smi   # arm64 manifest

# Cluster two units over ConnectX-7 ("Spark Stacking"): cable the 2x QSFP ports,
# then validate the RDMA link before running NCCL across the pair.
ibstat                           # ConnectX-7 ports up at expected speed

Management is the on-box DGX Dashboard and NVIDIA Sync (whose Cluster Assistant guides two-unit ConnectX-7 setup), not Base Command / Mission Control. Two-unit clustering ("Spark Stacking") is documented in the DGX Spark docs and is expanded under Networking below.

When to use it¶

Local prototyping of large models that exceed a single discrete GPU's VRAM but tolerate lower bandwidth: fitting a 70B–200B model in unified memory for development, then deploying the identical CUDA/container stack to DGX cloud or datacenter DGX unchanged.
A two-unit desktop pod for experiments needing up to ~405B-parameter models without renting rack capacity.
Not for training throughput, high-concurrency serving, or multi-tenant partitioning; its 273 GB/s bandwidth and lack of MIG/vGPU rule those out. For those, use HBM GPUs in DGX systems or the Blackwell platform.

Networking¶

The cluster path is 2x QSFP ConnectX-7 ports (200 Gb/s), used primarily to link two DGX Spark units into a single 256 GB working set for models up to ~405B parameters; this is GPUDirect RDMA over ConnectX, the same NIC family as datacenter DGX (networking fabric), but it is a small direct-attach pod, not a rail-aligned fabric. General connectivity is a 10 GbE RJ-45 and Wi-Fi 7. There is no NVSwitch and no multi-rack scaling; past a small ConnectX-7 pod the workflow is to push to DGX cloud or a datacenter cluster (provisioning & scheduling).

Cable and validate the 2-unit link ("Spark Stacking")¶

NVIDIA documents Spark Stacking under the DGX Spark docs (and NVIDIA Sync's Cluster Assistant). Physically it is the two rear QSFP / CX-7 ports, joined unit-to-unit by a single 200G QSFP56 passive DAC cable (pull-tab toward the top; seat fully). The CX-7 onboard SmartNIC runs at 200 GbE and the inter-unit link is point-to-point RoCE (RDMA over Converged Ethernet), so NCCL collectives ride it directly. Each QSFP port surfaces under predictable interface names (use the primary enpX... naming); pin NCCL to the 200G interface so traffic does not fall back to the 1/10 GbE management path.

# Confirm the ConnectX-7 RDMA link is up at the expected rate, on both units
ibstat                           # CX-7 port State: Active, expected rate (200 Gb/s class)
rdma link show                   # RDMA links present and up
ip -br link show                 # find the 200G interface name (e.g. enp1s0f0np0)

Do not re-document the full collective-benchmark recipe here; follow the shared keystone Fabric bring-up, validation & benchmarking for the RoCE link checks and a 2-node nccl-tests all_reduce_perf run across the pair, then read bus bandwidth against the 200 Gb/s link. Sketch only (full flags in the keystone): build nccl-tests, force traffic onto the CX-7 HCA/interface (NCCL_IB_HCA=..., NCCL_SOCKET_IFNAME=enp..., NCCL_DEBUG=INFO to confirm the RDMA transport engaged), and launch all_reduce_perf -b 8 -e 8G -f 2 -g 1 with one rank per unit. Validate the link and the collective before trusting NCCL across the two boxes (networking fabric). The headline pairing is two units → ~405B-parameter working set; NVIDIA Sync's Cluster Assistant also documents larger direct-cabled and switched ConnectX-7 topologies (verify the supported node count for your unit against the Cluster Assistant guide). There is no NVSwitch at any scale here; for a rail-aligned fabric, migrate to a datacenter cluster.

Gotchas & failure modes¶

x86 images silently absent. The most common failure is reaching for an x86-only container or wheel; on aarch64 it will not run. Always pull arm64 tags and confirm aarch64 builds exist for dependencies.
Bandwidth surprise. Unified LPDDR5x at 273 GB/s is the right capacity but ~10x less bandwidth than HBM; bandwidth-bound kernels and large-batch training underperform an HBM GPU by a wide margin. Size expectations to prototyping/single-stream inference.
Treating it as a rack GPU. No MIG, no vGPU, no NVSwitch, no Fabric Manager, no liquid cooling, no datacenter HBM ECC. Patterns from DGX systems that assume those do not transfer.
sm_121 build target. Binaries built only for sm_120/sm_90 may fail to load or fall back to slow PTX JIT; build for sm_121 or ship forward-compatible PTX.
Two-unit ceiling. Clustering tops out at two units over ConnectX-7; do not plan a Spark-only fabric beyond a pair. Migrate to DGX cloud/datacenter for anything larger.
NVLink-C2C is internal only. The coherent C2C link joins the CPU and GPU on-package; it is not an external NVLink and cannot be used to bond two Spark units (that path is ConnectX-7).

References¶

NVIDIA DGX Spark product page (GB10, 20-core Arm, 128 GB unified @ 273 GB/s, ~1 PFLOP FP4, ConnectX-7, DGX OS, 240 W, 200B single / 405B two-unit): https://www.nvidia.com/en-us/products/workstations/dgx-spark/
NVIDIA DGX Spark hardware guide (6,144 CUDA cores, 256-bit/4266 MHz/273 GB/s, 2x QSFP ConnectX-7, 10 GbE, Wi-Fi 7, 240 W PSU / 140 W TDP): https://docs.nvidia.com/dgx/dgx-spark/hardware.html
NVIDIA DGX Spark documentation home (Hardware, Software, Getting Started, Clustering / Spark Stacking, Networking): https://docs.nvidia.com/dgx/dgx-spark/index.html
NVIDIA CUDA GPUs compute capability list (GB10 / DGX Spark = compute capability 12.1 / sm_121): https://developer.nvidia.com/cuda-gpus
NVIDIA DGX platform overview (DGX Spark with the GB10 Superchip): https://www.nvidia.com/en-us/data-center/dgx-platform/
NVIDIA DGX OS 7 user guide (Ubuntu 24.04, CUDA, drivers): https://docs.nvidia.com/dgx/dgx-os-7-user-guide/index.html
NVIDIA DGX Spark porting guide — software requirements (R580 driver, CUDA 13.0, OpenRM kernel module, Ubuntu 24.04 base): https://docs.nvidia.com/dgx/dgx-spark-porting-guide/porting/software-requirements.html
NVIDIA DGX Spark porting guide — compilation (sm_121, CMAKE_CUDA_ARCHITECTURES="121-real", arm64-sbsa, aarch64-linux-gnu-g++): https://docs.nvidia.com/dgx/dgx-spark-porting-guide/porting/compilation.html
NVIDIA DGX Spark Spark Stacking (two-unit clustering, QSFP/CX-7 cabling): https://docs.nvidia.com/dgx/dgx-spark/spark-clustering.html
NVIDIA Sync user guide — Cluster Assistant for ConnectX-7 multi-node clusters: https://docs.nvidia.com/sync/latest/cluster-assistant.html
NVIDIA Container Runtime for Docker on DGX Spark (arm64 NGC CLI, preinstalled toolkit): https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html
NVIDIA Driver Persistence (legacy persistence mode deprecation; nvidia-persistenced daemon): https://docs.nvidia.com/deploy/driver-persistence/index.html
NVIDIA "Transitions Fully Towards Open-Source GPU Kernel Modules" (Blackwell requires open modules; proprietary unsupported): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
NVIDIA/nccl-tests (all_reduce_perf / all_gather_perf): https://github.com/NVIDIA/nccl-tests
NVIDIA CUDA container supported tags (13.0.x base/runtime/devel for ubuntu24.04, multi-arch): https://gitlab.com/nvidia/container-images/cuda/-/blob/master/doc/supported-tags.md