Skip to content
Markdown

NVIDIA DGX Spark (GB10 desktop AI)

Scope: NVIDIA's GB10 Grace Blackwell desktop AI computer (formerly "Project DIGITS"), a single-board Arm + Blackwell system with 128 GB of unified memory, two ConnectX-7 ports for clustering a pair of units, and DGX OS. The operational story is its difference from a rack GPU: an aarch64 software surface, desktop power and thermals, and a unified-memory model that trades bandwidth for capacity. Specs shift; confirm on the cited NVIDIA pages.

Figures verified against the NVIDIA DGX Spark product page and docs.nvidia.com DGX Spark hardware guide as of June 2026. Re-check the datasheet before relying on any single number.

What it is

DGX Spark is a compact desktop computer (150 x 150 x 50.5 mm, 1.2 kg) built on the GB10 Grace Blackwell Superchip (NVIDIA with MediaTek). It pairs a 20-core Arm CPU (10x Cortex-X925 + 10x Cortex-A725) with a Blackwell GPU (6,144 CUDA cores, 5th-gen Tensor Cores) on one package, joined internally by NVLink-C2C. It delivers up to 1 PFLOP of FP4 (with sparsity) / up to 1,000 TOPS, runs DGX OS, and lists at ~$3,999 for the Founders Edition. It is positioned as a desktop prototyping device whose work transfers to DGX cloud and datacenter DGX without code changes: same CUDA, same Blackwell architecture, same container images (for aarch64).

It is not a rack part: no SXM, no NVSwitch, no Fabric Manager, no MIG. It is a single coherent SoC on a desk.

flowchart TB
  subgraph GB10["GB10 Grace Blackwell Superchip (~140 W TDP)"]
    CPU["20-core Arm: 10x Cortex-X925 + 10x Cortex-A725"] -- "NVLink-C2C (coherent)" --> GPU["Blackwell GPU: 6144 CUDA, 5th-gen Tensor"]
    CPU --- MEM["128 GB LPDDR5x unified, 273 GB/s (ATS)"]
    GPU --- MEM
  end
  GB10 --> CX["2x QSFP ConnectX-7 (200 Gb/s)"]
  CX -- "cluster two units" --> SPARK2["2nd DGX Spark: 256 GB combined, up to 405B params"]
  GB10 --> IO["1x RJ-45 10 GbE + Wi-Fi 7"]

Lineup & specifications

All figures from the NVIDIA DGX Spark product page and the docs.nvidia.com DGX Spark hardware guide (cited in References).

Attribute Value Note
Superchip GB10 Grace Blackwell (NVIDIA + MediaTek) TSMC 3nm class; formerly Project DIGITS
CPU 20-core Arm: 10x Cortex-X925 + 10x Cortex-A725 aarch64
GPU Blackwell, 6,144 CUDA cores, 5th-gen Tensor / 4th-gen RT compute capability 12.1 (sm_121)
AI compute Up to 1 PFLOP FP4 (sparse); up to 1,000 TOPS NVFP4 / FP4
Memory 128 GB LPDDR5x, coherent unified system memory 256-bit, 4266 MHz, 273 GB/s; not HBM, not discrete VRAM
Internal interconnect NVLink-C2C (CPU↔GPU coherent) on-package
Cluster networking 2x QSFP ConnectX-7, 200 Gb/s connect two units
Other I/O 1x RJ-45 10 GbE, Wi-Fi 7
OS NVIDIA DGX OS CUDA / cuDNN / TensorRT, NGC, AI Enterprise
Power 240 W external PSU; GB10 SOC TDP 140 W desktop, air-cooled
Price ~$3,999 (Founders Edition) verify current pricing

NVIDIA states a single unit runs models up to ~200 billion parameters in its unified memory, and two units linked over ConnectX scale to models up to ~405 billion parameters. The 256 GB combined figure is simply 2x128 GB across the pair (derived; NVIDIA quotes the 405B-parameter ceiling rather than the combined byte count).

Operational differences

DGX Spark behaves like a Blackwell datacenter GPU in software (CUDA, NCCL, TensorRT) but differs sharply in the operational details that matter for ops and integration.

  • Unified memory, not VRAM (the central design point). The 128 GB LPDDR5x is one coherent pool shared by CPU and GPU via Address Translation Services (ATS) over NVLink-C2C, with no separate device VRAM to copy into. Implication for big models: a 70B–200B model that would not fit a discrete GPU's HBM fits here, because the GPU addresses the whole 128 GB directly with no host-to-device staging. Implication for bandwidth: 273 GB/s is roughly an order of magnitude below datacenter HBM (B200 ~8 TB/s, H100 ~3.35 TB/s), so memory-bandwidth-bound work (large-batch training, high-throughput serving) runs far slower than on an HBM GPU. The trade is capacity-at-desktop over throughput: a prototyping and single-stream inference profile, not a training-throughput one.
  • aarch64 software surface. GB10's CPU is Arm. Containers, wheels, and drivers must be arm64/aarch64; x86-only images and binaries will not run. Use the Arm builds of CUDA, the arm64 tags of NGC/nvidia/cuda images, and verify any third-party dependency publishes aarch64 artifacts. This is the most common porting friction versus an x86 DGX node.
  • No MIG, no vGPU, no NVSwitch, no Fabric Manager. It is a single GPU on a desktop SoC; partitioning, multi-tenant vGPU, and the NVSwitch/Fabric Manager machinery of DGX systems do not apply. ECC is the on-die LPDDR behaviour, not datacenter HBM ECC.
  • Compute capability sm_121. NVIDIA lists "GB10 (DGX Spark)" at compute capability 12.1. Build/JIT for sm_121 (or a PTX target that JITs forward); a binary built only for sm_120 (RTX 50 / desktop Blackwell) or sm_90 (Hopper) may need recompilation or PTX fallback.
  • DGX OS on a desktop. Same DGX OS family as rack DGX (CUDA/cuDNN/TensorRT preloaded, NGC, AI Enterprise) but on desktop hardware. Management is the on-box DGX Dashboard and NVIDIA Sync, not Base Command / Mission Control. See the GPU software stack for the shared CUDA/NCCL surface.
  • Desktop power and thermals. A 240 W wall supply and active air cooling: no rack busbar, no liquid loop, no facility planning. It plugs into a standard outlet; the datacentre readiness constraints do not apply.

Install & setup

DGX Spark ships imaged with DGX OS (a customized Ubuntu 24.04 server-plus-desktop image, NVIDIA Base OS kernel, NVIDIA Container Toolkit preinstalled). It launched on the R580 driver with CUDA 13.0 and the OpenRM (GSP-RM) open kernel module, which is mandatory on this Blackwell-class part. There is no proprietary-module option. The steps below are reference templates (unexecuted, not hardware-tested; pin versions). The defining checks are that you are on aarch64, the unified memory is visible, and your CUDA/build target matches sm_121.

# Confirm Arm + GPU + unified memory + compute capability
uname -m                         # aarch64
nvidia-smi                       # 1x Blackwell GPU; memory reported from the unified pool
nvidia-smi --query-gpu=compute_cap --format=csv   # 12.1 (sm_121)

# Persistence (prefer the daemon; legacy nvidia-smi -pm flag is slated for deprecation)
sudo systemctl enable --now nvidia-persistenced

aarch64 driver and CUDA specifics

  • CUDA toolkit is arm64-SBSA, not x86. From the CUDA downloads page select linux -> arm64-sbsa -> ... -> Ubuntu -> 24.04; cross-compilation on an x86 host uses aarch64-linux-gnu-g++ as both host and CUDA host compiler. If installing the driver from a runfile, NVIDIA documents --silent -m=kernel-open (open modules) for this platform.
  • Build for sm_121. NVIDIA lists GB10 (DGX Spark) at compute capability 12.1; the porting guide compiles with CMake -DCMAKE_CUDA_ARCHITECTURES="121-real" (equivalently nvcc -gencode arch=compute_121,code=sm_121). A binary built only for sm_120 (RTX 50 / desktop Blackwell) or sm_90 (Hopper) may need recompilation or PTX-JIT fallback.

Container and wheel architecture caveats (the #1 porting friction)

  • Pull arm64 image tags only: x86-only images will not run. The NVIDIA Container Toolkit is preinstalled; NGC needs the arm64 NGC CLI.
  • CUDA-major mismatch is the sharp edge. DGX Spark ships CUDA 13 (libcudart.so.13), but the great majority of pip ML wheels are still built against CUDA 12 (libcudart.so.12), and sm_121 is only supported from CUDA 13.0. Expect to use CUDA-13 / arm64 builds (or rebuild from source) rather than stock pip install; verify each third-party dependency publishes aarch64 + CUDA-13 artifacts.
# Use arm64 container images only (x86 images will not run); prefer a CUDA 13 base on this box.
# NGC cuda:13.0.x tags are multi-arch (arm64 included); verify the current patch tag on NGC.
docker run --rm --gpus all nvcr.io/nvidia/cuda:13.0.3-base-ubuntu24.04 nvidia-smi   # arm64 manifest

# Cluster two units over ConnectX-7 ("Spark Stacking"): cable the 2x QSFP ports,
# then validate the RDMA link before running NCCL across the pair.
ibstat                           # ConnectX-7 ports up at expected speed

Management is the on-box DGX Dashboard and NVIDIA Sync (whose Cluster Assistant guides two-unit ConnectX-7 setup), not Base Command / Mission Control. Two-unit clustering ("Spark Stacking") is documented in the DGX Spark docs and is expanded under Networking below.

When to use it

  • Local prototyping of large models that exceed a single discrete GPU's VRAM but tolerate lower bandwidth: fitting a 70B–200B model in unified memory for development, then deploying the identical CUDA/container stack to DGX cloud or datacenter DGX unchanged.
  • A two-unit desktop pod for experiments needing up to ~405B-parameter models without renting rack capacity.
  • Not for training throughput, high-concurrency serving, or multi-tenant partitioning; its 273 GB/s bandwidth and lack of MIG/vGPU rule those out. For those, use HBM GPUs in DGX systems or the Blackwell platform.

Networking

The cluster path is 2x QSFP ConnectX-7 ports (200 Gb/s), used primarily to link two DGX Spark units into a single 256 GB working set for models up to ~405B parameters; this is GPUDirect RDMA over ConnectX, the same NIC family as datacenter DGX (networking fabric), but it is a small direct-attach pod, not a rail-aligned fabric. General connectivity is a 10 GbE RJ-45 and Wi-Fi 7. There is no NVSwitch and no multi-rack scaling; past a small ConnectX-7 pod the workflow is to push to DGX cloud or a datacenter cluster (provisioning & scheduling).

NVIDIA documents Spark Stacking under the DGX Spark docs (and NVIDIA Sync's Cluster Assistant). Physically it is the two rear QSFP / CX-7 ports, joined unit-to-unit by a single 200G QSFP56 passive DAC cable (pull-tab toward the top; seat fully). The CX-7 onboard SmartNIC runs at 200 GbE and the inter-unit link is point-to-point RoCE (RDMA over Converged Ethernet), so NCCL collectives ride it directly. Each QSFP port surfaces under predictable interface names (use the primary enpX... naming); pin NCCL to the 200G interface so traffic does not fall back to the 1/10 GbE management path.

# Confirm the ConnectX-7 RDMA link is up at the expected rate, on both units
ibstat                           # CX-7 port State: Active, expected rate (200 Gb/s class)
rdma link show                   # RDMA links present and up
ip -br link show                 # find the 200G interface name (e.g. enp1s0f0np0)

Do not re-document the full collective-benchmark recipe here; follow the shared keystone Fabric bring-up, validation & benchmarking for the RoCE link checks and a 2-node nccl-tests all_reduce_perf run across the pair, then read bus bandwidth against the 200 Gb/s link. Sketch only (full flags in the keystone): build nccl-tests, force traffic onto the CX-7 HCA/interface (NCCL_IB_HCA=..., NCCL_SOCKET_IFNAME=enp..., NCCL_DEBUG=INFO to confirm the RDMA transport engaged), and launch all_reduce_perf -b 8 -e 8G -f 2 -g 1 with one rank per unit. Validate the link and the collective before trusting NCCL across the two boxes (networking fabric). The headline pairing is two units → ~405B-parameter working set; NVIDIA Sync's Cluster Assistant also documents larger direct-cabled and switched ConnectX-7 topologies (verify the supported node count for your unit against the Cluster Assistant guide). There is no NVSwitch at any scale here; for a rail-aligned fabric, migrate to a datacenter cluster.

Gotchas & failure modes

  • x86 images silently absent. The most common failure is reaching for an x86-only container or wheel; on aarch64 it will not run. Always pull arm64 tags and confirm aarch64 builds exist for dependencies.
  • Bandwidth surprise. Unified LPDDR5x at 273 GB/s is the right capacity but ~10x less bandwidth than HBM; bandwidth-bound kernels and large-batch training underperform an HBM GPU by a wide margin. Size expectations to prototyping/single-stream inference.
  • Treating it as a rack GPU. No MIG, no vGPU, no NVSwitch, no Fabric Manager, no liquid cooling, no datacenter HBM ECC. Patterns from DGX systems that assume those do not transfer.
  • sm_121 build target. Binaries built only for sm_120/sm_90 may fail to load or fall back to slow PTX JIT; build for sm_121 or ship forward-compatible PTX.
  • Two-unit ceiling. Clustering tops out at two units over ConnectX-7; do not plan a Spark-only fabric beyond a pair. Migrate to DGX cloud/datacenter for anything larger.
  • NVLink-C2C is internal only. The coherent C2C link joins the CPU and GPU on-package; it is not an external NVLink and cannot be used to bond two Spark units (that path is ConnectX-7).

References

  • NVIDIA DGX Spark product page (GB10, 20-core Arm, 128 GB unified @ 273 GB/s, ~1 PFLOP FP4, ConnectX-7, DGX OS, 240 W, 200B single / 405B two-unit): https://www.nvidia.com/en-us/products/workstations/dgx-spark/
  • NVIDIA DGX Spark hardware guide (6,144 CUDA cores, 256-bit/4266 MHz/273 GB/s, 2x QSFP ConnectX-7, 10 GbE, Wi-Fi 7, 240 W PSU / 140 W TDP): https://docs.nvidia.com/dgx/dgx-spark/hardware.html
  • NVIDIA DGX Spark documentation home (Hardware, Software, Getting Started, Clustering / Spark Stacking, Networking): https://docs.nvidia.com/dgx/dgx-spark/index.html
  • NVIDIA CUDA GPUs compute capability list (GB10 / DGX Spark = compute capability 12.1 / sm_121): https://developer.nvidia.com/cuda-gpus
  • NVIDIA DGX platform overview (DGX Spark with the GB10 Superchip): https://www.nvidia.com/en-us/data-center/dgx-platform/
  • NVIDIA DGX OS 7 user guide (Ubuntu 24.04, CUDA, drivers): https://docs.nvidia.com/dgx/dgx-os-7-user-guide/index.html
  • NVIDIA DGX Spark porting guide — software requirements (R580 driver, CUDA 13.0, OpenRM kernel module, Ubuntu 24.04 base): https://docs.nvidia.com/dgx/dgx-spark-porting-guide/porting/software-requirements.html
  • NVIDIA DGX Spark porting guide — compilation (sm_121, CMAKE_CUDA_ARCHITECTURES="121-real", arm64-sbsa, aarch64-linux-gnu-g++): https://docs.nvidia.com/dgx/dgx-spark-porting-guide/porting/compilation.html
  • NVIDIA DGX Spark Spark Stacking (two-unit clustering, QSFP/CX-7 cabling): https://docs.nvidia.com/dgx/dgx-spark/spark-clustering.html
  • NVIDIA Sync user guide — Cluster Assistant for ConnectX-7 multi-node clusters: https://docs.nvidia.com/sync/latest/cluster-assistant.html
  • NVIDIA Container Runtime for Docker on DGX Spark (arm64 NGC CLI, preinstalled toolkit): https://docs.nvidia.com/dgx/dgx-spark/nvidia-container-runtime-for-docker.html
  • NVIDIA Driver Persistence (legacy persistence mode deprecation; nvidia-persistenced daemon): https://docs.nvidia.com/deploy/driver-persistence/index.html
  • NVIDIA "Transitions Fully Towards Open-Source GPU Kernel Modules" (Blackwell requires open modules; proprietary unsupported): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
  • NVIDIA/nccl-tests (all_reduce_perf / all_gather_perf): https://github.com/NVIDIA/nccl-tests
  • NVIDIA CUDA container supported tags (13.0.x base/runtime/devel for ubuntu24.04, multi-arch): https://gitlab.com/nvidia/container-images/cuda/-/blob/master/doc/supported-tags.md

Related: DGX systems · Blackwell platform · GPU generations · Networking fabric · Provisioning & scheduling · GPU software stack · Glossary