Markdown

GB300)¶

Scope: the NVIDIA Blackwell datacenter line much of this knowledge base is built on. The GPUs (B200 and the memory/compute-max B300; the early B100), the nodes (DGX/HGX B200 and B300), and the rack-scale systems (GB200 NVL72 and GB300 NVL72). B200 is the broader-compatibility part; B300 (Blackwell Ultra) is the memory- and compute-maximised part. Both share 5th-gen Tensor Cores, 5th-gen NVLink, and Confidential Computing with TEE-I/O. Worth keeping current as datasheets shift.

Figures verified against vendor and reference sources as of June 2026. Re-check against the NVIDIA datasheet before relying on any single number; vendor-quoted rack power and FLOPS vary by configuration and by dense-vs-sparse and peak-vs-sustained framing.

Overview¶

Blackwell Ultra shifts the centre of gravity from training-only FLOPS to reasoning and test-time inference: memory-dense, low-precision (NVFP4) tensor work with fast attention. The "unit of compute" is the B300 GPU; the "unit of deployment" is the GB300 NVL72 rack. Announced at GTC 2025, shipping from H2 2025.

Architecture: GB300 NVL72 rack

flowchart TB
  subgraph Rack["GB300 NVL72 — 120-140 kW, liquid-cooled"]
    T["18x compute tray: 4x B300 + 2x Grace each"] --> NVS["NVLink 5 / NVSwitch: single 72-GPU domain"]
  end
  NVS --> NIC["72x ConnectX-8 800G"]
  NIC --> FAB["Quantum-X800 IB / Spectrum-X Ethernet"]

Blackwell datacenter lineup¶

The line spans three GPUs and two rack systems built on one architecture. All carry 5th-gen Tensor Cores with the 2nd-gen Transformer Engine (FP4 / NVFP4), 5th-gen NVLink at 1.8 TB/s per GPU, HBM3e, and Confidential Computing. Blackwell is the first GPU to extend the trusted execution environment across NVLink via TEE-I/O.

B100: the early Blackwell datacenter part at 700 W, the same board power envelope as H100 SXM, so it drops into Hopper-class power and cooling without facility changes. Use it where the goal is reusing existing H100 infrastructure rather than maximising per-GPU memory or FLOPS.
B200: the mainstream training/inference GPU. 192 GB HBM3e at ~8 TB/s (~180 GB usable per GPU; ECC, firmware, and manufacturing reserve the remainder, per Fregly, AI Systems Performance Engineering, O'Reilly, ch. 2), ~9 PFLOPS dense FP4, 1000 W, dual-die. The broader-compatibility part: it fits much existing datacenter power and cooling, and is the GPU inside DGX/HGX B200 and GB200 NVL72.
B300 (Blackwell Ultra): the memory- and compute-maximised part. 288 GB HBM3e (12-Hi stacks, 50% more than B200), ~15 PFLOPS dense FP4, 1400 W, liquid-cooling mandatory. Detailed below.

B200 (1000 W) and B300 (1400 W) differ chiefly in memory density, FP4 throughput, and thermals; both deploy as 8-GPU DGX/HGX nodes and as 72-GPU NVL72 racks. The numbers below come from the NVIDIA Blackwell architecture page, the DGX B200 page, and the GB200 / GB300 NVL72 product pages; re-check the datasheet before relying on any single figure, since vendor FLOPS vary by dense-vs-sparse framing.

B200 vs B300 (per GPU)¶

Spec	B200	B300 (Blackwell Ultra)
Architecture	Blackwell, dual-die	Blackwell Ultra, dual-die
HBM3e memory	192 GB	288 GB (12-Hi)
Memory bandwidth	~8 TB/s	~8 TB/s (verify on datasheet; sources disagree)
Dense FP4	~9 PFLOPS	~15 PFLOPS
Board power	1000 W	1400 W
Tensor / NVLink	5th-gen Tensor, NVLink5 1.8 TB/s	5th-gen Tensor, NVLink5 1.8 TB/s
Confidential Computing	yes (TEE-I/O)	yes (TEE-I/O)
Node	DGX / HGX B200 (8 GPU)	DGX / HGX B300 (8 GPU)
Positioning	broader-compatibility, fits existing DC	memory/compute-max, liquid required

DGX B200 ships 8 B200 for 1,440 GB total HBM3e at 64 TB/s (= 192 GB and ~8 TB/s per GPU) and 144 / 72 PFLOPS sparse/dense FP4, the primary-source anchor for the B200 per-GPU figures above (DGX B200 product page).

GB200 NVL72 vs GB300 NVL72 (rack)¶

Spec	GB200 NVL72	GB300 NVL72
GPUs	72x B200	72x B300 (Blackwell Ultra)
Grace CPUs	36 (Arm Neoverse V2)	36 (Arm Neoverse V2)
Superchip	1 Grace + 2 B200 (x36)	1 Grace + 2 B300 (x36)
GPU HBM3e	13.4 TB	~20 TB (20.7 TB)
NVLink (5th-gen)	1.8 TB/s per GPU, 130 TB/s aggregate	1.8 TB/s per GPU, 130 TB/s aggregate
FP4 (peak, 2:1 sparse)	1.44 EF	1.44 EF (2:1 sparse); ~1.1 EF dense
FP8 (peak, 2:1 sparse)	~720 PFLOPS	~720 PFLOPS
Grace memory	17 TB LPDDR5X	~18 TB LPDDR5X (verify)
Off-rack NIC	ConnectX-7/8	ConnectX-8 (per-GPU direct RDMA)
Cooling	liquid	liquid, 120-140 kW/rack

The GB200 Grace Blackwell Superchip pairs 1 Grace CPU with 2 B200 GPUs (not 1+1) over NVLink-C2C; GB200 NVL72 is 36 such superchips = 72 B200 + 36 Grace, delivering ~1.44 EF FP4 and ~720 PFLOPS FP8 (both peak figures assume 2:1 structured sparsity, NVFP4/FP8 Tensor Core sparse basis) and 13.4 TB GPU HBM3e (GB200 NVL72 product page; Fregly, AI Systems Performance Engineering, O'Reilly, ch. 2: "about 1.44 exaFLOPS for FP4 with 2 to 1 structured sparsity and about 720 petaFLOPS for FP8 with 2 to 1 structured sparsity"). GB200-to-GB300 stays largely compatible at rack level (same NVL72 architecture), so existing Blackwell deployers migrate at low cost.

Core knowledge¶

B300 GPU (Blackwell Ultra)¶

Dual-die (dual-reticle) package, ~208 billion transistors, 160 SMs, ~20,480 CUDA cores, 640 fifth-gen Tensor Cores.
Memory: 288 GB HBM3e via 12-high stacks (50% more than B200's 192 GB), 8 TB/s bandwidth.
Compute: ~15 PFLOPS dense FP4 (about 66% more than B200's ~9 PFLOPS). Precisions: FP8, FP6, NVFP4. Deprioritises FP64, so traditional double-precision HPC does not benefit equally.
TDP: 1,400 W (B200 was 1,000 W; H100 was 700 W). Liquid cooling mandatory.
Deploys mostly as an 8-GPU node (DGX/HGX B300), which is the basic unit schedulers design around.

GB300 NVL72 rack¶

72 B300 GPUs plus 36 Grace CPUs (Arm Neoverse V2, 72 cores each), arranged as 18 compute trays, each tray 4 GPUs + 2 Grace. Grace-Blackwell coupling via NVLink-C2C, coherent unified memory.
A Grace-Blackwell Ultra superchip pairs 1 Grace CPU with 2 Blackwell Ultra GPUs.
NVLink: fifth-gen, 1.8 TB/s per GPU (900 GB/s unidirectional), 130 TB/s aggregate intra-rack, NVLink 5 switching forming a single 72-GPU NVLink domain.
Memory: 20.7 TB HBM3e across the rack; ~37 TB total fast memory including Grace.
Compute: ~1.1 exaFLOPS dense FP4 (vendor peak/sparse figures cited up to ~1.44 exaFLOPS).
Power: roughly 120 to 140 kW per rack, fully liquid-cooled. Weight ~1.36 t; the cabinet is a 48U rack (18 compute trays, 9 NVLink switch trays, power shelves) occupying a standard 42U floor footprint.
Networking off-rack: ConnectX-8 SuperNICs, 800 Gb/s per GPU, to Quantum-X800 InfiniBand or Spectrum-X Ethernet (see networking fabric).
Managed by NVIDIA Mission Control.

Scaling up¶

DGX B300 node: 8 GPUs, 8x ConnectX-8 SuperNICs (800 Gb/s), ~2.3 TB HBM3e. CPU is Intel Xeon (not Grace). Grace appears in the GB300 NVL72 rack, not in the DGX B300 node.
Scalable Unit (SU): 72 nodes (576 GPUs), rail-aligned (see networking fabric).
Full DGX SuperPOD (B300-XDR reference): 8 SUs = 576 nodes = 4,608 GPUs. Aggregate HBM3e and FLOPS scale linearly (~288 GB and ~15 PFLOPS dense FP4 per GPU); confirm exact totals against the current NVIDIA reference architecture.

Context worth holding¶

Export controls apply to B300/GB300; relevant to sovereign-AI and cross-border procurement scenarios.
Next gen: Rubin (Vera Rubin, shipping as the VR200 NVL72 rack) reaching partners and cloud providers from H2 2026. Useful for "where is this heading" questions; verify the exact SKU and timeline against NVIDIA before relying on it.
GB200 to GB300 is largely compatible at rack level (same NVL72 architecture), lowering migration cost for existing Blackwell deployers. GB300 fixes a GB200 pain point: ConnectX-8's integrated PCIe switch gives each GPU a direct RDMA path, where the GB200/CX7 generation had to traverse the Grace NoC and NVLink-C2C.

Install & setup¶

Reference template, not hardware-tested. Pin every version against the current NVIDIA release notes before a fleet roll; validate on one node first. Two materially different shapes: an HGX/DGX B200/B300 8-GPU node (NVSwitch baseboard, Fabric Manager local to the node) versus a GB200/GB300 NVL72 rack (multi-node NVLink domain, delivered and managed as a system via NVIDIA Mission Control / Base Command Manager). The checklist below is the node case; for NVL72 racks most of these steps are owned by Mission Control rather than run by hand (see end of section).

Open kernel modules are mandatory on Blackwell. NVIDIA: "For cutting-edge platforms such as NVIDIA Grace Hopper or NVIDIA Blackwell, you must use the open-source GPU kernel modules. The proprietary drivers are unsupported on these platforms." Open modules are the default install since the R560 branch; the current datacenter LTS branch is R580, which adds Coherent Driver-based Memory Management (CDMM) for GB200. The proprietary .run/cuda-drivers path is a non-starter here: installing it leaves the GPUs uninitialised.

Ordered bring-up (Ubuntu/apt, CUDA network repo; mirror with the dnf repo on RHEL):

# 1. CUDA network repo + open datacenter driver (R580 LTS). Open modules are REQUIRED on Blackwell.
#    nvidia-open pulls the open kernel modules; verify the exact branch package for your distro.
sudo apt-get update
sudo apt-get install -y nvidia-open                     # open GPU kernel modules (R560+ default)
# or pin the branch explicitly: sudo apt-get install -y nvidia-driver-580-open

# 2. CUDA toolkit (only if you build/run on the host rather than in containers)
sudo apt-get install -y cuda-toolkit-13-0

# 3. Fabric Manager — NVSwitch baseboard on HGX/DGX B200/B300 (single-node NVLink domain).
#    The Fabric Manager guide scopes it to "NVSwitch-based single-node HGX and DGX systems";
#    B200/B300 require a 570.xx or newer driver. Version-match it to the installed driver branch.
sudo apt-get install -y nvidia-fabricmanager-580
sudo systemctl enable --now nvidia-fabricmanager
sudo systemctl is-active nvidia-fabricmanager           # must be 'active' before NCCL forms the domain

# 4. Persistence (avoids re-init latency / power-state churn between jobs)
sudo systemctl enable --now nvidia-persistenced

# 5. DOCA-OFED for the ConnectX-8 SuperNIC / BlueField-3 RDMA stack (supersedes MLNX_OFED)
sudo apt-get install -y doca-ofed
echo nvidia_peermem | sudo tee /etc/modules-load.d/nvidia-peermem.conf   # GPUDirect RDMA peer-memory

# 6. Verify
nvidia-smi --query-gpu=name,driver_version,persistence_mode --format=csv
nvidia-smi -q | grep -i -A2 'Fabric'                    # NVSwitch fabric state
ibstat                                                  # ConnectX-8 ports ACTIVE at the expected rate

For driver/Fabric-Manager/CUDA package selection by tier and the Ansible-automated form of this, see the GPU software stack and Ansible node & fabric bring-up; the runnable fabric validation lives in the keystone Fabric bring-up, validation & benchmarking.

GB200/GB300 NVL72 racks are delivered and operated differently. The rack-scale NVLink domain spans node OS boundaries, so two pieces replace the single-node model: the IMEX service ("NVIDIA Import/Export Service for Internode Memory Sharing") maps GPU memory over NVLink across the OS/node boundary ("NVLink multi-node jobs will fail if the IMEX service is not properly initialized") and NVIDIA Mission Control (which "leverages NVIDIA Base Command Manager (BCM) for foundational cluster-management tasks such as provisioning compute nodes, configuring software images, assigning roles") handles rack power-on, high-speed fabric management, firmware, and autonomous recovery. On a Mission-Control-managed rack you configure these through BCM, not by hand-installing each package. Verify the exact IMEX/Mission Control steps against the GB200/GB300 NVL72 admin guide for the deployed Mission Control version.

Networking¶

ConnectX-8 SuperNICs carry off-rack traffic; NVLink 5 + NVSwitch carries intra-rack. Both numbers below are NVIDIA-published; re-confirm on the datasheet before quoting.

Off-rack: ConnectX-8 SuperNIC at ~800 Gb/s. One OSFP port delivers 800 Gb/s InfiniBand XDR or 2x400 GbE, on PCIe Gen6, with an integrated PCIe switch that gives each GPU a direct RDMA path (the GB200/CX7 generation had to traverse the Grace NoC and NVLink-C2C). GB300 NVL72 and DGX/HGX B300 wire one ConnectX-8 per GPU.
Fabric choice. Into Quantum-X800 InfiniBand (144 ports of 800 Gb/s per switch, SHARP v4 in-network reduction, adaptive routing) for latency-bound training, or Spectrum-X Ethernet (Spectrum-4 SN5600, 64x 800 GbE / 51.2 Tb/s, with BlueField-3 SuperNICs and RoCEv2) for multi-tenant inference and cloud. Rule of thumb unchanged from networking fabric: IB where GPU-to-GPU latency dominates, Ethernet/RoCE for cloud.
Intra-rack: NVLink 5 / 5th-gen NVSwitch. 1.8 TB/s per GPU (900 GB/s each direction), forming a single 72-GPU NVLink domain across the NVL72's NVSwitch trays, 130 TB/s aggregate. Blackwell extends Confidential Computing across this domain via TEE-I/O.
Cross-rack NVLink memory. Beyond one NVL72, NVLink does not extend; GPU-to-GPU between racks rides the InfiniBand/Ethernet fabric, and the IMEX service is what maps GPU memory across the OS/node boundary inside a multi-node NVLink deployment (see Install & setup). Within a single 72-GPU rack the NVSwitch domain is local; IMEX is the multi-node piece.

Blackwell-specific bring-up notes (do not duplicate the procedure; it lives in the keystone):

The runnable bring-up/validation/benchmark commands, namely ibstat/ibdiagnet link checks, subnet-manager convergence, and nccl-tests (all_reduce/all_gather) at realistic message sizes, are centralised in Fabric bring-up, validation & benchmarking. Use that as the single source for ops.
XDR tuning: confirm every ConnectX-8 link negotiates the full 800 Gb/s XDR rate; a mixed CX7/CX8 estate can quietly drop to NDR (400 Gb/s). OSFP modules are not interchangeable between NDR and XDR despite the shared cage (see BOM validation).
NVL72 fabric: assert Fabric Manager active on each HGX/DGX node and, on Mission-Control-managed racks, confirm the NVLink partition is formed before launching multi-node NCCL.
Cross-node NVLink: start IMEX before NVLink multi-node jobs; they fail otherwise.

Don't-miss checklist¶

Distinguish B300 (GPU / DGX / HGX node) from GB300 NVL72 (rack-scale system) precisely in conversation.
Be ready to state why GB300 forces liquid cooling and facility-level planning by default, where DGX/HGX B300 fits more familiar patterns.
Know the per-GPU and per-rack power and the transient behaviour (see datacentre readiness).

Open questions & validation¶

Re-pull the NVIDIA Blackwell Ultra datasheet for exact, current rack power and FLOPS before relying on a figure; vendor numbers shift with configuration and dense-vs-sparse framing.
Confirm whether the target deployment is GB300 NVL72 racks or HGX/DGX B300 nodes; the facility and fabric implications differ.

References¶

NVIDIA Blackwell architecture (5th-gen Tensor, 2nd-gen Transformer Engine, NVFP4, NVLink5, TEE-I/O): https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
NVIDIA GB200 NVL72 product page (1 Grace + 2 B200 superchip, 72 GPU / 36 Grace, 13.4 TB HBM3e, 1.44 EF FP4 / 720 PFLOPS FP8 at 2:1 sparsity): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
Chris Fregly, AI Systems Performance Engineering (O'Reilly), ch. 2 — GB200 NVL72 ~1.44 EF FP4 / ~720 PFLOPS FP8 at 2:1 structured sparsity; B200 192 GB HBM3e with ~180 GB usable
NVIDIA GB300 NVL72 product page (72 Blackwell Ultra + 36 Grace, 20 TB HBM3e): https://www.nvidia.com/en-us/data-center/gb300-nvl72/
NVIDIA DGX B200 product page (8x B200, 1,440 GB total HBM3e at 64 TB/s, 144/72 PFLOPS FP4): https://www.nvidia.com/en-us/data-center/dgx-b200/
NVIDIA Blackwell Ultra technical blog: https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/
GB300 architecture deep dive (per-GPU specs, NCCL): https://verda.com/blog/gb300-nvl72-architecture
B300 specs and pricing overview: https://www.spheron.network/blog/nvidia-b300-blackwell-ultra-guide/
GB300 deployment field notes: https://introl.com/blog/why-nvidia-gb300-nvl72-blackwell-ultra-matters
NVIDIA "fully towards open-source GPU kernel modules" (Blackwell/Grace Hopper require open modules; R560 default): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
NVIDIA driver installation guide — kernel modules (nvidia-open, Turing+): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html
NVIDIA Data Center GPU Driver R580 release notes (current LTS; CDMM for GB200): https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-126-09/index.html
NVIDIA Fabric Manager user guide (NVSwitch single-node HGX/DGX scope; B200/B300 need 570.xx+): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA IMEX service for NVLink networks (Import/Export Service for Internode Memory Sharing): https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html
NVIDIA Mission Control with GB200/GB300 NVL72 — software stack (leverages Base Command Manager): https://docs.nvidia.com/mission-control/docs/systems-administration-guide/2.0.0/software-stack.html
NVIDIA Mission Control GB200/GB300 NVL72 high-speed fabric management: https://docs.nvidia.com/mission-control/docs/systems-administration-guide/2.0.0/high-speed-fabric-management.html
NVIDIA Quantum-X800 InfiniBand platform (144 ports x 800 Gb/s, ConnectX-8, SHARP v4): https://www.nvidia.com/en-us/networking/products/infiniband/quantum-x800/
NVIDIA DOCA host installation and upgrade (DOCA-OFED): https://docs.nvidia.com/doca/sdk/doca-host+installation+and+upgrade/index.html
NVIDIA ConnectX-8 SuperNIC PCIe Gen6 800G detail (ServeTheHome): https://www.servethehome.com/nvidia-connectx-8-supernic-pcie-gen6-800g-nic-detailed/