NVIDIA Blackwell datacenter platform (B200 / B300, GB200 / GB300)¶
Scope: the NVIDIA Blackwell datacenter line much of this knowledge base is built on. The GPUs (B200 and the memory/compute-max B300; the early B100), the nodes (DGX/HGX B200 and B300), and the rack-scale systems (GB200 NVL72 and GB300 NVL72). B200 is the broader-compatibility part; B300 (Blackwell Ultra) is the memory- and compute-maximised part. Both share 5th-gen Tensor Cores, 5th-gen NVLink, and Confidential Computing with TEE-I/O. Worth keeping current as datasheets shift.
Figures verified against vendor and reference sources as of June 2026. Re-check against the NVIDIA datasheet before relying on any single number; vendor-quoted rack power and FLOPS vary by configuration and by dense-vs-sparse and peak-vs-sustained framing.
Overview¶
Blackwell Ultra shifts the centre of gravity from training-only FLOPS to reasoning and test-time inference: memory-dense, low-precision (NVFP4) tensor work with fast attention. The "unit of compute" is the B300 GPU; the "unit of deployment" is the GB300 NVL72 rack. Announced at GTC 2025, shipping from H2 2025.
Architecture: GB300 NVL72 rack
flowchart TB
subgraph Rack["GB300 NVL72 — 120-140 kW, liquid-cooled"]
T["18x compute tray: 4x B300 + 2x Grace each"] --> NVS["NVLink 5 / NVSwitch: single 72-GPU domain"]
end
NVS --> NIC["72x ConnectX-8 800G"]
NIC --> FAB["Quantum-X800 IB / Spectrum-X Ethernet"]
Blackwell datacenter lineup¶
The line spans three GPUs and two rack systems built on one architecture. All carry 5th-gen Tensor Cores with the 2nd-gen Transformer Engine (FP4 / NVFP4), 5th-gen NVLink at 1.8 TB/s per GPU, HBM3e, and Confidential Computing. Blackwell is the first GPU to extend the trusted execution environment across NVLink via TEE-I/O.
- B100: the early Blackwell datacenter part at 700 W, the same board power envelope as H100 SXM, so it drops into Hopper-class power and cooling without facility changes. Use it where the goal is reusing existing H100 infrastructure rather than maximising per-GPU memory or FLOPS.
- B200: the mainstream training/inference GPU. 192 GB HBM3e at ~8 TB/s (~180 GB usable per GPU; ECC, firmware, and manufacturing reserve the remainder, per Fregly, AI Systems Performance Engineering, O'Reilly, ch. 2), ~9 PFLOPS dense FP4, 1000 W, dual-die. The broader-compatibility part: it fits much existing datacenter power and cooling, and is the GPU inside DGX/HGX B200 and GB200 NVL72.
- B300 (Blackwell Ultra): the memory- and compute-maximised part. 288 GB HBM3e (12-Hi stacks, 50% more than B200), ~15 PFLOPS dense FP4, 1400 W, liquid-cooling mandatory. Detailed below.
B200 (1000 W) and B300 (1400 W) differ chiefly in memory density, FP4 throughput, and thermals; both deploy as 8-GPU DGX/HGX nodes and as 72-GPU NVL72 racks. The numbers below come from the NVIDIA Blackwell architecture page, the DGX B200 page, and the GB200 / GB300 NVL72 product pages; re-check the datasheet before relying on any single figure, since vendor FLOPS vary by dense-vs-sparse framing.
B200 vs B300 (per GPU)¶
| Spec | B200 | B300 (Blackwell Ultra) |
|---|---|---|
| Architecture | Blackwell, dual-die | Blackwell Ultra, dual-die |
| HBM3e memory | 192 GB | 288 GB (12-Hi) |
| Memory bandwidth | ~8 TB/s | ~8 TB/s (verify on datasheet; sources disagree) |
| Dense FP4 | ~9 PFLOPS | ~15 PFLOPS |
| Board power | 1000 W | 1400 W |
| Tensor / NVLink | 5th-gen Tensor, NVLink5 1.8 TB/s | 5th-gen Tensor, NVLink5 1.8 TB/s |
| Confidential Computing | yes (TEE-I/O) | yes (TEE-I/O) |
| Node | DGX / HGX B200 (8 GPU) | DGX / HGX B300 (8 GPU) |
| Positioning | broader-compatibility, fits existing DC | memory/compute-max, liquid required |
DGX B200 ships 8 B200 for 1,440 GB total HBM3e at 64 TB/s (= 192 GB and ~8 TB/s per GPU) and 144 / 72 PFLOPS sparse/dense FP4, the primary-source anchor for the B200 per-GPU figures above (DGX B200 product page).
GB200 NVL72 vs GB300 NVL72 (rack)¶
| Spec | GB200 NVL72 | GB300 NVL72 |
|---|---|---|
| GPUs | 72x B200 | 72x B300 (Blackwell Ultra) |
| Grace CPUs | 36 (Arm Neoverse V2) | 36 (Arm Neoverse V2) |
| Superchip | 1 Grace + 2 B200 (x36) | 1 Grace + 2 B300 (x36) |
| GPU HBM3e | 13.4 TB | ~20 TB (20.7 TB) |
| NVLink (5th-gen) | 1.8 TB/s per GPU, 130 TB/s aggregate | 1.8 TB/s per GPU, 130 TB/s aggregate |
| FP4 (peak, 2:1 sparse) | 1.44 EF | 1.44 EF (2:1 sparse); ~1.1 EF dense |
| FP8 (peak, 2:1 sparse) | ~720 PFLOPS | ~720 PFLOPS |
| Grace memory | 17 TB LPDDR5X | ~18 TB LPDDR5X (verify) |
| Off-rack NIC | ConnectX-7/8 | ConnectX-8 (per-GPU direct RDMA) |
| Cooling | liquid | liquid, 120-140 kW/rack |
The GB200 Grace Blackwell Superchip pairs 1 Grace CPU with 2 B200 GPUs (not 1+1) over NVLink-C2C; GB200 NVL72 is 36 such superchips = 72 B200 + 36 Grace, delivering ~1.44 EF FP4 and ~720 PFLOPS FP8 (both peak figures assume 2:1 structured sparsity, NVFP4/FP8 Tensor Core sparse basis) and 13.4 TB GPU HBM3e (GB200 NVL72 product page; Fregly, AI Systems Performance Engineering, O'Reilly, ch. 2: "about 1.44 exaFLOPS for FP4 with 2 to 1 structured sparsity and about 720 petaFLOPS for FP8 with 2 to 1 structured sparsity"). GB200-to-GB300 stays largely compatible at rack level (same NVL72 architecture), so existing Blackwell deployers migrate at low cost.
Core knowledge¶
B300 GPU (Blackwell Ultra)¶
- Dual-die (dual-reticle) package, ~208 billion transistors, 160 SMs, ~20,480 CUDA cores, 640 fifth-gen Tensor Cores.
- Memory: 288 GB HBM3e via 12-high stacks (50% more than B200's 192 GB), 8 TB/s bandwidth.
- Compute: ~15 PFLOPS dense FP4 (about 66% more than B200's ~9 PFLOPS). Precisions: FP8, FP6, NVFP4. Deprioritises FP64, so traditional double-precision HPC does not benefit equally.
- TDP: 1,400 W (B200 was 1,000 W; H100 was 700 W). Liquid cooling mandatory.
- Deploys mostly as an 8-GPU node (DGX/HGX B300), which is the basic unit schedulers design around.
GB300 NVL72 rack¶
- 72 B300 GPUs plus 36 Grace CPUs (Arm Neoverse V2, 72 cores each), arranged as 18 compute trays, each tray 4 GPUs + 2 Grace. Grace-Blackwell coupling via NVLink-C2C, coherent unified memory.
- A Grace-Blackwell Ultra superchip pairs 1 Grace CPU with 2 Blackwell Ultra GPUs.
- NVLink: fifth-gen, 1.8 TB/s per GPU (900 GB/s unidirectional), 130 TB/s aggregate intra-rack, NVLink 5 switching forming a single 72-GPU NVLink domain.
- Memory: 20.7 TB HBM3e across the rack; ~37 TB total fast memory including Grace.
- Compute: ~1.1 exaFLOPS dense FP4 (vendor peak/sparse figures cited up to ~1.44 exaFLOPS).
- Power: roughly 120 to 140 kW per rack, fully liquid-cooled. Weight ~1.36 t; the cabinet is a 48U rack (18 compute trays, 9 NVLink switch trays, power shelves) occupying a standard 42U floor footprint.
- Networking off-rack: ConnectX-8 SuperNICs, 800 Gb/s per GPU, to Quantum-X800 InfiniBand or Spectrum-X Ethernet (see networking fabric).
- Managed by NVIDIA Mission Control.
Scaling up¶
- DGX B300 node: 8 GPUs, 8x ConnectX-8 SuperNICs (800 Gb/s), ~2.3 TB HBM3e. CPU is Intel Xeon (not Grace). Grace appears in the GB300 NVL72 rack, not in the DGX B300 node.
- Scalable Unit (SU): 72 nodes (576 GPUs), rail-aligned (see networking fabric).
- Full DGX SuperPOD (B300-XDR reference): 8 SUs = 576 nodes = 4,608 GPUs. Aggregate HBM3e and FLOPS scale linearly (~288 GB and ~15 PFLOPS dense FP4 per GPU); confirm exact totals against the current NVIDIA reference architecture.
Context worth holding¶
- Export controls apply to B300/GB300; relevant to sovereign-AI and cross-border procurement scenarios.
- Next gen: Rubin (Vera Rubin, shipping as the VR200 NVL72 rack) reaching partners and cloud providers from H2 2026. Useful for "where is this heading" questions; verify the exact SKU and timeline against NVIDIA before relying on it.
- GB200 to GB300 is largely compatible at rack level (same NVL72 architecture), lowering migration cost for existing Blackwell deployers. GB300 fixes a GB200 pain point: ConnectX-8's integrated PCIe switch gives each GPU a direct RDMA path, where the GB200/CX7 generation had to traverse the Grace NoC and NVLink-C2C.
Install & setup¶
Reference template, not hardware-tested. Pin every version against the current NVIDIA release notes before a fleet roll; validate on one node first. Two materially different shapes: an HGX/DGX B200/B300 8-GPU node (NVSwitch baseboard, Fabric Manager local to the node) versus a GB200/GB300 NVL72 rack (multi-node NVLink domain, delivered and managed as a system via NVIDIA Mission Control / Base Command Manager). The checklist below is the node case; for NVL72 racks most of these steps are owned by Mission Control rather than run by hand (see end of section).
Open kernel modules are mandatory on Blackwell. NVIDIA: "For cutting-edge platforms such as NVIDIA Grace Hopper or NVIDIA Blackwell, you must use the open-source GPU kernel modules. The proprietary drivers are unsupported on these platforms." Open modules are the default install since the R560 branch; the current datacenter LTS branch is R580, which adds Coherent Driver-based Memory Management (CDMM) for GB200. The proprietary .run/cuda-drivers path is a non-starter here: installing it leaves the GPUs uninitialised.
Ordered bring-up (Ubuntu/apt, CUDA network repo; mirror with the dnf repo on RHEL):
# 1. CUDA network repo + open datacenter driver (R580 LTS). Open modules are REQUIRED on Blackwell.
# nvidia-open pulls the open kernel modules; verify the exact branch package for your distro.
sudo apt-get update
sudo apt-get install -y nvidia-open # open GPU kernel modules (R560+ default)
# or pin the branch explicitly: sudo apt-get install -y nvidia-driver-580-open
# 2. CUDA toolkit (only if you build/run on the host rather than in containers)
sudo apt-get install -y cuda-toolkit-13-0
# 3. Fabric Manager — NVSwitch baseboard on HGX/DGX B200/B300 (single-node NVLink domain).
# The Fabric Manager guide scopes it to "NVSwitch-based single-node HGX and DGX systems";
# B200/B300 require a 570.xx or newer driver. Version-match it to the installed driver branch.
sudo apt-get install -y nvidia-fabricmanager-580
sudo systemctl enable --now nvidia-fabricmanager
sudo systemctl is-active nvidia-fabricmanager # must be 'active' before NCCL forms the domain
# 4. Persistence (avoids re-init latency / power-state churn between jobs)
sudo systemctl enable --now nvidia-persistenced
# 5. DOCA-OFED for the ConnectX-8 SuperNIC / BlueField-3 RDMA stack (supersedes MLNX_OFED)
sudo apt-get install -y doca-ofed
echo nvidia_peermem | sudo tee /etc/modules-load.d/nvidia-peermem.conf # GPUDirect RDMA peer-memory
# 6. Verify
nvidia-smi --query-gpu=name,driver_version,persistence_mode --format=csv
nvidia-smi -q | grep -i -A2 'Fabric' # NVSwitch fabric state
ibstat # ConnectX-8 ports ACTIVE at the expected rate
For driver/Fabric-Manager/CUDA package selection by tier and the Ansible-automated form of this, see the GPU software stack and Ansible node & fabric bring-up; the runnable fabric validation lives in the keystone Fabric bring-up, validation & benchmarking.
GB200/GB300 NVL72 racks are delivered and operated differently. The rack-scale NVLink domain spans node OS boundaries, so two pieces replace the single-node model: the IMEX service ("NVIDIA Import/Export Service for Internode Memory Sharing") maps GPU memory over NVLink across the OS/node boundary ("NVLink multi-node jobs will fail if the IMEX service is not properly initialized") and NVIDIA Mission Control (which "leverages NVIDIA Base Command Manager (BCM) for foundational cluster-management tasks such as provisioning compute nodes, configuring software images, assigning roles") handles rack power-on, high-speed fabric management, firmware, and autonomous recovery. On a Mission-Control-managed rack you configure these through BCM, not by hand-installing each package. Verify the exact IMEX/Mission Control steps against the GB200/GB300 NVL72 admin guide for the deployed Mission Control version.
Networking¶
ConnectX-8 SuperNICs carry off-rack traffic; NVLink 5 + NVSwitch carries intra-rack. Both numbers below are NVIDIA-published; re-confirm on the datasheet before quoting.
- Off-rack: ConnectX-8 SuperNIC at ~800 Gb/s. One OSFP port delivers 800 Gb/s InfiniBand XDR or 2x400 GbE, on PCIe Gen6, with an integrated PCIe switch that gives each GPU a direct RDMA path (the GB200/CX7 generation had to traverse the Grace NoC and NVLink-C2C). GB300 NVL72 and DGX/HGX B300 wire one ConnectX-8 per GPU.
- Fabric choice. Into Quantum-X800 InfiniBand (144 ports of 800 Gb/s per switch, SHARP v4 in-network reduction, adaptive routing) for latency-bound training, or Spectrum-X Ethernet (Spectrum-4 SN5600, 64x 800 GbE / 51.2 Tb/s, with BlueField-3 SuperNICs and RoCEv2) for multi-tenant inference and cloud. Rule of thumb unchanged from networking fabric: IB where GPU-to-GPU latency dominates, Ethernet/RoCE for cloud.
- Intra-rack: NVLink 5 / 5th-gen NVSwitch. 1.8 TB/s per GPU (900 GB/s each direction), forming a single 72-GPU NVLink domain across the NVL72's NVSwitch trays, 130 TB/s aggregate. Blackwell extends Confidential Computing across this domain via TEE-I/O.
- Cross-rack NVLink memory. Beyond one NVL72, NVLink does not extend; GPU-to-GPU between racks rides the InfiniBand/Ethernet fabric, and the IMEX service is what maps GPU memory across the OS/node boundary inside a multi-node NVLink deployment (see Install & setup). Within a single 72-GPU rack the NVSwitch domain is local; IMEX is the multi-node piece.
Blackwell-specific bring-up notes (do not duplicate the procedure; it lives in the keystone):
- The runnable bring-up/validation/benchmark commands, namely
ibstat/ibdiagnetlink checks, subnet-manager convergence, andnccl-tests(all_reduce/all_gather) at realistic message sizes, are centralised in Fabric bring-up, validation & benchmarking. Use that as the single source for ops. - XDR tuning: confirm every ConnectX-8 link negotiates the full 800 Gb/s XDR rate; a mixed CX7/CX8 estate can quietly drop to NDR (400 Gb/s). OSFP modules are not interchangeable between NDR and XDR despite the shared cage (see BOM validation).
- NVL72 fabric: assert Fabric Manager active on each HGX/DGX node and, on Mission-Control-managed racks, confirm the NVLink partition is formed before launching multi-node NCCL.
- Cross-node NVLink: start IMEX before NVLink multi-node jobs; they fail otherwise.
Don't-miss checklist¶
- Distinguish B300 (GPU / DGX / HGX node) from GB300 NVL72 (rack-scale system) precisely in conversation.
- Be ready to state why GB300 forces liquid cooling and facility-level planning by default, where DGX/HGX B300 fits more familiar patterns.
- Know the per-GPU and per-rack power and the transient behaviour (see datacentre readiness).
Open questions & validation¶
- Re-pull the NVIDIA Blackwell Ultra datasheet for exact, current rack power and FLOPS before relying on a figure; vendor numbers shift with configuration and dense-vs-sparse framing.
- Confirm whether the target deployment is GB300 NVL72 racks or HGX/DGX B300 nodes; the facility and fabric implications differ.
References¶
- NVIDIA Blackwell architecture (5th-gen Tensor, 2nd-gen Transformer Engine, NVFP4, NVLink5, TEE-I/O): https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
- NVIDIA GB200 NVL72 product page (1 Grace + 2 B200 superchip, 72 GPU / 36 Grace, 13.4 TB HBM3e, 1.44 EF FP4 / 720 PFLOPS FP8 at 2:1 sparsity): https://www.nvidia.com/en-us/data-center/gb200-nvl72/
- Chris Fregly, AI Systems Performance Engineering (O'Reilly), ch. 2 — GB200 NVL72 ~1.44 EF FP4 / ~720 PFLOPS FP8 at 2:1 structured sparsity; B200 192 GB HBM3e with ~180 GB usable
- NVIDIA GB300 NVL72 product page (72 Blackwell Ultra + 36 Grace, 20 TB HBM3e): https://www.nvidia.com/en-us/data-center/gb300-nvl72/
- NVIDIA DGX B200 product page (8x B200, 1,440 GB total HBM3e at 64 TB/s, 144/72 PFLOPS FP4): https://www.nvidia.com/en-us/data-center/dgx-b200/
- NVIDIA Blackwell Ultra technical blog: https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/
- GB300 architecture deep dive (per-GPU specs, NCCL): https://verda.com/blog/gb300-nvl72-architecture
- B300 specs and pricing overview: https://www.spheron.network/blog/nvidia-b300-blackwell-ultra-guide/
- GB300 deployment field notes: https://introl.com/blog/why-nvidia-gb300-nvl72-blackwell-ultra-matters
- NVIDIA "fully towards open-source GPU kernel modules" (Blackwell/Grace Hopper require open modules; R560 default): https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/
- NVIDIA driver installation guide — kernel modules (
nvidia-open, Turing+): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/kernel-modules.html - NVIDIA Data Center GPU Driver R580 release notes (current LTS; CDMM for GB200): https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-126-09/index.html
- NVIDIA Fabric Manager user guide (NVSwitch single-node HGX/DGX scope; B200/B300 need 570.xx+): https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
- NVIDIA IMEX service for NVLink networks (Import/Export Service for Internode Memory Sharing): https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html
- NVIDIA Mission Control with GB200/GB300 NVL72 — software stack (leverages Base Command Manager): https://docs.nvidia.com/mission-control/docs/systems-administration-guide/2.0.0/software-stack.html
- NVIDIA Mission Control GB200/GB300 NVL72 high-speed fabric management: https://docs.nvidia.com/mission-control/docs/systems-administration-guide/2.0.0/high-speed-fabric-management.html
- NVIDIA Quantum-X800 InfiniBand platform (144 ports x 800 Gb/s, ConnectX-8, SHARP v4): https://www.nvidia.com/en-us/networking/products/infiniband/quantum-x800/
- NVIDIA DOCA host installation and upgrade (DOCA-OFED): https://docs.nvidia.com/doca/sdk/doca-host+installation+and+upgrade/index.html
- NVIDIA ConnectX-8 SuperNIC PCIe Gen6 800G detail (ServeTheHome): https://www.servethehome.com/nvidia-connectx-8-supernic-pcie-gen6-800g-nic-detailed/
Related: GPU generations · Fabric · Fabric bring-up & benchmarking · Software Stack · Ansible Bring-Up · Physical · Performance · Glossary