Markdown

GH200)¶

Scope: the Hopper datacenter generation (GH100 die, TSMC 4N, 2022-2024): the H100 and H200 GPUs, the GH200 Grace Hopper superchip, their HGX/DGX systems, and the operational, driver, and networking differences versus the older Ampere platform and the newer Blackwell platform. Re-check fast-moving specs on the linked NVIDIA datasheets before relying on a single number.

Figures verified against NVIDIA primary sources as of June 2026. SXM and PCIe variants differ materially in power and interconnect; confirm the exact board SKU against its datasheet.

What it is¶

Hopper is the architecture that introduced the Transformer Engine and FP8 tensor math to NVIDIA datacenter GPUs, alongside fourth-generation NVLink at 900 GB/s and on-GPU Confidential Computing. All Hopper datacenter parts share the GH100 die. The line spans three things that are easy to conflate: the H100 GPU (the workhorse), the H200 (same compute, much more and faster memory), and the GH200 Grace Hopper superchip (a Grace Arm CPU coherently fused to a Hopper GPU). H800 and H20 are reduced-interconnect China-export variants.

flowchart TB
  GH100["GH100 die (TSMC 4N) — FP8 Transformer Engine, NVLink 4"]
  GH100 --> H100["H100: 80 GB HBM3"]
  GH100 --> H200["H200: 141 GB HBM3e, 4.8 TB/s"]
  GH100 --> GH200["GH200: Grace CPU + Hopper GPU, NVLink-C2C"]
  GH100 --> EXPORT["H800 / H20: export-cut NVLink"]
  H100 --> SXM["SXM5 700 W: full NVLink, NVSwitch domain"]
  H100 --> PCIE["PCIe 350 W: NVLink Bridge only, no NVSwitch"]
  SXM --> HGX["HGX / DGX H100/H200: 8 GPU + 4 NVSwitch"]

Lineup & specifications¶

Part	Memory	Bandwidth	NVLink	Form factor / power	MIG	Notes
H100 SXM5	80 GB HBM3	~3.35 TB/s	4th-gen, 900 GB/s	SXM5, 700 W	7 instances	Full NVLink, joins NVSwitch domain
H100 PCIe	80 GB HBM3	~2 TB/s	NVLink Bridge (2-way) only	PCIe Gen5, 350 W	7 instances	No NVSwitch; lower clocks/TDP
H100 NVL	2x 94 GB HBM3	~3.9 TB/s/GPU	Bridged pair	PCIe, ~400 W/GPU	7 instances	LLM-inference paired SKU
H200 SXM	141 GB HBM3e	4.8 TB/s	4th-gen, 900 GB/s	SXM, ~700 W	7 instances	Same GH100 compute as H100
GH200	96/144 GB HBM3 (GPU) + up to 480 GB LPDDR5X (Grace)	HBM ~4 TB/s	NVLink-C2C 900 GB/s CPU↔GPU	superchip module	7 instances	Coherent unified CPU+GPU memory
H800 / H20	as base	as base	reduced interconnect	SXM/PCIe	yes	China-export compliance SKUs

Compute: fourth-generation Tensor Cores with the Transformer Engine, adding FP8 (E4M3/E5M2) on top of FP16/BF16/TF32/INT8 and accelerated FP64; DPX instructions for dynamic-programming workloads; PCIe Gen5. Specs cited from the H100, H200, and Grace Hopper product pages and the Hopper architecture whitepaper (see References).

Operational differences¶

The differences that change how a cluster is built and run:

Driver branch. Datacenter (Tesla/nvidia-open or proprietary) driver, never the GeForce driver. Hopper is served by the datacenter production/LTS branches; pair the driver with a matching Fabric Manager and CUDA toolkit. See GPU software stack and driver upgrade runbook.
NVLink / NVSwitch. Fourth-gen NVLink at 900 GB/s per GPU. SXM5 H100/H200 expose full NVLink and join an NVSwitch domain (8-GPU HGX baseboard with 4 NVSwitches); PCIe H100 has only an optional 2-way NVLink Bridge and no NVSwitch domain. Multi-GPU beyond a bridged pair falls back to PCIe/host (NCCL routes over PCIe). This is the single largest topology difference between SXM and PCIe Hopper.
Fabric Manager is required on NVSwitch systems (HGX/DGX H100/H200) to configure and heal the NVLink fabric; it is not used on PCIe-only nodes.
MIG. Second-generation Multi-Instance GPU, up to 7 instances, with improved per-instance isolation, monitoring, and (Hopper-new) MIG-level confidential compute. See security & multi-tenancy and Kubernetes for GPUs.
Confidential Computing, new in Hopper. Hopper is the first NVIDIA GPU with a hardware TEE: an encrypted, attested CVM path protecting code and data in use. Ampere has none. Blackwell extends the TEE across NVLink (TEE-I/O); Hopper's TEE is per-GPU over the PCIe/CPU boundary. Verify the deployment mode against the NVIDIA Confidential Computing docs.
GPUDirect RDMA and vGPU are supported (datacenter parts), enabling direct NIC-to-GPU DMA for the fabric described in networking fabric.
ECC on HBM is on by default.
Transformer Engine / FP8 is the headline software-visible capability: it dynamically casts layers to FP8 with per-tensor scaling, roughly doubling effective tensor throughput versus BF16 for transformers. FP8 needs Hopper or newer (and Ada); it is absent on Ampere.

Install & setup¶

Reference template only (unexecuted, not hardware-tested); pin exact versions against the driver install guide, the driver and Fabric Manager matrices, and the DOCA-OFED guide for your CUDA target. Hopper is served by the datacenter production branches. As of June 2026 R570 (validated on HGX H100/H200) and the newer R580 are the live production branches; never the GeForce driver. Pick one branch and hold the driver, Fabric Manager, and libnvidia-nscq to the same version; version skew is the most frequent Hopper bring-up failure. Ordered checklist for an HGX/DGX H100/H200 node:

# 0. Confirm the board and kernel before touching drivers
nvidia-smi -L 2>/dev/null || echo "no driver yet"   # H100/H200 SXM expected on HGX/DGX
uname -r                                             # kernel must be in the driver/OFED matrix

# 1. Add NVIDIA's CUDA network repo (cuda-keyring). Replace $distro, e.g. ubuntu2204.
wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# 2. Datacenter driver — pin the branch. Open kernel modules (recommended, Turing+):
sudo apt-get install -y nvidia-open-570            # or nvidia-open-580; proprietary: cuda-drivers-570
#    DKMS note: open modules ship via nvidia-dkms-<branch>-open and rebuild on kernel bumps;
#    a kernel update can silently fail to rebuild the module — re-run `sudo dkms autoinstall`
#    and reboot, then re-check `nvidia-smi` before scheduling work.

# 3. CUDA toolkit (match to the driver's CUDA support; toolkit can trail the driver)
sudo apt-get install -y cuda-toolkit-12-9          # or `cuda-toolkit` for the repo default

# 4. Fabric Manager + NSCQ — REQUIRED on NVSwitch (HGX/DGX H100/H200), version == driver
sudo apt-get install -y nvidia-fabricmanager-570 libnvidia-nscq-570   # branch must match step 2

# 5. ConnectX-7 (NDR) host stack: DOCA-OFED is the 1-to-1 MLNX_OFED replacement.
#    Download the doca-host repo .deb matching your distro from NVIDIA, then:
sudo dpkg -i doca-host_<ver>-<distro>_amd64.deb    # verify <ver> against DOCA release notes
sudo apt-get update
sudo apt-get install -y doca-ofed                  # full OFED stack (drivers + libs + tools)
sudo /etc/init.d/openibd restart                   # reload mlx5; on off-matrix kernels reinstall with --add-kernel-support

# 6. Persistence daemon — recommended on dedicated compute nodes (avoids cold-init latency)
sudo systemctl enable --now nvidia-persistenced

# 7. Bring up + verify the NVLink fabric (NVSwitch nodes only)
sudo systemctl enable --now nvidia-fabricmanager   # must be active BEFORE NCCL multi-GPU
systemctl status nvidia-fabricmanager              # FM/driver mismatch -> service fails, NVLink down
nvidia-smi nvlink --status                         # expect NVLink 4 links up on SXM
ibstat                                             # ConnectX-7 ports: expect Active, NDR

# 8. MIG (optional): enable, then create 7x instances on H100/H200
sudo nvidia-smi -i 0 -mig 1
sudo nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19 -C   # profile IDs are SKU-specific

On PCIe-only Hopper nodes (H100 PCIe, H100 NVL pair) there is no NVSwitch. Skip steps 4 and 7's Fabric Manager: do not install or enable nvidia-fabricmanager. Multi-GPU beyond a bridged pair routes over PCIe/host. See Ansible bring-up for fleet automation and GPU software stack for the full version matrix.

When to use it¶

H100: the broad training/inference baseline; mature software, wide availability, FP8 for transformer training and serving.
H200: memory-bound inference and long-context/large-model serving where the jump to 141 GB HBM3e and 4.8 TB/s matters more than raw FLOPS (compute equals H100).
GH200: workloads that benefit from coherent CPU+GPU memory and high CPU-GPU bandwidth (large embeddings, graph, data-prep-adjacent, KV-cache offload).
Prefer SXM for any multi-GPU collective-heavy job (NVLink/NVSwitch); PCIe suits single-GPU, bridged-pair, or power/airflow-constrained nodes.
Choose Hopper over Blackwell when air-cooled, lower-power (700 W vs 1000-1400 W) infrastructure is the constraint, or when FP4/NVFP4 and TEE-I/O are not needed.

Networking¶

Hopper is the NDR InfiniBand era. Generation-specific fabric:

NIC: ConnectX-7, single/dual-port 400 Gb/s NDR InfiniBand (also 400GbE), PCIe Gen5, with In-Network Computing and GPUDirect RDMA giving the NIC a direct DMA path to GPU HBM.
Switch: Quantum-2 (QM9700/QM9750), 64x 400 Gb/s NDR ports (or 128x 200 Gb/s NDR200) on 32 OSFP cages, 51.2 Tb/s aggregate. Rail-aligned, one ConnectX-7 port per GPU rail. Spectrum-X (Spectrum-4 + BlueField-3) is the Ethernet alternative.
Per-port rate: 400 Gb/s line rate per ConnectX-7 NDR port; SHARP in-network reduction offloads collectives.
NVLink: fourth-gen NVLink 4 at 900 GB/s/GPU. On SXM5 H100/H200 the NVLink/NVSwitch domain (8-GPU HGX baseboard, 4 NVSwitches) carries intra-node all-reduce; the InfiniBand rail carries inter-node. PCIe H100 has only an optional 2-way NVLink Bridge and no NVSwitch domain.
Fabric Manager applicability: required on NVSwitch systems (HGX/DGX H100/H200); H200 reuses the HGX H100 8-GPU baseboard. PCIe H100 / H100 NVL: no Fabric Manager.

For the actual bring-up, validation, and benchmark commands (NVLink/NVSwitch checks, ibstat/ibping, nvbandwidth, nccl-tests all-reduce bus-bandwidth), use the shared keystone runbook; do not duplicate it here: Fabric bring-up, validation & benchmarking.

Hopper-specific pointers:

Verify FM before collectives. On HGX/DGX nodes nvidia-fabricmanager must be active or NCCL silently degrades to PCIe and all-reduce bus-bandwidth collapses; check nvidia-smi nvlink --status shows NVLink 4 links up.
NDR cabling/transceivers. Quantum-2 uses OSFP at 400 Gb/s; a single twin-port OSFP can split to 2x NDR200. Confirm transceiver/DAC and OSFP vs QSFP112 before claiming a port is at NDR rate; ibstat width/speed is the source of truth.
Rail alignment. Pin one ConnectX-7 port per GPU rail and set NCCL_IB_HCA accordingly; mis-rail-aligned topologies cost inter-node all-reduce bandwidth even at full NDR.
PCIe Hopper NCCL. On PCIe H100 nodes test NCCL PCIe P2P fallback explicitly (no NVSwitch); do not size collectives at 900 GB/s.

This NDR/ConnectX-7 pairing is the main networking-era difference from Ampere (HDR/ConnectX-6) and Blackwell (XDR/ConnectX-8). See networking fabric.

Gotchas & failure modes¶

PCIe vs SXM confusion. PCIe H100 is 350 W with no NVSwitch domain; expecting 900 GB/s all-reduce on PCIe nodes is a common, costly mistake. Confirm the form factor before sizing collectives.
Fabric Manager not running on an NVSwitch node leaves NVLink unconfigured; NCCL silently degrades to PCIe and throughput collapses. Always verify nvidia-fabricmanager is active first (see NCCL hang runbook).
Driver/Fabric-Manager/CUDA version skew is the most frequent Hopper bring-up failure; all three must come from a compatible set.
Export SKUs (H800/H20) have reduced NVLink/interconnect; do not assume full-H100 collective bandwidth on these parts.
FP8 accuracy. The Transformer Engine helps, but FP8 still needs scaling/calibration care; validate model quality, do not assume drop-in parity with BF16.
H200 ≠ more compute. It is the same GH100 compute as H100; gains are memory-capacity and memory-bandwidth bound only.

References¶

NVIDIA H100 Tensor Core GPU (product page): https://www.nvidia.com/en-us/data-center/h100/
NVIDIA H200 Tensor Core GPU (product page): https://www.nvidia.com/en-us/data-center/h200/
NVIDIA GH200 Grace Hopper Superchip (product page): https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/
NVIDIA Hopper architecture (technology overview): https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/
NVIDIA H100 / Hopper architecture whitepaper (GTC22): https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
NVIDIA Grace Hopper Superchip datasheet: https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip
NVIDIA H200 HPC/AI datasheet: https://resources.nvidia.com/en-us-data-center-overview/hpc-datasheet-sc23-h200
NVIDIA Confidential Computing (Hopper TEE): https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/
NVIDIA Fabric Manager user guide: https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
NVIDIA MIG user guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
NVIDIA datacenter driver documentation: https://docs.nvidia.com/datacenter/tesla/drivers/index.html
NVIDIA datacenter driver installation guide (Ubuntu): https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/ubuntu.html
NVIDIA CUDA installation guide for Linux: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
NVIDIA R570 datacenter driver release notes (HGX H100/H200 validation): https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-172-08/index.html
NVIDIA R580 datacenter driver release notes: https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html
NVIDIA Fabric Manager apt packaging (package names, version-match requirement): https://github.com/NVIDIA/apt-packaging-fabric-manager
NVIDIA driver persistence / nvidia-persistenced: https://docs.nvidia.com/deploy/driver-persistence/index.html
NVIDIA MLNX_OFED to DOCA-OFED transition guide: https://docs.nvidia.com/doca/sdk/mlnx_ofed-to-doca-ofed-transition-guide/index.html
NVIDIA ConnectX-7 NDR 400G InfiniBand adapter datasheet: https://www.nvidia.com/content/dam/en-zz/Solutions/networking/infiniband-adapters/infiniband-connectx7-data-sheet.pdf
NVIDIA Quantum-2 InfiniBand platform: https://www.nvidia.com/en-us/networking/quantum2/