The NVIDIA GPU roadmap¶
Scope: The datacenter GPU/platform cadence, Hopper to Blackwell to Blackwell Ultra (B300/GB300) to Vera Rubin to Feynman, covering what each generation changes (precision, NVLink, memory, rack scale) and how to plan procurement and software against a moving target.
What it is¶
NVIDIA ships a new datacenter GPU architecture on a roughly annual cadence, each paired with a co-designed ARM CPU, NVLink switch generation, DPU, and NIC into a rack-scale "superchip" platform. The progression as of mid-2026:
| Generation | GPU | Superchip | Rack | Status |
|---|---|---|---|---|
| Hopper | H100 / H200 | Grace Hopper GH200 | (8-GPU HGX / GH NVL32) | Shipping (prior gen) |
| Blackwell | B200 | Grace Blackwell GB200 | GB200 NVL72 | Shipping |
| Blackwell Ultra | B300 | Grace Blackwell Ultra GB300 | GB300 NVL72 | Shipping |
| Rubin | Rubin (R200-class) | Vera Rubin VR200 | Vera Rubin NVL72 | Ramping into production, H2 2026 (announced) |
| Rubin Ultra | Rubin Ultra (quad-die) | Vera Rubin Ultra | NVL576 ("Kyber") | Expected H2 2027 (announced) |
| Feynman | Feynman | Rosa Feynman | Not disclosed | Expected 2028 (announced) |
The book frames this as a deliberate strategy: NVIDIA "seems to be doubling something every generation, every year if possible" (one year memory, the next the number of dies, the next interconnect bandwidth), so the compound effect over a few years is large.1 The platform-level building block is the NVL72 rack: 72 GPUs + 36 Grace CPUs wired into a single NVLink domain that behaves like one large GPU.2
Note: several forward-looking specs in the source book (Rubin HBM bandwidth, Vera memory, NVLink 6 figures) were written before Rubin's full specs were public. Where the book and NVIDIA's now-published specs diverge, this page uses the official NVIDIA numbers and flags the difference.
Why it matters¶
Procurement and software lifecycles are 2-4 years; the hardware cadence is ~1 year. That mismatch forces decisions:
- Memory ceiling moves every year. A model that needs partitioning today may fit on one module next generation. Per-GPU HBM went 80 GB (H100) -> 180 GB usable (B200) -> 288 GB (B300, Rubin).345
- Precision floor drops. Hopper introduced FP8 in the Transformer Engine; Blackwell added NVFP4 (4-bit), roughly doubling FP8 throughput where accuracy tolerates it.6 Code written to a fixed precision leaves performance on the table after an upgrade.
- Interconnect and rack scale grow. NVLink per-GPU aggregate doubled from ~900 GB/s (Hopper-era links) to 1.8 TB/s (NVLink 5, Blackwell), and Rubin moves to NVLink 6.78 Collective-heavy training that is communication-bound on one generation can become compute-bound on the next.
- Power and cooling step up hard. NVL72 (GB200/GB300) draws ~130 kW per rack and is liquid-cooled only.9 NVIDIA states Vera Rubin and Rubin Ultra racks target roughly 5x the performance of GB300 NVL72 at ~5x the power, nearly 600 kW per rack.1011 Facility planning, not chip availability, often gates deployment.
If you procure or write software against a single static spec, you either over-fit to hardware you will replace or fail to exploit hardware you just bought.
When it is needed (and when not)¶
Plan explicitly against the roadmap when:
- You are sizing a multi-year cluster buildout or signing a capacity contract, where the generation boundary changes performance-per-dollar and performance-per-watt materially.
- Your models are memory-bound or communication-bound today; a generation jump may remove the bottleneck and change your parallelism strategy.
- You operate your own facility and must provision power/cooling 12-24 months ahead (130 kW today, ~600 kW for Rubin-class racks).11
You can largely ignore generation timing when:
- You rent slices (MIG partitions, cloud "rack-as-a-service"); the provider absorbs the cadence, and you consume an abstracted API.12
- Your workloads are small or latency-insensitive and already satisfied by current hardware; chasing the latest node yields little ROI.
- Your software is written to NVIDIA's abstraction libraries (CUDA, CUTLASS, Triton, Transformer Engine, NCCL) rather than to hardware-specific assumptions, so most of the generational gain arrives through library updates "with minimal code changes."13
Do not buy ahead of a generation boundary for marginal workloads: the book's own ROI framing is performance-per-dollar over a 1-2 year payback, contingent on keeping the hardware busy 24/7.14
How: implement, integrate, maintain¶
Read the cadence, not the press release¶
NVIDIA confirmed an annual architecture cadence at GTC 2025 and has reaffirmed it through CES 2026 / GTC 2026: Blackwell -> Rubin (H2 2026) -> Rubin Ultra (H2 2027) -> Feynman (2028).18 Treat every future generation's specs as announced/expected and date-sensitive, and re-verify against nvidia.com/data-center before committing budget.
What each generation actually changes¶
Blackwell Ultra (B300 / GB300): a drop-in upgrade to NVL72, same NVLink 5 fabric as GB200. Per the book: ~50% more HBM (288 GB vs 180 GB B200), ~1.5x AI compute, larger on-die attention/NVFP4 accelerators, yielding ~45-50% higher inference throughput.3 NVIDIA's published GB300 NVL72 figures: 72 Blackwell Ultra GPUs + 36 Grace CPUs, 288 GB HBM3e per GPU at 8 TB/s, ~1.1 EFLOPS FP4 per rack.416 Engineering note: B300 keeps 8 TB/s HBM bandwidth despite the larger 288 GB capacity (12-Hi HBM3e stacks), so capacity grew but per-GPU bandwidth did not.16
Vera Rubin (VR200, expected H2 2026): the first new architecture since Blackwell. Vera is the Grace successor (custom ARM Olympus cores); Rubin is the Blackwell successor; the superchip pairs one Vera CPU with two Rubin GPUs.15
The book (pre-launch) estimated Rubin HBM at ~13-14 TB/s, Vera LPDDR6 at ~1 TB/s, and NVLink 6 "doubling" link bandwidth.15 NVIDIA's published specs supersede these:
Rubin GPU (per package, NVIDIA-published):
336 billion transistors (two reticle-sized dies)
HBM4: up to 288 GB at up to 22 TB/s
NVLink 6: 3.6 TB/s bidirectional per GPU
NVFP4: 50 PFLOPS inference / 35 PFLOPS training
Vera CPU (per package):
88 custom Olympus ARM cores
LPDDR5X: up to 1.5 TB at up to 1.2 TB/s
NVLink-C2C: 1.8 TB/s
Vera Rubin NVL72 (rack):
72 Rubin GPUs + 36 Vera CPUs
20.7 TB HBM4, ~1,580 TB/s aggregate
NVLink 6 switch fabric: 260 TB/s
3,600 PFLOPS NVFP4 inference / 2,520 PFLOPS training
Sources: NVIDIA Rubin developer blog and Vera Rubin NVL72 product page.58 Note the divergence from the book: official HBM4 bandwidth (22 TB/s) is materially higher than the book's ~13-14 TB/s estimate, and the platform integrates six co-designed chips: Rubin GPU, Vera CPU, NVLink 6 switch, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-6 Ethernet switch.5 The memory generation jumps HBM3e -> HBM4; the NVLink generation jumps 5 -> 6.
Rubin Ultra (R300, expected H2 2027) moves to a quad-die GPU module. The book describes four GPU dies and 16 HBM stacks totaling ~1 TB HBM per module, with NVL144 (144 dies) and NVL576 configurations.17 NVIDIA's NVL576 ("Kyber" rack) is announced to pack up to 144 quad-chiplet GPUs = 576 compute dies, targeting ~14x GB300 NVL72 inference/training performance, H2 2027.1819 Treat the 1 TB/module and per-rack exaFLOPS figures as expected/speculative.
Feynman (expected 2028): a post-Rubin generation, paired with the Rosa CPU. The book speculates a finer 2 nm node, HBM5, more in-module DDR, and possibly doubling dies again (four to eight).20 NVIDIA has placed Feynman on the public roadmap for 2028 with "advanced 3D stacking" and custom HBM, but detailed specs are not yet published, so treat all Feynman numbers as roadmap-only.1819
Plan procurement against a moving target¶
- Buy for performance-per-watt at the generation you can power and cool. The book's ROI case: 100 H100s -> 50 Blackwell GPUs for the same work at lower aggregate power; payback often 1-2 years if utilization is high.14 Re-run this each generation with current prices.
- Provision power/cooling for the generation after the one you buy. 130 kW (NVL72) is already air-cooling-infeasible; Rubin-class racks target ~600 kW.911 Facility lead time exceeds chip lead time.
- Keep workloads intra-rack. Across every generation the design intent is to maximize intra-NVLink-domain traffic and minimize inter-rack InfiniBand/Ethernet hops.21 Software that respects the rack boundary survives generation changes; software that assumes a flat fabric does not.
Integrate and maintain across generations¶
- Target the abstraction libraries (CUDA, CUTLASS, OpenAI Triton, Transformer Engine, NCCL/NVSHMEM), not hardware constants. The book's repeated claim: most generational gains arrive through library/framework updates with minimal code changes, including new precision support.13
- Parameterize precision. Enable mixed precision via the Transformer Engine rather than hard-coding FP16/FP8; new generations add formats (NVFP4 in Blackwell; second-gen FP4 speculated for Rubin) the TE can adopt where accuracy holds.615
- Re-profile after every upgrade. A kernel that is memory-bound on Hopper may be compute-bound on Blackwell/Rubin because bandwidth, cache, and FLOPS all moved. Use the NVL72 telemetry path (DCGM: utilization, NVLink throughput, power) to confirm where the new bottleneck sits.22
- Re-verify forward specs before each buy. Vendor roadmap numbers for un-shipped parts change between GTC events. Confirm against NVIDIA's official data-center pages at decision time.
flowchart LR
H["Hopper H100<br/>FP8, NVLink 4<br/>80 GB HBM3"] --> B["Blackwell B200<br/>NVFP4, NVLink 5<br/>180 GB usable"]
B --> BU["Blackwell Ultra B300<br/>288 GB HBM3e<br/>same NVLink 5"]
BU --> R["Vera Rubin (H2 2026)<br/>HBM4 22 TB/s, NVLink 6<br/>announced"]
R --> RU["Rubin Ultra (H2 2027)<br/>quad-die, NVL576<br/>announced"]
RU --> F["Feynman (2028)<br/>HBM5?, more dies?<br/>roadmap only"]
References¶
- Chris Fregly, AI Systems Performance Engineering, O'Reilly Media. Chapter 2, "AI System Hardware Overview" — sections "A Glimpse into the Future: NVIDIA's Roadmap," "Compute Density and Power Requirements," "ROI of Upgrading Your Hardware," "Key Takeaways."
- NVIDIA Vera Rubin NVL72 product page
- Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer — NVIDIA Technical Blog
- NVIDIA Vera Rubin Platform overview
- NVIDIA GB300 NVL72 product page
- NVIDIA Data Center
- Tom's Hardware: NVIDIA enterprise roadmap — Rubin, Rubin Ultra, Feynman
- Spheron: NVIDIA B300 (Blackwell Ultra) specs
- Introl: NVIDIA Vera Rubin 600 kW racks
Related: NVIDIA Blackwell Platform · GPU Generations · Grace CPU · DGX Spark · CUDA Unified Memory · NVSwitch & NVLink · MoE Sparsity Scaling · Scaling to 100T Parameters · Mechanical Sympathy & Codesign · Datacentre Physical · GPU Power & Thermal Tuning · AI-Driven Performance Optimization · Goodput for AI Systems · Glossary
-
Fregly, Ch. 2, "Feynman GPU (2028) and Doubling Something Every Year": "NVIDIA seems to be doubling something every generation, every year if possible." ↩
-
Fregly, Ch. 2, "Ultrascale Networking Treating Many GPUs as One": NVL72 = 72 Blackwell GPUs + 36 Grace CPUs in one NVLink domain. ↩
-
Fregly, Ch. 2, "Blackwell Ultra and Grace Blackwell Ultra": B300 has ~288 GB vs B200's 180 GB, 1.5x compute, 45-50% higher inference throughput; same NVLink 5. ↩↩
-
NVIDIA GB300 NVL72 product page: 72 Blackwell Ultra GPUs + 36 Grace CPUs, ~1.1 EFLOPS FP4 per rack. ↩↩
-
NVIDIA Technical Blog, "Inside the NVIDIA Rubin Platform": Rubin 336B transistors, HBM4 up to 288 GB at up to 22 TB/s, NVLink 6 3.6 TB/s/GPU, 50/35 PFLOPS NVFP4; Vera 88 Olympus cores; six co-designed chips. ↩↩↩
-
Fregly, Ch. 2, "NVIDIA GPU Tensor Cores and Transformer Engine": Hopper TE introduced FP8 (2x FP16); Blackwell adds NVFP4 (4-bit), potentially 2x FP8. ↩↩
-
Fregly, Ch. 2, "NVLink and NVSwitch": NVLink 5 = 1.8 TB/s aggregate bidirectional per GPU, double the prior Hopper-era per-GPU NVLink bandwidth. ↩
-
NVIDIA Vera Rubin NVL72 product page: 72 Rubin GPUs + 36 Vera CPUs, 20.7 TB HBM4, ~1,580 TB/s, NVLink 6 switch 260 TB/s, 3,600/2,520 PFLOPS NVFP4. ↩↩
-
Fregly, Ch. 2, "Compute Density and Power Requirements" / "Liquid Cooling Versus Air Cooling": NVL72 ~130 kW, liquid-cooled only. ↩↩
-
Fregly, Ch. 2, "Rubin Ultra and Vera Rubin Ultra (2027)": Vera Rubin / Vera Rubin Ultra racks deliver ~5x GB200/GB300 NVL72 performance at ~5x power, nearly 600 kW per rack. ↩
-
Fregly, Ch. 2, "Feynman GPU (2028) and Doubling Something Every Year": NVIDIA envisions offering a "rack as a service" so companies rent a slice rather than building it. ↩
-
Fregly, Ch. 2, "Key Takeaways" / "Modern software stack support": gains arrive via CUDA, CUTLASS, Triton, Transformer Engine with minimal code changes; native FP8/FP4 support. ↩↩
-
Fregly, Ch. 2, "ROI of Upgrading Your Hardware": 100 H100 -> 50 Blackwell case study; payback often 1-2 years at high utilization. ↩↩
-
Fregly, Ch. 2, "Vera Rubin Superchip (2026)": Vera (ARM, 3 nm) + two Rubin GPUs per module; book pre-launch estimates HBM ~13-14 TB/s, LPDDR6 ~1 TB/s, NVLink 6 doubling link bandwidth. ↩↩↩
-
Spheron, "NVIDIA B300 (Blackwell Ultra) Guide": 288 GB HBM3e (12-Hi stacks) at 8 TB/s, 1,400 W TDP. ↩↩
-
Fregly, Ch. 2, "Rubin Ultra and Vera Rubin Ultra (2027)": quad-die module, 16 HBM stacks ~1 TB HBM, NVL144 / NVL576 configurations. ↩
-
NVIDIA annual cadence confirmed GTC 2025, reaffirmed CES/GTC 2026: Blackwell -> Rubin (H2 2026) -> Rubin Ultra (H2 2027) -> Feynman (2028). See Tom's Hardware roadmap coverage. ↩↩↩
-
Tom's Hardware, "NVIDIA enterprise roadmap: Rubin, Rubin Ultra, Feynman and silicon photonics": NVL576 "Kyber" up to 144 quad-chiplet GPUs (576 dies), ~14x GB300 NVL72, H2 2027; Feynman 2028 with Rosa CPU, 3D stacking, custom HBM. ↩↩
-
Fregly, Ch. 2, "Feynman GPU (2028) and Doubling Something Every Year": speculative 2 nm node, HBM5, more in-module DDR, possibly doubling dies four to eight. ↩
-
Fregly, Ch. 2, "Ultrascale Networking Treating Many GPUs as One": keep communication intra-rack over NVLink/NVSwitch; use InfiniBand/Ethernet inter-rack only when necessary. ↩
-
Fregly, Ch. 2, "Performance Monitoring and Utilization in Practice": DCGM tracks GPU utilization, memory, temperature, NVLink throughput, power. ↩