Scaling toward 100-trillion-parameter models¶
Scope: The systems thesis for the next order of magnitude, covering the memory wall, sparse MoE holding per-token FLOPs flat while capacity grows, rack-scale NVLink domains plus disaggregation, and the goodput/reliability problem at 100k+ GPU scale.
flowchart TB
Model["100T-param model<br/>(~182 TB weights at 16-bit)"]
Model --> Split{"Split what one GPU<br/>cannot hold"}
Split --> DP["Data parallel + FSDP<br/>(shard params/grads/optimizer state)"]
Split --> TP["Tensor parallel<br/>(intra-node NVLink)"]
Split --> PP["Pipeline parallel<br/>(hand-tuned, DualPipe overlap)"]
Split --> EP["MoE expert parallel<br/>(~37B of 680B active/token)"]
DP --> Interconnect["Interconnect: all-reduce / all-to-all<br/>(NCCL, Spectrum-X)"]
TP --> Interconnect
PP --> Interconnect
EP --> Interconnect
Interconnect --> Memory["Memory wall: the binding constraint<br/>(HBM3e, not FLOPs)"]
Interconnect --> Compute["Compute: FLOPs per token<br/>(MoE keeps it ~flat)"]
Memory --> Goodput["Goodput, not utilization<br/>(useful work at 100k+ GPUs)"]
Compute --> Goodput
What it is¶
A "100-trillion-parameter model" is an aspirational milestone, often compared against the roughly 100 trillion synaptic connections in the human neocortex, where one synapse is treated as one parameter.1 It is not a product; it is a forcing function. Reaching it exposes every wall in the stack at once (compute, memory, interconnect, power, and reliability), and none can be solved by buying more of the same hardware.
Two order-of-magnitude numbers frame the problem, both from the book:
-
Compute. A dense 100T model trained on ~29 trillion tokens needs on the order of 1.2 x 10^29 FLOPs. Token count scales this linearly; sparsity (MoE) reduces the effective compute.1 For reference, the book puts GPT-3 at ~3 x 10^23 FLOPs and GPT-4 at ~2 x 10^25 FLOPs, with next-gen "AI factories" engineered for ~10^27–10^28 FLOPs per training run.2 The book is blunt that brute force does not close this gap: an exascale system would need "several millennia of GPU time" for one dense 100T run.1
-
Memory. At 16-bit (2-byte) precision, 100T weights = ~182 TB just to load the model (100e12 params x 16 bits / 8 bits-per-byte ≈ 182 TB).1 That excludes activations, optimizer states, and inference inputs.
The capstone framing: 100T is where hardware, kernels, comms, and inference stop being separable concerns. It is a codesign problem.1
Why it matters¶
The memory wall is the binding constraint, not FLOPs. A single NVIDIA Blackwell B200 carries 192 GB HBM3e (~180 GB usable). 182 TB is ~1,000x that. Loading the weights alone takes close to 1,000 B200 GPUs (192 GB each) or ~700 Blackwell Ultra B300 GPUs (~288 GB each), and that is load-only, before activations and optimizer state.1 At 8 GPUs per node that is ~125 B200 nodes or ~86 B300 nodes.1 Official NVIDIA Blackwell GPU memory figures match the book's per-GPU numbers.3
Optimizer state multiplies the wall. Adam-family optimizers keep momentum and variance estimates, roughly tripling weight memory. At 100T weights that is an extra ~200 trillion values of optimizer state.2 This is why memory-efficient optimizers (Adafactor, Shampoo) and aggressive activation checkpointing (recompute instead of store) stop being optional at this scale.2
Sparsity changes the slope. A sparse Mixture-of-Experts (MoE) routes each token through only a subset of experts, so FLOPs per token stay roughly constant even as total parameters grow.1 DeepSeek-V3 is the worked example: ~680B total parameters, but only ~37B active per token (1 shared expert + 8 of 256 routed = 9 active experts per token).1 Google's 2021 Switch Transformer (1.6T-param MoE) reached dense-model accuracy at a fraction of the compute and trained ~7x faster than a comparable dense approach.1 By the time 100T arrives, the book expects most such models to be sparsely activated.2 Open-weight progress already crossed the trillion-parameter line: Moonshot AI's Kimi K2 is a 1T-param MoE with ~32B active per token.2
Goodput, not utilization, is the scorecard. At 100k+ GPUs, headline device utilization is misleading: a GPU 100% busy on redundant transfers is not productive. Goodput measures useful work (tokens/requests completed) per unit time, discounting stalls, comms, and failed-job restarts.1 Meta's "Revisiting Reliability" work (effective-training-time-ratio) showed that even when a cluster appeared ~100% utilized, 70–75% of compute could be lost to preemptions, network hotspots, fragmentation, and unrecoverable faults.1 At 100T scale a single job can span an entire datacenter, and frontier labs (xAI, OpenAI, Microsoft) are already building 1,000,000+-GPU clusters, so the reliability/goodput problem dominates economics.2
When it is needed (and when not)¶
This is a capstone mental model, not a daily checklist. Apply the 100T lens when:
- You are sizing a frontier training run and need to reason about where the wall is (memory vs comms vs goodput), not just peak FLOPs.
- You are choosing dense vs sparse: if capacity must grow but per-token cost must stay bounded, MoE is the lever.1
- You operate at thousands-to-millions of GPUs, where failure recovery and comms overlap decide cost.
Do not reach for it when:
- The model fits comfortably in one node's HBM: multi-dimensional parallelism and disaggregation add complexity you do not need.
- Latency-critical small-model serving is the goal; MoE routing and weight streaming add machinery that a dense model on one GPU avoids.
- You are tempted to "throw hardware at it." The book's repeated thesis: hardware alone does not solve 100T; software and algorithm codesign are equally or more important.2
A caution from the book on benchmarking at this scale: MLPerf states per-GPU results are not a primary cross-platform metric. Use system-level, end-to-end throughput.1
How: implement, integrate, maintain¶
The book's answer is full-stack codesign across four layers. None is sufficient alone.2
1. Rack-scale NVLink domains (collapse the memory wall locally). The GB200 NVL72 unifies 72 Blackwell GPUs into one NVLink domain so frameworks (PyTorch for training, vLLM for inference) treat the rack as one fabric for tensor / pipeline / expert parallelism.1 Verified figures:
| Metric | Value | Source |
|---|---|---|
| GPUs per rack | 72 (36 Grace CPU + 72 Blackwell GPU) | book1 / NVIDIA3 |
| NVLink 5 per-GPU | 1.8 TB/s bidirectional | book1 / NVIDIA4 |
| Aggregate NVLink (rack) | ~130 TB/s GPU-to-GPU | book1 / NVIDIA3 |
| FP4 (2:1 structured sparsity) | ~1.44 ExaFLOPS / rack | book1 / NVIDIA3 |
| FP8 (2:1 structured sparsity) | ~720 PFLOPS / rack | book1 / NVIDIA3 |
| HBM3e (72 GPUs) | ~13.5 TB (192 GB x 72) | book1 / NVIDIA3 |
| Unified memory (+ Grace) | ~30 TB | book1 / NVIDIA4 |
| Rack power | ~120–132 kW (config/cooling dependent) | book1 / NVIDIA3 |
Treat the domain as non-uniform memory. CUDA Unified Memory can migrate or remotely access pages over NVLink, but remote access has distinct latency/bandwidth: the rack is not a single uniform memory device.1 For managed allocations spanning Grace and Blackwell, prefer explicit prefetch (cudaMemPrefetchAsync) and hinting (cudaMemAdvise) to avoid page-fault stalls.1
# Pre-stage pages on the consuming GPU before the kernel needs them;
# advise the driver on access pattern to suppress fault-driven migration.
cudaMemAdvise(ptr, bytes, cudaMemAdviseSetPreferredLocation, gpu_id)
cudaMemPrefetchAsync(ptr, bytes, gpu_id, stream)
2. Multi-dimensional parallelism (split what one GPU cannot hold). The model is partitioned with data parallelism (DP), fully-sharded data parallelism (FSDP), tensor parallelism (TP), pipeline parallelism (PP), context/sequence parallelism (CP), and MoE expert parallelism, "3D, 4D, 5D" combinations chosen to keep every GPU busy.2 The art is placement: keep the right experts on the right devices at the right time to minimize sparse-data-exchange comms.2 These strategies remain largely hand-tuned; the book notes that fully-automatic pipeline parallelism is still an active research area and current frameworks require explicit pipeline-parallel implementations.2
3. Communication/computation overlap (recover goodput). The dominant pattern across the stack: overlap GPU compute with inter-GPU comms so neither idles. DeepSeek's DualPipe is the canonical example: a bidirectional pipeline algorithm that overlaps forward/backward compute with communication to shrink pipeline bubbles, paired with custom CUDA kernels that bypass default NCCL collectives to mask the export-limited H800's ~400 GB/s NVLink (vs H100's ~900 GB/s).1 For collectives (all-reduce, all-to-all, all-gather) the stack is NCCL; adaptive Ethernet fabrics like NVIDIA Spectrum-X (RoCE, adaptive routing, high bisection bandwidth) cut congestion at 10,000+-GPU scale.2
4. Disaggregation and weight streaming (scale inference past HBM). Inference splits into disaggregated prefill and decode, with KV-cache moved between stages over UCX with GPUDirect RDMA, NCCL point-to-point, or framework transports.1 NVIDIA Dynamo coordinates request scheduling, KV-cache placement, and data movement across GPUs/nodes (and integrates NIXL for low-latency KV transfer).1 Weight streaming / activation offloading pull a model's layers from host memory to GPU only when needed (e.g., during decode), letting parts of a 100T model live on cheaper storage tiers.2
Maintain (the reliability discipline). Drive goodput toward 100% by measuring it directly with NVIDIA Nsight Systems/Compute and the PyTorch profiler, then attacking the largest stall.1 At 100k+ GPUs assume faults are routine: make checkpoints fast and snapshot-able, design for failure recovery and rerouting (Dynamo reroutes around node failures), and watch for ECC/HBM channel errors as crash precursors.2 A 20% cluster-efficiency gain translates to millions of dollars at scale: the economic case for this work.1
Forward-looking (announced / expected; date-sensitive, verify before relying). The book projects the memory wall easing with HBM4 on the Rubin generation, citing ~1.6 TB/s per stack and 48–64 GB per module, enabling boards with 512 GB–1 TB of HBM.2 As of mid-2026, NVIDIA's official Vera Rubin (VR200) figures are more aggressive than the book's per-stack estimate: 288 GB HBM4 per GPU and ~22 TB/s aggregate per-GPU bandwidth across 8 stacks; the rack unit, originally previewed as "NVL144," is now the VR200 NVL72 (72 GPU packages, 36 Vera CPUs) with NVLink 6.5 The published cadence is Rubin (2026), Rubin Ultra (2027), Feynman (2028).5 The book's own roadmap mentions Vera Rubin VR200 (2026) and Feynman (2028) continuing the per-generation doubling pattern.1 Prefer current NVIDIA specs over the book's pre-publication estimates where they disagree. No hardware-tested claims are made here; all figures are vendor-published or book-stated.
References¶
- Chris Fregly, AI Systems Performance Engineering (O'Reilly). Chapter 1, "Introduction and AI System Overview" (memory wall, 182 TB / ~1,000 B200 estimate, DeepSeek-V3 MoE, NVL72 specs, goodput, MLPerf caveat). 1
- Chris Fregly, AI Systems Performance Engineering (O'Reilly). Chapter 20, "AI-Assisted Performance Optimizations and Scaling Toward Multimillion GPU Clusters" (GPT-3/4 FLOPs, optimizer-state memory, HBM4/Rubin outlook, disaggregation, multi-dimensional parallelism, Kimi K2). 2
- NVIDIA, "GB200 NVL72" product page — 72-GPU NVLink domain, 1.4 EFLOPS FP4, 720 PFLOPS FP8, 130 TB/s NVLink, 13.4 TB GPU memory, up to 120 kW. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ 3
- NVIDIA Technical Blog, "GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference" — 1.8 TB/s per-GPU NVLink 5, 30 TB unified memory, 576-GPU domain. https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/ 4
- NVIDIA, "Vera Rubin NVL72" platform page and GTC 2026 coverage — VR200 288 GB HBM4, ~22 TB/s per-GPU bandwidth, NVL72 naming, NVLink 6, Rubin/Rubin Ultra/Feynman cadence (announced, date-sensitive). https://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/ 5
Related: NVIDIA Blackwell Datacenter Platform · NVIDIA GPU Roadmap · Mixture-of-Experts: Sparse Scaling · NVSwitch and NVLink · CUDA Unified Memory and NVLink-C2C Page Migration · Goodput: Measuring Useful AI Throughput · Mechanical Sympathy and Hardware-Software Codesign · NVIDIA Grace CPU · Glossary
-
Fregly, AI Systems Performance Engineering, Ch. 1. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
-
Fregly, AI Systems Performance Engineering, Ch. 20. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
-
NVIDIA Technical Blog, GB200 NVL72 trillion-parameter post. ↩↩↩
-
NVIDIA Vera Rubin NVL72 page / GTC 2026 (announced, expected; verify before relying). ↩↩↩