Markdown

Glossary¶

Scope: concise definitions of the terms used across the knowledge base. Reference glossary, not a single WHAT/WHY/WHEN/HOW topic.

Defined once, linked from everywhere. Terms used across the knowledge base.

flowchart LR
  TERMS["Glossary terms"] --> HARDWARE["Hardware"]
  TERMS --> STACK["Software stack"]
  TERMS --> OPS["Operations"]
  TERMS --> MODELS["Models and post-training"]

Networking¶

InfiniBand (IB): low-latency RDMA interconnect for HPC/AI. Generations: EDR 100, HDR 200, NDR 400, XDR 800 Gb/s per port per direction.
NDR: Next Data Rate, 400 Gb/s IB (Quantum-2). 100G-PAM4 per lane.
XDR: eXtreme Data Rate, 800 Gb/s IB (Quantum-X800). 200G-PAM4 per lane.
Quantum-X800: NVIDIA's XDR IB platform (Quantum-3 ASIC). Switches: Q3400-RA (4U, 144x 800G), Q3200-RA (2U), Q3450-LD (co-packaged optics).
Spectrum-X / Spectrum-4: NVIDIA's Ethernet platform for AI; RoCEv2 for RDMA.
RoCE: RDMA over Converged Ethernet; the Ethernet path to RDMA.
ConnectX-8 (CX8): 800G SuperNIC, IB XDR or 2x400GbE, PCIe Gen6. ConnectX-7 (CX7) is the 400G NDR predecessor.
SuperNIC: NVIDIA's GPU-attached NIC for AI fabrics.
OSFP: octal small form-factor pluggable; the 800G transceiver/cage form factor. IHS (switch-side, twin-port) and RHS (NIC-side, single-port) are not interchangeable.
SHARP: Scalable Hierarchical Aggregation and Reduction Protocol; in-network reduction for collectives. v4 on Quantum-X800. NVLS: the NVLink SHARP variant (in-network reduction over NVLink).
AR: Adaptive Routing. SHIELD: self-healing fabric.
Subnet Manager (SM): core IB controller (init, routing, partitioning). OpenSM is the open implementation.
UFM: Unified Fabric Manager; NVIDIA's IB management platform (provisioning, monitoring, diagnostics). SM is a component of it. Variants: Telemetry, Enterprise, Cyber-AI.
PKey: partition key; IB network partitioning/isolation.
Fat-tree / leaf-spine: non-blocking topology. Super-spine: third tier for very large fabrics.
Skyway: IB-to-Ethernet gateway.

Platform¶

Ampere: NVIDIA GPU architecture (2020, GA100, TSMC 7nm): A100/A30/A40/A10; 3rd-gen Tensor (TF32, no FP8), 3rd-gen NVLink 600 GB/s, MIG up to 7.
Hopper: NVIDIA GPU architecture (2022, GH100, TSMC 4N): H100/H200/GH200; 4th-gen Tensor + Transformer Engine FP8, NVLink4 900 GB/s, Confidential Computing.
Blackwell: NVIDIA GPU architecture (2024-2025): B200/B300/GB200/GB300 datacenter, RTX 50 consumer, RTX PRO 6000 / GB10; 5th-gen Tensor + FP4/NVFP4, NVLink5 1.8 TB/s, TEE-I/O.
B300: Blackwell Ultra GPU. 288 GB HBM3e, ~15 PFLOPS dense FP4, 1,400 W TDP.
GB300 NVL72: rack-scale system: 72 B300 + 36 Grace, liquid-cooled, up to ~142 kW, single 72-GPU NVLink domain.
GH200 / Grace Hopper: superchip: 1 Grace CPU + 1 Hopper GPU joined by NVLink-C2C 900 GB/s coherent memory.
GB10 / DGX Spark: Grace Blackwell desktop superchip (20-core Arm + Blackwell GPU); 128 GB LPDDR5x unified memory @ ~273 GB/s; ~1 PFLOP FP4; runs DGX OS; cluster two units over ConnectX-7.
L40S: Ada Lovelace datacenter card: 48 GB GDDR6 ECC, ~350 W, vGPU, no NVLink, no MIG; inference/graphics workhorse.
RTX PRO 6000 Blackwell: workstation/server GPU: 96 GB GDDR7 with ECC, 600 W, MIG up to 4, vGPU; no NVLink. Editions: Workstation, Server, Max-Q.
SXM vs PCIe: GPU form factors: SXM is the NVLink/NVSwitch socketed module (datacenter, busbar power); PCIe is the standard add-in card (PCIe-only multi-GPU, 8-pin/16-pin power).
Unified memory (LPDDR5x): coherent CPU+GPU memory pool over NVLink-C2C/ATS (Grace Hopper, GB10/DGX Spark); not discrete HBM/GDDR VRAM.
12VHPWR / 12V-2x6: 16-pin PCIe power connectors: 12VHPWR on Ada consumer (RTX 40), 12V-2x6 (a revision) on Blackwell consumer (RTX 50).
Grace: NVIDIA Arm CPU (Neoverse V2). NVLink-C2C: coherent CPU-GPU link.
NVLink: intra-node/intra-rack GPU interconnect. 5th gen: 1.8 TB/s per GPU. NVSwitch: the switch ASIC forming the NVLink domain.
HBM3e: high-bandwidth memory; 288 GB per B300 via 12-high stacks.
NVFP4 / FP4 / FP8: low-precision tensor formats; Blackwell Ultra is tuned for these.
DGX / HGX B300: 8-GPU node form factors. SuperPOD: reference design of multiple SUs.
SU (Scalable Unit): modular building block, 72 nodes, rail-aligned.
Mission Control: NVIDIA's cluster management and RAS software.
Rubin: next GPU generation after Blackwell (Vera Rubin platform); announced at CES January 2026, paired with the Vera CPU and HBM4. As of mid-2026 ramping into full production, shipments expected 2H 2026.

Software stack & node¶

GSP: GPU System Processor; on-GPU RISC-V core running offloaded driver logic. Firmware ships with the driver and must match it.
Open kernel modules: NVIDIA's open-source GPU kernel modules (nvidia-open), required/default on Blackwell.
Fabric Manager: nv-fabricmanager; programs the NVSwitch fabric so GPUs form one NVLink domain. Lockstep-versioned with the driver.
IMEX: Internode Memory Exchange; coordinates the multi-node NVLink memory domain on NVL72.
MIG: Multi-Instance GPU; hardware partitioning into isolated instances with fault isolation.
MPS: Multi-Process Service; concurrent process sharing of one GPU, no fault isolation.
Persistence mode: keeps driver state resident (nvidia-persistenced) to avoid re-init latency/clock-down.
NVIDIA Container Toolkit: injects driver/devices into containers. CDI (Container Device Interface): the runtime-neutral standard it increasingly uses.
DKMS: Dynamic Kernel Module Support; rebuilds the driver on kernel upgrade.
LTS driver branch: a long-term-support datacenter driver branch (e.g. R535, R580); pin the fleet to one. R570 is a Production branch (shorter support), not LTS. Verify current designations on NVIDIA's support matrix. CUDA forward compatibility: run a newer CUDA on an older datacenter driver.
GeForce vs RTX-Enterprise vs datacenter driver: three NVIDIA driver families: GeForce (RTX 50/40 consumer; licence §2.8 bars datacenter deployment), RTX Enterprise Production Branch (RTX PRO/workstation; ISV-certified, long life-cycle), datacenter/Tesla (A100/H100/B-series; Production/LTS branches). DGX systems run DGX OS (customized Ubuntu, stack preinstalled).

Kubernetes & platform¶

GPU Operator: Helm-managed automation of the whole node GPU stack (driver, toolkit, plugin, DCGM, NFD/GFD, MIG manager, DRA driver).
Device plugin: legacy mechanism advertising nvidia.com/gpu as a countable integer resource.
DRA: Dynamic Resource Allocation; stable in Kubernetes 1.34. ResourceClaims/DeviceClasses with attributes and partitionable devices; the successor to the device plugin.
NFD / GFD: Node / GPU Feature Discovery; label nodes by hardware/GPU features.
Time-slicing: oversubscribe a GPU with no isolation. vGPU: hypervisor-mediated GPU virtualization for VMs.
KAI Scheduler: NVIDIA's open-source (Apache-2.0, ex-Run:ai) K8s scheduler: gang scheduling, fair-share queues, bin-packing, topology-aware. Volcano / Kueue: CNCF batch scheduler / job-queueing.
Gang scheduling: all-or-nothing placement of a distributed job's pods.
Grove: placement automation for rack-scale NVLink systems (used by Dynamo/KAI).
Network Operator: brings host RDMA, Multus, SR-IOV, and GPUDirect into Kubernetes pods.

Storage & data¶

Parallel filesystem: cluster-wide high-throughput FS: Lustre, IBM Storage Scale (GPFS), BeeGFS; AI-flash: WEKA, VAST, DDN.
GPUDirect Storage (GDS): DMA from storage directly into GPU memory (cuFile/nvidia-fs), bypassing the CPU bounce buffer.
DCP: torch.distributed.checkpoint; sharded, asynchronous distributed checkpointing.
DALI: NVIDIA Data Loading Library; GPU-accelerated decode/augment.
WebDataset / Mosaic StreamingDataset: shard datasets (tar/parquet) to avoid small-file metadata storms.

Training¶

torchrun: PyTorch elastic launcher (c10d/etcd rendezvous).
DDP: Distributed Data Parallel; replicate model, all-reduce gradients.
FSDP / ZeRO: shard params/grads/optimizer state across ranks (PyTorch FSDP2 / DeepSpeed ZeRO stages 1-3).
TP / PP / EP / SP: Tensor / Pipeline / Expert / Sequence parallelism; they compose and must match topology.
Megatron-Core: reference TP/PP/EP engine. NeMo: end-to-end framework on Megatron. DeepSpeed, TorchTitan: large-scale training stacks.
Transformer Engine: NVIDIA library providing FP8 training layers.
MFU / HFU: Model / Hardware FLOPs Utilization; achieved vs peak FLOPs (HFU counts recompute). The key training-efficiency metric (~35-50% healthy).
Activation checkpointing: recompute activations to save memory. Gradient accumulation: simulate a larger global batch.
DiLoCo: low-communication distributed training for WAN/geo-distributed GPUs.

Inference & serving¶

Prefill / decode: the compute-bound prompt phase / the memory-bandwidth-bound token-generation phase.
TTFT / TPOT (ITL): Time To First Token / Time Per Output Token (inter-token latency). Goodput: throughput within SLO.
KV cache: cached attention keys/values; the dominant serving memory cost.
PagedAttention: vLLM's paged KV-cache management. RadixAttention: SGLang's prefix-caching via a radix tree.
Continuous (in-flight) batching: add/retire requests from the running batch each step.
vLLM / SGLang / TensorRT-LLM: the leading inference engines (flexible default / prefix-heavy / compiled-fastest).
Dynamo: NVIDIA's datacenter-scale distributed inference framework (disaggregated serving, KV-aware routing). NIM: packaged inference microservice.
Triton Inference Server: multi-framework serving with dynamic batching. KServe: K8s-native model serving.
Speculative decoding: draft model proposes tokens, target verifies. Disaggregated serving: prefill and decode on separate GPU pools.

Observability, RAS & optimization¶

DCGM: Data Center GPU Manager; health, diagnostics, telemetry. dcgm-exporter: Prometheus exporter for DCGM.
SM-active / tensor-active: fraction of time SMs / Tensor Cores did work; the real "busy" signal, unlike misleading GPU-Util.
Nsight Systems (nsys) / Nsight Compute (ncu): system-timeline / single-kernel profilers.
XID: GPU error code in dmesg (app-caused 13/31/43; hardware 48 DBE, 79 fallen-off-bus, 94/95 contained/uncontained ECC). SXID: the NVSwitch equivalent.
ECC SBE / DBE: single-bit (correctable) / double-bit (uncorrectable) error. Row remapping: HBM remaps a bad row to a spare; remapping failure means RMA.
RAS: Reliability, Availability, Serviceability. MTBF: mean time between failures; checkpoint interval must be shorter. Straggler: a slow rank gating every collective.
Roofline: compute-bound vs memory-bound kernel model. ACS: PCIe Access Control Services; breaks P2P/GDR if enabled. NUMA: bind GPU↔NIC↔CPU locality.
FlashAttention: fused, memory-efficient attention. CUDA Graphs / torch.compile: eliminate launch overhead / fuse kernels.

Compute / ops¶

NCCL: NVIDIA Collective Communications Library (all_reduce, all_gather, etc.). Tuned via NCCL_* env vars (algo, proto, channels, IB HCA, GDR level).
GPUDirect RDMA: NIC reads/writes GPU memory directly, bypassing host.
DDP / FSDP / DiLoCo: distributed training strategies (see Training).
BMC: baseboard management controller. IPMI (legacy) / Redfish (modern): OOB management protocols.
PXE: network boot for bare-metal provisioning.
Slurm: HPC workload manager/scheduler.
CDU: coolant distribution unit. Rear-door heat exchanger (RDHx): rack-door liquid cooling.
OFED: OpenFabrics Enterprise Distribution; IB/RDMA drivers and userspace.

Security & cost¶

TEE-I/O: Trusted Execution Environment extended to GPU I/O; Blackwell is the first GPU to support it.
Confidential computing: encrypted GPU/HBM/NVLink with attestation; protects data/weights in use at near-zero overhead.
SPDM: Security Protocol and Data Model; device/firmware attestation.
Neocloud: GPU-specialist cloud (CoreWeave, Lambda, Crusoe, Nebius, …). DePIN GPU: decentralized/permissionless GPU aggregation across many independent providers.
Capacity Blocks: reserve a block of GPUs for a fixed window. Spot / preemptible: cheap, reclaimable capacity (checkpoint to use safely).
FinOps: cloud cost discipline; key GPU signals are $/GPU-hour, real utilisation, $/token, $/run.

Models & post-training¶

MoE / active params: Mixture-of-Experts; only a subset of experts (the active params) run per token, so memory scales with total params but compute with active (serving open-weight models).
MLA: Multi-head Latent Attention; compresses KV-cache state ~an order of magnitude (DeepSeek, Kimi K2).
DeepSeek-V3 / R1: 671B/37B MoE, MLA + FP8; R1 is the GRPO-RL reasoning variant. Kimi K2: Moonshot ~1T/32B MoE, MLA; K2-Instruct is 128K/block-FP8. GLM (Z.ai/zai-org): MoE, MIT-licensed; GLM-4.7-FP8 is the current cookbook target. Qwen3 / Llama: dense + MoE open-weight families.
SFT: supervised fine-tuning on curated demonstrations.
LoRA / QLoRA: low-rank adapter fine-tuning / with 4-bit base quantisation; parameter-efficient.
DPO / SimPO: Direct Preference Optimization (offline (chosen,rejected) alignment, no rollouts) / reference-free variant.
GRPO: Group Relative Policy Optimization; critic-free RL using group-relative advantages (DeepSeekMath, scaled in R1). DAPO: a GRPO-stability successor.
RLVR: Reinforcement Learning with Verifiable Rewards; reward comes from a deterministic checker (answer match, unit tests, format) rather than a learned reward model. Named by Tülu 3, scaled by DeepSeek-R1; algorithm-agnostic (PPO/GRPO/RLOO) (RLVR).
On-policy distillation (OPD): the student trains on its own sampled rollouts, graded per-token by a teacher (reverse KL); dense reward, cheaper than RL, bounded by the teacher. OPSD: the self-teacher variant (same-model/earlier-checkpoint/privileged-context teacher) (on-policy distillation).
RLSD: Reinforcement Learning with Self-Distillation; RLVR sets each update's direction (verifier reward) while a token-level weight w_t=(P_T/P_S)^sign(A) from a privileged-context self-teacher sets the magnitude: dense per-token credit on RLVR's sparse reward (RLSD).
Model merging: combine multiple finetunes of one base into a single model with no training, via task vectors; interference-resolving methods (TIES, DARE) and interpolation (SLERP, model soups) via mergekit (model merging).
Synthetic data: LLM-generated training data (Self-Instruct, Evol-Instruct, Magpie, teacher distillation, RLAIF); powerful but risks benchmark contamination and model collapse if unfiltered (synthetic data, data curation).
Multi-LoRA serving: serve many LoRA adapters over one shared base with heterogeneous batching across adapters (S-LoRA, Punica); vLLM enable_lora (multi-LoRA serving).
Eval harness / eval gate: reproducible benchmark runner (lm-evaluation-harness, lighteval) and the promotion check a checkpoint must pass before serving; decontaminate to keep scores honest (eval harness).
Experiment tracking: per-run params/metrics/artifacts plus a versioned model registry and data→code→config→checkpoint lineage (MLflow, W&B) (experiment tracking & registry).
TRL / verl / OpenRLHF: post-training frameworks (single-node / large-scale Ray RL / async RL).
Disaggregated serving: prefill and decode on separate GPU pools (disaggregated inference). NIXL: NVIDIA Inference Xfer Library; moves KV cache GPU-to-GPU over RDMA/NVMe at wire speed.
HSDP: Hybrid Sharded Data Parallel; shard intra-node on NVLink, replicate inter-node over IB.
DiLoCo: Distributed Low-Communication training (local SGD, infrequent sync, ~500x less comms); OpenDiLoCo/PRIME are the Prime Intellect implementations.
Burn rate: speed of error-budget consumption; multi-window burn-rate alerts page on fast burns (the SLO/SLI catalog).

Orchestration & RL systems¶

Ray: Python-native distributed runtime (tasks + actors); libraries: Ray Train, Ray Tune, Ray Serve, Ray Data, RLlib. The common controller for LLM-RL systems (orchestration overview).
KubeRay: operator running Ray on Kubernetes via RayCluster / RayJob / RayService CRDs.
k3s: lightweight single-binary Kubernetes (CNCF) for edge, small clusters, CI, and dev.
Generator / Trainer: the rollout half (inference engine + environment/reward) and the policy-update half (PPO/GRPO on FSDP/Megatron) of an RL-for-LLM system (RL libraries).
Colocated vs disaggregated RL: rollout and trainer share GPUs (efficient) vs separate pools (async, straggler-tolerant).
verl / slime / SkyRL: RL post-training libraries: colocated-default high-perf (ByteDance) / Megatron+SGLang decoupled, powers GLM (THUDM) / flexible colocated-or-disaggregated (UC Berkeley). See also OpenRLHF, NeMo-RL, ROLL, AReaL, TRL.

Practices & platform¶

SLI / SLO / error budget: service-level indicator / objective / the allowance of unreliability that arbitrates change velocity vs stability.
Golden signals: latency, traffic, errors, saturation; GPU-flavoured as TTFT/TPOT, tokens-s, XID/ECC, SM-active.
GitOps: cluster state declared in git and reconciled by Argo CD / Flux; drift self-heals.
IaC: infrastructure as code: Terraform/OpenTofu for cloud/cluster, Ansible for the metal (Ansible bring-up).
Policy-as-code: admission guardrails via Kyverno / OPA Gatekeeper (enforce GPU limits, image provenance, MIG).
Golden path: templated self-service (Backstage, Helm/Kustomize bases) for compliant jobs without re-deriving the platform.
MPIJob / PyTorchJob: Kubeflow CRDs for multi-pod MPI / PyTorch distributed jobs.
Model registry / eval gate: versioned model store (MLflow, W&B) and the quality/regression/safety check a model must pass before promotion to serving.
Pipeline orchestration: Kubeflow Pipelines / Argo Workflows / Flyte / Metaflow running data->train->eval->register->deploy.

Facility¶

One-line diagram: single-line electrical schematic (feed to rack).
UPS: uninterruptible power supply. Topologies: N, N+1, 2N.
THD: total harmonic distortion; IEEE 519 sets limits.
TDP: thermal design power.
PDU: power distribution unit.

References¶

NVIDIA DCGM documentation: https://docs.nvidia.com/datacenter/dcgm/latest/index.html
NVIDIA MIG user guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
NVIDIA GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
Kubernetes Dynamic Resource Allocation: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
PyTorch Distributed: https://pytorch.org/docs/stable/distributed.html
vLLM documentation: https://docs.vllm.ai/en/latest/

Related: Index