Glossary¶
Scope: concise definitions of the terms used across the knowledge base. Reference glossary, not a single WHAT/WHY/WHEN/HOW topic.
Defined once, linked from everywhere. Terms used across the knowledge base.
flowchart LR
TERMS["Glossary terms"] --> HARDWARE["Hardware"]
TERMS --> STACK["Software stack"]
TERMS --> OPS["Operations"]
TERMS --> MODELS["Models and post-training"]
Networking¶
- InfiniBand (IB): low-latency RDMA interconnect for HPC/AI. Generations: EDR 100, HDR 200, NDR 400, XDR 800 Gb/s per port per direction.
- NDR: Next Data Rate, 400 Gb/s IB (Quantum-2). 100G-PAM4 per lane.
- XDR: eXtreme Data Rate, 800 Gb/s IB (Quantum-X800). 200G-PAM4 per lane.
- Quantum-X800: NVIDIA's XDR IB platform (Quantum-3 ASIC). Switches: Q3400-RA (4U, 144x 800G), Q3200-RA (2U), Q3450-LD (co-packaged optics).
- Spectrum-X / Spectrum-4: NVIDIA's Ethernet platform for AI; RoCEv2 for RDMA.
- RoCE: RDMA over Converged Ethernet; the Ethernet path to RDMA.
- ConnectX-8 (CX8): 800G SuperNIC, IB XDR or 2x400GbE, PCIe Gen6. ConnectX-7 (CX7) is the 400G NDR predecessor.
- SuperNIC: NVIDIA's GPU-attached NIC for AI fabrics.
- OSFP: octal small form-factor pluggable; the 800G transceiver/cage form factor. IHS (switch-side, twin-port) and RHS (NIC-side, single-port) are not interchangeable.
- SHARP: Scalable Hierarchical Aggregation and Reduction Protocol; in-network reduction for collectives. v4 on Quantum-X800. NVLS: the NVLink SHARP variant (in-network reduction over NVLink).
- AR: Adaptive Routing. SHIELD: self-healing fabric.
- Subnet Manager (SM): core IB controller (init, routing, partitioning). OpenSM is the open implementation.
- UFM: Unified Fabric Manager; NVIDIA's IB management platform (provisioning, monitoring, diagnostics). SM is a component of it. Variants: Telemetry, Enterprise, Cyber-AI.
- PKey: partition key; IB network partitioning/isolation.
- Fat-tree / leaf-spine: non-blocking topology. Super-spine: third tier for very large fabrics.
- Skyway: IB-to-Ethernet gateway.
Platform¶
- Ampere: NVIDIA GPU architecture (2020, GA100, TSMC 7nm): A100/A30/A40/A10; 3rd-gen Tensor (TF32, no FP8), 3rd-gen NVLink 600 GB/s, MIG up to 7.
- Hopper: NVIDIA GPU architecture (2022, GH100, TSMC 4N): H100/H200/GH200; 4th-gen Tensor + Transformer Engine FP8, NVLink4 900 GB/s, Confidential Computing.
- Blackwell: NVIDIA GPU architecture (2024-2025): B200/B300/GB200/GB300 datacenter, RTX 50 consumer, RTX PRO 6000 / GB10; 5th-gen Tensor + FP4/NVFP4, NVLink5 1.8 TB/s, TEE-I/O.
- B300: Blackwell Ultra GPU. 288 GB HBM3e, ~15 PFLOPS dense FP4, 1,400 W TDP.
- GB300 NVL72: rack-scale system: 72 B300 + 36 Grace, liquid-cooled, up to ~142 kW, single 72-GPU NVLink domain.
- GH200 / Grace Hopper: superchip: 1 Grace CPU + 1 Hopper GPU joined by NVLink-C2C 900 GB/s coherent memory.
- GB10 / DGX Spark: Grace Blackwell desktop superchip (20-core Arm + Blackwell GPU); 128 GB LPDDR5x unified memory @ ~273 GB/s; ~1 PFLOP FP4; runs DGX OS; cluster two units over ConnectX-7.
- L40S: Ada Lovelace datacenter card: 48 GB GDDR6 ECC, ~350 W, vGPU, no NVLink, no MIG; inference/graphics workhorse.
- RTX PRO 6000 Blackwell: workstation/server GPU: 96 GB GDDR7 with ECC, 600 W, MIG up to 4, vGPU; no NVLink. Editions: Workstation, Server, Max-Q.
- SXM vs PCIe: GPU form factors: SXM is the NVLink/NVSwitch socketed module (datacenter, busbar power); PCIe is the standard add-in card (PCIe-only multi-GPU, 8-pin/16-pin power).
- Unified memory (LPDDR5x): coherent CPU+GPU memory pool over NVLink-C2C/ATS (Grace Hopper, GB10/DGX Spark); not discrete HBM/GDDR VRAM.
- 12VHPWR / 12V-2x6: 16-pin PCIe power connectors: 12VHPWR on Ada consumer (RTX 40), 12V-2x6 (a revision) on Blackwell consumer (RTX 50).
- Grace: NVIDIA Arm CPU (Neoverse V2). NVLink-C2C: coherent CPU-GPU link.
- NVLink: intra-node/intra-rack GPU interconnect. 5th gen: 1.8 TB/s per GPU. NVSwitch: the switch ASIC forming the NVLink domain.
- HBM3e: high-bandwidth memory; 288 GB per B300 via 12-high stacks.
- NVFP4 / FP4 / FP8: low-precision tensor formats; Blackwell Ultra is tuned for these.
- DGX / HGX B300: 8-GPU node form factors. SuperPOD: reference design of multiple SUs.
- SU (Scalable Unit): modular building block, 72 nodes, rail-aligned.
- Mission Control: NVIDIA's cluster management and RAS software.
- Rubin: next GPU generation after Blackwell (Vera Rubin platform); announced at CES January 2026, paired with the Vera CPU and HBM4. As of mid-2026 ramping into full production, shipments expected 2H 2026.
Software stack & node¶
- GSP: GPU System Processor; on-GPU RISC-V core running offloaded driver logic. Firmware ships with the driver and must match it.
- Open kernel modules: NVIDIA's open-source GPU kernel modules (
nvidia-open), required/default on Blackwell. - Fabric Manager:
nv-fabricmanager; programs the NVSwitch fabric so GPUs form one NVLink domain. Lockstep-versioned with the driver. - IMEX: Internode Memory Exchange; coordinates the multi-node NVLink memory domain on NVL72.
- MIG: Multi-Instance GPU; hardware partitioning into isolated instances with fault isolation.
- MPS: Multi-Process Service; concurrent process sharing of one GPU, no fault isolation.
- Persistence mode: keeps driver state resident (
nvidia-persistenced) to avoid re-init latency/clock-down. - NVIDIA Container Toolkit: injects driver/devices into containers. CDI (Container Device Interface): the runtime-neutral standard it increasingly uses.
- DKMS: Dynamic Kernel Module Support; rebuilds the driver on kernel upgrade.
- LTS driver branch: a long-term-support datacenter driver branch (e.g. R535, R580); pin the fleet to one. R570 is a Production branch (shorter support), not LTS. Verify current designations on NVIDIA's support matrix. CUDA forward compatibility: run a newer CUDA on an older datacenter driver.
- GeForce vs RTX-Enterprise vs datacenter driver: three NVIDIA driver families: GeForce (RTX 50/40 consumer; licence §2.8 bars datacenter deployment), RTX Enterprise Production Branch (RTX PRO/workstation; ISV-certified, long life-cycle), datacenter/Tesla (A100/H100/B-series; Production/LTS branches). DGX systems run DGX OS (customized Ubuntu, stack preinstalled).
Kubernetes & platform¶
- GPU Operator: Helm-managed automation of the whole node GPU stack (driver, toolkit, plugin, DCGM, NFD/GFD, MIG manager, DRA driver).
- Device plugin: legacy mechanism advertising
nvidia.com/gpuas a countable integer resource. - DRA: Dynamic Resource Allocation; stable in Kubernetes 1.34. ResourceClaims/DeviceClasses with attributes and partitionable devices; the successor to the device plugin.
- NFD / GFD: Node / GPU Feature Discovery; label nodes by hardware/GPU features.
- Time-slicing: oversubscribe a GPU with no isolation. vGPU: hypervisor-mediated GPU virtualization for VMs.
- KAI Scheduler: NVIDIA's open-source (Apache-2.0, ex-Run:ai) K8s scheduler: gang scheduling, fair-share queues, bin-packing, topology-aware. Volcano / Kueue: CNCF batch scheduler / job-queueing.
- Gang scheduling: all-or-nothing placement of a distributed job's pods.
- Grove: placement automation for rack-scale NVLink systems (used by Dynamo/KAI).
- Network Operator: brings host RDMA, Multus, SR-IOV, and GPUDirect into Kubernetes pods.
Storage & data¶
- Parallel filesystem: cluster-wide high-throughput FS: Lustre, IBM Storage Scale (GPFS), BeeGFS; AI-flash: WEKA, VAST, DDN.
- GPUDirect Storage (GDS): DMA from storage directly into GPU memory (
cuFile/nvidia-fs), bypassing the CPU bounce buffer. - DCP:
torch.distributed.checkpoint; sharded, asynchronous distributed checkpointing. - DALI: NVIDIA Data Loading Library; GPU-accelerated decode/augment.
- WebDataset / Mosaic StreamingDataset: shard datasets (tar/parquet) to avoid small-file metadata storms.
Training¶
- torchrun: PyTorch elastic launcher (c10d/etcd rendezvous).
- DDP: Distributed Data Parallel; replicate model, all-reduce gradients.
- FSDP / ZeRO: shard params/grads/optimizer state across ranks (PyTorch FSDP2 / DeepSpeed ZeRO stages 1-3).
- TP / PP / EP / SP: Tensor / Pipeline / Expert / Sequence parallelism; they compose and must match topology.
- Megatron-Core: reference TP/PP/EP engine. NeMo: end-to-end framework on Megatron. DeepSpeed, TorchTitan: large-scale training stacks.
- Transformer Engine: NVIDIA library providing FP8 training layers.
- MFU / HFU: Model / Hardware FLOPs Utilization; achieved vs peak FLOPs (HFU counts recompute). The key training-efficiency metric (~35-50% healthy).
- Activation checkpointing: recompute activations to save memory. Gradient accumulation: simulate a larger global batch.
- DiLoCo: low-communication distributed training for WAN/geo-distributed GPUs.
Inference & serving¶
- Prefill / decode: the compute-bound prompt phase / the memory-bandwidth-bound token-generation phase.
- TTFT / TPOT (ITL): Time To First Token / Time Per Output Token (inter-token latency). Goodput: throughput within SLO.
- KV cache: cached attention keys/values; the dominant serving memory cost.
- PagedAttention: vLLM's paged KV-cache management. RadixAttention: SGLang's prefix-caching via a radix tree.
- Continuous (in-flight) batching: add/retire requests from the running batch each step.
- vLLM / SGLang / TensorRT-LLM: the leading inference engines (flexible default / prefix-heavy / compiled-fastest).
- Dynamo: NVIDIA's datacenter-scale distributed inference framework (disaggregated serving, KV-aware routing). NIM: packaged inference microservice.
- Triton Inference Server: multi-framework serving with dynamic batching. KServe: K8s-native model serving.
- Speculative decoding: draft model proposes tokens, target verifies. Disaggregated serving: prefill and decode on separate GPU pools.
Observability, RAS & optimization¶
- DCGM: Data Center GPU Manager; health, diagnostics, telemetry. dcgm-exporter: Prometheus exporter for DCGM.
- SM-active / tensor-active: fraction of time SMs / Tensor Cores did work; the real "busy" signal, unlike misleading GPU-Util.
- Nsight Systems (nsys) / Nsight Compute (ncu): system-timeline / single-kernel profilers.
- XID: GPU error code in
dmesg(app-caused 13/31/43; hardware 48 DBE, 79 fallen-off-bus, 94/95 contained/uncontained ECC). SXID: the NVSwitch equivalent. - ECC SBE / DBE: single-bit (correctable) / double-bit (uncorrectable) error. Row remapping: HBM remaps a bad row to a spare; remapping failure means RMA.
- RAS: Reliability, Availability, Serviceability. MTBF: mean time between failures; checkpoint interval must be shorter. Straggler: a slow rank gating every collective.
- Roofline: compute-bound vs memory-bound kernel model. ACS: PCIe Access Control Services; breaks P2P/GDR if enabled. NUMA: bind GPU↔NIC↔CPU locality.
- FlashAttention: fused, memory-efficient attention. CUDA Graphs / torch.compile: eliminate launch overhead / fuse kernels.
Compute / ops¶
- NCCL: NVIDIA Collective Communications Library (all_reduce, all_gather, etc.). Tuned via
NCCL_*env vars (algo, proto, channels, IB HCA, GDR level). - GPUDirect RDMA: NIC reads/writes GPU memory directly, bypassing host.
- DDP / FSDP / DiLoCo: distributed training strategies (see Training).
- BMC: baseboard management controller. IPMI (legacy) / Redfish (modern): OOB management protocols.
- PXE: network boot for bare-metal provisioning.
- Slurm: HPC workload manager/scheduler.
- CDU: coolant distribution unit. Rear-door heat exchanger (RDHx): rack-door liquid cooling.
- OFED: OpenFabrics Enterprise Distribution; IB/RDMA drivers and userspace.
Security & cost¶
- TEE-I/O: Trusted Execution Environment extended to GPU I/O; Blackwell is the first GPU to support it.
- Confidential computing: encrypted GPU/HBM/NVLink with attestation; protects data/weights in use at near-zero overhead.
- SPDM: Security Protocol and Data Model; device/firmware attestation.
- Neocloud: GPU-specialist cloud (CoreWeave, Lambda, Crusoe, Nebius, …). DePIN GPU: decentralized/permissionless GPU aggregation across many independent providers.
- Capacity Blocks: reserve a block of GPUs for a fixed window. Spot / preemptible: cheap, reclaimable capacity (checkpoint to use safely).
- FinOps: cloud cost discipline; key GPU signals are $/GPU-hour, real utilisation, $/token, $/run.
Models & post-training¶
- MoE / active params: Mixture-of-Experts; only a subset of experts (the active params) run per token, so memory scales with total params but compute with active (serving open-weight models).
- MLA: Multi-head Latent Attention; compresses KV-cache state ~an order of magnitude (DeepSeek, Kimi K2).
- DeepSeek-V3 / R1: 671B/37B MoE, MLA + FP8; R1 is the GRPO-RL reasoning variant. Kimi K2: Moonshot ~1T/32B MoE, MLA; K2-Instruct is 128K/block-FP8. GLM (Z.ai/zai-org): MoE, MIT-licensed; GLM-4.7-FP8 is the current cookbook target. Qwen3 / Llama: dense + MoE open-weight families.
- SFT: supervised fine-tuning on curated demonstrations.
- LoRA / QLoRA: low-rank adapter fine-tuning / with 4-bit base quantisation; parameter-efficient.
- DPO / SimPO: Direct Preference Optimization (offline
(chosen,rejected)alignment, no rollouts) / reference-free variant. - GRPO: Group Relative Policy Optimization; critic-free RL using group-relative advantages (DeepSeekMath, scaled in R1). DAPO: a GRPO-stability successor.
- RLVR: Reinforcement Learning with Verifiable Rewards; reward comes from a deterministic checker (answer match, unit tests, format) rather than a learned reward model. Named by Tülu 3, scaled by DeepSeek-R1; algorithm-agnostic (PPO/GRPO/RLOO) (RLVR).
- On-policy distillation (OPD): the student trains on its own sampled rollouts, graded per-token by a teacher (reverse KL); dense reward, cheaper than RL, bounded by the teacher. OPSD: the self-teacher variant (same-model/earlier-checkpoint/privileged-context teacher) (on-policy distillation).
- RLSD: Reinforcement Learning with Self-Distillation; RLVR sets each update's direction (verifier reward) while a token-level weight
w_t=(P_T/P_S)^sign(A)from a privileged-context self-teacher sets the magnitude: dense per-token credit on RLVR's sparse reward (RLSD). - Model merging: combine multiple finetunes of one base into a single model with no training, via task vectors; interference-resolving methods (TIES, DARE) and interpolation (SLERP, model soups) via mergekit (model merging).
- Synthetic data: LLM-generated training data (Self-Instruct, Evol-Instruct, Magpie, teacher distillation, RLAIF); powerful but risks benchmark contamination and model collapse if unfiltered (synthetic data, data curation).
- Multi-LoRA serving: serve many LoRA adapters over one shared base with heterogeneous batching across adapters (S-LoRA, Punica); vLLM
enable_lora(multi-LoRA serving). - Eval harness / eval gate: reproducible benchmark runner (lm-evaluation-harness, lighteval) and the promotion check a checkpoint must pass before serving; decontaminate to keep scores honest (eval harness).
- Experiment tracking: per-run params/metrics/artifacts plus a versioned model registry and data→code→config→checkpoint lineage (MLflow, W&B) (experiment tracking & registry).
- TRL / verl / OpenRLHF: post-training frameworks (single-node / large-scale Ray RL / async RL).
- Disaggregated serving: prefill and decode on separate GPU pools (disaggregated inference). NIXL: NVIDIA Inference Xfer Library; moves KV cache GPU-to-GPU over RDMA/NVMe at wire speed.
- HSDP: Hybrid Sharded Data Parallel; shard intra-node on NVLink, replicate inter-node over IB.
- DiLoCo: Distributed Low-Communication training (local SGD, infrequent sync, ~500x less comms); OpenDiLoCo/PRIME are the Prime Intellect implementations.
- Burn rate: speed of error-budget consumption; multi-window burn-rate alerts page on fast burns (the SLO/SLI catalog).
Orchestration & RL systems¶
- Ray: Python-native distributed runtime (tasks + actors); libraries: Ray Train, Ray Tune, Ray Serve, Ray Data, RLlib. The common controller for LLM-RL systems (orchestration overview).
- KubeRay: operator running Ray on Kubernetes via
RayCluster/RayJob/RayServiceCRDs. - k3s: lightweight single-binary Kubernetes (CNCF) for edge, small clusters, CI, and dev.
- Generator / Trainer: the rollout half (inference engine + environment/reward) and the policy-update half (PPO/GRPO on FSDP/Megatron) of an RL-for-LLM system (RL libraries).
- Colocated vs disaggregated RL: rollout and trainer share GPUs (efficient) vs separate pools (async, straggler-tolerant).
- verl / slime / SkyRL: RL post-training libraries: colocated-default high-perf (ByteDance) / Megatron+SGLang decoupled, powers GLM (THUDM) / flexible colocated-or-disaggregated (UC Berkeley). See also OpenRLHF, NeMo-RL, ROLL, AReaL, TRL.
Practices & platform¶
- SLI / SLO / error budget: service-level indicator / objective / the allowance of unreliability that arbitrates change velocity vs stability.
- Golden signals: latency, traffic, errors, saturation; GPU-flavoured as TTFT/TPOT, tokens-s, XID/ECC, SM-active.
- GitOps: cluster state declared in git and reconciled by Argo CD / Flux; drift self-heals.
- IaC: infrastructure as code: Terraform/OpenTofu for cloud/cluster, Ansible for the metal (Ansible bring-up).
- Policy-as-code: admission guardrails via Kyverno / OPA Gatekeeper (enforce GPU limits, image provenance, MIG).
- Golden path: templated self-service (Backstage, Helm/Kustomize bases) for compliant jobs without re-deriving the platform.
- MPIJob / PyTorchJob: Kubeflow CRDs for multi-pod MPI / PyTorch distributed jobs.
- Model registry / eval gate: versioned model store (MLflow, W&B) and the quality/regression/safety check a model must pass before promotion to serving.
- Pipeline orchestration: Kubeflow Pipelines / Argo Workflows / Flyte / Metaflow running data->train->eval->register->deploy.
Facility¶
- One-line diagram: single-line electrical schematic (feed to rack).
- UPS: uninterruptible power supply. Topologies: N, N+1, 2N.
- THD: total harmonic distortion; IEEE 519 sets limits.
- TDP: thermal design power.
- PDU: power distribution unit.
References¶
- NVIDIA DCGM documentation: https://docs.nvidia.com/datacenter/dcgm/latest/index.html
- NVIDIA MIG user guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
- NVIDIA GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
- Kubernetes Dynamic Resource Allocation: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
- PyTorch Distributed: https://pytorch.org/docs/stable/distributed.html
- vLLM documentation: https://docs.vllm.ai/en/latest/
Related: Index