Skip to content
Markdown

Glossary

Scope: concise definitions of the terms used across the knowledge base. Reference glossary, not a single WHAT/WHY/WHEN/HOW topic.

Defined once, linked from everywhere. Terms used across the knowledge base.

flowchart LR
  TERMS["Glossary terms"] --> HARDWARE["Hardware"]
  TERMS --> STACK["Software stack"]
  TERMS --> OPS["Operations"]
  TERMS --> MODELS["Models and post-training"]

Networking

  • InfiniBand (IB): low-latency RDMA interconnect for HPC/AI. Generations: EDR 100, HDR 200, NDR 400, XDR 800 Gb/s per port per direction.
  • NDR: Next Data Rate, 400 Gb/s IB (Quantum-2). 100G-PAM4 per lane.
  • XDR: eXtreme Data Rate, 800 Gb/s IB (Quantum-X800). 200G-PAM4 per lane.
  • Quantum-X800: NVIDIA's XDR IB platform (Quantum-3 ASIC). Switches: Q3400-RA (4U, 144x 800G), Q3200-RA (2U), Q3450-LD (co-packaged optics).
  • Spectrum-X / Spectrum-4: NVIDIA's Ethernet platform for AI; RoCEv2 for RDMA.
  • RoCE: RDMA over Converged Ethernet; the Ethernet path to RDMA.
  • ConnectX-8 (CX8): 800G SuperNIC, IB XDR or 2x400GbE, PCIe Gen6. ConnectX-7 (CX7) is the 400G NDR predecessor.
  • SuperNIC: NVIDIA's GPU-attached NIC for AI fabrics.
  • OSFP: octal small form-factor pluggable; the 800G transceiver/cage form factor. IHS (switch-side, twin-port) and RHS (NIC-side, single-port) are not interchangeable.
  • SHARP: Scalable Hierarchical Aggregation and Reduction Protocol; in-network reduction for collectives. v4 on Quantum-X800. NVLS: the NVLink SHARP variant (in-network reduction over NVLink).
  • AR: Adaptive Routing. SHIELD: self-healing fabric.
  • Subnet Manager (SM): core IB controller (init, routing, partitioning). OpenSM is the open implementation.
  • UFM: Unified Fabric Manager; NVIDIA's IB management platform (provisioning, monitoring, diagnostics). SM is a component of it. Variants: Telemetry, Enterprise, Cyber-AI.
  • PKey: partition key; IB network partitioning/isolation.
  • Fat-tree / leaf-spine: non-blocking topology. Super-spine: third tier for very large fabrics.
  • Skyway: IB-to-Ethernet gateway.

Platform

  • Ampere: NVIDIA GPU architecture (2020, GA100, TSMC 7nm): A100/A30/A40/A10; 3rd-gen Tensor (TF32, no FP8), 3rd-gen NVLink 600 GB/s, MIG up to 7.
  • Hopper: NVIDIA GPU architecture (2022, GH100, TSMC 4N): H100/H200/GH200; 4th-gen Tensor + Transformer Engine FP8, NVLink4 900 GB/s, Confidential Computing.
  • Blackwell: NVIDIA GPU architecture (2024-2025): B200/B300/GB200/GB300 datacenter, RTX 50 consumer, RTX PRO 6000 / GB10; 5th-gen Tensor + FP4/NVFP4, NVLink5 1.8 TB/s, TEE-I/O.
  • B300: Blackwell Ultra GPU. 288 GB HBM3e, ~15 PFLOPS dense FP4, 1,400 W TDP.
  • GB300 NVL72: rack-scale system: 72 B300 + 36 Grace, liquid-cooled, up to ~142 kW, single 72-GPU NVLink domain.
  • GH200 / Grace Hopper: superchip: 1 Grace CPU + 1 Hopper GPU joined by NVLink-C2C 900 GB/s coherent memory.
  • GB10 / DGX Spark: Grace Blackwell desktop superchip (20-core Arm + Blackwell GPU); 128 GB LPDDR5x unified memory @ ~273 GB/s; ~1 PFLOP FP4; runs DGX OS; cluster two units over ConnectX-7.
  • L40S: Ada Lovelace datacenter card: 48 GB GDDR6 ECC, ~350 W, vGPU, no NVLink, no MIG; inference/graphics workhorse.
  • RTX PRO 6000 Blackwell: workstation/server GPU: 96 GB GDDR7 with ECC, 600 W, MIG up to 4, vGPU; no NVLink. Editions: Workstation, Server, Max-Q.
  • SXM vs PCIe: GPU form factors: SXM is the NVLink/NVSwitch socketed module (datacenter, busbar power); PCIe is the standard add-in card (PCIe-only multi-GPU, 8-pin/16-pin power).
  • Unified memory (LPDDR5x): coherent CPU+GPU memory pool over NVLink-C2C/ATS (Grace Hopper, GB10/DGX Spark); not discrete HBM/GDDR VRAM.
  • 12VHPWR / 12V-2x6: 16-pin PCIe power connectors: 12VHPWR on Ada consumer (RTX 40), 12V-2x6 (a revision) on Blackwell consumer (RTX 50).
  • Grace: NVIDIA Arm CPU (Neoverse V2). NVLink-C2C: coherent CPU-GPU link.
  • NVLink: intra-node/intra-rack GPU interconnect. 5th gen: 1.8 TB/s per GPU. NVSwitch: the switch ASIC forming the NVLink domain.
  • HBM3e: high-bandwidth memory; 288 GB per B300 via 12-high stacks.
  • NVFP4 / FP4 / FP8: low-precision tensor formats; Blackwell Ultra is tuned for these.
  • DGX / HGX B300: 8-GPU node form factors. SuperPOD: reference design of multiple SUs.
  • SU (Scalable Unit): modular building block, 72 nodes, rail-aligned.
  • Mission Control: NVIDIA's cluster management and RAS software.
  • Rubin: next GPU generation after Blackwell (Vera Rubin platform); announced at CES January 2026, paired with the Vera CPU and HBM4. As of mid-2026 ramping into full production, shipments expected 2H 2026.

Software stack & node

  • GSP: GPU System Processor; on-GPU RISC-V core running offloaded driver logic. Firmware ships with the driver and must match it.
  • Open kernel modules: NVIDIA's open-source GPU kernel modules (nvidia-open), required/default on Blackwell.
  • Fabric Manager: nv-fabricmanager; programs the NVSwitch fabric so GPUs form one NVLink domain. Lockstep-versioned with the driver.
  • IMEX: Internode Memory Exchange; coordinates the multi-node NVLink memory domain on NVL72.
  • MIG: Multi-Instance GPU; hardware partitioning into isolated instances with fault isolation.
  • MPS: Multi-Process Service; concurrent process sharing of one GPU, no fault isolation.
  • Persistence mode: keeps driver state resident (nvidia-persistenced) to avoid re-init latency/clock-down.
  • NVIDIA Container Toolkit: injects driver/devices into containers. CDI (Container Device Interface): the runtime-neutral standard it increasingly uses.
  • DKMS: Dynamic Kernel Module Support; rebuilds the driver on kernel upgrade.
  • LTS driver branch: a long-term-support datacenter driver branch (e.g. R535, R580); pin the fleet to one. R570 is a Production branch (shorter support), not LTS. Verify current designations on NVIDIA's support matrix. CUDA forward compatibility: run a newer CUDA on an older datacenter driver.
  • GeForce vs RTX-Enterprise vs datacenter driver: three NVIDIA driver families: GeForce (RTX 50/40 consumer; licence §2.8 bars datacenter deployment), RTX Enterprise Production Branch (RTX PRO/workstation; ISV-certified, long life-cycle), datacenter/Tesla (A100/H100/B-series; Production/LTS branches). DGX systems run DGX OS (customized Ubuntu, stack preinstalled).

Kubernetes & platform

  • GPU Operator: Helm-managed automation of the whole node GPU stack (driver, toolkit, plugin, DCGM, NFD/GFD, MIG manager, DRA driver).
  • Device plugin: legacy mechanism advertising nvidia.com/gpu as a countable integer resource.
  • DRA: Dynamic Resource Allocation; stable in Kubernetes 1.34. ResourceClaims/DeviceClasses with attributes and partitionable devices; the successor to the device plugin.
  • NFD / GFD: Node / GPU Feature Discovery; label nodes by hardware/GPU features.
  • Time-slicing: oversubscribe a GPU with no isolation. vGPU: hypervisor-mediated GPU virtualization for VMs.
  • KAI Scheduler: NVIDIA's open-source (Apache-2.0, ex-Run:ai) K8s scheduler: gang scheduling, fair-share queues, bin-packing, topology-aware. Volcano / Kueue: CNCF batch scheduler / job-queueing.
  • Gang scheduling: all-or-nothing placement of a distributed job's pods.
  • Grove: placement automation for rack-scale NVLink systems (used by Dynamo/KAI).
  • Network Operator: brings host RDMA, Multus, SR-IOV, and GPUDirect into Kubernetes pods.

Storage & data

  • Parallel filesystem: cluster-wide high-throughput FS: Lustre, IBM Storage Scale (GPFS), BeeGFS; AI-flash: WEKA, VAST, DDN.
  • GPUDirect Storage (GDS): DMA from storage directly into GPU memory (cuFile/nvidia-fs), bypassing the CPU bounce buffer.
  • DCP: torch.distributed.checkpoint; sharded, asynchronous distributed checkpointing.
  • DALI: NVIDIA Data Loading Library; GPU-accelerated decode/augment.
  • WebDataset / Mosaic StreamingDataset: shard datasets (tar/parquet) to avoid small-file metadata storms.

Training

  • torchrun: PyTorch elastic launcher (c10d/etcd rendezvous).
  • DDP: Distributed Data Parallel; replicate model, all-reduce gradients.
  • FSDP / ZeRO: shard params/grads/optimizer state across ranks (PyTorch FSDP2 / DeepSpeed ZeRO stages 1-3).
  • TP / PP / EP / SP: Tensor / Pipeline / Expert / Sequence parallelism; they compose and must match topology.
  • Megatron-Core: reference TP/PP/EP engine. NeMo: end-to-end framework on Megatron. DeepSpeed, TorchTitan: large-scale training stacks.
  • Transformer Engine: NVIDIA library providing FP8 training layers.
  • MFU / HFU: Model / Hardware FLOPs Utilization; achieved vs peak FLOPs (HFU counts recompute). The key training-efficiency metric (~35-50% healthy).
  • Activation checkpointing: recompute activations to save memory. Gradient accumulation: simulate a larger global batch.
  • DiLoCo: low-communication distributed training for WAN/geo-distributed GPUs.

Inference & serving

  • Prefill / decode: the compute-bound prompt phase / the memory-bandwidth-bound token-generation phase.
  • TTFT / TPOT (ITL): Time To First Token / Time Per Output Token (inter-token latency). Goodput: throughput within SLO.
  • KV cache: cached attention keys/values; the dominant serving memory cost.
  • PagedAttention: vLLM's paged KV-cache management. RadixAttention: SGLang's prefix-caching via a radix tree.
  • Continuous (in-flight) batching: add/retire requests from the running batch each step.
  • vLLM / SGLang / TensorRT-LLM: the leading inference engines (flexible default / prefix-heavy / compiled-fastest).
  • Dynamo: NVIDIA's datacenter-scale distributed inference framework (disaggregated serving, KV-aware routing). NIM: packaged inference microservice.
  • Triton Inference Server: multi-framework serving with dynamic batching. KServe: K8s-native model serving.
  • Speculative decoding: draft model proposes tokens, target verifies. Disaggregated serving: prefill and decode on separate GPU pools.

Observability, RAS & optimization

  • DCGM: Data Center GPU Manager; health, diagnostics, telemetry. dcgm-exporter: Prometheus exporter for DCGM.
  • SM-active / tensor-active: fraction of time SMs / Tensor Cores did work; the real "busy" signal, unlike misleading GPU-Util.
  • Nsight Systems (nsys) / Nsight Compute (ncu): system-timeline / single-kernel profilers.
  • XID: GPU error code in dmesg (app-caused 13/31/43; hardware 48 DBE, 79 fallen-off-bus, 94/95 contained/uncontained ECC). SXID: the NVSwitch equivalent.
  • ECC SBE / DBE: single-bit (correctable) / double-bit (uncorrectable) error. Row remapping: HBM remaps a bad row to a spare; remapping failure means RMA.
  • RAS: Reliability, Availability, Serviceability. MTBF: mean time between failures; checkpoint interval must be shorter. Straggler: a slow rank gating every collective.
  • Roofline: compute-bound vs memory-bound kernel model. ACS: PCIe Access Control Services; breaks P2P/GDR if enabled. NUMA: bind GPU↔NIC↔CPU locality.
  • FlashAttention: fused, memory-efficient attention. CUDA Graphs / torch.compile: eliminate launch overhead / fuse kernels.

Compute / ops

  • NCCL: NVIDIA Collective Communications Library (all_reduce, all_gather, etc.). Tuned via NCCL_* env vars (algo, proto, channels, IB HCA, GDR level).
  • GPUDirect RDMA: NIC reads/writes GPU memory directly, bypassing host.
  • DDP / FSDP / DiLoCo: distributed training strategies (see Training).
  • BMC: baseboard management controller. IPMI (legacy) / Redfish (modern): OOB management protocols.
  • PXE: network boot for bare-metal provisioning.
  • Slurm: HPC workload manager/scheduler.
  • CDU: coolant distribution unit. Rear-door heat exchanger (RDHx): rack-door liquid cooling.
  • OFED: OpenFabrics Enterprise Distribution; IB/RDMA drivers and userspace.

Security & cost

  • TEE-I/O: Trusted Execution Environment extended to GPU I/O; Blackwell is the first GPU to support it.
  • Confidential computing: encrypted GPU/HBM/NVLink with attestation; protects data/weights in use at near-zero overhead.
  • SPDM: Security Protocol and Data Model; device/firmware attestation.
  • Neocloud: GPU-specialist cloud (CoreWeave, Lambda, Crusoe, Nebius, …). DePIN GPU: decentralized/permissionless GPU aggregation across many independent providers.
  • Capacity Blocks: reserve a block of GPUs for a fixed window. Spot / preemptible: cheap, reclaimable capacity (checkpoint to use safely).
  • FinOps: cloud cost discipline; key GPU signals are $/GPU-hour, real utilisation, $/token, $/run.

Models & post-training

  • MoE / active params: Mixture-of-Experts; only a subset of experts (the active params) run per token, so memory scales with total params but compute with active (serving open-weight models).
  • MLA: Multi-head Latent Attention; compresses KV-cache state ~an order of magnitude (DeepSeek, Kimi K2).
  • DeepSeek-V3 / R1: 671B/37B MoE, MLA + FP8; R1 is the GRPO-RL reasoning variant. Kimi K2: Moonshot ~1T/32B MoE, MLA; K2-Instruct is 128K/block-FP8. GLM (Z.ai/zai-org): MoE, MIT-licensed; GLM-4.7-FP8 is the current cookbook target. Qwen3 / Llama: dense + MoE open-weight families.
  • SFT: supervised fine-tuning on curated demonstrations.
  • LoRA / QLoRA: low-rank adapter fine-tuning / with 4-bit base quantisation; parameter-efficient.
  • DPO / SimPO: Direct Preference Optimization (offline (chosen,rejected) alignment, no rollouts) / reference-free variant.
  • GRPO: Group Relative Policy Optimization; critic-free RL using group-relative advantages (DeepSeekMath, scaled in R1). DAPO: a GRPO-stability successor.
  • RLVR: Reinforcement Learning with Verifiable Rewards; reward comes from a deterministic checker (answer match, unit tests, format) rather than a learned reward model. Named by Tülu 3, scaled by DeepSeek-R1; algorithm-agnostic (PPO/GRPO/RLOO) (RLVR).
  • On-policy distillation (OPD): the student trains on its own sampled rollouts, graded per-token by a teacher (reverse KL); dense reward, cheaper than RL, bounded by the teacher. OPSD: the self-teacher variant (same-model/earlier-checkpoint/privileged-context teacher) (on-policy distillation).
  • RLSD: Reinforcement Learning with Self-Distillation; RLVR sets each update's direction (verifier reward) while a token-level weight w_t=(P_T/P_S)^sign(A) from a privileged-context self-teacher sets the magnitude: dense per-token credit on RLVR's sparse reward (RLSD).
  • Model merging: combine multiple finetunes of one base into a single model with no training, via task vectors; interference-resolving methods (TIES, DARE) and interpolation (SLERP, model soups) via mergekit (model merging).
  • Synthetic data: LLM-generated training data (Self-Instruct, Evol-Instruct, Magpie, teacher distillation, RLAIF); powerful but risks benchmark contamination and model collapse if unfiltered (synthetic data, data curation).
  • Multi-LoRA serving: serve many LoRA adapters over one shared base with heterogeneous batching across adapters (S-LoRA, Punica); vLLM enable_lora (multi-LoRA serving).
  • Eval harness / eval gate: reproducible benchmark runner (lm-evaluation-harness, lighteval) and the promotion check a checkpoint must pass before serving; decontaminate to keep scores honest (eval harness).
  • Experiment tracking: per-run params/metrics/artifacts plus a versioned model registry and data→code→config→checkpoint lineage (MLflow, W&B) (experiment tracking & registry).
  • TRL / verl / OpenRLHF: post-training frameworks (single-node / large-scale Ray RL / async RL).
  • Disaggregated serving: prefill and decode on separate GPU pools (disaggregated inference). NIXL: NVIDIA Inference Xfer Library; moves KV cache GPU-to-GPU over RDMA/NVMe at wire speed.
  • HSDP: Hybrid Sharded Data Parallel; shard intra-node on NVLink, replicate inter-node over IB.
  • DiLoCo: Distributed Low-Communication training (local SGD, infrequent sync, ~500x less comms); OpenDiLoCo/PRIME are the Prime Intellect implementations.
  • Burn rate: speed of error-budget consumption; multi-window burn-rate alerts page on fast burns (the SLO/SLI catalog).

Orchestration & RL systems

  • Ray: Python-native distributed runtime (tasks + actors); libraries: Ray Train, Ray Tune, Ray Serve, Ray Data, RLlib. The common controller for LLM-RL systems (orchestration overview).
  • KubeRay: operator running Ray on Kubernetes via RayCluster / RayJob / RayService CRDs.
  • k3s: lightweight single-binary Kubernetes (CNCF) for edge, small clusters, CI, and dev.
  • Generator / Trainer: the rollout half (inference engine + environment/reward) and the policy-update half (PPO/GRPO on FSDP/Megatron) of an RL-for-LLM system (RL libraries).
  • Colocated vs disaggregated RL: rollout and trainer share GPUs (efficient) vs separate pools (async, straggler-tolerant).
  • verl / slime / SkyRL: RL post-training libraries: colocated-default high-perf (ByteDance) / Megatron+SGLang decoupled, powers GLM (THUDM) / flexible colocated-or-disaggregated (UC Berkeley). See also OpenRLHF, NeMo-RL, ROLL, AReaL, TRL.

Practices & platform

  • SLI / SLO / error budget: service-level indicator / objective / the allowance of unreliability that arbitrates change velocity vs stability.
  • Golden signals: latency, traffic, errors, saturation; GPU-flavoured as TTFT/TPOT, tokens-s, XID/ECC, SM-active.
  • GitOps: cluster state declared in git and reconciled by Argo CD / Flux; drift self-heals.
  • IaC: infrastructure as code: Terraform/OpenTofu for cloud/cluster, Ansible for the metal (Ansible bring-up).
  • Policy-as-code: admission guardrails via Kyverno / OPA Gatekeeper (enforce GPU limits, image provenance, MIG).
  • Golden path: templated self-service (Backstage, Helm/Kustomize bases) for compliant jobs without re-deriving the platform.
  • MPIJob / PyTorchJob: Kubeflow CRDs for multi-pod MPI / PyTorch distributed jobs.
  • Model registry / eval gate: versioned model store (MLflow, W&B) and the quality/regression/safety check a model must pass before promotion to serving.
  • Pipeline orchestration: Kubeflow Pipelines / Argo Workflows / Flyte / Metaflow running data->train->eval->register->deploy.

Facility

  • One-line diagram: single-line electrical schematic (feed to rack).
  • UPS: uninterruptible power supply. Topologies: N, N+1, 2N.
  • THD: total harmonic distortion; IEEE 519 sets limits.
  • TDP: thermal design power.
  • PDU: power distribution unit.

References

  • NVIDIA DCGM documentation: https://docs.nvidia.com/datacenter/dcgm/latest/index.html
  • NVIDIA MIG user guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
  • NVIDIA GPUDirect RDMA: https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
  • Kubernetes Dynamic Resource Allocation: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
  • PyTorch Distributed: https://pytorch.org/docs/stable/distributed.html
  • vLLM documentation: https://docs.vllm.ai/en/latest/

Related: Index