AI infrastructure knowledge base¶

Scope: the role-based reading-path map and coverage overview for the knowledge base. Reference index page.

A working wiki for the full lifecycle of GPU/HPC infrastructure (deploy, manage, run, optimise) across the roles it touches: sysadmin, GPU server engineer, platform engineer, and MLOps. It covers the full NVIDIA range (Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems including DGX Spark) with their operational, install, and networking differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current focus. Current to mid-2026.

Published at ai-infrastructure.net by setloop.io, the company behind this knowledge base.

Pages come in four kinds. Concept pages explain a topic and its traps. Recipe pages carry reference Ansible playbooks, Helm/Kubernetes manifests, and monitoring stacks. Per-technology pages give each cluster technology, training algorithm, RL library, and runbook its own page with a fixed shape. Reference pages are the glossary and indexes. Mermaid architecture diagrams are embedded where a visual clarifies structure or flow.

How to use this¶

Concept pages follow the same shape:

Overview: what the topic is and why it matters operationally.
Core knowledge: the facts, structured.
Don't-miss checklist: the things that bite if skipped.
Failure modes: where it goes wrong in the field.
Open questions & validation: what to confirm against current vendor docs or in a test environment before relying on it.
References: primary sources, to re-check rather than trust cached.

Recipe and runbook pages are example-first: reference manifests, playbooks, and step-by-step procedures with the commands to apply and verify them.

Conventions¶

Manifests, playbooks, and configs are reference templates drawn from upstream documentation. Adapt versions, names, and resource sizes to the target environment and validate before production use. Pin image and chart versions; do not run :latest in production.
Commands assume a Linux control node with kubectl, helm, and ansible configured against the target cluster.
GitOps is assumed: manifests live in a repo and are applied by Argo CD / Flux, not pasted by hand into a prod cluster (SRE and MLOps practices).

Mental model¶

Four verbs over a layered stack. Deploy: specify and validate hardware, design the fabric, confirm facility readiness, bring up and sign off. Manage: administer the node software stack, provision and schedule, secure and isolate tenants. Run: train and serve workloads on top. Optimise: keep the fabric saturated, the GPUs fed, and cost-per-unit-work down, measured rather than guessed.

flowchart TB
  HW["Platform hardware: B300 / GB300 NVL72"] --> BC["Build & commission: BOM, facility, fabric, acceptance"]
  BC --> NS["Node & software admin: driver stack, provisioning, scheduling"]
  NS --> CP["Cluster platform: Kubernetes, storage, security"]
  CP --> WL["Workloads: distributed training, inference serving"]
  WL --> OPT["Operate & optimise: observability, RAS, tuning, SLOs"]
  OPT -.->|"continuous feedback"| NS

The stack, bottom-up, and where each page sits:

GPU hardware, the full range and how their ops differ: GPU generations, Blackwell datacenter, Hopper, Ampere, RTX & workstation, DGX systems, DGX Spark.
Build & commission: BOM validation, datacentre readiness, networking fabric, commissioning.
Node & software admin: GPU software stack, provisioning & scheduling.
Cluster platform: Kubernetes for GPUs, storage & data, security & multi-tenancy.
Workloads: distributed training, inference serving, open-weight serving cookbooks.
Operate & optimise: performance & health, observability, reliability & RAS, performance tuning.
Strategy & reference: cloud & cost, troubleshooting runbook, Glossary.
Recipes & manifests: recipes & manifests index, Ansible bring-up, Kubernetes platform, telemetry, workload recipes, SRE & MLOps practices, operational runbooks.
Models & methods: serving open-weight models, disaggregated inference, fine-tuning & post-training, FSDP/DiLoCo recipes, SLO/SLI catalog.
Orchestration & RL systems: orchestration overview, RL libraries.
Per-technology deep-dives: clusters (Kubernetes, k3s, Ray, Slurm); training algorithms (FSDP, DDP, DeepSpeed/ZeRO, tensor parallelism, pipeline parallelism, DiLoCo); post-training (GRPO, DPO, SFT & LoRA); RL libraries (verl, slime, SkyRL, OpenRLHF, NeMo-RL, TRL).

Coverage map¶

This map states current topic coverage, not a source-verification attestation. Page freshness is supplied by the git revision date plugin in CI; source claims still need page-level revalidation when vendor figures, APIs, or model releases change.

Area	Pages
GPU hardware	generations, Blackwell, Hopper, Ampere, RTX/workstation, DGX/HGX, DGX Spark
Build & commission	BOM validation, vendor sourcing & procurement, datacentre readiness, networking fabric, fabric bring-up & benchmarking, commissioning
Node & platform	software stack, provisioning, Kubernetes for GPUs, storage, security
Workloads	distributed training, inference serving, serving open-weight models, DeepSeek-R1 cookbook, Kimi K2 cookbook, GLM-5.2 cookbook, GLM-4.7 cookbook, Qwen3 cookbook, Llama 4 cookbook, disaggregated inference
Operate & optimise	GPU performance, observability, RAS, performance tuning, SLO/SLI catalog, inference serving SLOs, training platform SLOs, cluster and fabric SLOs, burn-rate alerting rules
Performance engineering	roofline and arithmetic intensity, Nsight profiling workflow, CUDA occupancy tuning, memory coalescing, kernel fusion, NCCL collectives and algorithms, torch.compile, FlashAttention and MLA
Recipes & manifests	recipes index, Ansible bring-up, Kubernetes platform, telemetry, workload recipes, SRE/MLOps
Orchestration	overview, Kubernetes, k3s, Ray, Slurm
Training algorithms	FSDP, DDP, ZeRO, tensor parallelism, pipeline parallelism, DiLoCo
Post-training & RL	fine-tuning, SFT/LoRA, DPO, GRPO, RL library overview, verl, slime, SkyRL, OpenRLHF, NeMo-RL, TRL
Runbooks & reference	all 23 operational runbooks — incl. driver/module load failure, NVLink visibility failure, PCIe/P2P bandwidth regression, scheduler GPU job pending, training OOM, inference KV-cache OOM, NCCL hang, GPU fault and RMA; plus cloud cost, Glossary

By role¶

Sysadmin / GPU server engineer: GPU software stack, Ansible bring-up, provisioning & scheduling, observability, reliability & RAS, troubleshooting runbook.
Platform engineer: Kubernetes for GPUs, Kubernetes platform manifests, orchestration overview, storage & data, security & multi-tenancy, networking fabric, SRE & MLOps practices.
MLOps: distributed training, inference serving, serving open-weight models, fine-tuning & post-training, RL libraries, FSDP/DiLoCo recipes, disaggregated inference, performance tuning.
SRE: reliability & RAS, observability, telemetry & alerting, operational runbooks, troubleshooting runbook, SLO/SLI catalog, SRE & MLOps practices.
Deployment / build: BOM validation, datacentre readiness, networking fabric, commissioning, Blackwell platform.

References¶

MkDocs: https://www.mkdocs.org/
Material for MkDocs: https://squidfunk.github.io/mkdocs-material/
NVIDIA DGX SuperPOD reference architecture: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
Kubernetes documentation: https://kubernetes.io/docs/home/
PyTorch distributed overview: https://pytorch.org/docs/stable/distributed.html
vLLM documentation: https://docs.vllm.ai/en/latest/

Related: Home · Glossary · Operational runbooks