Skip to content
Markdown

AI infrastructure knowledge base

Scope: the role-based reading-path map and coverage overview for the knowledge base. Reference index page.

A working wiki for the full lifecycle of GPU/HPC infrastructure (deploy, manage, run, optimise) across the roles it touches: sysadmin, GPU server engineer, platform engineer, and MLOps. It covers the full NVIDIA range (Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems including DGX Spark) with their operational, install, and networking differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current focus. Current to mid-2026.

Published at ai-infrastructure.net by setloop.io, the company behind this knowledge base.

Pages come in four kinds. Concept pages explain a topic and its traps. Recipe pages carry reference Ansible playbooks, Helm/Kubernetes manifests, and monitoring stacks. Per-technology pages give each cluster technology, training algorithm, RL library, and runbook its own page with a fixed shape. Reference pages are the glossary and indexes. Mermaid architecture diagrams are embedded where a visual clarifies structure or flow.

How to use this

Concept pages follow the same shape:

  • Overview: what the topic is and why it matters operationally.
  • Core knowledge: the facts, structured.
  • Don't-miss checklist: the things that bite if skipped.
  • Failure modes: where it goes wrong in the field.
  • Open questions & validation: what to confirm against current vendor docs or in a test environment before relying on it.
  • References: primary sources, to re-check rather than trust cached.

Recipe and runbook pages are example-first: reference manifests, playbooks, and step-by-step procedures with the commands to apply and verify them.

Conventions

  • Manifests, playbooks, and configs are reference templates drawn from upstream documentation. Adapt versions, names, and resource sizes to the target environment and validate before production use. Pin image and chart versions; do not run :latest in production.
  • Commands assume a Linux control node with kubectl, helm, and ansible configured against the target cluster.
  • GitOps is assumed: manifests live in a repo and are applied by Argo CD / Flux, not pasted by hand into a prod cluster (SRE and MLOps practices).

Mental model

Four verbs over a layered stack. Deploy: specify and validate hardware, design the fabric, confirm facility readiness, bring up and sign off. Manage: administer the node software stack, provision and schedule, secure and isolate tenants. Run: train and serve workloads on top. Optimise: keep the fabric saturated, the GPUs fed, and cost-per-unit-work down, measured rather than guessed.

flowchart TB
  HW["Platform hardware: B300 / GB300 NVL72"] --> BC["Build & commission: BOM, facility, fabric, acceptance"]
  BC --> NS["Node & software admin: driver stack, provisioning, scheduling"]
  NS --> CP["Cluster platform: Kubernetes, storage, security"]
  CP --> WL["Workloads: distributed training, inference serving"]
  WL --> OPT["Operate & optimise: observability, RAS, tuning, SLOs"]
  OPT -.->|"continuous feedback"| NS

The stack, bottom-up, and where each page sits:

Coverage map

This map states current topic coverage, not a source-verification attestation. Page freshness is supplied by the git revision date plugin in CI; source claims still need page-level revalidation when vendor figures, APIs, or model releases change.

Area Pages
GPU hardware generations, Blackwell, Hopper, Ampere, RTX/workstation, DGX/HGX, DGX Spark
Build & commission BOM validation, vendor sourcing & procurement, datacentre readiness, networking fabric, fabric bring-up & benchmarking, commissioning
Node & platform software stack, provisioning, Kubernetes for GPUs, storage, security
Workloads distributed training, inference serving, serving open-weight models, DeepSeek-R1 cookbook, Kimi K2 cookbook, GLM-5.2 cookbook, GLM-4.7 cookbook, Qwen3 cookbook, Llama 4 cookbook, disaggregated inference
Operate & optimise GPU performance, observability, RAS, performance tuning, SLO/SLI catalog, inference serving SLOs, training platform SLOs, cluster and fabric SLOs, burn-rate alerting rules
Performance engineering roofline and arithmetic intensity, Nsight profiling workflow, CUDA occupancy tuning, memory coalescing, kernel fusion, NCCL collectives and algorithms, torch.compile, FlashAttention and MLA
Recipes & manifests recipes index, Ansible bring-up, Kubernetes platform, telemetry, workload recipes, SRE/MLOps
Orchestration overview, Kubernetes, k3s, Ray, Slurm
Training algorithms FSDP, DDP, ZeRO, tensor parallelism, pipeline parallelism, DiLoCo
Post-training & RL fine-tuning, SFT/LoRA, DPO, GRPO, RL library overview, verl, slime, SkyRL, OpenRLHF, NeMo-RL, TRL
Runbooks & reference all 23 operational runbooks — incl. driver/module load failure, NVLink visibility failure, PCIe/P2P bandwidth regression, scheduler GPU job pending, training OOM, inference KV-cache OOM, NCCL hang, GPU fault and RMA; plus cloud cost, Glossary

By role

Suggested reading paths

References

  • MkDocs: https://www.mkdocs.org/
  • Material for MkDocs: https://squidfunk.github.io/mkdocs-material/
  • NVIDIA DGX SuperPOD reference architecture: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
  • NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
  • Kubernetes documentation: https://kubernetes.io/docs/home/
  • PyTorch distributed overview: https://pytorch.org/docs/stable/distributed.html
  • vLLM documentation: https://docs.vllm.ai/en/latest/

Related: Home · Glossary · Operational runbooks