AI infrastructure knowledge base¶
Scope: the role-based reading-path map and coverage overview for the knowledge base. Reference index page.
A working wiki for the full lifecycle of GPU/HPC infrastructure (deploy, manage, run, optimise) across the roles it touches: sysadmin, GPU server engineer, platform engineer, and MLOps. It covers the full NVIDIA range (Ampere, Hopper, and Blackwell datacenter GPUs, RTX consumer and workstation cards, and DGX systems including DGX Spark) with their operational, install, and networking differences, and with the Blackwell Ultra (B300 / GB300 NVL72) generation as the current focus. Current to mid-2026.
Published at ai-infrastructure.net by setloop.io, the company behind this knowledge base.
Pages come in four kinds. Concept pages explain a topic and its traps. Recipe pages carry reference Ansible playbooks, Helm/Kubernetes manifests, and monitoring stacks. Per-technology pages give each cluster technology, training algorithm, RL library, and runbook its own page with a fixed shape. Reference pages are the glossary and indexes. Mermaid architecture diagrams are embedded where a visual clarifies structure or flow.
How to use this¶
Concept pages follow the same shape:
- Overview: what the topic is and why it matters operationally.
- Core knowledge: the facts, structured.
- Don't-miss checklist: the things that bite if skipped.
- Failure modes: where it goes wrong in the field.
- Open questions & validation: what to confirm against current vendor docs or in a test environment before relying on it.
- References: primary sources, to re-check rather than trust cached.
Recipe and runbook pages are example-first: reference manifests, playbooks, and step-by-step procedures with the commands to apply and verify them.
Conventions¶
- Manifests, playbooks, and configs are reference templates drawn from upstream documentation. Adapt versions, names, and resource sizes to the target environment and validate before production use. Pin image and chart versions; do not run
:latestin production. - Commands assume a Linux control node with
kubectl,helm, andansibleconfigured against the target cluster. - GitOps is assumed: manifests live in a repo and are applied by Argo CD / Flux, not pasted by hand into a prod cluster (SRE and MLOps practices).
Mental model¶
Four verbs over a layered stack. Deploy: specify and validate hardware, design the fabric, confirm facility readiness, bring up and sign off. Manage: administer the node software stack, provision and schedule, secure and isolate tenants. Run: train and serve workloads on top. Optimise: keep the fabric saturated, the GPUs fed, and cost-per-unit-work down, measured rather than guessed.
flowchart TB
HW["Platform hardware: B300 / GB300 NVL72"] --> BC["Build & commission: BOM, facility, fabric, acceptance"]
BC --> NS["Node & software admin: driver stack, provisioning, scheduling"]
NS --> CP["Cluster platform: Kubernetes, storage, security"]
CP --> WL["Workloads: distributed training, inference serving"]
WL --> OPT["Operate & optimise: observability, RAS, tuning, SLOs"]
OPT -.->|"continuous feedback"| NS
The stack, bottom-up, and where each page sits:
- GPU hardware, the full range and how their ops differ: GPU generations, Blackwell datacenter, Hopper, Ampere, RTX & workstation, DGX systems, DGX Spark.
- Build & commission: BOM validation, datacentre readiness, networking fabric, commissioning.
- Node & software admin: GPU software stack, provisioning & scheduling.
- Cluster platform: Kubernetes for GPUs, storage & data, security & multi-tenancy.
- Workloads: distributed training, inference serving, open-weight serving cookbooks.
- Operate & optimise: performance & health, observability, reliability & RAS, performance tuning.
- Strategy & reference: cloud & cost, troubleshooting runbook, Glossary.
- Recipes & manifests: recipes & manifests index, Ansible bring-up, Kubernetes platform, telemetry, workload recipes, SRE & MLOps practices, operational runbooks.
- Models & methods: serving open-weight models, disaggregated inference, fine-tuning & post-training, FSDP/DiLoCo recipes, SLO/SLI catalog.
- Orchestration & RL systems: orchestration overview, RL libraries.
- Per-technology deep-dives: clusters (Kubernetes, k3s, Ray, Slurm); training algorithms (FSDP, DDP, DeepSpeed/ZeRO, tensor parallelism, pipeline parallelism, DiLoCo); post-training (GRPO, DPO, SFT & LoRA); RL libraries (verl, slime, SkyRL, OpenRLHF, NeMo-RL, TRL).
Coverage map¶
This map states current topic coverage, not a source-verification attestation. Page freshness is supplied by the git revision date plugin in CI; source claims still need page-level revalidation when vendor figures, APIs, or model releases change.
By role¶
- Sysadmin / GPU server engineer: GPU software stack, Ansible bring-up, provisioning & scheduling, observability, reliability & RAS, troubleshooting runbook.
- Platform engineer: Kubernetes for GPUs, Kubernetes platform manifests, orchestration overview, storage & data, security & multi-tenancy, networking fabric, SRE & MLOps practices.
- MLOps: distributed training, inference serving, serving open-weight models, fine-tuning & post-training, RL libraries, FSDP/DiLoCo recipes, disaggregated inference, performance tuning.
- SRE: reliability & RAS, observability, telemetry & alerting, operational runbooks, troubleshooting runbook, SLO/SLI catalog, SRE & MLOps practices.
- Deployment / build: BOM validation, datacentre readiness, networking fabric, commissioning, Blackwell platform.
Suggested reading paths¶
- Choose a GPU or understand a generation: GPU generations (the comparison matrix) → the family page (Hopper, Ampere, Blackwell, RTX & workstation, or DGX Spark) → GPU software stack for the driver / MIG / Fabric Manager differences.
- Stand up a GPU cluster from bare metal: datacentre readiness → networking fabric → Ansible bring-up → Kubernetes platform → telemetry → commissioning → first validation job.
- Debug a slow training job: observability (read SM-active/MFU) → performance tuning (bottleneck hierarchy) → storage & data (dataloader) → distributed training (parallelism layout) → troubleshooting runbook.
- Diagnose a GPU fault: troubleshooting runbook (triage) → reliability & RAS (XID/ECC classification) → GPU software stack (driver/Fabric Manager).
- Serve a model: inference serving (engines & batching) → workload recipes (vLLM/KServe manifests) → telemetry (SLO monitoring) → performance tuning.
- Serve the latest open-weight model (DeepSeek/Kimi/GLM/Qwen/Llama): serving open-weight models (pick model + precision/parallelism) → the model cookbook (DeepSeek-R1, Kimi K2, GLM-5.2, GLM-4.7, Qwen3, Llama 4) → disaggregated inference (scale out) → SLO/SLI catalog (SLOs + burn-rate alerts).
- Fine-tune or post-train a model: fine-tuning & post-training (SFT → DPO → GRPO) → RL libraries (pick verl/slime/SkyRL) → FSDP/DiLoCo recipes (mechanics) → storage & data (checkpoints) → SRE & MLOps practices (eval gate + registry).
- Choose an orchestrator or RL stack: orchestration overview (k8s/k3s/Ray/Slurm decision) → Kubernetes platform (KubeRay) → RL libraries (colocated vs disaggregated).
- Run an operational procedure: operational runbooks (pick the runbook) → troubleshooting runbook (triage if mid-incident) → SLO/SLI catalog (confirm against SLOs).
- Run a multi-tenant platform: security & multi-tenancy (isolation) → Kubernetes for GPUs (sharing & scheduling) → Kubernetes platform (quotas/gang scheduling) → SRE & MLOps practices (golden paths).
References¶
- MkDocs: https://www.mkdocs.org/
- Material for MkDocs: https://squidfunk.github.io/mkdocs-material/
- NVIDIA DGX SuperPOD reference architecture: https://docs.nvidia.com/dgx-superpod/reference-architecture/scalable-infrastructure-b300-xdr/latest/dgx-superpod-components.html
- NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
- Kubernetes documentation: https://kubernetes.io/docs/home/
- PyTorch distributed overview: https://pytorch.org/docs/stable/distributed.html
- vLLM documentation: https://docs.vllm.ai/en/latest/
Related: Home · Glossary · Operational runbooks