Markdown

Recipes & manifests (index)¶

Scope: the index of runnable recipes (Ansible playbooks, Kubernetes/Helm manifests, telemetry stacks, and workload bring-up cookbooks). Where operational runbooks are incident procedures, these are the build-and-operate recipes: example-first manifests and playbooks with the commands to apply and verify them.

Paradigm: every recipe is a reference template: pin versions, adapt names and resource sizes, validate before production. Each gives the manifest or playbook, the command to apply it, and an explicit verification step.

flowchart LR
  BM["Bare metal"] --> ANS["Ansible: node & fabric"]
  ANS --> K8S["Kubernetes & Helm GPU platform"]
  K8S --> TEL["Telemetry & alerting"]
  TEL --> WL["Workload bring-up"]
  WL --> SRE["SRE / Platform / MLOps practices"]

Find a recipe by task¶

Standing up nodes & fabric from bare metal? → Ansible: node & fabric bring-up
Making Kubernetes GPU-aware? → Kubernetes & Helm GPU platform
Need observability? → Telemetry, monitoring & alerting
Launching training or inference workloads? → Workload & bring-up recipes, then the open-weight model cookbooks
Operating it like an SRE/MLOps team? → SRE, Platform & MLOps practices
Validating the network fabric? → Fabric bring-up, validation & benchmarking

Recipe catalogue¶

Area	Recipe	Page
Provisioning	Ansible roles for node prep, driver/CUDA stack, RDMA fabric, MIG, health validation	Ansible: node & fabric bring-up
Cluster platform	GPU Operator, Network Operator, sharing models, DRA, gang scheduling & quota	Kubernetes & Helm GPU platform
Observability	DCGM exporter, Prometheus, Grafana, burn-rate alerts	Telemetry, monitoring & alerting
Workloads	nccl-tests fabric proof, gang-scheduled training, vLLM serving, model-specific inference cookbooks, end-to-end bring-up	Workload & bring-up recipes, open-weight serving
Fabric	DOCA/OFED install, OpenSM, perftest, nccl-tests, GPUDirect verification	Fabric bring-up, validation & benchmarking
Practices	SLOs, GitOps, policy-as-code, model lifecycle, FinOps	SRE, Platform & MLOps practices

Full recipe and manifest inventory¶

The hubs above group the work; below is every individual recipe, manifest, chart, role, and playbook page so each is directly reachable.

Ansible roles, playbooks and services

Helm charts for the GPU platform

NVIDIA GPU Operator Helm install · NVIDIA Network Operator Helm install · DRA driver Helm install · Volcano gang-scheduler Helm install · Kueue quota Helm install

Kubernetes manifests

GPU Operator ClusterPolicy · NicClusterPolicy for RDMA · GPU time-slicing config · MIG mode config · DRA ResourceClaim · DCGM exporter · Volcano gang job · Kueue ClusterQueue

Workload recipes

fabric validation with nccl-tests · gang-scheduled training · vLLM inference deployment · FSDP single-datacenter training · DiLoCo geo-distributed training · memory-efficient GRPO post-training · end-to-end workload bring-up · GPU platform smoke tests

Where the technology cookbooks live¶

Each cluster, training-algorithm, inference, and RL-library page also carries its own Cookbook (common use cases) section with worked code:

Cluster: Kubernetes · k3s · Ray · Slurm.
Training: FSDP · DDP · DeepSpeed/ZeRO · tensor parallelism · pipeline parallelism · DiLoCo.
Inference: Small models on consumer GPUs · DeepSeek-R1 · DeepSeek-V3.2-Exp · Kimi K2 · MiniMax-M2 · GLM-5.2 · GLM-4.7-FP8 · Qwen3-235B · Llama 4 Maverick.
Post-training: GRPO · GRPO variants & tricks · RL scaling laws · DPO · SFT/LoRA.
Evaluation: LLM benchmarks (anatomy & metrics) · evaluation harness & eval gate · evaluation integrity · evaluating agents.
RL libraries: verl · slime · SkyRL · OpenRLHF · NeMo-RL · TRL.

Don't-miss checklist¶

Treat every manifest as a reference template: pin chart/image versions, adapt names and sizes, validate before production.
Apply changes behind GitOps so each has a one-step revert (SRE and MLOps practices).
Verify with a real proof (dcgmi diag, nccl-tests, a smoke request), not "it applied cleanly".
Keep recipes and runbooks paired: build with a recipe, recover with a runbook.

References¶

NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
NVIDIA Network Operator: https://docs.nvidia.com/networking/display/cokan10/network+operator
Helm: https://helm.sh/docs/
Ansible: https://docs.ansible.com/