Recipes & manifests (index)¶
Scope: the index of runnable recipes (Ansible playbooks, Kubernetes/Helm manifests, telemetry stacks, and workload bring-up cookbooks). Where operational runbooks are incident procedures, these are the build-and-operate recipes: example-first manifests and playbooks with the commands to apply and verify them.
Paradigm: every recipe is a reference template: pin versions, adapt names and resource sizes, validate before production. Each gives the manifest or playbook, the command to apply it, and an explicit verification step.
flowchart LR
BM["Bare metal"] --> ANS["Ansible: node & fabric"]
ANS --> K8S["Kubernetes & Helm GPU platform"]
K8S --> TEL["Telemetry & alerting"]
TEL --> WL["Workload bring-up"]
WL --> SRE["SRE / Platform / MLOps practices"]
Find a recipe by task¶
- Standing up nodes & fabric from bare metal? → Ansible: node & fabric bring-up
- Making Kubernetes GPU-aware? → Kubernetes & Helm GPU platform
- Need observability? → Telemetry, monitoring & alerting
- Launching training or inference workloads? → Workload & bring-up recipes, then the open-weight model cookbooks
- Operating it like an SRE/MLOps team? → SRE, Platform & MLOps practices
- Validating the network fabric? → Fabric bring-up, validation & benchmarking
Recipe catalogue¶
| Area | Recipe | Page |
|---|---|---|
| Provisioning | Ansible roles for node prep, driver/CUDA stack, RDMA fabric, MIG, health validation | Ansible: node & fabric bring-up |
| Cluster platform | GPU Operator, Network Operator, sharing models, DRA, gang scheduling & quota | Kubernetes & Helm GPU platform |
| Observability | DCGM exporter, Prometheus, Grafana, burn-rate alerts | Telemetry, monitoring & alerting |
| Workloads | nccl-tests fabric proof, gang-scheduled training, vLLM serving, model-specific inference cookbooks, end-to-end bring-up | Workload & bring-up recipes, open-weight serving |
| Fabric | DOCA/OFED install, OpenSM, perftest, nccl-tests, GPUDirect verification | Fabric bring-up, validation & benchmarking |
| Practices | SLOs, GitOps, policy-as-code, model lifecycle, FinOps | SRE, Platform & MLOps practices |
Full recipe and manifest inventory¶
The hubs above group the work; below is every individual recipe, manifest, chart, role, and playbook page so each is directly reachable.
Ansible roles, playbooks and services
- base node and OS tuning role · NVIDIA driver and CUDA stack role · RDMA and InfiniBand fabric role · MIG configuration role · health validation role
- site bring-up playbook · PCIe ACS-disable systemd service
Helm charts for the GPU platform
- NVIDIA GPU Operator Helm install · NVIDIA Network Operator Helm install · DRA driver Helm install · Volcano gang-scheduler Helm install · Kueue quota Helm install
Kubernetes manifests
- GPU Operator ClusterPolicy · NicClusterPolicy for RDMA · GPU time-slicing config · MIG mode config · DRA ResourceClaim · DCGM exporter · Volcano gang job · Kueue ClusterQueue
Workload recipes
- fabric validation with nccl-tests · gang-scheduled training · vLLM inference deployment · FSDP single-datacenter training · DiLoCo geo-distributed training · memory-efficient GRPO post-training · end-to-end workload bring-up · GPU platform smoke tests
Where the technology cookbooks live¶
Each cluster, training-algorithm, inference, and RL-library page also carries its own Cookbook (common use cases) section with worked code:
- Cluster: Kubernetes · k3s · Ray · Slurm.
- Training: FSDP · DDP · DeepSpeed/ZeRO · tensor parallelism · pipeline parallelism · DiLoCo.
- Inference: Small models on consumer GPUs · DeepSeek-R1 · DeepSeek-V3.2-Exp · Kimi K2 · MiniMax-M2 · GLM-5.2 · GLM-4.7-FP8 · Qwen3-235B · Llama 4 Maverick.
- Post-training: GRPO · GRPO variants & tricks · RL scaling laws · DPO · SFT/LoRA.
- Evaluation: LLM benchmarks (anatomy & metrics) · evaluation harness & eval gate · evaluation integrity · evaluating agents.
- RL libraries: verl · slime · SkyRL · OpenRLHF · NeMo-RL · TRL.
Don't-miss checklist¶
- Treat every manifest as a reference template: pin chart/image versions, adapt names and sizes, validate before production.
- Apply changes behind GitOps so each has a one-step revert (SRE and MLOps practices).
- Verify with a real proof (
dcgmi diag,nccl-tests, a smoke request), not "it applied cleanly". - Keep recipes and runbooks paired: build with a recipe, recover with a runbook.
References¶
- NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
- NVIDIA Network Operator: https://docs.nvidia.com/networking/display/cokan10/network+operator
- Helm: https://helm.sh/docs/
- Ansible: https://docs.ansible.com/
Related: Operational runbooks · Ansible bring-up · Kubernetes & Helm · Telemetry · Workload recipes · Practices · Glossary