Skip to content
Markdown

Recipes & manifests (index)

Scope: the index of runnable recipes (Ansible playbooks, Kubernetes/Helm manifests, telemetry stacks, and workload bring-up cookbooks). Where operational runbooks are incident procedures, these are the build-and-operate recipes: example-first manifests and playbooks with the commands to apply and verify them.

Paradigm: every recipe is a reference template: pin versions, adapt names and resource sizes, validate before production. Each gives the manifest or playbook, the command to apply it, and an explicit verification step.

flowchart LR
  BM["Bare metal"] --> ANS["Ansible: node & fabric"]
  ANS --> K8S["Kubernetes & Helm GPU platform"]
  K8S --> TEL["Telemetry & alerting"]
  TEL --> WL["Workload bring-up"]
  WL --> SRE["SRE / Platform / MLOps practices"]

Find a recipe by task

Recipe catalogue

Area Recipe Page
Provisioning Ansible roles for node prep, driver/CUDA stack, RDMA fabric, MIG, health validation Ansible: node & fabric bring-up
Cluster platform GPU Operator, Network Operator, sharing models, DRA, gang scheduling & quota Kubernetes & Helm GPU platform
Observability DCGM exporter, Prometheus, Grafana, burn-rate alerts Telemetry, monitoring & alerting
Workloads nccl-tests fabric proof, gang-scheduled training, vLLM serving, model-specific inference cookbooks, end-to-end bring-up Workload & bring-up recipes, open-weight serving
Fabric DOCA/OFED install, OpenSM, perftest, nccl-tests, GPUDirect verification Fabric bring-up, validation & benchmarking
Practices SLOs, GitOps, policy-as-code, model lifecycle, FinOps SRE, Platform & MLOps practices

Full recipe and manifest inventory

The hubs above group the work; below is every individual recipe, manifest, chart, role, and playbook page so each is directly reachable.

Ansible roles, playbooks and services

Helm charts for the GPU platform

Kubernetes manifests

Workload recipes

Where the technology cookbooks live

Each cluster, training-algorithm, inference, and RL-library page also carries its own Cookbook (common use cases) section with worked code:

Don't-miss checklist

  • Treat every manifest as a reference template: pin chart/image versions, adapt names and sizes, validate before production.
  • Apply changes behind GitOps so each has a one-step revert (SRE and MLOps practices).
  • Verify with a real proof (dcgmi diag, nccl-tests, a smoke request), not "it applied cleanly".
  • Keep recipes and runbooks paired: build with a recipe, recover with a runbook.

References

  • NVIDIA GPU Operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
  • NVIDIA Network Operator: https://docs.nvidia.com/networking/display/cokan10/network+operator
  • Helm: https://helm.sh/docs/
  • Ansible: https://docs.ansible.com/

Related: Operational runbooks · Ansible bring-up · Kubernetes & Helm · Telemetry · Workload recipes · Practices · Glossary