Workload & bring-up recipes¶
Scope: index and decision overview for runnable GPU-cluster workloads (fabric validation, distributed training, inference serving) and the order they run in. The detailed, copy-pasteable manifests now live in the focused child pages below; this page frames which recipe to run when and ties them into the end-to-end bring-up. The applied counterpart to distributed training and inference serving.
Reference templates. Pin images, set model/secret/storage names, and validate on a small scale first.
flowchart LR
FABRIC["Fabric validation"] --> TRAIN["Distributed training job"]
TRAIN --> SERVE["Inference deployment"]
SERVE --> ACCEPT["Acceptance evidence"]
Overview¶
Three workloads prove a cluster works, in order: a collective benchmark proves the fabric (networking fabric), a distributed training job proves scheduling + storage + comms (distributed training), and an inference deployment proves the serving path (inference serving). Each is the acceptance test for the layer beneath it.
Focused pages¶
This page is the index; the full manifests and step-by-step procedures live in these children:
- Recipe: fabric validation with nccl-tests: use this when you need the MPIJob to run
all_reduce_perfand read bus bandwidth against the topology. - Recipe: gang-scheduled training: use this when you need the Volcano gang-scheduled
torchrunjob with FSDP and a shared-checkpoint PVC. - Recipe: vLLM inference deployment: use this when you need the OpenAI-compatible vLLM Deployment + Service + HPA and its smoke test.
- Playbook: end-to-end workload bring-up: use this when you are driving a fresh cluster from facility-ready to first workload in order, with proof at each step.
1. Fabric validation: nccl-tests (MPIJob)¶
Run all_reduce_perf across nodes and read bus bandwidth against the topology expectation (GPU performance and health) via the MPI Operator (kubeflow.org/v2beta1). Pass criterion: busbw approaches line rate for the message size; NCCL_DEBUG=INFO shows [GDRDMA], not TCP (performance tuning). Full MPIJob manifest and tuning: Recipe: fabric validation with nccl-tests.
2. Distributed training: gang-scheduled torchrun (Volcano Job)¶
Gang scheduling so all workers start together (Kubernetes for GPUs); a shared RWX PVC for sharded checkpoints (storage and data). Watch MFU and step time in Grafana (telemetry and monitoring); if MFU < 35%, profile (performance tuning). Full Volcano Job manifest: Recipe: gang-scheduled training.
3. Inference: vLLM (Deployment + Service + HPA)¶
OpenAI-compatible endpoint, tensor-parallel across GPUs, fronted by an HPA on vllm:num_requests_waiting. Smoke-test with curl .../v1/completions and track TTFT/TPOT against the SLO (inference serving). For multi-node/disaggregated serving, front this with Dynamo; for managed serving, wrap as a KServe InferenceService. Full Deployment + HPA manifest and smoke test: Recipe: vLLM inference deployment.
4. End-to-end bring-up runbook (ordered)¶
Summary below; the full ordered procedure with per-step commands lives in the Playbook: end-to-end workload bring-up.
| # | Step | How | Proof |
|---|---|---|---|
| 1 | Facility ready | datacentre readiness | power/cooling/weight signed off |
| 2 | Fabric up | networking fabric | ibdiagnet clean, SM converged |
| 3 | Nodes provisioned | Ansible Ansible bring-up | dcgmi diag -r 3 pass, ibstat Active |
| 4 | K8s GPU stack | Helm the Kubernetes platform | cuda-smoke pod prints GPU table |
| 5 | Telemetry | telemetry and monitoring | DCGM metrics in Grafana, alerts firing on test |
| 6 | Fabric proof | nccl-tests above | busbw near line rate, GDR engaged |
| 7 | Acceptance | commissioning | recorded thresholds met |
| 8 | First workload | training/inference above | MFU / TTFT within target |
Don't-miss checklist¶
- Validate the fabric with
nccl-testsbefore trusting a training benchmark (commissioning). - Gang-schedule distributed jobs; checkpoint to a parallel-FS RWX volume (storage and data).
- Request the RDMA resource in pod specs, or NCCL drops to TCP (performance tuning).
- Put an SLO and an autoscaler on inference; size KV cache to memory (inference serving).
- Drive the bring-up in order; each step is the prior step's acceptance test.
Failure modes¶
- Training launched before fabric validation; a degraded link misread as a model/GPU problem (reliability and RAS).
- Distributed job partial-placed (no gang scheduler): GPUs idle, deadlock.
- vLLM OOM from
--max-model-len/ batch too large for KV cache. - Checkpoints to a single-writer volume, throttling all ranks (storage and data).
Open questions & validation¶
- Confirm operator versions: MPI Operator (
v2beta1), Volcanopytorchplugin, KServe API for the cluster. - Validate the RDMA resource name (
rdma/...) emitted by the Network Operator (the Kubernetes platform). - Benchmark a real model end-to-end and record MFU / TTFT as the acceptance baseline (distributed training, inference serving).
References¶
- nccl-tests: https://github.com/NVIDIA/nccl-tests
- Kubeflow MPI Operator: https://github.com/kubeflow/mpi-operator
- Kubeflow Training Operator (PyTorchJob): https://www.kubeflow.org/docs/components/trainer/
- Volcano: https://volcano.sh/en/docs/ · vLLM serving: https://docs.vllm.ai/en/latest/
- KServe: https://kserve.github.io/website/
Related: Fabric · Commissioning · Storage · Training · Inference · K8s Platform · Telemetry · Glossary