Markdown

Workload & bring-up recipes¶

Scope: index and decision overview for runnable GPU-cluster workloads (fabric validation, distributed training, inference serving) and the order they run in. The detailed, copy-pasteable manifests now live in the focused child pages below; this page frames which recipe to run when and ties them into the end-to-end bring-up. The applied counterpart to distributed training and inference serving.

Reference templates. Pin images, set model/secret/storage names, and validate on a small scale first.

flowchart LR
  FABRIC["Fabric validation"] --> TRAIN["Distributed training job"]
  TRAIN --> SERVE["Inference deployment"]
  SERVE --> ACCEPT["Acceptance evidence"]

Overview¶

Three workloads prove a cluster works, in order: a collective benchmark proves the fabric (networking fabric), a distributed training job proves scheduling + storage + comms (distributed training), and an inference deployment proves the serving path (inference serving). Each is the acceptance test for the layer beneath it.

Focused pages¶

This page is the index; the full manifests and step-by-step procedures live in these children:

Recipe: fabric validation with nccl-tests: use this when you need the MPIJob to run all_reduce_perf and read bus bandwidth against the topology.
Recipe: gang-scheduled training: use this when you need the Volcano gang-scheduled torchrun job with FSDP and a shared-checkpoint PVC.
Recipe: vLLM inference deployment: use this when you need the OpenAI-compatible vLLM Deployment + Service + HPA and its smoke test.
Playbook: end-to-end workload bring-up: use this when you are driving a fresh cluster from facility-ready to first workload in order, with proof at each step.

1. Fabric validation: nccl-tests (MPIJob)¶

Run all_reduce_perf across nodes and read bus bandwidth against the topology expectation (GPU performance and health) via the MPI Operator (kubeflow.org/v2beta1). Pass criterion: busbw approaches line rate for the message size; NCCL_DEBUG=INFO shows [GDRDMA], not TCP (performance tuning). Full MPIJob manifest and tuning: Recipe: fabric validation with nccl-tests.

2. Distributed training: gang-scheduled torchrun (Volcano Job)¶

Gang scheduling so all workers start together (Kubernetes for GPUs); a shared RWX PVC for sharded checkpoints (storage and data). Watch MFU and step time in Grafana (telemetry and monitoring); if MFU < 35%, profile (performance tuning). Full Volcano Job manifest: Recipe: gang-scheduled training.

3. Inference: vLLM (Deployment + Service + HPA)¶

OpenAI-compatible endpoint, tensor-parallel across GPUs, fronted by an HPA on vllm:num_requests_waiting. Smoke-test with curl .../v1/completions and track TTFT/TPOT against the SLO (inference serving). For multi-node/disaggregated serving, front this with Dynamo; for managed serving, wrap as a KServe InferenceService. Full Deployment + HPA manifest and smoke test: Recipe: vLLM inference deployment.

4. End-to-end bring-up runbook (ordered)¶

Summary below; the full ordered procedure with per-step commands lives in the Playbook: end-to-end workload bring-up.

#	Step	How	Proof
1	Facility ready	datacentre readiness	power/cooling/weight signed off
2	Fabric up	networking fabric	`ibdiagnet` clean, SM converged
3	Nodes provisioned	Ansible Ansible bring-up	`dcgmi diag -r 3` pass, `ibstat` Active
4	K8s GPU stack	Helm the Kubernetes platform	`cuda-smoke` pod prints GPU table
5	Telemetry	telemetry and monitoring	DCGM metrics in Grafana, alerts firing on test
6	Fabric proof	nccl-tests above	busbw near line rate, GDR engaged
7	Acceptance	commissioning	recorded thresholds met
8	First workload	training/inference above	MFU / TTFT within target

Don't-miss checklist¶

Validate the fabric with nccl-tests before trusting a training benchmark (commissioning).
Gang-schedule distributed jobs; checkpoint to a parallel-FS RWX volume (storage and data).
Request the RDMA resource in pod specs, or NCCL drops to TCP (performance tuning).
Put an SLO and an autoscaler on inference; size KV cache to memory (inference serving).
Drive the bring-up in order; each step is the prior step's acceptance test.

Failure modes¶

Training launched before fabric validation; a degraded link misread as a model/GPU problem (reliability and RAS).
Distributed job partial-placed (no gang scheduler): GPUs idle, deadlock.
vLLM OOM from --max-model-len / batch too large for KV cache.
Checkpoints to a single-writer volume, throttling all ranks (storage and data).

Open questions & validation¶

Confirm operator versions: MPI Operator (v2beta1), Volcano pytorch plugin, KServe API for the cluster.
Validate the RDMA resource name (rdma/...) emitted by the Network Operator (the Kubernetes platform).
Benchmark a real model end-to-end and record MFU / TTFT as the acceptance baseline (distributed training, inference serving).

References¶

nccl-tests: https://github.com/NVIDIA/nccl-tests
Kubeflow MPI Operator: https://github.com/kubeflow/mpi-operator
Kubeflow Training Operator (PyTorchJob): https://www.kubeflow.org/docs/components/trainer/
Volcano: https://volcano.sh/en/docs/ · vLLM serving: https://docs.vllm.ai/en/latest/
KServe: https://kserve.github.io/website/

Related: Fabric · Commissioning · Storage · Training · Inference · K8s Platform · Telemetry · Glossary