Skip to content
Markdown

Workload & bring-up recipes

Scope: index and decision overview for runnable GPU-cluster workloads (fabric validation, distributed training, inference serving) and the order they run in. The detailed, copy-pasteable manifests now live in the focused child pages below; this page frames which recipe to run when and ties them into the end-to-end bring-up. The applied counterpart to distributed training and inference serving.

Reference templates. Pin images, set model/secret/storage names, and validate on a small scale first.

flowchart LR
  FABRIC["Fabric validation"] --> TRAIN["Distributed training job"]
  TRAIN --> SERVE["Inference deployment"]
  SERVE --> ACCEPT["Acceptance evidence"]

Overview

Three workloads prove a cluster works, in order: a collective benchmark proves the fabric (networking fabric), a distributed training job proves scheduling + storage + comms (distributed training), and an inference deployment proves the serving path (inference serving). Each is the acceptance test for the layer beneath it.

Focused pages

This page is the index; the full manifests and step-by-step procedures live in these children:

1. Fabric validation: nccl-tests (MPIJob)

Run all_reduce_perf across nodes and read bus bandwidth against the topology expectation (GPU performance and health) via the MPI Operator (kubeflow.org/v2beta1). Pass criterion: busbw approaches line rate for the message size; NCCL_DEBUG=INFO shows [GDRDMA], not TCP (performance tuning). Full MPIJob manifest and tuning: Recipe: fabric validation with nccl-tests.

2. Distributed training: gang-scheduled torchrun (Volcano Job)

Gang scheduling so all workers start together (Kubernetes for GPUs); a shared RWX PVC for sharded checkpoints (storage and data). Watch MFU and step time in Grafana (telemetry and monitoring); if MFU < 35%, profile (performance tuning). Full Volcano Job manifest: Recipe: gang-scheduled training.

3. Inference: vLLM (Deployment + Service + HPA)

OpenAI-compatible endpoint, tensor-parallel across GPUs, fronted by an HPA on vllm:num_requests_waiting. Smoke-test with curl .../v1/completions and track TTFT/TPOT against the SLO (inference serving). For multi-node/disaggregated serving, front this with Dynamo; for managed serving, wrap as a KServe InferenceService. Full Deployment + HPA manifest and smoke test: Recipe: vLLM inference deployment.

4. End-to-end bring-up runbook (ordered)

Summary below; the full ordered procedure with per-step commands lives in the Playbook: end-to-end workload bring-up.

# Step How Proof
1 Facility ready datacentre readiness power/cooling/weight signed off
2 Fabric up networking fabric ibdiagnet clean, SM converged
3 Nodes provisioned Ansible Ansible bring-up dcgmi diag -r 3 pass, ibstat Active
4 K8s GPU stack Helm the Kubernetes platform cuda-smoke pod prints GPU table
5 Telemetry telemetry and monitoring DCGM metrics in Grafana, alerts firing on test
6 Fabric proof nccl-tests above busbw near line rate, GDR engaged
7 Acceptance commissioning recorded thresholds met
8 First workload training/inference above MFU / TTFT within target

Don't-miss checklist

  • Validate the fabric with nccl-tests before trusting a training benchmark (commissioning).
  • Gang-schedule distributed jobs; checkpoint to a parallel-FS RWX volume (storage and data).
  • Request the RDMA resource in pod specs, or NCCL drops to TCP (performance tuning).
  • Put an SLO and an autoscaler on inference; size KV cache to memory (inference serving).
  • Drive the bring-up in order; each step is the prior step's acceptance test.

Failure modes

  • Training launched before fabric validation; a degraded link misread as a model/GPU problem (reliability and RAS).
  • Distributed job partial-placed (no gang scheduler): GPUs idle, deadlock.
  • vLLM OOM from --max-model-len / batch too large for KV cache.
  • Checkpoints to a single-writer volume, throttling all ranks (storage and data).

Open questions & validation

  • Confirm operator versions: MPI Operator (v2beta1), Volcano pytorch plugin, KServe API for the cluster.
  • Validate the RDMA resource name (rdma/...) emitted by the Network Operator (the Kubernetes platform).
  • Benchmark a real model end-to-end and record MFU / TTFT as the acceptance baseline (distributed training, inference serving).

References

  • nccl-tests: https://github.com/NVIDIA/nccl-tests
  • Kubeflow MPI Operator: https://github.com/kubeflow/mpi-operator
  • Kubeflow Training Operator (PyTorchJob): https://www.kubeflow.org/docs/components/trainer/
  • Volcano: https://volcano.sh/en/docs/ · vLLM serving: https://docs.vllm.ai/en/latest/
  • KServe: https://kserve.github.io/website/

Related: Fabric · Commissioning · Storage · Training · Inference · K8s Platform · Telemetry · Glossary