Recipe: generic vLLM inference deployment¶
Scope: a standalone, model-agnostic recipe to stand up a vLLM OpenAI-compatible server on Kubernetes, covering Deployment + Service + HPA, model/token config, a smoke request, and SLO/HPA wiring. Distinct from the model-specific cookbooks (Qwen3-235B, Llama-4-Maverick, DeepSeek-R1, GLM-4, Kimi-K2); use this as the base manifest those specialize.
Reference template. Manifests are not executed or hardware-tested here. Pin the vLLM image tag and model revision, set namespace/secret/storage names, and validate at small scale before production.
flowchart LR
CLIENT["Client / gateway"] --> SVC["Service :8000"]
SVC --> POD["vLLM pod(s)"]
POD --> GPU["nvidia.com/gpu (TP shards)"]
POD --> METRICS["/metrics :8000"]
METRICS --> PROM["Prometheus adapter"]
PROM --> HPA["HPA: num_requests_waiting"]
HPA -->|"scale replicas"| POD
What it is¶
A vLLM server packaged as a Kubernetes Deployment fronted by a Service, autoscaled by an HorizontalPodAutoscaler. vLLM exposes an OpenAI-compatible HTTP API on port 8000 (/v1/chat/completions, /v1/completions, /v1/models), a liveness/readiness path /health, and a Prometheus scrape path /metrics. One pod loads one model and shards it across its GPUs via tensor parallelism (--tensor-parallel-size); the HPA adds or removes whole pods (replicas) based on queue depth.
This recipe is the generic substrate: swap --model, --tensor-parallel-size, --max-model-len, and the GPU count, and you have any of the cookbook deployments. Routing, batching, KV-cache management, and continuous batching are vLLM's job; this page covers only the cluster wiring.
Why it matters¶
- Continuous batching + paged KV cache. vLLM packs in-flight requests and pages the KV cache, so a single replica sustains far higher throughput than naive per-request serving. The platform's job is to feed it and scale it, not to re-implement it.
- Replica autoscaling on the right signal. GPU utilization is a poor autoscale trigger: a saturated engine can read 100% util while latency is fine, or read low util while the queue backs up.
vllm:num_requests_waiting(queue depth) tracks the user-visible backlog directly (QoS and admission control). - SLO accountability. vLLM emits TTFT/TPOT/latency histograms at
/metrics, which feed the burn-rate alerts in the SLO/SLI catalog and the inference SLO-breach runbook. - One manifest, many models. A single reviewed base manifest reduces drift across the model cookbooks and the broader serving open-weight models catalog.
When it is needed (and when not)¶
Use this recipe when:
- You self-host an open-weight model on your own GPUs and want an OpenAI-compatible endpoint.
- One model fits on a node's GPUs (single-node tensor parallelism) and you scale by adding replicas.
- You want HPA-driven elasticity behind a TTFT/TPOT SLO.
Reach for something else when:
- Model exceeds one node: you need pipeline/expert parallelism across nodes; see inference parallelism strategies, expert-parallel inference, and front it with a multi-node coordinator rather than a plain Deployment.
- Prefill/decode disaggregation is required for tail latency. See disaggregated inference.
- Managed lifecycle (canary, scale-to-zero, revisions): wrap as a KServe
InferenceServiceinstead of a raw Deployment. - Multi-tenant fairness / priority: add an admission layer (QoS and admission control); HPA alone does not arbitrate tenants.
How: implement, integrate, maintain¶
1. Prereqs¶
- GPU stack installed (NVIDIA device plugin advertising
nvidia.com/gpu): Kubernetes GPU platform, health-gated nodes (GPU health gating). - Prometheus + an adapter exposing custom/external metrics (e.g.
prometheus-adapter) so the HPA can readvllm:num_requests_waiting(see telemetry and monitoring). - A Hugging Face token secret for gated models:
2. Deployment + Service¶
Pin the image tag and model revision. --tensor-parallel-size must equal the GPU count in the pod. Size --max-model-len to KV-cache headroom; mount a large /dev/shm for the NCCL/IPC traffic between TP shards.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-server }
template:
metadata:
labels: { app: vllm-server }
spec:
containers:
- name: vllm
image: vllm/vllm-openai:<pinned-tag> # pin exact tag
args:
- --model=meta-llama/Llama-3.1-8B-Instruct # swap per model
- --served-model-name=default # stable client-facing id
- --tensor-parallel-size=1 # == GPU count below
- --max-model-len=8192
- --gpu-memory-utilization=0.90
- --port=8000
ports:
- { containerPort: 8000, name: http }
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom: { secretKeyRef: { name: hf, key: token } }
resources:
limits: { nvidia.com/gpu: 1 }
requests: { nvidia.com/gpu: 1 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 60 # raise for large weights
periodSeconds: 10
livenessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 120
periodSeconds: 30
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 16Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
namespace: serving
labels: { app: vllm-server }
spec:
selector: { app: vllm-server }
ports:
- { name: http, port: 8000, targetPort: 8000 }
3. Scrape metrics¶
vLLM serves Prometheus metrics at /metrics on the same port. Add a ServiceMonitor (Prometheus Operator) so the adapter can surface vllm:num_requests_waiting:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-server
namespace: serving
labels: { release: kube-prom }
spec:
selector:
matchLabels: { app: vllm-server }
endpoints:
- port: http
path: /metrics
interval: 15s
4. HPA on queue depth¶
Scale on vllm:num_requests_waiting per pod (a Pods metric averaged across replicas), not GPU utilization. The averageValue is the target backlog per replica:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-server
namespace: serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric: { name: vllm:num_requests_waiting }
target: { type: AverageValue, averageValue: "5" }
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies: [{ type: Pods, value: 2, periodSeconds: 60 }]
scaleDown:
stabilizationWindowSeconds: 300 # slow down: cold starts are expensive
policies: [{ type: Pods, value: 1, periodSeconds: 120 }]
scaleDown is deliberately slow: loading large weights into GPU memory is a multi-minute cold start, so flapping replicas burns SLO. Apply:
kubectl apply -f vllm-deploy.yaml -f vllm-servicemonitor.yaml -f vllm-hpa.yaml
kubectl -n serving rollout status deploy/vllm-server
kubectl -n serving get hpa vllm-server
5. Smoke request¶
Port-forward and confirm the model is loaded and answering. --served-model-name is the model value clients send:
kubectl -n serving port-forward svc/vllm-server 8000:8000 &
# model is loaded and listed
curl -s http://localhost:8000/v1/models | jq '.data[].id'
# chat completion
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Reply with the single word: ok"}],
"max_tokens": 8,
"temperature": 0
}' | jq -r '.choices[0].message.content'
Also covered by the platform smoke tests. For an end-to-end bring-up sequence see workload bring-up recipes.
6. SLO wiring¶
vLLM's histograms back the SLO/SLI catalog. TTFT compliance as PromQL (fraction of requests under a 500 ms target):
# TTFT SLI: fraction under 500ms
sum(rate(vllm:time_to_first_token_seconds_bucket{le="0.5"}[5m]))
/ sum(rate(vllm:time_to_first_token_seconds_count[5m]))
# live queue depth driving the HPA
sum(vllm:num_requests_waiting)
# running vs waiting (saturation view)
sum(vllm:num_requests_running) / (sum(vllm:num_requests_running) + sum(vllm:num_requests_waiting))
Wire the multi-window burn-rate alerts from the SLO/SLI catalog; every page links to the inference SLO-breach runbook. Metric names carry the vllm: prefix and may shift between vLLM versions. Confirm against your pinned image's /metrics output.
Maintain¶
- Upgrades: bump the pinned image tag, roll one replica, re-run the smoke request and TTFT query before completing the rollout. Treat metric-name changes as a breaking change for the HPA and SLO rules.
- Capacity: if
num_requests_waitingsits above target with HPA atmaxReplicas, add GPUs/nodes or shed load via QoS admission control. Autoscaling cannot create capacity that is not there. - OOM: a pod CrashLooping on KV-cache OOM means
--max-model-lenor concurrency is too high for the GPU memory; lower--max-model-lenor--gpu-memory-utilizationbefore changing topology.
Failure modes¶
- HPA never scales: adapter not exposing
vllm:num_requests_waiting, orServiceMonitorlabel mismatch.kubectl get --rawthe custom-metrics API to confirm the metric is visible. - Readiness flaps on large models:
initialDelaySecondstoo low for weight load; raise it. - TP mismatch:
--tensor-parallel-sizenot equal tonvidia.com/gpulimit; the pod hangs at startup. /dev/shmtoo small: TP shards fail NCCL IPC; size theemptyDirmemory volume to the model.- Scale-down thrash: cold starts re-load weights repeatedly; lengthen
scaleDown.stabilizationWindowSeconds.
References¶
- vLLM OpenAI-compatible server: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
- vLLM production metrics (
vllm:*): https://docs.vllm.ai/en/latest/usage/metrics.html - vLLM engine args (
--tensor-parallel-size,--max-model-len,--gpu-memory-utilization): https://docs.vllm.ai/en/latest/serving/engine_args.html - Kubernetes HorizontalPodAutoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
- HPA with custom/external metrics: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#scaling-on-custom-metrics
- prometheus-adapter: https://github.com/kubernetes-sigs/prometheus-adapter
- Prometheus Operator ServiceMonitor: https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.ServiceMonitor
- Google SRE Workbook — alerting on SLOs: https://sre.google/workbook/alerting-on-slos/
Related: Inference serving · Serving OSS models · SLO/SLI catalog · Inference SLO-breach runbook · QoS & admission control · Workload bring-up recipes · K8s GPU platform · Telemetry · Glossary