Skip to content
Markdown

Recipe: generic vLLM inference deployment

Scope: a standalone, model-agnostic recipe to stand up a vLLM OpenAI-compatible server on Kubernetes, covering Deployment + Service + HPA, model/token config, a smoke request, and SLO/HPA wiring. Distinct from the model-specific cookbooks (Qwen3-235B, Llama-4-Maverick, DeepSeek-R1, GLM-4, Kimi-K2); use this as the base manifest those specialize.

Reference template. Manifests are not executed or hardware-tested here. Pin the vLLM image tag and model revision, set namespace/secret/storage names, and validate at small scale before production.

flowchart LR
  CLIENT["Client / gateway"] --> SVC["Service :8000"]
  SVC --> POD["vLLM pod(s)"]
  POD --> GPU["nvidia.com/gpu (TP shards)"]
  POD --> METRICS["/metrics :8000"]
  METRICS --> PROM["Prometheus adapter"]
  PROM --> HPA["HPA: num_requests_waiting"]
  HPA -->|"scale replicas"| POD

What it is

A vLLM server packaged as a Kubernetes Deployment fronted by a Service, autoscaled by an HorizontalPodAutoscaler. vLLM exposes an OpenAI-compatible HTTP API on port 8000 (/v1/chat/completions, /v1/completions, /v1/models), a liveness/readiness path /health, and a Prometheus scrape path /metrics. One pod loads one model and shards it across its GPUs via tensor parallelism (--tensor-parallel-size); the HPA adds or removes whole pods (replicas) based on queue depth.

This recipe is the generic substrate: swap --model, --tensor-parallel-size, --max-model-len, and the GPU count, and you have any of the cookbook deployments. Routing, batching, KV-cache management, and continuous batching are vLLM's job; this page covers only the cluster wiring.

Why it matters

  • Continuous batching + paged KV cache. vLLM packs in-flight requests and pages the KV cache, so a single replica sustains far higher throughput than naive per-request serving. The platform's job is to feed it and scale it, not to re-implement it.
  • Replica autoscaling on the right signal. GPU utilization is a poor autoscale trigger: a saturated engine can read 100% util while latency is fine, or read low util while the queue backs up. vllm:num_requests_waiting (queue depth) tracks the user-visible backlog directly (QoS and admission control).
  • SLO accountability. vLLM emits TTFT/TPOT/latency histograms at /metrics, which feed the burn-rate alerts in the SLO/SLI catalog and the inference SLO-breach runbook.
  • One manifest, many models. A single reviewed base manifest reduces drift across the model cookbooks and the broader serving open-weight models catalog.

When it is needed (and when not)

Use this recipe when:

  • You self-host an open-weight model on your own GPUs and want an OpenAI-compatible endpoint.
  • One model fits on a node's GPUs (single-node tensor parallelism) and you scale by adding replicas.
  • You want HPA-driven elasticity behind a TTFT/TPOT SLO.

Reach for something else when:

  • Model exceeds one node: you need pipeline/expert parallelism across nodes; see inference parallelism strategies, expert-parallel inference, and front it with a multi-node coordinator rather than a plain Deployment.
  • Prefill/decode disaggregation is required for tail latency. See disaggregated inference.
  • Managed lifecycle (canary, scale-to-zero, revisions): wrap as a KServe InferenceService instead of a raw Deployment.
  • Multi-tenant fairness / priority: add an admission layer (QoS and admission control); HPA alone does not arbitrate tenants.

How: implement, integrate, maintain

1. Prereqs

  • GPU stack installed (NVIDIA device plugin advertising nvidia.com/gpu): Kubernetes GPU platform, health-gated nodes (GPU health gating).
  • Prometheus + an adapter exposing custom/external metrics (e.g. prometheus-adapter) so the HPA can read vllm:num_requests_waiting (see telemetry and monitoring).
  • A Hugging Face token secret for gated models:
kubectl -n serving create secret generic hf \
  --from-literal=token="$HF_TOKEN"

2. Deployment + Service

Pin the image tag and model revision. --tensor-parallel-size must equal the GPU count in the pod. Size --max-model-len to KV-cache headroom; mount a large /dev/shm for the NCCL/IPC traffic between TP shards.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-server }
  template:
    metadata:
      labels: { app: vllm-server }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:<pinned-tag>          # pin exact tag
          args:
            - --model=meta-llama/Llama-3.1-8B-Instruct  # swap per model
            - --served-model-name=default                # stable client-facing id
            - --tensor-parallel-size=1                   # == GPU count below
            - --max-model-len=8192
            - --gpu-memory-utilization=0.90
            - --port=8000
          ports:
            - { containerPort: 8000, name: http }
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom: { secretKeyRef: { name: hf, key: token } }
          resources:
            limits:   { nvidia.com/gpu: 1 }
            requests: { nvidia.com/gpu: 1 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 60        # raise for large weights
            periodSeconds: 10
          livenessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 120
            periodSeconds: 30
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 16Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
  namespace: serving
  labels: { app: vllm-server }
spec:
  selector: { app: vllm-server }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

3. Scrape metrics

vLLM serves Prometheus metrics at /metrics on the same port. Add a ServiceMonitor (Prometheus Operator) so the adapter can surface vllm:num_requests_waiting:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-server
  namespace: serving
  labels: { release: kube-prom }
spec:
  selector:
    matchLabels: { app: vllm-server }
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

4. HPA on queue depth

Scale on vllm:num_requests_waiting per pod (a Pods metric averaged across replicas), not GPU utilization. The averageValue is the target backlog per replica:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-server
  namespace: serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 8
  metrics:
    - type: Pods
      pods:
        metric: { name: vllm:num_requests_waiting }
        target: { type: AverageValue, averageValue: "5" }
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies: [{ type: Pods, value: 2, periodSeconds: 60 }]
    scaleDown:
      stabilizationWindowSeconds: 300        # slow down: cold starts are expensive
      policies: [{ type: Pods, value: 1, periodSeconds: 120 }]

scaleDown is deliberately slow: loading large weights into GPU memory is a multi-minute cold start, so flapping replicas burns SLO. Apply:

kubectl apply -f vllm-deploy.yaml -f vllm-servicemonitor.yaml -f vllm-hpa.yaml
kubectl -n serving rollout status deploy/vllm-server
kubectl -n serving get hpa vllm-server

5. Smoke request

Port-forward and confirm the model is loaded and answering. --served-model-name is the model value clients send:

kubectl -n serving port-forward svc/vllm-server 8000:8000 &

# model is loaded and listed
curl -s http://localhost:8000/v1/models | jq '.data[].id'

# chat completion
curl -s http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Reply with the single word: ok"}],
    "max_tokens": 8,
    "temperature": 0
  }' | jq -r '.choices[0].message.content'

Also covered by the platform smoke tests. For an end-to-end bring-up sequence see workload bring-up recipes.

6. SLO wiring

vLLM's histograms back the SLO/SLI catalog. TTFT compliance as PromQL (fraction of requests under a 500 ms target):

# TTFT SLI: fraction under 500ms
sum(rate(vllm:time_to_first_token_seconds_bucket{le="0.5"}[5m]))
  / sum(rate(vllm:time_to_first_token_seconds_count[5m]))

# live queue depth driving the HPA
sum(vllm:num_requests_waiting)

# running vs waiting (saturation view)
sum(vllm:num_requests_running) / (sum(vllm:num_requests_running) + sum(vllm:num_requests_waiting))

Wire the multi-window burn-rate alerts from the SLO/SLI catalog; every page links to the inference SLO-breach runbook. Metric names carry the vllm: prefix and may shift between vLLM versions. Confirm against your pinned image's /metrics output.

Maintain

  • Upgrades: bump the pinned image tag, roll one replica, re-run the smoke request and TTFT query before completing the rollout. Treat metric-name changes as a breaking change for the HPA and SLO rules.
  • Capacity: if num_requests_waiting sits above target with HPA at maxReplicas, add GPUs/nodes or shed load via QoS admission control. Autoscaling cannot create capacity that is not there.
  • OOM: a pod CrashLooping on KV-cache OOM means --max-model-len or concurrency is too high for the GPU memory; lower --max-model-len or --gpu-memory-utilization before changing topology.

Failure modes

  • HPA never scales: adapter not exposing vllm:num_requests_waiting, or ServiceMonitor label mismatch. kubectl get --raw the custom-metrics API to confirm the metric is visible.
  • Readiness flaps on large models: initialDelaySeconds too low for weight load; raise it.
  • TP mismatch: --tensor-parallel-size not equal to nvidia.com/gpu limit; the pod hangs at startup.
  • /dev/shm too small: TP shards fail NCCL IPC; size the emptyDir memory volume to the model.
  • Scale-down thrash: cold starts re-load weights repeatedly; lengthen scaleDown.stabilizationWindowSeconds.

References

  • vLLM OpenAI-compatible server: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
  • vLLM production metrics (vllm:*): https://docs.vllm.ai/en/latest/usage/metrics.html
  • vLLM engine args (--tensor-parallel-size, --max-model-len, --gpu-memory-utilization): https://docs.vllm.ai/en/latest/serving/engine_args.html
  • Kubernetes HorizontalPodAutoscaler: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
  • HPA with custom/external metrics: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#scaling-on-custom-metrics
  • prometheus-adapter: https://github.com/kubernetes-sigs/prometheus-adapter
  • Prometheus Operator ServiceMonitor: https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.ServiceMonitor
  • Google SRE Workbook — alerting on SLOs: https://sre.google/workbook/alerting-on-slos/

Related: Inference serving · Serving OSS models · SLO/SLI catalog · Inference SLO-breach runbook · QoS & admission control · Workload bring-up recipes · K8s GPU platform · Telemetry · Glossary