Skip to content
Markdown

Inference serving & optimization

Scope: serving models in production, the engines, the servers, the optimisations specific to inference (batching, KV cache, quantisation, disaggregation), and autoscaling. The run/optimise half of the cluster that the training pages do not cover, and increasingly the larger share of GPU spend.

Overview

Inference is a different regime from training: latency-sensitive, bursty, multi-tenant, and memory-bandwidth-bound where training is compute-bound. Serving well means hitting a latency SLO at the highest throughput the hardware allows. The skill is knowing the engine landscape, the two-phase structure of LLM inference that drives every optimisation, and how to scale it on Kubernetes (Kubernetes for GPUs) without missing the SLO.

Flow: prefill then decode

sequenceDiagram
  participant C as Client
  participant P as Prefill
  participant D as Decode
  C->>P: prompt
  P->>P: build KV cache
  P->>D: hand off KV cache
  loop each output token
    D->>D: generate token
    D-->>C: stream token
  end

Core knowledge

The two phases (this drives everything)

  • Prefill: process the prompt, fill the KV cache. Compute-bound, parallel over prompt tokens.
  • Decode: generate one token at a time. Memory-bandwidth-bound, sequential.
  • Metrics: TTFT (time to first token, dominated by prefill), TPOT/ITL (per-token latency, dominated by decode), and goodput (throughput within SLO). Throughput and latency trade off against each other through batch size.

KV cache and batching

  • The KV cache is the dominant serving memory consumer; it grows with batch x sequence length. PagedAttention (vLLM) manages it in pages to kill fragmentation; RadixAttention / prefix caching (SGLang) reuses shared prefixes across requests.
  • Continuous (in-flight) batching: add and retire requests from the running batch every step rather than fixed batches. The single biggest serving throughput win. Assume it is on.

Engines (as of mid-2026)

  • vLLM: PagedAttention, continuous batching, broad model coverage, fastest to production, best cost-efficiency. The community default.
  • SGLang: RadixAttention/prefix caching and structured generation; throughput sits between vLLM and TensorRT-LLM, strong on prefix-heavy and structured workloads.
  • TensorRT-LLM: compiled engines, lowest latency and highest throughput on NVIDIA, at the cost of a compilation step and less runtime flexibility.
  • Others: TGI (Hugging Face), LMDeploy, llama.cpp/Ollama (local/edge).

vLLM cookbook entrypoints

The concept page stops at engine and SLO mechanics; the model-specific cookbooks carry the runnable commands and manifests:

Model family Cookbook Use when
DeepSeek-R1 Serve DeepSeek-R1 with vLLM reasoning-heavy math/code/agent tasks
Kimi K2 Serve Kimi K2 with vLLM 16-GPU agentic/tool-calling deployments
GLM-5.2-FP8 Serve GLM-5.2 with vLLM flagship long-horizon agentic coding at up to 1M context
GLM-4.7-FP8 Serve GLM-4.7-FP8 with vLLM coding agents and MIT-licensed tool workflows
Qwen3-235B-A22B Serve Qwen3-235B-A22B with vLLM Apache-2.0 general-purpose 256K context serving
Llama 4 Maverick Serve Llama 4 Maverick with vLLM multimodal image+text serving with Llama ecosystem tooling

Minimal single-node shape, before applying a model-specific cookbook:

vllm serve <hf-model-id> \
  --served-model-name <public-model-name> \
  --tensor-parallel-size <gpus-per-replica> \
  --max-model-len <route-context-limit> \
  --gpu-memory-utilization 0.85

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<public-model-name>",
    "messages": [{"role": "user", "content": "Return one sentence."}],
    "max_tokens": 64
  }'

Do not copy that generic command to production unchanged. The large current MoE models need model-specific parser flags, context caps, expert-parallel layouts, gated-token handling, or multi-node launch shapes.

NVIDIA Dynamo (1.0 GA, GTC 2026)

  • A datacenter-scale distributed inference framework that orchestrates the engines above (vLLM, SGLang, TensorRT-LLM) rather than replacing them. It adds disaggregated serving (prefill and decode on separate GPU pools, scaled independently), KV-aware routing to avoid recompute, and KV cache offload to cheaper memory/storage tiers. Grove automates placement on rack-scale NVLink systems like GB300 NVL72 (the Blackwell platform). NVIDIA reports up to ~7x MoE throughput on GB200 NVL72 with disaggregation plus wide expert parallelism. NIM is the packaged microservice form for enterprise deployment.

Serving optimisations

  • Quantisation: FP8/INT8/INT4, AWQ/GPTQ, and NVFP4 on Blackwell. Validate accuracy, do not quantise blindly.
  • Speculative decoding: a small draft model proposes tokens, the target verifies in parallel: fewer target forward passes.
  • Chunked prefill, prefix caching, multi-LoRA serving, and disaggregated prefill/decode for the largest deployments.
  • Chunked prefill does not reduce a prompt's total attention compute. Splitting prefill into chunks interleaves it with decode in the same batch to balance compute-bound prefill against memory-bound decode. It smooths per-iteration latency (raising TTFT slightly) and improves throughput/GPU utilisation, but the total attention FLOPs are unchanged (prefill self-attention stays N(N+1)/2 QK dot products, O(N²); the chunk sums are equal). Only a smaller effective context or local/sparse attention reduces the total. 1

Orchestration and autoscaling

  • Triton Inference Server (multi-framework, dynamic batching, model ensembles), KServe (K8s-native serving, autoscaling incl. scale-to-zero), Ray Serve, the NIM Operator.
  • Autoscale on queue depth / SLO / GPU signals. Watch the cold-start problem: scale-to-zero plus a multi-minute model load misses latency SLOs on the first request.

Don't-miss checklist

  • Measure TTFT and TPOT against an SLO and report goodput, not raw throughput.
  • Continuous batching on; size KV cache / max sequences to the real memory budget (reliability and RAS).
  • Pick the engine by constraint: vLLM (flexible default), TensorRT-LLM (lowest latency, can absorb a compile), SGLang (prefix-heavy/structured).
  • Use Dynamo to orchestrate multi-node/disaggregated serving at rack scale, not to replace the engine.
  • Quantise with measured accuracy; keep an eval gate.

Failure modes

  • KV-cache OOM under load (max sequences/length set too high), causing preemption thrash.
  • Throughput tuned at the cost of the TTFT SLO (batches too large).
  • Scale-to-zero plus slow model load: cold-start SLO misses.
  • One compiled engine forced onto a churning, multi-model dev endpoint.
  • Disaggregation added before scale justifies it: complexity with no payoff.

Open questions & validation

  • Deploy a model-specific vLLM cookbook, benchmark TTFT/TPOT, and size the KV cache to the memory budget (open-weight serving).
  • Dynamo disaggregated serving on a multi-node / NVL72 setup.
  • A quantisation accuracy/throughput trade-off measured on a real model.

References

  • vLLM: https://docs.vllm.ai/en/latest/
  • SGLang: https://docs.sglang.ai/ · TensorRT-LLM: https://nvidia.github.io/TensorRT-LLM/
  • NVIDIA Dynamo: https://docs.dynamo.nvidia.com/ · repo: https://github.com/ai-dynamo/dynamo
  • Triton Inference Server: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
  • KServe: https://kserve.github.io/website/

Related: Quantization for inference · Open-weight serving · LLM request routing · Platform · Kubernetes · Observability · Optimization · FMware performance engineering · Cloud & Cost · Glossary


  1. Chris Fregly, AI Systems Performance Engineering (O'Reilly), Ch. 16 "Dynamic Batching, Scheduling, and Routing" — Table 16-5 shows a 20K-token prompt costs 200M attention ops regardless of chunk size: "Chunking does not reduce this total... they do not change the total amount of attention work." vLLM docs, Optimization and Tuning (https://docs.vllm.ai/en/latest/configuration/optimization.html): chunked prefill "allows vLLM to process large prefills in smaller chunks and batch them together with decode requests," a scheduling optimisation that balances compute-bound and memory-bound work.