Inference serving & optimization¶
Scope: serving models in production, the engines, the servers, the optimisations specific to inference (batching, KV cache, quantisation, disaggregation), and autoscaling. The run/optimise half of the cluster that the training pages do not cover, and increasingly the larger share of GPU spend.
Overview¶
Inference is a different regime from training: latency-sensitive, bursty, multi-tenant, and memory-bandwidth-bound where training is compute-bound. Serving well means hitting a latency SLO at the highest throughput the hardware allows. The skill is knowing the engine landscape, the two-phase structure of LLM inference that drives every optimisation, and how to scale it on Kubernetes (Kubernetes for GPUs) without missing the SLO.
Flow: prefill then decode
sequenceDiagram
participant C as Client
participant P as Prefill
participant D as Decode
C->>P: prompt
P->>P: build KV cache
P->>D: hand off KV cache
loop each output token
D->>D: generate token
D-->>C: stream token
end
Core knowledge¶
The two phases (this drives everything)¶
- Prefill: process the prompt, fill the KV cache. Compute-bound, parallel over prompt tokens.
- Decode: generate one token at a time. Memory-bandwidth-bound, sequential.
- Metrics: TTFT (time to first token, dominated by prefill), TPOT/ITL (per-token latency, dominated by decode), and goodput (throughput within SLO). Throughput and latency trade off against each other through batch size.
KV cache and batching¶
- The KV cache is the dominant serving memory consumer; it grows with batch x sequence length. PagedAttention (vLLM) manages it in pages to kill fragmentation; RadixAttention / prefix caching (SGLang) reuses shared prefixes across requests.
- Continuous (in-flight) batching: add and retire requests from the running batch every step rather than fixed batches. The single biggest serving throughput win. Assume it is on.
Engines (as of mid-2026)¶
- vLLM: PagedAttention, continuous batching, broad model coverage, fastest to production, best cost-efficiency. The community default.
- SGLang: RadixAttention/prefix caching and structured generation; throughput sits between vLLM and TensorRT-LLM, strong on prefix-heavy and structured workloads.
- TensorRT-LLM: compiled engines, lowest latency and highest throughput on NVIDIA, at the cost of a compilation step and less runtime flexibility.
- Others: TGI (Hugging Face), LMDeploy, llama.cpp/Ollama (local/edge).
vLLM cookbook entrypoints¶
The concept page stops at engine and SLO mechanics; the model-specific cookbooks carry the runnable commands and manifests:
| Model family | Cookbook | Use when |
|---|---|---|
| DeepSeek-R1 | Serve DeepSeek-R1 with vLLM | reasoning-heavy math/code/agent tasks |
| Kimi K2 | Serve Kimi K2 with vLLM | 16-GPU agentic/tool-calling deployments |
| GLM-5.2-FP8 | Serve GLM-5.2 with vLLM | flagship long-horizon agentic coding at up to 1M context |
| GLM-4.7-FP8 | Serve GLM-4.7-FP8 with vLLM | coding agents and MIT-licensed tool workflows |
| Qwen3-235B-A22B | Serve Qwen3-235B-A22B with vLLM | Apache-2.0 general-purpose 256K context serving |
| Llama 4 Maverick | Serve Llama 4 Maverick with vLLM | multimodal image+text serving with Llama ecosystem tooling |
Minimal single-node shape, before applying a model-specific cookbook:
vllm serve <hf-model-id> \
--served-model-name <public-model-name> \
--tensor-parallel-size <gpus-per-replica> \
--max-model-len <route-context-limit> \
--gpu-memory-utilization 0.85
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "<public-model-name>",
"messages": [{"role": "user", "content": "Return one sentence."}],
"max_tokens": 64
}'
Do not copy that generic command to production unchanged. The large current MoE models need model-specific parser flags, context caps, expert-parallel layouts, gated-token handling, or multi-node launch shapes.
NVIDIA Dynamo (1.0 GA, GTC 2026)¶
- A datacenter-scale distributed inference framework that orchestrates the engines above (vLLM, SGLang, TensorRT-LLM) rather than replacing them. It adds disaggregated serving (prefill and decode on separate GPU pools, scaled independently), KV-aware routing to avoid recompute, and KV cache offload to cheaper memory/storage tiers. Grove automates placement on rack-scale NVLink systems like GB300 NVL72 (the Blackwell platform). NVIDIA reports up to ~7x MoE throughput on GB200 NVL72 with disaggregation plus wide expert parallelism. NIM is the packaged microservice form for enterprise deployment.
Serving optimisations¶
- Quantisation: FP8/INT8/INT4, AWQ/GPTQ, and NVFP4 on Blackwell. Validate accuracy, do not quantise blindly.
- Speculative decoding: a small draft model proposes tokens, the target verifies in parallel: fewer target forward passes.
- Chunked prefill, prefix caching, multi-LoRA serving, and disaggregated prefill/decode for the largest deployments.
- Chunked prefill does not reduce a prompt's total attention compute. Splitting prefill into chunks interleaves it with decode in the same batch to balance compute-bound prefill against memory-bound decode. It smooths per-iteration latency (raising TTFT slightly) and improves throughput/GPU utilisation, but the total attention FLOPs are unchanged (prefill self-attention stays N(N+1)/2 QK dot products, O(N²); the chunk sums are equal). Only a smaller effective context or local/sparse attention reduces the total. 1
Orchestration and autoscaling¶
- Triton Inference Server (multi-framework, dynamic batching, model ensembles), KServe (K8s-native serving, autoscaling incl. scale-to-zero), Ray Serve, the NIM Operator.
- Autoscale on queue depth / SLO / GPU signals. Watch the cold-start problem: scale-to-zero plus a multi-minute model load misses latency SLOs on the first request.
Don't-miss checklist¶
- Measure TTFT and TPOT against an SLO and report goodput, not raw throughput.
- Continuous batching on; size KV cache / max sequences to the real memory budget (reliability and RAS).
- Pick the engine by constraint: vLLM (flexible default), TensorRT-LLM (lowest latency, can absorb a compile), SGLang (prefix-heavy/structured).
- Use Dynamo to orchestrate multi-node/disaggregated serving at rack scale, not to replace the engine.
- Quantise with measured accuracy; keep an eval gate.
Failure modes¶
- KV-cache OOM under load (max sequences/length set too high), causing preemption thrash.
- Throughput tuned at the cost of the TTFT SLO (batches too large).
- Scale-to-zero plus slow model load: cold-start SLO misses.
- One compiled engine forced onto a churning, multi-model dev endpoint.
- Disaggregation added before scale justifies it: complexity with no payoff.
Open questions & validation¶
- Deploy a model-specific vLLM cookbook, benchmark TTFT/TPOT, and size the KV cache to the memory budget (open-weight serving).
- Dynamo disaggregated serving on a multi-node / NVL72 setup.
- A quantisation accuracy/throughput trade-off measured on a real model.
References¶
- vLLM: https://docs.vllm.ai/en/latest/
- SGLang: https://docs.sglang.ai/ · TensorRT-LLM: https://nvidia.github.io/TensorRT-LLM/
- NVIDIA Dynamo: https://docs.dynamo.nvidia.com/ · repo: https://github.com/ai-dynamo/dynamo
- Triton Inference Server: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
- KServe: https://kserve.github.io/website/
Related: Quantization for inference · Open-weight serving · LLM request routing · Platform · Kubernetes · Observability · Optimization · FMware performance engineering · Cloud & Cost · Glossary
-
Chris Fregly, AI Systems Performance Engineering (O'Reilly), Ch. 16 "Dynamic Batching, Scheduling, and Routing" — Table 16-5 shows a 20K-token prompt costs 200M attention ops regardless of chunk size: "Chunking does not reduce this total... they do not change the total amount of attention work." vLLM docs, Optimization and Tuning (https://docs.vllm.ai/en/latest/configuration/optimization.html): chunked prefill "allows vLLM to process large prefills in smaller chunks and batch them together with decode requests," a scheduling optimisation that balances compute-bound and memory-bound work. ↩