Markdown

Inference serving & optimization¶

Scope: serving models in production, the engines, the servers, the optimisations specific to inference (batching, KV cache, quantisation, disaggregation), and autoscaling. The run/optimise half of the cluster that the training pages do not cover, and increasingly the larger share of GPU spend.

Overview¶

Inference is a different regime from training: latency-sensitive, bursty, multi-tenant, and memory-bandwidth-bound where training is compute-bound. Serving well means hitting a latency SLO at the highest throughput the hardware allows. The skill is knowing the engine landscape, the two-phase structure of LLM inference that drives every optimisation, and how to scale it on Kubernetes (Kubernetes for GPUs) without missing the SLO.

Flow: prefill then decode

sequenceDiagram
  participant C as Client
  participant P as Prefill
  participant D as Decode
  C->>P: prompt
  P->>P: build KV cache
  P->>D: hand off KV cache
  loop each output token
    D->>D: generate token
    D-->>C: stream token
  end

Core knowledge¶

The two phases (this drives everything)¶

Prefill: process the prompt, fill the KV cache. Compute-bound, parallel over prompt tokens.
Decode: generate one token at a time. Memory-bandwidth-bound, sequential.
Metrics: TTFT (time to first token, dominated by prefill), TPOT/ITL (per-token latency, dominated by decode), and goodput (throughput within SLO). Throughput and latency trade off against each other through batch size.

KV cache and batching¶

The KV cache is the dominant serving memory consumer; it grows with batch x sequence length. PagedAttention (vLLM) manages it in pages to kill fragmentation; RadixAttention / prefix caching (SGLang) reuses shared prefixes across requests.
Continuous (in-flight) batching: add and retire requests from the running batch every step rather than fixed batches. The single biggest serving throughput win. Assume it is on.

Engines (as of mid-2026)¶

vLLM: PagedAttention, continuous batching, broad model coverage, fastest to production, best cost-efficiency. The community default.
SGLang: RadixAttention/prefix caching and structured generation; throughput sits between vLLM and TensorRT-LLM, strong on prefix-heavy and structured workloads.
TensorRT-LLM: compiled engines, lowest latency and highest throughput on NVIDIA, at the cost of a compilation step and less runtime flexibility.
Others: TGI (Hugging Face), LMDeploy, llama.cpp/Ollama (local/edge).

vLLM cookbook entrypoints¶

The concept page stops at engine and SLO mechanics; the model-specific cookbooks carry the runnable commands and manifests:

Model family	Cookbook	Use when
DeepSeek-R1	Serve DeepSeek-R1 with vLLM	reasoning-heavy math/code/agent tasks
Kimi K2	Serve Kimi K2 with vLLM	16-GPU agentic/tool-calling deployments
GLM-5.2-FP8	Serve GLM-5.2 with vLLM	flagship long-horizon agentic coding at up to 1M context
GLM-4.7-FP8	Serve GLM-4.7-FP8 with vLLM	coding agents and MIT-licensed tool workflows
Qwen3-235B-A22B	Serve Qwen3-235B-A22B with vLLM	Apache-2.0 general-purpose 256K context serving
Llama 4 Maverick	Serve Llama 4 Maverick with vLLM	multimodal image+text serving with Llama ecosystem tooling

Minimal single-node shape, before applying a model-specific cookbook:

vllm serve <hf-model-id> \
  --served-model-name <public-model-name> \
  --tensor-parallel-size <gpus-per-replica> \
  --max-model-len <route-context-limit> \
  --gpu-memory-utilization 0.85

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<public-model-name>",
    "messages": [{"role": "user", "content": "Return one sentence."}],
    "max_tokens": 64
  }'

Do not copy that generic command to production unchanged. The large current MoE models need model-specific parser flags, context caps, expert-parallel layouts, gated-token handling, or multi-node launch shapes.

NVIDIA Dynamo (1.0 GA, GTC 2026)¶

A datacenter-scale distributed inference framework that orchestrates the engines above (vLLM, SGLang, TensorRT-LLM) rather than replacing them. It adds disaggregated serving (prefill and decode on separate GPU pools, scaled independently), KV-aware routing to avoid recompute, and KV cache offload to cheaper memory/storage tiers. Grove automates placement on rack-scale NVLink systems like GB300 NVL72 (the Blackwell platform). NVIDIA reports up to ~7x MoE throughput on GB200 NVL72 with disaggregation plus wide expert parallelism. NIM is the packaged microservice form for enterprise deployment.

Serving optimisations¶

Quantisation: FP8/INT8/INT4, AWQ/GPTQ, and NVFP4 on Blackwell. Validate accuracy, do not quantise blindly.
Speculative decoding: a small draft model proposes tokens, the target verifies in parallel: fewer target forward passes.
Chunked prefill, prefix caching, multi-LoRA serving, and disaggregated prefill/decode for the largest deployments.
Chunked prefill does not reduce a prompt's total attention compute. Splitting prefill into chunks interleaves it with decode in the same batch to balance compute-bound prefill against memory-bound decode. It smooths per-iteration latency (raising TTFT slightly) and improves throughput/GPU utilisation, but the total attention FLOPs are unchanged (prefill self-attention stays N(N+1)/2 QK dot products, O(N²); the chunk sums are equal). Only a smaller effective context or local/sparse attention reduces the total. ¹

Orchestration and autoscaling¶

Triton Inference Server (multi-framework, dynamic batching, model ensembles), KServe (K8s-native serving, autoscaling incl. scale-to-zero), Ray Serve, the NIM Operator.
Autoscale on queue depth / SLO / GPU signals. Watch the cold-start problem: scale-to-zero plus a multi-minute model load misses latency SLOs on the first request.

Don't-miss checklist¶

Measure TTFT and TPOT against an SLO and report goodput, not raw throughput.
Continuous batching on; size KV cache / max sequences to the real memory budget (reliability and RAS).
Pick the engine by constraint: vLLM (flexible default), TensorRT-LLM (lowest latency, can absorb a compile), SGLang (prefix-heavy/structured).
Use Dynamo to orchestrate multi-node/disaggregated serving at rack scale, not to replace the engine.
Quantise with measured accuracy; keep an eval gate.

Failure modes¶

KV-cache OOM under load (max sequences/length set too high), causing preemption thrash.
Throughput tuned at the cost of the TTFT SLO (batches too large).
Scale-to-zero plus slow model load: cold-start SLO misses.
One compiled engine forced onto a churning, multi-model dev endpoint.
Disaggregation added before scale justifies it: complexity with no payoff.

Open questions & validation¶

Deploy a model-specific vLLM cookbook, benchmark TTFT/TPOT, and size the KV cache to the memory budget (open-weight serving).
Dynamo disaggregated serving on a multi-node / NVL72 setup.
A quantisation accuracy/throughput trade-off measured on a real model.

References¶

vLLM: https://docs.vllm.ai/en/latest/
SGLang: https://docs.sglang.ai/ · TensorRT-LLM: https://nvidia.github.io/TensorRT-LLM/
NVIDIA Dynamo: https://docs.dynamo.nvidia.com/ · repo: https://github.com/ai-dynamo/dynamo
Triton Inference Server: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
KServe: https://kserve.github.io/website/

Chris Fregly, AI Systems Performance Engineering (O'Reilly), Ch. 16 "Dynamic Batching, Scheduling, and Routing" — Table 16-5 shows a 20K-token prompt costs 200M attention ops regardless of chunk size: "Chunking does not reduce this total... they do not change the total amount of attention work." vLLM docs, Optimization and Tuning (https://docs.vllm.ai/en/latest/configuration/optimization.html): chunked prefill "allows vLLM to process large prefills in smaller chunks and batch them together with decode requests," a scheduling optimisation that balances compute-bound and memory-bound work. ↩