Markdown

Serving open-weight models¶

Scope: model-selection / decision overview and index for standing up the current open-weight flagships: DeepSeek, Kimi K2, GLM, Qwen, Llama. It frames the trade-offs and sizing, then delegates the detailed HOW to one vLLM cookbook per model family (parallelism sizing, quantisation, parser flags, model-specific gotchas) and a generic deploy recipe. The applied layer over inference serving, feeding disaggregation (disaggregated inference).

Specs are mid-2026 and move fast; the model families iterate (point releases monthly). Verify the exact checkpoint, context, and licence on the model card before deploying. GPU counts below are first-order from parameter math, not vendor-tested. Confirm with the engine's recipe page.

flowchart LR
  MODEL["Model architecture"] --> MEMORY["Weights and KV memory"]
  MEMORY --> PRECISION["Precision or quantization"]
  PRECISION --> PARALLEL["TP, EP, PP layout"]
  PARALLEL --> ENGINE["vLLM or SGLang"]
  ENGINE --> SLO["TTFT, TPOT, goodput"]

Focused pages¶

This is the decision/index layer. To get started, read the paradigm below, pick a model from the cheat-sheet, then open its cookbook for the concrete vllm serve command and gotchas.

Serve DeepSeek-R1 with vLLM. Use this when you want an MIT reasoning endpoint (671B/37B MoE, EP-sensitive, MLA).
Serve Kimi K2 with vLLM. Use this for an agentic/tool-calling endpoint (1T/32B MoE, MLA, 16-GPU smallest 128K FP8 shape).
Serve GLM-5.2 with vLLM. Use this for the flagship open-weight long-horizon coding/agent endpoint (~744B/40B MoE, Dynamic Sparse Attention + 1M context, MTP, MIT).
Serve GLM-4.7-FP8 with vLLM. Use this for a coding/agent endpoint that needs GLM tool and reasoning parser flags.
Serve Qwen3-235B-A22B with vLLM. Use this as the Apache-2.0 general default (256K native context, optional ~1M route).
Serve Llama 4 Maverick with vLLM. Use this when you need a gated multimodal text+image endpoint.
vLLM inference deployment recipe. Use this for the engine-agnostic deploy mechanics (Deployment/Service/HPA, rollout, autoscaling) once a model-specific launch shape is validated.

Paradigm: size from the architecture¶

Memory is driven by total params; compute by active params. A 1T-param MoE with 32B active needs ~1T-params of HBM but runs at ~32B-dense compute. So big MoE models are memory-bound to fit, cheap to run.
Weights memory ≈ params × bytes/param: FP8/INT8 ≈ 1 B/param, BF16 ≈ 2, INT4 ≈ 0.5. Add KV cache (grows with batch × context) and activation overhead (~15-20%).
Parallelism: TP within a node (NVLink), EP (expert parallel) for MoE to spread experts, PP across nodes only when a model exceeds one node. Keep TP/EP inside the NVLink domain (networking fabric, distributed training).
MLA (Multi-head Latent Attention, DeepSeek/Kimi) compresses the KV cache ~an order of magnitude, far larger batches/context per GB.

Model cheat-sheet (mid-2026)¶

Model	Total / active	Context	Notable	Cookbook
DeepSeek-R1	671B / 37B MoE	128K	MIT, reasoning, DeepSeek-V3 architecture, EP-sensitive	DeepSeek-R1 with vLLM
Kimi K2 Instruct	1T / 32B MoE	128K	modified MIT, MLA, agentic/tool calling, 16-GPU smallest 128K FP8 shape	Kimi K2 with vLLM
GLM-5.2-FP8	~744B / ~40B MoE	up to 1M (128K default)	MIT, flagship agentic coding, DSA + IndexShare, MTP, glm47/glm45 parsers	GLM-5.2 with vLLM
GLM-4.7-FP8	358B / ~32B MoE	vendor route-specific	MIT, coding/agentic, GLM tool and reasoning parsers	GLM-4.7-FP8 with vLLM
Qwen3-235B-A22B-Instruct-2507	235B / 22B MoE	256K native; optional ~1M route	Apache-2.0, non-thinking instruct, strong general default	Qwen3-235B with vLLM
Llama 4 Maverick FP8	400B / 17B MoE	1M on model card	gated, custom Llama 4 license, multimodal text+image	Llama 4 Maverick with vLLM

Recipes¶

Each per-model cookbook (linked under Focused pages) is a standalone WHAT / WHY / WHEN / HOW page with a concrete vllm serve command, Kubernetes or multi-node launch shape, smoke test, production knobs, failure modes, and references. Pick a model from the cheat-sheet, open its cookbook for the validated launch shape, then wrap the stable endpoint behind the engine-agnostic vLLM inference deployment recipe (or the broader Deployment/Service/HPA / KServe patterns in workload recipes) only after the model-specific shape is validated.

Networking & hardware optimisations (for these models)¶

MoE → Expert Parallelism over NVLink: all-to-all expert routing is bandwidth-heavy; keep EP inside the NVLink domain (intra-node, or intra-rack on NVL72). Crossing IB for EP collapses throughput (networking fabric, performance tuning).
FP8/INT4 + Blackwell NVFP4: native low-precision is the difference between one node and several; validate accuracy with an eval gate (SRE and MLOps practices).
MLA models: the KV saving lets you raise --max-num-seqs and context; size KV to the memory budget rather than capping conservatively (inference serving).
Confirm GDR/NVLink for any multi-node serving; NCCL_DEBUG=INFO should show [GDRDMA].

Don't-miss checklist¶

Pick precision/quant first; it sets the node count. FP8 is the practical baseline for the current large MoE checkpoints.
Keep TP/EP inside the NVLink domain; only go PP/multi-node when the model truly exceeds one node.
Exploit MLA: larger batch/context, not the default cap.
Pin the exact checkpoint and --trust-remote-code only for vetted repos (security and multi-tenancy).
Gate any quantised deployment on an accuracy eval (SRE and MLOps practices).

Failure modes¶

BF16 chosen for a 671B/1T model → needs 2-4× the GPUs, or OOM.
EP routed across IB instead of NVLink → all-to-all bottleneck, low throughput.
--max-model-len set to the model max with a large batch → KV-cache OOM (inference serving).
Quantised weights shipped without an accuracy check → silent quality regression.

Open questions & validation¶

Confirm current checkpoint names, licences, and the engine's recommended version per model card.
Benchmark TTFT/TPOT and max concurrent sequences for the chosen precision on the target GPU (the SLO/SLI catalog).
Validate quantised accuracy against the BF16 reference on a task eval before promotion.

References¶

vLLM model recipes: https://docs.vllm.ai/projects/recipes/en/latest/
DeepSeek-R1: https://huggingface.co/deepseek-ai/DeepSeek-R1
Kimi K2 (Moonshot): https://huggingface.co/moonshotai/Kimi-K2-Instruct
GLM-4.7-FP8 (Z.ai / zai-org): https://huggingface.co/zai-org/GLM-4.7-FP8
Qwen3-235B-A22B-Instruct-2507: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
Llama 4 Maverick FP8: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
SGLang: https://docs.sglang.ai/