Serving open-weight models¶
Scope: model-selection / decision overview and index for standing up the current open-weight flagships: DeepSeek, Kimi K2, GLM, Qwen, Llama. It frames the trade-offs and sizing, then delegates the detailed HOW to one vLLM cookbook per model family (parallelism sizing, quantisation, parser flags, model-specific gotchas) and a generic deploy recipe. The applied layer over inference serving, feeding disaggregation (disaggregated inference).
Specs are mid-2026 and move fast; the model families iterate (point releases monthly). Verify the exact checkpoint, context, and licence on the model card before deploying. GPU counts below are first-order from parameter math, not vendor-tested. Confirm with the engine's recipe page.
flowchart LR
MODEL["Model architecture"] --> MEMORY["Weights and KV memory"]
MEMORY --> PRECISION["Precision or quantization"]
PRECISION --> PARALLEL["TP, EP, PP layout"]
PARALLEL --> ENGINE["vLLM or SGLang"]
ENGINE --> SLO["TTFT, TPOT, goodput"]
Focused pages¶
This is the decision/index layer. To get started, read the paradigm below, pick a model from the cheat-sheet, then open its cookbook for the concrete vllm serve command and gotchas.
- Serve DeepSeek-R1 with vLLM. Use this when you want an MIT reasoning endpoint (671B/37B MoE, EP-sensitive, MLA).
- Serve Kimi K2 with vLLM. Use this for an agentic/tool-calling endpoint (1T/32B MoE, MLA, 16-GPU smallest 128K FP8 shape).
- Serve GLM-5.2 with vLLM. Use this for the flagship open-weight long-horizon coding/agent endpoint (~744B/40B MoE, Dynamic Sparse Attention + 1M context, MTP, MIT).
- Serve GLM-4.7-FP8 with vLLM. Use this for a coding/agent endpoint that needs GLM tool and reasoning parser flags.
- Serve Qwen3-235B-A22B with vLLM. Use this as the Apache-2.0 general default (256K native context, optional ~1M route).
- Serve Llama 4 Maverick with vLLM. Use this when you need a gated multimodal text+image endpoint.
- vLLM inference deployment recipe. Use this for the engine-agnostic deploy mechanics (Deployment/Service/HPA, rollout, autoscaling) once a model-specific launch shape is validated.
Paradigm: size from the architecture¶
- Memory is driven by total params; compute by active params. A 1T-param MoE with 32B active needs ~1T-params of HBM but runs at ~32B-dense compute. So big MoE models are memory-bound to fit, cheap to run.
- Weights memory ≈ params × bytes/param: FP8/INT8 ≈ 1 B/param, BF16 ≈ 2, INT4 ≈ 0.5. Add KV cache (grows with batch × context) and activation overhead (~15-20%).
- Parallelism: TP within a node (NVLink), EP (expert parallel) for MoE to spread experts, PP across nodes only when a model exceeds one node. Keep TP/EP inside the NVLink domain (networking fabric, distributed training).
- MLA (Multi-head Latent Attention, DeepSeek/Kimi) compresses the KV cache ~an order of magnitude, far larger batches/context per GB.
Model cheat-sheet (mid-2026)¶
| Model | Total / active | Context | Notable | Cookbook |
|---|---|---|---|---|
| DeepSeek-R1 | 671B / 37B MoE | 128K | MIT, reasoning, DeepSeek-V3 architecture, EP-sensitive | DeepSeek-R1 with vLLM |
| Kimi K2 Instruct | 1T / 32B MoE | 128K | modified MIT, MLA, agentic/tool calling, 16-GPU smallest 128K FP8 shape | Kimi K2 with vLLM |
| GLM-5.2-FP8 | ~744B / ~40B MoE | up to 1M (128K default) | MIT, flagship agentic coding, DSA + IndexShare, MTP, glm47/glm45 parsers | GLM-5.2 with vLLM |
| GLM-4.7-FP8 | 358B / ~32B MoE | vendor route-specific | MIT, coding/agentic, GLM tool and reasoning parsers | GLM-4.7-FP8 with vLLM |
| Qwen3-235B-A22B-Instruct-2507 | 235B / 22B MoE | 256K native; optional ~1M route | Apache-2.0, non-thinking instruct, strong general default | Qwen3-235B with vLLM |
| Llama 4 Maverick FP8 | 400B / 17B MoE | 1M on model card | gated, custom Llama 4 license, multimodal text+image | Llama 4 Maverick with vLLM |
Recipes¶
Each per-model cookbook (linked under Focused pages) is a standalone WHAT / WHY / WHEN / HOW page with a concrete vllm serve command, Kubernetes or multi-node launch shape, smoke test, production knobs, failure modes, and references. Pick a model from the cheat-sheet, open its cookbook for the validated launch shape, then wrap the stable endpoint behind the engine-agnostic vLLM inference deployment recipe (or the broader Deployment/Service/HPA / KServe patterns in workload recipes) only after the model-specific shape is validated.
Networking & hardware optimisations (for these models)¶
- MoE → Expert Parallelism over NVLink: all-to-all expert routing is bandwidth-heavy; keep EP inside the NVLink domain (intra-node, or intra-rack on NVL72). Crossing IB for EP collapses throughput (networking fabric, performance tuning).
- FP8/INT4 + Blackwell NVFP4: native low-precision is the difference between one node and several; validate accuracy with an eval gate (SRE and MLOps practices).
- MLA models: the KV saving lets you raise
--max-num-seqsand context; size KV to the memory budget rather than capping conservatively (inference serving). - Confirm GDR/NVLink for any multi-node serving;
NCCL_DEBUG=INFOshould show[GDRDMA].
Don't-miss checklist¶
- Pick precision/quant first; it sets the node count. FP8 is the practical baseline for the current large MoE checkpoints.
- Keep TP/EP inside the NVLink domain; only go PP/multi-node when the model truly exceeds one node.
- Exploit MLA: larger batch/context, not the default cap.
- Pin the exact checkpoint and
--trust-remote-codeonly for vetted repos (security and multi-tenancy). - Gate any quantised deployment on an accuracy eval (SRE and MLOps practices).
Failure modes¶
- BF16 chosen for a 671B/1T model → needs 2-4× the GPUs, or OOM.
- EP routed across IB instead of NVLink → all-to-all bottleneck, low throughput.
--max-model-lenset to the model max with a large batch → KV-cache OOM (inference serving).- Quantised weights shipped without an accuracy check → silent quality regression.
Open questions & validation¶
- Confirm current checkpoint names, licences, and the engine's recommended version per model card.
- Benchmark TTFT/TPOT and max concurrent sequences for the chosen precision on the target GPU (the SLO/SLI catalog).
- Validate quantised accuracy against the BF16 reference on a task eval before promotion.
References¶
- vLLM model recipes: https://docs.vllm.ai/projects/recipes/en/latest/
- DeepSeek-R1: https://huggingface.co/deepseek-ai/DeepSeek-R1
- Kimi K2 (Moonshot): https://huggingface.co/moonshotai/Kimi-K2-Instruct
- GLM-4.7-FP8 (Z.ai / zai-org): https://huggingface.co/zai-org/GLM-4.7-FP8
- Qwen3-235B-A22B-Instruct-2507: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
- Llama 4 Maverick FP8: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
- SGLang: https://docs.sglang.ai/
Related: Quantization for inference · Fabric · Inference · LLM request routing · Optimization · Workloads · Disaggregated · SLO/SLI · Glossary