Skip to content
Markdown

Serving open-weight models

Scope: model-selection / decision overview and index for standing up the current open-weight flagships: DeepSeek, Kimi K2, GLM, Qwen, Llama. It frames the trade-offs and sizing, then delegates the detailed HOW to one vLLM cookbook per model family (parallelism sizing, quantisation, parser flags, model-specific gotchas) and a generic deploy recipe. The applied layer over inference serving, feeding disaggregation (disaggregated inference).

Specs are mid-2026 and move fast; the model families iterate (point releases monthly). Verify the exact checkpoint, context, and licence on the model card before deploying. GPU counts below are first-order from parameter math, not vendor-tested. Confirm with the engine's recipe page.

flowchart LR
  MODEL["Model architecture"] --> MEMORY["Weights and KV memory"]
  MEMORY --> PRECISION["Precision or quantization"]
  PRECISION --> PARALLEL["TP, EP, PP layout"]
  PARALLEL --> ENGINE["vLLM or SGLang"]
  ENGINE --> SLO["TTFT, TPOT, goodput"]

Focused pages

This is the decision/index layer. To get started, read the paradigm below, pick a model from the cheat-sheet, then open its cookbook for the concrete vllm serve command and gotchas.

Paradigm: size from the architecture

  • Memory is driven by total params; compute by active params. A 1T-param MoE with 32B active needs ~1T-params of HBM but runs at ~32B-dense compute. So big MoE models are memory-bound to fit, cheap to run.
  • Weights memory ≈ params × bytes/param: FP8/INT8 ≈ 1 B/param, BF16 ≈ 2, INT4 ≈ 0.5. Add KV cache (grows with batch × context) and activation overhead (~15-20%).
  • Parallelism: TP within a node (NVLink), EP (expert parallel) for MoE to spread experts, PP across nodes only when a model exceeds one node. Keep TP/EP inside the NVLink domain (networking fabric, distributed training).
  • MLA (Multi-head Latent Attention, DeepSeek/Kimi) compresses the KV cache ~an order of magnitude, far larger batches/context per GB.

Model cheat-sheet (mid-2026)

Model Total / active Context Notable Cookbook
DeepSeek-R1 671B / 37B MoE 128K MIT, reasoning, DeepSeek-V3 architecture, EP-sensitive DeepSeek-R1 with vLLM
Kimi K2 Instruct 1T / 32B MoE 128K modified MIT, MLA, agentic/tool calling, 16-GPU smallest 128K FP8 shape Kimi K2 with vLLM
GLM-5.2-FP8 ~744B / ~40B MoE up to 1M (128K default) MIT, flagship agentic coding, DSA + IndexShare, MTP, glm47/glm45 parsers GLM-5.2 with vLLM
GLM-4.7-FP8 358B / ~32B MoE vendor route-specific MIT, coding/agentic, GLM tool and reasoning parsers GLM-4.7-FP8 with vLLM
Qwen3-235B-A22B-Instruct-2507 235B / 22B MoE 256K native; optional ~1M route Apache-2.0, non-thinking instruct, strong general default Qwen3-235B with vLLM
Llama 4 Maverick FP8 400B / 17B MoE 1M on model card gated, custom Llama 4 license, multimodal text+image Llama 4 Maverick with vLLM

Recipes

Each per-model cookbook (linked under Focused pages) is a standalone WHAT / WHY / WHEN / HOW page with a concrete vllm serve command, Kubernetes or multi-node launch shape, smoke test, production knobs, failure modes, and references. Pick a model from the cheat-sheet, open its cookbook for the validated launch shape, then wrap the stable endpoint behind the engine-agnostic vLLM inference deployment recipe (or the broader Deployment/Service/HPA / KServe patterns in workload recipes) only after the model-specific shape is validated.

Networking & hardware optimisations (for these models)

  • MoE → Expert Parallelism over NVLink: all-to-all expert routing is bandwidth-heavy; keep EP inside the NVLink domain (intra-node, or intra-rack on NVL72). Crossing IB for EP collapses throughput (networking fabric, performance tuning).
  • FP8/INT4 + Blackwell NVFP4: native low-precision is the difference between one node and several; validate accuracy with an eval gate (SRE and MLOps practices).
  • MLA models: the KV saving lets you raise --max-num-seqs and context; size KV to the memory budget rather than capping conservatively (inference serving).
  • Confirm GDR/NVLink for any multi-node serving; NCCL_DEBUG=INFO should show [GDRDMA].

Don't-miss checklist

  • Pick precision/quant first; it sets the node count. FP8 is the practical baseline for the current large MoE checkpoints.
  • Keep TP/EP inside the NVLink domain; only go PP/multi-node when the model truly exceeds one node.
  • Exploit MLA: larger batch/context, not the default cap.
  • Pin the exact checkpoint and --trust-remote-code only for vetted repos (security and multi-tenancy).
  • Gate any quantised deployment on an accuracy eval (SRE and MLOps practices).

Failure modes

  • BF16 chosen for a 671B/1T model → needs 2-4× the GPUs, or OOM.
  • EP routed across IB instead of NVLink → all-to-all bottleneck, low throughput.
  • --max-model-len set to the model max with a large batch → KV-cache OOM (inference serving).
  • Quantised weights shipped without an accuracy check → silent quality regression.

Open questions & validation

  • Confirm current checkpoint names, licences, and the engine's recommended version per model card.
  • Benchmark TTFT/TPOT and max concurrent sequences for the chosen precision on the target GPU (the SLO/SLI catalog).
  • Validate quantised accuracy against the BF16 reference on a task eval before promotion.

References

  • vLLM model recipes: https://docs.vllm.ai/projects/recipes/en/latest/
  • DeepSeek-R1: https://huggingface.co/deepseek-ai/DeepSeek-R1
  • Kimi K2 (Moonshot): https://huggingface.co/moonshotai/Kimi-K2-Instruct
  • GLM-4.7-FP8 (Z.ai / zai-org): https://huggingface.co/zai-org/GLM-4.7-FP8
  • Qwen3-235B-A22B-Instruct-2507: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
  • Llama 4 Maverick FP8: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
  • SGLang: https://docs.sglang.ai/

Related: Quantization for inference · Fabric · Inference · LLM request routing · Optimization · Workloads · Disaggregated · SLO/SLI · Glossary