Cookbook: serve GLM-5.2 with vLLM¶
Scope: a vLLM reference template for serving zai-org/GLM-5.2-FP8, Z.ai's current flagship long-horizon agentic coding and reasoning model: what GLM-5.2 is (a ~744-753B / ~40B-active MoE with Dynamic Sparse Attention and a 1M-token context), why and when to use it, how to launch the 8-GPU tensor-parallel FP8 shape with MTP speculative decoding and the glm47/glm45 parsers, and how to verify thinking mode and tool calls.
Reference template. Commands and manifests are not executed here. GLM-5.2 needs vLLM 0.23.0 or newer, and the model card notes that using tool calling and MTP speculative decoding at the same time currently requires the vLLM
mainbranch. Pin an exact vLLM image (main-derived if you need both features), DeepGEMM-capable where required, and the model revision, then validate on one node before production.
flowchart LR
PROMPT["Long-horizon coding / agent task"] --> VLLM["vLLM server (TP=8)"]
VLLM --> DSA["Dynamic Sparse Attention + IndexShare (up to 1M ctx)"]
VLLM --> MTP["MTP speculative decode"]
DSA --> PARSERS["glm47 tool parser / glm45 reasoning parser"]
MTP --> PARSERS
PARSERS --> API["OpenAI-compatible response"]
What¶
GLM-5.2 is Z.ai's flagship model for long-horizon tasks, released under MIT on 2026-06-17 (initial rollout to paying coding customers on 2026-06-13). It is a Mixture-of-Experts checkpoint of roughly 744B total parameters with ~40B active per token (A40B); the Hugging Face card lists 753B total. It succeeds GLM-5 and GLM-5.1 and is a large step up from the GLM-4.7 line for agentic coding.
Two architectural pieces define it:
- Dynamic Sparse Attention (DSA) with IndexShare. Every four transformer layers share a lightweight indexer that selects which keys each query attends to, so three of every four layers skip the full indexer dot-product and top-k. Z.ai reports this cuts per-token FLOPs by ~2.9x at 1M-token context, which is what makes the long-context window affordable to serve.
- Improved MTP layer. The Multi-Token Prediction module used for speculative decoding shares the indexer and KV cache, raising acceptance length by up to ~20% over the earlier GLM-5 MTP.
Context is up to 1M tokens (128K standard, expandable to 1M) with up to 131,072 output tokens. It exposes thinking mode with effort control (high, max) and structured tool calling, and integrates with coding agents such as Claude Code, OpenCode, and ZCode. It is strongest on long-horizon agentic coding and design/front-end generation; reported scores include SWE-bench Pro 62.1 (GLM-5.1: 58.4), DeepSWE 46.2 (SOTA among open-weight models at release), and Terminal-Bench 2.1 81.0.
Why¶
Use GLM-5.2 when the product is a long-running coding or tool-use agent and you want an open-weight (MIT) model that sustains quality over hundreds of tool-call rounds at a very long context. The small ~40B active fraction keeps decode cheap relative to the ~744B footprint, DSA keeps long-context attention sub-quadratic, and the MTP layer adds speculative-decoding throughput. It is the highest-capability open-weight agentic coder in this cookbook set at release.
When¶
Use it when:
- Long-horizon agentic coding, repository editing, terminal tasks, or multi-step tool use dominate.
- You need a very long context (well beyond 128K) at a serving cost that a dense model cannot match.
- A permissive MIT license and an OpenAI-compatible tool/reasoning path matter.
- The platform can run vLLM 0.23.0+ (or
mainwhen you need tool calling and MTP together).
Avoid it when:
- The serving platform only allows conservative stable images and cannot validate a recent/main vLLM build.
- The workload is short chat or minimal-latency completion where a smaller dense model meets the SLO.
- You cannot allocate an 8-GPU (H200/H20/B200-class) node; the FP8 checkpoint targets a single 8-GPU node and BF16 is multi-node.
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | zai-org/GLM-5.2-FP8 (FP8) or zai-org/GLM-5.2 (BF16) |
| Params | ~744B total / ~40B active (A40B) MoE; HF card lists 753B total |
| Attention | Dynamic Sparse Attention (DSA) + IndexShare |
| Context | up to 1M tokens (128K standard, expandable to 1M) |
| Max output | 131,072 tokens |
| Precision | FP8 (E4M3) or BF16; an nvidia/GLM-5.2-NVFP4 variant also exists |
| Speculative | improved MTP layer (+~20% acceptance length) |
| License | MIT |
| Starting hardware | FP8: 8x H200 / H20 single node (TP=8); BF16: multi-node; full 1M context: 8x B200 |
| vLLM | 0.23.0+ stable; main for tool calling + MTP simultaneously |
The official vLLM recipe uses --tensor-parallel-size 8 for the FP8 checkpoint on a single 8x H200/H20 node. Use that as a smoke baseline, then raise context or lower --max-num-seqs after measuring KV-cache capacity. SGLang (v0.5.13.post1+) and Transformers are also supported per the model card.
2. Bare-metal vLLM server (FP8, 8 GPUs, MTP + tool calling)¶
vllm serve zai-org/GLM-5.2-FP8 \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8
GLM-5.2 reuses the GLM-4.5/4.7 parsers: --tool-call-parser glm47 and --reasoning-parser glm45. The official vLLM recipe writes the speculative config in the equivalent dotted form (--speculative-config.method mtp --speculative-config.num_speculative_tokens 5); the JSON-string form above is the form shown in the vLLM MTP docs and both are accepted. Tool calling and MTP together currently need the vLLM main branch; if you pin 0.23.0 stable and hit a conflict, drop --speculative-config or move to a main-derived image.
Full 1M-context variant (8x B200) adds an FP8 KV cache, caps concurrency, and skips the DeepGEMM warmup:
VLLM_DEEP_GEMM_WARMUP=skip vllm serve zai-org/GLM-5.2-FP8 \
--tensor-parallel-size 8 \
--kv-cache-dtype fp8_e4m3 \
--max-num-seqs 32 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-5.2-fp8
3. Kubernetes Deployment template¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-glm-5-2
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-glm-5-2 }
template:
metadata:
labels: { app: vllm-glm-5-2 }
spec:
nodeSelector:
accelerator.nvidia.com/class: h200-8gpu
containers:
- name: vllm
image: vllm/vllm-openai:<tested-0.23+-or-main-tag>
args:
- --model=zai-org/GLM-5.2-FP8
- --tensor-parallel-size=8
- --kv-cache-dtype=fp8
- '--speculative-config={"method": "mtp", "num_speculative_tokens": 5}'
- --tool-call-parser=glm47
- --reasoning-parser=glm45
- --enable-auto-tool-choice
- --served-model-name=glm-5.2-fp8
ports:
- { containerPort: 8000, name: http }
env:
- { name: NCCL_DEBUG, value: INFO }
resources:
limits: { nvidia.com/gpu: 8 }
requests: { nvidia.com/gpu: 8 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 240
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-glm-5-2
namespace: serving
spec:
selector: { app: vllm-glm-5-2 }
ports:
- { name: http, port: 8000, targetPort: 8000 }
4. Thinking-mode smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "glm-5.2-fp8",
"messages": [{"role": "user", "content": "Design and implement a minimal LRU cache in Python, then explain the eviction invariant."}],
"temperature": 0.7,
"top_p": 1.0,
"max_tokens": 4096
}'
Thinking is on by default ("think max"). To disable it for a latency-sensitive route, pass the chat-template flag the model card documents:
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "glm-5.2-fp8",
"messages": [{"role": "user", "content": "Return only the final answer: 2**20 - 1."}],
"max_tokens": 256,
"chat_template_kwargs": {"enable_thinking": false}
}'
5. Tool-call smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "glm-5.2-fp8",
"messages": [{"role": "user", "content": "Use the search_docs tool to find CUDA driver upgrade instructions."}],
"tools": [{
"type": "function",
"function": {
"name": "search_docs",
"description": "Search internal documentation.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 1024
}'
Pass criteria:
- The model loads on the pinned vLLM image and
/healthreturns 200 with a non-zero KV-cache size in the logs. - Tool calls are structured, not plain text.
- Reasoning metadata is present when thinking is on, and absent when
enable_thinkingis false. - With MTP enabled, the logs report a non-trivial speculative acceptance rate; measure TTFT/TPOT separately for coding and tool-call routes.
Failure modes¶
glm5_moe/GLM-5.2 unsupported: vLLM is older than 0.23.0; upgrade or use a main-derived image.- MTP and tool calling conflict on stable: using both at once currently needs vLLM
main; drop--speculative-configor move to a main-derived image. - Tool calls not parsed: missing
--enable-auto-tool-choiceor--tool-call-parser glm47. - Reasoning content missing: missing
--reasoning-parser glm45, or the client ignores the extra field, orenable_thinkingwas set false. - OOM at long context: 1M is a ceiling, not a default. Cap
--max-model-lento the route's real need, lower--max-num-seqs(32 for the 1M B200 shape), and size KV cache accordingly (KV cache management). - BF16 on one node: the BF16 checkpoint is a multi-node deployment; use
zai-org/GLM-5.2-FP8for a single 8-GPU node. - Regression after image update: main/nightly images can change parser or MTP behaviour; pin a digest and run a tool-call + reasoning regression test before rollout.
References¶
- GLM-5.2 model card: https://huggingface.co/zai-org/GLM-5.2
- GLM-5.2-FP8 model card: https://huggingface.co/zai-org/GLM-5.2-FP8
- GLM-5.2 blog (DSA/IndexShare, MTP, benchmarks): https://huggingface.co/blog/zai-org/glm-52-blog
- vLLM recipe for GLM-5.2 (TP8, MTP,
glm47/glm45parsers, 1M-context B200 shape): https://recipes.vllm.ai/zai-org/GLM-5.2 - vLLM MTP speculative decoding (
--speculative-configJSON form): https://docs.vllm.ai/en/latest/features/speculative_decoding/mtp/ - vLLM tool calling and reasoning parsers: https://docs.vllm.ai/en/latest/features/tool_calling/
Related: Inference serving · Open-weight serving · GLM-4.7-FP8 · Generic vLLM recipe · Speculative decoding · KV cache management · SLO/SLI catalog · Glossary