Skip to content
Markdown

Cookbook: serve GLM-5.2 with vLLM

Scope: a vLLM reference template for serving zai-org/GLM-5.2-FP8, Z.ai's current flagship long-horizon agentic coding and reasoning model: what GLM-5.2 is (a ~744-753B / ~40B-active MoE with Dynamic Sparse Attention and a 1M-token context), why and when to use it, how to launch the 8-GPU tensor-parallel FP8 shape with MTP speculative decoding and the glm47/glm45 parsers, and how to verify thinking mode and tool calls.

Reference template. Commands and manifests are not executed here. GLM-5.2 needs vLLM 0.23.0 or newer, and the model card notes that using tool calling and MTP speculative decoding at the same time currently requires the vLLM main branch. Pin an exact vLLM image (main-derived if you need both features), DeepGEMM-capable where required, and the model revision, then validate on one node before production.

flowchart LR
  PROMPT["Long-horizon coding / agent task"] --> VLLM["vLLM server (TP=8)"]
  VLLM --> DSA["Dynamic Sparse Attention + IndexShare (up to 1M ctx)"]
  VLLM --> MTP["MTP speculative decode"]
  DSA --> PARSERS["glm47 tool parser / glm45 reasoning parser"]
  MTP --> PARSERS
  PARSERS --> API["OpenAI-compatible response"]

What

GLM-5.2 is Z.ai's flagship model for long-horizon tasks, released under MIT on 2026-06-17 (initial rollout to paying coding customers on 2026-06-13). It is a Mixture-of-Experts checkpoint of roughly 744B total parameters with ~40B active per token (A40B); the Hugging Face card lists 753B total. It succeeds GLM-5 and GLM-5.1 and is a large step up from the GLM-4.7 line for agentic coding.

Two architectural pieces define it:

  • Dynamic Sparse Attention (DSA) with IndexShare. Every four transformer layers share a lightweight indexer that selects which keys each query attends to, so three of every four layers skip the full indexer dot-product and top-k. Z.ai reports this cuts per-token FLOPs by ~2.9x at 1M-token context, which is what makes the long-context window affordable to serve.
  • Improved MTP layer. The Multi-Token Prediction module used for speculative decoding shares the indexer and KV cache, raising acceptance length by up to ~20% over the earlier GLM-5 MTP.

Context is up to 1M tokens (128K standard, expandable to 1M) with up to 131,072 output tokens. It exposes thinking mode with effort control (high, max) and structured tool calling, and integrates with coding agents such as Claude Code, OpenCode, and ZCode. It is strongest on long-horizon agentic coding and design/front-end generation; reported scores include SWE-bench Pro 62.1 (GLM-5.1: 58.4), DeepSWE 46.2 (SOTA among open-weight models at release), and Terminal-Bench 2.1 81.0.

Why

Use GLM-5.2 when the product is a long-running coding or tool-use agent and you want an open-weight (MIT) model that sustains quality over hundreds of tool-call rounds at a very long context. The small ~40B active fraction keeps decode cheap relative to the ~744B footprint, DSA keeps long-context attention sub-quadratic, and the MTP layer adds speculative-decoding throughput. It is the highest-capability open-weight agentic coder in this cookbook set at release.

When

Use it when:

  • Long-horizon agentic coding, repository editing, terminal tasks, or multi-step tool use dominate.
  • You need a very long context (well beyond 128K) at a serving cost that a dense model cannot match.
  • A permissive MIT license and an OpenAI-compatible tool/reasoning path matter.
  • The platform can run vLLM 0.23.0+ (or main when you need tool calling and MTP together).

Avoid it when:

  • The serving platform only allows conservative stable images and cannot validate a recent/main vLLM build.
  • The workload is short chat or minimal-latency completion where a smaller dense model meets the SLO.
  • You cannot allocate an 8-GPU (H200/H20/B200-class) node; the FP8 checkpoint targets a single 8-GPU node and BF16 is multi-node.

How

1. Size the replica

Item Practical note
Model zai-org/GLM-5.2-FP8 (FP8) or zai-org/GLM-5.2 (BF16)
Params ~744B total / ~40B active (A40B) MoE; HF card lists 753B total
Attention Dynamic Sparse Attention (DSA) + IndexShare
Context up to 1M tokens (128K standard, expandable to 1M)
Max output 131,072 tokens
Precision FP8 (E4M3) or BF16; an nvidia/GLM-5.2-NVFP4 variant also exists
Speculative improved MTP layer (+~20% acceptance length)
License MIT
Starting hardware FP8: 8x H200 / H20 single node (TP=8); BF16: multi-node; full 1M context: 8x B200
vLLM 0.23.0+ stable; main for tool calling + MTP simultaneously

The official vLLM recipe uses --tensor-parallel-size 8 for the FP8 checkpoint on a single 8x H200/H20 node. Use that as a smoke baseline, then raise context or lower --max-num-seqs after measuring KV-cache capacity. SGLang (v0.5.13.post1+) and Transformers are also supported per the model card.

2. Bare-metal vLLM server (FP8, 8 GPUs, MTP + tool calling)

vllm serve zai-org/GLM-5.2-FP8 \
  --tensor-parallel-size 8 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5.2-fp8

GLM-5.2 reuses the GLM-4.5/4.7 parsers: --tool-call-parser glm47 and --reasoning-parser glm45. The official vLLM recipe writes the speculative config in the equivalent dotted form (--speculative-config.method mtp --speculative-config.num_speculative_tokens 5); the JSON-string form above is the form shown in the vLLM MTP docs and both are accepted. Tool calling and MTP together currently need the vLLM main branch; if you pin 0.23.0 stable and hit a conflict, drop --speculative-config or move to a main-derived image.

Full 1M-context variant (8x B200) adds an FP8 KV cache, caps concurrency, and skips the DeepGEMM warmup:

VLLM_DEEP_GEMM_WARMUP=skip vllm serve zai-org/GLM-5.2-FP8 \
  --tensor-parallel-size 8 \
  --kv-cache-dtype fp8_e4m3 \
  --max-num-seqs 32 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 5}' \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-5.2-fp8

3. Kubernetes Deployment template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-glm-5-2
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-glm-5-2 }
  template:
    metadata:
      labels: { app: vllm-glm-5-2 }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h200-8gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:<tested-0.23+-or-main-tag>
          args:
            - --model=zai-org/GLM-5.2-FP8
            - --tensor-parallel-size=8
            - --kv-cache-dtype=fp8
            - '--speculative-config={"method": "mtp", "num_speculative_tokens": 5}'
            - --tool-call-parser=glm47
            - --reasoning-parser=glm45
            - --enable-auto-tool-choice
            - --served-model-name=glm-5.2-fp8
          ports:
            - { containerPort: 8000, name: http }
          env:
            - { name: NCCL_DEBUG, value: INFO }
          resources:
            limits: { nvidia.com/gpu: 8 }
            requests: { nvidia.com/gpu: 8 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 240
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-glm-5-2
  namespace: serving
spec:
  selector: { app: vllm-glm-5-2 }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

4. Thinking-mode smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-5.2-fp8",
    "messages": [{"role": "user", "content": "Design and implement a minimal LRU cache in Python, then explain the eviction invariant."}],
    "temperature": 0.7,
    "top_p": 1.0,
    "max_tokens": 4096
  }'

Thinking is on by default ("think max"). To disable it for a latency-sensitive route, pass the chat-template flag the model card documents:

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-5.2-fp8",
    "messages": [{"role": "user", "content": "Return only the final answer: 2**20 - 1."}],
    "max_tokens": 256,
    "chat_template_kwargs": {"enable_thinking": false}
  }'

5. Tool-call smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-5.2-fp8",
    "messages": [{"role": "user", "content": "Use the search_docs tool to find CUDA driver upgrade instructions."}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "search_docs",
        "description": "Search internal documentation.",
        "parameters": {
          "type": "object",
          "properties": {"query": {"type": "string"}},
          "required": ["query"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 1024
  }'

Pass criteria:

  • The model loads on the pinned vLLM image and /health returns 200 with a non-zero KV-cache size in the logs.
  • Tool calls are structured, not plain text.
  • Reasoning metadata is present when thinking is on, and absent when enable_thinking is false.
  • With MTP enabled, the logs report a non-trivial speculative acceptance rate; measure TTFT/TPOT separately for coding and tool-call routes.

Failure modes

  • glm5_moe/GLM-5.2 unsupported: vLLM is older than 0.23.0; upgrade or use a main-derived image.
  • MTP and tool calling conflict on stable: using both at once currently needs vLLM main; drop --speculative-config or move to a main-derived image.
  • Tool calls not parsed: missing --enable-auto-tool-choice or --tool-call-parser glm47.
  • Reasoning content missing: missing --reasoning-parser glm45, or the client ignores the extra field, or enable_thinking was set false.
  • OOM at long context: 1M is a ceiling, not a default. Cap --max-model-len to the route's real need, lower --max-num-seqs (32 for the 1M B200 shape), and size KV cache accordingly (KV cache management).
  • BF16 on one node: the BF16 checkpoint is a multi-node deployment; use zai-org/GLM-5.2-FP8 for a single 8-GPU node.
  • Regression after image update: main/nightly images can change parser or MTP behaviour; pin a digest and run a tool-call + reasoning regression test before rollout.

References

  • GLM-5.2 model card: https://huggingface.co/zai-org/GLM-5.2
  • GLM-5.2-FP8 model card: https://huggingface.co/zai-org/GLM-5.2-FP8
  • GLM-5.2 blog (DSA/IndexShare, MTP, benchmarks): https://huggingface.co/blog/zai-org/glm-52-blog
  • vLLM recipe for GLM-5.2 (TP8, MTP, glm47/glm45 parsers, 1M-context B200 shape): https://recipes.vllm.ai/zai-org/GLM-5.2
  • vLLM MTP speculative decoding (--speculative-config JSON form): https://docs.vllm.ai/en/latest/features/speculative_decoding/mtp/
  • vLLM tool calling and reasoning parsers: https://docs.vllm.ai/en/latest/features/tool_calling/

Related: Inference serving · Open-weight serving · GLM-4.7-FP8 · Generic vLLM recipe · Speculative decoding · KV cache management · SLO/SLI catalog · Glossary