Cookbook: serve GLM-4.7-FP8 with vLLM¶
Scope: a vLLM reference template for serving zai-org/GLM-4.7-FP8: what GLM-4.7 is, why and when to use it for coding and agentic workloads, how to launch the current vLLM command with reasoning/tool parsers, how to wrap it in Kubernetes, and how to verify thinking mode and tool calls.
Reference template. Commands and manifests are not executed here. The GLM-4.7 model card states that vLLM and SGLang support GLM-4.7 on their main branches; pin and test a nightly/main-derived image instead of assuming an older stable vLLM image works.
flowchart LR
PROMPT["Coding / agent prompt"] --> VLLM["vLLM server"]
VLLM --> MTP["MTP speculative config"]
VLLM --> PARSERS["glm47 tool parser / glm45 reasoning parser"]
PARSERS --> API["OpenAI-compatible response"]
What¶
GLM-4.7 is Z.ai's coding, reasoning, and agentic model family. The FP8 checkpoint (zai-org/GLM-4.7-FP8) is a 358B-total, ~32B-active (A32B) MoE release under MIT, with up to 131,072 output tokens and interleaved, preserved, and turn-level thinking modes. It succeeds GLM-4.6 and GLM-4.5. Z.ai has since shipped GLM-5 and GLM-5.2 (MIT, ~1M context), which outperform 4.7; feature this checkpoint only when you specifically want it, and check the current flagship before committing.
Why¶
Use GLM-4.7-FP8 when the product needs an open-weight coding/agent model with tool-use support and a permissive license. It is a better fit than general chat models for coding assistants, terminal agents, browser/search agents, and long-horizon tool workflows.
When¶
Use it when:
- Coding, repository editing, terminal tasks, or agent tool use dominate.
- The platform can run a vLLM build with GLM-4.7 support.
- A permissive MIT model license matters.
- The request path can expose or consume reasoning/tool metadata.
Avoid it when:
- The serving platform only allows conservative stable images and cannot validate nightly vLLM.
- The workload is pure short chat and a smaller dense model satisfies the SLO.
- Tool-use parsing is not wired through the client stack.
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | zai-org/GLM-4.7-FP8 |
| Params | 358B total / ~32B active (A32B) MoE |
| Precision | FP8 / compressed tensors |
| License | MIT |
| vLLM build | main/nightly-derived, per model card |
| Starting hardware | 4-8 H100/H200/B200 GPUs, then benchmark context/concurrency |
The official GLM examples use --tensor-parallel-size 4 for GLM-4.7-FP8. Use that as a smoke baseline, then raise GPU count or context after measuring KV-cache capacity. Note the flag form: vLLM documents --speculative-config as a single JSON string; recent vLLM also accepts the dotted --speculative-config.method form used in some official recipes. Use whichever your pinned vLLM version documents.
2. Bare-metal vLLM server¶
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-fp8
3. Kubernetes Deployment template¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-glm-4-7
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-glm-4-7 }
template:
metadata:
labels: { app: vllm-glm-4-7 }
spec:
nodeSelector:
accelerator.nvidia.com/class: h100-or-newer
containers:
- name: vllm
image: vllm/vllm-openai:<tested-nightly-or-main-tag>
args:
- --model=zai-org/GLM-4.7-FP8
- --tensor-parallel-size=4
- '--speculative-config={"method": "mtp", "num_speculative_tokens": 1}'
- --tool-call-parser=glm47
- --reasoning-parser=glm45
- --enable-auto-tool-choice
- --served-model-name=glm-4.7-fp8
ports:
- { containerPort: 8000, name: http }
env:
- { name: NCCL_DEBUG, value: INFO }
resources:
limits: { nvidia.com/gpu: 4 }
requests: { nvidia.com/gpu: 4 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 180
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-glm-4-7
namespace: serving
spec:
selector: { app: vllm-glm-4-7 }
ports:
- { name: http, port: 8000, targetPort: 8000 }
4. Thinking-mode smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "glm-4.7-fp8",
"messages": [{"role": "user", "content": "Write a minimal Python function that topologically sorts a DAG."}],
"temperature": 0.7,
"top_p": 1.0,
"max_tokens": 2048
}'
5. Tool-call smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "glm-4.7-fp8",
"messages": [{"role": "user", "content": "Use the search_docs tool to find CUDA driver upgrade instructions."}],
"tools": [{
"type": "function",
"function": {
"name": "search_docs",
"description": "Search internal documentation.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 1024
}'
Pass criteria:
- The model loads on the pinned vLLM image.
- Tool calls are structured, not plain text.
- Reasoning metadata is present when the client requests/parses it.
- TTFT and TPOT are recorded separately for coding and tool-call routes.
Failure modes¶
glm4_moeunsupported: vLLM image is too old; use a tested main/nightly build.- Tool calls not parsed: missing
--enable-auto-tool-choiceor--tool-call-parser glm47. - Reasoning content missing: missing
--reasoning-parser glm45or client ignores the extra field. - OOM with full context: cap max context, lower concurrency, or increase TP size.
- Regression after image update: nightly/main images can change parser behaviour; pin digest and run a tool-call regression test before rollout.
References¶
- GLM-4.7-FP8 model card: https://huggingface.co/zai-org/GLM-4.7-FP8
- GLM-4.5/4.6/4.7 repository and deployment notes: https://github.com/zai-org/GLM-4.5
- GLM technical report: https://arxiv.org/abs/2508.06471
- GLM-5.2 model card (current flagship, MIT, ~1M context): https://huggingface.co/zai-org/GLM-5.2
- vLLM MTP speculative decoding (
--speculative-configJSON form): https://docs.vllm.ai/en/latest/features/speculative_decoding/mtp/ - vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models/
- vLLM parallelism and scaling: https://docs.vllm.ai/en/latest/serving/parallelism_scaling/
Related: GLM-5.2 (current flagship) · Inference serving · Open-weight serving · Workload recipes · SLO/SLI catalog · Security