Skip to content
Markdown

Cookbook: serve GLM-4.7-FP8 with vLLM

Scope: a vLLM reference template for serving zai-org/GLM-4.7-FP8: what GLM-4.7 is, why and when to use it for coding and agentic workloads, how to launch the current vLLM command with reasoning/tool parsers, how to wrap it in Kubernetes, and how to verify thinking mode and tool calls.

Reference template. Commands and manifests are not executed here. The GLM-4.7 model card states that vLLM and SGLang support GLM-4.7 on their main branches; pin and test a nightly/main-derived image instead of assuming an older stable vLLM image works.

flowchart LR
  PROMPT["Coding / agent prompt"] --> VLLM["vLLM server"]
  VLLM --> MTP["MTP speculative config"]
  VLLM --> PARSERS["glm47 tool parser / glm45 reasoning parser"]
  PARSERS --> API["OpenAI-compatible response"]

What

GLM-4.7 is Z.ai's coding, reasoning, and agentic model family. The FP8 checkpoint (zai-org/GLM-4.7-FP8) is a 358B-total, ~32B-active (A32B) MoE release under MIT, with up to 131,072 output tokens and interleaved, preserved, and turn-level thinking modes. It succeeds GLM-4.6 and GLM-4.5. Z.ai has since shipped GLM-5 and GLM-5.2 (MIT, ~1M context), which outperform 4.7; feature this checkpoint only when you specifically want it, and check the current flagship before committing.

Why

Use GLM-4.7-FP8 when the product needs an open-weight coding/agent model with tool-use support and a permissive license. It is a better fit than general chat models for coding assistants, terminal agents, browser/search agents, and long-horizon tool workflows.

When

Use it when:

  • Coding, repository editing, terminal tasks, or agent tool use dominate.
  • The platform can run a vLLM build with GLM-4.7 support.
  • A permissive MIT model license matters.
  • The request path can expose or consume reasoning/tool metadata.

Avoid it when:

  • The serving platform only allows conservative stable images and cannot validate nightly vLLM.
  • The workload is pure short chat and a smaller dense model satisfies the SLO.
  • Tool-use parsing is not wired through the client stack.

How

1. Size the replica

Item Practical note
Model zai-org/GLM-4.7-FP8
Params 358B total / ~32B active (A32B) MoE
Precision FP8 / compressed tensors
License MIT
vLLM build main/nightly-derived, per model card
Starting hardware 4-8 H100/H200/B200 GPUs, then benchmark context/concurrency

The official GLM examples use --tensor-parallel-size 4 for GLM-4.7-FP8. Use that as a smoke baseline, then raise GPU count or context after measuring KV-cache capacity. Note the flag form: vLLM documents --speculative-config as a single JSON string; recent vLLM also accepts the dotted --speculative-config.method form used in some official recipes. Use whichever your pinned vLLM version documents.

2. Bare-metal vLLM server

vllm serve zai-org/GLM-4.7-FP8 \
  --tensor-parallel-size 4 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice \
  --served-model-name glm-4.7-fp8

3. Kubernetes Deployment template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-glm-4-7
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-glm-4-7 }
  template:
    metadata:
      labels: { app: vllm-glm-4-7 }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h100-or-newer
      containers:
        - name: vllm
          image: vllm/vllm-openai:<tested-nightly-or-main-tag>
          args:
            - --model=zai-org/GLM-4.7-FP8
            - --tensor-parallel-size=4
            - '--speculative-config={"method": "mtp", "num_speculative_tokens": 1}'
            - --tool-call-parser=glm47
            - --reasoning-parser=glm45
            - --enable-auto-tool-choice
            - --served-model-name=glm-4.7-fp8
          ports:
            - { containerPort: 8000, name: http }
          env:
            - { name: NCCL_DEBUG, value: INFO }
          resources:
            limits: { nvidia.com/gpu: 4 }
            requests: { nvidia.com/gpu: 4 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 180
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-glm-4-7
  namespace: serving
spec:
  selector: { app: vllm-glm-4-7 }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

4. Thinking-mode smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-4.7-fp8",
    "messages": [{"role": "user", "content": "Write a minimal Python function that topologically sorts a DAG."}],
    "temperature": 0.7,
    "top_p": 1.0,
    "max_tokens": 2048
  }'

5. Tool-call smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "glm-4.7-fp8",
    "messages": [{"role": "user", "content": "Use the search_docs tool to find CUDA driver upgrade instructions."}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "search_docs",
        "description": "Search internal documentation.",
        "parameters": {
          "type": "object",
          "properties": {"query": {"type": "string"}},
          "required": ["query"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 1024
  }'

Pass criteria:

  • The model loads on the pinned vLLM image.
  • Tool calls are structured, not plain text.
  • Reasoning metadata is present when the client requests/parses it.
  • TTFT and TPOT are recorded separately for coding and tool-call routes.

Failure modes

  • glm4_moe unsupported: vLLM image is too old; use a tested main/nightly build.
  • Tool calls not parsed: missing --enable-auto-tool-choice or --tool-call-parser glm47.
  • Reasoning content missing: missing --reasoning-parser glm45 or client ignores the extra field.
  • OOM with full context: cap max context, lower concurrency, or increase TP size.
  • Regression after image update: nightly/main images can change parser behaviour; pin digest and run a tool-call regression test before rollout.

References

  • GLM-4.7-FP8 model card: https://huggingface.co/zai-org/GLM-4.7-FP8
  • GLM-4.5/4.6/4.7 repository and deployment notes: https://github.com/zai-org/GLM-4.5
  • GLM technical report: https://arxiv.org/abs/2508.06471
  • GLM-5.2 model card (current flagship, MIT, ~1M context): https://huggingface.co/zai-org/GLM-5.2
  • vLLM MTP speculative decoding (--speculative-config JSON form): https://docs.vllm.ai/en/latest/features/speculative_decoding/mtp/
  • vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models/
  • vLLM parallelism and scaling: https://docs.vllm.ai/en/latest/serving/parallelism_scaling/

Related: GLM-5.2 (current flagship) · Inference serving · Open-weight serving · Workload recipes · SLO/SLI catalog · Security