Skip to content
Markdown

Cookbook: serve Qwen3-235B-A22B with vLLM

Scope: a vLLM reference template for serving Qwen/Qwen3-235B-A22B-Instruct-2507: what the model is, why and when to use it, how to deploy the 235B/22B MoE endpoint, how to handle 256K and optional 1M context modes, and how to verify sampling, tool use, and KV-cache sizing.

Reference template. Commands and manifests are not executed here. Qwen's model card recommends vLLM >= 0.8.5 for the base 262K checkpoint; pin the exact vLLM image and model revision before production. The optional 1M-context route (step 5) is different: it depends on vLLM's V0 engine, which was removed in late 2025, so it needs a pinned legacy build rather than a current image.

flowchart LR
  INPUT["Prompt"] --> VLLM["vLLM server"]
  VLLM --> MOE["235B total / 22B active MoE"]
  VLLM --> CTX["256K native context"]
  CTX --> SLO["TTFT budget"]
  MOE --> OUT["Instruction response"]

What

Qwen3-235B-A22B-Instruct-2507 is Alibaba Qwen's updated non-thinking MoE instruct model. The model card lists 235B total parameters, 22B activated parameters, 128 experts with 8 activated per token, native 262,144-token context, and Apache-2.0 licensing. It also documents an optional path to approximately 1M context with config changes and a custom attention backend.

Why

Use Qwen3-235B-A22B when the product needs a strong general-purpose open-weight model with broad language, coding, math, and tool-use capability, but without always emitting thinking traces. Apache-2.0 licensing and broad Qwen ecosystem support make it a good default for internal platforms.

When

Use it when:

  • The endpoint needs a balanced model for chat, coding, long documents, and tool use.
  • Apache-2.0 license compatibility is important.
  • Native 256K context is useful, but 1M context can be isolated to a separate route.

Avoid it when:

  • The workload is latency-sensitive short chat that a 30B model can satisfy.
  • The platform cannot allocate 8 GPUs for the 235B checkpoint.
  • The product requires explicit chain-of-thought output by default; this specific checkpoint is non-thinking.

How

1. Size the replica

Item Practical note
Model Qwen/Qwen3-235B-A22B-Instruct-2507
Params 235B total / 22B activated
Context 262,144 native; optional ~1,010,000 with config swap
License Apache-2.0
vLLM version >= 0.8.5 for the base 262K checkpoint; the optional 1M route needs a legacy V0-era build (step 5)
Starting hardware 8 GPUs for the full checkpoint; reduce context first on OOM

2. Bare-metal vLLM server

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
  --served-model-name qwen3-235b-a22b \
  --tensor-parallel-size 8 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90

If the first run OOMs, preserve the same model and lower context before changing precision or topology:

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
  --served-model-name qwen3-235b-a22b \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85

3. Kubernetes Deployment template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen3-235b
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-qwen3-235b }
  template:
    metadata:
      labels: { app: vllm-qwen3-235b }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h100-8gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:<tested-vllm-tag>
          args:
            - --model=Qwen/Qwen3-235B-A22B-Instruct-2507
            - --served-model-name=qwen3-235b-a22b
            - --tensor-parallel-size=8
            - --max-model-len=32768
            - --gpu-memory-utilization=0.85
          ports:
            - { containerPort: 8000, name: http }
          resources:
            limits: { nvidia.com/gpu: 8 }
            requests: { nvidia.com/gpu: 8 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 180
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen3-235b
  namespace: serving
spec:
  selector: { app: vllm-qwen3-235b }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }
curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3-235b-a22b",
    "messages": [{"role": "user", "content": "Summarize why tensor parallelism is usually kept inside one NVLink domain."}],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024,
    "extra_body": {"top_k": 20}
  }'

5. Optional 1M-context route

The Qwen card documents a 1M-token path that swaps config.json for config_1m.json and launches vLLM with Dual Chunk Flash Attention. Treat this as a separate endpoint with low concurrency.

Currency warning: this path sets VLLM_USE_V1=0 and uses the DUAL_CHUNK_FLASH_ATTN backend. vLLM removed the V0 engine (and the VLLM_USE_V1 switch) in late 2025, so on a current image the command below fails outright. Run the 1M route only on a pinned legacy vLLM build (V0-era, roughly <= 0.9.x), on a separate endpoint from the base 262K service, or use whatever current long-context path the card documents at deploy time.

export MODELNAME=Qwen3-235B-A22B-Instruct-2507
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json

VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
vllm serve ./Qwen3-235B-A22B-Instruct-2507 \
  --served-model-name qwen3-235b-a22b-1m \
  --tensor-parallel-size 8 \
  --max-model-len 1010000 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 131072 \
  --enforce-eager \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.85

Pass criteria:

  • The 32K/256K service logs enough KV-cache capacity for target concurrency.
  • The 1M service is isolated to --max-num-seqs 1 until measured.
  • Outputs do not include <think> blocks for this non-thinking checkpoint.
  • Tool-calling tests use Qwen's documented function-calling path if exposed to agents.

Failure modes

  • OOM at 256K - lower --max-model-len to 32K and re-measure before changing model.
  • Unexpected thinking traces - wrong checkpoint or chat template; this 2507 instruct model is documented as non-thinking.
  • 1M route starves normal traffic - isolate it behind a separate Service and HPA.
  • FP8 tensor-parallel mismatch on quantized variants - Qwen docs note block-wise FP8 can require lower TP or expert parallelism.
  • Long prompt cost hidden by aggregate latency - split TTFT and TPOT in metrics.

References

  • Qwen3-235B-A22B-Instruct-2507 model card: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
  • Qwen vLLM deployment guide: https://qwen.readthedocs.io/en/latest/deployment/vllm.html
  • Qwen3 technical report: https://arxiv.org/abs/2505.09388
  • Qwen2.5-1M technical report: https://arxiv.org/abs/2501.15383
  • vLLM parallelism and scaling: https://docs.vllm.ai/en/latest/serving/parallelism_scaling/
  • vLLM V0 engine removal (breaks the 1M VLLM_USE_V1=0 path): https://github.com/vllm-project/vllm/issues/18571

Related: Inference serving · Open-weight serving · Consumer-GPU small models · Performance tuning · SLO/SLI catalog · Storage