Cookbook: serve Qwen3-235B-A22B with vLLM¶
Scope: a vLLM reference template for serving Qwen/Qwen3-235B-A22B-Instruct-2507: what the model is, why and when to use it, how to deploy the 235B/22B MoE endpoint, how to handle 256K and optional 1M context modes, and how to verify sampling, tool use, and KV-cache sizing.
Reference template. Commands and manifests are not executed here. Qwen's model card recommends vLLM >= 0.8.5 for the base 262K checkpoint; pin the exact vLLM image and model revision before production. The optional 1M-context route (step 5) is different: it depends on vLLM's V0 engine, which was removed in late 2025, so it needs a pinned legacy build rather than a current image.
flowchart LR
INPUT["Prompt"] --> VLLM["vLLM server"]
VLLM --> MOE["235B total / 22B active MoE"]
VLLM --> CTX["256K native context"]
CTX --> SLO["TTFT budget"]
MOE --> OUT["Instruction response"]
What¶
Qwen3-235B-A22B-Instruct-2507 is Alibaba Qwen's updated non-thinking MoE instruct model. The model card lists 235B total parameters, 22B activated parameters, 128 experts with 8 activated per token, native 262,144-token context, and Apache-2.0 licensing. It also documents an optional path to approximately 1M context with config changes and a custom attention backend.
Why¶
Use Qwen3-235B-A22B when the product needs a strong general-purpose open-weight model with broad language, coding, math, and tool-use capability, but without always emitting thinking traces. Apache-2.0 licensing and broad Qwen ecosystem support make it a good default for internal platforms.
When¶
Use it when:
- The endpoint needs a balanced model for chat, coding, long documents, and tool use.
- Apache-2.0 license compatibility is important.
- Native 256K context is useful, but 1M context can be isolated to a separate route.
Avoid it when:
- The workload is latency-sensitive short chat that a 30B model can satisfy.
- The platform cannot allocate 8 GPUs for the 235B checkpoint.
- The product requires explicit chain-of-thought output by default; this specific checkpoint is non-thinking.
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | Qwen/Qwen3-235B-A22B-Instruct-2507 |
| Params | 235B total / 22B activated |
| Context | 262,144 native; optional ~1,010,000 with config swap |
| License | Apache-2.0 |
| vLLM version | >= 0.8.5 for the base 262K checkpoint; the optional 1M route needs a legacy V0-era build (step 5) |
| Starting hardware | 8 GPUs for the full checkpoint; reduce context first on OOM |
2. Bare-metal vLLM server¶
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
--served-model-name qwen3-235b-a22b \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--gpu-memory-utilization 0.90
If the first run OOMs, preserve the same model and lower context before changing precision or topology:
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
--served-model-name qwen3-235b-a22b \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85
3. Kubernetes Deployment template¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen3-235b
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-qwen3-235b }
template:
metadata:
labels: { app: vllm-qwen3-235b }
spec:
nodeSelector:
accelerator.nvidia.com/class: h100-8gpu
containers:
- name: vllm
image: vllm/vllm-openai:<tested-vllm-tag>
args:
- --model=Qwen/Qwen3-235B-A22B-Instruct-2507
- --served-model-name=qwen3-235b-a22b
- --tensor-parallel-size=8
- --max-model-len=32768
- --gpu-memory-utilization=0.85
ports:
- { containerPort: 8000, name: http }
resources:
limits: { nvidia.com/gpu: 8 }
requests: { nvidia.com/gpu: 8 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 180
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-qwen3-235b
namespace: serving
spec:
selector: { app: vllm-qwen3-235b }
ports:
- { name: http, port: 8000, targetPort: 8000 }
4. Smoke test with recommended sampling¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "qwen3-235b-a22b",
"messages": [{"role": "user", "content": "Summarize why tensor parallelism is usually kept inside one NVLink domain."}],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"extra_body": {"top_k": 20}
}'
5. Optional 1M-context route¶
The Qwen card documents a 1M-token path that swaps config.json for config_1m.json and launches vLLM with Dual Chunk Flash Attention. Treat this as a separate endpoint with low concurrency.
Currency warning: this path sets
VLLM_USE_V1=0and uses theDUAL_CHUNK_FLASH_ATTNbackend. vLLM removed the V0 engine (and theVLLM_USE_V1switch) in late 2025, so on a current image the command below fails outright. Run the 1M route only on a pinned legacy vLLM build (V0-era, roughly <= 0.9.x), on a separate endpoint from the base 262K service, or use whatever current long-context path the card documents at deploy time.
export MODELNAME=Qwen3-235B-A22B-Instruct-2507
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
vllm serve ./Qwen3-235B-A22B-Instruct-2507 \
--served-model-name qwen3-235b-a22b-1m \
--tensor-parallel-size 8 \
--max-model-len 1010000 \
--enable-chunked-prefill \
--max-num-batched-tokens 131072 \
--enforce-eager \
--max-num-seqs 1 \
--gpu-memory-utilization 0.85
Pass criteria:
- The 32K/256K service logs enough KV-cache capacity for target concurrency.
- The 1M service is isolated to
--max-num-seqs 1until measured. - Outputs do not include
<think>blocks for this non-thinking checkpoint. - Tool-calling tests use Qwen's documented function-calling path if exposed to agents.
Failure modes¶
- OOM at 256K - lower
--max-model-lento 32K and re-measure before changing model. - Unexpected thinking traces - wrong checkpoint or chat template; this 2507 instruct model is documented as non-thinking.
- 1M route starves normal traffic - isolate it behind a separate Service and HPA.
- FP8 tensor-parallel mismatch on quantized variants - Qwen docs note block-wise FP8 can require lower TP or expert parallelism.
- Long prompt cost hidden by aggregate latency - split TTFT and TPOT in metrics.
References¶
- Qwen3-235B-A22B-Instruct-2507 model card: https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
- Qwen vLLM deployment guide: https://qwen.readthedocs.io/en/latest/deployment/vllm.html
- Qwen3 technical report: https://arxiv.org/abs/2505.09388
- Qwen2.5-1M technical report: https://arxiv.org/abs/2501.15383
- vLLM parallelism and scaling: https://docs.vllm.ai/en/latest/serving/parallelism_scaling/
- vLLM V0 engine removal (breaks the 1M
VLLM_USE_V1=0path): https://github.com/vllm-project/vllm/issues/18571
Related: Inference serving · Open-weight serving · Consumer-GPU small models · Performance tuning · SLO/SLI catalog · Storage