Cookbook: serve DeepSeek-V3.2-Exp with vLLM¶
Scope: a vLLM reference template for serving deepseek-ai/DeepSeek-V3.2-Exp, DeepSeek's sparse-attention model: what DeepSeek Sparse Attention (DSA) changes, why and when to use it for long-context work, how to launch the expert/data-parallel shape on an 8-GPU H200/B200 node (including the required DeepGEMM install), and how to verify.
Reference template. Commands and manifests are not executed here. V3.2-Exp is an experimental line and needs a current vLLM with day-0 support plus a specific DeepGEMM build; pin the exact vLLM image, DeepGEMM version, and model revision before production, and validate on one node.
flowchart LR
LONG["Long prompt (tens to hundreds of K tokens)"] --> VLLM["vLLM server"]
VLLM --> DSA["DeepSeek Sparse Attention: lightning indexer selects keys"]
DSA --> MOE["685B MoE, small active fraction"]
MOE --> OUT["Reasoning / answer at lower long-context cost"]
What¶
DeepSeek-V3.2-Exp is a 685B-total MoE (256 experts) built on the DeepSeek-V3.1 architecture, released under MIT. The headline change is DeepSeek Sparse Attention (DSA): a fine-grained sparse-attention mechanism, where a lightning indexer picks which keys each query attends to, so attention no longer costs O(N^2) over the whole context. That cuts the cost of long-context prefill and decode while tracking V3.1-class quality. vLLM shipped day-0 support. Precision is FP8 block-scale or BF16, and context follows the V3.1 lineage (128K; confirm on the card). The -Exp suffix marks it as an experimental release validating efficiency changes for future DeepSeek models.
Why¶
Use V3.2-Exp when long context is the cost driver: large documents, repository-scale code, or long agent histories. DSA keeps attention sub-quadratic, so time-to-first-token and decode on long prompts are cheaper than the equally capable full-attention DeepSeek-R1 or V3.1 at the same length. It keeps the small-active-MoE compute profile (per-token compute far below the total weight), so throughput behaves like the other DeepSeek MoEs; the specific win is long-context efficiency, not raw tokens per second.
When¶
Use it when:
- Prompts are long (tens to hundreds of K tokens) and attention is the bottleneck.
- You already run DeepSeek-V3.1 or R1 and want cheaper long context on the same class of node.
- MIT licensing and the DeepSeek ecosystem fit.
Avoid it when:
- Prompts are short: DSA's benefit is long-context, and a dense model is simpler to operate.
- You cannot install the required DeepGEMM build or run a current, day-0-class vLLM.
- You need a settled, non-experimental checkpoint for a conservative platform; this is an
-Exprelease and a successor may replace it.
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | deepseek-ai/DeepSeek-V3.2-Exp |
| Params | 685B total, 256 experts (small active fraction per token, V3.1 lineage) |
| Attention | DeepSeek Sparse Attention (DSA) for long context |
| Context | 128K (V3.1 lineage; confirm on the card) |
| Precision | FP8 block-scale or BF16 |
| License | MIT |
| Starting hardware | 8x H200 (141 GB) / B200 (192 GB), Hopper or Blackwell with FP8; H20 (96 GB) is tight for 685B |
| vLLM | current build with day-0 V3.2 support, plus DeepGEMM |
2. Install (DeepGEMM is required)¶
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/[email protected] --no-build-isolation
DeepGEMM provides the MQA-logits kernels DSA relies on. Set VLLM_USE_DEEP_GEMM=0 only to disable the MoE optimization for debugging.
3. Bare-metal vLLM server (expert + data parallel on 8 GPUs)¶
DeepSeek's kernels are optimized for TP=1, so the recommended shape is data plus expert parallelism, not tensor parallelism:
vllm serve deepseek-ai/DeepSeek-V3.2-Exp \
--data-parallel-size 8 \
--enable-expert-parallel \
--trust-remote-code
Tensor-parallel fallback when EP is unavailable:
KV cache defaults to FP8; pass --kv-cache-dtype bfloat16 if you need it. If startup hits a config error, cap --max-num-seqs 256 (the default is 1024). For thinking output and tool calls, wire DeepSeek's documented reasoning and tool-call parsers (the DeepSeek-R1 lineage uses --reasoning-parser deepseek_r1); confirm the exact parser names for V3.2 against the card and vLLM docs before exposing them to agents.
4. Kubernetes Deployment template¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deepseek-v3-2
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-deepseek-v3-2 }
template:
metadata:
labels: { app: vllm-deepseek-v3-2 }
spec:
nodeSelector:
accelerator.nvidia.com/class: h200-8gpu
containers:
- name: vllm
image: vllm/vllm-openai:<tested-vllm-tag-with-deepgemm>
args:
- --model=deepseek-ai/DeepSeek-V3.2-Exp
- --served-model-name=deepseek-v3-2
- --data-parallel-size=8
- --enable-expert-parallel
- --trust-remote-code
ports:
- { containerPort: 8000, name: http }
env:
- { name: NCCL_DEBUG, value: INFO }
resources:
limits: { nvidia.com/gpu: 8 }
requests: { nvidia.com/gpu: 8 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 240
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-deepseek-v3-2
namespace: serving
spec:
selector: { app: vllm-deepseek-v3-2 }
ports:
- { name: http, port: 8000, targetPort: 8000 }
The image must carry the DeepGEMM build; a stock vLLM image without it fails the DSA kernels.
5. Smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-v3-2",
"messages": [{"role": "user", "content": "Summarize the tradeoff sparse attention makes versus dense attention."}],
"temperature": 0.6,
"max_tokens": 512
}'
Pass criteria:
/healthreturns 200; the vLLM logs show the DeepGEMM kernels loaded and a non-zero KV-cache size.- On a long-context prompt, TTFT grows sub-quadratically compared with a dense model at the same length. That is the DSA payoff; measure it rather than assume it.
Failure modes¶
- DeepGEMM missing or wrong version - MQA-logits/kernel errors at load. Install the pinned DeepGEMM build.
- Tensor parallelism too aggressive - DeepSeek kernels favor TP=1 plus EP/DP; pure high-TP is suboptimal.
- FP8 KV surprises - the default KV dtype is FP8; set
--kv-cache-dtype bfloat16if numerics require it. - Config error at startup - cap
--max-num-seqs 256. - Treating
-Expas stable - it is experimental; pin the revision and watch for a successor before betting a platform on it. - Wrong hardware - FP8 needs Hopper or Blackwell (H200/H20/B200); older GPUs are unsupported.
References¶
- DeepSeek-V3.2-Exp model card: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp
- vLLM recipe for DeepSeek-V3.2-Exp (EP/DP command, DeepGEMM, KV dtype): https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html
- DeepGEMM (required kernels): https://github.com/deepseek-ai/DeepGEMM
- vLLM expert parallel deployment: https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/
- DeepSeek-R1 cookbook (dense sibling in this KB): DeepSeek-R1
Related: DeepSeek-R1 · Inference serving · Open-weight serving · Disaggregated inference · KV cache management · FlashAttention / MLA · Performance tuning · SLO/SLI catalog · Glossary