Skip to content
Markdown

Cookbook: serve DeepSeek-V3.2-Exp with vLLM

Scope: a vLLM reference template for serving deepseek-ai/DeepSeek-V3.2-Exp, DeepSeek's sparse-attention model: what DeepSeek Sparse Attention (DSA) changes, why and when to use it for long-context work, how to launch the expert/data-parallel shape on an 8-GPU H200/B200 node (including the required DeepGEMM install), and how to verify.

Reference template. Commands and manifests are not executed here. V3.2-Exp is an experimental line and needs a current vLLM with day-0 support plus a specific DeepGEMM build; pin the exact vLLM image, DeepGEMM version, and model revision before production, and validate on one node.

flowchart LR
  LONG["Long prompt (tens to hundreds of K tokens)"] --> VLLM["vLLM server"]
  VLLM --> DSA["DeepSeek Sparse Attention: lightning indexer selects keys"]
  DSA --> MOE["685B MoE, small active fraction"]
  MOE --> OUT["Reasoning / answer at lower long-context cost"]

What

DeepSeek-V3.2-Exp is a 685B-total MoE (256 experts) built on the DeepSeek-V3.1 architecture, released under MIT. The headline change is DeepSeek Sparse Attention (DSA): a fine-grained sparse-attention mechanism, where a lightning indexer picks which keys each query attends to, so attention no longer costs O(N^2) over the whole context. That cuts the cost of long-context prefill and decode while tracking V3.1-class quality. vLLM shipped day-0 support. Precision is FP8 block-scale or BF16, and context follows the V3.1 lineage (128K; confirm on the card). The -Exp suffix marks it as an experimental release validating efficiency changes for future DeepSeek models.

Why

Use V3.2-Exp when long context is the cost driver: large documents, repository-scale code, or long agent histories. DSA keeps attention sub-quadratic, so time-to-first-token and decode on long prompts are cheaper than the equally capable full-attention DeepSeek-R1 or V3.1 at the same length. It keeps the small-active-MoE compute profile (per-token compute far below the total weight), so throughput behaves like the other DeepSeek MoEs; the specific win is long-context efficiency, not raw tokens per second.

When

Use it when:

  • Prompts are long (tens to hundreds of K tokens) and attention is the bottleneck.
  • You already run DeepSeek-V3.1 or R1 and want cheaper long context on the same class of node.
  • MIT licensing and the DeepSeek ecosystem fit.

Avoid it when:

  • Prompts are short: DSA's benefit is long-context, and a dense model is simpler to operate.
  • You cannot install the required DeepGEMM build or run a current, day-0-class vLLM.
  • You need a settled, non-experimental checkpoint for a conservative platform; this is an -Exp release and a successor may replace it.

How

1. Size the replica

Item Practical note
Model deepseek-ai/DeepSeek-V3.2-Exp
Params 685B total, 256 experts (small active fraction per token, V3.1 lineage)
Attention DeepSeek Sparse Attention (DSA) for long context
Context 128K (V3.1 lineage; confirm on the card)
Precision FP8 block-scale or BF16
License MIT
Starting hardware 8x H200 (141 GB) / B200 (192 GB), Hopper or Blackwell with FP8; H20 (96 GB) is tight for 685B
vLLM current build with day-0 V3.2 support, plus DeepGEMM

2. Install (DeepGEMM is required)

uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/[email protected] --no-build-isolation

DeepGEMM provides the MQA-logits kernels DSA relies on. Set VLLM_USE_DEEP_GEMM=0 only to disable the MoE optimization for debugging.

3. Bare-metal vLLM server (expert + data parallel on 8 GPUs)

DeepSeek's kernels are optimized for TP=1, so the recommended shape is data plus expert parallelism, not tensor parallelism:

vllm serve deepseek-ai/DeepSeek-V3.2-Exp \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --trust-remote-code

Tensor-parallel fallback when EP is unavailable:

vllm serve deepseek-ai/DeepSeek-V3.2-Exp --tensor-parallel-size 8 --trust-remote-code

KV cache defaults to FP8; pass --kv-cache-dtype bfloat16 if you need it. If startup hits a config error, cap --max-num-seqs 256 (the default is 1024). For thinking output and tool calls, wire DeepSeek's documented reasoning and tool-call parsers (the DeepSeek-R1 lineage uses --reasoning-parser deepseek_r1); confirm the exact parser names for V3.2 against the card and vLLM docs before exposing them to agents.

4. Kubernetes Deployment template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deepseek-v3-2
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-deepseek-v3-2 }
  template:
    metadata:
      labels: { app: vllm-deepseek-v3-2 }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h200-8gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:<tested-vllm-tag-with-deepgemm>
          args:
            - --model=deepseek-ai/DeepSeek-V3.2-Exp
            - --served-model-name=deepseek-v3-2
            - --data-parallel-size=8
            - --enable-expert-parallel
            - --trust-remote-code
          ports:
            - { containerPort: 8000, name: http }
          env:
            - { name: NCCL_DEBUG, value: INFO }
          resources:
            limits: { nvidia.com/gpu: 8 }
            requests: { nvidia.com/gpu: 8 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 240
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-deepseek-v3-2
  namespace: serving
spec:
  selector: { app: vllm-deepseek-v3-2 }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

The image must carry the DeepGEMM build; a stock vLLM image without it fails the DSA kernels.

5. Smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-v3-2",
    "messages": [{"role": "user", "content": "Summarize the tradeoff sparse attention makes versus dense attention."}],
    "temperature": 0.6,
    "max_tokens": 512
  }'

Pass criteria:

  • /health returns 200; the vLLM logs show the DeepGEMM kernels loaded and a non-zero KV-cache size.
  • On a long-context prompt, TTFT grows sub-quadratically compared with a dense model at the same length. That is the DSA payoff; measure it rather than assume it.

Failure modes

  • DeepGEMM missing or wrong version - MQA-logits/kernel errors at load. Install the pinned DeepGEMM build.
  • Tensor parallelism too aggressive - DeepSeek kernels favor TP=1 plus EP/DP; pure high-TP is suboptimal.
  • FP8 KV surprises - the default KV dtype is FP8; set --kv-cache-dtype bfloat16 if numerics require it.
  • Config error at startup - cap --max-num-seqs 256.
  • Treating -Exp as stable - it is experimental; pin the revision and watch for a successor before betting a platform on it.
  • Wrong hardware - FP8 needs Hopper or Blackwell (H200/H20/B200); older GPUs are unsupported.

References

  • DeepSeek-V3.2-Exp model card: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp
  • vLLM recipe for DeepSeek-V3.2-Exp (EP/DP command, DeepGEMM, KV dtype): https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2-Exp.html
  • DeepGEMM (required kernels): https://github.com/deepseek-ai/DeepGEMM
  • vLLM expert parallel deployment: https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/
  • DeepSeek-R1 cookbook (dense sibling in this KB): DeepSeek-R1

Related: DeepSeek-R1 · Inference serving · Open-weight serving · Disaggregated inference · KV cache management · FlashAttention / MLA · Performance tuning · SLO/SLI catalog · Glossary