Cookbook: serve DeepSeek-R1 with vLLM¶
Scope: a production-oriented vLLM reference template for serving deepseek-ai/DeepSeek-R1 as an OpenAI-compatible endpoint: what the model is, why and when to use it, how to size it, how to launch it on an 8-GPU node with vLLM expert parallelism, how to wrap it in Kubernetes, and how to verify TTFT/TPOT and reasoning behaviour.
Reference template. Commands and manifests are not executed here. Pin the vLLM image, CUDA driver, NCCL, and model revision before production; validate on one node before exposing traffic.
flowchart LR
REQ["Client request"] --> VLLM["vLLM OpenAI server"]
VLLM --> ATTN["Dense attention weights"]
VLLM --> MOE["MoE experts sharded by EP"]
ATTN --> TOK["Reasoning tokens"]
MOE --> TOK
TOK --> SLO["TTFT / TPOT / goodput"]
What¶
DeepSeek-R1 is a 671B-total / 37B-activated MoE reasoning model with a 128K context window, released under MIT. The Hugging Face model card exposes vLLM usage and states the model is based on the DeepSeek-V3 architecture. The full model is large enough that serving is a cluster-planning exercise, not a one-GPU endpoint. The current revision is deepseek-ai/DeepSeek-R1-0528 (same 671B/37B architecture): it adds system-prompt support and function calling and no longer forces a <think> prefix, so if you deploy 0528 the no-system-prompt rule below does not apply. Pin whichever revision you validate.
Why¶
Use DeepSeek-R1 when the workload rewards explicit reasoning: math, code review, symbolic tasks, multi-step troubleshooting, and agent plans where answer quality matters more than minimum decode latency. The small active-parameter count keeps per-token compute closer to a 30-40B dense model, but total weight memory still behaves like a 671B model.
When¶
Use it when:
- The request mix has complex prompts and long outputs.
- Latency SLOs tolerate reasoning-token overhead.
- The platform has at least an 8-GPU H200/B200-class node for FP8 weights, or multiple nodes for BF16.
- The team can evaluate reasoning quality, not just benchmark tokens/s.
Avoid it when:
- The product needs short factual answers at very low cost.
- The hardware is 8x80GB and only BF16 weights are available.
- The application cannot tolerate reasoning traces or model-specific prompt rules.
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | deepseek-ai/DeepSeek-R1 |
| Params | 671B total / 37B activated |
| Context | 128K maximum on the model card; start smaller for production SLOs |
| License | MIT |
| vLLM mode | MoE expert parallelism; DeepSeek-V3 architecture |
| Starting hardware | 8x H200/B200-class for FP8; BF16 needs more memory |
The first production run should cap --max-model-len to 32K or 64K and raise it only after the vLLM logs show enough KV-cache capacity for the target concurrency.
2. Bare-metal smoke server¶
export MODEL=deepseek-ai/DeepSeek-R1
export SERVED_MODEL=deepseek-r1
vllm serve "$MODEL" \
--served-model-name "$SERVED_MODEL" \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--trust-remote-code
Why this shape: vLLM's expert-parallel guide uses DeepSeek-V3 with 1-way tensor parallelism and 8-way data/expert parallelism on an 8-GPU H200/H20-class node. R1 shares that architecture family, so this is the lowest-risk starting point. If the node has no EP-capable vLLM build, fall back to tensor parallelism and expect lower MoE locality.
3. Kubernetes Deployment template¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deepseek-r1
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-deepseek-r1 }
template:
metadata:
labels: { app: vllm-deepseek-r1 }
spec:
nodeSelector:
accelerator.nvidia.com/class: h200-8gpu
containers:
- name: vllm
image: vllm/vllm-openai:<tested-vllm-tag>
args:
- --model=deepseek-ai/DeepSeek-R1
- --served-model-name=deepseek-r1
- --tensor-parallel-size=1
- --data-parallel-size=8
- --enable-expert-parallel
- --max-model-len=32768
- --gpu-memory-utilization=0.90
- --trust-remote-code
ports:
- { containerPort: 8000, name: http }
env:
- { name: NCCL_DEBUG, value: INFO }
resources:
limits: { nvidia.com/gpu: 8 }
requests: { nvidia.com/gpu: 8 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 180
periodSeconds: 10
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-deepseek-r1
namespace: serving
spec:
selector: { app: vllm-deepseek-r1 }
ports:
- { name: http, port: 8000, targetPort: 8000 }
4. Smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-r1",
"messages": [{"role": "user", "content": "Solve 17 * 23. Show the final answer only."}],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 256
}'
Pass criteria:
/healthreturns HTTP 200 after model load.- The response completes without repetition or empty reasoning markers.
- vLLM startup logs show a non-zero
GPU KV cache sizeand acceptable maximum concurrency for the configured--max-model-len. NCCL_DEBUG=INFOdoes not show TCP fallback if the replica spans nodes.
5. Production knobs¶
| Knob | Start | Change when |
|---|---|---|
--max-model-len |
32768 |
Raise only after TTFT and KV-cache capacity are measured. |
--gpu-memory-utilization |
0.90 |
Lower if CUDA graph capture or KV allocation OOMs. |
--enable-expert-parallel |
on | Keep on for MoE unless the build lacks EP support. |
temperature |
0.6 |
DeepSeek recommends 0.5-0.7 for R1 reasoning. |
| System prompt | none | R1 (original) recommends all instructions in the user prompt; R1-0528 supports system prompts. |
Failure modes¶
- OOM at load: BF16 weights or too much KV cache for the node. Use FP8 where supported, lower
--max-model-len, or move to more GPUs. - Low throughput despite enough HBM: MoE layers are not using expert parallelism; check vLLM flags and build support.
- Reasoning loops or empty
<think>blocks: use DeepSeek's recommended sampling range and prompt shape. - Bad TTFT: context cap is too high or prefill is saturated; use chunked prefill or a smaller max length.
- TCP fallback: multi-node deployment lacks GPUDirect RDMA or the pod lacks shared memory /
IPC_LOCKsettings.
References¶
- DeepSeek-R1 model card: https://huggingface.co/deepseek-ai/DeepSeek-R1
- DeepSeek-R1-0528 (current revision: system prompts + function calling, no forced think prefix): https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
- DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948
- vLLM expert parallel deployment: https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/
- vLLM parallelism and scaling: https://docs.vllm.ai/en/latest/serving/parallelism_scaling/
- vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models/
Related: Inference serving · Open-weight serving · Kubernetes GPU platform · Performance tuning · SLO/SLI catalog