Markdown

Cookbook: serve DeepSeek-R1 with vLLM¶

Scope: a production-oriented vLLM reference template for serving deepseek-ai/DeepSeek-R1 as an OpenAI-compatible endpoint: what the model is, why and when to use it, how to size it, how to launch it on an 8-GPU node with vLLM expert parallelism, how to wrap it in Kubernetes, and how to verify TTFT/TPOT and reasoning behaviour.

Reference template. Commands and manifests are not executed here. Pin the vLLM image, CUDA driver, NCCL, and model revision before production; validate on one node before exposing traffic.

flowchart LR
  REQ["Client request"] --> VLLM["vLLM OpenAI server"]
  VLLM --> ATTN["Dense attention weights"]
  VLLM --> MOE["MoE experts sharded by EP"]
  ATTN --> TOK["Reasoning tokens"]
  MOE --> TOK
  TOK --> SLO["TTFT / TPOT / goodput"]

What¶

DeepSeek-R1 is a 671B-total / 37B-activated MoE reasoning model with a 128K context window, released under MIT. The Hugging Face model card exposes vLLM usage and states the model is based on the DeepSeek-V3 architecture. The full model is large enough that serving is a cluster-planning exercise, not a one-GPU endpoint. The current revision is deepseek-ai/DeepSeek-R1-0528 (same 671B/37B architecture): it adds system-prompt support and function calling and no longer forces a <think> prefix, so if you deploy 0528 the no-system-prompt rule below does not apply. Pin whichever revision you validate.

Why¶

Use DeepSeek-R1 when the workload rewards explicit reasoning: math, code review, symbolic tasks, multi-step troubleshooting, and agent plans where answer quality matters more than minimum decode latency. The small active-parameter count keeps per-token compute closer to a 30-40B dense model, but total weight memory still behaves like a 671B model.

When¶

Use it when:

The request mix has complex prompts and long outputs.
Latency SLOs tolerate reasoning-token overhead.
The platform has at least an 8-GPU H200/B200-class node for FP8 weights, or multiple nodes for BF16.
The team can evaluate reasoning quality, not just benchmark tokens/s.

Avoid it when:

The product needs short factual answers at very low cost.
The hardware is 8x80GB and only BF16 weights are available.
The application cannot tolerate reasoning traces or model-specific prompt rules.

How¶

1. Size the replica¶

Item	Practical note
Model	`deepseek-ai/DeepSeek-R1`
Params	671B total / 37B activated
Context	128K maximum on the model card; start smaller for production SLOs
License	MIT
vLLM mode	MoE expert parallelism; DeepSeek-V3 architecture
Starting hardware	8x H200/B200-class for FP8; BF16 needs more memory

The first production run should cap --max-model-len to 32K or 64K and raise it only after the vLLM logs show enough KV-cache capacity for the target concurrency.

2. Bare-metal smoke server¶

export MODEL=deepseek-ai/DeepSeek-R1
export SERVED_MODEL=deepseek-r1

vllm serve "$MODEL" \
  --served-model-name "$SERVED_MODEL" \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code

Why this shape: vLLM's expert-parallel guide uses DeepSeek-V3 with 1-way tensor parallelism and 8-way data/expert parallelism on an 8-GPU H200/H20-class node. R1 shares that architecture family, so this is the lowest-risk starting point. If the node has no EP-capable vLLM build, fall back to tensor parallelism and expect lower MoE locality.

3. Kubernetes Deployment template¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deepseek-r1
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-deepseek-r1 }
  template:
    metadata:
      labels: { app: vllm-deepseek-r1 }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h200-8gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:<tested-vllm-tag>
          args:
            - --model=deepseek-ai/DeepSeek-R1
            - --served-model-name=deepseek-r1
            - --tensor-parallel-size=1
            - --data-parallel-size=8
            - --enable-expert-parallel
            - --max-model-len=32768
            - --gpu-memory-utilization=0.90
            - --trust-remote-code
          ports:
            - { containerPort: 8000, name: http }
          env:
            - { name: NCCL_DEBUG, value: INFO }
          resources:
            limits: { nvidia.com/gpu: 8 }
            requests: { nvidia.com/gpu: 8 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 180
            periodSeconds: 10
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-deepseek-r1
  namespace: serving
spec:
  selector: { app: vllm-deepseek-r1 }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

4. Smoke test¶

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-r1",
    "messages": [{"role": "user", "content": "Solve 17 * 23. Show the final answer only."}],
    "temperature": 0.6,
    "top_p": 0.95,
    "max_tokens": 256
  }'

Pass criteria:

/health returns HTTP 200 after model load.
The response completes without repetition or empty reasoning markers.
vLLM startup logs show a non-zero GPU KV cache size and acceptable maximum concurrency for the configured --max-model-len.
NCCL_DEBUG=INFO does not show TCP fallback if the replica spans nodes.

5. Production knobs¶

Knob	Start	Change when
`--max-model-len`	`32768`	Raise only after TTFT and KV-cache capacity are measured.
`--gpu-memory-utilization`	`0.90`	Lower if CUDA graph capture or KV allocation OOMs.
`--enable-expert-parallel`	on	Keep on for MoE unless the build lacks EP support.
`temperature`	`0.6`	DeepSeek recommends 0.5-0.7 for R1 reasoning.
System prompt	none	R1 (original) recommends all instructions in the user prompt; R1-0528 supports system prompts.

Failure modes¶

OOM at load: BF16 weights or too much KV cache for the node. Use FP8 where supported, lower --max-model-len, or move to more GPUs.
Low throughput despite enough HBM: MoE layers are not using expert parallelism; check vLLM flags and build support.
Reasoning loops or empty <think> blocks: use DeepSeek's recommended sampling range and prompt shape.
Bad TTFT: context cap is too high or prefill is saturated; use chunked prefill or a smaller max length.
TCP fallback: multi-node deployment lacks GPUDirect RDMA or the pod lacks shared memory / IPC_LOCK settings.

References¶

DeepSeek-R1 model card: https://huggingface.co/deepseek-ai/DeepSeek-R1
DeepSeek-R1-0528 (current revision: system prompts + function calling, no forced think prefix): https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948
vLLM expert parallel deployment: https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/
vLLM parallelism and scaling: https://docs.vllm.ai/en/latest/serving/parallelism_scaling/
vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models/