Cookbook: serve Kimi K2 with vLLM¶
Scope: a vLLM reference template for serving moonshotai/Kimi-K2-Instruct: what Kimi K2 is, why and when to use it for agentic/tool workloads, how to size the 1T-parameter MoE, how to launch the vendor-recommended 16-GPU vLLM shapes, and how to verify tool calling and latency.
Reference template. Commands and manifests are not executed here. Kimi's own deployment guide says the smallest FP8 128K deployment unit on mainstream H200/H20 platforms is 16 GPUs. Do not silently shrink this to one 8-GPU pod for production.
flowchart LR
USER["Agent request"] --> API["vLLM OpenAI endpoint"]
API --> TOOL["Tool-call parser: kimi_k2"]
API --> MOE["1T MoE weights"]
MOE --> EP["TP16 or DP+EP16"]
TOOL --> RESP["Action / answer"]
What¶
Kimi K2 Instruct is Moonshot AI's 1T-total / 32B-activated MoE language model. The model card describes it as agentic, tool-use-oriented, and long-context, with 128K context and MLA attention. The checkpoint is released in block-FP8 form under a modified MIT license. The current instruct revision is moonshotai/Kimi-K2-Instruct-0905, which extends context to 256K; for agentic reasoning, moonshotai/Kimi-K2-Thinking (1T/32B, 256K, native INT4) adds interleaved thinking and pairs --reasoning-parser kimi_k2 with the tool-call parser. Pin whichever checkpoint you validate.
Why¶
Use Kimi K2 when the target product is an agent rather than a simple chatbot: repository editing, terminal workflows, multi-tool orchestration, function calling, and long multi-turn tasks. The active-parameter count keeps compute manageable, but total weight memory and 128K context make the deployment larger than typical single-node vLLM services.
When¶
Use it when:
- Tool calling is a first-class requirement.
- The platform can reserve 16 GPUs for one replica at 128K context.
- The workload benefits from long agent state and high-quality coding/tool traces.
Avoid it when:
- The serving platform only has one 8-GPU node per replica.
- The application only needs short Q&A or extraction.
- The team cannot validate the model's modified MIT license and third-party notices.
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | moonshotai/Kimi-K2-Instruct |
| Params | 1T total / 32B activated |
| Context | 128K |
| Precision | block-FP8 checkpoint |
| License | modified MIT |
| Smallest vendor shape | 16 GPUs for FP8 128K on H200/H20-class platforms |
For an 8-GPU development node, lower --max-model-len, use a smaller Kimi variant if available, or treat the test as a non-production smoke check only. The production recipe below assumes 16 GPUs.
2. Tensor-parallel 16-GPU launch¶
Kimi's guide shows pure tensor parallelism for parallelism degree <= 16. Start a Ray cluster across two 8-GPU nodes first, then run the server from the head node.
export MODEL_PATH=/models/Kimi-K2-Instruct
vllm serve "$MODEL_PATH" \
--port 8000 \
--served-model-name kimi-k2 \
--trust-remote-code \
--tensor-parallel-size 16 \
--distributed-executor-backend ray \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2 \
--max-model-len 131072
3. DP + expert-parallel 16-GPU launch¶
This shape follows Moonshot's H200 example and is the better starting point when token routing and expert locality dominate throughput.
# node 0
export MASTER_IP=<node-0-private-ip>
export PORT=13345
export MODEL_PATH=/models/Kimi-K2-Instruct
vllm serve "$MODEL_PATH" \
--port 8000 \
--served-model-name kimi-k2 \
--trust-remote-code \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-address "$MASTER_IP" \
--data-parallel-rpc-port "$PORT" \
--enable-expert-parallel \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2
# node 1
vllm serve "$MODEL_PATH" \
--headless \
--data-parallel-start-rank 8 \
--port 8000 \
--served-model-name kimi-k2 \
--trust-remote-code \
--data-parallel-size 16 \
--data-parallel-size-local 8 \
--data-parallel-address "$MASTER_IP" \
--data-parallel-rpc-port "$PORT" \
--enable-expert-parallel \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2
4. Kubernetes shape¶
Use KubeRay, LeaderWorkerSet, or another multi-pod gang-scheduled pattern for a 16-GPU Kimi replica. A single Kubernetes Deployment with nvidia.com/gpu: 8 is not equivalent to the vendor 16-GPU shape.
Minimal scheduler constraints for each worker pod:
spec:
hostNetwork: true
containers:
- name: vllm
image: vllm/vllm-openai:<tested-vllm-tag>
securityContext:
capabilities:
add: ["IPC_LOCK"]
env:
- { name: NCCL_DEBUG, value: INFO }
- { name: GLOO_SOCKET_IFNAME, value: eth0 }
resources:
limits: { nvidia.com/gpu: 8 }
requests: { nvidia.com/gpu: 8 }
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
5. Tool-calling smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "kimi-k2",
"messages": [{"role": "user", "content": "Use the get_weather tool for Athens."}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Return current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"temperature": 0.6,
"max_tokens": 512
}'
Pass criteria:
- The response includes a structured tool call, not plain text pretending to call a tool.
--tool-call-parser kimi_k2is present in the launch command.- vLLM logs show all 16 ranks joined.
- The load test records TTFT and TPOT at the configured
--max-num-seqs.
Failure modes¶
- OOM on 8 GPUs: expected for full 128K FP8 Kimi K2; use the 16-GPU shape or reduce scope for a dev smoke test.
- Tool calls emitted as text: missing
--enable-auto-tool-choiceor--tool-call-parser kimi_k2. - Ranks hang on startup: Ray or data-parallel RPC networking is not reachable on the private interface.
- Poor decode throughput: expert-parallel backend or all-to-all path is suboptimal; evaluate DP+EP versus TP16.
- Long-context SLO miss: 128K context increases TTFT; cap context per route rather than exposing max context to every request.
References¶
- Kimi K2 model card: https://huggingface.co/moonshotai/Kimi-K2-Instruct
- Kimi-K2-Instruct-0905 (current instruct, 256K context): https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905
- Kimi-K2-Thinking (256K, native INT4, reasoning parser): https://huggingface.co/moonshotai/Kimi-K2-Thinking
- Kimi K2 deployment guide: https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/docs/deploy_guidance.md
- vLLM expert parallel deployment: https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/
- vLLM parallelism and scaling: https://docs.vllm.ai/en/latest/serving/parallelism_scaling/
Related: Inference serving · Open-weight serving · Disaggregated inference · Kubernetes GPU platform · SLO/SLI catalog