Cookbook: serve MiniMax-M2 with vLLM¶
Scope: a vLLM reference template for serving MiniMaxAI/MiniMax-M2, MiniMax's efficient MoE agent and reasoning model: what it is (230B total / 10B active, interleaved thinking), why and when to use it for agentic and coding workloads, how to launch the 4-GPU tensor-plus-expert-parallel shape with the minimax_m2 parsers, and how to verify tool calls and thinking preservation.
Reference template. Commands and manifests are not executed here. MiniMax-M2 needs a recent, verified vLLM build (older versions can emit corrupted output); pin the exact vLLM commit/image and model revision before production, and validate on one node.
flowchart LR
AGENT["Agent / coding request"] --> API["vLLM OpenAI endpoint"]
API --> THINK["Interleaved <think> reasoning"]
API --> MOE["230B MoE, 10B active"]
THINK --> TOOL["Tool call via minimax_m2 parser"]
MOE --> RESP["Action / answer"]
What¶
MiniMax-M2 is a 230B-total, 10B-active MoE model under a modified MIT license, tuned for agentic and coding workloads. It is an interleaved-thinking model: it emits <think>...</think> reasoning that you must keep in the conversation history across turns, or quality degrades. Context reaches up to 196K tokens per sequence. The small active count (10B of 230B) makes decode cheap for the capability, so it serves on a 4-GPU node rather than the 8 to 16 the trillion-parameter MoEs need. It follows MiniMax-Text-01 and MiniMax-M1 (which used lightning attention for 1M context).
Why¶
Use MiniMax-M2 when the product is an agent or a coding assistant and you want strong tool use and multi-step reasoning at a low serving cost. The 10B active footprint keeps tokens per second high and the deployment small, and interleaved thinking drives good long-horizon behaviour. The modified-MIT license suits internal platforms. Among the models in this set, it is the cheapest capable agent model to stand up, because it fits four GPUs.
When¶
Use it when:
- Agentic tool use, coding, and multi-step reasoning dominate.
- You want a capable model on a 4-GPU node rather than a 16-GPU one.
- Your client stack can preserve the model's thinking tokens in history (it must round-trip
<think>content).
Avoid it when:
- The client strips assistant thinking content: M2 needs it retained, and drops quality without it.
- You need a non-thinking, minimal-latency short-chat model.
- You cannot run a recent, verified vLLM build (older versions produce corrupted output for M2).
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | MiniMaxAI/MiniMax-M2 |
| Params | 230B total / 10B active MoE |
| Context | up to 196K tokens per sequence |
| Precision | BF16 or FP8 |
| License | modified MIT |
| Thinking | interleaved <think>; retain it in history |
| Starting hardware | 4x H200 / H20 / H100 or 4x A100 / A800 |
| vLLM | recent verified or nightly build (see references) |
2. Bare-metal vLLM server (tensor + expert parallel on 4 GPUs)¶
vllm serve MiniMaxAI/MiniMax-M2 \
--served-model-name minimax-m2 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice \
--compilation-config '{"mode": 3, "pass_config": {"fuse_minimax_qk_norm": true}}' \
--trust-remote-code
Pin a verified vLLM build: M2 is new enough that older releases can emit corrupted output, so use the commit the vLLM recipe verifies (or a nightly after it). The --compilation-config fuses the QK-norm kernel; keep it for throughput.
3. Kubernetes Deployment template¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-minimax-m2
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-minimax-m2 }
template:
metadata:
labels: { app: vllm-minimax-m2 }
spec:
nodeSelector:
accelerator.nvidia.com/class: h200-4gpu
containers:
- name: vllm
image: vllm/vllm-openai:<verified-vllm-tag>
args:
- --model=MiniMaxAI/MiniMax-M2
- --served-model-name=minimax-m2
- --tensor-parallel-size=4
- --enable-expert-parallel
- --tool-call-parser=minimax_m2
- --reasoning-parser=minimax_m2
- --enable-auto-tool-choice
- '--compilation-config={"mode": 3, "pass_config": {"fuse_minimax_qk_norm": true}}'
- --trust-remote-code
ports:
- { containerPort: 8000, name: http }
resources:
limits: { nvidia.com/gpu: 4 }
requests: { nvidia.com/gpu: 4 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 180
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-minimax-m2
namespace: serving
spec:
selector: { app: vllm-minimax-m2 }
ports:
- { name: http, port: 8000, targetPort: 8000 }
4. Tool-call and thinking smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "minimax-m2",
"messages": [{"role": "user", "content": "Use the get_weather tool for Athens, then summarize."}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Return current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"temperature": 1.0,
"top_p": 0.95,
"extra_body": {"top_k": 40},
"max_tokens": 1024
}'
Pass criteria:
- The response includes a structured tool call, not plain text pretending to call a tool.
--reasoning-parser minimax_m2and--tool-call-parser minimax_m2are present in the launch command.- In multi-turn runs, the client sends the assistant's prior
<think>content back in history. MiniMax documents that dropping it degrades quality. - Sampling uses the card's recommended temperature 1.0, top_p 0.95, top_k 40.
Failure modes¶
- Client strips
<think>- the defining M2 hazard: quality falls over turns. Retain assistant thinking content in the message history. - Corrupted output - the vLLM build is too old. Use the verified commit or a later nightly per the recipe.
- Tool calls emitted as text - missing
--enable-auto-tool-choiceor--tool-call-parser minimax_m2. - Slower than expected - the QK-norm fusion is off; keep the recommended
--compilation-config. - OOM on long context - 196K per sequence is a ceiling, not a default to expose. Cap
--max-model-lento the route's real need and size KV cache accordingly (KV cache management).
References¶
- MiniMax-M2 model card: https://huggingface.co/MiniMaxAI/MiniMax-M2
- vLLM recipe for MiniMax-M2 (verified commit, TP4+EP,
minimax_m2parsers, compilation-config): https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html - vLLM tool calling and reasoning parsers: https://docs.vllm.ai/en/latest/features/tool_calling/
- MiniMax-M1 (lineage, lightning attention, 1M context): https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
Related: Inference serving · Open-weight serving · Kimi K2 · Generic vLLM recipe · KV cache management · Constrained decoding · SLO/SLI catalog · Glossary