Skip to content
Markdown

Cookbook: serve MiniMax-M2 with vLLM

Scope: a vLLM reference template for serving MiniMaxAI/MiniMax-M2, MiniMax's efficient MoE agent and reasoning model: what it is (230B total / 10B active, interleaved thinking), why and when to use it for agentic and coding workloads, how to launch the 4-GPU tensor-plus-expert-parallel shape with the minimax_m2 parsers, and how to verify tool calls and thinking preservation.

Reference template. Commands and manifests are not executed here. MiniMax-M2 needs a recent, verified vLLM build (older versions can emit corrupted output); pin the exact vLLM commit/image and model revision before production, and validate on one node.

flowchart LR
  AGENT["Agent / coding request"] --> API["vLLM OpenAI endpoint"]
  API --> THINK["Interleaved <think> reasoning"]
  API --> MOE["230B MoE, 10B active"]
  THINK --> TOOL["Tool call via minimax_m2 parser"]
  MOE --> RESP["Action / answer"]

What

MiniMax-M2 is a 230B-total, 10B-active MoE model under a modified MIT license, tuned for agentic and coding workloads. It is an interleaved-thinking model: it emits <think>...</think> reasoning that you must keep in the conversation history across turns, or quality degrades. Context reaches up to 196K tokens per sequence. The small active count (10B of 230B) makes decode cheap for the capability, so it serves on a 4-GPU node rather than the 8 to 16 the trillion-parameter MoEs need. It follows MiniMax-Text-01 and MiniMax-M1 (which used lightning attention for 1M context).

Why

Use MiniMax-M2 when the product is an agent or a coding assistant and you want strong tool use and multi-step reasoning at a low serving cost. The 10B active footprint keeps tokens per second high and the deployment small, and interleaved thinking drives good long-horizon behaviour. The modified-MIT license suits internal platforms. Among the models in this set, it is the cheapest capable agent model to stand up, because it fits four GPUs.

When

Use it when:

  • Agentic tool use, coding, and multi-step reasoning dominate.
  • You want a capable model on a 4-GPU node rather than a 16-GPU one.
  • Your client stack can preserve the model's thinking tokens in history (it must round-trip <think> content).

Avoid it when:

  • The client strips assistant thinking content: M2 needs it retained, and drops quality without it.
  • You need a non-thinking, minimal-latency short-chat model.
  • You cannot run a recent, verified vLLM build (older versions produce corrupted output for M2).

How

1. Size the replica

Item Practical note
Model MiniMaxAI/MiniMax-M2
Params 230B total / 10B active MoE
Context up to 196K tokens per sequence
Precision BF16 or FP8
License modified MIT
Thinking interleaved <think>; retain it in history
Starting hardware 4x H200 / H20 / H100 or 4x A100 / A800
vLLM recent verified or nightly build (see references)

2. Bare-metal vLLM server (tensor + expert parallel on 4 GPUs)

vllm serve MiniMaxAI/MiniMax-M2 \
  --served-model-name minimax-m2 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --compilation-config '{"mode": 3, "pass_config": {"fuse_minimax_qk_norm": true}}' \
  --trust-remote-code

Pin a verified vLLM build: M2 is new enough that older releases can emit corrupted output, so use the commit the vLLM recipe verifies (or a nightly after it). The --compilation-config fuses the QK-norm kernel; keep it for throughput.

3. Kubernetes Deployment template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-minimax-m2
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-minimax-m2 }
  template:
    metadata:
      labels: { app: vllm-minimax-m2 }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h200-4gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:<verified-vllm-tag>
          args:
            - --model=MiniMaxAI/MiniMax-M2
            - --served-model-name=minimax-m2
            - --tensor-parallel-size=4
            - --enable-expert-parallel
            - --tool-call-parser=minimax_m2
            - --reasoning-parser=minimax_m2
            - --enable-auto-tool-choice
            - '--compilation-config={"mode": 3, "pass_config": {"fuse_minimax_qk_norm": true}}'
            - --trust-remote-code
          ports:
            - { containerPort: 8000, name: http }
          resources:
            limits: { nvidia.com/gpu: 4 }
            requests: { nvidia.com/gpu: 4 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 180
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-minimax-m2
  namespace: serving
spec:
  selector: { app: vllm-minimax-m2 }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

4. Tool-call and thinking smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "minimax-m2",
    "messages": [{"role": "user", "content": "Use the get_weather tool for Athens, then summarize."}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Return current weather for a city.",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "temperature": 1.0,
    "top_p": 0.95,
    "extra_body": {"top_k": 40},
    "max_tokens": 1024
  }'

Pass criteria:

  • The response includes a structured tool call, not plain text pretending to call a tool.
  • --reasoning-parser minimax_m2 and --tool-call-parser minimax_m2 are present in the launch command.
  • In multi-turn runs, the client sends the assistant's prior <think> content back in history. MiniMax documents that dropping it degrades quality.
  • Sampling uses the card's recommended temperature 1.0, top_p 0.95, top_k 40.

Failure modes

  • Client strips <think> - the defining M2 hazard: quality falls over turns. Retain assistant thinking content in the message history.
  • Corrupted output - the vLLM build is too old. Use the verified commit or a later nightly per the recipe.
  • Tool calls emitted as text - missing --enable-auto-tool-choice or --tool-call-parser minimax_m2.
  • Slower than expected - the QK-norm fusion is off; keep the recommended --compilation-config.
  • OOM on long context - 196K per sequence is a ceiling, not a default to expose. Cap --max-model-len to the route's real need and size KV cache accordingly (KV cache management).

References

  • MiniMax-M2 model card: https://huggingface.co/MiniMaxAI/MiniMax-M2
  • vLLM recipe for MiniMax-M2 (verified commit, TP4+EP, minimax_m2 parsers, compilation-config): https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html
  • vLLM tool calling and reasoning parsers: https://docs.vllm.ai/en/latest/features/tool_calling/
  • MiniMax-M1 (lineage, lightning attention, 1M context): https://huggingface.co/MiniMaxAI/MiniMax-M1-80k

Related: Inference serving · Open-weight serving · Kimi K2 · Generic vLLM recipe · KV cache management · Constrained decoding · SLO/SLI catalog · Glossary