Skip to content
Markdown

Cookbook: serve Llama 4 Maverick with vLLM

Scope: a vLLM reference template for serving meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8: what Llama 4 Maverick is, why and when to use it, how to handle gated access and multimodal requests, how to run a vLLM endpoint, and how to verify text and image-text chat completions.

Reference template. Commands and manifests are not executed here. Llama 4 models are gated and use the Llama 4 Community License; validate access, license terms, and acceptable-use requirements before mirroring weights into a cluster.

flowchart LR
  TEXT["Text prompt"] --> API["vLLM OpenAI API"]
  IMAGE["Image URL"] --> API
  API --> MOE["Llama 4 Maverick MoE"]
  MOE --> RESP["Text/code answer"]

What

Llama 4 Maverick is Meta's 17B-activated / 400B-total MoE model with native multimodal support for text and image inputs. The FP8 instruct checkpoint is published as meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8; the model card lists 1M context for Maverick, 12 supported languages, and the Llama 4 Community License.

Why

Use Llama 4 Maverick when the serving platform needs a mainstream open-weight multimodal model with a large ecosystem, image-understanding support, multilingual coverage, and commercial-friendly but custom license terms. It is the most natural choice in this set for image+text applications.

When

Use it when:

  • The product needs text plus image understanding behind one endpoint.
  • The organization can accept the Llama 4 license and gated download workflow.
  • The workload benefits from the Llama ecosystem and safety tooling.

Avoid it when:

  • An OSI license is required; Llama 4 is open-weight but custom-licensed.
  • The platform cannot store Hugging Face tokens safely.
  • The application is text-only and a smaller dense model meets the SLO.

How

1. Size the replica

Item Practical note
Model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
Params 400B total / 17B activated
Modalities Text and image input; text/code output
Context 1M on model card; expose smaller route limits first
License Llama 4 Community License
Access Gated Hugging Face repository; requires approved token

Start with the FP8 checkpoint and a reduced --max-model-len for SLO baselining. The model card says FP8 Maverick fits on a single H100 DGX host while maintaining quality; still validate actual HBM, KV cache, and image-preprocessor memory on the target vLLM build. --tensor-parallel-size 8 maps the replica to that 8-GPU host; it is an operational mapping, not a value the card prescribes. Cap images per request with --limit-mm-per-prompt so a burst of image tokens cannot exhaust the KV cache.

2. Create the Kubernetes token secret

kubectl create namespace serving
kubectl -n serving create secret generic hf-token \
  --from-literal=token='<approved-hugging-face-token>'

3. Bare-metal vLLM server

export HF_TOKEN=<approved-hugging-face-token>

vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --served-model-name llama-4-maverick \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --limit-mm-per-prompt '{"image": 1}' \
  --gpu-memory-utilization 0.85

4. Kubernetes Deployment template

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-4-maverick
  namespace: serving
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-llama-4-maverick }
  template:
    metadata:
      labels: { app: vllm-llama-4-maverick }
    spec:
      nodeSelector:
        accelerator.nvidia.com/class: h100-8gpu
      containers:
        - name: vllm
          image: vllm/vllm-openai:<tested-vllm-tag>
          args:
            - --model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
            - --served-model-name=llama-4-maverick
            - --tensor-parallel-size=8
            - --max-model-len=131072
            - '--limit-mm-per-prompt={"image": 1}'
            - --gpu-memory-utilization=0.85
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef: { name: hf-token, key: token }
          ports:
            - { containerPort: 8000, name: http }
          resources:
            limits: { nvidia.com/gpu: 8 }
            requests: { nvidia.com/gpu: 8 }
          readinessProbe:
            httpGet: { path: /health, port: 8000 }
            initialDelaySeconds: 180
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama-4-maverick
  namespace: serving
spec:
  selector: { app: vllm-llama-4-maverick }
  ports:
    - { name: http, port: 8000, targetPort: 8000 }

5. Text smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama-4-maverick",
    "messages": [{"role": "user", "content": "Give a concise checklist for rolling a CUDA driver upgrade."}],
    "temperature": 0.2,
    "max_tokens": 512
  }'

6. Image-text smoke test

curl -s http://<service-url>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llama-4-maverick",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image in one sentence."},
        {"type": "image_url", "image_url": {"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}}
      ]
    }],
    "max_tokens": 128
  }'

Pass criteria:

  • The model loads with an approved HF token and no interactive license prompt.
  • Text and image-text requests both return valid chat completions.
  • The service enforces route-specific input limits rather than exposing the full theoretical context to every client.
  • Logs show enough KV-cache capacity for the configured max context.

Failure modes

  • 401/403 during model download - HF token lacks Llama 4 access or was not mounted as HF_TOKEN.
  • License violation risk - redistribution or UI attribution terms were not reviewed before deployment.
  • Image requests fail but text works - vLLM image lacks the required multimodal support or request format is wrong.
  • OOM on long context - lower --max-model-len and cap image count/size per route.
  • Unsafe endpoint defaults - expose Prompt Guard/Llama Guard or equivalent policy filters where required by the use case.

References

  • Llama 4 Maverick FP8 model card: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
  • Llama 4 Maverick BF16 model card: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct
  • Llama Cookbook: https://github.com/meta-llama/llama-cookbook
  • Llama 4 license: https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE
  • vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models/

Related: Inference serving · Open-weight serving · Security · SLO/SLI catalog · Workload recipes