Cookbook: serve Llama 4 Maverick with vLLM¶
Scope: a vLLM reference template for serving meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8: what Llama 4 Maverick is, why and when to use it, how to handle gated access and multimodal requests, how to run a vLLM endpoint, and how to verify text and image-text chat completions.
Reference template. Commands and manifests are not executed here. Llama 4 models are gated and use the Llama 4 Community License; validate access, license terms, and acceptable-use requirements before mirroring weights into a cluster.
flowchart LR
TEXT["Text prompt"] --> API["vLLM OpenAI API"]
IMAGE["Image URL"] --> API
API --> MOE["Llama 4 Maverick MoE"]
MOE --> RESP["Text/code answer"]
What¶
Llama 4 Maverick is Meta's 17B-activated / 400B-total MoE model with native multimodal support for text and image inputs. The FP8 instruct checkpoint is published as meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8; the model card lists 1M context for Maverick, 12 supported languages, and the Llama 4 Community License.
Why¶
Use Llama 4 Maverick when the serving platform needs a mainstream open-weight multimodal model with a large ecosystem, image-understanding support, multilingual coverage, and commercial-friendly but custom license terms. It is the most natural choice in this set for image+text applications.
When¶
Use it when:
- The product needs text plus image understanding behind one endpoint.
- The organization can accept the Llama 4 license and gated download workflow.
- The workload benefits from the Llama ecosystem and safety tooling.
Avoid it when:
- An OSI license is required; Llama 4 is open-weight but custom-licensed.
- The platform cannot store Hugging Face tokens safely.
- The application is text-only and a smaller dense model meets the SLO.
How¶
1. Size the replica¶
| Item | Practical note |
|---|---|
| Model | meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 |
| Params | 400B total / 17B activated |
| Modalities | Text and image input; text/code output |
| Context | 1M on model card; expose smaller route limits first |
| License | Llama 4 Community License |
| Access | Gated Hugging Face repository; requires approved token |
Start with the FP8 checkpoint and a reduced --max-model-len for SLO baselining. The model card says FP8 Maverick fits on a single H100 DGX host while maintaining quality; still validate actual HBM, KV cache, and image-preprocessor memory on the target vLLM build. --tensor-parallel-size 8 maps the replica to that 8-GPU host; it is an operational mapping, not a value the card prescribes. Cap images per request with --limit-mm-per-prompt so a burst of image tokens cannot exhaust the KV cache.
2. Create the Kubernetes token secret¶
kubectl create namespace serving
kubectl -n serving create secret generic hf-token \
--from-literal=token='<approved-hugging-face-token>'
3. Bare-metal vLLM server¶
export HF_TOKEN=<approved-hugging-face-token>
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
--served-model-name llama-4-maverick \
--tensor-parallel-size 8 \
--max-model-len 131072 \
--limit-mm-per-prompt '{"image": 1}' \
--gpu-memory-utilization 0.85
4. Kubernetes Deployment template¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-4-maverick
namespace: serving
spec:
replicas: 1
selector:
matchLabels: { app: vllm-llama-4-maverick }
template:
metadata:
labels: { app: vllm-llama-4-maverick }
spec:
nodeSelector:
accelerator.nvidia.com/class: h100-8gpu
containers:
- name: vllm
image: vllm/vllm-openai:<tested-vllm-tag>
args:
- --model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
- --served-model-name=llama-4-maverick
- --tensor-parallel-size=8
- --max-model-len=131072
- '--limit-mm-per-prompt={"image": 1}'
- --gpu-memory-utilization=0.85
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef: { name: hf-token, key: token }
ports:
- { containerPort: 8000, name: http }
resources:
limits: { nvidia.com/gpu: 8 }
requests: { nvidia.com/gpu: 8 }
readinessProbe:
httpGet: { path: /health, port: 8000 }
initialDelaySeconds: 180
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 64Gi }
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama-4-maverick
namespace: serving
spec:
selector: { app: vllm-llama-4-maverick }
ports:
- { name: http, port: 8000, targetPort: 8000 }
5. Text smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama-4-maverick",
"messages": [{"role": "user", "content": "Give a concise checklist for rolling a CUDA driver upgrade."}],
"temperature": 0.2,
"max_tokens": 512
}'
6. Image-text smoke test¶
curl -s http://<service-url>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama-4-maverick",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in one sentence."},
{"type": "image_url", "image_url": {"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"}}
]
}],
"max_tokens": 128
}'
Pass criteria:
- The model loads with an approved HF token and no interactive license prompt.
- Text and image-text requests both return valid chat completions.
- The service enforces route-specific input limits rather than exposing the full theoretical context to every client.
- Logs show enough KV-cache capacity for the configured max context.
Failure modes¶
- 401/403 during model download - HF token lacks Llama 4 access or was not mounted as
HF_TOKEN. - License violation risk - redistribution or UI attribution terms were not reviewed before deployment.
- Image requests fail but text works - vLLM image lacks the required multimodal support or request format is wrong.
- OOM on long context - lower
--max-model-lenand cap image count/size per route. - Unsafe endpoint defaults - expose Prompt Guard/Llama Guard or equivalent policy filters where required by the use case.
References¶
- Llama 4 Maverick FP8 model card: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
- Llama 4 Maverick BF16 model card: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct
- Llama Cookbook: https://github.com/meta-llama/llama-cookbook
- Llama 4 license: https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE
- vLLM supported models: https://docs.vllm.ai/en/latest/models/supported_models/
Related: Inference serving · Open-weight serving · Security · SLO/SLI catalog · Workload recipes