Cookbook: serve small models on consumer GPUs with vLLM¶
Scope: running the current small open-weight models (roughly 1B to 32B) on a single consumer or workstation GPU with vLLM: which model fits which card, the VRAM math, which quantization to use per GPU generation, the memory-constrained launch flags, and the consumer-only gotchas (no NVLink, no MIG, FP8 by generation). The small-model counterpart to the large-model cookbooks (DeepSeek-R1, Kimi K2, Qwen3-235B).
Reference template, not hardware-tested. VRAM budgets and launch flags are assembled from vLLM docs and model cards, not measured boots; validate on your card before quoting throughput. The small-model landscape moves fast, so the shortlist below is a mid-2026 snapshot: re-check each model card for the current checkpoint, context length, and license before you ship.
flowchart LR
VRAM["Card VRAM (24 / 32 / 96 GB)"] --> SIZE["Pick model size"]
SIZE --> QUANT["Quantize to fit (AWQ / FP8 / NVFP4 / MXFP4)"]
QUANT --> KV["Budget KV cache (context x concurrency)"]
KV --> RUN["vllm serve on one whole GPU"]
What¶
A modern small model on one consumer card is a genuinely useful endpoint: an 8B to 32B open-weight model, quantized to fit 24 to 96 GB of VRAM, served by vLLM as an OpenAI-compatible API on a single GPU. The strong small models of mid-2026 (Qwen3, gpt-oss-20b, Gemma, Phi-4, Mistral Small) match last year's much larger models on many tasks, so a 4090 or 5090 can host a capable assistant, coder, or reasoning model locally.
This is deliberately the simple case: one model, one GPU, no tensor parallelism, no NVLink, no MIG. The skill is fitting the model (weights plus KV cache) into the card and picking the right quantization for that GPU's generation. For models that do not fit even at 4-bit, use the large-model cookbooks and data-center hardware instead.
Why¶
- Cost and privacy. Local inference on a card you already own has no per-token cost and keeps data on-prem. A single 4090/5090 serves a team's coding or chat assistant.
- The small tier caught up. Native-thinking Qwen3, gpt-oss-20b, and Phi-4 reasoning deliver strong math, code, and tool use at sizes that fit consumer VRAM, so you rarely need a frontier model for internal tooling.
- Simplicity. One model on one whole GPU avoids the parallelism, fabric, and gang-scheduling machinery the large cookbooks require. vLLM uses spare VRAM elastically for KV cache and throughput.
When¶
Use this when:
- One small model on a single 24 to 96 GB card meets the quality bar (dev tools, internal assistants, edge, on-prem, batch).
- Cost, latency, or data locality favour self-hosting over an API.
- You want a reproducible OpenAI-compatible endpoint without a cluster.
Avoid it when:
- You need frontier quality or very long context that only a large MoE gives: use the large-model cookbooks on H100/H200/B200.
- You need hardware multi-tenant isolation on one card: consumer GeForce has no MIG (see gotchas); use data-center MIG or one GPU per tenant.
- The model does not fit even at 4-bit (for example Llama-4-Scout at 109B total needs about 55 to 60 GB at 4-bit, so an RTX PRO 6000 96 GB, not a 24 GB card).
How¶
1. Pick the model and card¶
A mid-2026 shortlist that fits a single consumer card. Confirm the current checkpoint on each card before shipping; newer flagships (Qwen3.5, Gemma 4) may have superseded these.
| Model | Params | Context | License | Good fit |
|---|---|---|---|---|
Qwen/Qwen3-30B-A3B-Instruct-2507 |
30.5B / 3.3B active MoE | 262K | Apache-2.0 | Best all-round on 24 GB at 4-bit (~18-19 GB) |
openai/gpt-oss-20b |
21B / 3.6B active MoE | 128K | Apache-2.0 | ~16 GB native MXFP4; reasoning-effort control |
Qwen/Qwen3-8B |
8.2B | 128K (YaRN) | Apache-2.0 | 24 GB in BF16 with room for context |
Qwen/Qwen3-14B |
14.8B | 128K | Apache-2.0 | 24 GB via FP8 (Ada+) or AWQ 4-bit |
google/gemma-3-27b-it (QAT int4) |
27B | 128K | Gemma | 24 GB at QAT int4; text+image |
microsoft/Phi-4-reasoning-plus |
14B | 32K | MIT | STEM reasoning; note the short 32K context |
mistralai/Mistral-Small-3.2-24B-Instruct-2506 |
24B | 128K | Apache-2.0 | 24-32 GB at 4-bit; text+image |
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B |
8B | 128K | MIT | Best small reasoning distill |
HuggingFaceTB/SmolLM3-3B |
3B | 128K (YaRN) | Apache-2.0 | Fully open; edge and tiny cards |
MoE models with a small active count (Qwen3-30B-A3B, gpt-oss-20b) are the value picks: they carry big-model quality but page only their active experts, so per-token compute stays low and the whole model fits a 24 GB card at 4-bit.
2. Budget the VRAM¶
Two consumers of memory, weights and KV cache:
- Weights = params x bytes/param. BF16 is 2 B, FP8/INT8 is 1 B, 4-bit (AWQ/GPTQ/NVFP4/MXFP4) is about 0.5 B plus overhead. So a 14B model is roughly 28 GB in BF16, 14 GB at FP8, 7-8 GB at 4-bit; a 32B is roughly 64 / 32 / 16-18 GB.
- KV cache =
2 (K,V) x layers x kv_heads x head_dim x seq_len x bytes x batch. GQA shrinkskv_heads;--kv-cache-dtype fp8halvesbytes. Worked example: Qwen3-8B (36 layers, 8 KV heads, head_dim 128) is about 144 KiB per token, so a 32K context is roughly 4.5 GB in FP16, about 2.3 GB at FP8 KV. See KV cache management for the concurrency math.
vLLM fills the card up to --gpu-memory-utilization (default 0.92 in current builds) with weights, KV cache, activations, and CUDA-graph capture. Leave headroom, and cap context (--max-model-len) to what you actually serve, since KV cache scales with it.
3. Choose quantization by GPU generation¶
Consumer cards differ by what low precision the silicon supports, which decides the quant.
| Card | Arch | VRAM | Use |
|---|---|---|---|
| RTX 3090 | Ampere (SM 8.6) | 24 GB | AWQ or GPTQ 4-bit (no hardware FP8; FP8 checkpoints fall back to slow W8A16 Marlin) |
| RTX 4090 | Ada (SM 8.9) | 24 GB | FP8 for a 14B, AWQ/GPTQ 4-bit for 24-32B |
| RTX 5090 | Blackwell (SM 12.0) | 32 GB | FP8 or NVFP4; MXFP4 for gpt-oss |
| RTX PRO 6000 Blackwell | Blackwell | 96 GB | FP8/NVFP4 with long context, or several models side by side |
vLLM auto-detects the quant from a pre-quantized repo, so --quantization is optional; set it to force a kernel (awq_marlin, gptq_marlin, fp8). Notes: GGUF is llama.cpp's format and vLLM's support is experimental and under-optimised, so serve GGUF with llama.cpp/Ollama and use AWQ/GPTQ/FP8 repos with vLLM. bitsandbytes 4-bit works but is slow; prefer AWQ/GPTQ. See quantization for inference.
4. Launch on one GPU¶
Reference templates. Pin the vLLM image; validate on your card.
Qwen3-30B-A3B (best 24 GB pick, 4-bit; add --kv-cache-dtype fp8 on Ada/Blackwell to stretch context, omit it on a 3090):
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \
--quantization awq_marlin \
--max-model-len 32768 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8
gpt-oss-20b (native MXFP4, about 16 GB; runs on Ada, native MXFP4 on Blackwell):
vllm serve openai/gpt-oss-20b \
--max-model-len 32768 \
--enable-auto-tool-choice --tool-call-parser openai
A 14B dense via AWQ on a 24 GB card, with the thinking and tool parsers wired:
vllm serve Qwen/Qwen3-14B-AWQ \
--quantization awq_marlin \
--max-model-len 16384 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.92 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser hermes
5. The memory-constrained knobs¶
When you are a few hundred MB short, in order of preference:
| Flag | Effect |
|---|---|
--max-model-len |
Lower the context cap; KV cache scales with it. The biggest lever. |
--kv-cache-dtype fp8 |
Halve KV-cache memory (CUDA 11.8+; small accuracy hit). |
--max-num-seqs |
Fewer concurrent sequences, less KV cache. |
--quantization awq/gptq/fp8 |
Shrink weights (4-bit or FP8) per your card generation. |
--enforce-eager |
Skip CUDA-graph capture to reclaim its VRAM, at some decode-speed cost. |
--cpu-offload-gb N |
Last resort: stream N GiB of weights from CPU RAM over PCIe (slow). |
6. Smoke test¶
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"messages": [{"role": "user", "content": "In one sentence, why does GQA shrink the KV cache?"}],
"max_tokens": 256
}'
For a thinking model, pass the reasoning parser (--reasoning-parser qwen3 / deepseek_r1 / gemma4) so the reasoning content is separated from the answer; --enable-reasoning no longer exists. For gpt-oss, use the Harmony format and set reasoning effort in the system prompt.
Failure modes¶
- OOM at load or first long request. Weights plus KV cache exceed VRAM. Lower
--max-model-len, add--kv-cache-dtype fp8, drop--max-num-seqs, quantize weights, or--enforce-eager, in that order (inference KV-cache OOM). - FP8 checkpoint slow on a 3090. Ampere has no FP8 tensor cores, so vLLM falls back to weight-only Marlin. Use an AWQ or GPTQ 4-bit repo instead.
- GGUF underperforms or errors on vLLM. GGUF support is experimental; serve GGUF with llama.cpp and use AWQ/GPTQ/FP8 with vLLM.
- Two GeForce cards slower than one. GeForce has no NVLink and disabled PCIe P2P, so
--tensor-parallel-size 2across two 4090/5090 cards is bandwidth-bound. Prefer one bigger card (RTX PRO 6000 96 GB) for models that need more VRAM. - No MIG on GeForce. You cannot partition a 4090/5090; run one vLLM instance per whole GPU. See dynamic and fractional GPU sharing for software sharing options and their limits.
- Blackwell falls back to slow kernels. The 5090 and RTX PRO 6000 need a recent driver, CUDA 12.8+, and a vLLM build with Blackwell kernels (often FlashInfer) for FP8/NVFP4/MXFP4 to engage.
- Thermals and power. A 5090 (~575 W) or 4090 (~450 W) pins the card under sustained decode. Ensure PSU and airflow headroom; a modest
nvidia-smi -plpower cap tames heat for a small throughput cost (power and thermal tuning).
References¶
- vLLM quantization overview: https://docs.vllm.ai/en/latest/features/quantization/
- vLLM FP8 (W8A8, hardware support): https://docs.vllm.ai/en/stable/features/quantization/fp8/
- vLLM GGUF (experimental): https://docs.vllm.ai/en/latest/features/quantization/gguf.html
- vLLM quantized KV cache: https://docs.vllm.ai/en/latest/features/quantization/quantized_kvcache/
- vLLM conserving memory (offload, eager, KV dtype): https://docs.vllm.ai/en/latest/configuration/conserving_memory/
- vLLM engine args (defaults): https://docs.vllm.ai/en/stable/configuration/engine_args/
- vLLM reasoning outputs and tool parsers: https://docs.vllm.ai/en/latest/features/reasoning_outputs/
- vLLM gpt-oss recipe (MXFP4, Harmony): https://docs.vllm.ai/projects/recipes/en/stable/OpenAI/GPT-OSS.html
- Model cards: https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 · https://huggingface.co/openai/gpt-oss-20b · https://huggingface.co/google/gemma-3-27b-it · https://huggingface.co/microsoft/Phi-4-reasoning-plus · https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506 · https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Related: Open-weight serving · Inference serving · Generic vLLM recipe · Quantization for inference · KV cache management · RTX consumer & workstation GPUs · Dynamic & fractional GPU sharing · Glossary