LLM inference efficiency: the convergence map¶
Scope: an orienting map of LLM serving efficiency, following the six-part structure of Alex Smola's MLSS 2026 tutorial "Efficiency in LLMs": prefill versus decode economics, the hardware physics, the serving layer, weight compression, and KV compression. Each pillar here is a short synthesis that routes to the focused pages: roofline and arithmetic intensity, continuous batching internals, KV cache management, quantization for inference, and disaggregated inference. Read this page to decide which lever applies; read the linked pages to pull it.
The Python block below is executed and asserted (pure Python); every number quoted about it matches the printed output. The tutorial's own figures were, per its author, "verified in June 2026 (and probably wrong by December)": treat the hardware ratios as direction, not datasheet values, and re-verify before capacity decisions.
flowchart TB
M["Models: MoE sparsity, GQA, quantization-friendly weights"] --> C["Convergent serving efficiency"]
HW["Hardware: bandwidth ladder, FP8/FP4, unified memory"] --> C
A["Algorithms: batching, paging, prefix caching, PD split"] --> C
C --> P1["Pillar 1-2: prefill vs decode roofline"]
C --> P2["Pillar 3: serving layer (vLLM, SGLang)"]
C --> P3["Pillar 4: weight compression (MoE, 4-bit)"]
C --> P4["Pillar 5: KV compression"]
Overview¶
Smola's tutorial frames serving efficiency as a convergent evolution of models, hardware, and algorithms: model architectures moved toward sparsity (MoE) and shared KV heads (GQA/MLA) exactly because hardware bandwidth is the binding constraint, and the serving stack (continuous batching, paged KV, prefix caching, prefill/decode disaggregation) exists to keep that scarce bandwidth doing useful work. The tutorial is a six-part practitioner's guide of about 150 slides, with Qwen3 (dense 8B and the 30B-A3B mixture of experts) at a 40k-token context as the running example throughout.
The one-sentence version of the whole map: prefill is compute-bound and decode is bandwidth-bound, so almost every technique in the stack is either raising decode's arithmetic intensity (batching), shrinking the bytes it must move (weight and KV compression), or separating the two phases so each runs on hardware that suits it (disaggregation). Where a workload sits on the roofline decides which of those levers pays.
Core knowledge¶
Prefill versus decode economics¶
Prefill pushes a whole prompt through the model as one large matrix multiplication per layer: thousands of tokens reuse each streamed weight, so arithmetic intensity is high and the phase is compute-bound. Decode generates one token per step per sequence: the entire weight set streams from HBM to produce a handful of new tokens, so intensity collapses and the phase is bandwidth-bound. The roofline plot, in Smola's phrasing, tells you which of the two you are fighting.
The executed example below computes both placements from first principles for Qwen3-8B (real config values: hidden 4096, 36 layers, 32 query and 8 KV heads, head dim 128, MLP intermediate 12288, vocab 151936) on H100 SXM public specs (about 989 TFLOP/s dense bf16, 3.35 TB/s HBM3, so the ridge point is 295.2 FLOP/byte). The traffic model is deliberately first-order: weights stream once per step, KV reads scale with batch, activations are ignored.
# ai_phase.py -- validated: prefill is compute-bound, decode is bandwidth-bound. pure Python.
# Qwen3-8B config (huggingface.co/Qwen/Qwen3-8B config.json): hidden 4096, 36 layers,
# 32 query / 8 KV heads, head_dim 128, MLP intermediate 12288, vocab 151936, untied lm_head.
H, LAYERS, HEADS, KV_HEADS, HDIM, FFN, VOCAB = 4096, 36, 32, 8, 128, 12288, 151936
BYTES = 2 # bf16 weights and KV
PEAK_FLOPS, PEAK_BW = 989e12, 3.35e12 # H100 SXM: dense bf16, HBM3
RIDGE = PEAK_FLOPS / PEAK_BW # FLOP/byte where the roofs meet
def matmul_params() -> float:
"""Weight-matrix parameters touched by every forward step (embedding lookup excluded)."""
attn = H * HEADS * HDIM + 2 * H * KV_HEADS * HDIM + HEADS * HDIM * H # q, k, v, o
mlp = 3 * H * FFN # gate, up, down
return float(LAYERS * (attn + mlp) + VOCAB * H) # + untied lm_head
def decode_ai(batch: int, kv_len: int) -> float:
"""FLOP/byte for one decode step. Weights stream once per step; KV reads scale with batch."""
w = matmul_params()
flops_tok = 2 * w + LAYERS * 4 * HEADS * HDIM * kv_len # matmuls + QK^T and AV
kv_read = LAYERS * 2 * kv_len * KV_HEADS * HDIM * BYTES # K and V for one sequence
return batch * flops_tok / (w * BYTES + batch * kv_read)
def prefill_ai(chunk: int) -> float:
"""FLOP/byte for a from-scratch prefill chunk (weight traffic + causal attention + KV write)."""
w = matmul_params()
flops = chunk * 2 * w + LAYERS * 4 * HEADS * HDIM * (chunk * (chunk + 1) // 2)
kv_write = LAYERS * 2 * chunk * KV_HEADS * HDIM * BYTES
return flops / (w * BYTES + kv_write)
w_gb = matmul_params() * BYTES / 1e9
kv_gb = LAYERS * 2 * 40960 * KV_HEADS * HDIM * BYTES / 1e9
floor_ms = (w_gb + kv_gb) / (PEAK_BW / 1e9) * 1e3 # bandwidth floor on B=1 per-token latency
attn_ai = (4 * HEADS * HDIM) / (2 * KV_HEADS * HDIM * BYTES) # attention FLOPs per KV byte
asymptote = decode_ai(10**9, 40960) # batch so large weights are fully amortized
print(f"ridge point : {RIDGE:8.1f} FLOP/byte")
print(f"prefill AI (T=2048) : {prefill_ai(2048):8.1f} FLOP/byte")
print(f"decode AI (B=1, 4k) : {decode_ai(1, 4096):8.2f} FLOP/byte")
print(f"decode AI (B=8, 4k) : {decode_ai(8, 4096):8.2f} FLOP/byte")
print(f"decode AI (B=32, 4k) : {decode_ai(32, 4096):8.2f} FLOP/byte")
print(f"decode AI (B=32, 40k) : {decode_ai(32, 40960):8.2f} FLOP/byte")
print(f"attention AI (GQA 32/8): {attn_ai:8.2f} FLOP/byte; large-batch 40k asymptote {asymptote:.2f}")
print(f"weights {w_gb:.1f} GB + 40k KV {kv_gb:.1f} GB = {w_gb + kv_gb:.1f} GB/step -> >= {floor_ms:.1f} ms/token at {PEAK_BW/1e12:.2f} TB/s")
assert prefill_ai(2048) > RIDGE, "prefill chunk must sit above the ridge (compute-bound)"
assert decode_ai(1, 40960) < RIDGE / 10, "single-stream decode must sit far below the ridge"
assert decode_ai(8, 4096) > decode_ai(1, 4096), "batching must raise decode intensity"
assert decode_ai(32, 40960) < decode_ai(32, 4096), "at fixed batch, longer KV must lower intensity"
assert attn_ai == HEADS / KV_HEADS, "attention AI must equal the GQA ratio exactly"
gains = [decode_ai(2 * b, 40960) - decode_ai(b, 40960) for b in (1, 32)]
assert gains[1] < gains[0], "batching gains must shrink once KV traffic dominates"
for b in (1, 32, 4096): # adversarial: no batch size crosses the KV asymptote
flops_tok = 2 * matmul_params() + LAYERS * 4 * HEADS * HDIM * 40960
kv_read = LAYERS * 2 * 40960 * KV_HEADS * HDIM * BYTES
assert decode_ai(b, 40960) < flops_tok / kv_read, "AI must stay below FLOPs/KV-bytes"
assert asymptote < RIDGE / 40, "even infinite batching leaves 40k decode >40x under the ridge"
print("all assertions passed")
Executed output, and what each line means:
ridge point : 295.2 FLOP/byte
prefill AI (T=2048) : 2088.1 FLOP/byte
decode AI (B=1, 4k) : 1.12 FLOP/byte
decode AI (B=8, 4k) : 7.03 FLOP/byte
decode AI (B=32, 4k) : 16.30 FLOP/byte
decode AI (B=32, 40k) : 6.03 FLOP/byte
attention AI (GQA 32/8): 4.00 FLOP/byte; large-batch 40k asymptote 6.51
weights 15.1 GB + 40k KV 6.0 GB = 21.2 GB/step -> >= 6.3 ms/token at 3.35 TB/s
all assertions passed
A 2048-token prefill chunk sits at 2088.1 FLOP/byte, seven times above the 295.2 ridge: compute-bound. Single-stream decode at a 4k context sits at 1.12 FLOP/byte, roughly 260x below the ridge: the GPU idles while 15.1 GB of weights stream past, and at 3.35 TB/s a 40k-context step cannot beat 6.3 ms/token no matter how fast the ALUs are. Batching amortizes the weight stream (1.12 to 7.03 to 16.30 as batch goes 1 to 8 to 32 at 4k), which is why continuous batching is the single biggest serving win. But KV reads do not amortize: at a 40k context the same 32-way batch drops back to 6.03, and no batch size can pass the 6.51 asymptote set by FLOPs per KV byte. Two structural facts fall out of the asserts: attention's intensity equals the GQA ratio exactly (32/8 = 4.00 FLOP/byte, which is why architectures shrink KV heads), and once context is long the memory bus belongs to the KV cache, which is Smola's motivation for the whole KV-compression pillar.
Deep dives: roofline and arithmetic intensity for the general model, goodput for turning placement into SLO math.
The hardware physics¶
The tutorial's hardware section is about ratios, not SKUs. Compute grows roughly 4x per hardware generation while memory bandwidth only doubles, so the ridge point moves right every generation and decode falls further behind. Energy tells the same story: a single DRAM access costs about 500 multiplies, so moving a byte is the expensive act and the FP8 and FP4 number formats are as much bandwidth tools as compute tools. The bandwidth ladder (HBM on datacenter parts, GDDR on workstations, LPDDR unified memory on desktops) sets what each tier can serve: Smola's at-home angle is that a DGX Spark, an AMD Strix Halo box, or an Apple Silicon Mac with plenty of unified memory is a surprisingly good decode machine, because single-stream decode needs capacity and bandwidth, not TFLOPs. See DGX Spark and the GPU memory hierarchy.
The serving layer¶
The serving pillar is the algorithmic response to decode's economics, and the stack has converged hard across engines: continuous batching keeps the weight stream amortized every iteration; PagedAttention removes KV fragmentation so more sequences fit (KV cache management); prefix caching and RadixAttention reuse shared prompt prefixes instead of recomputing them; and prefill/decode disaggregation splits the compute-bound and bandwidth-bound phases onto separate hardware pools (disaggregated inference). Smola's one-liner for the caching economics: your API bill is probably a cache-hit-rate report. The named tools are vLLM and SGLang; inference serving and serving open-weight models cover engine selection, and speculative decoding attacks the same decode bottleneck by spending spare compute to move more than one token per weight stream.
Weight compression¶
Two mechanisms shrink the 15.1 GB weight stream in the example above. Mixture of experts is, in the tutorial's framing, "do not touch most of the weights": Qwen3-30B-A3B holds 30B parameters but activates about 3B per token, so decode streams a fraction of the model (MoE sparse scaling, expert parallelism). Quantization then shrinks the bytes that are touched, down to 4-bit in current practice (quantization for inference). The tutorial adds a pretty information-theoretic fact: the exponents of trained weights are almost losslessly compressible to about 4.7 bits each, which says trained weight distributions carry far less entropy than their storage format and explains why aggressive formats keep working.
KV compression¶
Once the weight stream is compressed and batched, the KV cache is, in Smola's phrase, the other thing that saturates your memory bus once the context gets long: the executed example puts a 40k-context KV read at 6.0 GB per sequence per step against 15.1 GB of weights, and the 6.51 FLOP/byte asymptote means batched long-context decode is KV-bound, not weight-bound. The levers are architectural (GQA shrinks KV heads; MLA compresses the cache into latents, see FlashAttention and MLA), numeric (KV quantization), and structural (paging and eviction in KV cache management, cross-node reuse via KV transfer).
Don't-miss checklist¶
- Identify the phase before the fix: prefill and decode sit on opposite sides of the ridge, and a fix for the wrong side does nothing.
- Compute the ridge point for the actual SKU (peak FLOPs over peak bandwidth), not a remembered one; it moves every generation.
- Track prefix-cache hit rate as a first-class serving metric; it is a direct multiplier on cost.
- Size batched long-context decode against the KV asymptote (FLOPs per KV byte), not against the weight stream.
- Prefer architectures with small KV footprints (GQA ratio, MLA) when long context is the product.
- Re-verify tutorial-era hardware ratios before purchase or capacity decisions; the source itself expects them to drift within months.
Failure modes¶
- Optimizing prefill when decode-bound. Faster attention kernels do not move a workload sitting at 1.12 FLOP/byte; only batching, compression, or more bandwidth do.
- Buying compute when bandwidth-starved. A part with more TFLOPs but the same memory system moves the roof, not the diagonal; decode throughput does not change.
- Ignoring cache hit rate. Serving identical agent prompts without prefix caching re-runs compute-bound prefill on every request; the bill scales with misses.
- Quantizing without an eval gate. 4-bit weights and quantized KV are bandwidth wins, but ship them through the same quality gates as any model change (evaluation harness).
- Batching past the KV asymptote. Growing batch at long context stops paying once KV traffic dominates; the fix is KV compression or disaggregation, not more batch.
- Treating a hub number as a datasheet. The figures here are first-order traffic models and June 2026 tutorial values; validate against profiler output (Nsight workflow) before committing capacity.
Open questions and validation¶
- The tutorial's own caveat applies to the whole map: numbers verified June 2026, expected wrong by December. The ratios (4x compute per generation vs 2x bandwidth; 500 multiplies per DRAM access) are directionally stable but should be re-derived per platform.
- The 4.7-bits-per-exponent compressibility figure is an empirical property of trained checkpoints reported in the tutorial; whether it holds across model families and training recipes is worth testing before building a format on it.
- The first-order traffic model here ignores activations, overlap, and kernel efficiency; measured rooflines (see roofline and arithmetic intensity) sit below these bounds. The asserts validate orderings and asymptotes, not absolute throughput.
References¶
- Alex Smola, "Efficiency in LLMs" (MLSS 2026 tutorial announcement and summary): https://alex.smola.org/posts/45-mlss-efficiency/
- Slides, "Efficiency in LLMs" (PDF, ~47 MB): https://alex.smola.org/posts/45-mlss-efficiency/main.pdf
- Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (arXiv 2309.06180): https://arxiv.org/abs/2309.06180
- Zheng et al., SGLang: Efficient Execution of Structured Language Model Programs, includes RadixAttention (arXiv 2312.07104): https://arxiv.org/abs/2312.07104
- vLLM: https://github.com/vllm-project/vllm
- SGLang: https://github.com/sgl-project/sglang
- Qwen3-8B model card and config (dimensions used in the executed example): https://huggingface.co/Qwen/Qwen3-8B
- NVIDIA H100 (public specifications used for the ridge point): https://www.nvidia.com/en-us/data-center/h100/
Related: Roofline and arithmetic intensity · Inference serving · Continuous batching internals · KV cache management · Quantization for inference · Disaggregated inference · Speculative decoding · MoE sparse scaling · Glossary