Markdown

Continuous batching and scheduler internals¶

Scope: how a modern serving scheduler (vLLM, SGLang) iterates via token-level continuous (in-flight) batching that admits and retires requests every step, the waiting/running queues, chunked-prefill interleaving of prefill with decode, and preemption/recompute under KV pressure. The per-step engine loop underneath inference serving and serving open-weight models.

What it is¶

Continuous batching (also called in-flight batching or iteration-level scheduling) refills the batch on every token-generation iteration instead of waiting for whole sequences to finish. It evicts completed requests and pulls in new ones based on GPU readiness, so a new request can join an ongoing batch mid-generation without blocking on the longest sequence.¹ This is the mechanism Orca's "iteration-level scheduling" introduced and the single largest reason vLLM delivers multiples of naive-batching throughput.²

The scheduler keeps two queues. vLLM's V1 scheduler holds a waiting queue for new requests and a running queue for active ones; each engine step admits requests from waiting into running subject to available KV blocks and a per-step token budget, and returns KV blocks to the free pool on completion.³ SGLang mirrors this with a waiting_queue and a running_batch, rebuilding the batch each iteration via get_next_batch_to_run(), removing finished requests and admitting new ones when space is free.⁶

Three batching strategies, contrasted:

Aspect	Static	Dynamic	Continuous
Trigger	fixed batch size / seq length	batch-size target or timeout	token-generation completion event
Latency bound	unbounded (wait for full batch)	bounded by `max_batch_delay_ms`	minimal; evicts and refills mid-batch
Padding	pad to longest sequence	pad within each batch	slots refilled without padding wait
GPU idle	poor under variable loads	better, but a timer firing on a small batch idles the GPU	keeps GPU saturated every iteration
Implementation	simple	timers + queue mgmt	per-token coordination and state tracking

Source: Fregly, Ch. 16, Table 16-4.¹

flowchart LR
  N["New request"] --> W["waiting queue (FCFS)"]
  W -->|"admit if KV blocks + token budget"| R["running batch"]
  R -->|"one token / step"| R
  R -->|"stop condition met"| D["retire: free KV blocks"]
  R -->|"KV pressure"| P["preempt: recompute, re-queue"]
  P --> W

Why it matters¶

Decode generates one token at a time; a single sequence leaves the GPU memory-bandwidth-bound and underutilized.¹ Waiting for an entire static batch to finish wastes slots because output lengths vary wildly; the batch stalls on its longest member. Continuous batching removes that idle time by replacing each finished sequence immediately, raising throughput while holding per-request added latency to a few milliseconds.¹ Batching 4–8 requests can double or triple throughput over 1–2 request batches on large models because the math units and memory pipelines are better fed.¹

The harder problem is head-of-line blocking: a long prefill monopolizes the step and spikes inter-token latency (ITL) for every decode in flight. Chunked prefill is the fix: it bounds prefill work per step so decodes keep progressing. Without it, p95 ITL degrades under any traffic mix that interleaves long prompts with active generations.

When it is needed (and when not)¶

Default on. Continuous batching is the baseline in vLLM and SGLang for any concurrent serving: chat assistants, multi-tenant APIs, agent backends. There is no production reason to disable it.
Chunked prefill: on for mixed/long-prompt traffic. In vLLM V1 chunked prefill is always enabled by default.⁴ Enable explicitly on engines where it is optional when prompts are long or prompt/decode interleaving causes ITL spikes.
Tune, do not disable, under tight ITL SLOs. Lower the per-step token budget for better ITL; raise it for better TTFT (see How). A single global knob trades the two.
Marginal at very low RPS / single-stream. With one request and no concurrency there is nothing to interleave; the machinery adds bookkeeping for no batching gain. It still costs nothing meaningful to leave on.
Not a substitute for disaggregation. When prefill and decode contend hard for the same GPUs at scale, split the pools (disaggregated inference) rather than relying on chunked prefill alone to hide the contention.

How: implement, integrate, maintain¶

The per-step loop¶

Each step the scheduler (1) admits/refills from waiting, (2) runs one forward pass over the running batch producing one token per sequence, (3) retires sequences hitting a stop condition and frees their KV blocks, (4) preempts if KV is exhausted. vLLM V1 prioritizes decode requests already in running: it computes their next token and calls allocate_slots before scheduling any prefill from waiting, then spends the remaining token budget on prefill.³

Token budget and chunked prefill¶

The per-step ceiling is max_num_batched_tokens. The scheduler batches all pending decodes first, then fills the leftover budget with prefill tokens, chunking a long prefill across multiple steps so it never starves decodes.⁴ This is "stall-free scheduling": a 20K-token prompt split into 5K chunks bounds per-iteration work and smooths per-iteration latency.¹ Chunking does not reduce total attention cost: prefill self-attention is N(N+1)/2 QK dot products per layer per head, quadratic in N; chunking changes only when work becomes available to the decoder, i.e. overlap and latency, not the total. Reducing the total requires a smaller effective context or local/sparse attention (O(NW) for a fixed window W).¹

The budget is the ITL/TTFT lever: smaller values (e.g. 2048) give better ITL because fewer prefill tokens slow decodes; higher values give better TTFT because more prefill lands per step.⁴

# vLLM: chunked prefill is always on in V1. Tune the per-step token budget.
# Lower -> better inter-token latency; higher -> better TTFT / throughput.
vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 --trust-remote-code \
  --max-num-batched-tokens 2048 \
  --max-num-seqs 256

# SGLang: chunk size is the prefill budget per step (tokens).
# --enable-mixed-chunk lets prefill and decode share one batch.
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 --tp 8 \
  --chunked-prefill-size 2048 \
  --enable-mixed-chunk

vLLM's scheduler config defaults max_num_batched_tokens to 2048, max_num_seqs to 128, enable_chunked_prefill to True, and policy to fcfs (set --scheduling-policy priority for priority-aware admission, the hook for inference QoS / admission control).⁵ Defaults shift across releases: the max_num_batched_tokens default has moved over versions (512 in early chunked-prefill builds).⁴ Pin and confirm against the installed version rather than assuming.

Preemption and recompute under KV pressure¶

When KV blocks run out, the scheduler cannot keep every running sequence resident. It preempts: a victim sequence is evicted from the batch, its KV is dropped, and the request is re-queued; when blocks free up it is recomputed from its tokens (prefill re-run) rather than resumed.⁴ vLLM V1's default preemption mode is RECOMPUTE (not SWAP to host), because recomputation has lower overhead in the V1 architecture.⁴ The book shows the exact V1-era log line. A recompute preemption costs extra GPU compute and raises latency, so a steady stream of these warnings means the KV pool is undersized:

WARNING ... scheduler.py:1057 Sequence group 0 is preempted by
PreemptionMode.RECOMPUTE because not enough KV cache space.
total_cumulative_preemption_cnt=1

Source: Fregly, Ch. 16, sample vLLM log.¹ Recommended actions when these appear: raise the GPU memory-utilization threshold, reduce max_num_batched_tokens, or rely on PagedAttention's block allocation to pack KV tighter.¹ SGLang exposes priority preemption (--enable-priority-scheduling with --priority-scheduling-preemption-threshold) so high-priority requests evict low-priority ones, saving state and re-queuing the victim,⁶ and --schedule-conservativeness to bias the admission estimate away from OOM.⁶

What makes it sustain ~100% utilization¶

vLLM's PagedAttention slices the KV cache into fixed-size pages and groups page-level compute across active sequences, interleaving large prefill GEMMs with small decode kernels so the scheduler can pack work without padding.¹ SGLang's RadixAttention uses tree-based KV grouping with lazy eviction for shared-prefix reuse, reaching similar utilization.¹ Both are open source; read the scheduler implementations directly. The block-level bookkeeping is what lets the scheduler track each request's state separately and refill slots every iteration.¹

Maintain¶

Watch preemption counters: vLLM increments total_cumulative_preemption_cnt; a rising count means KV starvation, not a transient. Pair with prefix-cache hit metrics (vllm:gpu_prefix_cache_queries, vllm:gpu_prefix_cache_hits).¹
Plot p50/p95/p99 latency against RPS; with batching p50 stays flat or drops as throughput rises up to an inflection point. Stay under it (the SLO/SLI catalog).¹
Treat max_num_batched_tokens / --chunked-prefill-size as SLO controls, retuned per traffic shape, not set-and-forget.
KV swap/recompute thrash shows as a copy-engine spike with low SM utilization; confirm with profiling before changing paging params (CUDA streams and concurrency).¹

References¶

Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Chapter 16 — Profiling, Debugging, and Tuning Inference at Scale (Continuous Batching, Continuous Scheduling, Stall-Free/Chunked Prefill, Table 16-4)
vLLM, "Inside vLLM: Anatomy of a High-Throughput LLM Inference System": https://vllm.ai/blog/2025-09-05-anatomy-of-vllm
vLLM, "Optimization and Tuning" (chunked prefill, max_num_batched_tokens, preemption mode): https://docs.vllm.ai/en/latest/performance/optimization.html
vLLM engine/scheduler arguments: https://docs.vllm.ai/en/latest/serving/engine_args.html
vLLM SchedulerConfig defaults (max_num_batched_tokens, max_num_seqs, enable_chunked_prefill, policy): https://docs.vllm.ai/en/latest/api/vllm/config/scheduler/
SGLang, "Server Arguments" (--chunked-prefill-size, --enable-mixed-chunk, --schedule-conservativeness, --enable-priority-scheduling): https://docs.sglang.io/advanced_features/server_arguments.html
Sarathi-Serve (chunked-prefill / decode coalescing, iteration-level scheduling): https://arxiv.org/abs/2403.02310

Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Ch. 16 — "Continuous Batching," "Continuous Scheduling," "Stall-Free Scheduling (Chunked Prefill)," Table 16-4, and the sample vLLM scheduler preemption log. Continuous batching "maintains high GPU utilization by refilling batches on every token-generation iteration," chunked prefill "schedules prompt prefill processing and token generation in a tightly interleaved way," and prefill self-attention is N(N+1)/2 QK dot products per layer per head. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
Orca introduced iteration-level scheduling; vLLM and SGLang adopt it as the basis of continuous batching. vLLM, "Inside vLLM: Anatomy of a High-Throughput LLM Inference System," https://vllm.ai/blog/2025-09-05-anatomy-of-vllm. ↩
vLLM V1 scheduler maintains a waiting and a running queue; each step admits from waiting subject to KV availability and a token_budget (max_num_batched_tokens), prioritizes decode requests already in running (computing their next token via allocate_slots) before scheduling prefill, and frees KV blocks on completion. vLLM, "Inside vLLM: Anatomy of a High-Throughput LLM Inference System," https://vllm.ai/blog/2025-09-05-anatomy-of-vllm. ↩↩
"In vLLM V1, chunked prefill is always enabled by default." Smaller max_num_batched_tokens (e.g. 2048) yields better ITL; higher yields better TTFT. Default preemption mode in V1 is RECOMPUTE rather than SWAP. vLLM, "Optimization and Tuning," https://docs.vllm.ai/en/v0.9.1/configuration/optimization.html (also https://docs.vllm.ai/en/latest/performance/optimization.html). ↩↩↩↩↩↩
SchedulerConfig defaults: max_num_batched_tokens=2048, max_num_seqs=128, enable_chunked_prefill=True ("prefill requests can be chunked based on the remaining max_num_batched_tokens"), policy="fcfs" (or "priority"). vLLM, "config.scheduler" API reference, https://docs.vllm.ai/en/latest/api/vllm/config/scheduler/. ↩
SGLang maintains a waiting_queue and running_batch, rebuilds the batch each iteration via get_next_batch_to_run(), and chunks prefill via --chunked-prefill-size (tokens) with --enable-mixed-chunk to share prefill and decode in one batch; priority preemption via --enable-priority-scheduling / --priority-scheduling-preemption-threshold, admission bias via --schedule-conservativeness. SGLang docs, "Server Arguments," https://docs.sglang.io/advanced_features/server_arguments.html (source: https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/server_arguments.md). ↩↩↩