Markdown

Inference QoS, admission control, and routing¶

Scope: protecting inference SLOs under load. Priority classes and per-request SLO targets, admission control / load shedding when the queue grows, latency-aware routing across replicas, and prefill/decode-aware scheduling that keeps TTFT and TPOT inside budget.

Vendor flags, thresholds, and log lines below are quoted from engine documentation and the book; none of it has been hardware-tested here. The Python blocks are runnable and self-checking. Validate every threshold against your own traffic and SLOs before relying on it.

What it is¶

QoS for an LLM inference fleet is the set of policies that decide, under contention, whose request runs when and where, so that latency targets hold instead of degrading uniformly for everyone. It sits one layer above the per-step engine loop (continuous batching internals) and one layer below capacity planning. Four mechanisms compose it:

Priority classes / per-request SLO targets. Requests carry a priority (or a class that maps to one), and the scheduler orders the waiting and running queues by it rather than purely by arrival time.
Admission control / load shedding. When the queue or KV pool is saturated, the system rejects, defers, or downgrades low-priority work instead of accepting everything and blowing every SLO.
Latency-aware routing. A router places each request on the replica that minimizes expected completion time, accounting for prefill cost (quadratic in prompt length), queue depth, and KV-cache affinity.
Prefill/decode-aware scheduling. Chunked prefill and prefill/decode disaggregation stop a long prompt from monopolizing a step and spiking inter-token latency for every decode in flight.

The book frames the load-shedding lever bluntly: a long request cannot be allowed to delay others past their tail-latency promise. "If you promise a p99 latency of 2 seconds for a request of a certain length, you can't afford to delay one request by 500 ms waiting in a batch queue."¹

Priority scheduling in vLLM is explicit. The engine's scheduling policy is fcfs (first-come-first-served) by default or priority; under priority "requests are handled based on given priority (lower value means earlier handling) and time of arrival deciding any ties."⁸ Set it with the --scheduling-policy priority engine argument; each request then carries an integer priority passed through add_request(..., priority=...) (default 0), where lower runs first.⁹ When the KV cache is exhausted under priority, the scheduler preempts the request with the highest (priority, arrival_time) tuple (that is, the lowest-priority, newest victim) back to the waiting queue, so a higher-priority arrival can run immediately.⁸ SGLang exposes the analogous knobs via --enable-priority-scheduling with --priority-scheduling-preemption-threshold, plus --schedule-conservativeness to bias the admission estimate away from OOM.¹¹

Why use it¶

LLM requests are non-uniform: unlike traditional microservice calls that are "relatively uniform and predictable in their execution time," LLM invocations "are nonuniform and can vary wildly in terms of latency."² A single 20K-token prefill can stall a step long enough to break inter-token latency for every decode sharing that GPU, the classic head-of-line blocking. The book's troubleshooting table lists exactly this: high tail latency (its example threshold is p95 > 200 ms) with probable cause "decode-node hotspot or head-of-line blocking," and the recommended action is to inspect router logs, tune the prefetch threshold, and enable speculative-decoding paths.³

Without QoS, overload degrades everyone equally: interactive chat traffic and bulk batch jobs miss their targets together. With priority classes plus admission control, the fleet sheds or defers the low-value work and preserves the SLOs that matter. The book's system-orchestration guidance calls for "multitenancy isolation and per-user quotas to prevent noisy neighbors" and "GPU isolation with MIG or stream priorities to enforce QoS."⁶ Latency-aware routing is the other half: the book's worked example shows a naive FIFO scheduler overloading one GPU (critical path 38,007,000 self-attention QK ops) while a latency-aware scheduler rebalances the same prompts into even batches (22,005,000 each), cutting the critical path "by about 42%" and "significantly reduc[ing] TTFT."⁴

The cost of dynamic batching is the same coin: it "improves throughput by amortizing fixed costs ... at the expense of individual-request latency."¹ QoS is how you spend that latency budget deliberately instead of letting it fall where it may.

When to use it (and when not)¶

Needed when:

Mixed-priority traffic shares a fleet: interactive vs. batch, paid tiers vs. free, foreground vs. background agents. Priority classes are the whole point.
Load is bursty and can exceed capacity. Admission control / load shedding is what keeps the admitted set inside SLO instead of accepting an unservable queue. Pair with autoscaling "warm spare instances to hide model load times."⁶
Prompt lengths are heterogeneous. Latency-aware routing and chunked prefill prevent long prompts from starving short ones; FIFO overloads a replica when "the arrival order is not globally ideal."⁴
You run a multi-replica or disaggregated deployment where placement is a free variable. KV-cache-affinity routing turns prefix reuse into latency reduction.

Not needed (or over-engineering) when:

Single replica, single uniform workload, no contention. FCFS is fine; priority queues add coordination cost for no gain. Adopt the simplest policy that meets the SLO.
Latency is already far inside budget at peak. The book warns the heavier mechanisms, speculative decoding (adds a draft model) and Medusa (multi-head parallel decode), "are typically reserved for extreme cases such as ultralong contexts or erratic latency variances. Lighter-weight methods, including sparsity, batching, and disaggregation, deliver the bulk of benefits in production."⁶
Offline / throughput-only batch jobs with no per-request latency target. Maximize batch size; QoS routing is irrelevant.

Define the targets before reaching for mechanisms; see the SLO/SLI catalog for TTFT/TPOT/end-to-end definitions and the inference SLO-breach runbook for response.

Architecture¶

A request crosses three decision points before it streams a token: admission control (admit, defer, or shed), the latency-aware router (which replica, local prefix-cache hit or remote), and the per-step scheduler (priority queue plus chunked prefill, with preemption of the lowest (priority, arrival) victim on KV exhaustion).

flowchart LR
  I["Incoming request<br/>(priority class, prompt len)"] --> AC{"Admission<br/>control"}
  AC -->|"shed / 429"| X["Reject or queue-defer<br/>low-priority"]
  AC -->|"admit"| RT["Latency-aware router"]
  RT -->|"prefix-cache hit"| L["Local replica"]
  RT -->|"miss / overloaded"| R["Remote replica<br/>(balanced batch)"]
  L --> SCH["Per-step scheduler<br/>(priority queue + chunked prefill)"]
  R --> SCH
  SCH -->|"preempt lowest (priority, arrival)"| SCH
  SCH --> O["Stream tokens<br/>within TTFT / TPOT budget"]

The engine-level knobs (--scheduling-policy, max_num_batched_tokens, max_num_seqs) govern the scheduler box; the router box is a separate process (NVIDIA Dynamo, or your own gateway) that owns admission and placement. The sections below walk each box: how to configure the priority policy, how to integrate a router and gateway, how to run and maintain it in production, and how to scale it out.

How to use it: priority classes and per-request SLO targets¶

Map product tiers to a small integer priority space and set the engine policy. In vLLM, lower value = earlier handling; arrival time breaks ties.⁸

# vLLM: enable priority scheduling at the engine level
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --scheduling-policy priority \
  --max-num-seqs 256 \
  --max-num-batched-tokens 2048

# Reference template (needs vLLM). Per-request priority: lower integer = served
# first; ties broken by arrival. Offline LLMEngine path; the OpenAI server accepts
# priority via its request body when the engine runs with --scheduling-policy priority.
engine.add_request(
    request_id="req-interactive-001",
    prompt=prompt,
    params=sampling_params,
    priority=0,          # interactive / paid tier
)
engine.add_request(
    request_id="req-batch-77",
    prompt=bulk_prompt,
    params=sampling_params,
    priority=100,        # background / best-effort
)

Under priority, KV exhaustion preempts the highest-(priority, arrival_time) victim (the lowest-priority, newest request) and re-queues it; recompute is the default reinstatement path, not host swap.⁸ The victim-selection rule is small enough to state exactly and check against a slow reference. This numpy-only block reproduces the core math vLLM applies (it needs no engine):

# vLLM preemption victim: under `priority`, the scheduler evicts the request with
# the HIGHEST (priority, arrival_time) tuple -- lowest priority, newest. Lower
# priority value = served first, so the highest value is first to be preempted.
import numpy as np

def preemption_victim(priorities, arrivals):
    """Index of the request vLLM preempts first: max over (priority, arrival)."""
    pr = np.asarray(priorities); ar = np.asarray(arrivals)
    order = np.lexsort((ar, pr))   # ascending by (priority, then arrival)
    return int(order[-1])          # last = largest (priority, arrival)

def _reference(priorities, arrivals):
    """Slow, obviously-correct reference: pick the max (priority, arrival) pair."""
    best, best_key = None, None
    for i, (p, a) in enumerate(zip(priorities, arrivals)):
        key = (p, a)
        if best_key is None or key > best_key:
            best_key, best = key, i
    return best

# Happy path: three classes, distinct priorities. priority=100 (batch) is evicted.
assert preemption_victim([0, 100, 0], [1.0, 2.0, 3.0]) == 1
# Edge: TIE on priority -> newest (largest arrival) wins. idx 1 arrived latest.
assert preemption_victim([5, 5, 5], [10.0, 30.0, 20.0]) == 1
# Adversarial: a newer high-priority (0) request must NOT be chosen over an older
# low-priority (100) one -- priority dominates arrival in the key.
assert preemption_victim([100, 0], [1.0, 999.0]) == 0

# Equivalence vs the slow reference over 5000 random fleets (dense ties).
rng = np.random.default_rng(0)
for _ in range(5000):
    n = int(rng.integers(1, 12))
    p = rng.integers(0, 4, size=n).tolist()               # few distinct -> ties
    a = (rng.integers(0, 6, size=n) + np.arange(n) * 1e-6).tolist()  # unique max
    assert preemption_victim(p, a) == _reference(p, a)

print("v1 preemption-victim: all asserts passed")

SGLang's equivalent policy and preemption controls:

# SGLang: priority preemption + admission conservativeness
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --enable-priority-scheduling \
  --priority-scheduling-preemption-threshold 10 \
  --schedule-conservativeness 1.0

--schedule-conservativeness (default 1.0) biases the admission estimate: lower values admit more aggressively (risking preemption), higher values hold back to avoid OOM.¹¹

How to integrate it: admission control and load shedding¶

The cheapest admission lever is the per-step token budget. max_num_batched_tokens (default 2048 for chunked prefill) caps tokens per iteration; smaller values protect ITL because fewer prefill tokens slow decodes, larger values favor TTFT.⁸¹⁰ max_num_seqs (default 128) caps concurrent sequences.⁸ Set the dynamic-batch max delay "well below the p99 latency requirement (on the order of 1–2 ms)" so queuing never eats the budget, and use an adaptive delay that "can dynamically drop to near 0 ms at low RPS and increase higher to 5–10 ms at peak load."¹

Beyond the engine, shed at the gateway: return 429 or defer low-priority classes when admitted concurrency or queue depth crosses a watermark, rather than accepting an unservable backlog. The admission decision is a small, testable function: serve by priority up to the watermark, drop the rest. This numpy-only block implements it and checks it against a slow reference plus a starvation-resistance edge case:

# Admission control / load shedding at the gateway. Serve by priority (lower value
# = more important); admit until the concurrency watermark is full, then shed (429)
# the rest. Under a flood, this keeps the ADMITTED set inside SLO instead of
# accepting an unservable backlog: the least important requests are the ones dropped.
import numpy as np

def admit(priorities, watermark):
    """Bool mask: True = admit, False = shed (429). Admit the `watermark` most
    important requests; ties broken by arrival order (stable sort)."""
    pr = np.asarray(priorities)
    order = np.argsort(pr, kind="stable")     # most important first, stable by arrival
    admitted = np.zeros(len(pr), dtype=bool)
    admitted[order[:watermark]] = True
    return admitted

def _reference(priorities, watermark):
    """Slow reference: admit exactly the `watermark` smallest (most important)
    priorities, ties broken by original arrival index."""
    idx = sorted(range(len(priorities)), key=lambda i: (priorities[i], i))
    keep = set(idx[:watermark])
    return np.array([i in keep for i in range(len(priorities))])

# Happy path: watermark=2. The two most important (0,0) stay, batch job (100) sheds.
assert admit([0, 100, 0], 2).tolist() == [True, False, True]
# Boundary: watermark == fleet size -> admit everyone; watermark == 0 -> shed all.
assert admit([0, 5, 100, 100], 4).all()
assert not admit([0, 5, 100, 100], 0).any()
# Adversarial: 9 batch (100) + 1 interactive (0), watermark=3. The lone interactive
# request must be admitted, not starved by the flood.
m = admit([100] * 9 + [0], 3)
assert m[-1] and m.sum() == 3

# Equivalence vs slow reference over random mixes (dense priority ties).
rng = np.random.default_rng(7)
for _ in range(5000):
    n = int(rng.integers(1, 20))
    p = rng.integers(0, 200, size=n).tolist()
    w = int(rng.integers(0, n + 1))
    assert (admit(p, w) == _reference(p, w)).all()
# Invariant: admitted count is exactly min(watermark, fleet size), always.
for _ in range(2000):
    n = int(rng.integers(1, 20)); w = int(rng.integers(0, 25))
    assert admit(rng.integers(0, 50, size=n).tolist(), w).sum() == min(w, n)

print("v3 admission: all asserts passed")

The book's preemption warning is the signal that you are already past admission limits. A steady stream of PreemptionMode.RECOMPUTE "because not enough KV cache space" means the KV pool is undersized for the admitted load:³

WARNING 2025-05-03 14:22:07 scheduler.py:1057 Sequence group 0 is preempted by
PreemptionMode.RECOMPUTE because not enough KV cache space.
total_cumulative_preemption_cnt=1

Recommended responses from the book: raise the GPU memory-utilization threshold, reduce max_num_batched_tokens, or rely on PagedAttention's block allocation; if recompute thrash persists, the load itself must be shed or scaled out.³

How to run it in production: latency-aware routing across replicas¶

Route on expected completion time, not round-robin. Because prefill self-attention cost is triangular (N(N+1)/2 QK dot products per layer per head, quadratic in prompt length N), a router that knows prompt lengths can pack balanced batches and move short requests ahead so they "finish prefill sooner and begin decode without waiting for the heavier ... prefill to complete."⁴ The book's worked example is concrete enough to reproduce exactly. This numpy-only block recomputes the cost model, the FIFO critical path, the balanced split, and the "about 42%" reduction:

# Prefill self-attention cost and latency-aware balancing. Book worked example
# (Fregly Ch.16): prompts [6K,2K,6K,2K,2K,2K] tokens. Per layer per head, prefill
# does N(N+1)/2 QK dot products (triangular). FIFO packs both 6K prompts on GPU1
# (heavy); latency-aware balances the load -> critical path ~42% shorter.
import numpy as np

def qk_ops(n_tokens):
    """Triangular self-attention QK dot-product count for a length-N prefill."""
    n = np.asarray(n_tokens, dtype=np.int64)
    return n * (n + 1) // 2

def balance_two_gpus(prompt_tokens):
    """Greedy longest-processing-time packing onto 2 GPUs; return per-GPU cost."""
    load = [0, 0]
    for c in sorted((int(qk_ops(n)) for n in prompt_tokens), reverse=True):
        g = 0 if load[0] <= load[1] else 1
        load[g] += c
    return load

# A single 6K / 2K prefill: N(N+1)/2.
assert int(qk_ops(6000)) == 6000 * 6001 // 2 == 18_003_000
assert int(qk_ops(2000)) == 2000 * 2001 // 2 == 2_001_000
# FIFO worst arrival order lands both 6K prompts on GPU1: the book's 38,007,000.
fifo_gpu1 = 2 * int(qk_ops(6000)) + int(qk_ops(2000))
assert fifo_gpu1 == 38_007_000
# Latency-aware balances the six prompts evenly: 22,005,000 per GPU.
balanced = balance_two_gpus([6000, 2000, 6000, 2000, 2000, 2000])
assert balanced == [22_005_000, 22_005_000], f"expected even split, got {balanced}"
# Total work is conserved: balancing moves the critical path, not the sum.
total = sum(int(qk_ops(n)) for n in [6000, 2000, 6000, 2000, 2000, 2000])
assert sum(balanced) == total == 44_010_000
# Critical-path reduction matches the book's "about 42%".
reduction = (fifo_gpu1 - max(balanced)) / fifo_gpu1
assert abs(reduction - 0.42) < 0.005, f"expected ~42% cut, got {reduction:.4f}"
# Edge: cost is quadratic -- doubling N roughly quadruples ops; boundaries defined.
assert int(qk_ops(4000)) > 3.9 * int(qk_ops(2000))
assert int(qk_ops(0)) == 0 and int(qk_ops(1)) == 1

print(f"v2 routing: reduction={reduction:.4f}, balanced={balanced}; all asserts passed")

Add KV-cache affinity: prefer the replica that already holds the request's prefix. The book's NVIDIA Dynamo router log shows the pattern: a 90% prefix-cache hit keeps prefill local; a miss dispatches the prefill to a remote worker:³

[Router] 2025-05-03T14:23:11Z INFO KVRouter: prefix-cache hit (90%) for
model=DeepSeek-R1; routing to local vLLM worker
[Router] 2025-05-03T14:23:12Z INFO KVRouter: cache miss; dispatching remote
prefill to GPU-node-03

NVIDIA Dynamo's KV-aware router hashes requests and tracks cache locations to "load balance for both compute (on prefill) and memory (on decode)," and its SLA-based planner scales prefill vs. decode workers from pre-deployment profiling to hold TTFT and ITL targets within a GPU budget.¹² The illustrative threshold in the book is p95 > 200 ms for "decode-node hotspot or head-of-line blocking"; inspect router logs and tune the prefetch threshold when it trips.³ (Numeric thresholds in the book's tables are explicitly illustrative, not measured.³)

How to scale it: prefill/decode-aware scheduling and isolation¶

Chunked prefill is the head-of-line-blocking fix: split a long prompt across steps so decodes keep progressing. It is enabled by default in vLLM V1; the scheduler batches pending decodes first, then spends the remaining max_num_batched_tokens budget on prefill, chunking long prompts so they never starve decodes.¹⁰ long_prefill_token_threshold marks which prompts count as "long" for this treatment (default 0, meaning derive from context length).⁸ Chunking changes only when prefill work becomes available (overlap and latency), "not fewer total attention dot-product operations."⁵ (The v2 routing block above already asserts that total QK work is conserved under rebalancing; the same holds for chunking.)

For a stronger isolation guarantee, run prefill/decode disaggregation so a prefill surge cannot touch decode latency at all (disaggregated inference), and move KV between pools with NIXL transfer. At the hardware tier, enforce isolation with MIG partitions or CUDA stream priorities so a noisy tenant cannot steal another class's SMs.⁶ Scale the router itself with Dynamo's SLA-based planner, which adjusts prefill and decode worker counts from profiling to hold TTFT and ITL within a GPU budget.¹²

How to maintain it¶

QoS is a measurement loop, not a static config. The book is emphatic: "optimizations should not be considered successful until they are verified with actual metrics."⁷

Plot p50/p95/p99 latency against RPS; "with batching, you'll often see overall p50 latency stay flat, or even drop, as throughput increases ... up until an inflection point. Make note of this inflection point and stay under this value."¹ That inflection is your admission watermark (the watermark argument in the admission block above).
Watch total_cumulative_preemption_cnt and prefix-cache hit rate (vllm:gpu_prefix_cache_queries, vllm:gpu_prefix_cache_hits); a rising preemption count is KV starvation, a falling hit rate signals router/affinity drift.³
Track per-class TTFT and TPOT separately; a fleet-wide average hides a starved low-priority class. Validate against the SLO/SLI catalog; on breach, follow the inference SLO-breach runbook.
Re-confirm defaults per release: vLLM's max_num_batched_tokens default has moved across versions; pin and verify against the installed build rather than assuming.¹⁰

Failure modes¶

Head-of-line blocking. A single long prefill stalls a step and spikes ITL for every co-resident decode; the book's symptom is high tail latency (illustrative p95 > 200 ms) tagged "decode-node hotspot or head-of-line blocking." Fix with chunked prefill and latency-aware routing, then disaggregation if it persists.³⁴
KV-cache starvation / recompute thrash. A steady PreemptionMode.RECOMPUTE "because not enough KV cache space" and a rising total_cumulative_preemption_cnt mean the admitted load exceeds the KV pool. Raise the memory-utilization threshold or reduce max_num_batched_tokens; if it persists, shed or scale out.³
Low-priority starvation. Aggressive priority ordering can indefinitely defer a best-effort class. Track per-class TTFT/TPOT separately (a fleet-wide average hides it) and cap deferral; the admission block above asserts a lone high-priority arrival survives a low-priority flood, but the converse (a starved batch class) is the risk to watch.
Queuing eats the latency budget. A dynamic-batch delay set too high, or a watermark set too high, admits an unservable backlog. Keep the batch delay "well below the p99 latency requirement" and stay under the latency-vs-RPS inflection point.¹
Router/affinity drift. A falling prefix-cache hit rate means requests are landing on replicas that do not hold their prefix, re-paying prefill remotely; inspect KVRouter logs and tune the prefetch threshold.³¹²
Over-engineering the policy. Priority queues add coordination cost, and speculative decoding and Medusa are "typically reserved for extreme cases"; on a single uniform workload FCFS meets the SLO and the heavier machinery is pure overhead.⁶
Trusting illustrative thresholds. The numeric values in the book's tables are "illustrative to explain the concepts," not measured; a threshold copied verbatim (p95 > 200 ms, delay 1–2 ms) must be re-derived from your own traffic.³

References¶

Chris Fregly, "AI Systems Performance Engineering" (O'Reilly), Chapter 16: Dynamic Batching, Latency-Aware Scheduling and Dynamic Routing, the troubleshooting table (Table 16-1) with the vLLM preemption log and Dynamo router log, and the QoS/scaling and system-orchestration rows (Table 16-2)
vLLM, scheduler configuration / scheduling policy (fcfs vs priority, preemption by (priority, arrival_time)): https://docs.vllm.ai/en/latest/api/vllm/config/scheduler/
vLLM, sampling/engine add_request priority parameter and --scheduling-policy: https://docs.vllm.ai/en/latest/api/vllm/sampling_params/ and https://docs.vllm.ai/en/stable/configuration/engine_args/
vLLM, "Optimization and Tuning" (chunked prefill default, max_num_batched_tokens, ITL/TTFT trade-off, preemption mode): https://docs.vllm.ai/en/latest/configuration/optimization/
SGLang, "Server Arguments" (--enable-priority-scheduling, --priority-scheduling-preemption-threshold, --schedule-conservativeness, --schedule-policy): https://docs.sglang.io/advanced_features/server_arguments.html
NVIDIA Dynamo, KV-aware routing and SLA-based planner (TTFT/ITL targets, prefill/decode worker autoscaling): https://docs.nvidia.com/dynamo/latest/architecture/sla_planner.html

Fregly, Ch. 16, "Dynamic Batching": batching "improves throughput by amortizing fixed costs ... at the expense of individual-request latency"; "if you promise a p99 latency of 2 seconds ... you can't afford to delay one request by 500 ms waiting in a batch queue"; default dynamic batch delay "on the order of 1–2 ms," adaptive delay "drop to near 0 ms at low RPS and increase higher to 5–10 ms at peak load"; latency-percentile-vs-load inflection point ("stay under this value"). ↩↩↩↩↩
Fregly, Ch. 16, "Monitoring System Metrics and Counters": LLM invocations "are nonuniform and can vary wildly in terms of latency," unlike uniform microservice calls. ↩
Fregly, Ch. 16, Table 16-1 and "Inference Troubleshooting Recipes": "High tail latency (p95 > 200 ms)" maps to "Decode-node hotspot or head-of-line blocking," action inspect router logs, tune prefetch threshold, enable speculative decoding; the sample vLLM PreemptionMode.RECOMPUTE "because not enough KV cache space" log and the NVIDIA Dynamo KVRouter prefix-cache-hit / cache-miss routing log. "The numeric values in all metrics tables are illustrative to explain the concepts." ↩↩↩↩↩↩↩↩↩↩↩
Fregly, Ch. 16, "Latency-Aware Scheduling and Dynamic Routing": FIFO vs. latency-aware worked example over prompts [6K, 2K, 6K, 2K, 2K, 2K]: FIFO critical path 38,007,000 QK ops on GPU 1; latency-aware rebalanced to 22,005,000 per GPU, "reduces the critical path self-attention by about 42% ... can significantly reduce TTFT"; "a naive FIFO scheduler can overload one of the GPUs if the arrival order is not globally ideal." ↩↩↩↩
Fregly, Ch. 16, "Stall-Free Scheduling (Chunked Prefill)": prefill self-attention performs N(N+1)/2 QK dot products per layer per head, quadratic in N; "Chunking does not reduce this total. Chunking only changes when work becomes available to the decoder." Total attention cost falls only with reduced effective context or local/sparse attention (O(NW) for window W). ↩
Fregly, Ch. 16, "Full-Stack Inference Optimizations," Table 16-2 (System orchestration and QoS/scaling rows): "Multitenancy isolation and per-user quotas to prevent noisy neighbors," "Autoscaling with warm spare instances to hide model load times," "GPU isolation with MIG or stream priorities to enforce QoS"; speculative decoding/Medusa "are typically reserved for extreme cases ... Lighter-weight methods, including sparsity, batching, and disaggregation, deliver the bulk of benefits in production." ↩↩↩↩↩
Fregly, Ch. 16, "Debugging Correctness Issues": "optimizations should not be considered successful until they are verified with actual metrics that demonstrate increased throughput, reduced latency, improved utilization." ↩
vLLM SchedulerConfig: policy field default 'fcfs', alternative 'priority' ("lower value means earlier handling," arrival time breaks ties); on KV exhaustion under priority the scheduler preempts the request with the highest (priority, arrival_time) tuple back to the waiting queue. max_num_batched_tokens default 2048, max_num_seqs default 128, long_prefill_token_threshold default 0, enable_chunked_prefill default True. https://docs.vllm.ai/en/latest/api/vllm/config/scheduler/ ↩↩↩↩↩↩↩
vLLM LLMEngine.add_request(..., priority: int = 0): "the priority of the request. Only applicable with priority scheduling"; enabled at the engine level via --scheduling-policy priority. https://docs.vllm.ai/en/stable/configuration/engine_args/ ↩
vLLM, "Optimization and Tuning": "In vLLM V1, chunked prefill is always enabled by default"; smaller max_num_batched_tokens (e.g. 2048) yields better ITL, higher yields better TTFT; the scheduler "prioritizes decode requests, batching all pending decode requests before scheduling any prefill," then spends the remaining budget on prefill; default max_num_batched_tokens has shifted across versions (pin and verify). https://docs.vllm.ai/en/latest/configuration/optimization/ ↩↩↩
SGLang, "Server Arguments": --schedule-policy (lpm longest-prefix-match default, fcfs, lof, random, dfs-weight); --enable-priority-scheduling with --priority-scheduling-preemption-threshold; --schedule-conservativeness (default 1.0) biases admission away from OOM. https://docs.sglang.io/advanced_features/server_arguments.html ↩↩
NVIDIA Dynamo: KV-aware router hashes requests and tracks cache locations to "load balance for both compute (on prefill) and memory (on decode)"; the SLA-based planner monitors TTFT and ITL and adjusts prefill/decode worker counts from pre-deployment profiling to meet latency targets within a GPU budget. https://docs.nvidia.com/dynamo/latest/architecture/sla_planner.html ↩↩↩