Markdown

Constrained and structured decoding¶

Scope: forcing model output to a grammar, JSON schema, or regex by masking the next-token logits to only the tokens a compiled finite-state machine / pushdown automaton allows at each step (XGrammar, Outlines, llguidance), guaranteeing structurally valid output, and the compute overhead of computing that mask every step.

Code blocks below come in two kinds. The vLLM / SGLang / TensorRT-LLM snippets are reference templates on real vendor APIs (those engines are not installed here); pin versions and validate before production. The numpy blocks are runnable, self-checking validations of the core math the page teaches (logit masking, the automaton walk, XGrammar's token split); each runs on a stock python3 with numpy and asserts its result, including adversarial cases.

What it is¶

Decoding is a loop: the model produces a logit vector over the vocabulary, a sampler picks a token, repeat. Constrained decoding inserts one operation into that loop: logit masking. Before sampling, every token that would violate the target structure has its logit set to -inf, so the chosen token is always a legal continuation. Stack enough of these per-step masks and the full output is guaranteed to parse: valid JSON, a string matching a regex, or a sentence in a context-free grammar (CFG).¹

The structure is compiled, once, into an automaton:

Regex / JSON-as-regex compiles to a finite-state machine (FSM). Each state has a precomputed set of allowed next tokens.²
Context-free grammar (EBNF) / full JSON schema compiles to a pushdown automaton (PDA): an FSM plus a stack, needed for nesting/recursion that a flat FSM cannot express.¹

At each decode step the engine reads the automaton's current state, looks up (or computes) the allowed-token set, builds a boolean mask over the vocabulary, applies it to the logits, samples, then advances the automaton with the accepted token. The mask is a sampling-layer operation on the logits (the same layer the engine already uses for temperature, top-p, and repetition penalties), not a change to the model weights or attention.³

The single load-bearing operation is masking-then-renormalising the logits. The block below is the exact math: it sets illegal tokens to -inf, applies a numerically stable softmax over the survivors, and asserts the three properties the guarantee rests on (valid distribution, support confined to the legal set, equivalence to renormalising the legal logits directly), plus two adversarial cases (an empty allowed-set must raise rather than emit NaNs, and an illegal token with a huge logit must still get zero mass).

# Runnable on system python3 (numpy). Core operation: masked-softmax over legal tokens.
import numpy as np


def masked_softmax(logits: np.ndarray, allowed: np.ndarray) -> np.ndarray:
    """logits: (V,) float; allowed: (V,) bool. Returns (V,) probs, 0 on illegal tokens."""
    assert logits.shape == allowed.shape and logits.ndim == 1
    assert allowed.any(), "at least one token must be legal, else no continuation exists"
    masked = np.where(allowed, logits, -np.inf)
    m = masked.max()                       # stable softmax; m is finite since allowed.any()
    ex = np.exp(masked - m)                # illegal -> exp(-inf) = 0 exactly
    return ex / ex.sum()


rng = np.random.default_rng(0)
V = 128
logits = rng.standard_normal(V) * 5.0
allowed = rng.random(V) < 0.1              # ~10% of the vocab is a legal next token
allowed[allowed.argmax()] = True           # guarantee the set is non-empty

p = masked_softmax(logits, allowed)

# 1. Valid distribution.
assert np.isclose(p.sum(), 1.0), p.sum()
assert np.all(p >= 0.0)
# 2. Support is EXACTLY the allowed set: illegal tokens get zero mass, legal tokens
#    strictly positive mass. This is the structural guarantee the page claims.
assert np.all(p[~allowed] == 0.0)
assert np.all(p[allowed] > 0.0)
# 3. Equivalence to a slow reference: masking-then-renormalising must equal a plain
#    softmax over ONLY the legal logits, scattered back into place.
idx = np.flatnonzero(allowed)
ref_sub = np.exp(logits[idx] - logits[idx].max())
ref_sub /= ref_sub.sum()
ref = np.zeros(V)
ref[idx] = ref_sub
assert np.allclose(p, ref, atol=1e-12), np.abs(p - ref).max()
# 4. Adversarial: an all-illegal mask means the grammar admits no next token; the
#    engine must refuse rather than emit garbage, so masked_softmax asserts.
raised = False
try:
    masked_softmax(logits, np.zeros(V, dtype=bool))
except AssertionError:
    raised = True
assert raised, "empty allowed-set must raise, not silently produce NaNs"
# 5. Adversarial numeric: an ILLEGAL token with a huge logit must still get zero mass;
#    masking dominates logit magnitude.
adv = logits.copy()
adv[~allowed] = 1e9
p_adv = masked_softmax(adv, allowed)
assert np.all(p_adv[~allowed] == 0.0), "illegal high-logit token leaked probability mass"
assert np.isclose(p_adv.sum(), 1.0)

print("masked_softmax OK:", f"legal={idx.size}/{V}", "max|p-ref|=%.2e" % np.abs(p - ref).max())

Why use it¶

Decode is the latency-critical, memory-bandwidth-bound phase of serving (inference serving, continuous batching internals). Constrained decoding buys correctness that no amount of prompting guarantees, but it inserts a CPU-side, per-step, per-request computation into exactly that hot loop:

Correctness is structural, not statistical. A masked decoder cannot emit a token that breaks the grammar: no retry loop, no JSON-repair pass, no "please respond only with valid JSON." This is the right tool for tool/function calling, typed agent outputs, and data extraction where a parse failure is a hard failure.
The overhead is real but engineered down. Mask construction is the cost. A slow constraint backend serialises the GPU behind a CPU computation every token; a fast one (XGrammar's precompute plus cache, Outlines' index) hides most of it. The win depends on grammar reuse: vLLM notes the xgrammar backend "performs best when grammars are reused, thanks to effective caching," so the per-step cost amortises across many requests sharing a schema.⁴
Compilation is a one-time-per-schema cost, paid on the first request. Building the FSM/PDA for a large or pathological schema is not free; it lands on the first request that uses it and is then cached. A schema that changes every request defeats the cache.
It changes the output distribution by construction. Masking renormalises probability over the legal tokens only. That is the point, but it means a poorly written grammar can force the model into low-probability regions and degrade content quality; the structure is guaranteed, the usefulness is not.

When to use it (and when not)¶

Reach for it when:

The consumer is a parser, not a human: function/tool calls, JSON APIs, classification into a fixed label set, regex-shaped fields (dates, IPs, SKUs, enums).
You need a guarantee, not a high probability; a single malformed response is an incident.
The same schema is reused across many requests, so compilation and the mask cache amortise.⁴

Skip or be cautious when:

Output is free-form prose; there is no structure to enforce and the mask is pure overhead.
The grammar is unique per request (cache never hits) or so large that compilation latency dominates short generations.
The constraint is trivial and cheaply post-validated (e.g. "yes/no"): a guided_choice of two options is fine, but do not reach for a full CFG where a downstream check suffices.
An over-tight grammar would force structurally valid but semantically wrong output; loosen the schema or let the model reason in free text first, then constrain a final extraction step.

Architecture¶

The constraint is compiled once into an automaton; every decode step then reads the current state, masks the logits to that state's allowed tokens, samples, and advances the automaton with the chosen token.

flowchart LR
  A["Schema / regex / EBNF"] -->|"compile once"| B["FSM or pushdown automaton"]
  C["Model forward → logits"] --> D["Mask: allowed tokens from current state"]
  B --> D
  D --> E["Sample legal token"]
  E --> F["Advance automaton state"]
  F -->|"next step"| C

The naive cost is the problem this field exists to solve: deciding validity for every token in a 128k vocabulary, against a stack of automaton states, every step. XGrammar's insight is that the vocabulary splits into context-independent tokens (valid/invalid regardless of the stack, precomputable) and context-dependent tokens (must be checked at runtime). For Llama-3.1 with a JSON grammar the context-dependent set is "less than 1% (1134 out of 128k)"; the rest is served from an adaptive token mask cache, and a persistent execution stack makes automaton rollback constant-time.¹ XGrammar reports per-token mask generation "in less than 40us," "up to 3x speedup in the setting of JSON schema, and more than 100x speedup in the case of JSON grammar," and near-zero end-to-end overhead by overlapping mask computation (on CPU) with the GPU forward pass.¹ Outlines takes the FSM route: precompute, per state, an index of allowed tokens so the online step is an "O(1) on average" hash lookup.²

The block below is that split, made concrete: it builds an oracle grammar where most vocabulary tokens are accepted or rejected identically in every state (context-independent, precomputable) and only a small structural subset flips with the state (context-dependent). It then reconstructs each step's mask by reusing the precomputed part and touching only the context-dependent subset, and asserts the reconstruction is bit-identical to a full per-token recompute. The adversarial case proves the runtime check is load-bearing: a "buggy cache" that skips it corrupts the mask.

# Runnable on system python3 (numpy). Core idea: context-independent vs context-dependent
# token split (XGrammar), so the per-step mask is reconstructed without scanning the vocab.
import numpy as np

rng = np.random.default_rng(3)
V = 4096                                   # stand-in vocabulary
n_states = 16

# Oracle grammar accept(state, token), built to mirror a real grammar's structure: MOST
# tokens (arbitrary word-pieces) are accepted/rejected the SAME way in every state
# (context-independent); only a small "structural" subset flips with the stack.
base = rng.random(V) < 0.5                  # per-token acceptance shared across all states
truth = np.tile(base, (n_states, 1)).copy()
n_cd_true = 24                              # small state-dependent subset
cd_true_tokens = rng.choice(V, size=n_cd_true, replace=False)
truth[:, cd_true_tokens] = rng.random((n_states, n_cd_true)) < 0.5   # flip per state

# A token is context-INDEPENDENT iff its acceptance is identical across every state.
same_across_states = (truth == truth[0]).all(axis=0)
ci_mask = same_across_states
cd_mask = ~ci_mask
cd_index = np.flatnonzero(cd_mask)          # the small set to check online

# Precompute ONCE: context-independent tokens keep their state-agnostic acceptance.
ci_accept = truth[0].copy()
ci_accept[cd_mask] = False                  # value irrelevant; overwritten at runtime


def fast_mask(state: int) -> np.ndarray:
    """Reconstruct the step mask touching ONLY the context-dependent tokens."""
    m = ci_accept.copy()                    # precomputed part, zero per-token work
    m[cd_index] = truth[state, cd_index]    # live check on the minority subset only
    return m


def slow_mask(state: int) -> np.ndarray:
    """Reference: recompute validity for EVERY token (the naive cost)."""
    return truth[state].copy()


# 1. Exactness: fast reconstruction equals the full recompute for every state, bit for bit.
for s in range(n_states):
    assert np.array_equal(fast_mask(s), slow_mask(s)), s
# 2. The context-dependent set is a small minority (the property XGrammar exploits; the
#    paper measures <1% for a real Llama-3.1 JSON grammar). Detection lands inside the
#    injected budget with no false positives.
frac_cd = cd_index.size / V
assert 0 < cd_index.size <= n_cd_true, (cd_index.size, n_cd_true)
assert frac_cd < 0.01, frac_cd
assert set(cd_index).issubset(set(cd_true_tokens))
assert ci_mask.sum() + cd_mask.sum() == V   # partition is exhaustive and disjoint
# 3. Adversarial / corruption detection: a "buggy cache" that skips the runtime check
#    must diverge from truth for at least one state -> the online check is load-bearing.
def buggy_mask(state: int) -> np.ndarray:
    return ci_accept.copy()                 # forgot to overwrite cd_index

assert any(not np.array_equal(buggy_mask(s), slow_mask(s)) for s in range(n_states)), \
    "skipping the context-dependent check MUST corrupt the mask"
# 4. Equivalence under a stack-independent grammar (a flat regex): cd_index is empty and
#    fast_mask needs ZERO runtime token checks.
flat = np.tile(rng.random(V) < 0.5, (n_states, 1))
assert (flat == flat[0]).all(axis=0).all(), "flat grammar makes every token context-independent"

saved = 1.0 - cd_index.size / V
print("XGrammar split OK:", f"context-dependent={cd_index.size}/{V} ({100*frac_cd:.1f}%),",
      f"per-step work saved={100*saved:.1f}%")

How to use it¶

Use the engine's built-in backend; do not hand-roll logit masking or the automaton. Backends are interchangeable across engines: XGrammar (PDA, default in vLLM-auto and SGLang), Outlines (FSM), llguidance. The snippets below are copy-correct against current vendor docs. None of these have been hardware-tested here. Validate parse rate, per-token latency (TPOT), and output quality on your own model, schema, and traffic before rollout.

vLLM: `guided_*` params / `GuidedDecodingParams`¶

Online (OpenAI-compatible server), via extra_body:⁴

# Reference template (needs a running vLLM server + openai client). Not executed here.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")

# JSON schema
resp = client.chat.completions.create(
    model="<model>",
    messages=[{"role": "user", "content": "Emit a person record."}],
    extra_body={"guided_json": {
        "type": "object",
        "properties": {"name": {"type": "string"}, "age": {"type": "integer"}},
        "required": ["name", "age"],
    }},
)

The supported request fields are guided_choice, guided_regex, guided_json, guided_grammar (context-free grammar), and structural_tag.⁴ Offline, configure GuidedDecodingParams inside SamplingParams:

# Reference template (needs vllm installed + a GPU). Not executed here.
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

llm = LLM(model="<model>")
sp = SamplingParams(
    guided_decoding=GuidedDecodingParams(
        choice=["positive", "negative", "neutral"]
    )
)
out = llm.generate("Classify: the food was cold.", sp)

GuidedDecodingParams accepts json, regex, choice, grammar, and structural_tag.⁴ Backend selection is the server flag --guided-decoding-backend; the default is auto, which picks a backend per request. Pin a backend explicitly with e.g. xgrammar, and append options after a colon. xgrammar:no-fallback forbids silent fallback to another backend on a grammar XGrammar cannot compile.⁴

SGLang: `json_schema` / `regex` / `ebnf`¶

SGLang's grammar backend defaults to XGrammar; alternatives are Outlines and llguidance, selected at launch with --grammar-backend.⁵

python3 -m sglang.launch_server --model-path <model> --grammar-backend xgrammar

# Reference template (needs a running SGLang server). Not executed here.
import requests
r = requests.post("http://localhost:30000/generate", json={
    "text": "Give me a date.",
    "sampling_params": {
        "regex": r"\d{4}-\d{2}-\d{2}",   # or "json_schema": ..., or "ebnf": ...
        "max_new_tokens": 32,
    },
})

The three constraint parameters are json_schema (typed objects), regex (fixed-format strings), and ebnf (domain grammars / code fragments JSON Schema cannot express).⁵

TensorRT-LLM: `GuidedDecodingParams`¶

Backends are xgrammar or llguidance (llguidance is PyTorch-backend only), set on the LLM:⁶

# Reference template (needs tensorrt_llm installed + an NVIDIA GPU). Not executed here.
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import GuidedDecodingParams

llm = LLM("nvidia/Llama-3.1-8B-Instruct-FP8", guided_decoding_backend="xgrammar")
sp = SamplingParams(
    guided_decoding=GuidedDecodingParams(json=PersonSchema.model_json_schema())
)
llm.generate("Emit a person record.", sampling_params=sp)

GuidedDecodingParams fields are json (JSON schema), regex, grammar (EBNF), and structural_tag; the JSON schema is typically produced from a pydantic model.⁶ The Executor (C++) API requires a GuidedDecodingConfig built with encodedVocab (and optional tokenizerStr, stopTokenIds) at construction.⁶

How to integrate it¶

Whatever backend you pick, it implements one contract: at each step, look up the allowed tokens for the current automaton state, mask the logits to them, sample a legal token, and advance the state. The property that makes constrained decoding a hard guarantee (not a heuristic) is that no reachable step can append an illegal token, so any string the loop can produce is accepted by the automaton by construction.

The block below is that contract, exercised end to end on the ISO-date regex \d{4}-\d{2}-\d{2} used in the SGLang/vLLM examples above. It compiles the regex to an FSM with a precomputed allowed-token index per state (the Outlines route), then drives greedy masked decoding from 500 arbitrary (and one deliberately adversarial) logit streams and asserts every output is a valid date, cross-checked against Python's own re engine. Corruption detection and the "empty allowed-set only at the accepting state" invariant are asserted too.

# Runnable on system python3 (numpy). Core mechanism: compile-to-FSM, mask, advance;
# any string the constrained loop emits is accepted by the automaton, by construction.
import numpy as np
import re

DIGITS = tuple("0123456789")
VOCAB = DIGITS + ("-", "END")               # token alphabet
TID = {t: i for i, t in enumerate(VOCAB)}
V = len(VOCAB)


def build_date_fsm() -> list[dict[int, int]]:
    """FSM for YYYY-MM-DD. State = chars matched (0..10); 10 accepts. Compiled ONCE."""
    pattern = "dddd-dd-dd"                   # 'd' = any digit, '-' = literal dash
    T: list[dict[int, int]] = [dict() for _ in range(len(pattern) + 1)]
    for s, ch in enumerate(pattern):
        if ch == "d":
            for d in DIGITS:
                T[s][TID[d]] = s + 1
        else:
            T[s][TID["-"]] = s + 1
    T[len(pattern)][TID["END"]] = len(pattern)  # only END is legal once complete
    return T


FSM = build_date_fsm()
ACCEPT = len(FSM) - 1


def allowed_mask(state: int) -> np.ndarray:
    m = np.zeros(V, dtype=bool)
    for tok in FSM[state]:                   # O(1) avg lookup per allowed token
        m[tok] = True
    return m


def masked_argmax_decode(logits_seq):
    """Greedy constrained decode driven by an arbitrary logit stream."""
    state, out = 0, []
    for raw in logits_seq:
        mask = allowed_mask(state)
        if not mask.any():                   # accepting state: nothing left to emit
            break
        tok = int(np.where(mask, raw, -np.inf).argmax())  # always a legal token
        if VOCAB[tok] == "END":
            break
        out.append(VOCAB[tok])
        state = FSM[state][tok]              # advance the automaton
    return "".join(out), state


def fsm_accepts(s: str) -> bool:
    state = 0
    for ch in s:
        tok = TID.get(ch)
        if tok is None or tok not in FSM[state]:
            return False
        state = FSM[state][tok]
    return state == ACCEPT


CANON = re.compile(r"^\d{4}-\d{2}-\d{2}$")
rng = np.random.default_rng(7)

# 1. Fuzz: 500 arbitrary logit streams all yield a valid date, per BOTH the FSM and re.
for _ in range(500):
    seq = rng.standard_normal((12, V)) * 10.0
    s, end_state = masked_argmax_decode(seq)
    assert len(s) == 10 and fsm_accepts(s) and CANON.match(s), s
    assert end_state == ACCEPT
# 2. Adversarial: spike the '-' logit at every step so the model "wants" a dash in every
#    slot; masking must override the preference and still emit a digit where required.
adv = np.full((12, V), -1e9)
adv[:, TID["-"]] = 1e9
s2, _ = masked_argmax_decode(adv)
assert fsm_accepts(s2) and CANON.match(s2), s2
assert s2[0] in DIGITS                       # position 0 is a digit despite the dash spike
# 3. Corruption detection: off-grammar strings must be REJECTED (acceptor is not trivial).
assert not fsm_accepts("2026/07/02")         # wrong separator
assert not fsm_accepts("2026-7-02")          # month not two digits
assert not fsm_accepts("2026-07-0")          # truncated
assert not fsm_accepts("x026-07-02")         # non-digit
# 4. Well-formedness: every non-accepting state has at least one legal continuation, so an
#    empty allowed-set only ever means "done".
for st in range(len(FSM)):
    if st != ACCEPT:
        assert allowed_mask(st).any(), st

print("FSM decode OK: 500/500 fuzz outputs valid dates; adversarial ->", s2)

Because the masks are token-level and tied to the exact vocabulary, the integration is only correct against the tokenizer the automaton was compiled for. A tokenizer or model swap invalidates precomputed masks and requires a recompile. Compose carefully with speculative decoding: both live in the sampling/verification path, so masking must apply to proposed and verified tokens; confirm your engine supports the combination before assuming the speedups stack.

How to run it in production¶

Overlap the mask with the forward pass. The reason XGrammar reports near-zero end-to-end overhead is that mask computation runs on the CPU concurrently with the GPU forward.¹ A backend that does not overlap serialises the GPU behind a per-token CPU computation; watch for it as decode-step CPU stalls (CUDA streams).
Measure mask overhead, not just parse rate. Add NVTX markers around the sampling layer and watch per-token latency (TPOT). Parse rate can be 100% while TPOT quietly doubles because the constraint backend is not overlapping.
Pin the backend. auto can change which backend handles a request as schemas vary; pin xgrammar (or test each) so behaviour is reproducible, and decide whether no-fallback should turn an uncompilable grammar into a hard error rather than a silent backend swap.⁴
Warm the compile. The FSM/PDA build lands on the first request using a schema and is then cached; pre-warm hot schemas at startup so the first live request does not pay compilation latency.

How to maintain it¶

Pin and version the backend. auto can change which backend handles a request as schemas vary; pin xgrammar (or test each) so behaviour is reproducible, and decide whether no-fallback should turn an uncompilable grammar into a hard error rather than a silent backend swap.⁴
Re-validate after tokenizer or model swaps. The automaton's token-level masks are tied to the exact vocabulary; a tokenizer change invalidates precomputed masks. Recompile and re-test.
Version schemas alongside code. A schema is an interface contract with the downstream parser; treat a schema change like an API change (review, test, roll out), because it re-triggers compilation and can shift the output distribution.
Keep an eye on distribution drift. Because masking renormalises over legal tokens only, tightening a grammar can push the model into low-probability regions; re-check output quality, not just validity, whenever the grammar changes.

How to scale it¶

Maximise grammar reuse. Share a small set of compiled schemas across requests so the mask cache and compiled-automaton cache hit; a per-request unique schema pays compilation every time and defeats the cache. vLLM's xgrammar backend "performs best when grammars are reused, thanks to effective caching."⁴
Amortise the per-step mask. The context-dependent subset (the part re-checked at runtime) is tiny (XGrammar: under 1% for a Llama-3.1 JSON grammar), and the adaptive token mask cache reduces its memory footprint (the paper cites 160 MB down to 0.46 MB).¹ Across many requests sharing a schema, that cheap per-step cost is what lets constrained decoding stay near-free at scale.
Bound schema size. A large or pathological schema makes compilation latency dominate short generations; cap schema complexity, or split reasoning (free text) from a small final constrained extraction step so the constrained portion stays cheap.
Route by schema, not just by load. When running many replicas, sending requests that share a schema to the same replica keeps its compiled-automaton and mask caches hot, the same locality argument as adapter-aware routing in multi-LoRA serving.

Failure modes¶

Empty allowed-set (over-tight or contradictory grammar). A state with no legal continuation deadlocks the decoder; a well-formed automaton only reaches an empty set at the accepting state (asserted in the FSM block above). Validate that every non-accepting state has a continuation.
Structurally valid but semantically wrong output. Masking guarantees the shape, not the meaning; an over-tight grammar forces the model into low-probability regions and degrades content quality. Loosen the schema, or reason in free text first and constrain only a final extraction step.
Cache miss from per-request unique schemas. A schema that changes every request never hits the compiled-automaton or mask cache and pays compilation plus full mask cost every time; share a small set of schemas.⁴
CPU stall from a non-overlapping backend. A backend that computes the mask without overlapping the GPU forward serialises decode behind a per-token CPU computation; parse rate stays perfect while TPOT regresses. Measure TPOT, not just validity (CUDA streams).
Silent backend fallback. Under auto, a grammar XGrammar cannot compile can be handed to another backend with different behaviour; use xgrammar:no-fallback to make it a hard error where reproducibility matters.⁴
Stale masks after a tokenizer/model swap. Token-level masks are tied to the exact vocabulary; a tokenizer change silently invalidates them. Recompile and re-test after any model or tokenizer change.
Compilation-latency spike on first use. A large or pathological schema pays a heavy one-time compile on the first request that uses it; pre-warm hot schemas at startup.
Fragile composition with speculative decoding. Both occupy the sampling/verification path; combining them means masking applies to proposed and verified tokens, and the speedups do not automatically stack. Confirm engine support (speculative decoding).

References¶

Chris Fregly, AI Systems Performance Engineering (O'Reilly): the adaptive-inference chapter (Ch. 19) covers logit-level sampling, the memory-bound decode phase, and names vLLM / SGLang / TensorRT-LLM as the production engines, but does not treat constrained/structured decoding directly; the grammar/FSM mechanics and all numbers here are grounded in the primary sources below.
XGrammar: Dong et al., "XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models," arXiv:2411.15100: https://arxiv.org/abs/2411.15100
Outlines: Willard & Louf, "Efficient Guided Generation for LLMs," arXiv:2307.09702: https://arxiv.org/abs/2307.09702
vLLM, "Structured Outputs": https://docs.vllm.ai/en/latest/features/structured_outputs.html
SGLang, "Structured Outputs": https://docs.sglang.io/advanced_features/structured_outputs.html
TensorRT-LLM, "Guided Decoding": https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/guided-decoding.md

Dong et al., arXiv:2411.15100. Vocabulary split into context-independent (precheckable) vs context-dependent tokens; "context-dependent tokens account for only a minor proportion, amounting to less than 1% (1134 out of 128k)" for Llama-3.1 JSON grammar; adaptive token mask cache (memory reduced "to 0.2% (from 160 MB to 0.46 MB)"); persistent execution stack with constant-time rollback; per-token mask "in less than 40us"; "up to 3x speedup in the setting of JSON schema, and more than 100x speedup in the case of JSON grammar"; grammar compute overlapped with GPU execution for near-zero end-to-end overhead. ↩↩↩↩↩↩
Willard & Louf, arXiv:2307.09702. Regex/JSON-as-regex compiled to an FSM; per-state precomputed index of allowed tokens makes the online masking step "O(1) on average"; extends to CFG/LALR(1) for JSON, Python, SQL, etc. ↩↩
Fregly, AI Systems Performance Engineering, Ch. 19. Decode is the memory-bandwidth-bound phase; sampling operates on the per-step logit vector over the vocabulary (the layer where masking is applied); vLLM, SGLang, and NVIDIA TensorRT-LLM are the named production inference engines. The book does not describe grammar masking itself. ↩
vLLM, "Structured Outputs." Online fields guided_choice, guided_regex, guided_json, guided_grammar, structural_tag; offline GuidedDecodingParams(json|regex|choice|grammar|structural_tag) inside SamplingParams; --guided-decoding-backend default auto; xgrammar:no-fallback option; xgrammar "performs best when grammars are reused, thanks to effective caching." https://docs.vllm.ai/en/latest/features/structured_outputs.html ↩↩↩↩↩↩↩↩↩↩↩
SGLang, "Structured Outputs." Grammar backends XGrammar (default), Outlines, llguidance via --grammar-backend; constraint params json_schema, regex, ebnf. https://docs.sglang.io/advanced_features/structured_outputs.html ↩↩
TensorRT-LLM, "Guided Decoding." guided_decoding_backend in {xgrammar, llguidance} (llguidance PyTorch-backend only) set on LLM; GuidedDecodingParams fields json, regex, grammar (EBNF), structural_tag; Executor API needs GuidedDecodingConfig(encodedVocab, tokenizerStr?, stopTokenIds?). https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/guided-decoding.md ↩↩↩