Markdown

Rejection sampling and best-of-N¶

Scope: the simplest, most reliable preference-based post-training step, generate N completions per prompt, score them with a reward model or a verifiable checker, then SFT the policy on only the best-scoring ones (and Best-of-N, the same generate-and-score procedure used at inference instead of training); the bridge between SFT and online RL (GRPO) in the post-training pipeline.

Reference templates on real APIs; pin versions and validate before production use. The numpy-only blocks in this page are runnable and self-checking (system python3 with numpy 2.4+); the vLLM/TRL blocks are reference templates and are labelled as such.

What it is¶

Rejection sampling (RS), used as a fine-tuning method, is one of the most widely used yet least documented methods in preference fine-tuning. It curates new candidate completions, filters them with a trained reward model, and then fine-tunes the original model only on the top completions, using the same loss function as instruction tuning (SFT). Following the book's notation, x denotes prompts and y denotes completions; methods operate on full prompt-completion pairs.

The name comes from computational statistics: when you cannot sample a complex target distribution directly, you sample from a simpler one and use a heuristic to accept or reject each draw. For language models the target is high-quality completions to prompts, the filter is a reward model, and the sampling distribution is the current model. It is versatile: it can be applied after instruction tuning, after RL, or even after RLVR, which, combined with its underdocumented nature, makes it a hard tool to place.

It is a core component of many prominent RLHF pipelines: WebGPT, Anthropic's Helpful and Harmless assistant, OpenAI's process-reward-model work, and Llama 2 Chat; modern open recipes (Llama 3, Tülu 3) use it as a post-training stage, and later work formalises it (RAFT for reward-ranked fine-tuning, and Statistical Rejection Sampling Optimization, RSO).

Why use it¶

Simple: the training step is plain instruction tuning (SFT) on the kept completions; no policy-gradient loop, no critic, no per-step weight sync.
Stable and reliable: because it never runs an online RL update, it inherits SFT's stability; it is the lowest-risk way to pull capability out of a reward signal.
No RL infrastructure: generation and training are decoupled stages, not a tight rollout-update loop like GRPO; you can generate once and train offline on the result.
A versatile bridge: it sits between SFT and full RL, and can be slotted in after IFT, after RL, or after RLVR. It is the natural first step before committing to an online RL run.

When to use it (and when not)¶

Use rejection sampling (RSFT) when you have a reward model or a verifiable checker and want gains beyond what your SFT demonstrations contain, but want SFT-level simplicity and stability rather than an online RL loop. It samples like RL (generate N, score) but trains like SFT (instruction-tune on the kept set).
Do not use it when you have no way to score completions (no reward model, no checker), or when your generation budget is near zero: with no scores there is nothing to select on, and with no samples there is nothing to score.
Prefer DPO when you only have offline (chosen, rejected) preference pairs and no generation budget; DPO needs no reward model and no sampling, so it is cheaper still.
Prefer GRPO when a reward can be computed and you want to push reasoning/agentic skill as far as possible; online RL extracts more from the same reward than a single RSFT pass, at much higher engineering cost.
Rule of thumb: SFT cold-starts behaviour; RSFT is the cheapest way to use a reward; DPO aligns from pairs; GRPO is the heavyweight that squeezes the most out of a verifiable reward. RSFT is the bridge from the first to the last.

Architecture¶

Rejection sampling and Best-of-N share one generate-and-score core and split at the last step: RSFT feeds the selected completions back into an SFT trainer and loops, while Best-of-N serves the top-scored answer without ever touching the weights.

flowchart LR
  P["Prompt set (reuse SFT prompts)"] --> GEN["Generate N completions (vLLM)"]
  GEN --> SCORE["Score each (reward model / verifiable check)"]
  SCORE --> SEL["Select best per prompt / top-K above threshold"]
  SEL -->|"training: RSFT"| SFT["SFT on kept completions"]
  SFT -->|"next iteration"| GEN
  SEL -->|"inference: Best-of-N"| SERVE["Serve highest-scored answer (no weight update)"]

Components:

Prompt set: the inputs to sample on; the simplest choice reuses the SFT/IFT prompts (with the overfitting caveat below).
Generator: the current policy checkpoint, served through vLLM for throughput; it produces the M x N completion matrix and is the cost bottleneck.
Scorer: a scalar reward model (reward model training) or a programmatic verifiable checker, applied to every (prompt, completion) pair to fill an M x N reward matrix.
Selector: pure array logic over that matrix, best-per-prompt argmax or global top-K above a threshold, with no model calls.
Training plane (RSFT): the selected completions become an SFT dataset; instruction-tuning the checkpoint on it closes the loop for the next iteration.
Inference plane (Best-of-N): the selector's top answer is served directly, with no weight update, trading compute for quality per query.

How it works (step by step)¶

Rejection sampling follows four stages (book §9.1):

Prompt and reward model selection. Choose the prompts to train on; the simplest choice is to re-use every prompt from the SFT/IFT stage, though that can cause overfitting. You must already have a trained reward model (reward model training).
Generate completions. For a set of M prompts, sample N completions each from the current checkpoint, giving an M x N matrix Y. Tunable settings: sampling temperature, top-p/top-k, max sequence length, and N.
Score with the reward model. Pass every pair through the reward model to get an M x N reward matrix R, where r[i,j] = R(y[i,j] | x[i]).
SFT on the top completions. Instruction-fine-tune the starting checkpoint on the selected completions, then optionally iterate.

Two selection strategies (book §9.1.2):

Best per prompt: take argmax over each row of R. Keeps one completion per prompt, preserving prompt diversity.
Top-K overall: flatten R, argsort, and take the K highest pairs across the whole set, maximising overall quality (some prompts may contribute several completions, others none).

Runnable check (numpy only): the selection math both strategies rely on, validated against slow references plus the tie and boundary cases. This is the core algorithm the generate/score/select templates below depend on.

import numpy as np

rng = np.random.default_rng(0)
M, N = 5, 8  # M prompts, N completions each

# Surrogate deterministic reward standing in for a scalar reward model r(y|x).
# We validate the SELECTION math that consumes the M x N reward matrix R.
R = rng.normal(size=(M, N))

def best_per_prompt(R):                 # argmax over each row (keeps 1 per prompt)
    return np.argmax(R, axis=1)

def best_per_prompt_ref(R):
    out = []
    for row in R:
        bj, bv = 0, row[0]
        for j, v in enumerate(row):
            if v > bv:                  # strict > => ties resolve to FIRST index
                bj, bv = j, v
        out.append(bj)
    return np.array(out)

bpp = best_per_prompt(R)
assert np.array_equal(bpp, best_per_prompt_ref(R))
assert bpp.shape == (M,)                # exactly one kept per prompt (diversity)

def top_k_overall(R, K):               # flatten, argsort, take K globally-highest
    return np.argsort(R.ravel())[::-1][:K]

def top_k_overall_ref(R, K):
    pairs = sorted(((v, i) for i, v in enumerate(R.ravel())),
                   key=lambda t: t[0], reverse=True)
    return [i for _, i in pairs[:K]]

for K in (1, 3, 7, M * N, M * N + 5):
    got, ref = top_k_overall(R, K), top_k_overall_ref(R, K)
    assert np.allclose(np.sort(R.ravel()[got]),
                       np.sort([R.ravel()[i] for i in ref]))
    assert len(got) == min(K, M * N)

# explicit claim: for a single prompt, argmax == top-K with K=1
single = R[0:1, :]
assert best_per_prompt(single)[0] == (top_k_overall(single, 1)[0] % N)

def keep_above(R, threshold):          # threshold filter on best-per-prompt
    bpp = best_per_prompt(R)
    vals = R[np.arange(R.shape[0]), bpp]
    return [(i, int(bpp[i])) for i in range(R.shape[0]) if vals[i] >= threshold]

assert len(keep_above(R, float("-inf"))) == M          # -inf keeps all
tau = float(np.median(R))
kept = keep_above(R, tau)
assert all(R[i, j] >= tau for i, j in kept)
assert len(kept) <= M

# adversarial: exact ties must resolve to the FIRST index
tied = np.array([[2.0, 2.0, 1.0], [0.5, 3.0, 3.0]])
assert np.array_equal(best_per_prompt(tied), np.array([0, 1]))
assert np.array_equal(best_per_prompt(tied), best_per_prompt_ref(tied))

print("NP-1 OK: best-per-prompt argmax == ref; top-K argsort == ref; "
      "K=1==argmax; threshold; ties->first")

Implementation details that matter (book §9.2):

Sampling parameters: use temperatures above zero, commonly 0.7-1.0, with top-p (nucleus) or top-k; the method depends entirely on the completions the model produces.
Completions per prompt: successful runs use 10 to 30 or more; too few makes training biased and/or noisy.
Heterogeneous generations: some implementations mix in completions from models other than the one being trained; best practices are not established.
Throughput: when batching reward-model inference, sort tokenised completions by length so batches are uniform and you waste fewer padding tokens.

How to use it¶

Generate with vLLM, score every pair with a reward model, keep the best per prompt, write a JSONL SFT dataset, then instruction-tune on it with TRL's SFTTrainer. The three blocks below are a reference template (needs vllm, transformers, trl, datasets, and a GPU); the selection math they feed is validated by the numpy check above, and the SFT loss they invoke is validated by the numpy check that follows.

# rsft_generate_score_select.py: vLLM generation + reward-model scoring
# Reference template. Pin versions (vllm, transformers, trl) and validate
# output shapes on your stack.
import json
import torch
from vllm import LLM, SamplingParams
from transformers import AutoModelForSequenceClassification, AutoTokenizer

N = 16                                   # completions per prompt (10-30+ in practice)
prompts = [
    "Write a haiku about GPUs.",
    "Explain why rejection sampling is simpler than online RL.",
    "Give three tips for debugging an NCCL hang.",
]                                        # in practice, reuse the full SFT/IFT prompt set

# 1. Generate N completions per prompt with vLLM
policy = LLM(model="Qwen/Qwen3-8B")
sampling = SamplingParams(n=N, temperature=0.9, top_p=0.95, max_tokens=2048)
gen = policy.generate(prompts, sampling)              # one RequestOutput per prompt

# 2. Score every (prompt, completion) with a scalar reward model
rm_name = "Skywork/Skywork-Reward-V2-Llama-3.1-8B"    # any chat RM; see reward model training
rm_tok = AutoTokenizer.from_pretrained(rm_name)
rm = AutoModelForSequenceClassification.from_pretrained(
    rm_name, num_labels=1, torch_dtype=torch.bfloat16).to("cuda").eval()

@torch.no_grad()
def score(prompt: str, completion: str) -> float:
    chat = [{"role": "user", "content": prompt},
            {"role": "assistant", "content": completion}]
    ids = rm_tok.apply_chat_template(chat, return_tensors="pt").to(rm.device)
    return rm(ids).logits[0, 0].item()                # scalar reward r(y | x)

# 3. Keep the best completion per prompt (preserves prompt diversity).
#    Set THRESHOLD to keep only "good enough" completions; or sort globally for top-K.
#    Reference template: consumes the vLLM `gen` and `score` objects above.
THRESHOLD = float("-inf")
kept: list[dict[str, str]] = []
for req in gen:
    scored = [(c.text, score(req.prompt, c.text)) for c in req.outputs]
    text, reward = max(scored, key=lambda t: t[1])    # arg max per prompt
    if reward >= THRESHOLD:
        kept.append({"prompt": req.prompt, "completion": text})

# 4. Write the SFT dataset for this RSFT iteration
with open("rsft_data.jsonl", "w") as f:
    for row in kept:
        f.write(json.dumps(row) + "\n")

# 5. SFT the starting checkpoint on the kept set (standard instruction tuning),
#    then loop: regenerate from the new checkpoint and repeat.
#    Reference template: needs trl + datasets + a GPU.
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer

ds = load_dataset("json", data_files="rsft_data.jsonl", split="train")
SFTTrainer(
    model="Qwen/Qwen3-8B",
    args=SFTConfig(max_length=4096, packing=True, bf16=True),
    train_dataset=ds,                                 # prompt-completion rows
).train()

The RSFT training step (block 5) is the same loss as instruction tuning: a cross-entropy over the completion tokens only, with the prompt tokens masked out. The runnable check below reproduces that loss against a slow reference and pins its boundaries, so the one claim the TRL template rests on is verified even though TRL itself is not run here.

import numpy as np

def log_softmax(z, axis=-1):            # stable: no log(0)
    m = z.max(axis=axis, keepdims=True)
    return z - m - np.log(np.exp(z - m).sum(axis=axis, keepdims=True))

def masked_ce(logits, targets, loss_mask):
    # logits [T,V], targets [T], loss_mask [T] in {0,1}; train on masked tokens
    logp = log_softmax(logits, axis=-1)
    tok = -logp[np.arange(len(targets)), targets]
    denom = loss_mask.sum()
    assert denom > 0
    return float((tok * loss_mask).sum() / denom)

def masked_ce_ref(logits, targets, loss_mask):
    total, count = 0.0, 0
    for t in range(len(targets)):
        if loss_mask[t] == 0:
            continue
        z = logits[t]
        m = z.max()
        logZ = m + np.log(np.exp(z - m).sum())
        total += -(z[targets[t]] - logZ)
        count += 1
    return total / count

rng = np.random.default_rng(1)
T, V = 6, 10
logits = rng.normal(size=(T, V))
targets = rng.integers(0, V, size=T)
mask = np.array([0, 0, 0, 1, 1, 1], dtype=float)   # first 3 prompt, last 3 completion

# 1. equivalence to a slow explicit reference
assert abs(masked_ce(logits, targets, mask) - masked_ce_ref(logits, targets, mask)) < 1e-12

# 2. masking the prompt changes the loss vs training on every token (adversarial)
assert not np.isclose(masked_ce(logits, targets, mask),
                      masked_ce(logits, targets, np.ones(T)))

# 3. perfect prediction => loss ~ 0 (boundary)
perfect = np.full((T, V), -1e9)
perfect[np.arange(T), targets] = 1e9
assert masked_ce(perfect, targets, mask) < 1e-6

# 4. uniform logits => loss == log(V) exactly (boundary)
assert abs(masked_ce(np.zeros((T, V)), targets, mask) - np.log(V)) < 1e-12

# 5. changing a PROMPT-token target must not move the masked loss
base = masked_ce(logits, targets, mask)
t2 = targets.copy(); t2[0] = (t2[0] + 1) % V
assert np.isclose(masked_ce(logits, t2, mask), base)

print("NP-2 OK: masked SFT cross-entropy == ref; prompt-mask changes loss; "
      "perfect->0; uniform->log(V); prompt tokens excluded")

For a verifiable reward instead of a reward model (maths, code, exact-match), swap score for a programmatic checker returning 1.0/0.0, the same shape as a GRPO reward function. Generation is the bottleneck, so back it with vLLM.

How to integrate with it¶

RSFT is a thin orchestration over interfaces you already have; the seams are what matter.

Reward interface: the scorer is any callable score(prompt, completion) -> float. A scalar chat reward model (reward model training) is the default and the single most load-bearing dependency; a weak or hackable RM propagates straight into the kept set.
Verifiable checker as a drop-in scorer: a programmatic checker that returns 1.0/0.0 has the same shape as a GRPO reward function (reward design), so one checker serves both an RSFT pass and a later online RL run without change.
Generation backend: decouple generation behind vLLM. The only artifact that crosses into training is the kept JSONL, so the generator and trainer never share process state.
Training backend: because RSFT reuses the exact SFT loss (validated above), it drops into an existing SFT pipeline unchanged; TRL's SFTTrainer is one option, any instruction-tuning stack works.
Pipeline position: RSFT is a stage in post-training. Its output checkpoint can feed another RSFT iteration or graduate to an online RL run (GRPO).

How to run it in production¶

Run the stages offline and independently. Generate once, score, select, then train; there is no tight rollout-update loop, so each stage can checkpoint and resume on its own, and a failed trainer never wastes the generation pass.
Batch the scorer with length-sorted inputs. Reward-model inference dominates the scoring cost; sort tokenised completions by length before batching so uniform batches waste fewer padding tokens (book §9.2).
Version every artifact per iteration. Persist the prompt set, the raw M x N completions, the reward matrix, and the kept JSONL as versioned artifacts (experiment tracking and model registry) so a round is reproducible and a bad one is auditable.
Gate on a held-out eval before promotion. A round can raise reward while regressing real quality; do not promote a checkpoint until it clears an eval the reward model never saw (SRE and MLOps practices).
Track KL alongside the eval. Watch the KL distance from the base policy together with the eval metric; rising reward with a rising KL and a falling eval is the over-optimisation signature (see Failure modes and the Best-of-N KL bound below).

How to maintain it¶

Treat every hyperparameter as empirical. N, temperature, top-p, the keep threshold, and the SFT-step epochs are not well documented in the literature; ablate them rather than trusting defaults.
Refresh the prompt pool between iterations. Re-using the full SFT prompt set every round overfits the original instruction distribution; rotate or sub-sample prompts.
Re-validate the reward model. A stale or drifting RM silently degrades the kept set; periodically re-check it against fresh judgements (reward model training) before starting a new round.
Keep rollback points. Version datasets and checkpoints per iteration so a round that regresses can be reverted without re-deriving the earlier state.

How to scale it¶

Scale generation first. N forward passes per prompt make generation the cost bottleneck; raise throughput with vLLM continuous batching (continuous batching internals) and tensor/data-parallel serving (inference serving) before anything else.
Scale breadth before depth. More prompts generalise better than a larger N; successful runs use 10 to 30 completions per prompt, beyond which extra N mostly buys over-optimised, higher-KL gains.
Parallelise scoring. Reward-model scoring is embarrassingly parallel over (prompt, completion) pairs; shard it across GPUs independently of generation.
Know when to graduate. More RSFT iterations compound gains but drift further in KL; when a single offline pass stops paying off, move to online RL (GRPO) for more headroom at higher engineering cost.

Best-of-N at inference¶

Best-of-N (BoN) is the close relative that follows the same generate-and-score procedure but does not fine-tune the model. Instead it computes the best completion to a prompt at inference time, the mechanism behind "Pro" tiers that spend extra compute per query. It is a pure sampling technique: it does not modify the underlying model, so comparisons against online methods such as PPO remain valid (you can still measure the KL distance of BoN relative to any policy). For a single prompt, picking the argmax and taking top-K with K=1 are the same selection.

# Reference template: needs the vLLM `policy` and the reward-model `score` above.
def best_of_n(prompt: str, n: int = 16) -> str:
    sampling = SamplingParams(n=n, temperature=0.9, top_p=0.95, max_tokens=2048)
    req = policy.generate([prompt], sampling)[0]
    return max(req.outputs, key=lambda c: score(prompt, c.text)).text

BoN trades inference compute (N forward passes plus N reward-model scores) for answer quality, with no weight update, which is why it is a common baseline when reporting RLHF results.

How far BoN drifts from the base policy is quantifiable. For a continuous reward with no ties, the KL divergence of the BoN policy from the base policy has the classical closed form log N - (N-1)/N; Beirami et al. (2024) show this is in general an upper bound on the true KL, tight in that idealised limit and looser once rewards tie. KL grows with N, which is exactly why very large N over-optimises (see Failure modes). The runnable check below validates the selection (argmax over the scored completions) and the KL bound against an exact discrete reference and a Monte-Carlo estimate.

import numpy as np

# ---- best-of-n SELECTION (core of the best_of_n reference template) ----
def best_of_n_idx(scores):
    return int(np.argmax(scores))            # ties -> first index (numpy)

def best_of_n_idx_ref(scores):
    bi, bv = 0, scores[0]
    for i, v in enumerate(scores):
        if v > bv:
            bi, bv = i, v
    return bi

rng = np.random.default_rng(2)
for _ in range(1000):
    s = rng.normal(size=int(rng.integers(1, 12)))
    assert best_of_n_idx(s) == best_of_n_idx_ref(s)
assert best_of_n_idx(np.array([3.14])) == 0            # n=1 identity
assert best_of_n_idx(np.array([1.0, 1.0, 1.0])) == 0   # all-tie -> first

# ---- KL(pi_bon || pi_ref): classical estimate log n - (n-1)/n ----
# Beirami et al. 2024 (2401.01879): this is an UPPER BOUND on the true KL,
# exact only in the continuous-reward, no-ties limit.
def kl_bound(n):
    return np.log(n) - (n - 1) / n

# Exact best-of-n distribution over a finite support (ref probs p, rewards r;
# ties allowed). Within a tied reward group the selected mass splits prop. to p.
def bon_distribution(p, r, n):
    p, r = np.asarray(p, float), np.asarray(r, float)
    order = np.argsort(r, kind="stable")
    q = np.zeros_like(p)
    below, i = 0.0, 0
    while i < len(order):
        j = i
        while j < len(order) and r[order[j]] == r[order[i]]:
            j += 1
        grp = order[i:j]
        Pg = p[grp].sum()
        mass = (below + Pg) ** n - below ** n          # P(top reward in group)
        if Pg > 0:
            q[grp] = (p[grp] / Pg) * mass
        below += Pg
        i = j
    return q

def kl(q, p):
    q, p = np.asarray(q, float), np.asarray(p, float)
    m = q > 0
    return float(np.sum(q[m] * np.log(q[m] / p[m])))

# (a) distinct rewards on a fine grid -> KL approaches the bound from below
for n in (2, 4, 8, 16):
    K = 4000
    p = np.full(K, 1.0 / K)
    r = np.linspace(0, 1, K)                           # all distinct
    q = bon_distribution(p, r, n)
    assert abs(q.sum() - 1.0) < 1e-9
    kln = kl(q, p)
    assert kln <= kl_bound(n) + 1e-9                   # it is an upper bound
    assert kl_bound(n) - kln < 5e-3                    # tight in continuous limit

# (b) ties make the bound strict (KL strictly below log n - (n-1)/n)
p = np.array([0.25, 0.25, 0.25, 0.25])
q_alltie = bon_distribution(p, np.ones(4), 8)
assert np.allclose(q_alltie, p)                        # no selection pressure
assert kl(q_alltie, p) < kl_bound(8) - 0.1            # strictly below bound
q_pairs = bon_distribution(p, np.array([0.0, 0.0, 1.0, 1.0]), 8)
assert abs(q_pairs.sum() - 1.0) < 1e-9
assert kl(q_pairs, p) < kl_bound(8)                   # still strictly below

# (c) Monte-Carlo cross-check of the continuous idealization == bound
for n in (2, 5, 10):
    umax = rng.random(size=(400_000, n)).max(axis=1)
    kl_mc = np.log(n) + (n - 1) * np.log(umax).mean()  # E[log(n u^{n-1})]
    assert abs(kl_mc - kl_bound(n)) < 5e-3

# (d) boundaries: n=1 => no update; KL monotonically non-decreasing in n
p, r = np.array([0.2, 0.3, 0.5]), np.array([0.1, 0.7, 0.4])
assert np.allclose(bon_distribution(p, r, 1), p)       # n=1 identity
assert kl(bon_distribution(p, r, 1), p) < 1e-12        # n=1 => KL 0 (float)
kls = [kl(bon_distribution(p, r, n), p) for n in range(1, 9)]
assert all(b >= a - 1e-12 for a, b in zip(kls, kls[1:]))

print("NP-3 OK: best-of-n argmax == ref; KL <= log n-(n-1)/n "
      "(tight continuous, strict with ties); n=1 no-op; KL monotone in n")

Failure modes¶

Too few completions per prompt leads to biased and/or noisy training; use 10-30+ per prompt.
Re-using all SFT prompts overfits the original instruction set; vary or sub-sample the prompt pool.
Weak or hackable reward model: the kept set inherits the RM's blind spots, so the model learns to satisfy the proxy rather than the goal. The reward model heavily determines the result. Design and validate it first (reward model training).
Over-optimising at high N: very large N (in RSFT or BoN) pushes outputs toward whatever the RM scores highest and drifts further in KL from the base policy; watch the KL and a held-out eval.
Padding waste in scoring: unsorted reward-model batches burn compute on padding tokens; sort completions by length before batching.
Treating configs as settled: the instruction-tuning details for the SFT step are not well documented in the literature; treat hyperparameters as empirical and ablate them.
No eval gate: a run that improved reward but regressed real quality reaches production; gate on a held-out eval (SRE and MLOps practices).

References¶

The RLHF book (Lambert), ch. 9 Rejection Sampling: https://rlhfbook.com
Llama 2 (rejection sampling in RLHF): https://arxiv.org/abs/2307.09288
Tülu 3 (open post-training recipe): https://arxiv.org/abs/2411.15124
WebGPT (best-of-n / rejection sampling): https://arxiv.org/abs/2112.09332
RAFT (reward-ranked fine-tuning, Dong et al. 2023): https://arxiv.org/abs/2304.06767
RSO (Statistical Rejection Sampling Improves Preference Optimization, Liu et al. 2023): https://arxiv.org/abs/2309.06657
Best-of-N KL bound (Theoretical guarantees on the best-of-n alignment policy, Beirami et al. 2024): https://arxiv.org/abs/2401.01879
vLLM docs (generation engine): https://docs.vllm.ai/en/latest/