Markdown

SFT & LoRA/QLoRA¶

Scope: supervised fine-tuning on demonstrations, and the parameter-efficient adapters, LoRA (low-rank) and QLoRA (4-bit base), that make it fit large models on little memory. The cheapest, first stage of post-training (fine-tuning and post-training); the foundation the preference (DPO) and RL (GRPO) stages build on.

Reference templates run against real APIs (TRL, PEFT, vLLM). Pin versions and validate before production use. The numpy blocks below have no external dependencies and are self-checking; they validate the core math each template teaches.

What it is¶

SFT (supervised fine-tuning) continues training a pretrained model on curated (input, output) demonstrations with the standard next-token loss. It sets format and behaviour and cold-starts a model before any alignment or RL.

LoRA (Low-Rank Adaptation) freezes the base weights and trains small low-rank matrices A and B injected into chosen linear layers (typically the attention projections). The effective weight becomes W + (alpha/r)·B·A; only the adapter (a few MB) is trained and saved, updating on the order of 100x fewer parameters than full fine-tuning.

QLoRA goes further: it quantises the frozen base to 4-bit (NF4) and trains LoRA adapters on top in BF16, fine-tuning a model whose full weights would not otherwise fit (quantization for inference covers NF4 and the other low-bit formats). Introduced in the LoRA and QLoRA papers (References).

Why use it¶

Cheapest adaptation: pure forward/backward, no rollouts, no reward model; the lightest post-training stage.
Fits big models on little memory: LoRA trains a fraction of the params; QLoRA's 4-bit base lets a single GPU/node tune models that full fine-tuning could not load.
Cheap to store and swap: adapters are small; many task adapters share one base and can be served together (multi-LoRA, below).
Composable: LoRA plugs into DPO and GRPO to cut their memory too.

When to use it (and when not)¶

Use SFT(+LoRA) as the default first step for any task adaptation: instruction-following, domain style, format, tool syntax, and as the cold-start before DPO/GRPO (fine-tuning and post-training).
Use QLoRA when the model plus optimiser will not fit in memory otherwise; accept a small speed/quality cost from 4-bit.
Prefer full-parameter SFT when you have the GPUs and need maximum quality, or are changing the model substantially (FSDP, DeepSpeed and ZeRO).
Go beyond SFT to DPO for preference alignment, GRPO for verifiable-reward reasoning.
Do not reach for LoRA when the rank you would need approaches the layer width. Past r = d/2 the adapter costs as many parameters as the full layer (validated in "How to scale it"), so full-parameter SFT is the better trade.

Architecture¶

flowchart LR
  X["Input tokens"] --> BASE["Frozen base (4-bit NF4 for QLoRA)"]
  BASE --> PROJ["Attention proj: q / k / v / o"]
  PROJ --> ADD(("+"))
  X --> LA["LoRA A: down-project (r, d_in)"]
  LA --> LB["LoRA B: up-project (d_out, r), BF16"]
  LB -->|"scale alpha/r"| ADD
  ADD --> OUT["Output / loss"]
  OUT -.->|"grad: adapter A,B only"| LB

The base path and the adapter path both read the input and sum: y = W·x + (alpha/r)·B·(A·x). Only A and B receive gradients. At serving time the two paths can be folded into one weight, W + (alpha/r)·B·A, which is exactly equivalent (proven in "How to run it in production").

How to use it¶

TRL's SFTTrainer plus a PEFT LoraConfig is the minimal path; target_modules picks which linear layers get adapters.

# REFERENCE TEMPLATE (needs trl, peft, datasets, torch, a GPU) - not run here.
# Verify field names on the installed versions; pin them before production.
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-8B",
    args=SFTConfig(max_length=4096, packing=True, bf16=True,
                   gradient_checkpointing=True),
    train_dataset=load_dataset("trl-lib/Capybara", split="train"),
    peft_config=LoraConfig(r=16, lora_alpha=32,
                           target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]),
)
trainer.train()

accelerate launch sft.py

For QLoRA on a single GPU, wrap the base in a 4-bit BitsAndBytesConfig and pass the same LoraConfig:

# REFERENCE TEMPLATE (needs transformers, bitsandbytes, peft, trl, torch) - not run here.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", quantization_config=bnb)
SFTTrainer(model=model, args=SFTConfig(gradient_checkpointing=True, bf16=True),
           train_dataset=ds,
           peft_config=LoraConfig(r=16, lora_alpha=32,
                                  target_modules=["q_proj","k_proj","v_proj","o_proj"])).train()

What the template actually does to each targeted linear layer is inject the A·B branch and scale it by alpha/r. The block below implements that forward math in numpy and proves the properties the trainer relies on: the zero-initialised adapter is a no-op at step 0, alpha scales the contribution linearly, the delta is rank-bounded by r, and a dimension-mismatched adapter is rejected rather than silently reshaped.

import numpy as np

def lora_delta(A, B, alpha, r, d_out=None, d_in=None):
    """LoRA weight delta = (alpha/r) * B @ A, shape (d_out, d_in)."""
    assert A.shape[0] == r and B.shape[1] == r, "A is (r,d_in), B is (d_out,r)"
    if d_in is not None:
        assert A.shape[1] == d_in, "A cols must equal base d_in"
    if d_out is not None:
        assert B.shape[0] == d_out, "B rows must equal base d_out"
    return (alpha / r) * (B @ A)

def forward_base_plus_adapter(x, W, A, B, alpha, r):
    """Frozen base plus adapter branch: two matmuls (training / multi-LoRA path)."""
    return x @ W.T + (alpha / r) * (x @ A.T) @ B.T

rng = np.random.default_rng(0)
d_in, d_out, r, n = 32, 24, 4, 10
W = rng.standard_normal((d_out, d_in))
A = rng.standard_normal((r, d_in))       # down-projection
B = rng.standard_normal((d_out, r))      # up-projection
x = rng.standard_normal((n, d_in))
alpha = 8.0

y = forward_base_plus_adapter(x, W, A, B, alpha, r)

# Zero-init adapter (B=0, the PEFT default) is a strict no-op: training starts from base.
y0 = forward_base_plus_adapter(x, W, np.zeros_like(A) + A, np.zeros_like(B), alpha, r)
assert np.allclose(y0, x @ W.T, atol=1e-12), "B=0 adapter must not change base output"

# alpha/r is the scaling factor: doubling alpha doubles the adapter contribution exactly.
contrib = y - x @ W.T
y_2a = forward_base_plus_adapter(x, W, A, B, 2 * alpha, r)
assert np.allclose(y_2a - x @ W.T, 2 * contrib, atol=1e-10), "alpha scales linearly"

# The delta injects at most rank r; generic A,B give exactly rank r.
delta = lora_delta(A, B, alpha, r, d_out=d_out, d_in=d_in)
assert np.linalg.matrix_rank(delta) == r, "LoRA delta rank equals r for generic A,B"

# Adversarial: an adapter whose d_in disagrees with the base must raise, not reshape.
bad = False
try:
    lora_delta(rng.standard_normal((r, d_in + 1)), B, alpha, r, d_out=d_out, d_in=d_in)
except (AssertionError, ValueError):
    bad = True
assert bad, "adapter/base dimension mismatch must raise, never silently reshape"

print("OK: B=0 no-op; alpha linear; rank(delta)==r; dim guard fires")

Run output:

OK: B=0 no-op; alpha linear; rank(delta)==r; dim guard fires

How to integrate with it¶

Config-first frameworks (Axolotl, LLaMA-Factory) make the same run a YAML, which is the common dev surface for sweeps:

# LLaMA-Factory / Axolotl-style SFT + QLoRA (pin the framework version)
base_model: Qwen/Qwen3-8B
adapter: qlora
load_in_4bit: true            # NF4 4-bit base (QLoRA)
lora_r: 16
lora_alpha: 32
lora_target: [q_proj, k_proj, v_proj, o_proj]
sequence_len: 4096
sample_packing: true          # pack short samples to fill the context
gradient_checkpointing: true
bf16: true
datasets: [{ path: ./sft_data.jsonl, type: chat }]

lora_r / lora_alpha: rank and scaling; alpha = 2·r is a common heuristic. Too low a rank under-fits.
target_modules: attention projections are the safe default; adding the MLP (gate_proj / up_proj / down_proj) raises capacity and cost. Wrong or empty targets mean the adapter learns almost nothing.
Sequence packing: concatenates short samples to fill the context window, a large throughput win for SFT.

Sequence packing is the one integration knob most likely to silently corrupt data if implemented wrong (tokens dropped at block boundaries, or duplicated). The block below implements greedy packing and proves it conserves every token in order, confines padding to under one block, beats per-sample padding on utilisation, and handles a sample longer than the block plus an empty input.

import numpy as np

def pad_batch(samples, max_len, pad_id=0):
    """Naive SFT batching: one sample per row, right-pad to max_len."""
    out = np.full((len(samples), max_len), pad_id, dtype=np.int64)
    for i, s in enumerate(samples):
        s = s[:max_len]
        out[i, : len(s)] = s
    return out

def pack_sequences(samples, block_len, pad_id=0):
    """Greedy packing: concatenate samples into fixed-length blocks.
    Returns (blocks, real_token_count). Tokens are never dropped or duplicated."""
    assert block_len > 0
    flat = (np.concatenate([np.asarray(s, dtype=np.int64) for s in samples])
            if samples else np.zeros(0, np.int64))
    real = int(flat.size)
    n_blocks = int(np.ceil(real / block_len)) if real else 0
    blocks = np.full((n_blocks, block_len), pad_id, dtype=np.int64)
    if real:
        blocks.reshape(-1)[:real] = flat          # fill row-major, pad the tail only
    return blocks, real

rng = np.random.default_rng(2)
samples = [rng.integers(1, 1000, size=int(k)).tolist()   # ids >=1 so pad_id 0 is distinct
           for k in rng.integers(5, 40, size=50)]
block_len = 128
total_real = sum(len(s) for s in samples)

blocks, real = pack_sequences(samples, block_len)

# Token conservation: every real token survives exactly once, in order.
assert real == total_real, "packing must not lose or add tokens"
recovered = blocks.reshape(-1)[:real]
assert np.array_equal(recovered, np.concatenate([np.asarray(s) for s in samples]))

# Padding is confined to at most the final block's tail.
pad_tokens = int((blocks == 0).sum())
assert pad_tokens == blocks.size - real and pad_tokens < block_len

# Throughput win vs per-sample padding on the SAME data.
naive = pad_batch(samples, max_len=max(len(s) for s in samples))
packed_util = real / blocks.size
naive_util = real / naive.size
assert packed_util > naive_util and packed_util > 0.99

# Adversarial: a sample longer than the block spans blocks losslessly.
b2, r2 = pack_sequences([list(range(1, 300))], block_len)
assert r2 == 299 and np.array_equal(b2.reshape(-1)[:299], np.arange(1, 300))
assert b2.shape[0] == int(np.ceil(299 / 128)) == 3

# Boundary: empty input yields zero blocks, not a crash.
b0, r0 = pack_sequences([], block_len)
assert r0 == 0 and b0.shape == (0, block_len)

print(f"OK: conserved {real} tokens; util packed={packed_util:.3f} vs naive={naive_util:.3f}; "
      f"long+empty handled")

Run output:

OK: conserved 1149 tokens; util packed=0.997 vs naive=0.589; long+empty handled

How to run it in production¶

The adapter must be reunited with the base to serve. Two options:

Merge the LoRA weights into the base (merge_and_unload()) and serve the merged checkpoint as a normal model on the serving stack. Merging into a 4-bit (QLoRA) base needs care: merge into the dequantised/BF16 base, not the quantised one.
Multi-LoRA serving: keep the base loaded once and hot-swap many adapters per request. vLLM supports this with --enable-lora and --lora-modules name=path; one GPU then serves many task adapters cheaply (multi-LoRA serving).

# REFERENCE TEMPLATE (needs peft, transformers, torch) - not run here.
from peft import AutoPeftModelForCausalLM
m = AutoPeftModelForCausalLM.from_pretrained("./out/checkpoint")  # base + adapter
m.merge_and_unload().save_pretrained("./merged")                 # one standalone model

# REFERENCE TEMPLATE (needs a built vLLM with LoRA support). Pin the vLLM version.
vllm serve Qwen/Qwen3-8B --enable-lora \
  --lora-modules sql=./adapters/sql chat=./adapters/chat \
  --max-lora-rank 16        # request a LoRA by its name at inference time

The correctness guarantee behind both paths is that merging is exactly equivalent to keeping the adapter separate: a single matmul against W + (alpha/r)·B·A produces the same output as the two-matmul multi-LoRA path, to floating-point tolerance. The block below proves that equivalence and the rank bound, and checks the boundary where merging stops paying off.

import numpy as np

def lora_delta(A, B, alpha, r):
    return (alpha / r) * (B @ A)

def forward_split(x, W, A, B, alpha, r):        # multi-LoRA serving path
    return x @ W.T + (alpha / r) * (x @ A.T) @ B.T

def forward_merged(x, W, A, B, alpha, r):       # merge_and_unload path
    W_eff = W + lora_delta(A, B, alpha, r)
    return x @ W_eff.T

rng = np.random.default_rng(0)
d_in, d_out, r, n = 32, 24, 4, 10
W = rng.standard_normal((d_out, d_in))
A = rng.standard_normal((r, d_in))
B = rng.standard_normal((d_out, r))
x = rng.standard_normal((n, d_in))
alpha = 8.0

y_split = forward_split(x, W, A, B, alpha, r)
y_merged = forward_merged(x, W, A, B, alpha, r)

# Merge equivalence: serving with an adapter == serving the merged checkpoint.
assert np.allclose(y_split, y_merged, atol=1e-10), "merge must equal base+adapter"

# The merged delta added at most rank r of new directions.
assert np.linalg.matrix_rank(lora_delta(A, B, alpha, r)) <= r

# Adversarial equivalence-to-reference: an explicit loop over rows must match the vectorised merge.
W_eff = W + lora_delta(A, B, alpha, r)
ref = np.stack([W_eff @ x[i] for i in range(n)])
assert np.allclose(y_merged, ref, atol=1e-10), "vectorised merge == per-row reference"

print(f"OK: merge==split (max diff {np.max(np.abs(y_split - y_merged)):.1e}); "
      f"rank(delta)={np.linalg.matrix_rank(lora_delta(A,B,alpha,r))}<=r={r}; loop-ref matches")

Run output:

OK: merge==split (max diff 4.3e-14); rank(delta)=4<=r=4; loop-ref matches

Serve the merged model, or a base + adapters set, on the same serving stack you use for OSS models. Gate merged checkpoints behind the same eval you trained against before promotion.

How to maintain it¶

SFT is fine-tuning, the first and cheapest stage of the post-training pipeline (SFT then DPO then GRPO). It produces the cold-start policy the later stages depend on; DPO and GRPO typically start from an SFT'd checkpoint, and LoRA adapters carry through all three. Maintenance is therefore mostly about not regressing that cold-start:

Hold out an eval set and gate every re-train on it. Over-fitting a tiny SFT set causes memorisation and capability regression that a held-out eval catches (SRE and MLOps practices).
Version adapters by (base checkpoint, r, alpha, target_modules, data hash). An adapter is only valid against the exact base it was trained on; swapping the base silently changes behaviour.
Re-evaluate after any rank or target-module change: raising r (and lora_alpha) fixes under-fitting on harder tasks but can over-fit small sets. For calibrated LR and rank starting points (the 10x LoRA LR rule, params-vs-tokens capacity), see LoRA hyperparameter scaling.
When maintaining a merged checkpoint, keep the source adapter. Merging is lossy to reproduce from the merged weights alone, and merging into a 4-bit base is a known silent-quality-loss trap (merge into BF16, see "How to run it in production" and model merging).

How to scale it¶

Full-parameter SFT and large LoRA runs scale across nodes with FSDP (FSDP) or DeepSpeed ZeRO (DeepSpeed and ZeRO) via accelerate / torchrun. FSDP+LoRA shards the (frozen) base while training adapters:

# REFERENCE TEMPLATE: multi-node SFT (full or LoRA) with FSDP2 via Accelerate.
accelerate launch --config_file fsdp.yaml --num_processes 16 sft.py \
  --model Qwen/Qwen3-32B

For multi-instance/multi-node, HSDP shards intra-node over NVLink and replicates inter-node over IB (FSDP). QLoRA's 4-bit base often keeps even large models single-node, avoiding multi-node entirely; reach for sharding only when the base plus activations exceed one node.

The scaling decision is memory arithmetic. Full-SFT holds, per parameter: BF16 weights (2 B) + BF16 grads (2 B) + FP32 master weight (4 B) + FP32 Adam states m and v (4 B + 4 B) = 16 bytes. QLoRA's resident base is 4-bit NF4 (0.5 bytes/param) plus a tiny adapter optimiser footprint. The block below proves the parameter-count reduction, where it breaks down, and the memory ratio the page claims.

import numpy as np

def lora_params(d_in, d_out, r):
    """LoRA adds A:(r,d_in) and B:(d_out,r) -> r*(d_in+d_out) trainable params."""
    assert r > 0 and d_in > 0 and d_out > 0
    return r * (d_in + d_out)

def full_params(d_in, d_out):
    return d_in * d_out

def reduction(d_in, d_out, r):
    return full_params(d_in, d_out) / lora_params(d_in, d_out, r)

d = 4096                                   # 8B-scale hidden size
r = 16
assert lora_params(d, d, r) == 2 * d * r == 131072
assert full_params(d, d) == d * d == 16777216
assert abs(reduction(d, d, r) - 128.0) < 1e-9, "square layer, r=16 -> d/(2r)=128x fewer"
assert reduction(d, d, r) >= 100.0, "page claim: ~100x fewer trainable params holds"

# Monotonic: higher rank -> more params -> less reduction.
assert reduction(d, d, 8) > reduction(d, d, 16) > reduction(d, d, 64)
# Adversarial breakdown: LoRA only saves while r < d/2.
assert abs(reduction(d, d, d // 2) - 1.0) < 1e-9, "r=d/2 -> no saving"
assert reduction(d, d, d) < 1.0, "r>=d/2 costs MORE than full fine-tuning"

# Memory bytes/param: full-SFT=16 (bf16 w+grad, fp32 master+Adam m,v); QLoRA base=0.5 (4-bit).
n = 8.0e9
full_bytes = n * (2 + 2 + 4 + 4 + 4)
qlora_base = n * 0.5
assert full_bytes == n * 16
assert abs(full_bytes / qlora_base - 32.0) < 1e-9, "QLoRA base is 32x smaller than full-SFT footprint"

print(f"OK: r=16 -> {reduction(d,d,r):.0f}x fewer params (>=100x); "
      f"full-SFT 16 B/param vs QLoRA base 0.5 B/param = {full_bytes/qlora_base:.0f}x")

Run output:

OK: r=16 -> 128x fewer params (>=100x); full-SFT 16 B/param vs QLoRA base 0.5 B/param = 32x

QLoRA trades compute (dequantise on the fly) for that 4-bit resident base. The 4-bit format is NF4 (NormalFloat): a 16-entry codebook placed at the quantiles of a standard normal, with blockwise absmax scaling. Its whole point is to represent normally-distributed weights better than uniform int4. The block below reproduces the NF4 codebook, proves the round-trip error bound, shows NF4 beats linear int4 on Gaussian data, and guards the all-zero block that would otherwise divide by zero.

import numpy as np

# NF4: 16 levels at the quantiles of a standard normal, normalised to [-1, 1].
NF4 = np.array([
    -1.0, -0.6961928, -0.5250731, -0.3949175, -0.2844414, -0.1847734,
    -0.0910500, 0.0, 0.0795803, 0.1609302, 0.2461123, 0.3379152,
    0.4407098, 0.5626170, 0.7229568, 1.0], dtype=np.float64)

def quant_nf4(w, levels=NF4):
    """Blockwise NF4: scale by absmax to [-1,1], snap to nearest level."""
    absmax = np.max(np.abs(w))
    zero_idx = int(np.abs(levels).argmin())
    if absmax == 0.0:
        return np.full(w.shape, zero_idx, dtype=np.int8), 1.0
    idx = np.abs((w / absmax)[..., None] - levels).argmin(axis=-1).astype(np.int8)
    return idx, absmax

def dequant_nf4(idx, absmax, levels=NF4):
    return levels[idx] * absmax

def quant_int4_linear(w):
    """Reference: symmetric uniform int4, 15 steps over [-absmax, absmax]."""
    absmax = np.max(np.abs(w))
    if absmax == 0.0:
        return np.zeros_like(w)
    step = 2 * absmax / 15
    return np.clip(np.round(w / step) * step, -absmax, absmax)

# The code is a valid 4-bit codebook: 16 levels, strictly increasing, asymmetric, has 0.
assert len(NF4) == 16 and np.all(np.diff(NF4) > 0)
assert NF4[0] == -1.0 and NF4[-1] == 1.0 and (NF4 == 0.0).any()

rng = np.random.default_rng(1)
w = rng.standard_normal(4096)
idx, absmax = quant_nf4(w)
w_hat = dequant_nf4(idx, absmax)

# Round-trip error is within half the widest gap times absmax.
max_gap = np.max(np.diff(NF4))
assert np.max(np.abs(w - w_hat)) <= (max_gap / 2) * absmax + 1e-9

# NF4 beats uniform int4 on Gaussian data (its design goal).
mse_nf4 = np.mean((w - w_hat) ** 2)
mse_lin = np.mean((w - quant_int4_linear(w)) ** 2)
assert mse_nf4 < mse_lin, "NF4 must beat linear int4 on Gaussians"

# On-grid values are lossless; an all-zero block must not produce NaN.
on_grid = NF4 * 3.0
gi, ga = quant_nf4(on_grid)
assert np.allclose(dequant_nf4(gi, ga), on_grid, atol=1e-12)
zi, za = quant_nf4(np.zeros(64))
assert np.all(dequant_nf4(zi, za) == 0.0) and np.isfinite(dequant_nf4(zi, za)).all()

print(f"OK: 16-level asymmetric code; MSE nf4={mse_nf4:.4f} < linear={mse_lin:.4f} "
      f"({(1-mse_nf4/mse_lin)*100:.0f}% better); zero-block safe")

Run output:

OK: 16-level asymmetric code; MSE nf4=0.0115 < linear=0.0210 (45% better); zero-block safe

Hardware notes for sharded runs:

Gradient checkpointing trades recompute for activation memory; paged optimisers spill optimiser state to host on spikes.
NCCL for sharded runs: set NCCL_IB_HCA, NCCL_NET_GDR_LEVEL=SYS; confirm [GDRDMA] in NCCL_DEBUG=INFO; ACS off for P2P/GDR (performance tuning).
BF16 is the default compute dtype; on Blackwell, FP8 training paths apply to full-parameter SFT. NVLink/NVSwitch carries the FSDP all-gather intra-node (the Blackwell platform).

Failure modes¶

Wrong target_modules: adapter learns little; include the attention projections at minimum, add MLP for capacity.
Rank too low: under-fitting on harder tasks; raise r (and lora_alpha) and re-evaluate.
Merging into a 4-bit base: silent quality loss; merge into the BF16/dequantised base, not the quantised one.
No sequence packing: wasted compute on padding for short-sample SFT datasets (the utilisation gap is measured in "How to integrate with it": 0.997 vs 0.589 on the sample data).
Over-fitting a tiny SFT set: memorisation and capability regression; hold out an eval and gate on it (SRE and MLOps practices).
Treating LoRA as free quality: for large distribution shifts, full-parameter SFT can be materially better, and past r = d/2 LoRA costs more parameters than full fine-tuning anyway.
Serving an adapter against the wrong base: an adapter is only valid against the exact checkpoint it trained on; mismatches produce wrong outputs, not errors.

References¶

LoRA paper: https://arxiv.org/abs/2106.09685
QLoRA paper: https://arxiv.org/abs/2305.14314
TRL SFT Trainer docs: https://huggingface.co/docs/trl/sft_trainer
PEFT docs (LoRA): https://huggingface.co/docs/peft/index
vLLM LoRA serving: https://docs.vllm.ai/en/stable/features/lora/