Markdown

Quantization for inference¶

Scope: quantizing LLM weights and activations to INT8, INT4, FP8 and Blackwell NVFP4/MXFP4 for inference on GPU clusters: the data types, scale/zero-point math, granularity, calibration (static vs dynamic, symmetric vs asymmetric), and the production methods (GPTQ, AWQ, SmoothQuant, NF4/QLoRA, QAT) that cut memory and bandwidth without retraining.

Reference templates run against real APIs (torch, vLLM). Pin versions and validate before production use. The numpy blocks are self-contained and were executed with system python3; every one asserts its result, including an edge or adversarial case.

What it is¶

Quantization stores and computes a model's tensors in a low-bit format (INT8, INT4, FP8, FP4) instead of FP32/BF16, recovering an approximation of each value through a stored scale (and, for asymmetric schemes, a zero-point). For LLM serving it is first and foremost a memory-and-bandwidth optimisation applied to the checkpoint before or at load time: a 70B model needs roughly 140 GB in BF16 but about 35 GB in 4-bit, which is the difference between four GPUs and one. The connected agent never sees it; the model's outputs are meant to be near-identical.

Any scheme is defined by two axes: which tensors are quantised (weight-only, or weight-and-activation, and separately whether the KV cache is quantised) and at what granularity the scale is shared. The arithmetic itself is trivial: divide, round, clamp to quantise; multiply to dequantise (CUDA for Deep Learning, Ch. 9). The engineering is choosing the scale well and dequantising inside the matmul kernel, so the low-bit weights stay compressed in HBM until the instant they feed the tensor cores. This page is the optimisation layer under inference serving and serving open-weight models.

Why use it¶

The savings cascade through the whole system (CUDA for Deep Learning, Ch. 9):

Memory. INT8 is 4x smaller than FP32 (2x vs BF16); INT4 is 8x (4x vs BF16). One billion parameters drop from 4 GB (FP32) to 1 GB (INT8) to 0.5 GB (INT4). This sets how many GPUs a model needs and how much HBM is left for the KV cache and activations.
Bandwidth. Decode is memory-bandwidth bound; every generated token streams the full weight set (plus the KV cache) out of HBM. Quartering the weight bytes quarters that traffic and moves the kernel rightward on the roofline toward the compute roof. The book's framing: "4x less bandwidth to move data around, and inference runs twice as fast."
Cache. Each cache line now holds 4x more parameters, so L2 and shared memory effectively grow 4x; "many operations go from cache-bound to compute-bound, completely changing the performance profile of your kernels."

It is tolerable because a network's knowledge lives in the relationships between weights, not their exact values. Rounding to a few bits injects structured, bounded quantisation noise that a well-trained model absorbs: "a carefully quantized model often maintains 99% of its original accuracy." The catch: error grows with the square of the step size (double the step, 4x the error), so the scale factor is the single most important choice.

When to use it (and when not)¶

Reach for quantization when:

The checkpoint does not fit, or you want fewer GPUs. Weight-only INT4 (GPTQ, AWQ) turns a 140 GB BF16 70B into roughly 35 GB, moving four GPUs down to one; FP8 halves it to roughly 70 GB, moving four GPUs down to two.
Decode is bandwidth-bound, which it is at small batch. Quartering the weight bytes quarters HBM traffic and speeds token generation.
You are post-training on a budget. NF4/QLoRA keeps a 4-bit base plus small adapters on a single node for SFT, DPO, and GRPO.
You serve at high QPS with a stable input distribution and want a compute win, not just a memory one. Static W8A8 INT8 (SmoothQuant) or FP8 unlock integer and FP8 tensor-core GEMMs.

Stay in BF16, or proceed with care, when:

The model already fits with room for the KV cache. The accuracy risk buys little.
You cannot validate accuracy on a representative held-out eval. Never ship a quantized model you have not measured.
Latency at batch 1 is all that matters and you expected an INT8/FP8 compute win. Small-batch decode is bandwidth-bound, so weight-only INT4 helps, but INT8/FP8 compute speedups need tiles large enough to saturate the tensor cores.
You need the last point of accuracy at very low bit-width and cannot afford QAT.
The target hardware lacks the format: FP8 before Hopper, NVFP4/MXFP4 before Blackwell, or INT4 without Marlin kernels.

Architecture¶

Every quantization pipeline has the same shape: pick a granularity and a symmetric or asymmetric scheme, set the scales (offline by calibration for static, or per-token at runtime for dynamic), store the low-bit values alongside their scales, then dequantise inside the matmul kernel so the compressed weights stay in HBM until the moment they feed the tensor cores, which accumulate in FP16/FP32. The methods below slot into this pipeline: GPTQ and AWQ produce weight-only INT4 stores; SmoothQuant rebalances activations and weights before an INT8 GEMM.

flowchart TD
    W["FP32 / BF16 weights and activations"] --> G["Choose granularity + symmetric/asymmetric"]
    G --> CAL["Calibrate offline (static, percentile)"]
    G --> DYN["Per-token ranges at runtime (dynamic)"]
    CAL --> STORE["Store INT8 / INT4 / FP8 + scales"]
    DYN --> STORE
    STORE --> DEQ["Dequant in the kernel"]
    DEQ --> MM["Tensor-core matmul (accumulate in FP16/FP32)"]
    GPTQ["GPTQ (Hessian error compensation)"] -.->|"weight-only INT4"| STORE
    AWQ["AWQ (scale salient channels)"] -.->|"weight-only INT4"| STORE
    SQ["SmoothQuant (migrate activation outliers)"] -.->|"W8A8 INT8"| G

Data types¶

Format	Bits	Layout	Range / levels	Typical inference use
FP8 E4M3	8	4 exp, 3 mantissa	~±448	Weights and activations (W8A8) on Hopper/Blackwell
FP8 E5M2	8	5 exp, 2 mantissa	~±57344	Wider range, lower precision; gradients in FP8 training
INT8	8	integer	−127..127 (sym), 0..255 (asym)	W8A8 serving via DP4A / INT8 tensor cores
INT4	4	integer	−8..7 (often ±7), packed 2/byte	Weight-only, group-wise (GPTQ, AWQ)
NF4	4	non-uniform (normal quantiles)	16 levels, per-block absmax	QLoRA / bitsandbytes weight storage
NVFP4	4	E2M1 (2 exp, 1 mantissa)	±6; 16-elem block E4M3 scale + per-tensor FP32	Blackwell 4-bit inference
MXFP4	4	E2M1 (2 exp, 1 mantissa)	±6; 32-elem block E8M0 (pow-2) scale	OCP Microscaling 4-bit on Blackwell

Integer formats can slide the zero-point and scale to fit a tensor's actual distribution (adaptive range placement) while floating-point formats fix the exponent/mantissa split and force your data into a predetermined range (CUDA for Deep Learning, Ch. 9). That adaptivity is why calibrated INT8/INT4 often beats naive low-precision float for weights. FP8 and FP4 trade it back for hardware-native tensor-core throughput: a 16-element NVFP4 microblock carries an FP8 E4M3 scale plus a per-tensor FP32 scale (two-level micro-scaling), which is what lets 4-bit storage retain usable accuracy, see tensor cores and mixed precision. The same low-bit formats also shrink the KV cache (FP8/INT8 KV), the other half of decode memory.

Quant and dequant (scale, zero-point)¶

Quantisation is an affine map between a real range and an integer range [qmin, qmax]:

scale       = (x_max - x_min) / (qmax - qmin)
zero_point  = round(qmin - x_min / scale)
quantize:   q = clamp(round(x / scale) + zero_point, qmin, qmax)
dequantize: x_hat = (q - zero_point) * scale

Symmetric quantisation pins zero to integer 0 and sets scale = max(|x|) / qmax with zero_point = 0, the natural fit for zero-centred weights (e.g. weights in [-2.5, 2.5] give scale = 2.5/127 ≈ 0.0197). Asymmetric quantisation lets zero land anywhere, using the full unsigned range for skewed activations: for a range [-2, 18], scale = 20/255 ≈ 0.0784 and zero_point = 26, so the value 5.0 quantises to 90 (worked through in CUDA for Deep Learning, Ch. 9, and reproduced by the code in how to use it). Dequantisation is just a subtract-and-multiply, cheap enough to fold into the weight load so its latency hides behind memory.

Granularity¶

The scale can be shared at progressively finer grain, trading metadata and scale-lookups for precision (CUDA for Deep Learning, Ch. 9):

Per-tensor: one scale for the whole tensor. Cheapest; fine for small, uniform tensors (biases, norms) but wastes precision when ranges vary across the tensor.
Per-channel (per output feature): one scale per column/head. The standard for weights; cuts error 2-3x when features have heterogeneous magnitudes, at the cost of one scale per channel.
Per-token: one scale per row of the activation matrix, computed at runtime; the dynamic-quantisation default for activations.
Group / block-wise: one scale per contiguous block of 64 or 128 elements. The sweet spot for large LLM weight matrices: group size 128 keeps scale overhead under ~1% while still adapting locally. INT4 weight-only checkpoints (GPTQ, AWQ, NF4) are almost always group-wise.

A typical transformer uses channel/head-wise scales for attention outputs, group-wise (g=128) for the big feed-forward matrices, and per-tensor for norms and biases.

Finer grain strictly lowers per-feature error when magnitudes vary, and buys nothing once the ranges already match. Running this prints per-tensor=0.836 per-channel=0.138 (6.1x); uniform gap=0.000:

import numpy as np

rng = np.random.default_rng(1)


def rt_int4(x, scale):
    """Symmetric int4 round-trip at a given (broadcastable) scale."""
    return np.clip(np.round(x / scale), -7, 7) * scale


def per_feature_rel_mae(W, Wq):
    """Mean over columns of each feature's relative reconstruction error."""
    num = np.abs(W - Wq).mean(axis=0)
    den = np.abs(W).mean(axis=0) + 1e-12
    return float((num / den).mean())


# a weight matrix whose columns (output features) span very different magnitudes
d_out, d_in = 256, 512
col_mag = 10.0 ** rng.integers(-2, 3, size=d_in)  # magnitudes from 1e-2 to 1e2
W = rng.standard_normal((d_out, d_in)) * col_mag

# per-tensor: a single absmax scale, dominated by the largest column
s_tensor = np.abs(W).max() / 7
rel_pt = per_feature_rel_mae(W, rt_int4(W, s_tensor))

# per-channel: one absmax scale per column, adapting to each feature
s_chan = np.abs(W).max(axis=0, keepdims=True) / 7
rel_pc = per_feature_rel_mae(W, rt_int4(W, s_chan))

# finer granularity strictly lowers per-feature error on heterogeneous ranges
assert rel_pc < rel_pt
assert rel_pt / rel_pc > 3  # small-magnitude features are near-destroyed per-tensor

# edge: when every column already shares one absmax, the scales coincide
U = rng.standard_normal((d_out, d_in))
U = U / np.abs(U).max(axis=0, keepdims=True)  # force identical column ranges
rel_u_pt = per_feature_rel_mae(U, rt_int4(U, np.abs(U).max() / 7))
rel_u_pc = per_feature_rel_mae(U, rt_int4(U, np.abs(U).max(axis=0, keepdims=True) / 7))
assert abs(rel_u_pt - rel_u_pc) < 1e-9  # no free lunch when ranges already match

print(f"heterogeneous per-feature rel-MAE: per-tensor={rel_pt:.3g} "
      f"per-channel={rel_pc:.3g} ({rel_pt / rel_pc:.1f}x); "
      f"uniform gap={abs(rel_u_pt - rel_u_pc) / rel_u_pt:.3f}")

Calibration: static vs dynamic, symmetric vs asymmetric¶

Calibration computes the scales from representative data: run a few hundred to a few thousand in-distribution samples, observe the per-slice ranges, and set scales. Min-max spans the full observed range but is wrecked by a single outlier; percentile calibration (e.g. the 99.9th percentile) clips rare extremes so the common values keep the precision. Bad calibration hurts more than aggressive bit-width: "if you quantize to 8 bits with terrible calibration, you might lose 20% accuracy." Always validate on held-out data, never the calibration set.

Static quantisation bakes the scales into the checkpoint at conversion time, so inference is pure compute, the right default for high-QPS serving. Dynamic quantisation recomputes activation scales per batch/token at runtime; it adapts to wildly varying inputs but adds a reduction pass on every forward. Symmetric INT8 suits zero-centred weights; asymmetric UINT8 suits ReLU-style positive activations. Most production stacks converge on the hybrid the book recommends: symmetric INT8 weights, asymmetric activations, static where the input distribution is stable.

Methods (GPTQ, AWQ, SmoothQuant, NF4 and QLoRA, QAT)¶

The backdrop is LLM.int8(): above ~6.7B parameters, a handful of emergent outlier feature dimensions blow up naive INT8 activation quantisation. Its fix, keep the outlier dimensions in FP16 and quantise the rest INT8 (mixed-precision decomposition), motivated the outlier-handling in the methods below (Dettmers et al.).

GPTQ: one-shot post-training quantisation using approximate second-order (Hessian) information. It quantises a weight matrix column by column and, after each column, updates the not-yet-quantised weights to compensate for the rounding error just introduced, so the layer's output stays close to the FP16 original. Lazy batched updates (128 columns at a time) and a Cholesky reformulation make it fast and stable; weight-only INT4 (or 3-bit), dequantised to FP16 inside the matmul (Frantar et al.).
AWQ: activation-aware weight quantisation. Weight importance is set by activation magnitude, not weight magnitude: only ~0.1-1% of channels are salient. AWQ searches for a per-channel scale that protects those channels (scale the salient weights up, the matching activations down, a mathematically equivalent transform), then quantises everything to group-wise INT4. No backprop, no mixed precision, so it generalises and runs fast. The book's intuition: rescale w' = w / sqrt(|a|) so the high-impact products land in the precise part of the range (Lin et al.).
SmoothQuant: enables full W8A8 (INT8 weights and activations) by migrating quantisation difficulty from activations to weights with a per-channel smoothing factor s_j = max(|X_j|)^α / max(|W_j|)^(1−α) (typically α=0.5), applied as the equivalent transform X̂ = X·diag(s)⁻¹, Ŵ = diag(s)·W. Both tensors become easy to quantise, unlocking INT8 tensor-core GEMMs, a compute speedup, not just a memory one (Xiao et al.). Contrast: AWQ/GPTQ are weight-only INT4 for memory/bandwidth; SmoothQuant quantises activations too for INT8 compute.
NF4 and QLoRA: NF4 (4-bit NormalFloat) is a non-uniform data type whose 16 levels are the quantiles of a standard normal, matched to how weights actually distribute; it is block-wise (64 weights) with absmax scales, and double quantisation compresses the scales themselves. QLoRA freezes an NF4 base model and trains BF16 LoRA adapters on top, dequantising NF4 to BF16 just-in-time, fine-tuning a 65B model on one 48 GB GPU (Dettmers et al.). It is the standard way to run RL post-training on a budget: a 4-bit base under SFT, DPO, and GRPO keeps the policy (and reference) small enough for a single node. For pure serving, prefer GPTQ/AWQ/FP8.

Quantization is also a rollout-phase lever for RL workloads: in GRPO/PPO the generation phase dominates wall-clock, so serving the policy's rollouts in FP8 (Hopper/Blackwell) or NVFP4 (Blackwell) cuts generation cost while training stays in BF16, the same precision split this page describes, applied to the inference engine inside the RL loop.

QAT: quantisation-aware training simulates the round in the forward pass and passes gradients straight through it (STE), so the model learns to be robust to quantisation noise. It reaches the best accuracy at very low bit-widths but costs a full fine-tune; everything above is post-training quantisation, which gets ~90% of the way at a fraction of the cost.

The four numpy blocks below reduce each method to the core linear-algebra claim it rests on and check it. They use only numpy, so they run anywhere.

SmoothQuant: an exact rebalancing that makes W8A8 easier¶

The smoothing transform leaves the GEMM identical while shrinking the activation peak, so per-tensor INT8 stops collapsing on the outlier channels. Prints act max 97.1 -> 6.25; W8A8 GEMM MAE 0.2898 -> 0.1113 (2.6x better):

import numpy as np

rng = np.random.default_rng(2)


def q8_per_tensor(a):
    """Per-tensor symmetric int8 round-trip."""
    s = max(np.abs(a).max() / 127, 1e-12)
    return np.clip(np.round(a / s), -127, 127) * s


n, d, m = 64, 256, 128  # tokens, in-features, out-features
X = rng.standard_normal((n, d))
X[:, ::37] *= 30.0  # emergent activation outliers in a few channels
W = rng.standard_normal((d, m)) * 0.1

# SmoothQuant smoothing factor s_j = max(|X_j|)^a / max(|W_j|)^(1-a)
alpha = 0.5
act_max = np.abs(X).max(axis=0)
wt_max = np.abs(W).max(axis=1)
s = act_max ** alpha / np.maximum(wt_max, 1e-8) ** (1 - alpha)

X_hat = X / s  # diag(s)^-1 on activation columns
W_hat = s[:, None] * W  # diag(s) on weight rows

# 1) the transform is mathematically exact: the GEMM is unchanged
assert np.allclose(X_hat @ W_hat, X @ W, rtol=1e-10, atol=1e-8)

# 2) difficulty migrates off the activations (their peak shrinks)
assert np.abs(X_hat).max() < np.abs(X).max()

# 3) adversarial payoff: W8A8 int8 GEMM error drops after smoothing
ref = X @ W
err_naive = np.abs(q8_per_tensor(X) @ q8_per_tensor(W) - ref).mean()
err_smooth = np.abs(q8_per_tensor(X_hat) @ q8_per_tensor(W_hat) - ref).mean()
assert err_smooth < err_naive

print(f"act max {np.abs(X).max():.1f} -> {np.abs(X_hat).max():.2f}; "
      f"W8A8 GEMM MAE {err_naive:.4g} -> {err_smooth:.4g} "
      f"({err_naive / err_smooth:.1f}x better)")

AWQ: protect a salient channel by scaling it into the range¶

A salient channel is a small weight paired with a large activation; a single per-tensor scale rounds that weight to zero and loses the product. AWQ scales the weight up and the activation down by the same factor, an exact transform that leaves the weight representable. Prints salient product x*w = 1.250; dot-product error naive=1.250 -> awq=0.233 (5.4x better):

import numpy as np

rng = np.random.default_rng(3)

d = 64
w = rng.standard_normal(d)
x = rng.standard_normal(d)
scale = np.abs(w).max() / 7  # the bulk of the weights sets the scale

# snap the bulk onto the int4 grid so they quantize losslessly, isolating the
# one salient channel: a small weight paired with a large activation.
w = np.round(w / scale) * scale
sal = 7
w[sal] = 0.05  # off-grid, smaller than half a step -> rounds to zero
x[sal] = 25.0  # ...but its activation makes the product matter


def q4(v):  # weight-only int4 at the fixed bulk scale
    return np.clip(np.round(v / scale), -7, 7) * scale


assert q4(w)[sal] == 0.0  # naive int4 destroys the salient weight
ref = x @ w
err_naive = abs(x @ q4(w) - ref)

# AWQ: scale the salient weight up by g, the matching activation down by g.
g = np.ones(d)
g[sal] = 8.0
w_awq = g * w  # protected weight now occupies part of the range
x_awq = x / g  # equivalent transform: every product is preserved
assert np.isclose(x_awq @ w_awq, x @ w, rtol=1e-10)
assert q4(w_awq)[sal] != 0.0  # salient weight is now representable

err_awq = abs(x_awq @ q4(w_awq) - ref)
assert err_awq < err_naive  # the important product survives quantization

print(f"salient product x*w = {x[sal] * w[sal]:.3f}; dot-product error "
      f"naive={err_naive:.3f} -> awq={err_awq:.3f} "
      f"({err_naive / max(err_awq, 1e-9):.1f}x better)")

GPTQ: error compensation beats round-to-nearest on the same grid¶

GPTQ quantises column by column and pushes each column's rounding residual onto the not-yet-quantised columns through the inverse Hessian H = X Xᵀ, minimising the layer's output error ‖(W − Wq) X‖. Both results land on the same legal INT4 grid, so the win is compensation, not a wider format. Prints output error RTN=71.52 GPTQ=66.69 (1.07x better on the same grid):

import numpy as np

rng = np.random.default_rng(4)


def to_grid(w, scale):
    """Symmetric int4 grid (dequantized values)."""
    return np.clip(np.round(w / scale), -7, 7) * scale


out_f, in_f, n = 16, 64, 256
W = rng.standard_normal((out_f, in_f))
mix = np.eye(in_f) + 0.3 * rng.standard_normal((in_f, in_f)) / np.sqrt(in_f)
X = mix @ rng.standard_normal((in_f, n))  # correlated features -> non-diagonal H
scale = np.abs(W).max() / 7

# round-to-nearest baseline
Wq_rtn = to_grid(W, scale)

# GPTQ: quantize column by column, pushing each column's rounding error onto
# the not-yet-quantized columns via the inverse Hessian H = X X^T.
H = X @ X.T
H += np.eye(in_f) * (0.02 * np.mean(np.diag(H)))  # damping for invertibility
Hinv = np.linalg.inv(H)
Wg = W.astype(np.float64).copy()
Wq = np.zeros_like(Wg)
for i in range(in_f):
    q = to_grid(Wg[:, i], scale)
    Wq[:, i] = q
    e = (Wg[:, i] - q) / Hinv[i, i]  # error scaled by the pivot
    if i + 1 < in_f:
        Wg[:, i + 1:] -= np.outer(e, Hinv[i, i + 1:])  # compensate the rest

# GPTQ minimizes output error ||(W - Wq) X||, so it must beat round-to-nearest
err_rtn = np.linalg.norm((W - Wq_rtn) @ X)
err_gptq = np.linalg.norm((W - Wq) @ X)
assert err_gptq < err_rtn

# both land on the same legal int4 grid (compensation, not cheating the format)
assert np.all(np.abs(np.round(Wq / scale)) <= 7)

print(f"output error  RTN={err_rtn:.2f}  GPTQ={err_gptq:.2f} "
      f"({err_rtn / err_gptq:.2f}x better on the same grid)")

NF4: normal-quantile levels beat uniform on normal weights¶

NF4 places its 16 levels at the quantiles of a standard normal, matching where weight mass actually sits. On normally distributed weights it beats a uniform 16-level grid; on uniform data the advantage reverses, which is the honest boundary of the claim. Prints NF4 MAE=0.09683 uniform MAE=0.1368 (1.41x); uniform data flips it: NF4=0.04187 uniform=0.03356:

import numpy as np

rng = np.random.default_rng(5)


def to_levels(x, levels):
    """Nearest-level (non-uniform) quantize."""
    idx = np.abs(x[:, None] - levels[None, :]).argmin(axis=1)
    return levels[idx]


# NF4-style levels: 16 values placed at the quantiles of a standard normal,
# normalized so the outermost level sits at +-1 (numpy has no erfinv, so
# estimate the quantiles empirically from a large normal sample).
sample = rng.standard_normal(4_000_000)
nf4 = np.quantile(sample, (np.arange(16) + 0.5) / 16)
nf4 = nf4 / np.abs(nf4).max()

# a fair 16-vs-16 contest isolating level *placement*, not level *count*
uniform = np.linspace(-1.0, 1.0, 16)

# block-wise absmax scaling into [-1, 1], as NF4/QLoRA does per block
w = rng.standard_normal(8192)
absmax = np.abs(w).max()
w_norm = w / absmax

err_nf4 = np.abs(w - to_levels(w_norm, nf4) * absmax).mean()
err_uniform = np.abs(w - to_levels(w_norm, uniform) * absmax).mean()

# normal-quantile placement beats uniform on normally distributed weights
assert err_nf4 < err_uniform

# edge: on uniformly distributed data the advantage reverses (matched levels win)
u = rng.uniform(-1.0, 1.0, 8192)
e_nf4_u = np.abs(u - to_levels(u, nf4)).mean()
e_uni_u = np.abs(u - to_levels(u, uniform)).mean()
assert e_uni_u < e_nf4_u  # NF4 is optimal for normal data, not every distribution

print(f"normal weights: NF4 MAE={err_nf4:.4g} uniform MAE={err_uniform:.4g} "
      f"({err_uniform / err_nf4:.2f}x); uniform data flips it: "
      f"NF4={e_nf4_u:.4g} uniform={e_uni_u:.4g}")

How to use it¶

The whole scheme rests on one affine map. This numpy version reproduces the book's worked example ([-2, 18] gives scale ≈ 0.0784, zero_point = 26, and 5.0 -> 90), bounds the symmetric round-trip error, catches a too-small scale by its clipping signature, and checks the vectorised path against a slow scalar reference. Prints scale=0.04213 MAE=0.01053 clip_frac=0.51 ratio_bad=1 ratio_good=0.014:

import numpy as np


def calibrate(x, symmetric=True, num_bits=8):
    """Affine calibration: scale = (max - min) / (qmax - qmin)."""
    x = np.asarray(x, dtype=np.float64)
    if symmetric:  # zero-centred weights, signed int
        qmax = 2 ** (num_bits - 1) - 1
        return max(np.abs(x).max() / qmax, 1e-12), 0, -qmax, qmax
    qmin, qmax = 0, 2 ** num_bits - 1  # skewed activations, unsigned int
    scale = max((x.max() - x.min()) / (qmax - qmin), 1e-12)
    zp = int(round(qmin - x.min() / scale))
    return scale, zp, qmin, qmax


def quantize(x, scale, zp, qmin, qmax):
    return np.clip(np.round(np.asarray(x) / scale) + zp, qmin, qmax).astype(np.int32)


def dequantize(q, scale, zp):
    return (q.astype(np.float64) - zp) * scale


# 1) reproduce the book's asymmetric worked example: [-2, 18] -> 5.0 maps to 90
scale, zp, qmin, qmax = calibrate([-2.0, 18.0], symmetric=False)
assert abs(scale - 20 / 255) < 1e-12 and zp == 26, (scale, zp)
assert int(quantize(5.0, scale, zp, qmin, qmax)) == 90

# 2) symmetric round-trip error is bounded by half a step (scale / 2)
rng = np.random.default_rng(0)
w = rng.standard_normal((4096, 4096))
s, z, lo, hi = calibrate(w, symmetric=True)
recon = dequantize(quantize(w, s, z, lo, hi), s, z)
err = w - recon
mae_g, mse_g = np.abs(err).mean(), (err ** 2).mean()
assert np.abs(err).max() <= s / 2 + 1e-9
assert mae_g < s / 3  # mean abs error tracks a quarter-step (step = scale)

# 3) adversarial: a scale 8x too small clips hard and fattens the error tail
recon_bad = np.clip(np.round(w / (s / 8)), lo, hi) * (s / 8)
clipped = np.mean(np.abs(np.round(w / (s / 8))) >= hi)
err_b = w - recon_bad
mae_b, mse_b = np.abs(err_b).mean(), (err_b ** 2).mean()
assert clipped > 0.1  # over 10% of values saturate at the clamp
assert mse_b / mae_b > 5 * (mse_g / mae_g)  # MSE jumps far above MAE under clipping

# 4) equivalence to a slow scalar reference
ref = np.array([np.clip(round(v / s) + z, lo, hi) for v in w[0]])
assert np.array_equal(ref, quantize(w[0], s, z, lo, hi))

print(f"scale={s:.4g} MAE={mae_g:.4g} clip_frac={clipped:.2f} "
      f"ratio_bad={mse_b / mae_b:.3g} ratio_good={mse_g / mae_g:.3g}")

In a framework you express the same affine map over whole tensors and let the tensor cores do the GEMM. This is the torch form of the block above (reference template, needs torch; the numpy block is the validated core it implements):

# Reference template (needs torch >= 2.1). Core math validated by the numpy block above.
import torch


def quantize_affine(
    x: torch.Tensor, num_bits: int = 8, symmetric: bool = True
) -> tuple[torch.Tensor, float, int]:
    """Affine quant to int. scale = (max - min) / (qmax - qmin)."""
    if symmetric:  # weights: zero-centred, INT8 in [-127, 127]
        qmax = 2 ** (num_bits - 1) - 1
        qmin = -qmax
        scale = x.abs().max().item() / qmax or 1.0
        zero_point = 0
    else:  # activations: skewed, UINT8 in [0, 255]
        qmin, qmax = 0, 2 ** num_bits - 1
        x_min, x_max = x.min().item(), x.max().item()
        scale = (x_max - x_min) / (qmax - qmin) or 1.0
        zero_point = int(round(qmin - x_min / scale))
    q = torch.round(x / scale) + zero_point
    return q.clamp(qmin, qmax).to(torch.int32), scale, zero_point


def dequantize_affine(q: torch.Tensor, scale: float, zero_point: int) -> torch.Tensor:
    return (q.to(torch.float32) - zero_point) * scale

How to integrate with it¶

In production you rarely quantise by hand: you serve a prequantized checkpoint and let the engine pick kernels. vLLM auto-detects the method from the checkpoint config; --quantization makes it explicit, and it auto-selects the faster Marlin kernels for AWQ/GPTQ on Ampere and newer (reference template, needs vllm; the quantization math is the same affine/method core validated above):

# vLLM >= 0.6, OpenAI-compatible server. Reference template; pin versions.

# AWQ: INT4 weight-only, group-wise (g=128).
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ --quantization awq --max-model-len 8192

# GPTQ: INT4 weight-only with Hessian error compensation.
vllm serve Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --quantization gptq

# FP8 (W8A8) on Hopper/Blackwell tensor cores:
#  (a) quantize a BF16 model to FP8 E4M3 on the fly at load time,
vllm serve Qwen/Qwen2.5-7B-Instruct --quantization fp8
#  (b) or load a prequantized FP8 checkpoint (method auto-detected).
vllm serve RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8

FP8 needs Hopper or newer; NVFP4/MXFP4 need Blackwell. INT4 weight-only runs on Ampere+ via the AWQ/GPTQ Marlin kernels. Validate accuracy on your own eval set after any of these. The throughput numbers assume accuracy holds. See the vLLM quantization docs for the full method matrix and supported hardware.

Two integration points are worth calling out. First, the KV cache: the same low-bit formats (FP8/INT8 KV) quantise the other half of decode memory, configured separately from the weights. Second, the RL loop: in GRPO/PPO the rollout engine is just an inference server, so serving rollouts in FP8/NVFP4 while training stays BF16 plugs quantization straight into post-training. This is the optimisation layer under inference serving and serving open-weight models.

How to run it in production¶

Prefer static scales for high-QPS serving so inference is pure compute. Calibrate with a few hundred to a few thousand in-distribution samples, use percentile (not min-max) so a lone outlier does not steal the range, and gate promotion on a held-out eval, never the calibration set. Confirm the format matches the GPU before you ship (FP8 needs Hopper+, NVFP4/MXFP4 need Blackwell, INT4 needs Marlin kernels). Keep operands low-bit but accumulate in FP16/FP32 so numerics do not drift, see tensor cores and mixed precision. Profile end-to-end rather than one kernel: you can land 4x memory but only ~1.2x speed, so measure the whole request.

How to maintain it¶

Re-calibrate when the input distribution shifts; unrepresentative calibration is the dominant failure mode, and yesterday's scales silently degrade on today's traffic. Pin the engine and kernel versions (the Marlin kernels and the checkpoint format both move), and re-validate at your maximum context length whenever you quantise the KV cache, because long-context quality degrades first. Run an accuracy regression on every model or engine upgrade: a kernel change moves numerics, and the only proof that accuracy still holds is a fresh eval on held-out data.

How to scale it¶

The memory math sets the fleet: at ~0.5 GB per billion parameters in INT4, model size decides how many GPUs a replica needs and how much HBM is left for the KV cache. Quantising the KV cache (FP8/INT8) frees that HBM for more concurrent sequences and longer context. On the roofline, cutting the weight bytes moves decode toward the compute roof, which is where added batch turns into throughput. Group size 128 keeps scale overhead under ~1% while adapting locally, so it stays the default for the big feed-forward matrices even at scale. Combine quantization with tensor parallelism across GPUs, and keep batches large enough to feed the tensor cores, otherwise the INT8/FP8 compute speedup never materialises and you are left with the bandwidth win alone.

Failure modes¶

Unrepresentative calibration. The dominant failure: scales tuned on the wrong distribution clip or waste range and tank accuracy. Gather diverse, in-distribution samples and validate on held-out data, never the calibration set.
Activation outliers. Naive INT8 activation quantisation collapses on models above ~6.7B because of emergent outlier features. Use SmoothQuant (migrate the outliers) or LLM.int8()-style mixed precision; weight-only INT4 sidesteps it by leaving activations in FP16/BF16.
Clipping vs wasted range. A scale too small clips to ±max (MSE jumps far above MAE); too large bunches values into a narrow band, effectively using fewer bits. The MSE/MAE gap is the diagnostic, exactly the signal asserted in the how to use it block.
Quantisation is not free. Dynamic-scale reductions and per-channel/group scale lookups have cost; you can land 4x memory but only ~1.2x speed. At decode (small batch) weight-only INT4 helps bandwidth, but INT8/FP8 compute speedups need tiles large enough to feed the tensor cores. Profile end-to-end, not one kernel.
Accumulating in low precision. Operands are low-bit; accumulators are not. Always accumulate in FP16/FP32 or numerics drift, see tensor cores and mixed precision.
KV-cache quantisation drift. FP8/INT8 KV shrinks decode memory but can degrade long-context quality; validate at your maximum context length, see KV-cache management.
Format/hardware mismatch. FP8 on pre-Hopper, NVFP4/MXFP4 on pre-Blackwell, or an INT4 checkpoint without Marlin kernels will fall back or fail. Confirm the engine and GPU support the format you ship.

References¶

CUDA for Deep Learning (Manning MEAP), Chapter 9 (Quantization). Data types, quant/dequant kernels, granularity, calibration, static vs dynamic, symmetric vs asymmetric, AWQ, GPTQ, NF4, QAT.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. https://arxiv.org/abs/2210.17323
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. https://arxiv.org/abs/2211.10438
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. https://arxiv.org/abs/2208.07339
QLoRA: Efficient Finetuning of Quantized LLMs (NF4, double quantization). https://arxiv.org/abs/2305.14314
FP8 Formats for Deep Learning (E4M3 / E5M2). https://arxiv.org/abs/2209.05433
vLLM, Quantization documentation. https://docs.vllm.ai/en/latest/features/quantization/index.html