Markdown

NVFP4: four-bit precision across the model lifecycle¶

Scope: the FP4/NVFP4 story as told by NVIDIA's EAI lab post "Pushing Intelligence to 4-bit" (June 2026): why four bits is representationally hard, how NVFP4's two-level scaling differs from MXFP4, and what the format now covers in practice: W4A4 LLM training and inference, quantization-aware 4-bit experts in frontier MoE models, end-to-end FP4 long-video generation, the NVFP4 KV cache, and FP4 attention. The broader quantization method catalog (GPTQ, AWQ, SmoothQuant, calibration) is quantization for inference; the tensor-core mechanics of low-precision MMA are tensor cores and mixed precision.

All benchmark and accuracy numbers on this page are the blog's (and its cited NVIDIA reports and papers), not reproduced here; re-verify against current releases before capacity or quality decisions. The numpy example is executed and asserted; it validates the format mechanics, not a Blackwell kernel. Tooling pointers are as of 2026-07.

flowchart TB
  T["FP32/BF16 tensor"] --> TS["FP32 per-tensor scale<br/>(normalizes block scales into FP8 range)"]
  TS --> B["16-value micro-blocks"]
  B --> BS["FP8 (E4M3) scale per block"]
  BS --> Q["4-bit E2M1 codes<br/>(15 values: 0, +/-0.5 ... +/-6)"]
  Q --> TC["Blackwell Tensor Cores<br/>W4A4 GEMM, FP4 attention"]
  Q --> KV["NVFP4 KV cache<br/>(native FP4-to-FP8 dequant datapath)"]
  KV --> TC

What it is¶

NVFP4 is NVIDIA's 4-bit floating-point format and the hardware path that makes it usable across the model lifecycle. The element type is E2M1 (1 sign, 2 exponent, 1 mantissa bit), which yields exactly fifteen distinct values: {0, +/-0.5, +/-1, +/-1.5, +/-2, +/-3, +/-4, +/-6}. A 4-bit code only picks a mark on that tiny ruler; everything about accuracy rides on the scale that places the ruler over the data.¹

Block scaling is the established fix for the one-ruler problem. MXFP4, the Open Compute Project microscaling format, gives every block of 32 values its own power-of-two scale (an E8M0 exponent), so one outlier only distorts its local block; the catch is that a power-of-two dial is coarse, and the rounding of the ideal scale is paid for out of the 4-bit budget. NVFP4 sharpens both knobs: one shared FP8 (E4M3) scale per 16-value block, plus a higher-level FP32 per-tensor scale that normalizes the tensor so each block scale fits cleanly into FP8. Smaller blocks halve how many values one outlier contaminates, and the mantissa-bearing scale lands where the data actually is; NVIDIA reports about 88% lower quantization error than power-of-two scaling. The cost is metadata: roughly 4.5 effective bits per value versus MXFP4's 4.25, and the format needs Blackwell to run natively.¹

Why use it¶

The arithmetic itself gets faster, not just the storage. W4A4 (weights and activations both 4-bit) runs the dominant matrix multiplies directly on Blackwell Tensor Cores: up to a 4x GEMM throughput ceiling over BF16 at a quarter of the memory traffic per element, and up to about 3x the peak inference throughput of FP8 on the same hardware.²
Training too. The blog reports about 1.9x faster pretraining than FP8 on Llama-3.1 405B with NVFP4-aware training, and a head-to-head pretraining comparison in which MXFP4 needed 36% more tokens to reach the same loss as NVFP4.²
Accuracy holds at the reported margins. Post-training quantization of DeepSeek-R1 to NVFP4 stays within about 1% of FP8 across reasoning and knowledge benchmarks (MMLU-Pro 85 to 84, GPQA 81 to 80, AIME 2024 up 89 to 91).²
The KV cache shrinks on the same format. NVIDIA reports the NVFP4 KV cache cutting footprint up to about 50% versus FP8 at under 1% accuracy loss, beating an MXFP4 cache by about 5%, and dequantizing along Blackwell's native FP4-to-FP8 datapath where generic INT4/INT2 caches must dequantize in software.³
It ships. NVFP4 inference is in TensorRT-LLM, the recipes are in the TensorRT Model Optimizer, fused quantize-and-GEMM kernels are in TransformerEngine, and NVFP4 checkpoints of frontier models (including DeepSeek-R1) are published for Blackwell deployment.²

When to use it (and when not)¶

Use PTQ NVFP4 for serving on Blackwell when the eval gate confirms the roughly 1% margin holds on your tasks; it is the cheapest path to the memory and throughput gains.
Use QAT when the model can be built for 4-bit. The blog's frontier examples bake the format into the architecture: DeepSeek-V4 (1.6T-parameter MoE) trains its expert weights and sparse-attention indexer directly in FP4 with quantization-aware training and keeps the rest in FP8; OpenAI's GPT-OSS trains its MoE experts (over 90% of parameters) quantization-aware in MXFP4, which is what fits the 120B model on a single 80 GB GPU. Same tradeoff as the format comparison, decided inside the models.²
Do not expect native speed off Blackwell. The two-level scaling exists as a hardware datapath there; elsewhere FP4 degrades to a storage format with software dequantization on the critical path.
Budget the cast overhead on small layers. Every activation must be quantized just before the multiply; as a standalone kernel that adds an HBM round-trip that can quietly erode the 4x ceiling on memory-bound layers. The fix is fusion (quantization in the GEMM prologue, normalization in the epilogue), which is what the production kernels do.²
Treat FP4 attention as the sharp edge. It works (see Production below) but still carries more accuracy risk than FP4 GEMMs; the blog lists closing that gap as open research.⁵

Architecture¶

One NVFP4 tensor is three things: E2M1 codes, one E4M3 scale per 16-value block, and one FP32 scale per tensor. Quantization divides each block by its effective scale and rounds onto the 15-mark grid; dequantization multiplies back, and on Blackwell the FP4-to-FP8 path is a native datapath rather than a software loop. In a W4A4 layer both operands enter the tensor core in 4-bit and accumulate in higher precision. Two co-design details from the blog matter beyond the GEMM. First, K-smoothing: keys carry channel offsets and outliers that waste the tiny grid, so production recipes center or smooth K before quantization (LongLive-2.0 subtracts each key vector's channel mean; SageAttention3 inherits K-smoothing from SageAttention). Second, the softmax map P is the hardest tensor in the attention path: after softmax its values sit in [0, 1] crammed near zero, so SageAttention3 stretches each row by a per-token FP32 factor (a 1/(448x6) arrangement) to fill the representable range before quantizing, lifting P's cosine similarity from 93.3% to 99.5% while keeping both attention matmuls in NVFP4.⁵

How to use it¶

The deployment path is tooling, not custom kernels: quantize with the TensorRT Model Optimizer's NVFP4 recipes (post-training or quantization-aware), serve with TensorRT-LLM, or train through TransformerEngine's fused quantize-and-GEMM path; published NVFP4 checkpoints skip the quantization step entirely. Pin versions and re-run the eval gate on your own tasks, since the roughly 1% PTQ margin is a report about specific benchmarks, not a guarantee.

How to develop with it¶

The format mechanics are small enough to validate directly, and doing so makes the two design choices legible: finer continuous scales beat power-of-two scales, and block scaling confines outlier damage. This block is executed and asserted: the E2M1 grid is exactly the blog's fifteen values; on a tensor mixing unit-scale values with injected outliers the relative MSE orders NVFP4-style < MXFP4-style < global scale; one outlier disturbs only its own 16-value block under block scaling but almost the whole tensor under a global scale; and the adversarial cases cover an all-zero block (no NaN), deterministic midpoint rounding, and the dequantized-magnitude bound:

# nvfp4_mechanics.py - validated: the E2M1 grid and two-level block scaling that
# define NVFP4, compared against MXFP4-style power-of-two scaling and a single
# global scale. Format-mechanics validation in numpy, not a Blackwell kernel.
import numpy as np

E2M1_POS = (0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0)
GRID = np.array(sorted({-v for v in E2M1_POS} | set(E2M1_POS)))
E4M3_MAX = 448.0


def round_to_grid(x: np.ndarray) -> np.ndarray:
    """Nearest E2M1 grid point; an exact midpoint goes to the even grid index
    (the on-grid analogue of round-to-nearest-even), so ties are deterministic."""
    d = np.abs(np.asarray(x, dtype=np.float64)[..., None] - GRID)
    lo = np.argmin(d, axis=-1)
    dmin = np.take_along_axis(d, lo[..., None], -1)[..., 0]
    hi = np.minimum(lo + 1, GRID.size - 1)
    tie = np.isclose(np.abs(np.asarray(x) - GRID[hi]), dmin) & (hi != lo)
    pick = np.where(tie & (hi % 2 == 0), hi, lo)
    return GRID[pick]


def e4m3_scale(raw: np.ndarray) -> np.ndarray:
    """Quantize a positive block scale to 3 mantissa bits (E4M3-style precision),
    rounding up so the block maximum still fits the 15-mark grid."""
    raw = np.where(raw <= 0.0, 2.0 ** -6, raw)
    e = np.floor(np.log2(raw))
    return np.clip(np.ceil(raw / 2.0 ** e * 8.0) / 8.0 * 2.0 ** e, 2.0 ** -6, E4M3_MAX)


def dequant_global(x: np.ndarray) -> tuple[np.ndarray, float]:
    """One scale for the whole tensor: amax maps to the grid maximum of 6."""
    s = float(np.abs(x).max() / 6.0) or 1.0
    return round_to_grid(x / s) * s, s


def dequant_blocked(x: np.ndarray, block: int, po2: bool,
                    tensor_scale: float | None = None) -> tuple[np.ndarray, np.ndarray]:
    """Per-block scaling. po2=True is MXFP4-style (power-of-two E8M0 per block);
    po2=False is NVFP4-style (E4M3-precision scale per block on top of an FP32
    per-tensor scale that normalizes block scales into FP8 range)."""
    pad = (-x.size) % block
    blocks = np.pad(x.astype(np.float64), (0, pad)).reshape(-1, block)
    amax = np.abs(blocks).max(axis=1)
    if po2:
        s = np.where(amax == 0.0, 1.0, 2.0 ** np.ceil(np.log2(np.where(amax == 0, 1, amax) / 6.0)))
    else:
        st = tensor_scale if tensor_scale is not None else float(np.abs(x).max() / (E4M3_MAX * 6.0)) or 1.0
        s = e4m3_scale(amax / (6.0 * st)) * st
    dq = round_to_grid(blocks / s[:, None]) * s[:, None]
    return dq.reshape(-1)[: x.size], s


def rel_mse(x: np.ndarray, dq: np.ndarray) -> float:
    return float(np.mean((x - dq) ** 2) / np.mean(x ** 2))


# 1) The grid is exactly the blog's fifteen values.
assert GRID.size == 15
assert set(np.abs(GRID).tolist()) == set(E2M1_POS)

# 2) Mixed magnitudes plus injected outliers: NVFP4-style < MXFP4-style < global.
rng = np.random.default_rng(0)
x = rng.normal(0.0, 1.0, 4096)
x[::128] *= 60.0                                   # 32 outliers, one per 128 values
dq_g, s_g = dequant_global(x)
dq_mx, _ = dequant_blocked(x, block=32, po2=True)
dq_nv, s_nv = dequant_blocked(x, block=16, po2=False)
mses = (rel_mse(x, dq_nv), rel_mse(x, dq_mx), rel_mse(x, dq_g))
assert mses[0] < mses[1] < mses[2], mses
print(f"relative MSE  nvfp4-style {mses[0]:.5f} < mxfp4-style {mses[1]:.5f} < global {mses[2]:.5f}")

# 3) Blast radius of one outlier: local under block scaling, global under one scale.
clean = rng.normal(0.0, 1.0, 512)
dirty = clean.copy()
dirty[40] = 80.0                                   # one outlier, block 2 of 16
nv_clean, _ = dequant_blocked(clean, 16, po2=False, tensor_scale=float(np.abs(dirty).max() / (E4M3_MAX * 6.0)))
nv_dirty, _ = dequant_blocked(dirty, 16, po2=False, tensor_scale=float(np.abs(dirty).max() / (E4M3_MAX * 6.0)))
changed = np.flatnonzero(~np.isclose(nv_clean, nv_dirty))
assert changed.size and set(changed.tolist()) <= set(range(32, 48)), changed
g_clean, _ = dequant_global(clean)
g_dirty, _ = dequant_global(dirty)
outside = np.flatnonzero(~np.isclose(g_clean, g_dirty))
assert np.any((outside < 32) | (outside >= 48)), "global scale must spread the damage"
print(f"outlier blast radius: block-scaled changes confined to its 16-value block "
      f"({changed.size} values); global scale disturbs {outside.size} of 512")

# 4) Adversarial: an all-zero block dequantizes to zeros, no NaN anywhere.
z, _ = dequant_blocked(np.zeros(64), 16, po2=False)
assert np.array_equal(z, np.zeros(64)) and not np.isnan(z).any()

# 5) Deterministic midpoint: 2.5 sits between 2 and 3; the even grid index wins.
assert round_to_grid(np.array([2.5]))[0] == 3.0 == round_to_grid(np.array([2.5]))[0]
assert round_to_grid(np.array([-2.5]))[0] == -3.0

# 6) Dequantized magnitudes never exceed scale times the grid maximum of 6.
assert np.abs(dq_nv).max() <= 6.0 * s_nv.max() + 1e-12
assert np.abs(dq_g).max() <= 6.0 * s_g + 1e-12
print("all NVFP4 mechanics assertions passed")

Output of the run: relative MSE nvfp4-style 0.00671 < mxfp4-style 0.01999 < global 0.05683 (the finer 16-value scale is about 3x more accurate than the power-of-two 32-value scale on this tensor, and about 8x better than one global scale), and the single outlier changed 16 values (its own block) under block scaling versus 464 of 512 under a global scale. Beyond the format itself, the blog's open problems are the development frontier: smarter per-block scale search, rotations and Hadamard transforms for outlier handling, faster fused KV dequantization, and cheap recipes that reach QAT-level quality.¹

How to maintain it¶

Gate on your evals, not the blog's. The reported margins (about 1% PTQ loss, about 5% KV-cache edge over MXFP4) are benchmark-specific; wire NVFP4 promotion through the same eval gate as any model change, with the hardest reasoning and long-context tasks in the set.
Pin the kernel stack as a unit. The gains depend on fused quantize-and-GEMM kernels and the hardware dequant path; TransformerEngine, Model Optimizer, TensorRT-LLM, and the driver move together, so re-benchmark after any bump rather than assuming the 2026 ratios persist.
Track the smoothing recipes. K-smoothing is a shared ingredient with per-system variants (channel-mean subtraction, semantic-aware smoothing, inherited smoothing in the attention kernel); a checkpoint quantized under one recipe does not automatically carry to a stack using another.
Watch the QAT/PTQ boundary per model family. Frontier MoE models increasingly train their experts in 4-bit natively (NVFP4 or MXFP4); for those, the maintenance question inverts from "does PTQ hold" to "which components stayed FP8 and why".

Running it in production¶

The blog's end-to-end proof is video. LongLive-2.0, a 5B autoregressive long-video model built on Wan2.2-TI2V-5B, runs NVFP4 in both training and inference (to the authors' knowledge the first end-to-end NVFP4 long-video recipe) and generates minute-long 720p video in real time: 45.7 FPS versus 3.3 FPS for the 50-step base, about 14x. Training: NVFP4 plus balanced sequence parallelism cuts a 64-second AR-training iteration from 1372.9 s (BF16+SP) to 639.5 s, up to 2.1x, while plain BF16 without SP runs out of memory; the DMD distillation stage co-locates three networks per GPU as frozen W4A4 NVFP4 backbones with only small LoRA adapters trainable (an idea from QeRL), walking peak memory from 70.5 to 49.0 GB. Inference: W4A4 plus an NVFP4 KV cache drops peak memory from 36.4 GB to 19.4 GB, and a fused kernel dequantizes every in-window cache chunk in a single launch, keeping the recurring per-step dequant tax under 2% on a cache about 3.6x smaller.⁴

Attention has climbed the same precision ladder in production kernels: FlashAttention-2 set the FP16/BF16 baseline, FlashAttention-3 added the FP8 path on Hopper (about 1.2 PFLOP/s on H100), SageAttention and SageAttention2 pushed Q/K to INT8 then INT4 (about 2.1x and 3x over FlashAttention-2), and SageAttention3 runs both attention matmuls in NVFP4 on Blackwell: 1038 TOPS on an RTX 5090, roughly 5x the fastest FlashAttention available there, and about 2.4 to 3x end-to-end on video diffusion, with NVFP4 ahead of MXFP4 (99.52% versus 98.37% on the blog's quality figure).⁵ The operational reading: keep the whole path in one format family where possible (an NVFP4 cache feeds an NVFP4 attention kernel with no format boundary), and treat attention precision as a separately gated rollout from GEMM precision.

Failure modes¶

One global scale on an outlier-bearing tensor. The grid's fifteen marks collapse onto empty range and small values quantize to zero; this is the failure block scaling exists to prevent (validated above: 464 of 512 values disturbed by one outlier).
Power-of-two scale rounding. When the ideal scale sits between two powers of two, MXFP4 pays the rounding out of the 4-bit budget; the blog's pretraining comparison (36% more tokens to the same loss) is that tax measured end to end.²
Unfused activation casts. A standalone quantize kernel before every GEMM adds an HBM round-trip; on memory-bound layers the op-overhead tax can consume much of the promised speedup. Confirm the serving stack uses fused prologue quantization.²
FP4 storage without the FP4 datapath. Off Blackwell, INT4/INT2-style caches dequantize in software inside the per-step decode loop; the recurring cost lands exactly where latency hurts most.³
Naive 4-bit softmax. Quantizing P without range stretching wastes nearly the whole grid on [0, 1] values near zero; the blog quantifies the repair (cosine similarity 93.3% to 99.5% with the two-level stretch).⁵
Serial per-chunk cache dequantization. One kernel launch per cached chunk stacks launch latency across the window every step; the fused single-launch rebuild is what keeps the tax under 2% in LongLive-2.0.⁴
Format boundaries in the attention path. A capacity-first INT2 cache feeding an FP4 attention kernel inserts a conversion at the hottest boundary; choosing quality-first NVFP4 end to end avoids it, at twice the cache bits.³

References¶

Huang, Chen, Mao, Wang, Yang, Han (NVIDIA EAI lab), Pushing Intelligence to 4-bit (June 2026): https://research.nvidia.com/labs/eai/blogs/pushing-intelligence-to-4-bit/
NVIDIA Technical Blog, Introducing NVFP4 for Efficient and Accurate Low-Precision Inference: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
NVIDIA Technical Blog, NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit: https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/
NVIDIA Technical Blog, Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache: https://developer.nvidia.com/blog/optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/
Rouhani et al., Microscaling Data Formats for Deep Learning (the MX/MXFP4 formats, arXiv 2310.10537): https://arxiv.org/abs/2310.10537
NVIDIA, Pretraining Large Language Models with NVFP4 (arXiv 2509.25149): https://arxiv.org/abs/2509.25149
Zhang et al., SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training (arXiv 2505.11594): https://arxiv.org/abs/2505.11594
Chen et al., LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation (arXiv 2605.18739): https://arxiv.org/abs/2605.18739 (code: https://github.com/NVlabs/LongLive)
Hooper et al., KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (arXiv 2401.18079): https://arxiv.org/abs/2401.18079
Huang et al., QeRL: Beyond Efficiency, Quantization-enhanced Reinforcement Learning for LLMs (arXiv 2510.11696): https://arxiv.org/abs/2510.11696
Xi et al., Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization (arXiv 2602.02958): https://arxiv.org/abs/2602.02958
NVIDIA TransformerEngine (fused low-precision quantize-and-GEMM kernels): https://github.com/NVIDIA/TransformerEngine
NVIDIA TensorRT Model Optimizer (NVFP4 PTQ and QAT recipes): https://github.com/NVIDIA/TensorRT-Model-Optimizer
DeepSeek-V4-Pro model card (FP4 QAT expert weights): https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
OpenAI gpt-oss-120b model card (MXFP4 QAT experts): https://huggingface.co/openai/gpt-oss-120b

Blog sections 1-2 and closing: E2M1 gives fifteen values {0, +/-0.5, +/-1, +/-1.5, +/-2, +/-3, +/-4, +/-6}; MXFP4 uses an E8M0 power-of-two scale per 32-value block; NVFP4 uses an E4M3 scale per 16-value block plus an FP32 per-tensor scale, about 88% lower quantization error than power-of-two scaling per NVIDIA, at about 4.5 effective bits per value versus MXFP4's 4.25; open problems listed: scale search, rotations/Hadamard transforms and outlier handling, KV dequant efficiency, FP4 attention quality, and closing the QAT/PTQ gap. ↩↩↩
Blog section 3: W4A4 GEMMs on Blackwell at up to 4x the BF16 throughput ceiling and a quarter of the memory traffic; up to about 3x FP8 peak inference throughput and about 1.9x FP8 pretraining speed (Llama-3.1 405B); MXFP4 needed 36% more tokens than NVFP4 to reach the same pretraining loss; DeepSeek-R1 PTQ within about 1% of FP8 (MMLU-Pro 85 to 84, GPQA 81 to 80, AIME 2024 89 to 91); fused activation-quantization prologues via TransformerEngine and recipes via TensorRT Model Optimizer; DeepSeek-V4 (1.6T MoE) trains expert weights and its sparse-attention indexer in FP4 QAT with FP8 elsewhere; GPT-OSS trains MoE experts (over 90% of parameters) QAT in MXFP4, fitting the 120B model on one 80 GB GPU. ↩↩↩↩↩↩↩↩
Blog section 5: NVFP4 KV cache up to about 50% smaller than FP8 at under 1% accuracy loss and about 5% better than an MXFP4 cache; Blackwell dequantizes FP4 to FP8 natively while generic INT4/INT2 caches dequantize in software; K-smoothing (channel-mean subtraction, semantic-aware smoothing) tightens key ranges before quantization; KVQuant reaches 10M-token contexts near 3 bits; Quant VideoGen stores AR-video KV at 2 bits (up to 7x smaller, under 4% latency overhead) as the capacity-first contrast to quality-first NVFP4. ↩↩↩
LongLive-2.0 (arXiv 2605.18739), as summarized by the blog: 5B AR video model on Wan2.2-TI2V-5B, NVFP4 in training and inference; 45.7 FPS at 1280x720 versus 3.3 FPS for the 50-step base (about 14x, about 2x real time); 64 s AR-training iteration 1372.9 s (BF16+SP) to 1196.5 s (balanced SP) to 639.5 s (NVFP4 + balanced SP, up to 2.1x; BF16 without SP OOMs); DMD distillation with three co-located frozen W4A4 backbones and trainable LoRA only, peak memory 70.5 to 49.0 GB; adaptive 4-or-6 block-scale search; inference peak memory 36.4 (BF16, 24.8 FPS) to 29.7 (+NVFP4, 32.0 FPS) to 19.4 GB (+NVFP4 KV cache, 29.7 FPS); fused single-launch window dequantization keeps overhead under 2% on a cache about 3.6x smaller. ↩↩
Blog section 6: FlashAttention-2 (FP16/BF16 exact baseline); FlashAttention-3 FP8 on Hopper at about 1.2 PFLOP/s (H100); SageAttention INT8 QK at about 2.1x FlashAttention-2; SageAttention2 INT4 Q/K with FP8 P/V at about 3x; SageAttention3 runs both matmuls in NVFP4 on Blackwell at 1038 TOPS on an RTX 5090 (about 5x the fastest FlashAttention there, about 2.4 to 3x end-to-end on video diffusion); the softmax map P in [0, 1] is stretched per token in FP32 (a 1/(448x6) arrangement) before 4-bit quantization, raising cosine similarity from 93.3% to 99.5%, NVFP4 99.52% versus MXFP4 98.37%. ↩↩↩↩