KV cache quantization for autoregressive video generation¶
Scope: the video half of NVIDIA's KV-compression infrastructure story, centered on Quant VideoGen (QVG, arXiv 2602.02958, ICML 2026): why autoregressive video diffusion models grow LLM-shaped KV caches that outgrow their own weights within seconds of footage, how semantic-aware smoothing and progressive residual quantization make a 2-bit cache survivable, the Triton kernel design, and what running it in production requires. The LLM-side compression walls (FlashAttention scores, paged reclamation) are KV cache token eviction; the quality-first NVFP4 cache and the LongLive-2.0 recipe are NVFP4 quantization; dtype and pool mechanics for text serving are KV cache management.
Benchmark, accuracy, and overhead numbers on this page are the cited paper's, repo's, and blog's, not reproduced here. The numpy block is executed and asserted, including adversarial cases; it validates the core mechanism (smoothing, progressive residuals, the compression arithmetic), not the Triton kernels. Repo pointers are as of 2026-07; pin versions and validate before production use.
What it is¶
Autoregressive video diffusion models (LongCat-Video, HY-WorldPlay, Self-Forcing-style causal Wan variants) generate footage chunk by chunk, and each new chunk attends over cached K/V for every latent token of every previous frame. Tokens here are spatial patches, so the cache grows as 2 x layers x (H x W x T) x d_model x bytes: linear in frames, like an LLM trace, but with tens of thousands of tokens per few seconds of video. The paper's worked example: 5 seconds of 480p on LongCat-Video is roughly 38K latent tokens and roughly 34 GB of KV cache, which is more than the 13B model's own 27 GB of BF16 weights. The cache, not the model, is the binding constraint on generation length.12
Quant VideoGen (QVG) is a training-free framework that stores that cache at 2 (or 4) bits per element and dequantizes on the fly at attention time. Weights are untouched and no fine-tuning or calibration run is needed. Two ideas carry it:2
- Semantic-aware smoothing. A uniform quantizer's error is proportional to its scale, and the scale is set by the largest magnitude in each group. Video gives you a way to shrink that magnitude: adjacent frames are nearly identical, so k-means over the chunk's tokens (256 centroids) groups near-duplicates, the shared structure moves into the centroids, and what remains to quantize is each token's small, homogeneous residual from its centroid. This one move cuts key-cache quantization error by about 6.9x and value-cache error by about 2.6x.
- Progressive residual quantization. Smoothing can be reapplied to its own residuals. The shipped operating points are QVG (one stage, quantization group size 64) and QVG-Pro (four stages, group size 16); the final-stage residual is quantized to symmetric INT2 with one FP8 (E4M3) scale per group. The first stage does most of the work (5.83x MSE reduction in the paper's ablation) and later stages add at least 1.10x each, with diminishing returns.
The compressed cache holds four things per stage-and-chunk: packed INT2 codes, FP8 per-group scales, BF16 centroids (256 x d_model), and a uint8 centroid assignment per token. Codes dominate at 65% or more of the bytes, and the metadata is exactly why the delivered ratio is about 7x rather than the naive 16-bit-to-2-bit 8x.2
Why use it¶
Because the cache exceeds the weights, every gigabyte saved converts directly into longer horizons, higher resolution, or a cheaper GPU for the same job. QVG's published results, INT2 configuration: 6.94x cache compression on LongCat-Video-13B (PSNR 28.72 against the BF16 output) and 7.05x on HY-WorldPlay-8B (PSNR 29.17), with end-to-end latency overhead of 2.1%, 1.5%, and 4.3% on LongCat-Video, HY-WorldPlay, and Self-Forcing respectively (the abstract's headline is "under 4%"; the Self-Forcing worst case slightly exceeds it). The INT4 configuration trades ratio for fidelity: 3.72x at PSNR 37.14 on LongCat-Video, 3.75x at PSNR 34.45 on HY-WorldPlay. The repo reports 6.21x to 7.01x across its four model integrations.23
The long-horizon result is the sharper argument. Generic 2-bit KV quantization built for text (KIVI) holds up on short clips and then collapses as errors compound across autoregressive chunks: at 700 frames on Self-Forcing, VBench image quality is 69.52 for QVG versus 57.18 for KIVI, and at 1,400 frames QVG holds 67.28 while KIVI degrades to 35.85 (the BF16 baseline sits at 65.87). Exploiting video's temporal redundancy is not an optimization on top of standard KV quantization; it is what makes 2 bits survivable at all on this workload.24
When to use it (and when not)¶
- Use it on autoregressive or causal video generators whose per-frame KV growth is what limits generation length: long-horizon video, interactive world models, streaming generation. It needs no retraining, so it also fits models you cannot fine-tune.
- Use INT4 first if the fidelity budget is tight. Roughly half the savings for about 8 PSNR points; step down to INT2 only after the quality gate passes at your horizon.2
- On Blackwell with a quality-first posture, weigh the NVFP4 cache instead. LongLive-2.0 quantizes weights and KV to NVFP4 on a 5B generator for a blog-reported 1.84x throughput at negligible quality cost, riding the native FP4-to-FP8 dequant datapath; QVG's Triton INT2 path runs on stock CUDA GPUs but dequantizes in software. See NVFP4 quantization.15
- Do not deploy it for latency. It costs 1.5-4.3% end to end; the payoff is memory and horizon, never speed.2
- Do not use it on single-shot bidirectional diffusion. With no autoregressive cache growing across steps, there is nothing for it to compress.
- Do not transplant it to text LLM caches. The lever is near-duplication of adjacent frames; text tokens carry no such structure, and the established text options (FP8 KV, KIVI-style channel/token quantization) already live in KV cache management and quantization for inference.
- Measure first on content with hard cuts or heavy motion. The whole mechanism rides on temporal redundancy; the validation below shows the gain collapsing when that structure is absent, and the paper does not characterize scene-cut behaviour.
Architecture¶
flowchart TB
X["KV chunk: one AR step's latent frame tokens (12-20 frames)"] --> KM["k-means over tokens (256 centroids, warm-started from previous chunk)"]
KM --> META["Metadata: BF16 centroids + uint8 assignment per token"]
KM --> RES["Residual = token - centroid (small, homogeneous)"]
RES -->|"QVG-Pro: re-smooth, up to S=4 stages"| KM
RES -->|"final stage"| Q["Symmetric INT2 codes + FP8 (E4M3) scale per group (B=64 or 16)"]
Q --> MEM["Compressed cache: codes 65%+ of bytes, ~7x smaller than BF16"]
META --> MEM
MEM --> DQ["Fused Triton dequant at attention read: codes to residuals, add centroids stage by stage, intermediates in registers"]
DQ --> ATT["Attention over reconstructed K/V for the next chunk"]
Compression runs once per generated chunk, at the model's native chunking (20 frames on LongCat-Video's 73-frame window, 12 on HY-WorldPlay, 16 on Self-Forcing). Two kernel-side decisions keep the recurring cost small:2
- Centroid caching. Each chunk's k-means warm-starts from the previous chunk's centroids instead of a cold init, cutting the clustering overhead by about 3x. Adjacent chunks are as redundant as adjacent frames, so the old codebook is already close.
- Fused staged dequantization. Reconstruction inverts the pipeline (dequantize codes, add back centroids stage by stage), and a naive implementation would round-trip each stage through global memory. The fused kernel applies all stages in one pass with intermediates held in registers. This is the same lesson LongLive-2.0's single-launch window dequantizer teaches on the NVFP4 side: the per-step dequant tax is a launch-and-bandwidth problem before it is a math problem (NVFP4 quantization).5
The block below validates the mechanism end to end on a stock python3 with numpy: quantizer error tracks the group scale (one outlier forces every neighbour to the zero code); smoothing wins by orders of magnitude on redundant video-shaped tokens and by nothing on structureless ones; the smoothing stages are lossless bookkeeping (all loss lives in the final residual codes); progressive stages help monotonically with diminishing returns; warm-started k-means beats a cold start; and the byte accounting lands at about 7x for the QVG layout, below the codes-only 8x.
# Runnable on system python3 (numpy). Core mechanism: semantic-aware smoothing
# (k-means grouping + centroid subtraction) makes near-duplicate video tokens
# INT2-quantizable; all loss lives in the final residual codes, and the metadata
# (centroids, assignments, scales) is what pulls the 8x raw ratio down to ~7x.
import numpy as np
def quantize_symmetric(x: np.ndarray, bits: int, group: int) -> np.ndarray:
"""Per-group symmetric quantization: s = max|x| / (2**(bits-1) - 1)."""
assert x.ndim == 2 and x.shape[1] % group == 0
qmax = 2 ** (bits - 1) - 1
g = x.reshape(x.shape[0], -1, group)
s = np.abs(g).max(axis=2, keepdims=True) / qmax
s = np.where(s == 0, 1.0, s) # all-zero group: codes stay 0
codes = np.clip(np.round(g / s), -qmax, qmax)
return (codes * s).reshape(x.shape)
def kmeans(x: np.ndarray, k: int, iters: int,
centroids: np.ndarray | None = None,
seed: int = 0) -> tuple[np.ndarray, np.ndarray]:
"""Lloyd k-means; warm-startable; empty clusters keep their centroid."""
if centroids is None:
rng = np.random.default_rng(seed)
centroids = x[rng.choice(len(x), size=k, replace=False)].copy()
else:
centroids = centroids.copy()
labels = np.zeros(len(x), dtype=np.int64)
for _ in range(iters):
d2 = ((x[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2)
labels = d2.argmin(axis=1)
for c in range(k):
members = x[labels == c]
if len(members):
centroids[c] = members.mean(axis=0)
d2 = ((x[:, None, :] - centroids[None, :, :]) ** 2).sum(axis=2)
return d2.argmin(axis=1), centroids
def inertia(x: np.ndarray, labels: np.ndarray, centroids: np.ndarray) -> float:
return float(((x - centroids[labels]) ** 2).sum())
def qvg_compress(x: np.ndarray, stages: int, k: int, bits: int, group: int,
iters: int = 8) -> tuple[list[tuple[np.ndarray, np.ndarray]], np.ndarray]:
"""S smoothing stages, then symmetric low-bit quantization of the residual."""
residual = x
meta: list[tuple[np.ndarray, np.ndarray]] = []
for s in range(stages):
labels, cents = kmeans(residual, k, iters, seed=s)
meta.append((labels, cents))
residual = residual - cents[labels]
return meta, quantize_symmetric(residual, bits, group)
def qvg_decompress(meta: list[tuple[np.ndarray, np.ndarray]],
deq_residual: np.ndarray) -> np.ndarray:
x_hat = deq_residual
for labels, cents in reversed(meta):
x_hat = x_hat + cents[labels]
return x_hat
def mse(a: np.ndarray, b: np.ndarray) -> float:
return float(((a - b) ** 2).mean())
rng = np.random.default_rng(11)
N, D, K, GROUP = 512, 64, 16, 16
# 1. Quantization error tracks the scale (the paper's eq. 4 motivation): the
# quantizer is scale-equivariant, so 10x the data means exactly 10x the error;
# and one outlier in a group stretches that group's scale so every OTHER
# element in the group gets worse.
x = rng.standard_normal((N, D))
dq1 = quantize_symmetric(x, bits=2, group=GROUP)
dq10 = quantize_symmetric(10.0 * x, bits=2, group=GROUP)
assert np.allclose(dq10, 10.0 * dq1)
assert np.isclose(mse(10.0 * x, dq10), 100.0 * mse(x, dq1))
spiked = x.copy()
spiked[:, 0] = 50.0 # one hot element per first group
dq_spiked = quantize_symmetric(spiked, bits=2, group=GROUP)
assert (dq_spiked[:, 1:GROUP] == 0).all() # scale 50: everything else -> code 0
assert mse(spiked[:, 1:GROUP], dq_spiked[:, 1:GROUP]) > 2.0 * mse(x[:, 1:GROUP], dq1[:, 1:GROUP])
# 2. The video case: adjacent-frame tokens are near-duplicates. 8 spatial
# patterns, each drifting slowly across frames plus small noise. Smoothing
# (S=1) beats direct INT2 by a wide margin because residuals are small and
# homogeneous after the shared structure moves into the centroids.
patterns = 3.0 * rng.standard_normal((8, D))
assign_true = rng.integers(0, 8, size=N)
drift = 0.05 * rng.standard_normal((N, D)).cumsum(axis=0) / np.sqrt(np.arange(1, N + 1))[:, None]
video = patterns[assign_true] + drift + 0.05 * rng.standard_normal((N, D))
direct = quantize_symmetric(video, bits=2, group=GROUP)
meta_v, deq_res_v = qvg_compress(video, stages=1, k=K, bits=2, group=GROUP)
smoothed = qvg_decompress(meta_v, deq_res_v)
gain_video = mse(video, direct) / mse(video, smoothed)
assert gain_video > 5.0, gain_video
# 3. Smoothing is lossless bookkeeping: reconstruction error equals the final
# residual quantization error exactly; the stages themselves add no loss.
residual_v = video.copy()
for labels, cents in meta_v:
residual_v = residual_v - cents[labels]
assert np.isclose(mse(video, smoothed), mse(residual_v, deq_res_v))
# 4. Progressive stages: hierarchical structure (coarse pattern + fine offset)
# gives stage 2 something left to explain. MSE falls monotonically with S and
# the first stage dominates (the paper: 5.83x from stage 1, >=1.10x after).
fine = 0.6 * rng.standard_normal((4, D))
hier = patterns[assign_true] + fine[rng.integers(0, 4, size=N)] \
+ 0.02 * rng.standard_normal((N, D))
errs = []
for s in (0, 1, 2, 3):
if s == 0:
rec = quantize_symmetric(hier, bits=2, group=GROUP)
else:
m, dq = qvg_compress(hier, stages=s, k=K, bits=2, group=GROUP)
rec = qvg_decompress(m, dq)
errs.append(mse(hier, rec))
assert errs[0] > errs[1] > errs[2] >= errs[3]
gain_s1, gain_s2 = errs[0] / errs[1], errs[1] / errs[2]
assert gain_s1 > gain_s2 > 1.0, (gain_s1, gain_s2)
# 5. Adversarial: no temporal redundancy (i.i.d. tokens, a scene-cut-like chunk).
# The smoothing gain collapses toward 1; the video-shaped chunk must beat the
# structureless one by a wide margin. The lever IS the redundancy.
noise_chunk = 3.0 * rng.standard_normal((N, D))
direct_n = quantize_symmetric(noise_chunk, bits=2, group=GROUP)
meta_n, deq_res_n = qvg_compress(noise_chunk, stages=1, k=K, bits=2, group=GROUP)
gain_noise = mse(noise_chunk, direct_n) / mse(noise_chunk, qvg_decompress(meta_n, deq_res_n))
assert gain_video > 3.0 * gain_noise, (gain_video, gain_noise)
# 6. Edge cases: identical tokens reconstruct exactly (residual is zero), and an
# all-zero chunk produces no NaN and exact zeros.
same = np.tile(video[0], (32, 1))
m_same, dq_same = qvg_compress(same, stages=1, k=4, bits=2, group=GROUP)
assert np.allclose(qvg_decompress(m_same, dq_same), same)
zeros = np.zeros((32, D))
m_z, dq_z = qvg_compress(zeros, stages=1, k=4, bits=2, group=GROUP)
out_z = qvg_decompress(m_z, dq_z)
assert np.isfinite(out_z).all() and np.allclose(out_z, zeros)
# 7. Centroid caching: warm-starting the next chunk's k-means from this chunk's
# centroids reaches a better clustering in one iteration than a cold start.
chunk2 = video + 0.05 * rng.standard_normal((N, D))
lab_cold, cen_cold = kmeans(chunk2, K, iters=1, seed=99)
lab_warm, cen_warm = kmeans(chunk2, K, iters=1, centroids=meta_v[0][1])
assert inertia(chunk2, lab_warm, cen_warm) < inertia(chunk2, lab_cold, cen_cold)
# 8. Compression arithmetic for a representative layer slice (N tokens, d model
# dim), QVG layout: INT2 codes packed 4/byte, one FP8 scale per 64-value
# group, one uint8 assignment per token per stage, BF16 centroids (256 x d)
# per stage. Metadata pulls the raw 8x down to ~7x (QVG) and further down for
# QVG-Pro (S=4, B=16); codes stay >=65% of the compressed bytes (Figure 7a).
def compressed_bytes(n: int, d: int, stages: int, block: int, k: int) -> tuple[float, float]:
codes = n * d / 4
scales = n * d / block
meta_b = stages * (n + k * d * 2)
return codes + scales + meta_b, codes
n_tok, d_model = 38_400, 4_096 # ~5 s of 480p latents, 13B-class dim
bf16 = n_tok * d_model * 2
qvg_total, qvg_codes = compressed_bytes(n_tok, d_model, stages=1, block=64, k=256)
pro_total, pro_codes = compressed_bytes(n_tok, d_model, stages=4, block=16, k=256)
ratio_qvg, ratio_pro = bf16 / qvg_total, bf16 / pro_total
assert bf16 / qvg_codes == 8.0 # codes alone would be exactly 8x
assert 6.5 < ratio_qvg < 7.6, ratio_qvg
assert ratio_pro < ratio_qvg # more stages, more metadata
assert qvg_codes / qvg_total >= 0.65 and pro_codes / pro_total >= 0.65
print("qvg OK:", f"smoothing gain {gain_video:.1f}x on video vs {gain_noise:.2f}x on noise;",
f"stage gains {gain_s1:.2f}x then {gain_s2:.2f}x;",
f"compression {ratio_qvg:.2f}x (QVG) / {ratio_pro:.2f}x (Pro) vs 8x codes-only")
Output of the run: smoothing gain 3200.7x on the redundant video-shaped chunk versus 1.11x on the structureless one; stage gains 59.18x then 15.85x (monotone, diminishing); layout arithmetic 7.16x (QVG) and 5.45x (QVG-Pro) against the codes-only 8x. The synthetic clusters are far cleaner than a real cache, which is why the measured gains dwarf the paper's 6.9x/2.6x error reductions; the ordering and the collapse without redundancy are the transferable facts.
How to use it¶
- Start from the repo's per-model scripts.
github.com/svg-project/Quant-VideoGenships paired bash scripts per integration (BF16 baseline and quantized) for LongCat-Video, Self-Forcing, HY-WorldPlay, and LingBot-World, with block size (64) and centroid count (256) configurable in-script and outputs landing inresults/. Python 3.12 era, conda or uv, FlashAttention recommended.3 - Run the baseline first. Every fidelity number is relative to the BF16 output of the same model on the same prompts; without the baseline run there is nothing to gate against.
- Pick the operating point by quality budget, not ratio. INT4 (about 3.7x) at PSNR 34-37, INT2 (about 7x) at PSNR 28-29 on the paper's models; QVG-Pro spends metadata for fidelity at tight group size. Re-measure on your own model and content; the numbers do not transfer.2
- Evaluate at the deployment horizon. Fidelity (PSNR, SSIM, LPIPS against BF16) plus perceptual VBench dimensions (subject/background consistency, image and aesthetic quality), on clips as long as production will generate. Short-clip parity says nothing about frame 1,400.2
How to develop with it¶
- Two integration surfaces. Compression hooks the KV write path at each chunk boundary (cluster, subtract, quantize, store codes plus metadata); decompression hooks the attention read path (fused dequant reconstructing K/V before the kernel). Nothing else in the generator changes; that is what training-free buys.
- Keep the chunk aligned to the model's native AR step (12-20 frames on the paper's integrations). Smaller chunks mean more k-means passes and proportionally more metadata (a 6-frame chunk costs 3.3% overhead in the ablation); larger chunks amortize both.2
- Carry centroids across chunks. The warm start is worth about 3x on clustering cost, validated above in miniature. Treat the previous chunk's codebook as part of the cache state, including across any engine restarts mid-session.2
- Resolution scales the token count, not the codebook. At 720p the paper holds 256 centroids and lets the chunk's token count grow, landing at PSNR 25.99 under INT2; budget quality headroom before raising resolution.2
- Unit-test against the invariants, not the pictures. The numpy block above is an oracle: reconstruction error must equal final-residual quantization error exactly (smoothing stages are lossless bookkeeping), the quantizer must be scale-equivariant, and the byte accounting must match the configured layout. A fused kernel that violates any of these is wrong before any video is rendered.
How to run it in production¶
- Gate promotion on fidelity at horizon. The failure mode is silent quality drift late in long generations, not errors. Compare PSNR/LPIPS against BF16 on a fixed prompt slice at full production length, and keep a no-reference metric only as a secondary signal: VBench image quality can score the quantized model above the BF16 baseline (67.28 versus 65.87 at 1,400 frames) while fidelity to the baseline has genuinely degraded.2
- Plan capacity with the measured ratio, including metadata. About 7x for the QVG layout, less for QVG-Pro and small chunks; never the naive 8x. The validation block's arithmetic is the sizing oracle.
- Track the overhead share. The 1.5-4.3% end-to-end tax rides on the fused kernels and centroid caching; a Triton or driver bump that de-optimizes either shows up as a growing quant/dequant share at flat memory. Benchmark per release on one replica before fleet rollout.
- Canary content that stresses the assumption. Hard cuts, rapid motion, and scene resets minimize inter-frame redundancy, which is the entire mechanism. Route a stress slice through the canary alongside the standard eval.
- Verify savings at the allocator, not the device. Engines that pre-reserve GPU memory show flat device-level usage regardless of cache contents; judge the win by cache byte accounting, the same wrong-gauge lesson as the no-savings runbook.
How to maintain it¶
- Pin the repo, Triton, and torch together. The kernels are Triton, so performance is re-derived per GPU generation and per compiler release; re-run the overhead benchmark after any bump.
- Re-gate on every checkpoint swap. There are no calibration artifacts to rebuild (the method is training-free and clusters at runtime), but the quality operating point moves with the model; re-run the fidelity gate at horizon before promoting a new checkpoint.
- Track upstream integrations and paper revisions. The repo added LingBot-World after the paper's evaluation set, and the arXiv entry has revised several times (v5 as of May 2026); re-verify headline numbers against the current version before quoting them in capacity plans.32
- Watch the NVFP4 path as hardware turns over. On Blackwell fleets the native FP4 datapath shifts the trade toward NVFP4 caches for quality-first deployments; revisit the INT2-vs-NVFP4 decision at each hardware refresh (NVFP4 quantization).5
Failure modes¶
- Redundancy collapse. A chunk with no temporal structure (hard cut, scene reset, chaotic motion) leaves residuals as large as the signal, and INT2 error reverts toward direct 2-bit quantization (validated above: 1.11x gain on structureless tokens versus three orders of magnitude on redundant ones). The paper does not characterize scene-cut behaviour; measure on your content mix before trusting the headline ratios.
- Sizing at 8x. Centroids, assignments, and scales are real bytes; the delivered ratio is about 7x for QVG and lower for QVG-Pro and small chunks. Plans written at the naive bits ratio come up short exactly when the cache is fullest.
- Stale warm-start across a content boundary. Centroid caching assumes the next chunk resembles this one; after a hard cut the inherited codebook starts wrong and k-means pays extra iterations to recover. The 3x saving is an average over redundant content, not a guarantee.
- A text-born quantizer transplanted to video. KIVI-style 2-bit caches look fine on short clips and collapse over long horizons as autoregressive error compounds (35.85 versus 67.28 VBench image quality at 1,400 frames). Bit width is not the design; the video-aware structure is.24
- Unfused dequantization on the hot path. One launch per cache block, or per-stage round-trips through global memory, turns the recurring dequant tax from under a few percent into the bottleneck. The fused staged kernel with register-resident intermediates (and LongLive-2.0's single-launch analogue) is the required pattern, not an optimization.25
- Quality gated on short clips. Degradation is horizon-dependent by mechanism; a gate that stops at a few seconds certifies nothing about production-length sessions.
- Savings judged at the wrong layer. Pre-reserved device memory does not move when the cache shrinks; a flat
nvidia-smireading says nothing either way (the no-savings runbook).
References¶
- NVIDIA Efficient AI Lab, "KV Cache Compression and Its Infra Problems" (June 2026): https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/
- Xi et al., "Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization" (ICML 2026, arXiv:2602.02958): https://arxiv.org/abs/2602.02958
- Quant-VideoGen code: https://github.com/svg-project/Quant-VideoGen
- Chen et al., "LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation" (arXiv:2605.18739): https://arxiv.org/abs/2605.18739 (code: https://github.com/NVlabs/LongLive)
- Liu et al., "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache" (ICML 2024, arXiv:2402.02750): https://arxiv.org/abs/2402.02750
Related: NVFP4 Quantization · KV Cache Token Eviction · KV-Cache Management · Quantization for Inference · No-Savings Runbook · Inference KV-Cache OOM · vLLM-Omni · LLM Inference Efficiency · Glossary
-
NVIDIA Efficient AI Lab, "KV Cache Compression and Its Infra Problems" (June 2026), video section: tokens are spatial patches and AR generators cache KV per frame exactly as LLMs cache past tokens; within seconds of 480p the cache outgrows model weights; Quant VideoGen summarized as up to 7x compression, no retraining, small latency overhead; LongLive 2.0 summarized as NVFP4 weights plus KV on a 5B generator at 1.84x throughput with a fused parallel dequantization kernel keeping quant/dequant overhead below 2%. ↩↩
-
Xi et al., "Quant VideoGen" (arXiv:2602.02958, ICML 2026). Cache formula and the 38K-token / ~34 GB vs 27 GB LongCat-Video example; semantic-aware smoothing (k-means over chunk tokens, 256 centroids, centroid subtraction; K-cache error down ~6.9x, V-cache ~2.6x); progressive residual quantization (QVG: S=1, group 64; QVG-Pro: S=4, group 16; symmetric INT2/INT4 with FP8 E4M3 per-group scales; stage ablation 5.83x then >=1.10x); centroid caching (~3x k-means overhead cut); fused staged dequant with register-resident intermediates; INT2 results 6.94x/PSNR 28.72 (LongCat-Video-13B) and 7.05x/PSNR 29.17 (HY-WorldPlay-8B), INT4 3.72x/37.14 and 3.75x/34.45; latency overhead 2.1%/1.5%/4.3% (LongCat-Video/HY-WorldPlay/Self-Forcing); chunk sizes 20/12/16 frames; long-horizon VBench image quality at 700 frames 69.52 (QVG) vs 57.18 (KIVI) and at 1,400 frames 67.28 vs 35.85 with BF16 at 65.87; 720p at PSNR 25.99 with unchanged centroid count; compressed bytes dominated (>=65%) by codes; 6-frame chunks at 3.3% overhead. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
-
github.com/svg-project/Quant-VideoGen README: four integrations (LongCat-Video, Self-Forcing, HY-WorldPlay, LingBot-World) at 6.89x/6.97x/7.01x/6.21x cache compression, ~85% memory reduction claim, Triton k-means and staged product-quantization kernels, per-model baseline-vs-quantized scripts, block size 64 and 256 centroids as defaults. ↩↩↩
-
Liu et al., "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache," ICML 2024. Per-channel key and per-token value 2-bit quantization for text LLM caches; used by the QVG paper as the generic-2-bit baseline that degrades over long video horizons. https://arxiv.org/abs/2402.02750 ↩↩
-
Chen et al., "LongLive-2.0" (arXiv:2605.18739), as summarized by the blog: NVFP4 weights and KV cache on a 5B generator, 1.84x throughput at negligible quality cost, fused parallel dequantization kernel mapping threads one-to-one onto packed FP4 pairs across the cache window in a single launch, quant/dequant overhead under 2%. Full treatment on NVFP4 quantization. ↩↩↩↩