Markdown

Training-data curation & decontamination¶

Scope: turning raw or synthetic data into a training set that helps rather than hurts: deduplication (exact, fuzzy/MinHash, semantic), quality filtering, benchmark decontamination, and dataset mixing, plus the GPU-accelerated tooling that does it at scale. The filtering half of the data engine behind SFT, distillation, and synthetic data; the difference between a dataset that lifts the model and one that leaks the eval or collapses it.

The datasketch / NeMo Curator / datatrove snippets are reference templates on real APIs: pin versions and validate before production use. The pure-numpy blocks below are runnable and self-checked.

What it is¶

Curation is the pipeline that stands between raw data (scraped or synthetic) and the trainer. It runs a sequence of stages, each dropping or reshaping examples:

Extract & clean: pull text from HTML, strip boilerplate, normalize.
Language & format ID: keep the target languages/formats.
Quality filtering: heuristics (length, symbol ratio, repetition) and learned classifiers; FineWeb-Edu's classifier-based educational filter drove large MMLU/ARC gains, showing quality filtering beats raw volume.¹
Deduplication: exact (hashing), fuzzy (MinHash/LSH on n-gram shingles), and semantic (embedding-based); SemDeDup removed ~50% of web data with minimal loss and improved out-of-distribution generalization.²
Decontamination: remove any training example overlapping an evaluation benchmark, so reported scores are real.
Mixing: weight domains deliberately (upsample high-value sources) into the final blend.

Tools do this at scale: NVIDIA NeMo Curator (GPU-accelerated via RAPIDS + Ray) and datatrove (platform-agnostic pipeline blocks, Slurm/Ray executors).³⁴

Why use it¶

Quality dominates quantity. Classifier-filtered data (FineWeb-Edu) beats larger unfiltered corpora on knowledge/reasoning benchmarks; the same lesson holds for SFT sets.¹
Dedup buys efficiency and cuts memorization. Removing near-duplicates halves training data at little cost and reduces verbatim memorization and over-weighting of repeated text.²
Decontamination protects the eval gate. Training on data that overlaps your benchmarks inflates scores and invalidates the promotion decision (evaluation integrity); it is the one stage you cannot skip if you report numbers.
Cost. Less-but-better data means fewer GPU-hours to the same quality.

When to use it (and when not)¶

Use it for every non-trivial training set: pretraining, continued pretraining, and even a small SFT mix.
Even tiny SFT/preference sets need decontamination against your evals and exact dedup; these are cheap and non-negotiable.
The unskippable stage is decontamination the moment you quote a benchmark; everything else trades effort for quality.
Do not over-curate. Aggressive dedup or a biased quality classifier can strip valid diversity (dialects, rare domains); measure what you drop.

Architecture¶

flowchart LR
  RAW["Raw / synthetic data"] --> EXT["Extract + clean"]
  EXT --> LANG["Language / format ID"]
  LANG --> QUAL["Quality filter (heuristic + classifier)"]
  QUAL --> DEDUP["Dedup: exact, fuzzy (MinHash), semantic"]
  DEDUP --> DECON["Decontaminate vs eval benchmarks"]
  DECON --> MIX["Mix / reweight domains"]
  MIX --> OUT["Training set"]
  DECON -.->|"overlap found, drop"| RAW

How to use it¶

At scale, drive a pipeline with NeMo Curator (GPU) or datatrove (CPU/Slurm/Ray). For the two stages you must get right, fuzzy dedup and decontamination, the mechanics are small enough to show directly. Fuzzy dedup with MinHash/LSH catches near-duplicates that exact hashing misses; in production you reach for datasketch:

# fuzzy_dedup.py: MinHash + LSH near-duplicate removal (reference template, needs `pip install datasketch`).
from datasketch import MinHash, MinHashLSH

def shingles(text, k=5):
    toks = text.split()
    return {" ".join(toks[i:i+k]) for i in range(max(1, len(toks) - k + 1))}

def minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for sh in shingles(text):
        m.update(sh.encode("utf-8"))
    return m

def dedup(rows, threshold=0.8, num_perm=128):
    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    kept = []
    for i, r in enumerate(rows):
        mh = minhash(r["text"], num_perm)
        if lsh.query(mh):            # a near-duplicate already kept
            continue
        lsh.insert(str(i), mh)
        kept.append(r)
    return kept

datasketch is not magic: a MinHash signature estimates Jaccard similarity by the fraction of matching min-hash slots, and LSH banding turns that estimate into a near-linear candidate search. Here is that core math in pure numpy, checked against the slow exact-Jaccard reference and its boundaries:

# minhash_lsh.py: the core math behind fuzzy dedup -- MinHash Jaccard estimation + LSH banding.
# Pure numpy + hashlib: what datasketch does under the hood, made explicit and checked.
import hashlib
import numpy as np

P = (1 << 61) - 1  # Mersenne prime modulus (M61) for universal hashing a*x+b mod P

def shingles(text, k=5):
    toks = text.split()
    return {" ".join(toks[i:i + k]) for i in range(max(1, len(toks) - k + 1))}

def _base_hash(s):
    return int.from_bytes(hashlib.blake2b(s.encode(), digest_size=8).digest(), "big")

def make_perms(num_perm, seed=0):
    rng = np.random.default_rng(seed)
    a = rng.integers(1, P, size=num_perm, dtype=np.uint64).astype(object)
    b = rng.integers(0, P, size=num_perm, dtype=np.uint64).astype(object)
    return a, b

def signature(text, a, b):
    sh = shingles(text)
    h = np.array([_base_hash(s) for s in sh], dtype=object)
    permuted = (a[:, None] * h[None, :] + b[:, None]) % P   # exact big-int arithmetic
    return permuted.min(axis=1)

def est_jaccard(sig_a, sig_b):
    return float(np.mean(sig_a == sig_b))                   # fraction of agreeing min-slots

def true_jaccard(x, y):
    A, B = shingles(x), shingles(y)
    return len(A & B) / len(A | B)

def lsh_candidate(sig_a, sig_b, bands, rows):
    for band in range(bands):
        s = band * rows
        if np.array_equal(sig_a[s:s + rows], sig_b[s:s + rows]):
            return True                                     # agree on a whole band -> candidate
    return False

def dedup(rows, a, b, bands, rows_per_band):
    kept, sigs = [], []
    for r in rows:
        sig = signature(r["text"], a, b)
        if any(lsh_candidate(sig, s, bands, rows_per_band) for s in sigs):
            continue                                        # near-duplicate of something kept
        sigs.append(sig)
        kept.append(r)
    return kept

NUM_PERM, BANDS, ROWS = 512, 64, 8                          # 64 * 8 == 512
a, b = make_perms(NUM_PERM)

doc = " ".join(f"w{i}" for i in range(80))                  # 80 distinct tokens
near = " ".join("changed" if i == 40 else f"w{i}" for i in range(80))  # 1 token differs
far = " ".join(f"z{i}" for i in range(80))                  # fully disjoint vocab

sig_doc, sig_near, sig_far = (signature(t, a, b) for t in (doc, near, far))

# 1. Signature length is exactly num_perm; identical text estimates Jaccard 1.0 exactly.
assert sig_doc.shape == (NUM_PERM,)
assert est_jaccard(sig_doc, signature(doc, a, b)) == 1.0

# 2. Disjoint vocab -> true Jaccard 0, estimate collapses to ~0.
assert true_jaccard(doc, far) == 0.0
assert est_jaccard(sig_doc, sig_far) < 0.05

# 3. MinHash is an unbiased Jaccard estimator: matches the slow exact reference within tolerance.
assert abs(est_jaccard(sig_doc, sig_near) - true_jaccard(doc, near)) < 0.08

# 4. Equivalence to a slow reference across many random pairs (bounded estimator error).
rng = np.random.default_rng(7)
max_err = 0.0
for _ in range(60):
    n = int(rng.integers(30, 90))
    base = [f"t{int(rng.integers(0, 120))}" for _ in range(n)]
    frac = float(rng.uniform(0.0, 1.0))
    other = [tok if rng.random() > frac else f"t{int(rng.integers(0, 120))}" for tok in base]
    x, y = " ".join(base), " ".join(other)
    err = abs(est_jaccard(signature(x, a, b), signature(y, a, b)) - true_jaccard(x, y))
    max_err = max(max_err, err)
assert max_err < 0.12, max_err

# 5. LSH banding is a similarity S-curve: the near-duplicate bands-match, the disjoint doc does not.
assert lsh_candidate(sig_doc, sig_near, BANDS, ROWS) is True
assert lsh_candidate(sig_doc, sig_far, BANDS, ROWS) is False

# 6. End-to-end dedup: an exact copy AND a 1-token near-duplicate are dropped; the distinct doc stays.
rows = [{"id": 0, "text": doc}, {"id": 1, "text": doc}, {"id": 2, "text": near}, {"id": 3, "text": far}]
kept = dedup(rows, a, b, BANDS, ROWS)
assert [r["id"] for r in kept] == [0, 3]                    # kept the first doc and the disjoint one
print(f"minhash+lsh: all checks passed (max estimator error {max_err:.3f})")

Decontamination drops any training row whose n-grams overlap a held-out eval set (the same check the synthetic-data and evaluation-integrity pages use). It is the stage that most often ships broken, so it is worth a runnable, self-checked block:

# decontaminate.py: drop training rows that share an n-gram with any eval example (n-gram containment).
def ngrams(s, n=13):
    """Space-joined n-gram windows. For text shorter than n, the whole text is the single window."""
    toks = s.split()
    return {" ".join(toks[i:i + n]) for i in range(max(1, len(toks) - n + 1))}

def decontaminate(rows, eval_texts, n=13):
    """Keep only rows whose text shares no n-gram with any eval text."""
    banned = set().union(*(ngrams(t, n) for t in eval_texts)) if eval_texts else set()
    return [r for r in rows if not (ngrams(r["text"], n) & banned)]

words = "alpha beta gamma delta epsilon zeta eta theta iota kappa lambda mu nu".split()
assert len(words) == 13                                    # exactly one 13-gram
ev = [" ".join(words)]

# 1. Happy path: a row echoing the eval 13-gram is dropped; an unrelated row survives.
dirty = {"text": "please continue: " + " ".join(words)}
clean = {"text": "a totally different sentence with zero token overlap against the benchmark set here"}
assert decontaminate([dirty, clean], ev) == [clean]

# 2. Contamination buried inside a longer row: the eval span embedded mid-document is still caught.
buried = {"text": "prefix junk " + " ".join(words) + " trailing junk tokens"}
assert decontaminate([buried], ev) == []

# 3. Boundary: 13 shared tokens form a 13-gram and are caught; 12 tokens cannot and survive.
assert decontaminate([{"text": " ".join(words)}], ev) == []
row12 = {"text": " ".join(words[:12])}
assert decontaminate([row12], ev) == [row12]

# 4. Whitespace corruption must not smuggle contamination past the filter (.split collapses runs).
noisy = {"text": "   ".join(words) + "\n\t"}
assert decontaminate([noisy], ev) == []

# 5. Empty eval set bans nothing; every row survives.
assert decontaminate([dirty, clean], []) == [dirty, clean]

# 6. Containment is span-based, not bag-of-words: a long paraphrase sharing no 13-token run leaks.
paraphrase = {"text": "the identical claim conveyed with completely reordered wording and fresh "
                      "vocabulary throughout this entire rewritten passage here"}
assert len(paraphrase["text"].split()) > 13
assert decontaminate([paraphrase], ev) == [paraphrase]     # honest limit: semantic dedup needed

# 7. Equivalence to a slow brute-force reference, with a planted contamination hit.
import random
def slow(rows, eval_texts, n=13):
    banned = set()
    for t in eval_texts:
        tk = t.split()
        banned |= {tuple(tk[i:i + n]) for i in range(max(1, len(tk) - n + 1))}
    out = []
    for r in rows:
        tk = r["text"].split()
        rw = {tuple(tk[i:i + n]) for i in range(max(1, len(tk) - n + 1))}
        if not (rw & banned):
            out.append(r)
    return out
random.seed(0)
vocab = list("abcdefgh")
corpus = [{"text": " ".join(random.choice(vocab) for _ in range(25))} for _ in range(50)]
rand_eval = [" ".join(random.choice(vocab) for _ in range(25)) for _ in range(4)]
rand_eval.append(corpus[11]["text"])                       # plant a guaranteed hit
fast_out = decontaminate(corpus, rand_eval)
assert fast_out == slow(corpus, rand_eval)                 # fast == slow reference
assert corpus[11] not in fast_out                          # planted contamination removed
print("decontaminate: all checks passed")

How to integrate with it¶

Curation is a stage in a larger pipeline, not a standalone script. It reads from the storage/data platform and its output is the only thing the trainer ever sees:

Upstream. Raw scrapes and synthetic generations land in object storage; curation reads shards from there and writes cleaned shards back, so every stage is restartable from its input.
Order matters. Wire the stages exactly as the architecture shows: extract, language/format ID, quality filter, the dedup ladder, decontamination, then mixing. Cheap, high-yield filters run first so the expensive stages see less data.
Shared decontamination. The n-gram check is the same primitive the synthetic-data and evaluation-integrity pages use; run it against one shared, versioned benchmark manifest so every dataset is decontaminated against the same list.
Downstream consumers. The curated blend feeds SFT, on-policy distillation, and reward-model training; a broken curation run silently corrupts all of them, so treat its output as a release artifact.
Tooling. Both NeMo Curator and datatrove expose curation as composable blocks (readers, filters, dedup, extractors) over pluggable executors, so the same pipeline definition runs locally, on Ray, or on Slurm.³⁴

How to tune each stage¶

Tune each stage against a measurable target, not by feel:

Quality filter. Start with cheap heuristics (drop very short/very repetitive docs), then add a classifier trained on "good" exemplars (the FineWeb-Edu pattern) or an LLM-judge score; hold out a sample to check the filter is not dropping valid data.¹
Dedup ladder. Run exact, then fuzzy, then semantic in that order (cheapest first). Fuzzy threshold ~0.7-0.8 Jaccard is typical; semantic dedup (embed + cluster, drop intra-cluster near-neighbours) catches paraphrases the others miss.²
Decontamination coverage. Decontaminate against every benchmark you will report, including the test and few-shot prompts; n-gram containment (8-13 grams) is the common standard.
Mixing. Set domain weights deliberately and upsample scarce high-value domains; measure the downstream effect rather than assuming more of a domain helps.

How to run it in production¶

Curation is a batch job on the cluster, and a few disciplines make its output trustworthy and auditable:

Decontamination is a release gate, not an optional pass. Data that fails the n-gram check never reaches SFT or DPO; wire it as a hard gate before any run whose scores you will publish (evaluation integrity).
Idempotent, resumable stages. Checkpoint each stage's output so a preempted job resumes instead of recomputing, and a rerun skips completed work; keep intermediate shards until the final blend is accepted.
Keep a drop manifest. Record why every example was dropped (which stage, which rule), so an over-aggressive filter is debuggable after the fact rather than a silent hole in the data.
Redact PII. Un-redacted personal data is a compliance and safety hole; run redaction as an explicit stage (security & multi-tenancy). NeMo Curator ships a PII stage for this.³
Determinism. Fix the MinHash permutation seed and record the thresholds so the same corpus curates to the same output; non-reproducible curation makes a regression impossible to bisect.

How to maintain it¶

A curated dataset is a living artifact, not a frozen file:

Re-decontaminate on every benchmark change. A set that was clean last quarter is contaminated the moment you adopt a new eval whose text it happens to contain; re-run the n-gram check against the current benchmark manifest (evaluation integrity).
Refresh the quality classifier. As the data distribution shifts, a stale "good-exemplar" classifier drifts; retrain it periodically and re-check that it is not silently dropping a valid domain.¹
Version the blend. Treat each dataset as an immutable, versioned artifact with its config (thresholds, seeds, source weights, benchmark list); a new mix is a new version, not a patch, so a regression is traceable to a data change (SRE/MLOps practices).
Audit what you drop. Periodically sample the drop manifest for false positives (valid dialects, rare domains, real long documents killed as "repetitive").
Watch the dedup ratio run over run. A sudden jump in near-duplicate rate flags an upstream source that started emitting boilerplate; a sudden drop flags a broken dedup stage (observability).

How to scale it¶

Dedup is the expensive stage: naive all-pairs comparison is quadratic, so production uses LSH banding (MinHash) or embedding clustering (semantic) to make it near-linear. NeMo Curator GPU-accelerates fuzzy dedup with RAPIDS (cuDF/cuGraph) and Ray for multi-node scale (reporting ~16x over CPU); datatrove runs the same shape on Slurm/Ray executors.³⁴ Stage the corpus on the storage/data platform, run curation as a Ray job, and checkpoint intermediate stages so a rerun skips completed work.

Cookbook (common use cases)¶

1. Fuzzy-dedup an SFT set: apply dedup(rows, threshold=0.8) (above) after exact dedup; near-duplicate instructions are a top cause of SFT over-fitting.

2. Decontaminate before reporting: run decontaminate(train_rows, eval_texts) against every benchmark's items before any run whose scores you will publish.

3. Heuristic quality gate: a cheap first pass that drops empty, over-short, or symbol-spam rows before the (expensive) classifier runs.

# quality_gate.py: cheap first-pass filter -- drop empty, over-short, or symbol-spam rows.
def quality_ok(text):
    toks = text.split()
    if len(toks) < 8:                                      # too short to teach anything
        return False
    symbol_ratio = sum(not c.isalnum() and not c.isspace() for c in text) / max(1, len(text))
    return symbol_ratio < 0.3                              # drop symbol spam

# 1. Empty / whitespace-only text is rejected (zero tokens).
assert quality_ok("") is False
assert quality_ok("   \n\t ") is False

# 2. Token-count boundary: 7 tokens fail, 8 clean tokens pass.
assert quality_ok("one two three four five six seven") is False
assert quality_ok("one two three four five six seven eight") is True

# 3. Symbol-spam adversary: enough tokens but junk characters is rejected.
assert quality_ok("!!!! @@@@ #### $$$$ %%%% ^^^^ &&&& ****") is False

# 4. A normal sentence passes.
assert quality_ok("The mitochondria is the powerhouse of the cell and stores energy") is True

# 5. symbol_ratio boundary around 0.3: just under passes, at/above fails.
letters = "ab cd ef gh ij kl mn op"                        # 8 tokens, 23 chars, 0 symbols
assert quality_ok(letters) is True
just_under = letters + " " + "#" * 9                        # 9/33 = 0.2727 < 0.3
assert 9 / 33 < 0.3 and quality_ok(just_under) is True
at_or_above = letters + " " + "#" * 11                      # 11/35 = 0.3143 >= 0.3
assert 11 / 35 >= 0.3 and quality_ok(at_or_above) is False
print("quality gate: all checks passed")

Failure modes¶

No decontamination. Train/eval overlap inflates scores and silently breaks the eval gate; decontaminate against every reported benchmark.
Over-dedup / biased filter. Aggressive dedup or a skewed quality classifier strips valid diversity (dialects, rare domains); audit a sample of what you drop.
Language-ID errors. A misfiring language filter silently discards valid target-language data.
PII leakage. Un-redacted personal data in the training set is a compliance and safety hole; redact it (security & multi-tenancy).
Synthetic-only curation. Filtering synthetic data without mixing in real data still risks model collapse; keep real data in the blend (synthetic data).
Skipping semantic dedup. Exact/fuzzy dedup miss paraphrased near-duplicates that semantic dedup catches.²

References¶

The FineWeb Datasets (curation ablations, FineWeb-Edu classifier filtering): https://arxiv.org/abs/2406.17557
SemDeDup (semantic deduplication): https://arxiv.org/abs/2303.09540
NVIDIA NeMo Curator (GPU-accelerated curation): https://github.com/NVIDIA/NeMo-Curator
datatrove (large-scale data-processing pipelines): https://github.com/huggingface/datatrove
datasketch (MinHash / LSH): https://github.com/ekzhu/datasketch

Penedo et al., The FineWeb Datasets: documents and ablates web-data dedup and filtering; the FineWeb-Edu classifier-based educational filter drives large gains on MMLU/ARC, showing quality filtering beats raw volume. https://arxiv.org/abs/2406.17557 ↩↩↩↩
Abbas et al., SemDeDup: embedding-based semantic deduplication removes semantically near-duplicate (not exact-duplicate) data, cutting ~50% of web data with minimal loss and improving OOD performance. https://arxiv.org/abs/2303.09540 ↩↩↩↩
NVIDIA NeMo Curator: GPU-accelerated (RAPIDS cuDF/cuGraph + Ray) data-curation pipelines covering extraction, language ID, quality filtering, exact/fuzzy/semantic dedup, classification, PII redaction, and task decontamination. https://github.com/NVIDIA/NeMo-Curator ↩↩↩↩
datatrove: platform-agnostic pipeline blocks (readers, filters, extractors, MinHash + exact dedup, contamination stats) with Local/Slurm/Ray executors. https://github.com/huggingface/datatrove ↩↩↩