Training Data Curation: Web Filtering, Deduplication, and Quality Selection

Category: training Updated: 2026-02-27

Common Crawl contains 400B+ tokens of raw web text; quality filtering (perplexity scoring, deduplication, URL filtering) retains ~5–20% as training data; Penedo et al. (2023) FineWeb showed filtered quality improves benchmark scores by 2–4 points.

Key Data Points
MeasureValueUnitNotes
Common Crawl raw size400+billion tokensMonthly snapshots; 2021 snapshot ≈ 3.1TB compressed; quality varies widely
Retention rate after quality filtering5–20%%Typical filtering pipeline retains 5–20% of raw Common Crawl; varies by pipeline
Deduplication improvement~1.5×perplexity improvementLee et al. (2022): removing duplicates reduces perplexity ~1.5× at same training compute
Near-deduplication threshold0.8MinHash Jaccard similarityTypical threshold for near-duplicate detection using MinHash LSH
Code data impact+10–15%reasoning benchmarkChen et al. (2021): including code in pre-training improves mathematical reasoning

Training data quality is at least as important as model architecture for language model performance. Raw web text contains spam, templated content, low-information pages, and near-duplicate documents. Systematic curation pipelines convert hundreds of terabytes of raw text into training corpora that enable effective language model pre-training.

Filtering Pipeline Stages

StageMethodTypical Reduction
URL filteringBlocklist of spam/adult domains10–30%
Language identificationfastText classifier30–60% (for English-only)
Length & content heuristicsMin/max document length, symbol ratios5–15%
Quality scoringPerplexity vs reference LM; content classifier30–70%
Near-deduplicationMinHash LSH (Jaccard ≥ 0.8)20–40%
Exact deduplicationHash-based5–10%
Safety/PII filteringRule-based + classifier2–5%

Combined pipeline: retains ~5–20% of raw Common Crawl as high-quality training data.

Data Source Composition for Large Models

SourceQualityScaleCommon Use
Common Crawl (filtered)Variable → High after filter400B+ tokens/snapshotPrimary pre-training data
WikipediaHigh~3B tokensFactual grounding
Books corporaHigh12–100B tokensLong-form structure
GitHub/codeHigh for code100B+ tokensReasoning improvement
Scientific papersHigh50B+ tokensSTEM reasoning
Web text (curated)High20–50B tokensInstruction quality

Deduplication Methods

MethodTypeGranularityComplexity
Exact matchHash (SHA256)Document, paragraphO(n)
MinHash LSHApproximateDocumentO(n log n)
SimHashApproximateDocumentO(n)
Suffix arrayExactn-gramO(n log n), high memory

Lee et al. (2022) found that suffix array-based substring deduplication is most thorough — removing repeated sequences of ≥50 tokens reduced memorization most effectively and improved model perplexity 1.5× at the same compute.

Impact on Benchmark Performance

Penedo et al. (2023) systematically compared filtering strategies on Common Crawl, finding:

  • High-quality filtered data (FineWeb) outperforms unfiltered CC by 2–4 points on MMLU
  • Mixing filtered web with curated sources (books, Wikipedia) consistently improves over web-only
  • Raising training token count with low-quality data can hurt performance relative to fewer high-quality tokens

See pre-training for how curated data is used in the training loop, and scaling-laws for how dataset quality interacts with compute-optimal token count decisions.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What steps are in a typical web data filtering pipeline?

A typical pipeline includes: (1) URL-level filtering — blocklist of spam, adult content, and low-quality domains; (2) language identification — removing non-target language text; (3) quality filtering — perplexity scoring against a reference LM, text length filtering, symbol/punctuation ratio filters; (4) near-deduplication — MinHash LSH to remove near-duplicate documents; (5) content filtering — removing PII, harmful content. Each stage further reduces data volume while improving quality.

Why does deduplication improve language model training?

Lee et al. (2022) showed that training on deduplicated data significantly improves model quality at the same compute budget. The key mechanism: memorization. Models trained on highly duplicated data (e.g., the same document 100× in Common Crawl) memorize specific text verbatim rather than learning generalizable patterns. Deduplication forces the model to generalize rather than memorize, improving held-out perplexity by approximately 1.5× at identical compute.

How is data quality measured without human labeling?

The most common automated quality signals: (1) reference model perplexity — filter out text that a small, high-quality reference LM assigns high perplexity (i.e., text unlike high-quality sources); (2) content classification — train a binary classifier on known-good vs known-bad examples; (3) linguistic features — sentence count, token-to-word ratio, average word length, punctuation density; (4) URL quality scores — domain-level reputation from human-curated allow/blocklists. These signals are noisy individually but combine to significantly improve corpus quality.

← All AI pages · Dashboard