Masked Language Modeling: BERT's Pre-Training Objective and Bidirectional Context

Category: training Updated: 2026-02-27

BERT's MLM masks 15% of tokens — 80% replaced with [MASK], 10% random token, 10% unchanged — enabling bidirectional context encoding; BERT-large achieved 87.6 on GLUE, surpassing all prior models by 7.0 points (Devlin et al., 2019).

Key Data Points
MeasureValueUnitNotes
Masking rate15%of tokensPer input sequence; chosen as balance between too few signal vs. too much corruption
Masking strategy breakdown80% [MASK], 10% random, 10% unchangedRandom and unchanged tokens prevent model from only learning to predict [MASK] tokens
BERT-large GLUE score87.6GLUE averageDevlin et al. (2019); +7.0 points over prior best; trained on 3.3B words × 40 epochs
RoBERTa masking improvement+1.2GLUE pointsDynamic masking (new mask per epoch) vs static masking; Liu et al. (2019)
BERT training tokens~13.7billion3.3B BooksCorpus + Wikipedia × 40 epochs at 128 seq_len + 40 epochs at 512

Masked language modeling (MLM), introduced by Devlin et al. in “BERT: Pre-training of Deep Bidirectional Transformers” (NAACL 2019), trains encoder-only transformers by predicting randomly masked tokens using full bidirectional context. This differs fundamentally from causal language modeling, which only uses left-to-right context.

The BERT Masking Procedure

For each training sequence:

  1. Randomly select 15% of token positions for prediction
  2. For each selected position:
    • 80% of the time: replace with [MASK] token
    • 10% of the time: replace with a random token from the vocabulary
    • 10% of the time: keep the original token unchanged
  3. Train the model to predict the original token at all selected positions using cross-entropy loss

Only selected positions contribute to the loss — the other 85% of tokens are used as context but not predicted.

Why the 80/10/10 Split?

StrategyBenefitProblem
100% [MASK]Strong learning signal[MASK] never appears at inference; train-test mismatch
100% unchangedNo train-test mismatchNo masking signal; model doesn’t learn to predict
80/10/10 (BERT)Strong signal for 80%Only slight mismatch from 10% random; 10% identity adds robustness

MLM vs Causal LM Comparison

PropertyMLM (BERT-style)Causal LM (GPT-style)
Context directionBidirectionalLeft-to-right only
Primary architectureEncoder-onlyDecoder-only
Good forClassification, extraction, NERText generation, completion
Pre-training data efficiencyHigher (sees each token from both directions)Lower
Fine-tuning approachAdd classification head; fine-tune all weightsPrompt/few-shot or fine-tune

BERT Performance on Downstream Tasks

TaskMetricBERT-largePrior SOTAImprovement
GLUEAverage87.680.6+7.0
SQuAD v1.1 F1F193.291.6+1.6
SQuAD v2.0 F1F183.178.0+5.1
MultiNLIAccuracy86.782.1+4.6

RoBERTa Improvements Over BERT

Liu et al. (2019) identified several training choices that significantly impacted BERT’s performance:

ChangeGLUE Improvement
Dynamic masking (new mask per epoch)+1.2
Removing Next Sentence Prediction (NSP)+0.9
Larger batch size (8K vs 256)+0.8
More training data (160GB vs 16GB)+1.4
Longer training+0.5

See next-token-prediction for the causal LM objective comparison, fine-tuning for how MLM-pre-trained models are adapted for downstream tasks, and scaling-laws for how MLM pre-training scales with compute.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does BERT use both [MASK], random, and unchanged tokens?

If all 15% of selected tokens were always replaced with [MASK], the model would learn to predict tokens only when seeing [MASK] — but [MASK] never appears at inference time. To prevent this train/test mismatch, 10% of selected tokens are replaced with a random word and 10% are left unchanged. The model must learn to predict the original token even when the input appears normal, making representations more robust and usable for downstream tasks without masking.

What is the difference between MLM and causal language modeling?

MLM uses bidirectional context — when predicting a masked token, the model can attend to tokens both before and after the mask. Causal LM (next-token prediction) uses only left context. Bidirectional context makes MLM-trained models (like BERT) better for understanding tasks (classification, question answering, named entity recognition) but unable to generate text autoregressively. Causal LM models are naturally generative but rely on left-to-right context only.

What is dynamic masking and why does it help?

Static masking (original BERT) generates the mask once during data preprocessing, so the model sees the same masked positions repeatedly across epochs. Dynamic masking (RoBERTa, Liu et al. 2019) generates a new random mask for each training instance at each epoch, so the model never sees the same (sequence, mask) pair twice. Liu et al. found this improves GLUE by ~1.2 points and is one of several optimizations in RoBERTa that improved on BERT without architectural changes.

← All AI pages · Dashboard