Hallucination Mechanisms: Why Language Models Generate Plausible but Incorrect Text

Name: Hallucination Mechanisms: Why Language Models Generate Plausible but Incorrect Text
Creator: AI Tower
Published: 2026-02-27

Category: evaluation Updated: 2026-02-27

Ji et al. (ACM Computing Surveys 2023): hallucination = content unsupported or contradicted by source; Mallen et al. (ACL 2023): entities in bottom-25% training frequency show 4–14× higher hallucination rates than top-25%; Maynez et al. (ACL 2020): ~30% of abstractive summaries contain hallucinated content.

Key Data Points
Measure	Value	Unit	Notes
Hallucination rate in abstractive summarization	~30%	% summaries affected	Maynez et al. (2020): ~30% of abstractive summaries contain hallucinated content; extractive <1%
Frequency effect: rare vs common entities	4–14× higher rate	× hallucination rate	Mallen et al. (2022): bottom-25% training-frequency entities hallucinate 4–14× vs top-25%
Intrinsic vs extrinsic hallucination	Both in ~30–50% of evaluated summaries	% examples	Ji et al. (2023): intrinsic = contradicts source; extrinsic = adds information absent from source
RLHF effect on hallucination	Reduces on common topics; persists on rare ones		RLHF reward model trained by raters who cannot detect specialized errors; confident hallucinations remain

Hallucination in language models refers to generated text that is factually incorrect, unsupported by provided source documents, or internally inconsistent, despite appearing fluent and confident. Ji et al. (2023) classify hallucinations as intrinsic (contradicting a provided source) or extrinsic (adding information absent from the source), with both arising from distinct failure modes in training and inference.

Taxonomy of Hallucination

Type	Definition	Example
Intrinsic	Contradicts the source document	Summary says “Tuesday” when source says “Monday”
Extrinsic	Adds information absent from source	Summary adds an unstated cause for an event
World-knowledge	Contradicts factual reality	Attributing a quote to the wrong person
Numerical	Incorrect quantities, dates, or statistics	Wrong year, wrong percentage
Entity-level	Wrong name, location, or relationship	Misattributing a discovery to the wrong researcher

Frequency-Based Hallucination Rates (Mallen et al., 2022)

Mallen et al. evaluated entity-level factual accuracy across entities stratified by training corpus frequency:

Entity Frequency Quartile	Hallucination Rate	Relative Rate
Top 25% (most frequent)	~5%	1× (baseline)
2nd quartile	~10%	2×
3rd quartile	~20%	4×
Bottom 25% (least frequent)	~20–70%	4–14×

The practical implication: model confidence is anti-correlated with accuracy for rare entities. Confident-sounding claims about obscure topics are significantly less reliable than claims about common topics.

Mechanistic Sources of Hallucination

1. Training Data Conflicts

Corpora contain contradictory facts about the same entity. The model learns a distribution over conflicting claims and samples from it, generating internally consistent but externally incorrect text.

2. Decoding Error Amplification

Step 1: Model generates incorrect claim C with moderate probability
Step 2: C becomes part of context for subsequent tokens
Step 3: Subsequent tokens now condition on C as if it were fact
Step 4: Sequence appears internally consistent but factually wrong

3. Exposure Bias

Training with teacher forcing (always feeding ground-truth prefixes) means models never learn to recover from their own errors. At inference, early mistakes compound.

Mitigation Approaches

Approach	Mechanism	Limitation
RAG	Ground generation in retrieved documents	Retrieval failures propagate
RLHF	Human feedback on factual errors	Raters miss specialized errors
Self-consistency	Multiple samples; majority vote	Confident wrong answers can dominate
Chain-of-thought	Explicit checkable reasoning steps	Reasoning chains themselves can hallucinate
Uncertainty calibration	Output confidence alongside claims	Models poorly calibrated for rare entities

See rag for retrieval-based factual grounding, rlhf for alignment training that partially mitigates hallucination, and training-data-curation for how data quality affects hallucination rates.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What are the main mechanistic causes of hallucination?

Four primary mechanisms: (1) Training data conflicts — corpora contain contradictory facts about entities; the model learns a distribution over these and generates a plausible but incorrect resolution. (2) Knowledge sparsity — entities appearing rarely in training have high weight uncertainty; the model generates plausible-sounding but incorrect attributes. (3) Decoding error amplification — greedy or beam decoding reinforces early factual errors across the sequence, since incorrect claims have high conditional probability given themselves. (4) Exposure bias — training on teacher-forced correct prefixes leaves models unprepared to recover from errors in their own previous output.

Why does RLHF reduce but not eliminate hallucination?

RLHF trains on human preference comparisons, penalizing outputs raters identify as incorrect. For frequently-encountered topics, raters can detect errors and the reward model learns to penalize them. For rare or specialized topics, human raters often fail to recognize incorrect claims — so the reward model does not penalize them and the trained policy learns to sound confident and fluent regardless of factual accuracy. The defining characteristic of hallucination is plausible language paired with wrong content, and RLHF primarily optimizes plausibility.

← All AI pages · Dashboard