Hallucination Mechanisms: Why Language Models Generate Plausible but Incorrect Text
Ji et al. (ACM Computing Surveys 2023): hallucination = content unsupported or contradicted by source; Mallen et al. (ACL 2023): entities in bottom-25% training frequency show 4–14× higher hallucination rates than top-25%; Maynez et al. (ACL 2020): ~30% of abstractive summaries contain hallucinated content.
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Hallucination rate in abstractive summarization | ~30% | % summaries affected | Maynez et al. (2020): ~30% of abstractive summaries contain hallucinated content; extractive <1% |
| Frequency effect: rare vs common entities | 4–14× higher rate | × hallucination rate | Mallen et al. (2022): bottom-25% training-frequency entities hallucinate 4–14× vs top-25% |
| Intrinsic vs extrinsic hallucination | Both in ~30–50% of evaluated summaries | % examples | Ji et al. (2023): intrinsic = contradicts source; extrinsic = adds information absent from source |
| RLHF effect on hallucination | Reduces on common topics; persists on rare ones | RLHF reward model trained by raters who cannot detect specialized errors; confident hallucinations remain |
Hallucination in language models refers to generated text that is factually incorrect, unsupported by provided source documents, or internally inconsistent, despite appearing fluent and confident. Ji et al. (2023) classify hallucinations as intrinsic (contradicting a provided source) or extrinsic (adding information absent from the source), with both arising from distinct failure modes in training and inference.
Taxonomy of Hallucination
| Type | Definition | Example |
|---|---|---|
| Intrinsic | Contradicts the source document | Summary says “Tuesday” when source says “Monday” |
| Extrinsic | Adds information absent from source | Summary adds an unstated cause for an event |
| World-knowledge | Contradicts factual reality | Attributing a quote to the wrong person |
| Numerical | Incorrect quantities, dates, or statistics | Wrong year, wrong percentage |
| Entity-level | Wrong name, location, or relationship | Misattributing a discovery to the wrong researcher |
Frequency-Based Hallucination Rates (Mallen et al., 2022)
Mallen et al. evaluated entity-level factual accuracy across entities stratified by training corpus frequency:
| Entity Frequency Quartile | Hallucination Rate | Relative Rate |
|---|---|---|
| Top 25% (most frequent) | ~5% | 1× (baseline) |
| 2nd quartile | ~10% | 2× |
| 3rd quartile | ~20% | 4× |
| Bottom 25% (least frequent) | ~20–70% | 4–14× |
The practical implication: model confidence is anti-correlated with accuracy for rare entities. Confident-sounding claims about obscure topics are significantly less reliable than claims about common topics.
Mechanistic Sources of Hallucination
1. Training Data Conflicts
Corpora contain contradictory facts about the same entity. The model learns a distribution over conflicting claims and samples from it, generating internally consistent but externally incorrect text.
2. Decoding Error Amplification
Step 1: Model generates incorrect claim C with moderate probability
Step 2: C becomes part of context for subsequent tokens
Step 3: Subsequent tokens now condition on C as if it were fact
Step 4: Sequence appears internally consistent but factually wrong
3. Exposure Bias
Training with teacher forcing (always feeding ground-truth prefixes) means models never learn to recover from their own errors. At inference, early mistakes compound.
Mitigation Approaches
| Approach | Mechanism | Limitation |
|---|---|---|
| RAG | Ground generation in retrieved documents | Retrieval failures propagate |
| RLHF | Human feedback on factual errors | Raters miss specialized errors |
| Self-consistency | Multiple samples; majority vote | Confident wrong answers can dominate |
| Chain-of-thought | Explicit checkable reasoning steps | Reasoning chains themselves can hallucinate |
| Uncertainty calibration | Output confidence alongside claims | Models poorly calibrated for rare entities |
See rag for retrieval-based factual grounding, rlhf for alignment training that partially mitigates hallucination, and training-data-curation for how data quality affects hallucination rates.
Related Pages
Sources
- Ji et al. (2023) — Survey of Hallucination in Natural Language Generation. ACM Computing Surveys
- Mallen et al. (2022) — When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL 2023
- Maynez et al. (2020) — On Faithfulness and Factuality in Abstractive Summarization. ACL 2020
Frequently Asked Questions
What are the main mechanistic causes of hallucination?
Four primary mechanisms: (1) Training data conflicts — corpora contain contradictory facts about entities; the model learns a distribution over these and generates a plausible but incorrect resolution. (2) Knowledge sparsity — entities appearing rarely in training have high weight uncertainty; the model generates plausible-sounding but incorrect attributes. (3) Decoding error amplification — greedy or beam decoding reinforces early factual errors across the sequence, since incorrect claims have high conditional probability given themselves. (4) Exposure bias — training on teacher-forced correct prefixes leaves models unprepared to recover from errors in their own previous output.
Why does RLHF reduce but not eliminate hallucination?
RLHF trains on human preference comparisons, penalizing outputs raters identify as incorrect. For frequently-encountered topics, raters can detect errors and the reward model learns to penalize them. For rare or specialized topics, human raters often fail to recognize incorrect claims — so the reward model does not penalize them and the trained policy learns to sound confident and fluent regardless of factual accuracy. The defining characteristic of hallucination is plausible language paired with wrong content, and RLHF primarily optimizes plausibility.