Hallucination Mechanisms: Why Language Models Generate Plausible but Incorrect Text

Category: evaluation Updated: 2026-02-27

Ji et al. (ACM Computing Surveys 2023): hallucination = content unsupported or contradicted by source; Mallen et al. (ACL 2023): entities in bottom-25% training frequency show 4–14× higher hallucination rates than top-25%; Maynez et al. (ACL 2020): ~30% of abstractive summaries contain hallucinated content.

Key Data Points
MeasureValueUnitNotes
Hallucination rate in abstractive summarization~30%% summaries affectedMaynez et al. (2020): ~30% of abstractive summaries contain hallucinated content; extractive <1%
Frequency effect: rare vs common entities4–14× higher rate× hallucination rateMallen et al. (2022): bottom-25% training-frequency entities hallucinate 4–14× vs top-25%
Intrinsic vs extrinsic hallucinationBoth in ~30–50% of evaluated summaries% examplesJi et al. (2023): intrinsic = contradicts source; extrinsic = adds information absent from source
RLHF effect on hallucinationReduces on common topics; persists on rare onesRLHF reward model trained by raters who cannot detect specialized errors; confident hallucinations remain

Hallucination in language models refers to generated text that is factually incorrect, unsupported by provided source documents, or internally inconsistent, despite appearing fluent and confident. Ji et al. (2023) classify hallucinations as intrinsic (contradicting a provided source) or extrinsic (adding information absent from the source), with both arising from distinct failure modes in training and inference.

Taxonomy of Hallucination

TypeDefinitionExample
IntrinsicContradicts the source documentSummary says “Tuesday” when source says “Monday”
ExtrinsicAdds information absent from sourceSummary adds an unstated cause for an event
World-knowledgeContradicts factual realityAttributing a quote to the wrong person
NumericalIncorrect quantities, dates, or statisticsWrong year, wrong percentage
Entity-levelWrong name, location, or relationshipMisattributing a discovery to the wrong researcher

Frequency-Based Hallucination Rates (Mallen et al., 2022)

Mallen et al. evaluated entity-level factual accuracy across entities stratified by training corpus frequency:

Entity Frequency QuartileHallucination RateRelative Rate
Top 25% (most frequent)~5%1× (baseline)
2nd quartile~10%
3rd quartile~20%
Bottom 25% (least frequent)~20–70%4–14×

The practical implication: model confidence is anti-correlated with accuracy for rare entities. Confident-sounding claims about obscure topics are significantly less reliable than claims about common topics.

Mechanistic Sources of Hallucination

1. Training Data Conflicts

Corpora contain contradictory facts about the same entity. The model learns a distribution over conflicting claims and samples from it, generating internally consistent but externally incorrect text.

2. Decoding Error Amplification

Step 1: Model generates incorrect claim C with moderate probability
Step 2: C becomes part of context for subsequent tokens
Step 3: Subsequent tokens now condition on C as if it were fact
Step 4: Sequence appears internally consistent but factually wrong

3. Exposure Bias

Training with teacher forcing (always feeding ground-truth prefixes) means models never learn to recover from their own errors. At inference, early mistakes compound.

Mitigation Approaches

ApproachMechanismLimitation
RAGGround generation in retrieved documentsRetrieval failures propagate
RLHFHuman feedback on factual errorsRaters miss specialized errors
Self-consistencyMultiple samples; majority voteConfident wrong answers can dominate
Chain-of-thoughtExplicit checkable reasoning stepsReasoning chains themselves can hallucinate
Uncertainty calibrationOutput confidence alongside claimsModels poorly calibrated for rare entities

See rag for retrieval-based factual grounding, rlhf for alignment training that partially mitigates hallucination, and training-data-curation for how data quality affects hallucination rates.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

What are the main mechanistic causes of hallucination?

Four primary mechanisms: (1) Training data conflicts — corpora contain contradictory facts about entities; the model learns a distribution over these and generates a plausible but incorrect resolution. (2) Knowledge sparsity — entities appearing rarely in training have high weight uncertainty; the model generates plausible-sounding but incorrect attributes. (3) Decoding error amplification — greedy or beam decoding reinforces early factual errors across the sequence, since incorrect claims have high conditional probability given themselves. (4) Exposure bias — training on teacher-forced correct prefixes leaves models unprepared to recover from errors in their own previous output.

Why does RLHF reduce but not eliminate hallucination?

RLHF trains on human preference comparisons, penalizing outputs raters identify as incorrect. For frequently-encountered topics, raters can detect errors and the reward model learns to penalize them. For rare or specialized topics, human raters often fail to recognize incorrect claims — so the reward model does not penalize them and the trained policy learns to sound confident and fluent regardless of factual accuracy. The defining characteristic of hallucination is plausible language paired with wrong content, and RLHF primarily optimizes plausibility.

← All AI pages · Dashboard