Next-Token Prediction: Causal Language Modeling Objective and Perplexity
Causal language modeling maximizes log P(x) = Σₜ log P(x_t | x_{<t}); perplexity = exp(−(1/N)Σ log P(x_t|context)); GPT-2 117M achieved perplexity 35.1 on Penn Treebank without fine-tuning (Radford et al., 2019).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| Training objective | max Σₜ log P(x_t | x_1,...,x_{t-1}) | Equivalently, minimize cross-entropy H(y, ŷ) = −Σ y_i log ŷ_i | |
| Perplexity formula | PPL = exp(−(1/N) Σₜ log P(x_t | x_{<t})) | Geometric mean of inverse probabilities; lower is better | |
| GPT-2 117M perplexity (Penn Treebank) | 35.1 | PPL | Radford et al. (2019); zero-shot, no fine-tuning; SOTA at that time was ~34 with fine-tuning |
| Context for causal mask | Left-only | Attention mask sets upper triangle to −∞ before softmax; tokens cannot attend to future positions | |
| Tokens per batch (GPT-3) | 3.2 million | tokens/batch | Large batches reduce gradient variance; 3.2M tokens across sequences of 2,048 tokens |
Next-token prediction (causal language modeling) is the training objective that transforms a transformer decoder into a language model. By training the model to predict each token from its preceding context, the model learns general-purpose language representations without any task-specific supervision.
The Objective
For a sequence of tokens (x₁, x₂, …, x_N), the language model objective is to maximize:
log P(x) = Σₜ₌₁ᴺ log P(x_t | x₁, x₂, …, x_{t-1})
Each P(x_t | x₁,…,x_{t-1}) is computed by:
- Running the causal transformer to obtain hidden state h_t at position t
- Projecting h_t through the output (unembedding) layer: logits = h_t · W_E^T
- Applying softmax to get a probability distribution over vocabulary
The loss is the sum of cross-entropies over all positions.
Causal Masking
The autoregressive property is enforced via a triangular attention mask:
| t=1 | t=2 | t=3 | t=4 | |
|---|---|---|---|---|
| Position 1 | ✓ | ✗ | ✗ | ✗ |
| Position 2 | ✓ | ✓ | ✗ | ✗ |
| Position 3 | ✓ | ✓ | ✓ | ✗ |
| Position 4 | ✓ | ✓ | ✓ | ✓ |
Positions marked ✗ are set to −∞ before softmax, producing attention weight ≈ 0. This ensures that when computing the representation for position t, only tokens 1..t are visible.
Perplexity Benchmarks
| Model | Parameters | Penn Treebank PPL | Notes |
|---|---|---|---|
| 5-gram (Kneser-Ney) | — | 141 | Classic n-gram baseline |
| LSTM (Merity et al., 2018) | 33M | 57.3 | State-of-art LSTM with fine-tuning |
| GPT-2 117M | 117M | 35.1 | Zero-shot; Radford et al. (2019) |
| Transformer-XL (Dai et al.) | 257M | 21.8 | Recurrence for long context |
Teacher Forcing
During training, the model receives the true tokens as input at each position (not its own predictions). This technique, called “teacher forcing,” provides stable training gradients — if the model makes a prediction error, subsequent positions still receive the correct context. At inference time, the model must use its own predictions autoregressively.
See pre-training for the broader pre-training data and compute context, perplexity-metric for how perplexity is used for model evaluation, and temperature-sampling for how the probability distribution from next-token prediction is used to generate text.
Related Pages
Sources
- Radford et al. (2019) — Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI Technical Report
- Brown et al. (2020) — Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020
- Bengio et al. (2003) — A Neural Probabilistic Language Model. JMLR 2003
Frequently Asked Questions
Why is next-token prediction an effective pre-training objective?
Next-token prediction is a general-purpose objective — to predict what comes next in text, a model must implicitly learn syntax, semantics, factual relationships, reasoning patterns, and conversational structure. The training signal is derived entirely from the text itself (no human labels needed), enabling training on internet-scale data. Radford et al. (2019) demonstrated that GPT-2 acquires diverse capabilities (summarization, translation, QA) purely from next-token prediction.
What is the relationship between cross-entropy loss and perplexity?
The average negative log-likelihood (NLL) per token is the cross-entropy loss: CE = −(1/N) Σ log P(x_t | context). Perplexity is the exponential of this: PPL = exp(CE). A model with perplexity 35 assigns the correct next token an average probability of approximately 1/35 ≈ 2.9%. Lower perplexity indicates the model is better calibrated to the data distribution. Comparing perplexities across models requires identical tokenization.
Why can't the model attend to future tokens during training?
During pre-training with causal language modeling, the model must predict x_t using only x_1,...,x_{t-1}. If the model could attend to x_{t+1} when predicting x_t, the task becomes trivially easy — the answer is always in the context. The causal attention mask sets all attention weights from position t to positions > t to −∞ before softmax, effectively zeroing them out and enforcing this constraint. This mask also makes the architecture directly usable for autoregressive generation at inference time.