Byte-Pair Encoding: Algorithm, Vocabulary Construction, and Byte-Level BPE
BPE iteratively merges the most frequent adjacent byte pair until vocabulary reaches target size; GPT-2 applies BPE on raw UTF-8 bytes, producing 50,257 tokens with guaranteed full Unicode coverage and zero unknown tokens (Radford et al., 2019).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| GPT-2 vocabulary size | 50,257 | tokens | Byte-level BPE; includes all 256 bytes as base tokens + 50,000 merges + special tokens |
| Base vocabulary (byte-level) | 256 | bytes | All UTF-8 bytes as initial symbols; zero unknown token issue |
| Number of BPE merge operations (GPT-2) | 50,000 | merges | Each merge combines two symbols into one new subword token |
| BPE time complexity | O(V·I) | V = vocabulary size, I = number of iterations; each iteration scans corpus once | |
| Typical multilingual vocabulary | 250,000+ | tokens | Larger vocabularies needed to cover diverse scripts without excessive character-level decomposition |
Byte-pair encoding, originally a data compression algorithm, was adapted for neural machine translation by Sennrich et al. (2016) to address the open-vocabulary problem. It builds a vocabulary of subword units by greedily merging the most frequent adjacent symbol pairs from a training corpus.
The Algorithm
Initialization: Build a corpus as a frequency-weighted list of words, each split into characters with an end-of-word marker (e.g., “low” → “l o w ”).
Iteration:
- Count frequency of all adjacent symbol pairs in the corpus
- Find the most frequent pair (e.g., “e s” appearing 8,000 times)
- Merge that pair into a new symbol (“es”)
- Update the corpus vocabulary with the merged symbol
- Repeat until vocabulary reaches target size
Result: A vocabulary where common words appear as single tokens, and rare words decompose into several subword pieces.
Merge Priority Example
| Step | Most Frequent Pair | New Symbol | Corpus Change |
|---|---|---|---|
| 1 | (“e”, “s”) 8,432× | “es" | "lowest” → “l o w es t” |
| 2 | (“es”, “t”) 6,891× | “est" | "lowest” → “l o w est” |
| 3 | (“l”, “o”) 5,234× | “lo" | "lowest” → “lo w est” |
| 4 | (“lo”, “w”) 4,102× | “low" | "lowest” → “low est” |
| … | … | … | … |
After sufficient merges, common words like “the” and “lowest” are single tokens, while rare words like “electroencephalography” decompose into frequent morphemes.
Byte-Level BPE vs Character-Level BPE
| Property | Character-level BPE | Byte-level BPE |
|---|---|---|
| Base vocabulary | Unicode characters (1,114,112 possible) | 256 UTF-8 bytes |
| Unknown tokens | Possible for unusual Unicode | None — every string is encodable |
| Sequence length | Similar to char-level | Slightly longer for non-ASCII |
| Implementation | Simpler | Requires byte-to-char mapping |
| Example (GPT-2) | — | 256 bytes + 50,000 merges = 50,257 tokens |
Vocabulary Size Tradeoffs
| Vocabulary Size | Pros | Cons |
|---|---|---|
| Small (16K) | Short sequences, small embedding table | More character-level splits for rare words |
| Medium (32K–50K) | Balance for monolingual English | Moderate sequence length |
| Large (100K+) | Better coverage of morphology and code | Large embedding/output matrices |
| Multilingual (250K+) | All scripts covered efficiently | Very large embedding tables |
See tokenization for the broader tokenization landscape, and word-embeddings for how these token indices are converted to dense continuous vectors for the transformer.
Related Pages
Sources
- Sennrich et al. (2016) — Neural Machine Translation of Rare Words with Subword Units. ACL 2016
- Radford et al. (2019) — Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI Technical Report
- Gage (1994) — A New Algorithm for Data Compression. C Users Journal
Frequently Asked Questions
What problem does BPE solve compared to word-level tokenization?
Word-level tokenization requires assigning an [UNK] token to any word not seen during training — a critical problem for names, technical terms, and morphological variants. BPE avoids this by decomposing rare words into frequent subword pieces. 'internationalization' might not appear in the vocabulary, but 'inter', 'national', 'ization' do. Sennrich et al. (2016) showed BPE-based NMT systems achieve near-zero unknown token rates while keeping sequence length manageable.
What is the difference between character-level BPE and byte-level BPE?
Character-level BPE starts with Unicode code points as the initial vocabulary, which can number in the thousands for multilingual text. Byte-level BPE (introduced by Radford et al., 2019 for GPT-2) starts with raw UTF-8 bytes as the 256-token base vocabulary. Since all text is ultimately bytes, byte-level BPE guarantees zero unknown tokens regardless of language or script, at the cost of slightly longer sequences for some Unicode characters.