Byte-Pair Encoding: Algorithm, Vocabulary Construction, and Byte-Level BPE

Name: Byte-Pair Encoding: Algorithm, Vocabulary Construction, and Byte-Level BPE
Creator: AI Tower
Published: 2026-02-27

Category: representation Updated: 2026-02-27

BPE iteratively merges the most frequent adjacent byte pair until vocabulary reaches target size; GPT-2 applies BPE on raw UTF-8 bytes, producing 50,257 tokens with guaranteed full Unicode coverage and zero unknown tokens (Radford et al., 2019).

Key Data Points
Measure	Value	Unit	Notes
GPT-2 vocabulary size	50,257	tokens	Byte-level BPE; includes all 256 bytes as base tokens + 50,000 merges + special tokens
Base vocabulary (byte-level)	256	bytes	All UTF-8 bytes as initial symbols; zero unknown token issue
Number of BPE merge operations (GPT-2)	50,000	merges	Each merge combines two symbols into one new subword token
BPE time complexity	O(V·I)		V = vocabulary size, I = number of iterations; each iteration scans corpus once
Typical multilingual vocabulary	250,000+	tokens	Larger vocabularies needed to cover diverse scripts without excessive character-level decomposition

Byte-pair encoding, originally a data compression algorithm, was adapted for neural machine translation by Sennrich et al. (2016) to address the open-vocabulary problem. It builds a vocabulary of subword units by greedily merging the most frequent adjacent symbol pairs from a training corpus.

The Algorithm

Initialization: Build a corpus as a frequency-weighted list of words, each split into characters with an end-of-word marker (e.g., “low” → “l o w ”).

Iteration:

Count frequency of all adjacent symbol pairs in the corpus
Find the most frequent pair (e.g., “e s” appearing 8,000 times)
Merge that pair into a new symbol (“es”)
Update the corpus vocabulary with the merged symbol
Repeat until vocabulary reaches target size

Result: A vocabulary where common words appear as single tokens, and rare words decompose into several subword pieces.

Merge Priority Example

Step	Most Frequent Pair	New Symbol	Corpus Change
1	(“e”, “s”) 8,432×	“es"	"lowest” → “l o w es t”
2	(“es”, “t”) 6,891×	“est"	"lowest” → “l o w est”
3	(“l”, “o”) 5,234×	“lo"	"lowest” → “lo w est”
4	(“lo”, “w”) 4,102×	“low"	"lowest” → “low est”
…	…	…	…

After sufficient merges, common words like “the” and “lowest” are single tokens, while rare words like “electroencephalography” decompose into frequent morphemes.

Byte-Level BPE vs Character-Level BPE

Property	Character-level BPE	Byte-level BPE
Base vocabulary	Unicode characters (1,114,112 possible)	256 UTF-8 bytes
Unknown tokens	Possible for unusual Unicode	None — every string is encodable
Sequence length	Similar to char-level	Slightly longer for non-ASCII
Implementation	Simpler	Requires byte-to-char mapping
Example (GPT-2)	—	256 bytes + 50,000 merges = 50,257 tokens

Vocabulary Size Tradeoffs

Vocabulary Size	Pros	Cons
Small (16K)	Short sequences, small embedding table	More character-level splits for rare words
Medium (32K–50K)	Balance for monolingual English	Moderate sequence length
Large (100K+)	Better coverage of morphology and code	Large embedding/output matrices
Multilingual (250K+)	All scripts covered efficiently	Very large embedding tables

See tokenization for the broader tokenization landscape, and word-embeddings for how these token indices are converted to dense continuous vectors for the transformer.

🧠 🧠 🧠

Sources

Frequently Asked Questions

What problem does BPE solve compared to word-level tokenization?

Word-level tokenization requires assigning an [UNK] token to any word not seen during training — a critical problem for names, technical terms, and morphological variants. BPE avoids this by decomposing rare words into frequent subword pieces. 'internationalization' might not appear in the vocabulary, but 'inter', 'national', 'ization' do. Sennrich et al. (2016) showed BPE-based NMT systems achieve near-zero unknown token rates while keeping sequence length manageable.

What is the difference between character-level BPE and byte-level BPE?

Character-level BPE starts with Unicode code points as the initial vocabulary, which can number in the thousands for multilingual text. Byte-level BPE (introduced by Radford et al., 2019 for GPT-2) starts with raw UTF-8 bytes as the 256-token base vocabulary. Since all text is ultimately bytes, byte-level BPE guarantees zero unknown tokens regardless of language or script, at the cost of slightly longer sequences for some Unicode characters.

← All AI pages · Dashboard