Word Embeddings: Distributed Representations, word2vec, and Semantic Geometry
word2vec skip-gram learns 300-dim embeddings where cosine similarity encodes semantics; vector arithmetic king − man + woman ≈ queen holds with ~76% accuracy; GloVe achieves 75.0% on word analogy tasks (Mikolov et al., 2013).
| Measure | Value | Unit | Notes |
|---|---|---|---|
| word2vec embedding dimension | 300 | dimensions | Mikolov et al. (2013); cosine similarity captures semantic + syntactic relationships |
| GloVe word analogy accuracy | 75.0% | % accuracy | Pennington et al. (2014) on semantic analogy task; 65.5% on combined benchmark |
| word2vec skip-gram training objective | max P(w_{t±c} | w_t) | Predict context words within window c from center word; window size c=5 typical | |
| Typical vocabulary size | 100,000–500,000 | tokens | Embedding matrix shape: V × d, e.g., 100K × 300 = 30M parameters |
| Transformer embedding initialization | N(0, d_model^{-0.5}) | Vaswani et al. scale embeddings by √d_model to match expected attention input scale |
Word embeddings are the interface between discrete text tokens and continuous vector spaces where neural networks can compute. Before transformers, static embeddings like word2vec and GloVe were the dominant pre-training approach. In modern transformers, the embedding table provides the initial token representation, which is refined by attention layers into contextual representations.
Static Embeddings: word2vec
Mikolov et al. (2013) introduced two architectures for learning word embeddings:
- Skip-gram: predicts surrounding context words given a center word — better for infrequent words
- CBOW: predicts center word from surrounding context — faster training
The skip-gram objective maximizes:
(1/T) Σₜ Σ_{-c≤j≤c, j≠0} log P(w_{t+j} | w_t)
Trained on ~100 billion words from Google News, producing 300-dimensional embeddings for 3 million words.
Geometric Properties
| Analogy Type | Example | Accuracy (word2vec) |
|---|---|---|
| Semantic (capitals) | Paris:France :: Berlin:? Germany | ~91% |
| Semantic (currency) | Dollar:USA :: Euro:? Germany | ~83% |
| Syntactic (plurals) | Car:Cars :: Bus:? Buses | ~87% |
| Syntactic (comparative) | Good:Better :: Big:? Bigger | ~78% |
| Gender | King − Man + Woman ≈ Queen | ~76% |
Embedding Methods Comparison
| Method | Training Signal | Dimensionality | Key Advantage |
|---|---|---|---|
| word2vec (SG) | Local context window | 300 | Fast; captures analogies |
| GloVe | Global co-occurrence | 300 | Global statistical information |
| FastText | Character n-grams | 300 | Handles morphology, OOV words |
| Transformer (static) | Task-specific fine-tune | d_model | Architecture-native; contextual |
Transformer Embedding Layer
In the transformer, the embedding layer E ∈ ℝ^{V×d_model} maps token index t to a vector E[t] ∈ ℝ^{d_model}. Vaswani et al. (2017) multiply these embeddings by √d_model to match the expected scale of the positional encodings:
embedding_input = E[token_id] × √d_model
The embedding matrix (E) is typically shared with the output projection (the unembedding layer), reducing parameter count: with V=50K and d_model=512, this saves 50K × 512 = 25.6M parameters.
See tokenization for how tokens are assigned indices, context-window for how embeddings are positioned in sequence, and pre-training for how modern models learn contextual representations.
Related Pages
Sources
- Mikolov et al. (2013) — Distributed Representations of Words and Phrases. NeurIPS 2013
- Pennington et al. (2014) — GloVe: Global Vectors for Word Representation. EMNLP 2014
- Bojanowski et al. (2017) — Enriching Word Vectors with Subword Information (FastText). TACL 2017
Frequently Asked Questions
Why does vector arithmetic like 'king − man + woman ≈ queen' work?
Word2vec embeddings capture distributional semantics — words appearing in similar contexts have similar vectors. The 'king−man+woman' analogy works because the king−man vector captures the gender direction in the embedding space, and adding the woman vector relocates to queen's position. This holds with ~76% accuracy on the 8,869 semantic analogy pairs from Mikolov et al. (2013). It is a property that emerges from training, not a design constraint.
How are word embeddings in transformers different from word2vec?
word2vec produces static embeddings — each word has one vector regardless of context. Transformer embeddings (contextual embeddings) are dynamically computed: the same token 'bank' gets different representations in 'river bank' vs 'bank account' because attention layers mix information from surrounding tokens. The embedding table in a transformer provides an initial lookup; the hidden states after each attention layer become progressively more contextual.
What dimension should word embeddings be?
Empirically, performance scales logarithmically with embedding dimension. word2vec's original 300-dim vectors capture most useful structure; increasing to 1000+ dims yields diminishing returns for static embeddings. In transformers, d_model (embedding dimension) is coupled to the rest of the architecture. The original transformer uses d_model=512; larger models use 768, 1024, 2048, 4096, or 8192 dimensions.