Question 1

Why does vector arithmetic like 'king − man + woman ≈ queen' work?

Accepted Answer

Word2vec embeddings capture distributional semantics — words appearing in similar contexts have similar vectors. The 'king−man+woman' analogy works because the king−man vector captures the gender direction in the embedding space, and adding the woman vector relocates to queen's position. This holds with ~76% accuracy on the 8,869 semantic analogy pairs from Mikolov et al. (2013). It is a property that emerges from training, not a design constraint.

Question 2

How are word embeddings in transformers different from word2vec?

Accepted Answer

word2vec produces static embeddings — each word has one vector regardless of context. Transformer embeddings (contextual embeddings) are dynamically computed: the same token 'bank' gets different representations in 'river bank' vs 'bank account' because attention layers mix information from surrounding tokens. The embedding table in a transformer provides an initial lookup; the hidden states after each attention layer become progressively more contextual.

Question 3

What dimension should word embeddings be?

Accepted Answer

Empirically, performance scales logarithmically with embedding dimension. word2vec's original 300-dim vectors capture most useful structure; increasing to 1000+ dims yields diminishing returns for static embeddings. In transformers, d_model (embedding dimension) is coupled to the rest of the architecture. The original transformer uses d_model=512; larger models use 768, 1024, 2048, 4096, or 8192 dimensions.

Measure	Value	Unit	Notes
word2vec embedding dimension	300	dimensions	Mikolov et al. (2013); cosine similarity captures semantic + syntactic relationships
GloVe word analogy accuracy	75.0%	% accuracy	Pennington et al. (2014) on semantic analogy task; 65.5% on combined benchmark
word2vec skip-gram training objective	max P(w_{t±c} \| w_t)		Predict context words within window c from center word; window size c=5 typical
Typical vocabulary size	100,000–500,000	tokens	Embedding matrix shape: V × d, e.g., 100K × 300 = 30M parameters
Transformer embedding initialization	N(0, d_model^{-0.5})		Vaswani et al. scale embeddings by √d_model to match expected attention input scale

Analogy Type	Example	Accuracy (word2vec)
Semantic (capitals)	Paris:France :: Berlin:? Germany	~91%
Semantic (currency)	Dollar:USA :: Euro:? Germany	~83%
Syntactic (plurals)	Car:Cars :: Bus:? Buses	~87%
Syntactic (comparative)	Good:Better :: Big:? Bigger	~78%
Gender	King − Man + Woman ≈ Queen	~76%

Method	Training Signal	Dimensionality	Key Advantage
word2vec (SG)	Local context window	300	Fast; captures analogies
GloVe	Global co-occurrence	300	Global statistical information
FastText	Character n-grams	300	Handles morphology, OOV words
Transformer (static)	Task-specific fine-tune	d_model	Architecture-native; contextual

Word Embeddings: Distributed Representations, word2vec, and Semantic Geometry

Static Embeddings: word2vec

Geometric Properties

Embedding Methods Comparison

Transformer Embedding Layer

Related Pages

Sources

Frequently Asked Questions

Why does vector arithmetic like 'king − man + woman ≈ queen' work?

How are word embeddings in transformers different from word2vec?

What dimension should word embeddings be?