Word Embeddings: Distributed Representations, word2vec, and Semantic Geometry

Category: representation Updated: 2026-02-27

word2vec skip-gram learns 300-dim embeddings where cosine similarity encodes semantics; vector arithmetic king − man + woman ≈ queen holds with ~76% accuracy; GloVe achieves 75.0% on word analogy tasks (Mikolov et al., 2013).

Key Data Points
MeasureValueUnitNotes
word2vec embedding dimension300dimensionsMikolov et al. (2013); cosine similarity captures semantic + syntactic relationships
GloVe word analogy accuracy75.0%% accuracyPennington et al. (2014) on semantic analogy task; 65.5% on combined benchmark
word2vec skip-gram training objectivemax P(w_{t±c} | w_t)Predict context words within window c from center word; window size c=5 typical
Typical vocabulary size100,000–500,000tokensEmbedding matrix shape: V × d, e.g., 100K × 300 = 30M parameters
Transformer embedding initializationN(0, d_model^{-0.5})Vaswani et al. scale embeddings by √d_model to match expected attention input scale

Word embeddings are the interface between discrete text tokens and continuous vector spaces where neural networks can compute. Before transformers, static embeddings like word2vec and GloVe were the dominant pre-training approach. In modern transformers, the embedding table provides the initial token representation, which is refined by attention layers into contextual representations.

Static Embeddings: word2vec

Mikolov et al. (2013) introduced two architectures for learning word embeddings:

  • Skip-gram: predicts surrounding context words given a center word — better for infrequent words
  • CBOW: predicts center word from surrounding context — faster training

The skip-gram objective maximizes:

(1/T) Σₜ Σ_{-c≤j≤c, j≠0} log P(w_{t+j} | w_t)

Trained on ~100 billion words from Google News, producing 300-dimensional embeddings for 3 million words.

Geometric Properties

Analogy TypeExampleAccuracy (word2vec)
Semantic (capitals)Paris:France :: Berlin:? Germany~91%
Semantic (currency)Dollar:USA :: Euro:? Germany~83%
Syntactic (plurals)Car:Cars :: Bus:? Buses~87%
Syntactic (comparative)Good:Better :: Big:? Bigger~78%
GenderKing − Man + Woman ≈ Queen~76%

Embedding Methods Comparison

MethodTraining SignalDimensionalityKey Advantage
word2vec (SG)Local context window300Fast; captures analogies
GloVeGlobal co-occurrence300Global statistical information
FastTextCharacter n-grams300Handles morphology, OOV words
Transformer (static)Task-specific fine-tuned_modelArchitecture-native; contextual

Transformer Embedding Layer

In the transformer, the embedding layer E ∈ ℝ^{V×d_model} maps token index t to a vector E[t] ∈ ℝ^{d_model}. Vaswani et al. (2017) multiply these embeddings by √d_model to match the expected scale of the positional encodings:

embedding_input = E[token_id] × √d_model

The embedding matrix (E) is typically shared with the output projection (the unembedding layer), reducing parameter count: with V=50K and d_model=512, this saves 50K × 512 = 25.6M parameters.

See tokenization for how tokens are assigned indices, context-window for how embeddings are positioned in sequence, and pre-training for how modern models learn contextual representations.

🧠 🧠 🧠

Related Pages

Sources

Frequently Asked Questions

Why does vector arithmetic like 'king − man + woman ≈ queen' work?

Word2vec embeddings capture distributional semantics — words appearing in similar contexts have similar vectors. The 'king−man+woman' analogy works because the king−man vector captures the gender direction in the embedding space, and adding the woman vector relocates to queen's position. This holds with ~76% accuracy on the 8,869 semantic analogy pairs from Mikolov et al. (2013). It is a property that emerges from training, not a design constraint.

How are word embeddings in transformers different from word2vec?

word2vec produces static embeddings — each word has one vector regardless of context. Transformer embeddings (contextual embeddings) are dynamically computed: the same token 'bank' gets different representations in 'river bank' vs 'bank account' because attention layers mix information from surrounding tokens. The embedding table in a transformer provides an initial lookup; the hidden states after each attention layer become progressively more contextual.

What dimension should word embeddings be?

Empirically, performance scales logarithmically with embedding dimension. word2vec's original 300-dim vectors capture most useful structure; increasing to 1000+ dims yields diminishing returns for static embeddings. In transformers, d_model (embedding dimension) is coupled to the rest of the architecture. The original transformer uses d_model=512; larger models use 768, 1024, 2048, 4096, or 8192 dimensions.

← All AI pages · Dashboard