Word Embeddings: Complete Guide — Word2Vec, GloVe, FastText
Summary — Word embeddings are dense vector representations of words in a continuous space. Unlike classical representations (one-hot encoding, bag-of-words) that treat each word independently, embeddings capture semantic similarity: words that are close in meaning have close vectors in the space. Word2Vec (2013), GloVe (2014), and FastText (2016) are the three historical methods that revolutionized NLP before the era of Transformers.
Mathematical Principle
1. Skip-gram (Word2Vec)
The objective is to maximize the probability of observing context words around the center word.
Loss function:
$$
\max \prod_{t=1}^{T} \prod_{-c \leq j \leq c, j \neq 0} P(w_{t+j} \mid w_t)
$$
With softmax:
$$
P(w_O \mid w_I) = \frac{\exp(v_{w_O}^{\prime T} \cdot v_{w_I})}{\sum_{w=1}^{V} \exp(v_w^{\prime T} \cdot v_{w_I})}
$$
Where $v_{w_I}$ is the center vector and $v’_{w_O}$ the context vector for each word. Inference over a large vocabulary costs $O(V)$ per word, so the following optimization techniques are used:
- Negative Sampling: instead of computing over the entire vocabulary, the positive context word + $k$ randomly drawn negative words are trained.
- Hierarchical Softmax: binary tree over the vocabulary reducing computation to $O(\log V)$.
$$
\text{Loss} = -\log \sigma(v_{w_O}^{\prime T} v_{w_I}) – \sum_{k=1}^{K} \mathbb{E}{w_k \sim P_n(w)} [\log \sigma(-v)]
$$}^{\prime T} v_{w_I
2. CBOW (Continuous Bag of Words)
The inverse of Skip-gram: the center word is predicted from the context.
$$
P(w_I \mid w_{context}) = \text{softmax}(V^{\prime} \cdot \bar{v}_{context})
$$
CBOW is faster than Skip-gram and performs better on frequent words, but Skip-gram captures rare words better.
3. GloVe (Global Vectors)
GloVe combines the advantages of matrix factorization and local context learning.
$$
J = \sum_{i,j} f(X_{ij}) \left(w_i^T \tilde{w}j + b_i + \tilde{b}_j – \log X\right)^2
$$
Where $X_{ij}$ is the number of co-occurrences between words $i$ and $j$, and $f(x)$ is a weighting that limits the influence of very frequent co-occurrences:
- $f(x) = (x/x_{max})^\alpha$ if $x < x_{max}$
- $f(x) = 1$ otherwise
4. FastText (subwords / character n-grams)
FastText improves Word2Vec by representing each word as the sum of its subwords (character n-grams).
$$
v_w = \sum_{g \in G_w} z_g
$$
Where $G_w$ is the set of character n-grams of word $w$. Example for “chat” with $n=3$: <c, ch, cha, hat, at, t> with word delimiters. This allows handling unknown (OOV) words and learning representations for morphologically close words (“marcher”, “marchait”, “marchant”).
5. Cosine Similarity
To compare two word embeddings, cosine similarity is used rather than Euclidean distance:
$$
\text{cos}(u, v) = \frac{u \cdot v}{|u| |v|}
$$
Intuition
Imagine a huge geographic map where each word is a point.
On this map:
- “King” and “queen” are neighbors, like Paris and Lyon
- “Apple” and “pear” are in the “fruits” neighborhood
- “Sad” and “melancholy” live on the same street of emotions
The magic? This map was not drawn by hand. It was learned automatically by reading billions of words and observing which words appear together. “King” and “queen” are neighbors because they appear in similar contexts: “the king decreed” ~ “the queen decreed.”
Even better: the geometry of this map enables algebraic analogies. The operation vector(king) – vector(man) + vector(woman) yields a vector very close to vector(queen). It is as if the meaning of words had become computable.
Python Implementation
1. Word2Vec with Gensim
from gensim.models import Word2Vec
# Example corpus
sentences = [
["le", "chat", "dort", "sur", "le", "canapé"],
["le", "chien", "court", "dans", "le", "jardin"],
["la", "reine", "vit", "dans", "le", "château"],
["le", "roi", "gouverne", "le", "royaume"],
["le", "chat", "et", "le", "chien", "sont", "des", "animaux"],
["la", "femme", "du", "roi", "est", "une", "reine"],
]
# Skip-gram training
model = Word2Vec(
sentences,
vector_size=100,
window=3,
min_count=1,
sg=1, # 1 = Skip-gram, 0 = CBOW
epochs=100,
negative=5
)
# Words closest to "chat"
print(model.wv.most_similar("chat", topn=3))
# Analogy
similar = model.wv.most_similar(
positive=["roi", "femme"],
negative=["homme"],
topn=1
)
print(f"Roi - Homme + Femme = {similar[0][0]}")
2. Similarity and analogies
# Cosine similarity between two words
sim = model.wv.similarity("roi", "reine")
print(f"Similarity roi-reine: {sim:.4f}")
# Distance between word pairs
dist = model.wv.n_similarity(["roi", "reine"], ["homme", "femme"])
print(f"Pair similarity: {dist:.4f}")
# Word that doesn't match the group
odd = model.wv.doesnt_match(["chat", "chien", "oiseau", "chateau"])
print(f"Odd one out: {odd}") # Should return "chateau"
3. FastText with subwords
from gensim.models import FastText
model = FastText(
sentences,
vector_size=100,
window=3,
min_count=1,
sg=1,
epochs=100
)
# Even an out-of-vocabulary word has a vector
# because FastText uses character n-grams
vec = model.wv.get_vector("château")
print(f"Vector 'château' shape: {vec.shape}")
4. 2D Visualization with t-SNE
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Extract vectors for all words
words = list(model.wv.index_to_key)
vectors = np.array([model.wv[word] for word in words])
# 2D reduction with t-SNE
tsne = TSNE(n_components=2, perplexity=5, random_state=42)
vectors_2d = tsne.fit_transform(vectors)
# Visualization
plt.figure(figsize=(10, 8))
for i, word in enumerate(words):
plt.scatter(vectors_2d[i, 0], vectors_2d[i, 1])
plt.annotate(word, (vectors_2d[i, 0], vectors_2d[i, 1]))
plt.title("Word Embeddings 2D (t-SNE)")
plt.show()
Hyperparameters
| Hyperparameter | Typical Value | Description |
|---|---|---|
vector_size |
100-300 | Vector dimension (more = more nuance, but slower) |
window |
3-10 | Context window size (small = syntax, large = topic) |
min_count |
1-5 | Ignore words appearing fewer than X times |
sg |
0 or 1 | 0 = CBOW (fast), 1 = Skip-gram (better for rare words) |
epochs |
5-100 | Number of passes over the corpus (more for small corpora) |
negative |
5-20 | Number of negative samples for negative sampling |
hs |
0 or 1 | 1 = use hierarchical softmax instead of negative sampling |
Advantages of Word Embeddings
- Dramatic dimensionality reduction: a vocabulary of 50,000 words goes from a sparse 50,000-dimensional one-hot encoding to a dense 300-dimensional vector.
- Capturing semantic relationships: word similarities and analogies are encoded in vector geometry.
- Handling the unknown (FastText): subwords allow representing words never seen during training.
- Lightweight and fast: a 300D embedding for 100,000 words weighs ~120 MB. That is 1,000 times lighter than a BERT.
- Interpretable: unlike BERT representations which are contextual (sentence-dependent), pre-trained embeddings offer a fixed and analyzable representation of each word.
Limitations of Word Embeddings
- Static representation: “Bank” has the same vector in “garden bench” and “bench of the accused.” BERT solves this problem with contextual representations.
- Corpus size dependence: for quality results, billions of words are needed. A corpus of 10,000 sentences will not produce meaningful embeddings.
- No deep syntax handling: embeddings primarily capture statistical associations, not grammar or logic.
- Problem of multi-sense words: a single vector per word cannot represent all possible meanings.
4 Concrete Use Cases
1. Document Search (Semantic Engine)
A news website uses Word2Vec to represent article titles as averages of word vectors. When a user reads an article, the most similar articles in the embedding space are found. Unlike an exact keyword-based engine, it works even when articles use different vocabulary but talk about the same topic.
2. Sentiment Analysis in French
A company fine-tunes a linear classifier on French pre-trained FastText embeddings (Facebook’s cc.fr model) to classify customer reviews as positive or negative. FastText is ideal here because it handles spelling errors and morphological variants common in online reviews.
3. Semantic Plagiarism Detection
A university system converts student submissions into average embedding vectors and computes cosine similarity. Unlike classic copy-paste detection, this approach also detects paraphrases where words have been replaced by synonyms.
4. E-Commerce Product Recommendation
Product descriptions are encoded with Word2Vec embeddings. When a customer views a product, products whose descriptions are semantically similar are recommended. The approach works particularly well for products that don’t share an explicit category but are functionally similar.
Conclusion
Word embeddings were the first revolution in deep NLP before the arrival of Transformers. Word2Vec, GloVe, and FastText demonstrated that the meaning of words could be learned automatically from raw data, without manual annotation.
Even though the contextual representations of BERT and GPT have surpassed them in performance, pre-trained embeddings remain relevant for lightweight applications, resource-constrained environments, or as an input layer for deep models.
See Also
- Computing the Average Least Common Multiple (LCM) in Python: Complete Guide and Tips
- Mastering Planetary Gears with Python: Simulations and Practical Applications

