t-SNE: Complete Guide — High-Dimensional Data Visualization
Summary — t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction algorithm developed by Laurens van der Maaten and Geoffrey Hinton in 2008. Unlike PCA which preserves global variance, t-SNE preserves local neighborhoods: points that are close in high dimensions remain close in 2D/3D, while distant points are pushed apart. It is the reference tool for visualizing clusters and complex structures in high-dimensional data (images, text, genomics).
Mathematical Principle
1. High-dimensional similarities
t-SNE begins by computing, for each pair of points (x_i, x_j), a conditional probability that measures their similarity:
$$p_{j|i} = \frac{\exp(-||x_i – x_j||^2 / (2\sigma_i^2))}{\sum_{k \neq i} \exp(-||x_i – x_k||^2 / (2\sigma_i^2))}$$
where σ_i is a bandwidth parameter that depends on the local density around x_i. In dense regions, σ_i is small; in sparse regions, it is large.
The value of σ_i is calibrated so that the perplexity (a measure of effective neighborhood size) is constant:
$$\text{Perplexity}(P_i) = 2^{H(P_i)}$$
where H(P_i) = -Σj p{j|i} · log₂(p_{j|i}) is the Shannon entropy of the distribution around point i.
The conditional probabilities are then symmetrized:
$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$$
2. Low-dimensional similarities (2D/3D)
In the projection space (y_i, y_j), t-SNE uses a Student’s t-distribution with one degree of freedom (Cauchy distribution):
$$q_{ij} = \frac{(1 + ||y_i – y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k – y_l||^2)^{-1}}$$
The choice of the t-distribution is crucial: its heavy tails mean that the decay of q_ij is much slower than that of p_ij. As a result: points that are far apart in high dimensions are even farther apart in the projection, which reveals the global structure.
3. KL Divergence minimization
The goal of t-SNE is to make the distribution Q (low dimension) as close as possible to the distribution P (high dimension), by minimizing the Kullback-Leibler divergence:
$$\text{KL}(P || Q) = \sum_{i \neq j} p_{ij} \cdot \log\left(\frac{p_{ij}}{q_{ij}}\right)$$
The gradient with respect to each point y_i is:
$$\frac{\partial \text{KL}}{\partial y_i} = 4 \sum_{j \neq i} (p_{ij} – q_{ij}) \cdot (y_i – y_j) \cdot (1 + ||y_i – y_j||^2)^{-1}$$
This gradient contains two forces:
– Attraction when p_ij > q_ij: points that are close in high dimensions are brought together.
– Repulsion when q_ij > p_ij: distant points are pushed apart.
4. Early Exaggeration
During the first iterations, t-SNE multiplies the p_ij by a factor (typically 4–12). This strengthens the attraction between close neighbors and allows clusters to form more quickly before repulsion sets in.
Intuition
Imagine you have thousands of photographs of faces and you want to arrange them on a two-meter by one-meter wall. Each photo has thousands of pixels (high dimension), but you only have a 2D plane.
PCA would place the photos from left to right according to overall lighting. Light faces on one side, dark ones on the other. It’s global but not very useful.
t-SNE, on the other hand, says: “Faces that look alike should be side by side, and faces that are very different should be at opposite ends of the wall.” Result: you see groups form — smiles together, frontal faces together, profiles together — without being told what a smile is.
The magnet and spring analogy: t-SNE imagines that each high-dimensional point is connected to all others by springs. The closer two points are in high dimensions, the stronger the spring. In the 2D projection, the points move until equilibrium: neighbors stay together, distant ones are pushed apart. The t-distribution acts like a spring that never stops pushing — even at great distances.
Python Implementation
Example 1: Basic t-SNE with scikit-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
# Data: 64 pixels per handwritten digit
digits = load_digits()
X = digits.data # (1797, 64)
y = digits.target # 10 classes (0-9)
# t-SNE in 2D
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_embedded = tsne.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Projected shape: {X_embedded.shape}")
# Visualization
fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y,
cmap='tab10', alpha=0.8, s=50, edgecolors='black', linewidth=0.5)
plt.colorbar(scatter, ticks=range(10), label='Digit')
ax.set_title(f't-SNE: Handwritten Digits (perplexity=30)')
ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
plt.tight_layout()
plt.savefig('tsne_digits.png', dpi=150)
Example 2: Impact of Perplexity
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
perplexities = [5, 15, 30, 50]
for ax, perp in zip(axes.ravel(), perplexities):
tsne_p = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000)
embedded_p = tsne_p.fit_transform(X)
sc = ax.scatter(embedded_p[:, 0], embedded_p[:, 1], c=y,
cmap='tab10', alpha=0.8, s=30)
ax.set_title(f'Perplexity = {perp}')
ax.set_xticks([])
ax.set_yticks([])
plt.suptitle('Impact of perplexity on t-SNE visualization')
plt.tight_layout()
plt.savefig('tsne_perplexity.png', dpi=150)
# Observations:
print("Low perplexity (5): small local clusters, fragmentation")
print("Medium perplexity (30): good balance, coherent groups")
print("High perplexity (50): global clusters, loss of fine details")
Example 3: Comparison with PCA and UMAP
from sklearn.decomposition import PCA
# PCA in 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', alpha=0.7, s=40)
axes[0].set_title(f'PCA — Explained variance: {pca.explained_variance_ratio_.sum()*100:.1f}%')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[1].scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, cmap='tab10', alpha=0.7, s=40)
axes[1].set_title('t-SNE — Stochastic Neighbor Embedding')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')
plt.suptitle('PCA vs t-SNE on handwritten digits')
plt.tight_layout()
plt.savefig('pca_vs_tsne.png', dpi=150)
Example 4: 3D t-SNE
from mpl_toolkits.mplot3d import Axes3D
tsne_3d = TSNE(n_components=3, perplexity=30, random_state=42, n_iter=1000)
X_3d = tsne_3d.fit_transform(X)
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
sc = ax.scatter(X_3d[:, 0], X_3d[:, 1], X_3d[:, 2], c=y,
cmap='tab10', alpha=0.7, s=30)
plt.colorbar(sc, ticks=range(10))
ax.set_title('3D t-SNE: Handwritten Digits')
plt.tight_layout()
plt.savefig('tsne_3d.png', dpi=150)
Hyperparameters
| Hyperparameter | Typical value | Description |
|---|---|---|
n_components |
2 or 3 | Dimension of the projection space. 2 for visualization, 3 for immersion |
perplexity |
5–50 (30 default) | Effective neighborhood size. Higher = more global view. Generally 5–50 for datasets < 10,000 points |
learning_rate |
10–1000 (200 default) | Optimization step size. If points collapse, increase it. If unstable, reduce it |
early_exaggeration |
4–12 (12 default) | Multiplier for p_ij during the first iterations. Strengthens cluster separation |
n_iter |
250–5000 (1000 default) | Number of iterations. 250 for exploration, 1000+ for refinement |
init |
‘random’ or ‘pca’ | Initialization. ‘pca’ often gives better results and converges faster |
method |
‘barnes_hut’ or ‘exact’ | ‘barnes_hut’ for large datasets (O(n·log(n))), ‘exact’ for maximum precision |
Advantages of t-SNE
- Preservation of nonlinear structures: Unlike PCA which only captures linear variance, t-SNE reveals complex structures (clusters, loops, branches) that linear methods miss.
- Visually excellent results: t-SNE visualizations are often impressive: clusters separate clearly, subgroups emerge naturally. It is the preferred tool for visual data exploration.
- No assumption about data shape: t-SNE assumes neither linearity nor Gaussian distribution. It works on any data where a distance is defined.
- Reveals local topology: t-SNE excels at preserving close neighborhoods, which is exactly what you want to see in cluster visualization.
Limitations of t-SNE
- Does not preserve global distances: The distance between two clusters on a t-SNE visualization has no intrinsic meaning. You cannot say “this cluster is farther than that one” in a quantitative way.
- Non-deterministic results: Each run (with a different seed) can produce radically different visualizations, even if local clusters remain consistent.
- High computational cost: Barnes-Hut is O(n·log(n)) but remains slow on millions of points. The exact algorithm is O(n²) and impractical beyond 10,000 points.
- No inverse projection mechanism: Unlike PCA where you can transform a new point without recomputing the entire decomposition, t-SNE offers no projection function for unseen data. Each run starts from scratch.
- Sensitivity to hyperparameters: Perplexity and learning rate must be tuned for each dataset. A bad choice can produce a misleading visualization (everything grouped together, or everything scattered).
4 Concrete Use Cases
1. Word Embedding Visualization (NLP)
In natural language processing, t-SNE is used to visualize Word2Vec or GloVe embeddings: each word is a point in 300 dimensions, and t-SNE projects it into 2D. We then discover that semantically close words (king/queen, man/woman, Paris/France) naturally group together in clusters.
2. Customer Cluster Exploration
Customer data (purchasing behavior, demographics, engagement) is often in 50–200 dimensions after feature engineering. t-SNE projects this data into 2D to visually reveal natural customer segments, validating or invalidating K-Means results.
3. Genomic Data Analysis
In genomics, gene expression data typically has 20,000 features (one per gene). t-SNE reveals cell subtypes and patient clusters that correspond to distinct molecular profiles, used in oncology for tumor classification.
4. Autoencoder Latent Space Visualization
When an autoencoder is trained on images, its latent space (bottleneck) is a compressed space where each dimension encodes an abstract feature. t-SNE projects this space into 2D to verify that similar images are indeed close in the learned representation.
Conclusion
t-SNE is the essential tool for seeing what is happening in high-dimensional data. Its ability to reveal nonlinear structures and clusters invisible to PCA makes it a standard in every data scientist’s toolbox.
But be careful: t-SNE is an exploration tool, not a modeling tool. Its visualizations are beautiful but should be interpreted with caution. Distances between clusters have no absolute meaning, and results depend on the hyperparameters used.
For very large datasets, UMAP is a faster and more globally structured alternative. But for pure visualization quality, t-SNE remains the reference.
See Also
- Discover ‘Factor Shuffle’ in Python: Complete Guide to Efficiently Reorganize Your Collections
- Exploring Lexicographic Neighbors in Python: Complete Guide for Developers

