t-SNE: Principles, Examples, and Python Implementation

t-SNE : Guide Complet — Principes, Exemples et Implémentation Python

t-SNE: Complete Guide — Principles, Examples, and Python Implementation

Summary

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction algorithm designed specifically for visualizing high-dimensional data. Unlike PCA, which preserves global distances, t-SNE aims to preserve the local neighborhood structure: points that are close in the original space remain close after projection into 2D or 3D. This property makes it the reference tool for exploring complex datasets such as images, embedded text, or genomic profiles.

Mathematical Principle

The operation of t-SNE is based on an elegant sequence of probabilistic transformations.

Step 1: Gaussian similarities in high dimension

For each data point $x_i$, t-SNE defines a Gaussian distribution centered on that point. The conditional probability $p_{j|i}$ that point $x_j$ is considered a neighbor of $x_i$ is given by:

$$p_{j|i} = \frac{\exp(-|x_i – x_j|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-|x_i – x_k|^2 / 2\sigma_i^2)}$$

where $\sigma_i$ is the variance (or width) of the Gaussian centered on $x_i$. This parameter is not set arbitrarily: it is automatically determined to match the perplexity specified by the user.

Perplexity is defined as:

$$\text{Perp}(P_i) = 2^{H(P_i)}$$

where $H(P_i)$ is the Shannon entropy of the distribution $P_i = {p_{j|i}}_j$. A perplexity of 30 means that each point considers approximately 30 significant neighbors. It is an intuitive way to control the size of the local neighborhood without explicitly specifying the number of neighbors.

Next, these probabilities are symmetrized to obtain the joint distribution:

$$p_{ij} = \frac{p_{j|i} + p_{i|j}}{2N}$$

Step 2: Student’s t similarities in low dimension

In the embedding space (usually 2D or 3D), an analogous distribution $q_{ij}$ is defined between projected points $y_i$ and $y_j$. The fundamental difference is that t-SNE uses a Student’s t-distribution with one degree of freedom (also called the Cauchy distribution) rather than a Gaussian:

$$q_{ij} = \frac{(1 + |y_i – y_j|^2)^{-1}}{\sum_{k \neq l} (1 + |y_k – y_l|^2)^{-1}}$$

This choice is not trivial. The heavier tail of the Student’s t-distribution allows modeling the fact that, in the low-dimensional space, distant points become much more numerous (density decreases considerably). Without this, clusters would tend to compress toward the center of the visualization — this is the well-known crowding problem.

Step 3: Minimizing the Kullback-Leibler divergence

The goal of t-SNE is to find the positions ${y_i}$ such that the distribution $Q$ is as close as possible to the distribution $P$. The dissimilarity measure used is the Kullback-Leibler (KL) divergence:

$$KL(P | Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$

This divergence is minimized via gradient descent. The gradient with respect to each point $y_i$ is written:

$$\frac{\partial KL}{\partial y_i} = 4 \sum_{j \neq i} (p_{ij} – q_{ij}) (y_i – y_j) (1 + |y_i – y_j|^2)^{-1}$$

This term is remarkably interpretable: it is a sum of attractive forces (when $p_{ij} > q_{ij}$) and repulsive forces (when $p_{ij} < q_{ij}$). Points that are close in high dimension but far apart in low dimension are attracted to each other, and vice versa.

Intuition: flattening a globe

Imagine you need to create a flat map of the world from a globe. You want neighboring countries to remain neighbors on the map: France should stay close to Germany, Brazil close to Argentina. However, it is acceptable — and even inevitable — that distances between continents be distorted. The Pacific may appear larger than it is, Europe more compressed.

This is exactly the philosophy of t-SNE: preserve local neighborhood at the expense of global distances. If two images of dogs are similar, t-SNE will place them side by side on the 2D map. However, the exact distance between the dog cluster and the cat cluster does not carry strong meaning: it may vary from one run to another.

This property is both the strength and the weakness of the algorithm. For visual exploration of local structures, it is ideal. For comparing clusters with each other, one must remain cautious.

Python Implementation with scikit-learn

Basic example on handwritten digits

The load_digits dataset from scikit-learn contains 1,797 images of handwritten digits (8×8 pixels), i.e., 64-dimensional vectors. It is an excellent starting point for visualizing t-SNE:

import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE

# Load data
digits = load_digits()
X, y = digits.data, digits.target

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_embedded = tsne.fit_transform(X)

# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, cmap="tab10", s=30, alpha=0.8)
ax.set_title("t-SNE des chiffres manuscrits (2D)")
plt.colorbar(scatter, label="Chiffre", ticks=range(10))
plt.tight_layout()
plt.show()

The result typically shows 10 well-separated clusters, each corresponding to a digit. Fuzzy transitions between certain clusters (for example between 4s and 9s) reflect the real visual similarity between those digits.

Impact of perplexity

Perplexity is the most important hyperparameter. Let’s explore its effect:

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
perplexities = [5, 30, 100]

for ax, perp in zip(axes, perplexities):
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42)
    X_emb = tsne.fit_transform(X)
    ax.scatter(X_emb[:, 0], X_emb[:, 1], c=y, cmap="tab10", s=20, alpha=0.7)
    ax.set_title(f"Perplexité = {perp}")
    ax.set_xticks([])
    ax.set_yticks([])

plt.tight_layout()
plt.show()

With low perplexity (5), t-SNE captures very local structures but risks fragmenting clusters into multiple islands. With high perplexity (100), the view becomes more global and clusters may merge. The default value of 30 is a good compromise for medium-sized data.

Comparison with PCA

It is instructive to compare t-SNE with Principal Component Analysis (PCA) on the same dataset:

from sklearn.decomposition import PCA

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap="tab10", s=20, alpha=0.7)
axes[0].set_title("ACP (linéaire)")
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap="tab10", s=20, alpha=0.7)
axes[1].set_title("t-SNE (non linéaire)")
for ax in axes:
    ax.set_xticks([])
    ax.set_yticks([])
plt.tight_layout()
plt.show()

PCA, being linear, does not always separate clusters as well. t-SNE reveals nonlinear structures that PCA cannot capture. However, PCA preserves global distances and is much faster.

Iteration Animation

To understand how t-SNE converges, we can visualize the evolution of the map over iterations:

import matplotlib.animation as animation
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

# Retrieve intermediate states via callback
states = []

def callback(Xi, i):
    states.append(Xi.copy())

tsne = TSNE(n_components=2, perplexity=30, n_iter=500,
            random_state=42, callback=callback, callback_every=50)
tsne.fit_transform(X)

# Animate
fig, ax = plt.subplots(figsize=(8, 6))
scat = ax.scatter(states[0][:, 0], states[0][:, 1], c=y, cmap="tab10", s=15)
ax.set_xticks([])
ax.set_yticks([])
titre = ax.set_title("Itération 0")

def update(frame):
    scat.set_offsets(states[frame])
    titre.set_text(f"Itération {frame * 50}")
    return scat, titre

ani = animation.FuncAnimation(fig, update, frames=len(states), interval=400)
ani.save("tsne_animation.gif", writer="pillow")
plt.show()

We typically observe an initial early exaggeration phase where clusters form quickly, followed by progressive refinement of positions.

t-SNE Hyperparameters

Mastering t-SNE requires understanding its hyperparameters:

Hyperparameter Default value Role
n_components 2 Dimension of the output space (2 for visualization, 3 for a three-axis rendering)
perplexity 30 Effective size of the local neighborhood. Typical values: 5 to 50. Must be less than the number of samples
early_exaggeration 12.0 Amplifies attractions at the beginning to create well-separated clusters. Increasing this value produces denser clusters but computation time increases
learning_rate “auto” Learning rate. “auto” automatically adjusts based on the number of samples. A rate too low gives compressed maps; too high, ball-shaped artifacts
n_iter 1000 Number of gradient descent iterations. Minimum 250. For high-quality visualizations, 2,000 to 5,000 iterations are recommended
metric “euclidean” Distance metric for the original space. Can be “cosine”, “manhattan”, “chebyshev”, etc.
init “pca” Initialization of points. “pca” (recommended) uses a prior PCA for more stable convergence. “random” can trap the algorithm in local minima
method “exact” “exact” for small datasets, “barnes_hut” (approximate approach) to speed up on large datasets

Advantages and Limitations

Advantages

  • Exceptional cluster separation: t-SNE excels at revealing local structures that linear methods like PCA cannot detect.
  • Robustness to nonlinearities: since t-SNE assumes no underlying linear structure, it applies to very complex data.
  • Relative scale invariance: the algorithm focuses on relative proximities rather than absolute distances, making it less sensitive to normalization.
  • Widespread adoption: visualizing embeddings with t-SNE has become standard practice in research and industry.

Limitations

  • Non-determinism: even with the same data, different runs (unless random_state is fixed) produce different maps. The overall shape may vary.
  • Global distance interpretation: the distance between two clusters is not meaningful. One cannot conclude that cluster A is twice as far from B as from C.
  • Cluster density and size: the apparent size of a cluster does not necessarily reflect the true density of the data.
  • Computational cost: in exact mode, the complexity is quadratic, limiting application to datasets of moderate size. The Barnes-Hut approximation helps but remains expensive beyond a few hundred thousand points.
  • No transform function: unlike PCA, t-SNE does not provide a function to project new data. Each run is independent.

Four Concrete Use Cases

1. Genomic Data Exploration

In bioinformatics, t-SNE is used to visualize gene expression profiles of thousands of genes. Each sample becomes a point in a space of several thousand dimensions, and t-SNE reveals invisible subtypes to the naked eye. Tight clusters often correspond to distinct cell types, while isolated points may indicate aberrant cells or rare types.

2. Natural Language Visualization

Natural language processing models (such as BERT or Word2Vec) produce high-dimensional vector representations of words or documents. Projecting these vectors with t-SNE allows verifying the quality of embeddings: semantically close words should appear in the same neighborhood. This is an essential diagnostic tool for NLP practitioners.

3. Image Analysis and Style Transfer

In convolutional neural networks, activations from intermediate layers can be reduced with t-SNE to understand what the network sees. It is often observed that early layers capture simple patterns (edges, textures) while deeper layers represent more abstract concepts (faces, objects). This analysis helps debug architectures and identify model biases.

4. Industrial Anomaly Detection

In industrial monitoring, hundreds of sensors are collected in real time. Projecting these observations with t-SNE reveals normal operation as a large dense cloud, while anomalous states (imminent failures, production defects) appear as isolated points on the periphery. This visualization guides the setup of alert thresholds and facilitates operator interpretation.

Best Practices

  1. Always set random_state for reproducibility.
  2. Run multiple times with different seeds to verify cluster stability.
  3. Start with PCA to reduce the dimension below 50 before applying t-SNE (especially if the original data has thousands of dimensions).
  4. Vary the perplexity to examine the robustness of the detected structure.
  5. Do not interpret distances between clusters or the relative size of clusters.
  6. Preserve labels to color points and visually validate cluster coherence.

See also