UMAP: Principles, Examples, and Python Implementation

UMAP : Guide Complet — Principes, Exemples et Implémentation Python

UMAP: Complete Guide — Principles, Examples, and Python Implementation

Summary — UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm that simultaneously preserves both the local and global structure of data. Faster than t-SNE and capable of extrapolation, it is based on topological manifold theory and graph optimization. Developed by Leland McInnes and John Healy, UMAP has become the reference tool for high-dimensional data visualization in many scientific fields.


Mathematical principle

UMAP rests on three fundamental mathematical pillars that clearly distinguish it from its predecessors:

1. Fuzzy k-NN graph construction: For each point $x_i$, its $k$ nearest neighbors are identified. An adaptive local distance $\rho_i$ (distance to the $k$-th neighbor) and a scale $\sigma_i$ are defined via binary search so that the sum of edge weights from $x_i$ is consistent. The similarity between $x_i$ and $x_j$ is given by:

$$w_{ij} = \exp\left(-\frac{\max(0, d(x_i, x_j) – \rho_i)}{\sigma_i}\right)$$

The key trick here is local adaptation: each point has its own radius $\sigma_i$, meaning UMAP automatically adapts to density variations in the data. Where other algorithms use a single global parameter, UMAP adjusts its sensitivity locally.

2. Symmetric graph (fuzzy union): The edges computed above are directed ($w_{ij} \neq w_{ji}$). To obtain an undirected graph, UMAP applies a probabilistic fuzzy union:

$$w’{ij} = w$$} + w_{ji} – w_{ij} \cdot w_{ji

This operation can be understood as a “probabilistic OR”: the strength of the connection between two points is strong if at least one of them considers the other a close neighbor. This is what allows UMAP to handle clusters of very different densities well.

3. Low-dimensional embedding optimization: In the target dimension (typically 2), similarities are defined with a shifted Student distribution:

$$q_{ij} = (1 + a \cdot ||y_i – y_j||^{2b})^{-1}$$

where $a$ and $b$ are parameters determined by min_dist and spread. Optimization minimizes cross-entropy (not KL-divergence like t-SNE) via stochastic gradient descent:

$$CE = \sum_{i \neq j} w’{ij} \log\left(\frac{w’\right) + (1 – w’}}{q_{ij}{ij}) \log\left(\frac{1 – w’\right)$$}}{1 – q_{ij}

Theoretical foundation: Unlike t-SNE, which was essentially heuristic, UMAP is grounded in algebraic topology. The authors demonstrate that under the assumption of uniform sampling on a Riemannian manifold, the fuzzy k-NN graph is an approximation of the nerve of a covering of the manifold. This connection with Čech theory provides a solid mathematical foundation for the algorithm.


Intuition

Imagine you are drawing a subway map of a large city. Neighboring stations must remain close on the map, but you can distort distances between distant neighborhoods so that everything fits on one page. t-SNE does this with remarkable precision for details, but at the cost of losing all information about global distances — it’s impossible to tell whether two groups of stations are close to or far from each other.

UMAP, on the other hand, manages to preserve both the immediate neighborhood and the overall structure. Neighborhoods remain in the correct relative order, and you can even distinguish the main arteries connecting them. It’s as if you had t-SNE’s precision for details, but PCA’s fidelity for the big picture.

The fabric analogy: Imagine a crumpled piece of fabric in 3D. PCA flattens it brutally by tearing it. t-SNE cuts it into pieces that it arranges nicely but without respecting the original order. UMAP, on the other hand, carefully unfolds the fabric, preserving both local patterns (the fabric pattern remains recognizable) and the global shape (the edges stay in the right places).

Why is it faster than t-SNE?: t-SNE computes similarity for all pairs of points, which costs $O(n^2)$. UMAP only computes the $k$ neighbors of each point, bringing the cost down to $O(n \cdot k \cdot \log(n))$ through approximate nearest neighbor data structures (k-d trees, RP-trees). On 70,000 MNIST points, t-SNE takes several minutes while UMAP finishes in a few seconds.


Python implementation

Example 1: UMAP on MNIST with umap-learn

import umap
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml

# Load MNIST (sample for speed)
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data[:7000], mnist.target[:7000]

# Configure the UMAP reducer
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric='euclidean',
    random_state=42
)

# Transform
X_embedded = reducer.fit_transform(X)

# Visualization
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
scatter = ax.scatter(
    X_embedded[:, 0], X_embedded[:, 1],
    c=y.astype(int), cmap='tab10', s=5, alpha=0.7
)
plt.colorbar(scatter, ticks=range(10))
ax.set_title('UMAP: MNIST Visualization in 2D')
ax.set_xlabel('UMAP Component 1')
ax.set_ylabel('UMAP Component 2')
plt.tight_layout()
plt.savefig('umap_mnist.png', dpi=150)
print(f"UMAP complete. Embedding shape: {X_embedded.shape}")
print(f"The 10 digits form well-separated clusters in 2D.")

Example 2: UMAP vs t-SNE vs PCA comparison

import umap
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import time

X, y = load_digits(return_X_y=True)

# PCA (linear reference)
t0 = time.time()
pca = PCA(n_components=2).fit_transform(X)
pca_time = time.time() - t0
print(f"PCA: {pca_time:.3f}s")

# t-SNE
t0 = time.time()
tsne = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(X)
tsne_time = time.time() - t0
print(f"t-SNE: {tsne_time:.3f}s")

# UMAP
t0 = time.time()
umap_res = umap.UMAP(n_components=2, n_neighbors=10, random_state=42).fit_transform(X)
umap_time = time.time() - t0
print(f"UMAP: {umap_time:.3f}s")
print(f"UMAP is {tsne_time/umap_time:.1f}x faster than t-SNE")

# Comparative display
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, res, title in zip(axes, [pca, tsne, umap_res], ['PCA', 't-SNE', 'UMAP']):
    sc = ax.scatter(res[:, 0], res[:, 1], c=y, cmap='tab10', s=10)
    ax.set_title(f'{title}\n({len(set(y))} digits)')
    ax.set_xticks([])
    ax.set_yticks([])
plt.suptitle('PCA / t-SNE / UMAP Comparison: Handwritten Digits')
plt.tight_layout()
plt.savefig('umap_comparison.png', dpi=150)

Example 3: UMAP with supervised data (y as information)

import umap
from sklearn.datasets import load_iris
from sklearn.metrics import silhouette_score

# iris
X, y = load_iris(return_X_y=True)

# Search for the best n_neighbors via silhouette
best_score, best_k = -1, None
results = []
for k in [3, 5, 10, 15, 20, 30]:
    embedding = umap.UMAP(n_neighbors=k, n_components=2, random_state=42).fit_transform(X)
    score = silhouette_score(embedding, y)
    results.append((k, score))
    if score > best_score:
        best_score, best_k = score, k
    print(f"  k={k:2d} → silhouette={score:.3f}")

print(f"\nBest n_neighbors: {best_k} (silhouette={best_score:.3f})")

Example 4: Transforming new data (extrapolation capability)

import umap
import numpy as np
from sklearn.datasets import make_classification

# Training data
X_train, y_train = make_classification(n_samples=2000, n_features=50, n_classes=3, random_state=42)

# Fit UMAP on training data
reducer = umap.UMAP(n_components=2, random_state=42)
X_train_2d = reducer.fit_transform(X_train)
print(f"Training complete. Shape: {X_train_2d.shape}")

# Transform NEW data without retraining
X_new = np.random.randn(200, 50)
X_new_2d = reducer.transform(X_new)
print(f"New data transformation: {X_new_2d.shape}")
print(f"UMAP is one of the few unsupervised algorithms that supports transform().")

Hyperparameters

Hyperparameter Default value Impact Recommendation
n_neighbors 15 Controls the local/global balance. Low (5-10) = fine clusters, high (30-100) = global view Start at 15, adjust according to the desired scale
min_dist 0.1 Minimum distance in the embedding. 0.01 = very tight points, 0.99 = spread out points 0.05-0.1 for clustering, 0.3-0.5 for visualization
n_components 2 Target dimension: 2 for visualization, 5-50 for supervised preprocessing 2 or 3 to visualize, 10-50 for ML pipeline
metric ‘euclidean’ Distance metric between points ‘cosine’ for text/NLP, ‘correlation’ for bio
spread 1.0 Effective scale of the embedding (with min_dist determines a and b) Keep 1.0 in most cases
alpha 1.0 Initial learning rate for optimization 0.5-1.0; lower if optimization oscillates
negative_sample_rate 5 Negative/positive sample ratio 5 (default) for speed, 10 for maximum quality
random_state None Seed for reproducibility Always set for reproducible results

Advantages of UMAP

  1. Global preservation: Unlike t-SNE which only preserves local neighborhoods, UMAP also preserves large-scale structure. Relationships between clusters are meaningful.
  2. Exceptional speed: Thanks to k-NN approximation (RP-trees), UMAP is 5 to 50 times faster than t-SNE on medium to large datasets.
  3. Extrapolation capability: Via transform(), UMAP can project new data into an existing embedding without full recomputation — unique among nonlinear visualization algorithms.
  4. Dimensional flexibility: Although mostly used in 2D/3D, UMAP works effectively up to 50-100 dimensions for preprocessing before supervised models, offering a nonlinear alternative to PCA.
  5. Wide range of metrics: Native support for dozens of metrics (Euclidean, cosine, Manhattan, Hamming, Jaccard, Hellinger, Wasserstein…) allowing the processing of text, images, sequences, and categorical data.

Limitations of UMAP

  1. Sensitivity to hyperparameters: n_neighbors and min_dist have a major impact on the result. There is no universally optimal value — grid search tuning is often necessary.
  2. Lack of axis interpretability: Like t-SNE, UMAP components have no direct physical meaning, unlike PCA’s principal components.
  3. Stochastic variability: Different random seeds produce different visualizations, which can be unsettling for novice users.
  4. Dependency on numba: The umap-learn library depends on numba, which compiles functions on first run (initial delay) and can cause installation issues on some Python/OS version combinations.
  5. Memory cost on very large datasets: Despite its algorithmic efficiency, k-NN graph construction consumes $O(n \cdot k)$ memory, which becomes problematic beyond a few million points.

4 concrete use cases

1. Visualization of transcriptomic data (single-cell RNA-seq)

In genomics, gene expression data measures the activity of 20,000+ genes in thousands of individual cells. UMAP has become the standard tool for visualization in this field, having largely replaced t-SNE in pipelines like Scanpy and Seurat. It allows biologists to visually identify cell subpopulations (cell types, differentiation states, cell cycles) and explore developmental trajectories.

2. UMAP + HDBSCAN assisted clustering

A powerful approach consists of projecting data into 10-50 dimensions with UMAP, then applying HDBSCAN or DBSCAN. The UMAP embedding makes cluster structures denser and better separated than in the original space. This is particularly effective for data where clusters have non-spherical shapes and varying densities — situations where K-Means and PCA fail.

3. Exploration of recommendation spaces

In recommendation systems (e-commerce, streaming, news), users and content are represented by vectors of 100 to 500 dimensions. UMAP allows product teams to visualize the structure of this space: which content items are close? Are there empty zones (opportunities for missing content)? Do clusters correspond to genres or business categories? This visual understanding helps design better recommendation strategies and detect biases.

4. Anomaly detection in cybersecurity and industry

In network monitoring, normal connections form a dense continuum in UMAP space. Anomalies (DDoS attacks, intrusions, phishing attempts) appear as isolated points or small groups, clearly separated from the main mass. Similarly, in industry, deviating sensor measurements (early failure, abnormal wear) stand out visually. This property is leveraged in real-time monitoring dashboards where a human operator watches the UMAP topology to detect abnormal drifts.


Best practices for using UMAP

  • Always normalize data before UMAP (StandardScaler or MinMaxScaler), as with any distance-based method.
  • Try multiple n_neighbors values: start at 5, 15, 30, and 50. The lower the value, the more UMAP focuses on fine local structure.
  • Set random_state for reproducibility, especially if presenting results to colleagues or in a publication.
  • Combine with PCA: for very large datasets, first apply PCA to reduce to 50 dimensions, then UMAP for 2D visualization. This significantly speeds up computation without significant quality loss.
  • Use transform() with caution: new data is projected into the existing embedding, which is fast but may give worse results than full retraining if the new data distribution is very different.

Conclusion

UMAP represents a significant advance in the field of dimensionality reduction. By combining a solid theoretical foundation (algebraic topology) with efficient algorithmic approximations (fuzzy k-NN, stochastic gradient descent), it offers an unparalleled trade-off between preservation quality, computation speed, and versatility.

When choosing between PCA, t-SNE, and UMAP, the practical rule is: PCA for interpretability and maximum speed, t-SNE for fine visualizations of small datasets, UMAP for everything else — large datasets, global structure, supervised preprocessing, and extrapolation.


See also