HDBSCAN: Principles, Examples, and Python Implementation

HDBSCAN : Guide Complet — Principes, Exemples et Implémentation Python

HDBSCAN: Complete Guide — Principles, Examples, and Python Implementation

Summary

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that combines ideas from DBSCAN and hierarchical clustering. Unlike K-Means which requires setting the number of clusters K in advance, or DBSAN which requires a carefully calibrated ε radius, HDBSCAN automatically explores the hierarchical structure of data across all density scales and extracts the most stable clusters. The result: it detects clusters of arbitrary shapes and varying densities, naturally identifies outliers (noise), and requires no choice of the number of clusters. This article details the mathematical foundations, algorithmic intuition, practical Python implementation, and concrete use cases of HDBSCAN.


Mathematical Principle

HDBSCAN’s operation relies on a chain of rigorous mathematical transformations that convert the original data space into a cluster hierarchy. Understanding each of these steps is essential to mastering the algorithm.

1. Core distance

For each point x, we calculate its core distance core_dist(x), defined as the distance to the k-th nearest neighbor, where k = min_samples. This concept, inherited from DBSCAN, quantifies the local density around each point: a point in a dense region will have a small core distance, while a point in a sparse region will have a high core distance.

Formally:

core_dist(x) = d(x, k-NN(x))

where k-NN(x) denotes the k-th nearest neighbor of x and d is the chosen distance metric (usually Euclidean distance).

2. Mutual reachability distance

This is the cornerstone of HDBSCAN. For any pair of points (x, y), we define the mutual reachability distance as:

d_reachability(x, y) = max(core_dist(x), core_dist(y), d(x, y))

This formula is fundamental. Let’s examine why:

  • If both points are in a dense region: their core distances are small, so d(x, y) dominates and d_reachability(x, y) ≈ d(x, y). Close points remain close.
  • If one or both points are in a sparse region: their core distances are large, so d_reachability(x, y) is at least as large as the larger core distance. This “artificially pushes” less dense points away from their neighbors, effectively dissolving thin bridges between potential clusters.

In other words, this distance transforms the space so that low-density regions are “dilated” while dense regions remain contracted. It is this mechanism that allows HDBSCAN to naturally distinguish clusters of different densities.

3. Dendrogram construction (single-linkage hierarchical clustering)

On the complete graph of mutual reachability distances, HDBSCAN applies single-linkage hierarchical clustering. This process builds a dendrogram — a binary or multi-node tree — where:

  • Each point starts as its own cluster (a leaf of the tree).
  • At each step, the two closest clusters (by the minimum mutual reachability distance between their members) are merged.
  • The merge level corresponds to the mutual reachability distance between the points triggering the merge.

The result is a complete hierarchy: cutting the tree at different levels yields partitions with different numbers of clusters. The higher you go in the tree (high threshold), the larger and fewer the clusters; the lower you go (low threshold), the more numerous and finer the clusters.

4. Flat extraction via cluster persistence and stability

This step that distinguishes HDBSCAN from simple hierarchical clustering. Instead of asking the user to choose a cut level (which would be arbitrary), HDBSCAN automatically evaluates the stability of each candidate cluster across the hierarchy.

For a given cluster C, we define:

λ_min = 1 / entry_distance_into_C    (inverse of the distance at which C appears)
λ_max = 1 / exit_distance_from_C     (inverse of the distance at which C splits)
stability(C) = Σ_{x ∈ C} (λ_max - λ_min(x))

More precisely, for each point x in the cluster, we measure how long (in terms of inverse distance λ = 1/d) it remains loyal to this cluster before moving to a child cluster or becoming noise. The total stability of a cluster is the sum of these individual contributions.

The algorithm then performs a descending traversal of the tree: for each parent cluster, it compares its stability to the sum of its children’s stabilities. If the parent is more stable, it is kept; otherwise, the children are retained. This approach ensures that only truly persistent clusters across scales are selected.

5. Unassigned points = noise

Points that cannot be reliably assigned to any stable cluster are labeled noise, with label -1. This noise detection is intrinsic to the algorithm and requires no additional parameter.


Intuition: Improved DBSCAN

To properly understand HDBSCAN, it is useful to position it relative to its predecessor, DBSCAN.

DBSCAN relies on two parameters: eps (the neighborhood radius) and min_samples (the minimum number of neighbors for a point to be considered a core point). The problem is that eps is difficult to choose:

  • A eps that is too small → too much noise, clusters fracture.
  • A eps that is too large → all points merge into one giant cluster.
  • If clusters have different densities, no single eps can correctly capture all structures. A eps suited to a dense cluster will be too small for a sparse cluster, and vice versa.

HDBSCAN solves this dilemma elegantly: instead of manually fixing eps, it explores ALL possible eps simultaneously. For each possible value, the algorithm “sees” a different partition of the data. By assembling these partitions across all scales, it builds a complete hierarchical tree. Then, using the stability mechanism described above, it automatically selects the most stable clusters across scales.

Think of it this way: instead of taking a single blurry photo of your data with a poorly adjusted lens (fixed eps), HDBSCAN makes a complete movie of your data at all zoom levels, then identifies structures that persist throughout the film. It is this multi-scale approach that makes HDBSCAN so powerful for real-world data.


Python Implementation

Installation

pip install hdbscan scikit-learn matplotlib seaborn

The hdbscan library is the reference implementation, developed by the original authors of the algorithm (Leland McInnes and John Healy).

Fundamental example with varying densities

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import hdbscan
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# ------------------------------------------------------------------
# Create data with 3 clusters of different densities
# ------------------------------------------------------------------
centers = [[1, 1], [-1, -1], [1, -1]]
cluster_sizes = [200, 80, 40]      # Very varied sizes
cluster_stds = [0.2, 0.5, 0.9]     # Different standard deviations = different densities

X_parts = []
labels_true = []
for i, (center, size, std) in enumerate(zip(centers, cluster_sizes, cluster_stds)):
    np.random.seed(42 + i)
    blob = np.random.randn(size, 2) * std + center
    X_parts.append(blob)
    labels_true.extend([i] * size)

# Add uniform noise points
noise = np.random.uniform(low=-3, high=3, size=(30, 2))
X_parts.append(noise)

X = np.vstack(X_parts)
X = StandardScaler().fit_transform(X)

# ------------------------------------------------------------------
# HDBSCAN Clustering
# ------------------------------------------------------------------
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,   # Minimum cluster size
    min_samples=10,        # Core distance conservatism
    metric='euclidean',
    cluster_selection_method='eom',  # Excess of Mass (default)
    gen_min_span_tree=True   # Required for plot_hierarchy()
)
clusterer.fit(X)
labels = clusterer.labels_
probas = clusterer.probabilities_  # Cluster membership probability

# ------------------------------------------------------------------
# Visualization
# ------------------------------------------------------------------
palette = sns.color_palette('deep', np.unique(labels).max() + 1)
colors = [palette[int(l)] if l >= 0 else (0.5, 0.5, 0.5) for l in labels]

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Scatter plot by cluster
axes[0].scatter(X[:, 0], X[:, 1], c=colors, s=8, alpha=0.7)
axes[0].set_title(f'HDBSCAN — {labels.max() + 1} clusters detected\n'
                  f'{(labels == -1).sum()} noise points')
axes[0].set_xlabel('X1')
axes[0].set_ylabel('X2')

# Hierarchical visualization (dendrogram)
clusterer.condensed_tree_.plot(select_clusters=True,
                                selection_palette=palette)
axes[1].set_title('HDBSCAN Condensed Tree')

# Single-linkage tree
clusterer.single_linkage_tree_.plot(cmap='viridis',
                                     colorbar=True)
axes[2].set_title('Single-Linkage Tree')

plt.tight_layout()
plt.savefig('hdbscan_demo.png', dpi=150, bbox_inches='tight')
plt.close()

# ------------------------------------------------------------------
# Display results
# ------------------------------------------------------------------
for cluster_id in range(labels.max() + 1):
    mask = labels == cluster_id
    print(f"Cluster {cluster_id}: {mask.sum()} points, "
          f"mean probability = {probas[mask].mean():.3f}")
print(f"Noise: {(labels == -1).sum()} points")

Probabilistic membership

One of HDBSCAN’s most useful features is the soft clustering probability:

# Display ambiguous points (probability between 0.4 and 0.7)
ambiguous = (probas > 0.4) & (probas < 0.7) & (labels >= 0)
print(f"Ambiguous points: {ambiguous.sum()}")

# These points are at cluster boundaries.
# We can visualize them separately to understand the transitions.

Unlike K-Means which assigns each point to a single cluster deterministically, HDBSCAN provides a confidence for each assignment. A point at the center of a dense cluster will have a probability close to 1.0, while a point at the boundary between two clusters will have a lower probability.

Comparison with DBSCAN on the same data

from sklearn.cluster import DBSCAN

# DBSCAN — requires manually choosing eps
dbscan = DBSCAN(eps=0.5, min_samples=10)
dbscan_labels = dbscan.fit_predict(X)

n_clusters_db = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise_db = list(dbscan_labels).count(-1)

print(f"DBSCAN (eps=0.5): {n_clusters_db} clusters, {n_noise_db} noise points")
print(f"HDBSCAN          : {labels.max() + 1} clusters, {(labels == -1).sum()} noise points")

On data with varying densities, HDBSCAN will typically detect all three clusters more correctly, while DBSCAN will miss one or two depending on the chosen eps. This comparison concretely illustrates the advantage of HDBSCAN’s multi-scale approach.


Detailed Hyperparameters

Mastering HDBSCAN’s hyperparameters is crucial for obtaining relevant results. Here is a detailed guide to each:

min_cluster_size (required, integer ≥ 2)

This is the most important parameter. It sets the minimum size of a valid cluster. A cluster is any group of points that persists in the hierarchical tree with at least min_cluster_size members.

  • Small value (5-10): detects fine structures, but risks over-segmentation.
  • Large value (50+): only captures large clusters, may merge distinct structures.
  • Practical rule: start with 5-10% of the total data size, then adjust.

min_samples (integer ≥ 1)

Determines the k used to calculate core distances. This parameter controls the conservatism of the density metric:

  • Equal to min_cluster_size (default): standard behavior, well balanced.
  • Smaller than min_cluster_size: makes the metric more “permissive”, clusters form more easily.
  • Larger than min_cluster_size: makes the metric more “conservative”, only very dense cores are identified, increasing noise detection.

A good practice is to set min_samples ≥ min_cluster_size to emphasize the separation between dense clusters and noise.

cluster_selection_method (string)

Two available methods:

  • 'eom' (Excess of Mass, default): favors persistence across scales. Selects clusters that survive the longest in the hierarchical tree. Ideal when seeking globally coherent structures.
  • 'leaf': traverses the leaves of the tree (the finest clusters) and only goes up if necessary. Generally produces more, smaller clusters. Useful when the data contains interesting fine sub-structures.

cluster_selection_epsilon (float ≥ 0)

Sets a maximum scale for cluster extraction. Any cluster whose stability lies beyond this distance (in terms of 1/distance) will be split. The default (0) means the algorithm explores the entire hierarchy. This parameter is useful when you approximately know the maximum scale of the structures of interest.

allow_single_cluster (boolean)

When True, allows the case where all points form a single cluster (degenerate case). The default is False, meaning that if the tree root is the only cluster, the algorithm returns only noise. Only enable this option if you are certain your data forms a single coherent group.

metric (string or callable)

The distance metric used to build the mutual reachability space. Common options:

  • 'euclidean': standard Euclidean distance (default).
  • 'manhattan': L1 distance, robust to outliers.
  • 'cosine': cosine similarity, ideal for text or high-dimensional data.
  • 'hamming': for binary/categorical data.
  • Custom function: callable accepting two arrays and returning a float.

The choice of metric is critical: HDBSCAN can only find structures that exist in the chosen metric space.


Advantages and Limitations

Advantages

  1. No number of clusters to set — unlike K-Means or classical hierarchical clustering.
  2. Handles varying densities — where DBSCAN fails, HDBSCAN excels thanks to mutual reachability distance.
  3. Natural noise detection — unassigned points are automatically identified, no additional threshold needed.
  4. Probabilistic membership — provides a confidence measure for each assignment, not just a binary label.
  5. Exploitable hierarchical structure — the condensed tree reveals sub-structures at different scales.
  6. Robustness — less sensitive to parameter choice than DBSCAN due to its multi-scale exploration.

Limitations

  1. Computational complexity — O(n²) in the worst case, although in practice it is significantly better thanks to optimizations (k-d trees, Borůvka’s algorithm). For millions of points, approximations or sampling are often necessary.
  2. Sensitivity to min_cluster_size — while easier to tune than DBSCAN’s eps, this parameter remains influential. A poor choice can under- or over-segment.
  3. Critical distance metric — like all distance-based algorithms, poorly scaled or very high-dimensional (>100) data can give poor results.
  4. Non-deterministic on some datasets — ambiguities in tree construction can lead to slight variations between two runs.
  5. No natural incrementation — like most clustering algorithms, adding new data requires a recalculation (or the use of approximate_predict from the hdbscan library).

4 Concrete Use Cases

Use Case 1: Anomaly Detection in Cybersecurity

In a network flow, the majority of connections form dense clusters (normal traffic). HDBSCAN naturally identifies atypical connections as noise. The membership probability allows sorting alerts by severity: points with very low probability are the most suspicious.

# Simplified scenario: intrusion detection
anomalies = X[labels == -1]
print(f"{len(anomalies)} suspicious connections detected out of {len(X)}")

Use Case 2: Customer Segmentation in Marketing

Customers do not divide into perfectly separated groups. Some are loyal buyers (dense clusters), others occasional browsers (diffuse clusters), and some are anomalies (suspicious purchases or data errors). HDBSCAN captures this reality much better than K-Means which would force an artificial partition.

Use Case 3: Document Analysis and NLP

After embedding texts into a vector space (via TF-IDF, Word2Vec, or modern embeddings like BERT), HDBSCAN groups documents by semantic theme. The cosine metric is then ideal. “Noise” documents are those that clearly do not correspond to any theme — potentially multi-thematic or atypical documents.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances

texts = ["article about ML", "recipe for cooking", ...]
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(texts)

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=5,
    metric='precomputed',
    cluster_selection_method='leaf'
)
labels = clusterer.fit_predict(cosine_distances(X_tfidf))

Use Case 4: Computational Biology — Single-Cell Clustering

In scRNA-seq (single-cell RNA sequencing), each cell is a point in high dimension. Cell populations have very different densities (some cells are rare, others abundant). HDBSCAN has become a standard in this field because it detects both major populations and rare sub-populations while identifying low-quality cells as noise. The Scanpy tool for single-cell data analysis natively integrates HDBSCAN.


Best Practices

  1. Always normalize your data — HDBSCAN is sensitive to feature scaling. Use StandardScaler or MinMaxScaler before clustering.
  2. Reduce dimensionality if necessary — for data with > 50 dimensions, apply UMAP or PCA first. The curse of dimensionality affects all distance metrics.
  3. Start with min_cluster_size ≈ n/20 — a reasonable empirical rule for a first attempt.
  4. Visualize the condensed tree — it is the best diagnostic for understanding whether detected clusters are robust or artifacts.
  5. Compare with DBSCAN — if HDBSCAN and DBSCAN give similar results on the same data, it is a good sign. If they differ radically, investigate your data structure.
  6. Use cluster_selection_method='leaf' when you suspect fine sub-structures within your main clusters.

Conclusion

HDBSCAN represents a major advance in the arsenal of unsupervised clustering. By combining the rigor of mutual reachability distance with systematic exploration of the hierarchy at all scales, it offers robustness few algorithms can match. Its main asset — not having to guess the number of clusters — makes it a tool of choice for exploring real data, where the underlying structure is rarely known in advance. With accessible Python libraries and an active community, HDBSCAN is today a de facto standard for exploratory clustering in data science.


See Also