SOM (Self-Organizing Map): Kohonen Self-Organizing Map

SOM (Self-Organizing Map) : Guide Complet — Carte Auto-Organisatrice de Kohonen

SOM (Self-Organizing Map) — Kohonen Self-Organizing Map

Summary

The Self-Organizing Map (SOM), or Kohonen self-organizing map, is an unsupervised learning algorithm that projects high-dimensional data onto a two-dimensional grid while preserving its topological structure. Introduced by Teuvo Kohonen in 1982, the SOM remains one of the most elegant approaches for visualizing and clustering complex data. Unlike methods such as K-Means, which simply produce flat clusters, the SOM organizes its neurons on a spatial grid where neighboring regions correspond to similar data. This unique topological property makes it a preferred tool for data exploration, pattern detection, and concept mapping in fields as varied as bioinformatics, finance, and image processing.


Mathematical Principle

The Neuron Grid

The fundamental architecture of a SOM is based on a regular grid of neurons (usually 1D, 2D hexagonal, or rectangular). Each neuron i has a weight vector wi ∈ ℝ^d of the same dimension as the input data. These weight vectors play the role of “prototypes” or “adaptive centroids”, similar to K-Means centers, but spatially organized on the grid.

Step 1: Finding the BMU (Best Matching Unit)

For each input vector x, the algorithm identifies the Best Matching Unit — the BMU — by computing the Euclidean distance between x and each weight vector:

BMU = argmin_i || x - w_i ||

The BMU is the neuron whose weight vector is closest to the current input vector. This step is analogous to K-Means assignment, but instead of simply assigning the point to the nearest cluster, the SOM uses this information to update an entire region of the grid.

Step 2: Weight Update with Gaussian Neighborhood

Unlike K-Means which only updates the winning centroid, the SOM updates the BMU and its neighbors on the grid. The update rule is:

w_j ← w_j + η · h_ij · (x - w_j)

where:

  • η (eta) is the learning rate, a positive scalar that decays over iterations.
  • h_ij is the Gaussian neighborhood function between the BMU (neuron i) and neighboring neuron j:
h_ij(t) = exp( -d(i,j)² / (2 · σ(t)²) )
  • d(i,j) is the grid distance between neurons i and j (Euclidean or Manhattan distance on grid coordinates).
  • σ(t) is the neighborhood radius at iteration t, which decreases progressively.

Temporal Decay of the Learning Rate and Radius

A crucial aspect of the SOM is the progressive decay of two key hyperparameters:

η(t) = η₀ · exp(-t / τ₁)
σ(t) = σ₀ · exp(-t / τ₂)

where η₀ is the initial learning rate, σ₀ is the initial neighborhood radius, and τ₁, τ₂ are time constants linked to the total number of iterations T:

τ₁ = T / ln(η₀)
τ₂ = T / ln(σ₀)

This dual decay ensures two distinct phases:

  1. Global organization phase (large σ, high η): at the start of training, the wide neighborhood allows entire regions of the map to move collectively, creating a coarse topological order. The map gradually folds to fit the underlying data distribution, like a piece of elastic fabric molding itself to a complex shape.
  2. Local refinement phase (small σ, low η): toward the end of training, only a small number of neighbors near the BMU are affected, allowing fine-tuning of the map. The neighborhood shrinks until it only concerns the BMU itself, similar to a local gradient descent.

This two-phase mechanism is the keystone of the algorithm: without the organization phase, the map would remain trapped in local minima; without the refinement phase, precision would remain mediocre.

Why is Topology Preserved?

The strength of the Gaussian neighborhood is that two close points in the original input space will likely activate neighboring BMUs on the grid. Consequently, their weight vectors will be pulled in similar directions, maintaining a spatial coherence between proximity in the original space and proximity on the map. This topology preservation property fundamentally distinguishes the SOM from simple clustering.


Geographic Intuition: A Map of the Data World

Imagine you have thousands of customer profiles, each described by fifty variables (age, income, purchase frequency, preferences, etc.). Directly visualizing this fifty-dimensional space is impossible for the human mind. The SOM acts as an algorithmic cartographer: it takes this multidimensional landscape and projects it onto a readable 2D grid.

On this map, neighboring regions group similar profiles. In the top left, you might find young urban professionals with high purchasing power. In the bottom right, rural retirees with occasional purchases. In between, transition zones represent hybrid customer bases. If two points are side by side on the map, they resemble each other in the original space; if they are far apart, they differ significantly.

It is precisely this ability to reveal the latent structure of the data through an intuitive spatial representation that makes the SOM so powerful. It turns the invisible into the visible.


Complete Python Implementation

Installation

pip install minisom matplotlib numpy seaborn scikit-learn

Complete Code with U-Matrix, Heatmap, and Clustering

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from minisom import MiniSom
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import euclidean

# 1. Loading and preparing data
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
class_names = iris.target_names

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Initializing and training the SOM
grid_size = (10, 10)
sigma = 2.0
learning_rate = 0.5
n_iter = 1000

som = MiniSom(
    grid_size[0], grid_size[1],
    input_len=X_scaled.shape[1],
    sigma=sigma,
    learning_rate=learning_rate,
    topology="hexagonal",
    neighborhood_function="gaussian",
    random_seed=42
)

som.random_weights_init(X_scaled)

print("Training the self-organizing map...")
som.train_random(X_scaled, n_iter)
print(f"Training complete after {n_iter} iterations.")

# 3. Visualization: U-Matrix (Distance Map)
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

u_matrix = som.distance_map()
im0 = axes[0, 0].imshow(u_matrix, cmap="bone_r")
axes[0, 0].set_title("U-Matrix - Distance Map", fontsize=13, fontweight="bold")
axes[0, 0].set_xlabel("Grid position")
axes[0, 0].set_ylabel("Grid position")
plt.colorbar(im0, ax=axes[0, 0], label="Average distance")

# 4. Marking classes on the map
colors_list = ["red", "green", "blue"]
markers_list = ["o", "s", "^"]

win_map = {}
for idx, sample in enumerate(X_scaled):
    winner = som.winner(sample)
    key = (winner[0], winner[1])
    if key not in win_map:
        win_map[key] = []
    win_map[key].append(y[idx])

for cls in range(len(class_names)):
    positions = []
    for (i, j), labels in win_map.items():
        if cls in labels:
            count = labels.count(cls)
            for _ in range(min(count, 3)):
                noise_x = np.random.uniform(-0.15, 0.15)
                noise_y = np.random.uniform(-0.15, 0.15)
                positions.append((i + noise_x, j + noise_y))

    if positions:
        pos_arr = np.array(positions)
        axes[0, 0].scatter(
            pos_arr[:, 0], pos_arr[:, 1],
            c=[colors_list[cls]], marker=markers_list[cls],
            label=class_names[cls], alpha=0.7, s=40
        )

axes[0, 0].legend(fontsize=10)

# 5. Per-variable heatmaps
for idx, name in enumerate(feature_names):
    row = idx // 2
    col = idx % 2
    if row < 2 and col < 2:
        component = som.get_weights()[:, :, idx]
        axes[row, col].imshow(component, cmap="viridis")
        axes[row, col].set_title(
            f"Component: {name}",
            fontsize=11, fontweight="bold"
        )
        plt.colorbar(
            axes[row, col].images[-1],
            ax=axes[row, col], shrink=0.8
        )

plt.tight_layout()
plt.show()

# 6. Clustering from the SOM
from sklearn.cluster import KMeans

weights = som.get_weights()
weight_vectors = weights.reshape(-1, weights.shape[2])

k_som = 3
kmeans = KMeans(n_clusters=k_som, random_state=42, n_init=10)
som_clusters = kmeans.fit_predict(weight_vectors)

point_clusters = []
for sample in X_scaled:
    winner = som.winner(sample)
    cluster_idx = winner[0] * grid_size[1] + winner[1]
    point_clusters.append(som_clusters[cluster_idx])

point_clusters = np.array(point_clusters)

fig, ax = plt.subplots(1, 1, figsize=(8, 6))
for cls in range(k_som):
    mask = point_clusters == cls
    ax.scatter(
        X_scaled[mask, 0], X_scaled[mask, 1],
        alpha=0.6, label=f"SOM Cluster {cls}", s=50
    )
ax.set_title("SOM + K-Means Clustering", fontsize=13, fontweight="bold")
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# 7. Clustering quality evaluation
from sklearn.metrics import adjusted_rand_score

ari = adjusted_rand_score(y, point_clusters)
print(f"\nAdjusted Rand Index (true labels vs SOM clusters): {ari:.4f}")

# 8. Quantification error
quantization_errors = []
for sample in X_scaled:
    winner = som.winner(sample)
    w = som.get_weights()[winner[0], winner[1]]
    error = euclidean(sample, w)
    quantization_errors.append(error)

mean_qe = np.mean(quantization_errors)
print(f"Mean quantization error: {mean_qe:.4f}")
print(f"Error standard deviation: {np.std(quantization_errors):.4f}")

# 9. Predicting a new data point
def predict_som(new_sample, som_model, scaler_ref):
    sample_scaled = scaler_ref.transform(new_sample.reshape(1, -1))
    winner = som_model.winner(sample_scaled[0])
    return winner

new_data = np.array([[5.1, 3.5, 1.4, 0.2]])
winner = predict_som(new_data, som, scaler)
print(f"\nBMU for new point: ({winner[0]}, {winner[1]})")
print("This point lies in the map region corresponding to Iris setosa.")

Code Explanation

The above code implements a complete SOM analysis pipeline:

  • Preparation phase: The Iris data is normalized with StandardScaler, an essential step since the SOM relies on Euclidean distances.
  • Training: MiniSom’s train_random function presents samples randomly, avoiding any bias related to data order.
  • U-Matrix (Unified Distance Matrix): This visualization displays the average distances between adjacent neurons. Light regions indicate cluster boundaries (large distances), while dark regions correspond to homogeneous areas.
  • Component maps: Each heatmap shows how a specific variable varies across the grid, revealing the contribution of each feature to the data structure.
  • Hybrid clustering: By applying K-Means to the SOM weight vectors, you get more robust clustering than K-Means alone, since the weights are already topologically organized.
  • Evaluation: The Adjusted Rand Index (ARI) measures the agreement between the clusters obtained and the true labels.

Essential Hyperparameters

The choice of hyperparameters profoundly influences map quality. Here is a practical guide:

Hyperparameter Role Typical values Effect of a poor choice
grid_size Grid dimensions (rows × columns) (10,10) to (50,50) depending on dataset size Too small: under-specification, merged clusters. Too large: overfitting, empty regions.
sigma Initial Gaussian neighborhood radius 1/3 to 1/2 of the grid dimension Too small: no topological organization. Too large: all neurons move together.
learning_rate Initial learning rate (η₀) 0.3 to 0.7 Too high: instability and oscillations. Too low: slow convergence, poorly organized map.
n_iter Total number of iterations 500 to 10,000 Too low: incomplete training. Too high: excessive computation time, marginal gains.

Practical Heuristics

  • Rule of thumb for grid size: grid_size ≈ 5 × √n where n is the number of samples. For 150 samples, this gives about 61 neurons, i.e. an 8×8 or 10×10 grid.
  • Initial sigma: about one-third of the largest grid dimension. For a 10×10 grid, σ₀ = 3 is a good starting point.
  • Learning rate: start with 0.5, then adjust by observing the stability of the quantization error.
  • Hexagonal vs rectangular topology: hexagonal offers better isotropy (6 neighbors instead of 4), which produces more natural cluster boundaries.

Advantages and Limitations

Advantages

  1. Topological preservation: The SOM maintains neighborhood relationships between data points, providing a coherent spatial representation that K-Means cannot provide.
  2. Intuitive visualization: Projection onto a 2D grid allows analysts to visually explore multidimensional data. The U-Matrix instantly reveals the cluster structure.
  3. Robustness to noise: The Gaussian neighborhood mechanism smooths the effects of outliers, as an outlier’s influence is diffused across multiple neurons.
  4. No distributional assumption: Unlike Gaussian models (GMM), the SOM does not assume any particular form for the data distribution.
  5. Interpretability of weights: Each neuron has a weight vector that is directly interpretable as a representative prototype of its region.

Limitations

  1. Computational cost: Finding the BMU requires computing the distance to all neurons at each iteration, i.e. O(N × M × d) where N is the number of samples, M the number of neurons, and d the dimension.
  2. Initialization dependence: Initial weights influence the final result. Multiple runs with different seeds are recommended for stability.
  3. Sensitive hyperparameters: Grid size, sigma, and learning rate must be carefully adjusted. There is no universally optimal rule.
  4. Difficulty with categorical data: The SOM relies on continuous distances. Categorical data requires prior encoding (one-hot, embedding) which can bias results.
  5. Partial non-determinism: Even with a fixed seed, the random order of sample presentation introduces variability that complicates strict reproducibility.

4 Concrete Use Cases

1. Customer Data Exploration in Marketing

A bank wants to segment its customer base using 30 behavioral variables. The SOM reveals five distinct areas on the map: young connected professionals, stable families, conservative retirees, high-yield entrepreneurs, and precarious customers. The U-Matrix clearly shows the boundaries between these segments. Marketing can then tailor offers to each zone with targeted campaigns far more effective than arbitrary business-rule segmentation.

2. Spectrum Analysis in Bioinformatics

In proteomics, the mass spectra of thousands of proteins form very high-dimensional signals. The SOM enables mapping structural similarities: proteins of similar functions spontaneously cluster together on the map, revealing functional families even in the absence of prior annotation. Researchers have used this approach to discover new families of previously unknown proteins.

3. Industrial Anomaly Detection

On a production line equipped with hundreds of sensors, the SOM learns a map of normal operation. When a new measurement point falls on a region of the map not covered during training (or very far from any known BMU), the exceptionally high quantization error signals a potential anomaly. This approach is used in the oil industry to detect drilling deviations before they become critical.

4. Organizing Textual Documents

By combining the SOM with word embeddings (Word2Vec, BERT), thousands of documents can be projected onto a 2D map where similar themes naturally cluster together. This technique is used by digital libraries and search engines to create navigable “topic maps”, allowing users to visually explore a document corpus. Unlike a rigid hierarchical classification, the SOM captures the overlapping zones between disciplines, revealing interdisciplinary fields.


See Also