Local Outlier Factor (LOF): Principles, Examples and Python Implementation

Local Outlier Factor (LOF) : Guide Complet — Principes, Exemples et Implémentation Python

Local Outlier Factor (LOF): Complete Guide — Principles, Examples and Python Implementation

Summary — The Local Outlier Factor (LOF) is an anomaly detection algorithm based on local relative density. Unlike global methods such as the Elliptic Envelope, LOF identifies anomalies by comparing the density of each point to that of its neighbors. Proposed by Breunig et al. in 2000, it excels in situations where clusters have very heterogeneous densities.


Mathematical principle

LOF is based on four nested concepts that measure the locality of each point:

1. k-distance: The k-distance of a point p is the distance to the k-th nearest neighbor. All points at a distance less than or equal to it are the neighbors of p, denoted $N_k(p)$.

2. Reachability Distance: The reachability distance between $p$ and $o$ is the maximum between the k-distance of $o$ and the Euclidean distance between $p$ and $o$:

$$reachdist_k(p, o) = \max(k\text{-distance}(o), d(p, o))$$

This trick is crucial: it stabilizes distances within a cluster (all nearby points have the same reachability distance, that of the k-distance of the center), making the density comparison more robust.

3. Local Reachability Density (LRD): The local density of a point $p$ is the inverse of the average reachability distances between $p$ and its $k$ neighbors:

$$LRD(p) = \frac{1}{\frac{1}{|N_k(p)|} \sum_{o \in N_k(p)} reachdist_k(p, o)}$$

A high LRD means that $p$ is in a dense area (its neighbors are close), while a low LRD indicates an isolated environment.

4. Local Outlier Factor: The final LOF score is the average ratio between the LRD of the neighbors and the LRD of the point itself:

$$LOF(p) = \frac{1}{|N_k(p)|} \sum_{o \in N_k(p)} \frac{LRD(o)}{LRD(p)}$$

Interpretation:
– $LOF \approx 1$: the point has a density comparable to its neighbors → normal
– $LOF \gg 1$: the point is in a much less dense area than its neighbors → anomaly
– $LOF < 1$: the point is in a denser area than its neighbors → cluster core


Intuition

Imagine you are trying to find your way in a big city. In downtown, everyone is packed together — the density of people is very high. In the suburbs, people are more spread out.

Now imagine a Parisian in the middle of the Sahara. He is not abnormal because he is alone — he is abnormal because the density around him is radically different from what it should be for him.

Conversely, imagine a Bedouin in the middle of the Sahara. He will not be detected as an anomaly, because the density around him is typical of the place.

This is the full power of LOF: it does not measure absolute isolation (as Isolation Forest would), but relative isolation to the local context. A point in the middle of a dense cluster is not suspicious, but a point at the edge of a cluster, where the density drops sharply, is.

Comparison with other methods:
Isolation Forest: detects anomalies by ease of isolation — efficient globally but misses local anomalies.
One-Class SVM: learns a global boundary — unsuitable for clusters of varying densities.
DBSCAN: classifies points as noise globally — does not provide a graduated score.
LOF: measures local relative density — the only one that detects both global and subtle local anomalies.


Python implementation

Example 1: Basic anomaly detection

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs

# Data with clusters of different densities
X1 = np.random.randn(300, 2) * 0.3          # dense cluster
X2 = np.random.randn(100, 2) * 0.8 + 3      # sparse cluster
X3 = np.array([[6, 1], [6.2, 0.8], [5.8, 1.2]])  # isolated anomalies
X = np.vstack([X1, X2, X3])

# LOF
lof = LocalOutlierFactor(n_neighbors=20, contamination='auto')
labels = lof.fit_predict(X)  # -1 = anomaly, 1 = normal
scores = lof.negative_outlier_factor_  # native scores (more negative = more anomalous)

# Visualization
fig, ax = plt.subplots(figsize=(8, 6))
colors = ['red' if l == -1 else 'steelblue' for l in labels]
sizes = [150 if l == -1 else 30 for l in labels]
ax.scatter(X[:, 0], X[:, 1], c=colors, s=sizes, alpha=0.7,
           edgecolors='darkred' if any(l == -1 for l in labels) else 'none')
ax.set_title('LOF: Local anomaly detection')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.legend(['Normal', 'Anomaly'])
plt.tight_layout()
plt.savefig('lof_detection.png', dpi=150)
print(f"Anomalies detected: {(labels == -1).sum()} out of {len(labels)} points")
print(f"LOF scores min: {scores.min():.3f}, max: {scores.max():.3f}")

Example 2: Novelty detection mode (detecting new anomalies)

from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Generate "normal" data + some anomalies
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=3, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Training in novelty mode (supervised-like: only "normal" has been seen)
# Here we use novelty=True to be able to call predict() on new data
lof_novelty = LocalOutlierFactor(n_neighbors=20, novelty=True, contamination=0.1)
lof_novelty.fit(X_train)  # learns the distribution of training data

# Detection on test data
predictions = lf_novelty.predict(X_test)
scores_test = lf_novelty.score_samples(X_test)  # continuous score

# Evaluation
print(f"AUC-ROC: {roc_auc_score(y_test == 0, scores_test):.3f}")

# Simplified confusion matrix
n_detected = (predictions == -1).sum()
n_total = len(predictions)
print(f"Anomalies detected in novelty mode: {n_detected}/{n_total}")
print(f"Score range: [{scores_test.min():.3f}, {scores_test.max():.3f}]")

Example 3: Continuous LOF score visualization

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor

# Point grid
np.random.seed(42)
X_train = np.vstack([
    np.random.randn(150, 2) * 0.4,
    np.random.randn(80, 2) * 0.6 + [3, 2],
])

# LOF
lof = LocalOutlierFactor(n_neighbors=15, contamination=0.05)
lof.fit_predict(X_train)
scores = lof.negative_outlier_factor_

# Sort by score and visualize
sorted_idx = np.argsort(scores)
plt.figure(figsize=(10, 3))
plt.bar(range(len(scores)), scores[sorted_idx], color='steelblue')
plt.axhline(y=np.mean(scores), color='red', linestyle='--', label='Mean')
plt.title('Distribution of LOF scores (negative = anomaly)')
plt.xlabel('Points (sorted by score)')
plt.ylabel('LOF score (negative_outlier_factor)')
plt.legend()
plt.tight_layout()
plt.savefig('lof_scores_distribution.png', dpi=150)
print("LOF score visualization saved")

Hyperparameters

Hyperparameter Default value Description Recommendation
n_neighbors 20 Number of neighbors for estimating local density 10-30; low = fine local detection, high = more global
contamination ‘auto’ Expected proportion of anomalies ‘auto’ by default; set if anomaly rate is known
metric ‘minkowski’ Distance metric between points ‘euclidean’, ‘manhattan’, or ‘cosine’ for text
p 2 Minkowski metric parameter 1 = Manhattan, 2 = Euclidean
novelty False If True, allows calling predict() on new data True for production, False for exploration
algorithm ‘auto’ Neighbor search algorithm ‘auto’ chooses between ‘brute’, ‘kd_tree’, ‘ball_tree’
leaf_size 30 Leaf size for trees Lower for more precision, higher for speed

Advantages of LOF

  1. Local detection: Identifies anomalies in varying density contexts, where global methods fail.
  2. Continuous score: Unlike DBSCAN which simply assigns “noise” or “cluster”, LOF provides a graduated score that allows prioritizing investigations.
  3. No distribution assumption: No assumption about the shape or distribution of the data — LOF works with clusters of any geometry.
  4. Interpretability: A LOF score of 2.5 reads directly as “this point is 2.5 times less dense than its neighbors” — a metric that business stakeholders understand without statistical training.
  5. Metric flexibility: Support for numerous distance metrics (Euclidean, Manhattan, cosine, Jaccard), adaptable to data type.

Limitations of LOF

  1. Computational cost: $O(n^2)$ in naive implementation, since pairwise distances must be computed. Approximations exist but sacrifice precision.
  2. Critical choice of k: The n_neighbors parameter strongly influences results. Too small a k gives unstable scores; too large a k dilutes local anomalies.
  3. Poor in very high dimensions: Like all distance-based methods, LOF suffers from the curse of dimensionality. Combining with a PCA upstream is often necessary.
  4. No native transform() mode: Without novelty=True, new data cannot be evaluated without recomputing on the entire set. Even with novelty=True, computation remains costly on large datasets.
  5. Parameter sensitivity: Unlike Isolation Forest which is relatively robust, LOF requires careful tuning of n_neighbors and contamination for reliable results.

4 concrete use cases

1. Fraud detection in banking

In financial transactions, fraudsters constantly adapt their techniques to resemble normal behavior. LOF excels here because it detects local anomalies: a transaction that seems normal in absolute terms (average amount, usual time) but is atypical compared to the customer’s similar transactions. For example, a €500 electronics purchase may be normal for a customer who makes them regularly, but abnormal for a customer who usually only buys food products under €50.

2. Industrial equipment monitoring (predictive maintenance)

On oil platforms, sensors monitor vibration, temperature, pressure, etc. A slight rise in vibration may be normal in itself, but if it occurs in a context where all other measurements are at their minimum, LOF flags it as suspicious. This early detection allows intervention before complete breakdown, avoiding costly production shutdowns.

3. Rare disease detection in healthcare

In laboratory result analysis, a patient may have individual values all within the statistical “normal range”, but their combination is unusual. LOF detects this type of multidimensional anomaly — a profile where each measurement is normal in isolation but whose combination is suspicious. This is particularly useful for early screening of rare pathologies or drug interactions.

4. Cybersecurity: network intrusion detection

In a corporate network, “normal” connections vary by department, time of day, and user. LOF can detect behaviors that would be within the network’s global norm but abnormal for a specific context. For example, an access to the accounting server at 3 a.m. by a developer is a LOF anomaly: even if server access is not prohibited, it is unusual for that user’s profile at that time.


Best practices for LOF

  • Normalize the data: Like all distance-based methods, LOF is sensitive to scales. StandardScaler is essential.
  • Find the right k: Test several values of n_neighbors (10, 20, 30, 50) and compare the stability of scores. Visualize the distribution of LOF scores for each value.
  • Dimensionality reduction upstream: For data with more than 20-30 dimensions, a prior PCA or UMAP significantly improves the relevance of LOF results.
  • Combine with Isolation Forest: Use both algorithms and only retain as anomalies points detected by both — this reduces false positives.
  • Use novelty=True in production: For a deployment where new data arrives continuously, novelty mode allows scoring new requests without recomputing the entire model.

Conclusion

The Local Outlier Factor is arguably the most nuanced anomaly detection algorithm in the data scientist’s toolbox. By measuring local relative density rather than absolute distance, it captures subtle anomalies that global methods would inevitably miss.

Its main drawback — computational complexity — is largely offset by its ability to work on heterogeneous density data and its continuous score output, which allows for much richer analysis than a simple abnormal/normal binary.

To choose between the main anomaly methods:
Isolation Forest: fast, robust, good default first choice.
One-Class SVM: when you want a precise boundary and the data is well separable.
Elliptic Envelope: when the normal data approximately follows a Gaussian distribution.
LOF: when clusters have very varying densities and you want to detect subtle local anomalies.


See also