PCA (Principal Component Analysis): Dimensionality Reduction

PCA (Analyse en Composantes Principales) : Guide Complet — Réduction de Dimension

Principal Component Analysis (PCA): Complete Guide

Summary

Principal Component Analysis (PCA) is one of the most fundamental and most used dimensionality reduction methods in machine learning. Its goal is simple but powerful: transform a set of potentially correlated variables into a new set of uncorrelated variables, called principal components, while preserving as much information (variance) as possible from the original data.

In practice, this means going from a 100-dimensional space to a 10-dimensional space, while preserving 95% of the data’s variability. PCA is an unsupervised method — it requires no labels — and is based on solid linear algebra foundations: eigenvalue decomposition, SVD (Singular Value Decomposition), and maximization of projected variance.

Mathematical Principle

Covariance Matrix

Let X be a centered data matrix of dimensions n × p, where n is the number of observations and p is the number of variables. The first step is to compute the covariance matrix:

S = (1 / (n − 1)) · XᵀX

This symmetric p × p matrix encodes all linear relationships between variables. The diagonal elements represent the variances of each variable, while the off-diagonal elements capture the covariances between pairs of variables.

Eigenvalue Decomposition

The theoretical core of PCA rests on the eigenvalue decomposition of the covariance matrix:

S · V = λ · V

where:
V is the matrix whose columns are the eigenvectors of S. Each eigenvector defines a direction in the data space — this is a principal component.
λ is a vector containing the associated eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λₚ ≥ 0. Each eigenvalue measures the amount of variance captured by the corresponding principal component.

The principal components are orthogonal to each other — they are perfectly uncorrelated — and are ordered in decreasing order of importance (explained variance).

SVD Decomposition (Singular Value Decomposition)

In practice, most modern PCA implementations use Singular Value Decomposition (SVD) directly on the matrix X, rather than explicitly computing the covariance matrix. SVD decomposes X as follows:

X = U · Σ · Vᵀ

where:
U (n × p): matrix of left singular vectors (orthogonal)
Σ (p × p): diagonal matrix of singular values σ₁ ≥ σ₂ ≥ … ≥ σₚ ≥ 0
V (p × p): matrix of eigenvectors (principal components), orthogonal

The relationship between eigenvalues and singular values is direct:

λⱼ = σⱼ² / (n − 1)

SVD is numerically more stable than the direct computation of S·V = λ·V, especially when variables are numerous or highly correlated. This is why it is the default choice in scikit-learn.

Explained Variance

The explained variance by the k-th principal component is defined as:

vₖ = λₖ / Σⱼ λⱼ

That is, the ratio of the k-th eigenvalue to the total sum of all eigenvalues. This proportion indicates what fraction of the total information is contained in each component.

The cumulative explained variance for the first m components is:

V_cumulative = Σₖ₌₁ᵐ λₖ / Σⱼ₌₁ᵖ λⱼ

This criterion is essential for choosing the optimal number of components to retain.

Intuition: Finding the Best Viewing Angles

Imagine you need to photograph a three-dimensional object, but you only have a 2D camera. What is the best angle to capture the most information about the shape of the object? You would naturally choose the view that reveals the most “details,” the greatest variety of shapes and outlines.

PCA works in exactly the same way, but in n dimensions. It automatically finds the viewing angles that show the most details — the directions in which the data exhibits the most dispersion (variance). The first principal component is the direction that maximizes this variance; the second is the orthogonal direction that captures the most remaining variance; and so on.

Like photographing a 3D object: PCA automatically finds the best angles that capture the maximum information with the fewest photos. Instead of getting lost in hundreds of redundant variables, it extracts the few essential axes that summarize the behavior of your data.

Let’s take a concrete example: if you measure the height, weight, waist circumference, arm length, and leg length of thousands of people, these variables are highly correlated. PCA might discover that it all essentially reduces to two dimensions: one “overall size” component and one “body proportion” component. These two synthetic axes capture the essence of the information with far fewer variables.

Python Implementation

Installation and Preparation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import StandardScaler

Basic PCA with Iris Data

# Load data
iris = load_iris()
X = iris.data  # 150 samples, 4 variables
y = iris.target

# Standardization is essential for PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Display eigenvalues
print("Eigenvalues:", pca.explained_variance_)
print("Variance explained per component:", np.round(pca.explained_variance_ratio_, 4))
print("Cumulative explained variance:", np.round(np.cumsum(pca.explained_variance_ratio_), 4))

Typical output:

Eigenvalues: [2.938, 0.920, 0.147, 0.021]
Variance explained: [0.730, 0.229, 0.037, 0.005]
Cumulative variance: [0.730, 0.959, 0.996, 1.000]

The first two components capture 95.9% of the total variance. We can thus reduce from 4 to 2 dimensions while losing only 4.1% of information.

Scree Plot — Visualizing Eigenvalues

# Scree plot
plt.figure(figsize=(8, 5))
components = np.arange(1, len(pca.explained_variance_) + 1)
plt.bar(components, pca.explained_variance_ratio_, color='steelblue', edgecolor='black')
plt.plot(components, np.cumsum(pca.explained_variance_ratio_), 'ro-', linewidth=2, markersize=8)
plt.axhline(y=0.95, color='green', linestyle='--', label='95% threshold')
plt.xlabel('Principal component')
plt.ylabel('Proportion of explained variance')
plt.title('Scree Plot — Principal Component Analysis (Iris)')
plt.xticks(components)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

The scree plot shows a clear “elbow”: the steep drop between the first components and the following ones indicates how many dimensions to retain.

2D Visualization of Projected Data

# Reduction to 2 dimensions
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(9, 6))
for label, color in zip(range(3), ['navy', 'darkorange', 'green']):
    mask = y == label
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1],
                c=color, edgecolor='black', s=60,
                label=iris.target_names[label], alpha=0.8)

pct0 = pca_2d.explained_variance_ratio_[0]
pct1 = pca_2d.explained_variance_ratio_[1]
plt.xlabel(f"PC1 ({pct0:.1%} variance)")
plt.ylabel(f"PC2 ({pct1:.1%} variance)")
plt.title("2D PCA Projection — Iris Dataset")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

We can observe that the three iris species separate clearly in the space of the first two principal components, with Iris setosa being totally distinct.

3D Visualization

from mpl_toolkits.mplot3d import Axes3D

pca_3d = PCA(n_components=3)
X_3d = pca_3d.fit_transform(X_scaled)

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

for label, color in zip(range(3), ['navy', 'darkorange', 'green']):
    mask = y == label
    ax.scatter(X_3d[mask, 0], X_3d[mask, 1], X_3d[mask, 2],
               c=color, edgecolor='black', s=50,
               label=iris.target_names[label], alpha=0.8)

ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_zlabel("PC3")
ax.set_title("3D PCA Projection — Iris Dataset")
ax.legend()
plt.tight_layout()
plt.show()

Inverse Reconstruction

A powerful aspect of PCA is the ability to reconstruct the original data from the reduced components:

# Reduction to 2 dimensions
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

# Inverse reconstruction (return to 4-dimensional space)
X_reconstructed = pca_2d.inverse_transform(X_2d)

# Compute reconstruction error
mse = np.mean((X_scaled - X_reconstructed) ** 2)
variance_lost = 1 - np.sum(pca_2d.explained_variance_ratio_)
print(f"Mean squared reconstruction error: {mse:.6f}")
print(f"Total variance lost: {variance_lost:.4f} ({variance_lost*100:.1f}%)")

The reconstruction is never perfect (unless all components are retained), but with 95.9% of variance retained, the reconstructed data is very close to the original.

PCA Hyperparameters

n_components

The most important parameter. It determines the number of components retained:

  • int: fixed number of components (e.g., n_components=2 for 2D visualization)
  • float (between 0 and 1): minimum proportion of variance to preserve (e.g., n_components=0.95 to retain 95% of variance). PCA automatically determines the number of components needed.
  • “mle”: uses maximum likelihood estimation (Minka, 2000) to automatically choose the optimal number of dimensions.
# Auto-selection to preserve 95% of variance
pca_auto = PCA(n_components=0.95, svd_solver='full')
X_reduced = pca_auto.fit_transform(X_scaled)
print(f"Components retained: {X_reduced.shape[1]} out of {X_scaled.shape[1]}")

svd_solver

The decomposition method used:

  • “auto” (default): automatically chosen based on data size
  • “full”: classical SVD (LAPACK), ideal for moderately sized datasets
  • “arpack”: uses ARPACK to compute a subset of components, useful for large datasets
  • “randomized”: fast stochastic approximation for very large datasets (thousands of features)
# For very large matrices
pca_fast = PCA(n_components=50, svd_solver='randomized', random_state=42)

whiten

The whiten=True option normalizes each principal component to have unit variance:

pca_whiten = PCA(whiten=True)

This transformation “whitens” the data — all components then have the same variance scale. Useful as a preprocessing step for certain algorithms like neural networks or SVMs, which benefit from uniformly scaled data.

tol

Tolerance for the “arpack” solver. Defines the convergence criterion:

pca_precis = PCA(n_components=10, svd_solver='arpack', tol=1e-6)

A stricter tolerance gives more precise results but can slow convergence.

Advantages and Limitations

Advantages

  • Effective noise reduction: low-variance components often capture noise rather than signal. Removing them improves data quality.
  • Multicollinearity elimination: principal components are by construction orthogonal and uncorrelated, which solves problems of correlation between variables.
  • Visualization: reducing to 2 or 3 dimensions allows visualization of intrinsically multidimensional data.
  • Universal preprocessing: PCA accelerates and improves almost all subsequent learning algorithms by reducing the input dimension.
  • Deterministic: unlike t-SNE or UMAP, PCA is perfectly reproducible — same data, same results.
  • Relative interpretability: you can analyze the weights (loadings) of each variable in each component to understand what it captures.

Limitations

  • Linearity: PCA only captures linear relationships between variables. For complex nonlinear structures, methods like t-SNE, UMAP, or autoencoders are more appropriate.
  • Sensitivity to outliers: since variance is based on quadratic means, extreme values can bias the principal components.
  • Loss of direct interpretability: each principal component is a linear combination of all original variables, which sometimes makes its interpretation difficult.
  • Standardization required: PCA is sensitive to variable scales. Without normalization, variables with large scales artificially dominate.
  • No guarantee of class separability: PCA maximizes variance, not discrimination between classes. For classification, LDA (Linear Discriminant Analysis) is sometimes more relevant.

4 Concrete Use Cases

1. Image Compression and Processing

PCA is massively used for image compression. By decomposing a set of faces into principal components (the famous Eigenfaces of Turk & Pentland, 1991), any face can be represented by just 50 to 100 coefficients instead of hundreds of thousands of pixels.

from sklearn.datasets import fetch_olivetti_faces

faces = fetch_olivetti_faces()
X_faces = faces.data / 255.0  # 400 images of 64×64 pixels

pca_faces = PCA(n_components=0.95)
X_faces_reduced = pca_faces.fit_transform(X_faces)
print(f"Reduction: {X_faces.shape[1]} → {X_faces_reduced.shape[1]} dimensions")

# Visualize Eigenfaces
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i, ax in enumerate(axes.flat):
    ax.imshow(pca_faces.components_[i].reshape(64, 64), cmap='gray')
    ax.axis('off')
    ax.set_title(f"PC{i+1}")
plt.suptitle("First 10 principal components (Eigenfaces)")
plt.tight_layout()
plt.show()

This approach reduces data size by a factor of 100 while preserving essential facial features.

2. Genomic Analysis and Bioinformatics

In genomic studies, the expression of tens of thousands of genes is often measured on a few hundred samples. PCA makes it possible to identify dominant expression patterns, detect groups of similar patients, and visualize otherwise inaccessible data.

A classic example: reducing 20,000 genes to 3 components to reveal that samples naturally separate by cancer type — without any label information.

3. Finance — Portfolio Analysis

In quantitative finance, PCA identifies the latent risk factors that explain market movements:

  • First component: overall market effect (all assets move in the same direction)
  • Second component: sector or size effect (large vs. small capitalizations)
  • Third component: additional specific dynamics
# PCA on stock returns
import pandas as pd
from sklearn.decomposition import PCA

np.random.seed(42)
# Simulation: 500 days, 30 stocks
returns = np.random.multivariate_normal(
    mean=np.zeros(30),
    cov=np.eye(30) * 0.5 + 0.3,  # moderate correlation
    size=500
)

pca_finance = PCA()
X_pca_fin = pca_finance.fit_transform(returns)

print("Variance explained by factor:")
for i in range(5):
    prop = pca_finance.explained_variance_ratio_[i]
    print(f"  Factor {i+1}: {prop:.2%}")
cum3 = np.sum(pca_finance.explained_variance_ratio_[:3])
print(f"  Cumulative 3 factors: {cum3:.2%}")

Often, the first 3 factors explain more than 80% of the total variability of returns.

4. Natural Language Processing (NLP)

Although Word2Vec and transformers dominate NLP today, PCA remains useful for:

  • Reducing embedding dimension (from 768 to 128 dimensions) before using them in a lighter model
  • Denoising word vectors by removing low-variance components
  • Visualizing the semantic space: projecting word embeddings into 2D to observe thematic clusters
# Embedding reduction for a lightweight model
from sklearn.decomposition import PCA

# 768-dimensional embeddings
embeddings = np.random.randn(10000, 768)  # 10,000 words, 768 dimensions

pca_embed = PCA(n_components=128)
embeddings_reduced = pca_embed.fit_transform(embeddings)

variance_preserved = np.sum(pca_embed.explained_variance_ratio_)
print(f"Dimensions: 768 → 128 ({variance_preserved:.1%} variance preserved)")
print(f"Memory reduction: {embeddings.nbytes / 1024:.0f} KB → {embeddings_reduced.nbytes / 1024:.0f} KB")

This reduces model size by 6× while preserving the vast majority of semantic information.

See Also