Principal Component Analysis (PCA): Complete Guide
Summary
Principal Component Analysis (PCA) is one of the most fundamental and most used dimensionality reduction methods in machine learning. Its goal is simple but powerful: transform a set of potentially correlated variables into a new set of uncorrelated variables, called principal components, while preserving as much information (variance) as possible from the original data.
In practice, this means going from a 100-dimensional space to a 10-dimensional space, while preserving 95% of the data’s variability. PCA is an unsupervised method — it requires no labels — and is based on solid linear algebra foundations: eigenvalue decomposition, SVD (Singular Value Decomposition), and maximization of projected variance.
Mathematical Principle
Covariance Matrix
Let X be a centered data matrix of dimensions n × p, where n is the number of observations and p is the number of variables. The first step is to compute the covariance matrix:
S = (1 / (n − 1)) · XᵀX
This symmetric p × p matrix encodes all linear relationships between variables. The diagonal elements represent the variances of each variable, while the off-diagonal elements capture the covariances between pairs of variables.
Eigenvalue Decomposition
The theoretical core of PCA rests on the eigenvalue decomposition of the covariance matrix:
S · V = λ · V
where:
– V is the matrix whose columns are the eigenvectors of S. Each eigenvector defines a direction in the data space — this is a principal component.
– λ is a vector containing the associated eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λₚ ≥ 0. Each eigenvalue measures the amount of variance captured by the corresponding principal component.
The principal components are orthogonal to each other — they are perfectly uncorrelated — and are ordered in decreasing order of importance (explained variance).
SVD Decomposition (Singular Value Decomposition)
In practice, most modern PCA implementations use Singular Value Decomposition (SVD) directly on the matrix X, rather than explicitly computing the covariance matrix. SVD decomposes X as follows:
X = U · Σ · Vᵀ
where:
– U (n × p): matrix of left singular vectors (orthogonal)
– Σ (p × p): diagonal matrix of singular values σ₁ ≥ σ₂ ≥ … ≥ σₚ ≥ 0
– V (p × p): matrix of eigenvectors (principal components), orthogonal
The relationship between eigenvalues and singular values is direct:
λⱼ = σⱼ² / (n − 1)
SVD is numerically more stable than the direct computation of S·V = λ·V, especially when variables are numerous or highly correlated. This is why it is the default choice in scikit-learn.
Explained Variance
The explained variance by the k-th principal component is defined as:
vₖ = λₖ / Σⱼ λⱼ
That is, the ratio of the k-th eigenvalue to the total sum of all eigenvalues. This proportion indicates what fraction of the total information is contained in each component.
The cumulative explained variance for the first m components is:
V_cumulative = Σₖ₌₁ᵐ λₖ / Σⱼ₌₁ᵖ λⱼ
This criterion is essential for choosing the optimal number of components to retain.
Intuition: Finding the Best Viewing Angles
Imagine you need to photograph a three-dimensional object, but you only have a 2D camera. What is the best angle to capture the most information about the shape of the object? You would naturally choose the view that reveals the most “details,” the greatest variety of shapes and outlines.
PCA works in exactly the same way, but in n dimensions. It automatically finds the viewing angles that show the most details — the directions in which the data exhibits the most dispersion (variance). The first principal component is the direction that maximizes this variance; the second is the orthogonal direction that captures the most remaining variance; and so on.
Like photographing a 3D object: PCA automatically finds the best angles that capture the maximum information with the fewest photos. Instead of getting lost in hundreds of redundant variables, it extracts the few essential axes that summarize the behavior of your data.
Let’s take a concrete example: if you measure the height, weight, waist circumference, arm length, and leg length of thousands of people, these variables are highly correlated. PCA might discover that it all essentially reduces to two dimensions: one “overall size” component and one “body proportion” component. These two synthetic axes capture the essence of the information with far fewer variables.
Python Implementation
Installation and Preparation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import StandardScaler
Basic PCA with Iris Data
# Load data
iris = load_iris()
X = iris.data # 150 samples, 4 variables
y = iris.target
# Standardization is essential for PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Display eigenvalues
print("Eigenvalues:", pca.explained_variance_)
print("Variance explained per component:", np.round(pca.explained_variance_ratio_, 4))
print("Cumulative explained variance:", np.round(np.cumsum(pca.explained_variance_ratio_), 4))
Typical output:
Eigenvalues: [2.938, 0.920, 0.147, 0.021]
Variance explained: [0.730, 0.229, 0.037, 0.005]
Cumulative variance: [0.730, 0.959, 0.996, 1.000]
The first two components capture 95.9% of the total variance. We can thus reduce from 4 to 2 dimensions while losing only 4.1% of information.
Scree Plot — Visualizing Eigenvalues
# Scree plot
plt.figure(figsize=(8, 5))
components = np.arange(1, len(pca.explained_variance_) + 1)
plt.bar(components, pca.explained_variance_ratio_, color='steelblue', edgecolor='black')
plt.plot(components, np.cumsum(pca.explained_variance_ratio_), 'ro-', linewidth=2, markersize=8)
plt.axhline(y=0.95, color='green', linestyle='--', label='95% threshold')
plt.xlabel('Principal component')
plt.ylabel('Proportion of explained variance')
plt.title('Scree Plot — Principal Component Analysis (Iris)')
plt.xticks(components)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()
The scree plot shows a clear “elbow”: the steep drop between the first components and the following ones indicates how many dimensions to retain.
2D Visualization of Projected Data
# Reduction to 2 dimensions
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
plt.figure(figsize=(9, 6))
for label, color in zip(range(3), ['navy', 'darkorange', 'green']):
mask = y == label
plt.scatter(X_2d[mask, 0], X_2d[mask, 1],
c=color, edgecolor='black', s=60,
label=iris.target_names[label], alpha=0.8)
pct0 = pca_2d.explained_variance_ratio_[0]
pct1 = pca_2d.explained_variance_ratio_[1]
plt.xlabel(f"PC1 ({pct0:.1%} variance)")
plt.ylabel(f"PC2 ({pct1:.1%} variance)")
plt.title("2D PCA Projection — Iris Dataset")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
We can observe that the three iris species separate clearly in the space of the first two principal components, with Iris setosa being totally distinct.
3D Visualization
from mpl_toolkits.mplot3d import Axes3D
pca_3d = PCA(n_components=3)
X_3d = pca_3d.fit_transform(X_scaled)
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
for label, color in zip(range(3), ['navy', 'darkorange', 'green']):
mask = y == label
ax.scatter(X_3d[mask, 0], X_3d[mask, 1], X_3d[mask, 2],
c=color, edgecolor='black', s=50,
label=iris.target_names[label], alpha=0.8)
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_zlabel("PC3")
ax.set_title("3D PCA Projection — Iris Dataset")
ax.legend()
plt.tight_layout()
plt.show()
Inverse Reconstruction
A powerful aspect of PCA is the ability to reconstruct the original data from the reduced components:
# Reduction to 2 dimensions
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
# Inverse reconstruction (return to 4-dimensional space)
X_reconstructed = pca_2d.inverse_transform(X_2d)
# Compute reconstruction error
mse = np.mean((X_scaled - X_reconstructed) ** 2)
variance_lost = 1 - np.sum(pca_2d.explained_variance_ratio_)
print(f"Mean squared reconstruction error: {mse:.6f}")
print(f"Total variance lost: {variance_lost:.4f} ({variance_lost*100:.1f}%)")
The reconstruction is never perfect (unless all components are retained), but with 95.9% of variance retained, the reconstructed data is very close to the original.
PCA Hyperparameters
n_components
The most important parameter. It determines the number of components retained:
- int: fixed number of components (e.g.,
n_components=2for 2D visualization) - float (between 0 and 1): minimum proportion of variance to preserve (e.g.,
n_components=0.95to retain 95% of variance). PCA automatically determines the number of components needed. - “mle”: uses maximum likelihood estimation (Minka, 2000) to automatically choose the optimal number of dimensions.
# Auto-selection to preserve 95% of variance
pca_auto = PCA(n_components=0.95, svd_solver='full')
X_reduced = pca_auto.fit_transform(X_scaled)
print(f"Components retained: {X_reduced.shape[1]} out of {X_scaled.shape[1]}")
svd_solver
The decomposition method used:
- “auto” (default): automatically chosen based on data size
- “full”: classical SVD (LAPACK), ideal for moderately sized datasets
- “arpack”: uses ARPACK to compute a subset of components, useful for large datasets
- “randomized”: fast stochastic approximation for very large datasets (thousands of features)
# For very large matrices
pca_fast = PCA(n_components=50, svd_solver='randomized', random_state=42)
whiten
The whiten=True option normalizes each principal component to have unit variance:
pca_whiten = PCA(whiten=True)
This transformation “whitens” the data — all components then have the same variance scale. Useful as a preprocessing step for certain algorithms like neural networks or SVMs, which benefit from uniformly scaled data.
tol
Tolerance for the “arpack” solver. Defines the convergence criterion:
pca_precis = PCA(n_components=10, svd_solver='arpack', tol=1e-6)
A stricter tolerance gives more precise results but can slow convergence.
Advantages and Limitations
Advantages
- Effective noise reduction: low-variance components often capture noise rather than signal. Removing them improves data quality.
- Multicollinearity elimination: principal components are by construction orthogonal and uncorrelated, which solves problems of correlation between variables.
- Visualization: reducing to 2 or 3 dimensions allows visualization of intrinsically multidimensional data.
- Universal preprocessing: PCA accelerates and improves almost all subsequent learning algorithms by reducing the input dimension.
- Deterministic: unlike t-SNE or UMAP, PCA is perfectly reproducible — same data, same results.
- Relative interpretability: you can analyze the weights (loadings) of each variable in each component to understand what it captures.
Limitations
- Linearity: PCA only captures linear relationships between variables. For complex nonlinear structures, methods like t-SNE, UMAP, or autoencoders are more appropriate.
- Sensitivity to outliers: since variance is based on quadratic means, extreme values can bias the principal components.
- Loss of direct interpretability: each principal component is a linear combination of all original variables, which sometimes makes its interpretation difficult.
- Standardization required: PCA is sensitive to variable scales. Without normalization, variables with large scales artificially dominate.
- No guarantee of class separability: PCA maximizes variance, not discrimination between classes. For classification, LDA (Linear Discriminant Analysis) is sometimes more relevant.
4 Concrete Use Cases
1. Image Compression and Processing
PCA is massively used for image compression. By decomposing a set of faces into principal components (the famous Eigenfaces of Turk & Pentland, 1991), any face can be represented by just 50 to 100 coefficients instead of hundreds of thousands of pixels.
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces()
X_faces = faces.data / 255.0 # 400 images of 64×64 pixels
pca_faces = PCA(n_components=0.95)
X_faces_reduced = pca_faces.fit_transform(X_faces)
print(f"Reduction: {X_faces.shape[1]} → {X_faces_reduced.shape[1]} dimensions")
# Visualize Eigenfaces
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i, ax in enumerate(axes.flat):
ax.imshow(pca_faces.components_[i].reshape(64, 64), cmap='gray')
ax.axis('off')
ax.set_title(f"PC{i+1}")
plt.suptitle("First 10 principal components (Eigenfaces)")
plt.tight_layout()
plt.show()
This approach reduces data size by a factor of 100 while preserving essential facial features.
2. Genomic Analysis and Bioinformatics
In genomic studies, the expression of tens of thousands of genes is often measured on a few hundred samples. PCA makes it possible to identify dominant expression patterns, detect groups of similar patients, and visualize otherwise inaccessible data.
A classic example: reducing 20,000 genes to 3 components to reveal that samples naturally separate by cancer type — without any label information.
3. Finance — Portfolio Analysis
In quantitative finance, PCA identifies the latent risk factors that explain market movements:
- First component: overall market effect (all assets move in the same direction)
- Second component: sector or size effect (large vs. small capitalizations)
- Third component: additional specific dynamics
# PCA on stock returns
import pandas as pd
from sklearn.decomposition import PCA
np.random.seed(42)
# Simulation: 500 days, 30 stocks
returns = np.random.multivariate_normal(
mean=np.zeros(30),
cov=np.eye(30) * 0.5 + 0.3, # moderate correlation
size=500
)
pca_finance = PCA()
X_pca_fin = pca_finance.fit_transform(returns)
print("Variance explained by factor:")
for i in range(5):
prop = pca_finance.explained_variance_ratio_[i]
print(f" Factor {i+1}: {prop:.2%}")
cum3 = np.sum(pca_finance.explained_variance_ratio_[:3])
print(f" Cumulative 3 factors: {cum3:.2%}")
Often, the first 3 factors explain more than 80% of the total variability of returns.
4. Natural Language Processing (NLP)
Although Word2Vec and transformers dominate NLP today, PCA remains useful for:
- Reducing embedding dimension (from 768 to 128 dimensions) before using them in a lighter model
- Denoising word vectors by removing low-variance components
- Visualizing the semantic space: projecting word embeddings into 2D to observe thematic clusters
# Embedding reduction for a lightweight model
from sklearn.decomposition import PCA
# 768-dimensional embeddings
embeddings = np.random.randn(10000, 768) # 10,000 words, 768 dimensions
pca_embed = PCA(n_components=128)
embeddings_reduced = pca_embed.fit_transform(embeddings)
variance_preserved = np.sum(pca_embed.explained_variance_ratio_)
print(f"Dimensions: 768 → 128 ({variance_preserved:.1%} variance preserved)")
print(f"Memory reduction: {embeddings.nbytes / 1024:.0f} KB → {embeddings_reduced.nbytes / 1024:.0f} KB")
This reduces model size by 6× while preserving the vast majority of semantic information.
See Also
- Mastering Maximum Arrangements in Python: Complete Guide and Tips for Developers
- Finding the Median of a Data Stream in Python — Interview Question

