PLS Regression: Complete Guide ? Principles, Examples, and Python Implementation

Summary ? PLS regression (Partial Least Squares Regression) is a supervised modeling method that projects the predictor variables and the target onto latent components by maximizing their covariance. Particularly effective against multicollinearity and in situations where the number of variables exceeds the number of observations, PLS regression has become an indispensable tool in chemometrics, bioinformatics, and economic forecasting.

Mathematical Principle

PLS regression models the relationship between a predictor matrix X ? ?^(n?p) and a target vector y ? ?^n through latent components.

PLS Decomposition

Model:
– X = TW? + E ? predictor decomposition
– y = Tq + f ? target decomposition

T (scores), W (weights), q (y loadings), E and f (residuals).

Covariance Maximization

Unlike PCA, which only maximizes the variance of X, PLS regression maximizes the covariance between t = Xw and y:

w? = argmax_w Cov(Xw, y)   s.t. ?w? = 1

NIPALS Algorithm

u = y
w = X?u / ?X?u?
t = Xw
q = y?t / (t?t)
u = yq / ?q?
Repeat 2-5 until convergence
Deflate X ? X ? tp? where p = X?t/(t?t), y ? y ? tq

Final coefficients: ?_pls = W(T?T)??T?y, therefore ? = X?_pls.

Intuition

Imagine predicting the quality of a wine from 150 spectral measurements. The absorbance at 520 nm is almost identical to that at 521 nm: extreme multicollinearity. Classical linear regression fails.

PLS regression constructs a few latent components that summarize X while remaining correlated with y.

PCA vs PLS:

Method	Criterion	Supervised?
PCA	Variance of X	No
PLS Regression	Covariance (X, y)	Yes

PLS is a supervised PCA. Ideal when p > n or when facing multicollinearity.

Python Implementation

Setup

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score, KFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

Multicollinear Data

n = 100; p = 50
Z = np.random.randn(n, 3)  # latent variables
A = np.random.randn(3, p)
X = Z @ A + 0.5 * np.random.randn(n, p)
y = Z @ [2.0, -1.5, 3.0] + 0.3 * np.random.randn(n)

Cross-Validation

scaler_x = StandardScaler()
scaler_y = StandardScaler()
Xs = scaler_x.fit_transform(X)
ys = scaler_y.fit_transform(y.reshape(-1,1)).ravel()

scores = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for k in range(1, min(n,p)+1):
    pls = PLSRegression(n_components=k)
    s = cross_val_score(pls, Xs, ys, cv=kf, scoring='neg_mean_squared_error')
    scores.append(-s.mean())

opt_k = np.argmin(scores) + 1
print(f"Optimal components: {opt_k}")

PLS vs PCA+LR

pls = PLSRegression(n_components=opt_k)
pls.fit(Xs, ys)
yp = pls.predict(Xs)

pca = PCA(n_components=opt_k)
Xp = pca.fit_transform(Xs)
lr = LinearRegression().fit(Xp, ys)
yp2 = lr.predict(Xp)

r2_pls = r2_score(ys, yp)
r2_pca = r2_score(ys, yp2)
print(f'PLS Improvement')

Variable Weights

wts = pls.x_weights_[:, 0]
top10 = np.argsort(np.abs(wts))[::-1][:10]
for i in top10:
    print(f"  Var {i}: {wts[i]:.4f}")

Explained Variance

var_cum = []
for k in range(1, 21):
    m = PLSRegression(n_components=k).fit(Xs, ys)
    var_cum.append(r2_score(ys, m.predict(Xs)))
plt.plot(range(1,21), var_cum, "go-")
plt.xlabel('Components'); plt.ylabel('Cumulative R?'); plt.show()

Hyperparameters

Parameter	Type	Default	Description
`n_components`	int	2	Number of latent components. The key hyperparameter ? choose via cross-validation.
`scale`	bool	True	Automatically standardizes X and y. Recommended by default.
`max_iter`	int	500	Max NIPALS iterations per component.
`tol`	float	1e-6	NIPALS convergence tolerance.
`copy`	bool	True	Copy input data (memory vs. safety).

Choosing n_components: the 1-SE rule

mse = []
for k in range(1, 31):
    yp_cv = cross_val_predict(PLSRegression(n_components=k), Xs, ys, cv=10)
    mse.append(mean_squared_error(ys, yp_cv))
best = np.argmin(mse)
opt = int(np.where(np.array(mse) <= mse[best] + np.std(mse))[0][0]) + 1

Advantages and Limitations

Advantages

Multicollinearity: PLS regression handles strong correlations between predictors without coefficient divergence.
p > n: Works even when variables outnumber observations.
Supervised: Orients components toward predicting y, not just the variance of X.
Noise filtering: Noise in X with no link to y is ignored.
Multi-response: PLS2 handles multi-dimensional Y.
Interpretable: VIP weights and scores identify key variables.

Limitations

Mixed components: Linear combinations of all variables, hard to interpret physically.
Sensitive to n_components: A poor choice causes overfitting or underfitting.
Non-selective: Uses all variables (sparse sPLS is not in scikit-learn).
Linear: Kernel PLS is needed for nonlinearities.
Outliers: Sensitive to outliers like linear regression.
Not probabilistic: No analytical confidence intervals, no statistical tests.

Use Cases

1. Spectroscopy and Chemometrics

The historical domain of PLS regression. In NIR spectroscopy, absorbance is measured at hundreds of highly correlated wavelengths. Predicting the protein or moisture concentration of a food sample is the typical application.

pls = PLSRegression(n_components=8)
pls.fit(X_spectra, y_concentration)
y_pred = pls.predict(X_new)

VIP scores identify the most informative wavelengths.

2. Omics Data p >> n

In genomics or metabolomics, thousands of genes for a few dozen patients. PLS regression exploits the biological correlation structure between co-expressed genes, creating components relevant to the target phenotype.

3. Economic Forecasting

Combining 150 correlated macroeconomic indicators to forecast quarterly GDP. PLS selects the predictive dimensions of economic activity, without wasting time on factors unrelated to GDP.

4. Sensor Data

Industrial sensor networks (temperature, pressure, vibrations, flow rate) that are highly correlated. Predict the energy efficiency of an HVAC system from 50 intercorrelated sensors.

from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
Xtr_s = scaler.fit_transform(Xtr)
Xte_s = scaler.transform(Xte)
pls = PLSRegression(n_components=5).fit(Xtr_s, ytr)
print(f"Test R?: {r2_score(yte, pls.predict(Xte_s)):.4f}")

Going Further

Wold, Sj?str?m & Eriksson (2001) ? PLS-regression: a basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130.
Herv? Abdi (2010) ? Partial Least Squares Regression and Projection on Latent Structure, Computational Statistics, 2(1), 97-106.
scikit-learn documentation ? cross_decomposition.PLSRegression

PLS Regression: Principles, Examples, and Python Implementation

PLS Regression: Complete Guide ? Principles, Examples, and Python Implementation

Mathematical Principle

PLS Decomposition

Covariance Maximization

NIPALS Algorithm

Intuition

Python Implementation

Setup

Multicollinear Data

Cross-Validation

PLS vs PCA+LR

Variable Weights

Explained Variance

Hyperparameters

Choosing n_components: the 1-SE rule

Advantages and Limitations

Advantages

Limitations

Use Cases

1. Spectroscopy and Chemometrics

2. Omics Data p >> n

3. Economic Forecasting

4. Sensor Data

See Also

Going Further

Articles similaires

About Salah YAHIAOUI

PLS Regression: Complete Guide ? Principles, Examples, and Python Implementation

Mathematical Principle

PLS Decomposition

Covariance Maximization

NIPALS Algorithm

Intuition

Python Implementation

Setup

Multicollinear Data

Cross-Validation

PLS vs PCA+LR

Variable Weights

Explained Variance

Hyperparameters

Choosing n_components: the 1-SE rule

Advantages and Limitations

Advantages

Limitations

Use Cases

1. Spectroscopy and Chemometrics

2. Omics Data p >> n

3. Economic Forecasting

4. Sensor Data

See Also

Going Further

Partager :

Articles similaires

Related Posts

Linear Regression: Principles, Examples, and Python Implementation

Régression Logistique : Guide Complet — Principes, Exemples et Implémentation Python

Flow Matching: Generation by Flow Matching

About Salah YAHIAOUI