PLS Regression: Principles, Examples, and Python Implementation

Régression PLS : Guide Complet — Principes, Exemples et Implémentation Python

PLS Regression: Complete Guide ? Principles, Examples, and Python Implementation

Summary ? PLS regression (Partial Least Squares Regression) is a supervised modeling method that projects the predictor variables and the target onto latent components by maximizing their covariance. Particularly effective against multicollinearity and in situations where the number of variables exceeds the number of observations, PLS regression has become an indispensable tool in chemometrics, bioinformatics, and economic forecasting.


Mathematical Principle

PLS regression models the relationship between a predictor matrix X ? ?^(n?p) and a target vector y ? ?^n through latent components.

PLS Decomposition

Model:
X = TW? + E ? predictor decomposition
y = Tq + f ? target decomposition

T (scores), W (weights), q (y loadings), E and f (residuals).

Covariance Maximization

Unlike PCA, which only maximizes the variance of X, PLS regression maximizes the covariance between t = Xw and y:

w? = argmax_w Cov(Xw, y)   s.t. ?w? = 1

NIPALS Algorithm

  1. u = y
  2. w = X?u / ?X?u?
  3. t = Xw
  4. q = y?t / (t?t)
  5. u = yq / ?q?
  6. Repeat 2-5 until convergence
  7. Deflate X ? X ? tp? where p = X?t/(t?t), y ? y ? tq

Final coefficients: ?_pls = W(T?T)??T?y, therefore ? = X?_pls.


Intuition

Imagine predicting the quality of a wine from 150 spectral measurements. The absorbance at 520 nm is almost identical to that at 521 nm: extreme multicollinearity. Classical linear regression fails.

PLS regression constructs a few latent components that summarize X while remaining correlated with y.

PCA vs PLS:

Method Criterion Supervised?
PCA Variance of X No
PLS Regression Covariance (X, y) Yes

PLS is a supervised PCA. Ideal when p > n or when facing multicollinearity.


Python Implementation

Setup

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score, KFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

Multicollinear Data

n = 100; p = 50
Z = np.random.randn(n, 3)  # latent variables
A = np.random.randn(3, p)
X = Z @ A + 0.5 * np.random.randn(n, p)
y = Z @ [2.0, -1.5, 3.0] + 0.3 * np.random.randn(n)

Cross-Validation

scaler_x = StandardScaler()
scaler_y = StandardScaler()
Xs = scaler_x.fit_transform(X)
ys = scaler_y.fit_transform(y.reshape(-1,1)).ravel()

scores = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for k in range(1, min(n,p)+1):
    pls = PLSRegression(n_components=k)
    s = cross_val_score(pls, Xs, ys, cv=kf, scoring='neg_mean_squared_error')
    scores.append(-s.mean())

opt_k = np.argmin(scores) + 1
print(f"Optimal components: {opt_k}")

PLS vs PCA+LR

pls = PLSRegression(n_components=opt_k)
pls.fit(Xs, ys)
yp = pls.predict(Xs)

pca = PCA(n_components=opt_k)
Xp = pca.fit_transform(Xs)
lr = LinearRegression().fit(Xp, ys)
yp2 = lr.predict(Xp)

r2_pls = r2_score(ys, yp)
r2_pca = r2_score(ys, yp2)
print(f'PLS Improvement')

Variable Weights

wts = pls.x_weights_[:, 0]
top10 = np.argsort(np.abs(wts))[::-1][:10]
for i in top10:
    print(f"  Var {i}: {wts[i]:.4f}")

Explained Variance

var_cum = []
for k in range(1, 21):
    m = PLSRegression(n_components=k).fit(Xs, ys)
    var_cum.append(r2_score(ys, m.predict(Xs)))
plt.plot(range(1,21), var_cum, "go-")
plt.xlabel('Components'); plt.ylabel('Cumulative R?'); plt.show()

Hyperparameters

Parameter Type Default Description
n_components int 2 Number of latent components. The key hyperparameter ? choose via cross-validation.
scale bool True Automatically standardizes X and y. Recommended by default.
max_iter int 500 Max NIPALS iterations per component.
tol float 1e-6 NIPALS convergence tolerance.
copy bool True Copy input data (memory vs. safety).

Choosing n_components: the 1-SE rule

mse = []
for k in range(1, 31):
    yp_cv = cross_val_predict(PLSRegression(n_components=k), Xs, ys, cv=10)
    mse.append(mean_squared_error(ys, yp_cv))
best = np.argmin(mse)
opt = int(np.where(np.array(mse) <= mse[best] + np.std(mse))[0][0]) + 1

Advantages and Limitations

Advantages

  • Multicollinearity: PLS regression handles strong correlations between predictors without coefficient divergence.
  • p > n: Works even when variables outnumber observations.
  • Supervised: Orients components toward predicting y, not just the variance of X.
  • Noise filtering: Noise in X with no link to y is ignored.
  • Multi-response: PLS2 handles multi-dimensional Y.
  • Interpretable: VIP weights and scores identify key variables.

Limitations

  • Mixed components: Linear combinations of all variables, hard to interpret physically.
  • Sensitive to n_components: A poor choice causes overfitting or underfitting.
  • Non-selective: Uses all variables (sparse sPLS is not in scikit-learn).
  • Linear: Kernel PLS is needed for nonlinearities.
  • Outliers: Sensitive to outliers like linear regression.
  • Not probabilistic: No analytical confidence intervals, no statistical tests.

Use Cases

1. Spectroscopy and Chemometrics

The historical domain of PLS regression. In NIR spectroscopy, absorbance is measured at hundreds of highly correlated wavelengths. Predicting the protein or moisture concentration of a food sample is the typical application.

pls = PLSRegression(n_components=8)
pls.fit(X_spectra, y_concentration)
y_pred = pls.predict(X_new)

VIP scores identify the most informative wavelengths.

2. Omics Data p >> n

In genomics or metabolomics, thousands of genes for a few dozen patients. PLS regression exploits the biological correlation structure between co-expressed genes, creating components relevant to the target phenotype.

3. Economic Forecasting

Combining 150 correlated macroeconomic indicators to forecast quarterly GDP. PLS selects the predictive dimensions of economic activity, without wasting time on factors unrelated to GDP.

4. Sensor Data

Industrial sensor networks (temperature, pressure, vibrations, flow rate) that are highly correlated. Predict the energy efficiency of an HVAC system from 50 intercorrelated sensors.

from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
Xtr_s = scaler.fit_transform(Xtr)
Xte_s = scaler.transform(Xte)
pls = PLSRegression(n_components=5).fit(Xtr_s, ytr)
print(f"Test R?: {r2_score(yte, pls.predict(Xte_s)):.4f}")

See Also


Going Further

  • Wold, Sj?str?m & Eriksson (2001) ? PLS-regression: a basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130.
  • Herv? Abdi (2010) ? Partial Least Squares Regression and Projection on Latent Structure, Computational Statistics, 2(1), 97-106.
  • scikit-learn documentation ? cross_decomposition.PLSRegression