PLS Regression: Complete Guide ? Principles, Examples, and Python Implementation
Summary ? PLS regression (Partial Least Squares Regression) is a supervised modeling method that projects the predictor variables and the target onto latent components by maximizing their covariance. Particularly effective against multicollinearity and in situations where the number of variables exceeds the number of observations, PLS regression has become an indispensable tool in chemometrics, bioinformatics, and economic forecasting.
Mathematical Principle
PLS regression models the relationship between a predictor matrix X ? ?^(n?p) and a target vector y ? ?^n through latent components.
PLS Decomposition
Model:
– X = TW? + E ? predictor decomposition
– y = Tq + f ? target decomposition
T (scores), W (weights), q (y loadings), E and f (residuals).
Covariance Maximization
Unlike PCA, which only maximizes the variance of X, PLS regression maximizes the covariance between t = Xw and y:
w? = argmax_w Cov(Xw, y) s.t. ?w? = 1
NIPALS Algorithm
- u = y
- w = X?u / ?X?u?
- t = Xw
- q = y?t / (t?t)
- u = yq / ?q?
- Repeat 2-5 until convergence
- Deflate X ? X ? tp? where p = X?t/(t?t), y ? y ? tq
Final coefficients: ?_pls = W(T?T)??T?y, therefore ? = X?_pls.
Intuition
Imagine predicting the quality of a wine from 150 spectral measurements. The absorbance at 520 nm is almost identical to that at 521 nm: extreme multicollinearity. Classical linear regression fails.
PLS regression constructs a few latent components that summarize X while remaining correlated with y.
PCA vs PLS:
| Method | Criterion | Supervised? |
|---|---|---|
| PCA | Variance of X | No |
| PLS Regression | Covariance (X, y) | Yes |
PLS is a supervised PCA. Ideal when p > n or when facing multicollinearity.
Python Implementation
Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_decomposition import PLSRegression
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score, KFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
Multicollinear Data
n = 100; p = 50
Z = np.random.randn(n, 3) # latent variables
A = np.random.randn(3, p)
X = Z @ A + 0.5 * np.random.randn(n, p)
y = Z @ [2.0, -1.5, 3.0] + 0.3 * np.random.randn(n)
Cross-Validation
scaler_x = StandardScaler()
scaler_y = StandardScaler()
Xs = scaler_x.fit_transform(X)
ys = scaler_y.fit_transform(y.reshape(-1,1)).ravel()
scores = []
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for k in range(1, min(n,p)+1):
pls = PLSRegression(n_components=k)
s = cross_val_score(pls, Xs, ys, cv=kf, scoring='neg_mean_squared_error')
scores.append(-s.mean())
opt_k = np.argmin(scores) + 1
print(f"Optimal components: {opt_k}")
PLS vs PCA+LR
pls = PLSRegression(n_components=opt_k)
pls.fit(Xs, ys)
yp = pls.predict(Xs)
pca = PCA(n_components=opt_k)
Xp = pca.fit_transform(Xs)
lr = LinearRegression().fit(Xp, ys)
yp2 = lr.predict(Xp)
r2_pls = r2_score(ys, yp)
r2_pca = r2_score(ys, yp2)
print(f'PLS Improvement')
Variable Weights
wts = pls.x_weights_[:, 0]
top10 = np.argsort(np.abs(wts))[::-1][:10]
for i in top10:
print(f" Var {i}: {wts[i]:.4f}")
Explained Variance
var_cum = []
for k in range(1, 21):
m = PLSRegression(n_components=k).fit(Xs, ys)
var_cum.append(r2_score(ys, m.predict(Xs)))
plt.plot(range(1,21), var_cum, "go-")
plt.xlabel('Components'); plt.ylabel('Cumulative R?'); plt.show()
Hyperparameters
| Parameter | Type | Default | Description |
|---|---|---|---|
n_components |
int | 2 | Number of latent components. The key hyperparameter ? choose via cross-validation. |
scale |
bool | True | Automatically standardizes X and y. Recommended by default. |
max_iter |
int | 500 | Max NIPALS iterations per component. |
tol |
float | 1e-6 | NIPALS convergence tolerance. |
copy |
bool | True | Copy input data (memory vs. safety). |
Choosing n_components: the 1-SE rule
mse = []
for k in range(1, 31):
yp_cv = cross_val_predict(PLSRegression(n_components=k), Xs, ys, cv=10)
mse.append(mean_squared_error(ys, yp_cv))
best = np.argmin(mse)
opt = int(np.where(np.array(mse) <= mse[best] + np.std(mse))[0][0]) + 1
Advantages and Limitations
Advantages
- Multicollinearity: PLS regression handles strong correlations between predictors without coefficient divergence.
- p > n: Works even when variables outnumber observations.
- Supervised: Orients components toward predicting y, not just the variance of X.
- Noise filtering: Noise in X with no link to y is ignored.
- Multi-response: PLS2 handles multi-dimensional Y.
- Interpretable: VIP weights and scores identify key variables.
Limitations
- Mixed components: Linear combinations of all variables, hard to interpret physically.
- Sensitive to n_components: A poor choice causes overfitting or underfitting.
- Non-selective: Uses all variables (sparse sPLS is not in scikit-learn).
- Linear: Kernel PLS is needed for nonlinearities.
- Outliers: Sensitive to outliers like linear regression.
- Not probabilistic: No analytical confidence intervals, no statistical tests.
Use Cases
1. Spectroscopy and Chemometrics
The historical domain of PLS regression. In NIR spectroscopy, absorbance is measured at hundreds of highly correlated wavelengths. Predicting the protein or moisture concentration of a food sample is the typical application.
pls = PLSRegression(n_components=8)
pls.fit(X_spectra, y_concentration)
y_pred = pls.predict(X_new)
VIP scores identify the most informative wavelengths.
2. Omics Data p >> n
In genomics or metabolomics, thousands of genes for a few dozen patients. PLS regression exploits the biological correlation structure between co-expressed genes, creating components relevant to the target phenotype.
3. Economic Forecasting
Combining 150 correlated macroeconomic indicators to forecast quarterly GDP. PLS selects the predictive dimensions of economic activity, without wasting time on factors unrelated to GDP.
4. Sensor Data
Industrial sensor networks (temperature, pressure, vibrations, flow rate) that are highly correlated. Predict the energy efficiency of an HVAC system from 50 intercorrelated sensors.
from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
Xtr_s = scaler.fit_transform(Xtr)
Xte_s = scaler.transform(Xte)
pls = PLSRegression(n_components=5).fit(Xtr_s, ytr)
print(f"Test R?: {r2_score(yte, pls.predict(Xte_s)):.4f}")
See Also
- Mastering Duodigits in Python: Complete Guide to Optimizing Your Numerical Computations
- Mastering Project Permutations with Python: Complete Guide for Developers
Going Further
- Wold, Sj?str?m & Eriksson (2001) ? PLS-regression: a basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems, 58(2), 109-130.
- Herv? Abdi (2010) ? Partial Least Squares Regression and Projection on Latent Structure, Computational Statistics, 2(1), 97-106.
- scikit-learn documentation ? cross_decomposition.PLSRegression

