Lasso Regression: Complete Guide — Principles, Examples, and Python Implementation
Lasso regression is a supervised learning algorithm that combines linear prediction with an L1 regularization penalty. Unlike Ridge regression, which shrinks coefficients without ever canceling them completely, lasso regression produces sparse solutions: it forces certain coefficients to become exactly zero, thereby performing automatic variable selection. This unique property makes Lasso an indispensable tool when working with high-dimensional data.
Lasso regression, whose name comes from the English acronym Least Absolute Shrinkage and Selection Operator, was introduced by Robert Tibshirani in 1996. It has established itself as one of the fundamental techniques of Machine Learning, particularly in domains where the number of features far exceeds the number of observations. In this complete guide, we will explore the mathematical foundations, geometric intuition, practical implementation in Python, and concrete applications of this essential algorithm.
Mathematical Principle
The Lasso Objective Function
Consider a dataset composed of $n$ observations and $p$ explanatory variables. Let $X \in \mathbb{R}^{n \times p}$ be the feature matrix, $y \in \mathbb{R}^n$ the target vector, and $w \in \mathbb{R}^p$ the vector of coefficients to be estimated. Lasso regression solves the following optimization problem:
$$
\hat{w}^{\text{lasso}} = \underset{w}{\arg\min} \left( \frac{1}{2n} |y – Xw|_2^2 + \alpha |w|_1 \right)
$$
where:
- $\frac{1}{2n} |y – Xw|_2^2$ is the quadratic loss term (RSS — Residual Sum of Squares), identical to ordinary linear regression. It measures the prediction error on training data.
- $|w|1 = \sum |w_j|$}^{p is the L1 norm of the coefficient vector. It is the sum of the absolute values of each coefficient.
- $\alpha \geq 0$ is the regularization hyperparameter that controls the tradeoff between data fit (fidelity) and model parsimony (sparsity).
Why Does the L1 Norm Produce Exact Zeros?
This is where the fundamental difference between Lasso (L1 norm) and Ridge (L2 norm) lies. To understand this phenomenon, we need to examine the geometry of both approaches.
Equivalent constrained formulation. The Lasso problem can be reformulated in constrained form:
$$
\underset{w}{\arg\min} \ |y – Xw|_2^2 \quad \text{subject to} \quad |w|_1 \leq t
$$
In a 2D space (two coefficients $w_1$ and $w_2$), the region defined by $|w|_1 \leq t$ has the shape of a diamond (or rhombus) whose vertices lie on the axes. The RSS level curves, on the other hand, are ellipses centered around the ordinary least squares (OLS) coefficients.
Intersection at vertices. The optimal solution lies at the first point of contact between an RSS ellipse and the L1 diamond. Since the diamond has sharp corners located exactly on the axes, it is very frequent that the contact point occurs at a vertex or an edge — meaning at least one coefficient is exactly equal to zero. This geometry explains the variable selection property: Lasso automatically identifies relevant features (non-zero coefficients) and eliminates useless features (exactly zero coefficients).
Special Case: Analytical Solution for a Single Variable
In the univariate case ($p = 1$), with a normalized feature $x$ (zero mean, unit variance), the Lasso solution admits an explicit form called the soft-thresholding operator:
$$
\hat{w}_j = \text{sign}(z_j) \cdot \max(|z_j| – \alpha, 0)
$$
where $z_j$ is the corresponding OLS coefficient. This formula clearly shows that if $|z_j| \leq \alpha$, then $\hat{w}_j = 0$: the coefficient is canceled. This is the algebraic manifestation of the sparsity property.
Intuition — How to Understand It?
Lasso as Automatic Variable Selection
Imagine you are working with a dataset containing 1,000 explanatory variables: age, income, location, dozens of biological measurements, etc. You suspect that only 20 or 30 of these variables are truly useful for predicting your target. Ordinary linear regression would use all 1,000 variables, leading to massive overfitting and an unreadable model.
Lasso regression solves this problem elegantly. By progressively increasing the parameter $\alpha$, Lasso forces the coefficients of the least informative variables to become exactly zero. At the end of the process, only the “important” variables remain — those with non-zero coefficients. Lasso has automatically selected the best features, without you needing to manually test combinations.
The Effect of $\alpha$ on Selection
The parameter $\alpha$ acts like a sensitivity knob:
- $\alpha = 0$: No regularization term. Lasso reduces to ordinary linear regression. All coefficients are potentially non-zero.
- $\alpha$ small: A few very weak coefficients are canceled. The model stays close to OLS while starting to eliminate noise.
- $\alpha$ moderate: A significant number of coefficients become zero. The model is parsimonious and generalizes better.
- $\alpha$ very large: Almost all coefficients are zero. The model underfits and lacks predictive capacity.
The optimal choice of $\alpha$ is generally made through cross-validation, seeking the value that minimizes the error on unseen data.
Lasso vs Ridge: When to Choose Which?
The choice between Lasso and Ridge depends on the underlying structure of your data:
- Prefer Lasso when you believe that only a few variables are truly important (sparsity assumption). Lasso will select these variables and eliminate the rest.
- Prefer Ridge when you believe that all (or nearly all) variables contribute to the prediction, each with a modest effect (typically the case in polygenic genomics or image processing).
- Consider Elastic Net when your variables are strongly correlated with each other — a case where Lasso alone tends to randomly select just one variable from a correlated group.
Python Implementation — Complete Example
Installing Dependencies
pip install scikit-learn numpy matplotlib
Complete Code
# Step 1: Generating high-dimensional synthetic data
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
np.random.seed(42)
n_samples = 200 # Number of observations
n_features = 100 # Total number of variables (high dimension)
n_informative = 5 # Only 5 variables are truly informative
# Creating the X matrix with correlated features
X = np.random.randn(n_samples, n_features)
# Building the true coefficient vector: only 5 are non-zero
w_true = np.zeros(n_features)
w_true[:n_informative] = np.array([3.0, -2.5, 1.8, -1.2, 0.7])
# Generating target with Gaussian noise
noise = np.random.randn(n_samples) * 0.5
y = X @ w_true + noise
print(f"Dataset: {n_samples} samples, {n_features} features")
print(f"Truly informative variables: {n_informative}")
print(f"True coefficients: {w_true[:n_informative]}")
# Step 2: Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Step 3: Fitting Lasso for different alpha values
alphas = np.logspace(-4, 2, 50) # 50 values from 0.0001 to 100
coef_paths = [] # Stores coefficients for each alpha
n_nonzero = [] # Number of non-zero coefficients
train_scores = [] # R2 score on train
test_scores = [] # R2 score on test
for alpha in alphas:
# Creating and fitting the Lasso model
lasso = Lasso(alpha=alpha, max_iter=10000, random_state=42)
lasso.fit(X_train, y_train)
# Storing results
coef_paths.append(lasso.coef_)
n_nonzero.append(np.count_nonzero(lasso.coef_))
# Performance evaluation
y_pred_train = lasso.predict(X_train)
y_pred_test = lasso.predict(X_test)
train_scores.append(r2_score(y_train, y_pred_train))
test_scores.append(r2_score(y_test, y_pred_test))
coef_paths = np.array(coef_paths)
# Step 4: Visualization of regularization paths
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left chart: coefficient evolution as a function of log(alpha)
ax1 = axes[0]
for j in range(n_informative):
ax1.plot(np.log10(alphas), coef_paths[:, j], linewidth=2,
label=f"Coefficient {j} (true = {w_true[j]})")
# Plotting non-informative variable coefficients (in gray)
for j in range(n_informative, n_features):
ax1.plot(np.log10(alphas), coef_paths[:, j], color='gray',
alpha=0.15, linewidth=0.5)
ax1.axvline(0, color='red', linestyle='--', alpha=0.7, label='alpha = 1')
ax1.set_xlabel('log10(alpha)', fontsize=12)
ax1.set_ylabel('Coefficient value', fontsize=12)
ax1.set_title('Regularization Paths - Lasso Regression', fontsize=13)
ax1.legend(fontsize=8, loc='upper left')
ax1.grid(True, alpha=0.3)
# Right chart: number of selected features vs alpha
ax2 = axes[1]
ax2.semilogx(alphas, n_nonzero, 'b-o', linewidth=2, markersize=4)
ax2.axhline(n_informative, color='red', linestyle='--',
label=f'True count ({n_informative})')
ax2.set_xlabel('Alpha (logarithmic scale)', fontsize=12)
ax2.set_ylabel('Number of non-zero coefficients', fontsize=12)
ax2.set_title('Variable selection as a function of alpha', fontsize=13)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lasso_coefficient_paths.png', dpi=150, bbox_inches='tight')
plt.show()
# Step 5: LassoCV - automatic alpha selection by cross-validation
print("\n" + "="*60)
print("LassoCV - Automatic alpha selection by cross-validation")
print("="*60)
lasso_cv = LassoCV(
alphas=None, # The algorithm automatically tests a range of alphas
cv=5, # 5-fold cross-validation
max_iter=10000,
random_state=42,
n_jobs=-1 # Using all CPU cores
)
lasso_cv.fit(X_train, y_train)
alpha_optimal = lasso_cv.alpha_
print(f"Optimal alpha selected: {alpha_optimal:.6f}")
print(f"Number of non-zero coefficients: {np.count_nonzero(lasso_cv.coef_)}")
print(f"Score R2 (train): {lasso_cv.score(X_train, y_train):.4f}")
print(f"Score R2 (test): {r2_score(y_test, lasso_cv.predict(X_test)):.4f}")
# Displaying estimated vs true coefficients
fig, ax = plt.subplots(figsize=(10, 5))
indices = np.arange(n_informative + 3) # Shows the 5 true + max 3 false positives
ax.bar(indices - 0.2, w_true[indices], width=0.4,
label='True coefficients', color='steelblue', alpha=0.8)
ax.bar(indices + 0.2, lasso_cv.coef_[indices], width=0.4,
label='LassoCV coefficients', color='coral', alpha=0.8)
ax.set_xticks(indices)
ax.set_xticklabels([f'Feature {i}' for i in indices])
ax.set_ylabel('Coefficient value', fontsize=12)
ax.set_title('Comparison: true vs LassoCV-estimated coefficients', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('lasso_coefficients_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
# Step 6: Detailed analysis - summary table
print("\n" + "="*60)
print("Summary table of coefficients for optimal alpha")
print("="*60)
print(f"{'Index':<8} {'True':>10} {'Lasso':>10} {'Selected':>14}")
print("-" * 44)
for j in range(n_features):
selected = "Yes" if lasso_cv.coef_[j] != 0 else "No"
print(f"{j:<8} {w_true[j]:>10.4f} {lasso_cv.coef_[j]:>10.4f} {selected:>14}")
# Step 7: Train/test performance curve
fig, ax = plt.subplots(figsize=(8, 5))
ax.semilogx(alphas, train_scores, 'b-o', markersize=3, linewidth=1.5,
label='Score R2 (train)')
ax.semilogx(alphas, test_scores, 'r-s', markersize=3, linewidth=1.5,
label='Score R2 (test)')
ax.axvline(alpha_optimal, color='green', linestyle='--',
label=f'Optimal alpha ({alpha_optimal:.4f})')
ax.set_xlabel('Alpha (logarithmic scale)', fontsize=12)
ax.set_ylabel('Score R2', fontsize=12)
ax.set_title('Lasso performance as a function of alpha', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lasso_perf_vs_alpha.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nAnalysis complete!")
Step-by-Step Explanation
Step 1 — Synthetic data. We create a deliberately difficult problem: 200 observations for 100 variables, of which only 5 are informative. This is exactly the scenario in which Lasso excels when $p$ is large but the true model is sparse.
Step 2 — Sweeping alphas. We test 50 values of $\alpha$ logarithmically spaced from 0.0001 to 100. For each value, we train a Lasso model and record the resulting coefficients, the number of selected features, and the $R^2$ scores.
Step 3 — LassoCV. Instead of choosing $\alpha$ manually, LassoCV automatically performs cross-validation to find the optimal value. This is the recommended method in practice.
Step 4 — Coefficient comparison. The summary table shows for each variable the true coefficient, the coefficient estimated by LassoCV, and whether the variable was selected (non-zero coefficient).
Hyperparameters
| Hyperparameter | Role | Typical Values | Remarks |
|---|---|---|---|
alpha |
L1 penalty strength | 1e-4 to 10 | The larger alpha, the more the model forces coefficients to zero. |
fit_intercept |
Adds an intercept term | True (default), False |
Leave on True unless data is already centered and intercept is unnecessary. |
max_iter |
Maximum number of iterations | 1000 to 10000 | Increase if optimization does not converge. |
tol |
Convergence tolerance | 1e-4 (default), 1e-6 | A lower value gives a more precise solution but increases computation time. |
precompute |
Uses a precomputed Gram matrix | False, True, matrix |
Can speed up training when $n > p$. |
selection |
Coordinate choice in descent | 'cyclic', 'random' |
'random' can speed up convergence on certain datasets. |
Practical recommendation. In the vast majority of cases, use LassoCV rather than Lasso. Cross-validation reduces the risk of choosing a penalty that is too weak (overfitting) or too strong (underfitting).
Advantages and Limitations
Advantages
- Automatic variable selection — Lasso identifies and retains only the relevant features by forcing other coefficients to exactly zero. This is its major strength: a model that performs both prediction and variable selection.
- Interpretable models — By eliminating useless variables, Lasso produces simpler, easier-to-interpret models, which is crucial in scientific research and business decision-making.
- Computational efficiency — The coordinate descent algorithm used by scikit-learn is very fast, even for thousands of variables.
- Resistance to overfitting — L1 regularization reduces model variance, improving generalization ability, especially in high dimensions ($p > n$).
- No arbitrary threshold — Unlike stepwise selection methods, Lasso does not rely on arbitrary statistical thresholds (p-values). Selection emerges naturally from optimization.
Limitations
- Poor handling of correlated features — When several variables are strongly correlated, Lasso tends to select only one arbitrarily and cancel the others. This can lead to unstable selection: a slight change in data changes the chosen variable. To address this problem, Elastic Net combines L1 and L2 penalties.
- Selection limit of $n$ variables — Lasso can select at most $n$ variables (where $n$ is the number of observations). If $p \gg n$, Lasso saturates: it cannot go beyond $n$ selected features. Elastic Net lifts this limitation.
- Sensitivity to feature scaling — Like all regularization methods, Lasso is sensitive to the scale of variables. It is imperative to standardize features (center-reduce) before applying Lasso.
- No closed-form analytical solution — Unlike Ridge, which admits a closed-form solution ($\hat{w} = (X^T X + \alpha I)^{-1} X^T y$), Lasso requires an iterative algorithm (coordinate descent), making it slightly slower.
- Estimation bias — Lasso systematically underestimates the magnitude of non-zero coefficients (shrinkage bias). For large coefficients, this bias can be significant. A two-step approach (Lasso for selection, then OLS on selected features) can mitigate this problem.
Use Cases
1. Genomics and Molecular Biology
This is the flagship application of Lasso. In genomics, the expression of tens of thousands of genes ($p \approx 20,000$) is measured on a limited number of patients ($n \approx 100$ to $500$). Most genes have no connection to the disease being studied. Lasso automatically identifies the small subset of genes relevant for prediction (e.g., tumor status or treatment response). This ultra-high-dimensional selection capability makes Lasso a standard tool in bioinformatics.
2. Natural Language Processing (NLP) and Text Classification
Text vectorization (TF-IDF, Bag-of-Words) produces extremely sparse representations with tens of thousands of words/features per document. Lasso regression helps identify the most discriminating terms for a classification task (spam/ham, sentiment analysis, thematic categorization) while ignoring irrelevant vocabulary. The resulting model is not only more performant but also interpretable: you can directly read which words contribute to the prediction.
3. Quantitative Finance and Credit Scoring
In credit scoring, financial institutions have hundreds of descriptive variables on each borrower (income, banking history, debt, age, occupation, etc.). Lasso helps build parsimonious default prediction models, retaining only the variables that truly contribute predictive information. These models are more robust, easier to audit by regulators, and less prone to overfitting.
4. Econometrics and Social Sciences
In econometrics, researchers often work with a large number of potential covariates (demographic, geographic, temporal) and need to identify the real determinants of an economic phenomenon. Lasso, used as a pre-selection tool, helps build more reliable econometric models by eliminating spurious variables. Variants such as adaptive Lasso and post-selection Lasso are commonly used in modern econometric literature.
See Also
- Mastering Regressions in Python: Complete Guide to Optimizing Your Models with B Retractions
- Mastering Square Subsets in Python: Techniques and Advanced Tips
- Computing the Average Least Common Multiple (LCM) in Python: Complete Guide and Tips
- Mastering Planetary Gears with Python: Simulations and Practical Applications

