Bayesian Regression: Complete Guide — Principles, Examples, and Python Implementation
Summary — Bayesian regression is a probabilistic approach to linear regression that applies Bayes’ theorem to estimate a complete distribution over the model weights, rather than a single point estimate. Unlike classical regression which returns a unique value for each coefficient, Bayesian regression provides a posterior distribution that allows quantifying the uncertainty of each prediction. This guide covers the fundamental theory, geometric intuition, a complete Python implementation with scikit-learn, and concrete use cases.
Mathematical principle of Bayesian regression
Bayes’ theorem as foundation
At the heart of Bayesian regression lies Bayes’ theorem, which allows inverting a conditional probability. In the context of regression, we seek to know the distribution of weights w given the observed data D = (X, y):
P(w | D) = P(D | w) × P(w) / P(D)
where:
- P(w | D) is the posterior distribution — what we believe about the weights after observing the data.
- P(D | w) is the likelihood — the probability of observing the data given particular weights.
- P(w) is the prior distribution — what we believe about the weights before seeing the data.
- P(D) is the marginal probability (or evidence), serving as a normalization constant.
Gaussian likelihood model
We assume that observations follow a linear model with Gaussian noise:
y = Xw + ε, where ε ~ N(0, α⁻¹ I)
The likelihood is then written as the product of N independent Gaussians, each centered on the linear prediction wᵀxᵢ with variance α⁻¹.
Gaussian prior on weights (Ridge regularization)
The most common choice for the prior distribution is a Gaussian centered at zero:
P(w | λ) = N(w | 0, λ⁻¹ I)
This Gaussian prior has a major consequence: the Maximum A Posteriori (MAP) under this prior is exactly equivalent to Ridge regression. This means that classical Ridge regression can be interpreted as a MAP point estimate in a Bayesian framework with a Gaussian prior. The Ridge regularization parameter λ corresponds to the inverse of the prior variance on the weights.
Analytic posterior
Thanks to the conjugacy between the Gaussian likelihood and the Gaussian prior, the posterior distribution admits a closed analytic form:
p(w | X, y) = N(w | m_N, S_N)
with:
S_N = (α XᵀX + λ I)⁻¹
m_N = α S_N Xᵀy
This result is fundamental: we obtain not only a central estimate of the weights (m_N), but also a covariance matrix (S_N) that measures the uncertainty on each coefficient and their correlations.
Predictive distribution with epistemic variance
For a new input x_*, the predictive distribution is obtained by integrating over the posterior of the weights:
p(y_ | x_, X, y) = ∫ p(y_ | x_, w) · p(w | X, y) dw
Which gives:
p(y_ | x_, X, y) = N(y_ | m_Nᵀ x_, σ*²)
where the predictive variance decomposes into two terms:
σ² = α⁻¹ + x_ᵀ S_N x_*
- The first term (α⁻¹) represents the inherent noise in the data (aleatoric variance).
- The second term (x_ᵀ S_N x_) represents the epistemic uncertainty — the one that decreases as more data is added.
It is this ability to separate the two sources of uncertainty that makes Bayesian regression so powerful.
Intuition: why estimate a distribution rather than a point?
In classical regression (ordinary least squares or Ridge), we obtain a single weight vector ŵ. It’s as if we were saying: “The coefficients are exactly these values, period.” But this approach has a major flaw: it says nothing about the confidence we can have in these estimates.
Bayesian regression adopts a radically different philosophy. Instead of estimating a single point for the weights, it estimates a complete distribution over the space of possible weights. This distribution captures everything the model “knows” and everything it “doesn’t know.”
Imagine you are trying to guess the location of buried treasure. The classical approach would give you a single coordinate: “The treasure is here.” The Bayesian approach would rather say: “There is a 68% chance it is within this 5-meter circle, a 95% chance within this 10-meter circle.” The second answer is infinitely more useful for making informed decisions.
The more data we have, the more the posterior distribution tightens around the true value of the weights. With an infinite amount of data, the posterior converges to a Dirac mass centered on the true parameters. It is this convergence property that guarantees the consistency of the Bayesian approach.
The geometric intuition is also illuminating: in weight space, the Gaussian prior penalizes weights of large norm (like Ridge), but the posterior explores a volume around the maximum, rather than a single peak. Regions where the data is uninformative retain high variance, clearly signaling to the practitioner: “Be careful, I’m not sure here.”
Complete Python implementation
Installing dependencies
pip install scikit-learn numpy matplotlib seaborn
Comparison BayesianRidge vs Ridge
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_regression
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Style configuration
sns.set_style("whitegrid")
sns.set_palette("deep")
# 1. Data generation
random_state = 42
X, y, coef_true = make_regression(
n_samples=150,
n_features=1,
n_informative=1,
noise=15.0,
coef=True,
random_state=random_state
)
# Add heteroscedastic noise to illustrate uncertainty
np.random.seed(random_state)
y += np.random.normal(0, np.abs(X).flatten() * 2, size=y.shape)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=random_state
)
# 2. Training — Bayesian regression
bayesian_ridge = BayesianRidge(
n_iter=300,
tol=1e-3,
alpha_1=1e-6,
alpha_2=1e-6,
lambda_1=1e-6,
lambda_2=1e-6,
compute_score=True,
fit_intercept=True
)
bayesian_ridge.fit(X_train, y_train)
# 3. Training — Ridge (for comparison)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# 4. Predictions with uncertainty (BayesianRidge only)
X_plot = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)
y_pred_bayes, sigma_bayes = bayesian_ridge.predict(X_plot, return_std=True)
y_pred_ridge = ridge.predict(X_plot)
# 95% uncertainty bands (μ ± 2σ)
y_lower = y_pred_bayes - 2 * sigma_bayes
y_upper = y_pred_bayes + 2 * sigma_bayes
# 5. Evaluation
y_pred_test_bayes, sigma_test = bayesian_ridge.predict(X_test, return_std=True)
y_pred_test_ridge = ridge.predict(X_test)
print("=" * 60)
print("BAYESIAN REGRESSION VS RIDGE")
print("=" * 60)
print(f"BayesianRidge — MSE: {mean_squared_error(y_test, y_pred_test_bayes):.4f}")
print(f"BayesianRidge — R²: {r2_score(y_test, y_pred_test_bayes):.4f}")
print(f"Ridge — MSE: {mean_squared_error(y_test, y_pred_test_ridge):.4f}")
print(f"Ridge — R²: {r2_score(y_test, y_pred_test_ridge):.4f}")
print(f"\nEstimated weight (Bayes): {bayesian_ridge.coef_}")
print(f"Estimated weight (Ridge): {ridge.coef_}")
print(f"True weight: {coef_true}")
print(f"\nEstimated alpha (noise precision): {bayesian_ridge.alpha_:.4f}")
print(f"Estimated lambda (weight precision): {bayesian_ridge.lambda_:.4f}")
print(f"Evidence score (log marginal): {bayesian_ridge.scores_[-1]:.4f}")
# 6. Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Plot 1: Predictions with uncertainty bands
axes[0].scatter(X_train, y_train, color="steelblue", alpha=0.6, label="Training data", zorder=5)
axes[0].scatter(X_test, y_test, color="coral", alpha=0.6, label="Test data", zorder=5)
axes[0].plot(X_plot, y_pred_bayes, color="navy", linewidth=2, label="Mean prediction (Bayes)")
axes[0].fill_between(
X_plot.flatten(), y_lower, y_upper,
color="navy", alpha=0.15, label="95% interval (μ ± 2σ)"
)
axes[0].set_title("Bayesian regression — Uncertainty", fontweight="bold")
axes[0].set_xlabel("X")
axes[0].set_ylabel("y")
axes[0].legend(fontsize=8)
# Plot 2: Bayes vs Ridge comparison
axes[1].scatter(X_train, y_train, color="steelblue", alpha=0.4, zorder=5)
axes[1].plot(X_plot, y_pred_bayes, color="navy", linewidth=2, label="BayesianRidge")
axes[1].plot(X_plot, y_pred_ridge, color="crimson", linewidth=2, linestyle="--", label="Ridge")
axes[1].set_title("BayesianRidge vs Ridge", fontweight="bold")
axes[1].set_xlabel("X")
axes[1].set_ylabel("y")
axes[1].legend(fontsize=8)
# Plot 3: Evidence score evolution
axes[2].plot(bayesian_ridge.scores_, color="navy", linewidth=2)
axes[2].set_title("Evidence score (log P(y|X))", fontweight="bold")
axes[2].set_xlabel("Iteration")
axes[2].set_ylabel("Log marginal likelihood")
axes[2].axhline(y=bayesian_ridge.scores_[-1], color="crimson", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.savefig("bayesian_ridge_results.png", dpi=150, bbox_inches="tight")
plt.show()
# 7. Using ARDRegression (Automatic Relevance Determination)
from sklearn.linear_model import ARDRegression
ard = ARDRegression(
n_iter=300,
tol=1e-3,
alpha_1=1e-6,
alpha_2=1e-6,
lambda_1=1e-6,
lambda_2=1e-6,
threshold_lambda=10_000,
fit_intercept=True
)
ard.fit(X_train, y_train)
y_pred_ard = ard.predict(X_test)
print(f"\nARDRegression — MSE: {mean_squared_error(y_test, y_pred_ard):.4f}")
print(f"ARDRegression — R²: {r2_score(y_test, y_pred_ard):.4f}")
print(f"ARD coefficients: {ard.coef_}")
Key code points
return_std=True: exclusive method ofBayesianRidgethat returns the predictive standard deviation σ for each point. This is what allows plotting uncertainty bands.compute_score=True: records the evidence score (log marginal likelihood) at each iteration. Useful for diagnosing convergence.alpha_1, alpha_2, lambda_1, lambda_2: hyperparameters of the Gamma prior distributions on the precisions α and λ. Values close to zero correspond to non-informative priors.- ARDRegression: variant that assigns one λ_j per coefficient, enabling Automatic Relevance Determination — irrelevant variables see their coefficient tend toward zero.
Hyperparameters of BayesianRidge
| Hyperparameter | Type | Default | Role |
|---|---|---|---|
n_iter |
int | 300 | Maximum number of iterations of the iterative update algorithm |
tol |
float | 1e-3 | Tolerance for the stopping criterion — convergence when the relative change in evidence is less than tol |
alpha_1 |
float | 1e-6 | Shape parameter of the Gamma prior distribution on α (noise precision) |
alpha_2 |
float | 1e-6 | Scale parameter of the Gamma prior distribution on α |
lambda_1 |
float | 1e-6 | Shape parameter of the Gamma prior distribution on λ (weight precision) |
lambda_2 |
float | 1e-6 | Scale parameter of the Gamma prior distribution on λ |
compute_score |
bool | False | If True, computes and stores the log-marginal evidence score at each iteration |
fit_intercept |
bool | True | If True, fits a bias term separately from the weights |
How to choose hyperparameters
The parameters alpha_1, alpha_2, lambda_1, and lambda_2 control the Gamma priors on the precisions. By default, with very small values (1e-6), we obtain weakly informative priors that let the data speak. If you have strong prior knowledge about the noise level or the magnitude of the coefficients, you can increase these values.
For n_iter, 300 iterations are generally sufficient. Monitor the evidence curve: if it plateaus before 300 iterations, the model has converged. You can reduce n_iter to speed up training.
Advantages and Limitations
Advantages
- Uncertainty quantification — This is the main advantage. Each prediction comes with a natural confidence interval. In safety, medicine, or finance, knowing “I’m not sure” is sometimes as important as the prediction itself.
- Automatic regularization — The hyperparameters α and λ are learned automatically from the data via marginal likelihood maximization (Type-II Maximum Likelihood). No need for expensive cross-validation to find the right regularization parameter.
- Rigorous probabilistic interpretation — Unlike frequentist methods that rely on asymptotic approximations, the Bayesian framework provides an exact interpretation (under the model assumptions).
- Equivalence with Ridge at MAP — We recover the classical L2 regularization guarantees while benefiting from full uncertainty quantification.
- Robustness with little data — The prior naturally regularizes the model, preventing overfitting even when the number of samples is small relative to the number of features.
Limitations
- Computational cost — Computing the posterior requires inverting a matrix of size (D × D) where D is the number of features. This requires O(D³) operations, which becomes prohibitive beyond a few thousand dimensions.
- Gaussian assumption — The closed-form result relies on the assumption of a Gaussian prior and Gaussian likelihood. If the data has heavy tails, significant outliers, or non-linear relationships, the model may be poorly calibrated.
- Linear model — Like all linear regressions, Bayesian regression does not capture complex non-linear interactions without explicit feature engineering (polynomials, kernels, etc.).
- Sensitivity to hyperprior choice — Although α and λ are learned automatically, the hyperparameters of the Gamma priors can influence results, especially with little data.
- No native variable selection — Unlike Lasso, standard Bayesian regression does not force coefficients to zero. To obtain sparsity,
ARDRegressionmust be used, which assigns an individual prior to each coefficient.
Concrete use cases
1. Uncertainty modeling in medicine
In medical diagnosis or risk prediction, it is crucial to know the margin of error of a prediction. Bayesian regression can be used to predict a health indicator (for example, blood glucose levels) from clinical variables, while providing a confidence interval. A wide interval signals to the physician that the prediction is uncertain and that further examination is needed.
Example: Prediction of post-operative recovery time. A wide 95% confidence interval allows planning hospital resources with an appropriate safety margin.
2. Data with few samples (small data)
In many fields — pharmaceutical research, aerospace engineering, rare climate studies — the number of observations is limited. Bayesian regression excels in this regime thanks to its prior which acts as a natural regularizer. Where a classical regression quickly overfits, Bayesian regression maintains reasonable estimates by relying on prior knowledge encoded in the priors.
Example: Predicting the performance of a new material from only 30 laboratory experiments, incorporating an informative prior from physical simulations.
3. A/B testing and experimentation
In an A/B testing context, we want not only to estimate the effect of a treatment (the difference between groups A and B) but also to quantify the certainty of that estimate. Bayesian regression naturally provides this information through the posterior distribution of the coefficient associated with the treatment variable.
Example: Measuring the impact of a new feature on user session time. The posterior distribution of the coefficient directly gives the probability that the effect is positive (e.g., 97% chance of a positive effect).
4. Bayesian exploration and optimization
Bayesian regression is often used as a surrogate model in Bayesian optimization. The epistemic variance guides exploration toward regions of the search space where the model is most uncertain, allowing an optimal balance between exploration and exploitation.
Example: Hyperparameter optimization of a neural network. The Bayesian model predicts performance and identifies the most promising parameter combinations to evaluate next.
See also
- Calculating the Sum of Products in Python: Complete Guide and Practical Tips
- Exploring $k$-Markov Numbers with Python: Complete Guide for Algorithms and Programming

