Stacking: Model Stacking

Stacking : Guide Complet — Empilement de Modèles

Stacking: Combining Models for Superior Predictions

Summary

Stacking (or Stacked Generalization, also called model stacking) is an advanced ensemble learning technique that combines the predictions of several base models (called level-zero models) using a meta-model (or level-one model). Unlike Bagging which uses the same algorithm on subsamples, or Boosting which sequentially corrects errors, Stacking exploits algorithmic diversity: each base model learns in its own way, and the meta-model learns to intelligently weight their contributions. This approach often achieves superior performance compared to any individual model, at the cost of increased computational complexity.


Mathematical Principles of Stacking

Stacking operates on two distinct hierarchical levels, each with a precise role in the prediction chain.

Level 0: K Base Models

Suppose we have a training set $\mathcal{D} = {(x_i, y_i)}_{i=1}^{n}$ with $n$ observations. We select $K$ different base models:

$$h_1, h_2, …, h_K$$

Each model $h_k$ is a distinct algorithm — for example a Random Forest, SVM, Logistic Regression, Naive Bayes, or Gradient Boosting. Each model is trained (potentially with different hyperparameters) on the training set.

The crucial point of Stacking is that each model’s out-of-fold predictions are used to build a new level-one dataset. This prevents overfitting that would occur if predictions on the training data themselves were used.

Cross-Validation for Out-of-Fold Predictions

To construct meta-features without data leakage, the process is as follows:

  1. Split $\mathcal{D}$ into $J$ folds $\mathcal{F}_1, \mathcal{F}_2, …, \mathcal{F}_J$.
  2. For each fold $j$ and each model $h_k$:
    – Train $h_k$ on $\mathcal{D} \setminus \mathcal{F}_j$ (all data except fold $j$).
    – Predict only on $\mathcal{F}_j$.
  3. Concatenate all out-of-fold predictions to form the meta-features.

Mathematically, for each observation $x_i$, we obtain a meta-feature vector:

$$z_i = [h_1^{-i}(x_i), h_2^{-i}(x_i), …, h_K^{-i}(x_i)]$$

where $h_k^{-i}$ denotes model $h_k$ trained without observation $i$. This vector $z_i \in \mathbb{R}^K$ captures how each base model views observation $x_i$.

Level 1: The Meta-Model

The meta-model $g$ is then trained on the transformed set ${(z_i, y_i)}_{i=1}^{n}$:

$$\hat{y} = g(z_i) = g(h_1^{-i}(x_i), h_2^{-i}(x_i), …, h_K^{-i}(x_i))$$

The role of the meta-model is to learn the optimal combination of base model predictions. It discovers that in certain regions of the space, model $h_1$ is more reliable, while elsewhere $h_2$ performs better. This adaptive weighting is much more powerful than a simple average or fixed weighting.

In classification, meta-features can be either predicted labels (discrete) or predicted probabilities (continuous). Using probabilities is generally more informative, as the meta-model then has a confidence measure for each prediction.

At Prediction Time

During the testing phase, a new observation $x_{\text{new}}$ first passes through all base models:

$$z_{\text{new}} = [h_1(x_{\text{new}}), h_2(x_{\text{new}}), …, h_K(x_{\text{new}})]$$

Then the meta-model produces the final prediction:

$$\hat{y}{\text{new}} = g(z)$$}


Intuition: The Jury of a Singing Competition

Imagine a singing competition jury, with several judges of very different profiles. Each judge evaluates contestants according to their own criteria:

  • The classical musician judge (our Random Forest): scores technique, vocal accuracy, rhythm mastery. Excellent for evaluating technical solidity but sometimes too rigid.
  • The singer-songwriter judge (our RBF kernel SVM): sensitive to subtleties, nuances, originality. Detects complex patterns others miss, but can be temperamental on noisy data.
  • The hit producer judge (our Naive Bayes): gets straight to the point, focuses on a few key indicators. Sometimes simplistic, but surprisingly effective and fast.

Now imagine a head judge (our Logistic Regression as meta-model). This head judge doesn’t listen to the contestants directly — they look at the scores of the three judges and learn from experience:

  • When the judge classical musician gives 8/10 and the singer-songwriter 9/10, it’s almost always an excellent contestant.
  • The producer often gets it wrong when the singer-songwriter is very enthusiastic but the classical musician is mixed.
  • If all three agree, I’m 95% confident.

This is exactly Stacking: the meta-model isn’t a simple average calculator. It’s a strategist who learns the strengths and weaknesses of each model and composes intelligently to make the best possible decision.


Python Implementation with Scikit-Learn

Scikit-learn provides StackingClassifier (for classification) and StackingRegressor (for regression) since version 0.22. Here is a complete, commented implementation.

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import (
    RandomForestClassifier,
    StackingClassifier,
    VotingClassifier,
)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# 1. Generate synthetic dataset
X, y = make_classification(
    n_samples=5000,
    n_features=20,
    n_informative=15,
    n_redundant=3,
    n_classes=2,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 2. Define base models (level 0)
estimators = [
    (
        "rf",
        RandomForestClassifier(n_estimators=200, max_depth=12, random_state=42),
    ),
    (
        "svm",
        make_pipeline(StandardScaler(), SVC(probability=True, random_state=42)),
    ),
    ("nb", GaussianNB()),
]

# 3. Create StackingClassifier
stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=1000, random_state=42),
    cv=5,  # 5-fold cross-validation for out-of-fold predictions
    n_jobs=-1,  # Parallelize across all cores
    passthrough=False,  # Original features are not passed to the meta-model
)

# 4. Train and evaluate
stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)

print("=== Stacking Classifier ===")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

# 5. Compare with individual models
print("\n=== Individual model comparison ===")
for name, estimator in estimators:
    estimator.fit(X_train, y_train)
    y_pred_i = estimator.predict(X_test)
    acc_i = accuracy_score(y_test, y_pred_i)
    print(f"{name}: {acc_i:.4f}")

# 6. Cross-validation
cv_scores = cross_val_score(stacking_clf, X_train, y_train, cv=5, scoring="accuracy")
print(f"\n5-fold Cross-validation: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# 7. Meta-model weight analysis
meta_weights = stacking_clf.final_estimator_.coef_[0]
for (name, _), weight in zip(estimators, meta_weights):
    print(f"Meta-model weight for {name}: {weight:.4f}")

Comparing Ensemble Approaches

It’s instructive to compare Stacking with two other popular ensemble techniques:

# VotingClassifier (simple weighted average)
voting_clf = VotingClassifier(
    estimators=estimators, voting="soft"  # soft = average of probabilities
)

# Complete comparison
models = {
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=42),
    "Voting (Soft)": voting_clf,
    "Stacking": stacking_clf,
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std():.4f})")

In general, Stacking outperforms Voting because it learns adaptive weightings instead of using a fixed average. And it often outperforms individual models because it exploits the complementarity between algorithms.


Key Hyperparameters of StackingClassifier

Configuring Stacking is crucial to achieve good performance. Here are the main hyperparameters to master.

estimators

This is the list of base models (level 0). The choice of these models is the most important factor:

  • Diversity above all: use fundamentally different algorithms (trees, linear models, SVM, Bayesian). Three different Random Forests won’t add much more value than a single one.
  • Minimum quality: each model should be at least reasonably performant. A catastrophic model can harm the meta-model.
  • 3 to 7 models is a good range. Too few = not enough diversity. Too many = risk of overfitting and computational explosion.

final_estimator

The meta-model (level 1). By default, it’s LogisticRegression, and this is generally an excellent choice:

  • Logistic Regression: simple, interpretable, fast. Its coefficients reveal the relative importance of each base model.
  • Linear Regression: alternative for regression tasks.
  • Random Forest or Gradient Boosting as meta-model: possible to capture non-linear interactions between predictions, but beware of overfitting.
  • The practical rule: a simple meta-model is often enough. If the base models are already complex, a complex meta-model risks overfitting on their predictions.

cv

The number of cross-validation folds for generating out-of-fold predictions:

  • cv=5: good default compromise.
  • cv=10: more robust but slower.
  • cv=3: faster, useful for initial exploration.
  • The smaller the dataset, the more folds are needed so that each level-0 model has enough training data.

n_jobs

Controls parallelism:

  • n_jobs=-1: uses all available cores.
  • n_jobs=1: sequential execution (useful for debugging).
  • Stacking is inherently parallelizable because the base models are independent.

passthrough

This hyperparameter determines whether original features are added to the meta-features:

  • passthrough=False (default): the meta-model sees only the base model predictions.
  • passthrough=True: the meta-model sees both the base model predictions and the original raw features. This can be useful if the base models don’t capture all the information, but increases the risk of overfitting and dimensionality.

Advantages and Limitations of Stacking

Advantages

  1. Superior performance: Stacking consistently achieves higher scores than individual models, as it exploits algorithmic complementarity. In Kaggle competitions, stacked ensembles regularly dominate the leaderboards.
  2. Increased robustness: By combining models with different biases and variances, Stacking reduces both bias (through algorithmic diversity) and variance (through aggregation). It’s one of the rare cases where both are improved simultaneously.
  3. Total flexibility: You can mix any algorithms — Random Forest, SVM, neural networks, Naive Bayes, KNN, XGBoost. No restriction on the nature of the base models.
  4. Spatial adaptivity: Unlike Voting which weights uniformly, the meta-model learns that one model is better in one region of the feature space. It’s second-order intelligence.
  5. Partial interpretability: The meta-model’s coefficients (if it’s a linear model) reveal the relative importance of each base model, offering a useful view of ensemble dynamics.

Limitations

  1. Computational complexity: Training $K$ models with cross-validation means $K \times J$ trainings (where $J$ is the number of folds). For 5 models and 5 folds, that’s 25 trainings. This is significantly slower than a single model.
  2. Risk of overfitting of the meta-model: If the base models are too correlated or the meta-model is too complex, the ensemble can overfit. Rigorous cross-validation is essential.
  3. Difficulty of global interpretation: Although the meta-model’s weights are readable, understanding why the ensemble made a specific decision remains complex — it’s a second-order “black box”.
  4. Production maintenance: In production, you need to deploy and maintain $K+1$ models. Prediction latency is that of the slowest base model plus the meta-model.
  5. Sensitivity to base model quality: If all base models share the same systematic bias (for example, they all fail on the same types of observations), Stacking cannot correct this bias.

4 Concrete Use Cases

1. Machine Learning Competitions (Kaggle)

Stacking is a classic weapon in competitions. Top participants often build ensembles of 5 to 15 heterogeneous models (XGBoost, LightGBM, CatBoost, neural networks, regularized linear models) and stack them via a meta-model. The idea is simple: each algorithm has its blind spots, and combining them fills their mutual gaps. Victories on competitive platforms frequently rely on multi-level Stacking architectures (meta-stacking where the meta-model itself is stacked).

2. Medical Diagnostic Assistance

In the medical domain, reliability is paramount. A Stacking can combine:

  • A model based on tabular data (blood tests, medical history)
  • A computer vision model (X-ray or MRI analysis)
  • A natural language processing model (medical report analysis)

Each model provides a complementary perspective. The meta-model learns to trust the imaging model more for visually obvious pathologies, and the text model for complex diagnoses requiring analysis of described symptoms. In medicine, missing nothing is often more important than maximizing overall accuracy — Stacking allows configuring the meta-model to favor recall.

3. Financial Fraud Detection

Fraud detection presents an extreme class imbalance (less than 1% of fraudulent transactions). A Stacking can combine:

  • An Isolation Forest for anomaly detection
  • A weighted XGBoost for imbalanced classification
  • A neural network to capture complex temporal patterns

The meta-model learns to dynamically weight these signals: when Isolation Forest detects an anomaly AND XGBoost classifies it as fraud, confidence is maximal. When only the neural network is alerted, the meta-model can moderate its response. This approach reduces false positives while maintaining a high detection rate.

4. Customer Churn Prediction

Telecom and SaaS companies fight churn. A Stacking can bring together:

  • A model based on usage history (frequency, duration, features used)
  • A model based on customer service (number of complaints, satisfaction, resolution time)
  • A model based on demographic and contractual data (tenure, plan, price)

The meta-model discovers subtle interactions: for example, a long-satisfied customer who starts reducing usage AND recently contacted customer service presents a very high risk, even if each model individually only flags it moderately.


Best Practices for Stacking

  1. Check error correlation: use a correlation matrix of predictions between base models. If two models are correlated above 0.9, one of them is likely redundant.
  2. Start simple: begin with 3 very different models and Logistic Regression as the meta-model. Add complexity only if the gains justify it.
  3. Keep a strict validation set: never reuse the level-0 cross-validation data to evaluate final performance. Set aside a completely independent test set.
  4. Monitor training time: Stacking can be very long. Use n_jobs=-1 and consider fast base models if time is a constraint.
  5. Consider passthrough=True cautiously: it can improve performance but significantly increases dimensionality and overfitting risk.

Conclusion

Stacking represents one of the pinnacles of ensemble learning. By intelligently stacking heterogeneous models via a meta-model trained on out-of-fold predictions, it exploits algorithmic diversity to achieve often unmatched performance. Although it is more complex and slower than a single model, its gains in accuracy, robustness, and adaptivity make it an indispensable tool in the data scientist’s toolkit.

Whether you’re preparing for a Kaggle competition, a medical diagnostic system, or a churn prediction model, Stacking offers a rigorous methodology to get the best from your algorithms — and ensure that the whole is truly greater than the sum of its parts.


See Also