Voting Ensemble: Majority Voting of Models

Voting Ensemble : Guide Complet — Vote Majoritaire de Modèles

Voting Ensemble: The Complete Guide to Majority Voting in Machine Learning

Summary

Voting Ensemble is one of the most intuitive and effective ensemble methods in machine learning. Rather than relying on a single model, this approach combines the predictions of several distinct classifiers using a voting mechanism. Each model “votes” for a predicted class, and the class that receives the most votes becomes the final prediction of the Voting Ensemble. Two strategies exist: hard voting, based on predicted labels, and soft voting, based on the probabilities estimated by each model. This simple method significantly improves performance through the diversity of the combined models.

Mathematical Principles

The Voting Ensemble rests on two fundamental mathematical formulations, each corresponding to a different voting strategy.

Hard Voting

In hard voting, each classifier $f_i$ produces a discrete class prediction for an observation $x$. The final class is determined by the majority rule:

$$\hat{y} = \arg\max_{c} \sum_{i=1}^{n} \mathbb{I}(f_i(x) = c)$$

Where:
– $n$ is the number of classifiers in the ensemble
– $f_i(x)$ is the prediction of the $i$-th classifier for observation $x$
– $c$ ranges over all possible classes
– $\mathbb{I}(f_i(x) = c)$ is the indicator function equal to 1 if classifier $i$ predicts class $c$, and 0 otherwise
– $\hat{y}$ is the final predicted class, the one that receives the most votes

In case of a tie, the first classifier by index order is generally chosen, or a tie-breaking mechanism defined by the implementation is used.

Soft Voting

Soft voting is more refined: instead of counting discrete votes, it averages the probabilities estimated by each model:

$$\hat{y} = \arg\max_{c} \sum_{i=1}^{n} w_i \cdot p_i(c)$$

Where:
– $p_i(c)$ is the probability estimated by model $i$ for class $c$
– $w_i$ is the weight assigned to model $i$ (by default $w_i = 1$ for all models)
– $\hat{y}$ is the class with the highest weighted average probability

Soft voting is generally more performant because it exploits the fine-grained probabilistic information of each model, whereas hard voting retains only the final label.

Why it works: Condorcet’s Theorem

The Voting Ensemble relies on a fundamental statistical principle. If each individual classifier has accuracy better than random ($p > 0.5$ for a binary problem) and if the models’ errors are sufficiently independent, then the accuracy of the ensemble grows with the number of models. This is the analog of Condorcet’s jury theorem in social choice theory.

More precisely, for $n$ independent classifiers with accuracy $p$, the probability that the majority votes correctly is:

$$P(\text{correct}) = \sum_{k=\lceil n/2 \rceil}^{n} \binom{n}{k} p^k (1-p)^{n-k}$$

This probability tends exponentially toward 1 as $n$ increases, provided $p > 0.5$.

Intuition: the medical consultation

Imagine you consult a doctor for a difficult diagnosis. A single practitioner, however competent, can be wrong. Now imagine you bring together three independent doctors, each with their own specialty and experience. Each doctor examines the symptoms and proposes their diagnosis. If two out of three doctors arrive at the same conclusion, you have much more confidence than in the opinion of a single doctor.

This is exactly the principle of the Voting Ensemble:

  • Each model is an expert with its strengths and weaknesses
  • Errors are independent: one model errs where another succeeds
  • The majority is more reliable: individual errors cancel each other out

The medical analogy perfectly illustrates why combined models must be diversified. Three doctors trained in the same school, with the same approach, risk making the same mistakes. Similarly, three Random Forests trained on the same data with the same hyperparameters will all vote the same way — which nullifies any benefit from the ensemble.

The key to success therefore lies in complementarity: combining fundamentally different algorithms (a decision tree, a logistic regression, an SVM) that capture distinct aspects of the data structure.

Python Implementation with Scikit-Learn

Basic Setup with VotingClassifier

The scikit-learn library provides the VotingClassifier class which implements both voting strategies. Here is a complete implementation comparing individual models and the ensemble:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import seaborn as sns

# ─── Data generation ───
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=3,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ─── Create individual classifiers ───
rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(max_iter=1000, random_state=42)
svm_model = make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))

# ─── VotingClassifier — Hard Voting ───
voting_hard = VotingClassifier(
    estimators=[
        ('rf', rf),
        ('lr', lr),
        ('svm', svm_model)
    ],
    voting='hard'
)

# ─── VotingClassifier — Soft Voting ───
voting_soft = VotingClassifier(
    estimators=[
        ('rf', rf),
        ('lr', lr),
        ('svm', svm_model)
    ],
    voting='soft'
)

# ─── Train and evaluate ───
models = {
    'Random Forest': rf,
    'Logistic Regression': lr,
    'SVM': svm_model,
    'Voting Hard': voting_hard,
    'Voting Soft': voting_soft
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    cv_score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    results[name] = {'test_acc': acc, 'cv_acc': cv_score}
    print(f"{name:25s} — Test: {acc:.4f} | CV 5-fold: {cv_score:.4f}")

# ─── Visual comparison ───
names = list(results.keys())
test_accs = [results[n]['test_acc'] for n in names]
cv_accs = [results[n]['cv_acc'] for n in names]

x = np.arange(len(names))
width = 0.35

fig, ax = plt.subplots(figsize=(12, 6))
bars1 = ax.bar(x - width/2, test_accs, width, label='Test', color='steelblue')
bars2 = ax.bar(x + width/2, cv_accs, width, label='CV 5-fold', color='coral')

ax.set_xlabel('Model', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Comparison: Individual Models vs Voting Ensemble', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(names, rotation=15, ha='right')
ax.legend()
ax.set_ylim(0.7, 1.0)

for bar in bars1:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
            f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=9)
for bar in bars2:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
            f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('voting_ensemble_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Custom Weight Voting

When some models are significantly more performant than others, it makes sense to give them more influence in the vote. Scikit-learn allows specifying weights via the weights parameter:

# ─── Weighted Voting — Random Forest has more weight ───
voting_weighted = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
        ('lr', LogisticRegression(max_iter=1000, random_state=42)),
        ('svm', make_pipeline(StandardScaler(), SVC(probability=True, random_state=42)))
    ],
    voting='soft',
    weights=[3, 1, 2]  # RF ×3, LR ×1, SVM ×2
)

voting_weighted.fit(X_train, y_train)
y_pred_weighted = voting_weighted.predict(X_test)
acc_weighted = accuracy_score(y_test, y_pred_weighted)
print(f"Weighted Voting — Test Accuracy: {acc_weighted:.4f}")

# ─── Individual vote analysis (transparency) ───
# Show each model's predictions for a sample
for i in range(5):
    votes = [model.predict(X_test[[i]])[0] for model in models.values()]
    print(f"Sample {i}: votes = {votes}, true = {y_test[i]}")

Detailed Performance Comparison

# ─── Detailed classification report ───
print("\n" + "="*60)
print("CLASSIFICATION REPORT — VOTING SOFT")
print("="*60)
print(classification_report(y_test, voting_soft.predict(X_test)))

# ─── Confusion matrix ───
cm = confusion_matrix(y_test, voting_soft.predict(X_test))
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Class 0', 'Class 1', 'Class 2'],
            yticklabels=['Class 0', 'Class 1', 'Class 2'])
plt.xlabel('Prediction', fontsize=12)
plt.ylabel('Ground Truth', fontsize=12)
plt.title('Confusion Matrix — Voting Ensemble (soft)', fontsize=14)
plt.tight_layout()
plt.savefig('voting_confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

# ─── Cross-validation of different voting strategies ───
cv_results = {}
for voting_type in ['hard', 'soft']:
    vc = VotingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
            ('lr', LogisticRegression(max_iter=1000, random_state=42)),
            ('svm', make_pipeline(StandardScaler(), SVC(probability=True, random_state=42)))
        ],
        voting=voting_type
    )
    scores = cross_val_score(vc, X_train, y_train, cv=5, scoring='accuracy')
    cv_results[voting_type] = {'mean': scores.mean(), 'std': scores.std()}
    mean_val = scores.mean()
    std_val = scores.std()
    print(f"{voting_type} voting: {mean_val:.4f} (+/- {std_val*2:.4f})")

Hyperparameters of VotingClassifier

The VotingClassifier object of scikit-learn exposes several essential hyperparameters:

estimators

Type: list of tuples (name, estimator)

This is the central parameter. It defines the list of classifiers that make up the ensemble. Each estimator is a tuple containing a textual identifier and the model instance.

estimators=[
    ('random_forest', RandomForestClassifier(n_estimators=200)),
    ('logistic_reg', LogisticRegression(max_iter=2000)),
    ('svm', SVC(probability=True)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
    ('gbm', GradientBoostingClassifier(n_estimators=100))
]

Recommendations:
– Include at minimum 3 classifiers to benefit from the voting effect
– Prioritize algorithmic diversity: combine different families (trees, linear, kernels, k-nearest neighbors)
– Avoid redundant models that would systematically vote the same way
– All classifiers must support predict; soft voting additionally requires predict_proba

voting

Type: string, 'hard' or 'soft'

  • 'hard': each model votes for a class, the majority class wins. Works with all classifiers.
  • 'soft': weighted average of probabilities. Generally more performant, but requires that all classifiers implement predict_proba.

When to choose which:
– Soft voting is preferable when models produce well-calibrated probabilities
– Hard voting is appropriate when some models don’t support predict_proba (e.g., SVC without probability=True, LinearSVC, some custom classifiers)
– In practice, soft voting outperforms hard voting in about 70% of cases on standard classification problems

weights

Type: list of floats or None

Allows assigning different importance to each classifier. By default, all weights are equal to 1.

# Random Forest counts double, Logistic Regression counts single
weights=[2, 1, 1.5]

Tuning strategy:
– Start with equal weights (default value)
– Adjust weights based on individual cross-validation performance
– Use grid search (GridSearchCV) to optimize automatically:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'weights': [[1, 1, 1], [2, 1, 1], [1, 2, 1], [1, 1, 2],
                [3, 1, 2], [2, 1, 3], [1, 2, 3], [2, 3, 1]]
}

grid = GridSearchCV(
    VotingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
            ('lr', LogisticRegression(max_iter=1000, random_state=42)),
            ('svm', make_pipeline(StandardScaler(), SVC(probability=True, random_state=42)))
        ],
        voting='soft'
    ),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid.fit(X_train, y_train)
print(f"Best weights: {grid.best_params_['weights']}")
print(f"Best CV accuracy: {grid.best_score_:.4f}")

n_jobs

Type: integer or None

Number of processors to use for parallel training. -1 uses all available cores. Significantly speeds up training when individual classifiers are expensive.

flatten_transform

Type: boolean, default True

Controls the output format of transform (useful when VotingClassifier is used as a transformer in a pipeline). Does not affect prediction performance.

Advantages of Voting Ensemble

Conceptual Simplicity

Voting Ensemble is probably the easiest ensemble method to understand and explain. Unlike stacking which requires a meta-model, or boosting which proceeds through sequential iterations, majority voting relies on an immediately understandable principle. This transparency is a major asset in regulated domains where explainability is required.

No Additional Training Phase

Unlike Stacking (article 087) which requires training a meta-model on base model predictions, Voting Ensemble requires no additional training. Once the individual classifiers are trained, the assembly is instantaneous. This reduces computation time and the risk of meta-model overfitting.

Robustness to Overfitting

By combining models with different biases, Voting Ensemble tends to smooth predictions. Models that overfit on specific patterns in the training set have less influence when their predictions are contradicted by other ensemble members.

Maximum Flexibility

Any combination of classifiers is possible. You can mix algorithms from radically different families: tree forests, support vector machines, logistic regressions, neural networks, naive Bayesian classifiers. This heterogeneity is the primary source of performance gain.

Good Baseline Performance

Even without fine-tuning weights, a well-designed Voting Ensemble generally outperforms each of its individual members. It’s an excellent starting point before exploring more complex ensemble methods.

Limitations and Pitfalls to Avoid

Critical Dependence on Diversity

If classifiers are too similar (same algorithmic family, same hyperparameters, same data), they will make the same errors and Voting Ensemble will provide no gain. This is the most common beginner mistake: combining three similar Random Forests is pointless.

Multiplied Computational Cost

Training $n$ classifiers costs $n$ times more than training a single model. For large datasets or expensive algorithms (such as SVMs on large sets), this cost can become prohibitive.

Need for predict_proba for Soft Voting

Soft voting requires all classifiers to implement predict_proba. Some popular models like standard SVM (SVC without probability=True) or LinearSVC don’t natively, which requires configuring them specifically or falling back to hard voting.

Less Powerful than Stacking

Voting Ensemble treats all models symmetrically (except explicit weighting). Stacking, on the other hand, learns to combine models through a meta-learner, which can capture more subtle interactions between predictions. On complex problems, Stacking often outperforms Voting.

Sensitivity to Weak Models

A very high-performing classifier (95% accuracy) can be “diluted” by two mediocre classifiers (60% accuracy). The majority vote will side with the mediocre two out of three times. It is crucial that each model significantly outperforms random guessing.

4 Concrete Use Cases

Case 1: Bank Fraud Detection

In a fraud detection context, both precision and recall are critical. A Voting Ensemble combining a business-rule-based model, a Gradient Boosting, and a neural network can capture known patterns (business rules), complex non-linear interactions (Gradient Boosting), and abstract representations (neural network). Majority voting significantly reduces false positives, a major issue for customer experience.

Case 2: Medical Diagnostic Assistance

For a diagnostic support system, different analysis modalities (radiological image analysis, biological data, patient history) are handled by specialized models. Voting Ensemble aggregates their conclusions with transparency: you know exactly how many models voted for each diagnosis. This explainability is essential in the medical domain, where every decision must be justifiable.

Case 3: Multi-Source Sentiment Analysis

In natural language processing (NLP), combining a BERT-based model, a TF-IDF model with logistic regression, and a sentiment lexicon-based model yields robust results. The BERT model captures deep semantic context, TF-IDF detects strong keywords, and the lexicon identifies explicit polarities. Voting Ensemble harmonizes these complementary perspectives.

Case 4: Customer Churn Prediction

A telecommunications company wants to predict subscriber churn. A Voting Ensemble combining a Random Forest (excellent for tabular data with interactions), a Logistic Regression (interpretable for the marketing team), and a Gradient Boosting (raw performance) provides both reliable predictions and a basis for understanding churn factors through the diversity of underlying models.

Best Practices to Maximize Performance

  1. Select complementary models: aim for at least 3 distinct algorithmic families
  2. Validate each model individually: discard classifiers whose accuracy is close to random
  3. Prefer soft voting when all models support predict_proba
  4. Optimize weights through cross-validation when some models are clearly superior
  5. Use nested cross-validation to avoid optimism bias in evaluation
  6. Calibrate probabilities (CalibratedClassifierCV) before soft voting for more reliable estimates
  7. Document votes: recording each model’s predictions for each sample allows diagnosing systematic biases

See Also