Machine Learning: Bagging and Boosting

Rahul S
3 min readSep 6, 2023

--

Bagging and boosting are both ensemble machine learning techniques used to improve the performance of weak learners (often decision trees) by combining their predictions.

However, they differ in their approach and the way they handle training data and errors. Let’s explore the differences between bagging and boosting with detailed examples:

BAGGING (BOOTSTRAP AGGREGATION)

  1. Bootstrap Sampling: In Bagging, multiple subsets of the training dataset are created through bootstrap sampling. For each subset, we randomly sample data points with replacement from the original dataset. Each subset is roughly the same size as the original dataset but contains some duplicate data points and omits others.
  2. Parallel Training: Each subset is used to train a separate base model. These base models are trained independently in parallel. Since each subset is slightly different, the base models capture different patterns or errors.
  3. Voting or Averaging: During inference, all base models make individual predictions. The final prediction is determined through majority voting (for classification) or averaging (for regression). This ensemble approach helps reduce variance and can improve model stability.

Example:

Random Forest: Random Forest creates multiple decision trees, each trained on a bootstrapped subset of the data, and combines their predictions to make the final decision. It is less prone to overfitting and can handle high-dimensional data well.

BOOSTING:

  1. Weighted Training Data: In boosting, the focus is on sequentially improving the performance of base models. Initially, all data points are assigned equal weights. Base models are trained on the data, and their performance is evaluated.
  2. Weight Adjustment: In subsequent iterations, more weight is given to data points that were misclassified in previous iterations. This means that the algorithm focuses on correcting errors made by earlier models. Each base model is trained to perform better on the data points…

--

--