Boosting (AdbBoost)


In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Wikipedia

Boosting

Today, I would like to introduce Boosting method in Machine Learning. Basically, Boosting is a set of algorithms (classifiers) which changes weak learner to strong learners. It is one of the ensemble methods for improving the model predictions of any given learning algorithm. However, unlike the regular ensemble method which we group several model and use them in parallel, Boosting is using a single model in sequential order with differnt weights.

Boosting is also using ramdom sampling with a replacement. It will start training model from Sample 1 to Sample N. In each training, there would be well-classified data and wrong-classified data. Since it allows a replacement, some wrong classified data might be included in different samples or not.

Boosting Algorithm

Since each previous model affect a current model by assigning weights to data which it coudln’t classify correctly, this is not parallel - it’s sequential. At the end, it will use all trained models with different weights and generate the output. One of the disadvantages of Boosting is that it easily get corrupted by outliers because it will assign heavy weights to those outliers and it will mislead the models.

AdaBoost (Adaptive Boosting)

Adaboost flow chart

AdaBoost works in a way putting more weights on difficult to classify data and less on those already handled well.

In AdaBoost, we use something called Decision Stumps, the simplest model we could construct on data. It split the data into two subsets based on the feature. To find the best decision stump, we should lay out all features of data along with every possible threshold and look for one gives us best accuracy.

In this example, I will consturct AdaBoost from the scratch with some mathematical explanation.

Example

X1 = (−1,0,+), X2 = (−0.5,0.5,+)
X3 = (0,1,−), X4 = (0.5,1,−)
X5 = (1,0,+), X6 = (1,−1,+)
X7 = (0,−1,−),X8 = (0,0,−)

figure1

How to?

Constructing $D_t$. It’s basically weight of each $i^{th}$ data.

$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{-\alpha_t}$ if $y_i = h_t(x_i)$
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{\alpha_t}$ if $y_i \ne h_t(x_i)$,

where $Z_t$ = Normalization Constant = Sum of $D_t$

and $\alpha_t$ = $\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$.

$h_t: \epsilon_t = \sum\limits_{i=1}^{m}D_t\vert(y_i\ne h_t(x_i))$ if $(y_i\ne h_t(x_i)$ then 1 otherwise 0​

Step 1

Put an initial random decision stump aka classifier $h_1$

figure2

h1 is a randomly assinged decision stump.
Let’s get $D_1$ = $\frac{1}{m}$, where $m = 8$

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125

Now, in a second row, I will put 1 if $h_1$ classify $y_i$ incorrectly, else 0.

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
0 1 0 0 0 0 1 1

$h_1$ has classified correctly $d_t(1),d_t(3),d_t(4),d_t(5),d_t(6)$

Once we have $D_1(i)$ and a result of classification by $h_1$ we could calculate $\epsilon_1$, where $h_t: \epsilon_t = \sum\limits_{i=1}^{m}D_t\vert(y_i\ne h_t(x_i))$ if $(y_i\ne h_t(x_i)$ then 1 otherwise 0

In a thrid row, I will get all $D_1\vert(y_i\ne h_1(x_i)$, and if take sumation of all values, it’s our $\epsilon_1$.

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.375 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
0 1 0 0 0 0 1 1
0 0.13 0 0 0 0 0.13 0.13

Let’s get $\alpha_1$ and $Z_1$. Remember $Z_t$ = Normalization Constant = Sum of $D_t$, and $\alpha_t$ = $\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$.

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.375 0.255 1.000 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
0 1 0 0 0 0 1 1
0 0.13 0 0 0 0 0.13 0.13

If you see the table above, you could notice that d2, d7, d8 has more weight, 0.13 becase $h_1$ failed to classify them correctly. Based on this information, we could move onto second iteration. In order to calculate $D_2(i)$ we need additional information other than table above.

$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{-\alpha_t}$ if $y_i = h_t(x_i)$
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{\alpha_t}$ if $y_i \ne h_t(x_i)$

We need to what what’s $e^{-\alpha_1}$ and $e^{\alpha_1}$

$e^{-\alpha_1}=0.775$

$e^{\alpha_1}=1.291$

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.375 0.255 1.000 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
0 1 0 0 0 0 1 1
0 0.13 0 0 0 0 0.13 0.13
2 0.323 0.371 0.968 0.097 0.161 0.097 0.097 0.097 0.097 0.161 0.161

We’ve got new weights for data. d1 was 0.125; it has become 0.097. d2 has become 0.161 from 0.125 because $h_1$ failed to classfy this data. Let’s get the $h_2$ - I was focusing on having all +in the same group.

figure3

If you do same calculation we’ve done above, You will get this table:

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.375 0.255 1.000 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
0 1 0 0 0 0 1 1
0 0.13 0 0 0 0 0.13 0.13
2 0.323 0.371 0.968 0.097 0.161 0.097 0.097 0.097 0.097 0.161 0.161
0 0 0 0 0 0 1 1
0 0 0 0 0 0 0.16 0.16

$e^{-\alpha_2}=0.490$

$e^{\alpha_2}=2.041$

Lastly, let’s do the final iteration.

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.375 0.255 1.000 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
0 1 0 0 0 0 1 1
0 0.13 0 0 0 0 0.13 0.13
2 0.194 0.713 0.968 0.097 0.161 0.097 0.097 0.097 0.097 0.161 0.161
0 0 1 1 0 0 0 0
0 0 0.1 0.1 0 0 0 0
3 0.049 0.082 0.204 0.204 0.049 0.049 0.082 0.082

figure4

A reasoning of $h_3$ is to have all o in a same group. Also, d5 and d6 have never been misclassified, so I want to have a information of d5 and d6 in my model. And the result will be:

t err alpha Z d1 d2 d3 d4 d5 d6 d7 d8
1 0.375 0.255 1.000 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
0 1 0 0 0 0 1 1
0 0.13 0 0 0 0 0.13 0.13
2 0.323 0.371 0.968 0.097 0.161 0.097 0.097 0.097 0.097 0.161 0.161
0 0 0 0 0 0 1 1
0 0 0 0 0 0 0.16 0.16
3 0.138 0.916 0.943 0.069 0.115 0.069 0.069 0.069 0.069 0.241 0.241
0 0 0 0 1 1 0 0
0 0 0 0 0.07 0.07 0 0

Graphically, it will look like:

figure5

Lastly, the final classifier is:

$H_{final}(x)=sign(0.255h_1+0.371h_2+0.943*h_3)$

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×