In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Wikipedia
Boosting
Today, I would like to introduce Boosting
method in Machine Learning. Basically, Boosting
is a set of algorithms (classifiers) which changes weak learner to strong learners. It is one of the ensemble methods for improving the model predictions of any given learning algorithm. However, unlike the regular ensemble method which we group several model and use them in parallel, Boosting is using a single model in sequential order with differnt weights.
Boosting is also using ramdom sampling with a replacement. It will start training model from Sample 1 to Sample N. In each training, there would be well-classified data and wrong-classified data. Since it allows a replacement, some wrong classified data might be included in different samples or not.
Since each previous model affect a current model by assigning weights to data which it coudln’t classify correctly, this is not parallel - it’s sequential. At the end, it will use all trained models with different weights and generate the output. One of the disadvantages of Boosting is that it easily get corrupted by outliers because it will assign heavy weights to those outliers and it will mislead the models.
AdaBoost (Adaptive Boosting)
AdaBoost works in a way putting more weights on difficult to classify data and less on those already handled well.
In AdaBoost, we use something called Decision Stumps, the simplest model we could construct on data. It split the data into two subsets based on the feature. To find the best decision stump, we should lay out all features of data along with every possible threshold and look for one gives us best accuracy.
In this example, I will consturct AdaBoost from the scratch with some mathematical explanation.
Example
X1 = (−1,0,+), X2 = (−0.5,0.5,+)
X3 = (0,1,−), X4 = (0.5,1,−)
X5 = (1,0,+), X6 = (1,−1,+)
X7 = (0,−1,−),X8 = (0,0,−)
How to?
Constructing $D_t$. It’s basically weight of each $i^{th}$ data.
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{-\alpha_t}$ if $y_i = h_t(x_i)$
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{\alpha_t}$ if $y_i \ne h_t(x_i)$,
where $Z_t$ = Normalization Constant = Sum of $D_t$
and $\alpha_t$ = $\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$.
$h_t: \epsilon_t = \sum\limits_{i=1}^{m}D_t\vert(y_i\ne h_t(x_i))$ if $(y_i\ne h_t(x_i)$ then 1 otherwise 0
Step 1
Put an initial random decision stump aka classifier $h_1$
h1 is a randomly assinged decision stump.
Let’s get $D_1$ = $\frac{1}{m}$, where $m = 8$
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 |
Now, in a second row, I will put 1 if $h_1$ classify $y_i$ incorrectly, else 0.
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | |||
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
$h_1$ has classified correctly $d_t(1),d_t(3),d_t(4),d_t(5),d_t(6)$
Once we have $D_1(i)$ and a result of classification by $h_1$ we could calculate $\epsilon_1$, where $h_t: \epsilon_t = \sum\limits_{i=1}^{m}D_t\vert(y_i\ne h_t(x_i))$ if $(y_i\ne h_t(x_i)$ then 1 otherwise 0
In a thrid row, I will get all $D_1\vert(y_i\ne h_1(x_i)$, and if take sumation of all values, it’s our $\epsilon_1$.
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.375 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | ||
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0.13 | 0 | 0 | 0 | 0 | 0.13 | 0.13 |
Let’s get $\alpha_1$ and $Z_1$. Remember $Z_t$ = Normalization Constant = Sum of $D_t$, and $\alpha_t$ = $\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$.
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.375 | 0.255 | 1.000 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 |
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0.13 | 0 | 0 | 0 | 0 | 0.13 | 0.13 |
If you see the table above, you could notice that d2, d7, d8 has more weight, 0.13 becase $h_1$ failed to classify them correctly. Based on this information, we could move onto second iteration. In order to calculate $D_2(i)$ we need additional information other than table above.
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{-\alpha_t}$ if $y_i = h_t(x_i)$
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{\alpha_t}$ if $y_i \ne h_t(x_i)$
We need to what what’s $e^{-\alpha_1}$ and $e^{\alpha_1}$
$e^{-\alpha_1}=0.775$
$e^{\alpha_1}=1.291$
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.375 | 0.255 | 1.000 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 |
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0.13 | 0 | 0 | 0 | 0 | 0.13 | 0.13 | ||||
2 | 0.323 | 0.371 | 0.968 | 0.097 | 0.161 | 0.097 | 0.097 | 0.097 | 0.097 | 0.161 | 0.161 |
We’ve got new weights for data. d1 was 0.125; it has become 0.097. d2 has become 0.161 from 0.125 because $h_1$ failed to classfy this data. Let’s get the $h_2$ - I was focusing on having all +
in the same group.
If you do same calculation we’ve done above, You will get this table:
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.375 | 0.255 | 1.000 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 |
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0.13 | 0 | 0 | 0 | 0 | 0.13 | 0.13 | ||||
2 | 0.323 | 0.371 | 0.968 | 0.097 | 0.161 | 0.097 | 0.097 | 0.097 | 0.097 | 0.161 | 0.161 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0 | 0 | 0 | 0 | 0 | 0.16 | 0.16 |
$e^{-\alpha_2}=0.490$
$e^{\alpha_2}=2.041$
Lastly, let’s do the final iteration.
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.375 | 0.255 | 1.000 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 |
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0.13 | 0 | 0 | 0 | 0 | 0.13 | 0.13 | ||||
2 | 0.194 | 0.713 | 0.968 | 0.097 | 0.161 | 0.097 | 0.097 | 0.097 | 0.097 | 0.161 | 0.161 |
0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ||||
0 | 0 | 0.1 | 0.1 | 0 | 0 | 0 | 0 | ||||
3 | 0.049 | 0.082 | 0.204 | 0.204 | 0.049 | 0.049 | 0.082 | 0.082 |
A reasoning of $h_3$ is to have all o
in a same group. Also, d5 and d6 have never been misclassified, so I want to have a information of d5 and d6 in my model. And the result will be:
t | err | alpha | Z | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.375 | 0.255 | 1.000 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 | 0.125 |
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0.13 | 0 | 0 | 0 | 0 | 0.13 | 0.13 | ||||
2 | 0.323 | 0.371 | 0.968 | 0.097 | 0.161 | 0.097 | 0.097 | 0.097 | 0.097 | 0.161 | 0.161 |
0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ||||
0 | 0 | 0 | 0 | 0 | 0 | 0.16 | 0.16 | ||||
3 | 0.138 | 0.916 | 0.943 | 0.069 | 0.115 | 0.069 | 0.069 | 0.069 | 0.069 | 0.241 | 0.241 |
0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | ||||
0 | 0 | 0 | 0 | 0.07 | 0.07 | 0 | 0 |
Graphically, it will look like:
Lastly, the final classifier is:
$H_{final}(x)=sign(0.255h_1+0.371h_2+0.943*h_3)$