2020-04-01

3. Blog Post

Boosting (AdbBoost)

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Wikipedia

Boosting

Today, I would like to introduce Boosting method in Machine Learning. Basically, Boosting is a set of algorithms (classifiers) which changes weak learner to strong learners. It is one of the ensemble methods for improving the model predictions of any given learning algorithm. However, unlike the regular ensemble method which we group several model and use them in parallel, Boosting is using a single model in sequential order with differnt weights.

Boosting is also using ramdom sampling with a replacement. It will start training model from Sample 1 to Sample N. In each training, there would be well-classified data and wrong-classified data. Since it allows a replacement, some wrong classified data might be included in different samples or not.

Boosting Algorithm

Since each previous model affect a current model by assigning weights to data which it coudln’t classify correctly, this is not parallel - it’s sequential. At the end, it will use all trained models with different weights and generate the output. One of the disadvantages of Boosting is that it easily get corrupted by outliers because it will assign heavy weights to those outliers and it will mislead the models.

AdaBoost (Adaptive Boosting)

Adaboost flow chart

AdaBoost works in a way putting more weights on difficult to classify data and less on those already handled well.

In AdaBoost, we use something called Decision Stumps, the simplest model we could construct on data. It split the data into two subsets based on the feature. To find the best decision stump, we should lay out all features of data along with every possible threshold and look for one gives us best accuracy.

In this example, I will consturct AdaBoost from the scratch with some mathematical explanation.

Example

X1 = (−1,0,+), X2 = (−0.5,0.5,+)
X3 = (0,1,−), X4 = (0.5,1,−)
X5 = (1,0,+), X6 = (1,−1,+)
X7 = (0,−1,−),X8 = (0,0,−)

How to?

Constructing $D_t$. It’s basically weight of each $i^{th}$ data.

$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{-\alpha_t}$ if $y_i = h_t(x_i)$
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{\alpha_t}$ if $y_i \ne h_t(x_i)$,

where $Z_t$ = Normalization Constant = Sum of $D_t$

and $\alpha_t$ = $\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$.

$h_t: \epsilon_t = \sum\limits_{i=1}^{m}D_t\vert(y_i\ne h_t(x_i))$ if $(y_i\ne h_t(x_i)$ then 1 otherwise 0

Step 1

Put an initial random decision stump aka classifier $h_1$

h1 is a randomly assinged decision stump.
Let’s get $D_1$ = $\frac{1}{m}$, where $m = 8$

t	err	alpha	Z	d1	d2	d3	d4	d5	d6	d7	d8
1				0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125

Now, in a second row, I will put 1 if $h_1$ classify $y_i$ incorrectly, else 0.

t	err	alpha	Z	d1	d2	d3	d4	d5	d6	d7	d8
1				0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125
				0	1	0	0	0	0	1	1

$h_1$ has classified correctly $d_t(1),d_t(3),d_t(4),d_t(5),d_t(6)$

Once we have $D_1(i)$ and a result of classification by $h_1$ we could calculate $\epsilon_1$, where $h_t: \epsilon_t = \sum\limits_{i=1}^{m}D_t\vert(y_i\ne h_t(x_i))$ if $(y_i\ne h_t(x_i)$ then 1 otherwise 0

In a thrid row, I will get all $D_1\vert(y_i\ne h_1(x_i)$, and if take sumation of all values, it’s our $\epsilon_1$.

t	err	d1	d2	d3	d4	d5	d6	d7	d8
1	0.375	0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125
		0	1	0	0	0	0	1	1
		0	0.13	0	0	0	0	0.13	0.13

Let’s get $\alpha_1$ and $Z_1$. Remember $Z_t$ = Normalization Constant = Sum of $D_t$, and $\alpha_t$ = $\frac{1}{2}ln(\frac{1-\epsilon_t}{\epsilon_t})$.

t	err	alpha	Z	d1	d2	d3	d4	d5	d6	d7	d8
1	0.375	0.255	1.000	0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125
				0	1	0	0	0	0	1	1
				0	0.13	0	0	0	0	0.13	0.13

If you see the table above, you could notice that d2, d7, d8 has more weight, 0.13 becase $h_1$ failed to classify them correctly. Based on this information, we could move onto second iteration. In order to calculate $D_2(i)$ we need additional information other than table above.

$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{-\alpha_t}$ if $y_i = h_t(x_i)$
$D_{t+1}(i)$ = $\frac{D_t(i)}{Z_t} * e^{\alpha_t}$ if $y_i \ne h_t(x_i)$

We need to what what’s $e^{-\alpha_1}$ and $e^{\alpha_1}$

$e^{-\alpha_1}=0.775$

$e^{\alpha_1}=1.291$

t	err	alpha	Z	d1	d2	d3	d4	d5	d6	d7	d8
1	0.375	0.255	1.000	0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125
				0	1	0	0	0	0	1	1
				0	0.13	0	0	0	0	0.13	0.13
2	0.323	0.371	0.968	0.097	0.161	0.097	0.097	0.097	0.097	0.161	0.161

We’ve got new weights for data. d1 was 0.125; it has become 0.097. d2 has become 0.161 from 0.125 because $h_1$ failed to classfy this data. Let’s get the $h_2$ - I was focusing on having all +in the same group.

If you do same calculation we’ve done above, You will get this table:

t	err	alpha	Z	d1	d2	d3	d4	d5	d6	d7	d8
1	0.375	0.255	1.000	0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125
				0	1	0	0	0	0	1	1
				0	0.13	0	0	0	0	0.13	0.13
2	0.323	0.371	0.968	0.097	0.161	0.097	0.097	0.097	0.097	0.161	0.161
				0	0	0	0	0	0	1	1
				0	0	0	0	0	0	0.16	0.16

$e^{-\alpha_2}=0.490$

$e^{\alpha_2}=2.041$

Lastly, let’s do the final iteration.

t	err	alpha	Z	d1	d2	d3	d4	d5	d6	d7	d8
1	0.375	0.255	1.000	0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125
				0	1	0	0	0	0	1	1
				0	0.13	0	0	0	0	0.13	0.13
2	0.194	0.713	0.968	0.097	0.161	0.097	0.097	0.097	0.097	0.161	0.161
				0	0	1	1	0	0	0	0
				0	0	0.1	0.1	0	0	0	0
3				0.049	0.082	0.204	0.204	0.049	0.049	0.082	0.082

A reasoning of $h_3$ is to have all o in a same group. Also, d5 and d6 have never been misclassified, so I want to have a information of d5 and d6 in my model. And the result will be:

t	err	alpha	Z	d1	d2	d3	d4	d5	d6	d7	d8
1	0.375	0.255	1.000	0.125	0.125	0.125	0.125	0.125	0.125	0.125	0.125
				0	1	0	0	0	0	1	1
				0	0.13	0	0	0	0	0.13	0.13
2	0.323	0.371	0.968	0.097	0.161	0.097	0.097	0.097	0.097	0.161	0.161
				0	0	0	0	0	0	1	1
				0	0	0	0	0	0	0.16	0.16
3	0.138	0.916	0.943	0.069	0.115	0.069	0.069	0.069	0.069	0.241	0.241
				0	0	0	0	1	1	0	0
				0	0	0	0	0.07	0.07	0	0

Graphically, it will look like:

Lastly, the final classifier is:

$H_{final}(x)=sign(0.255h_1+0.371h_2+0.943*h_3)$

English Machine Learning

Boosting (AdbBoost)

Boosting

AdaBoost (Adaptive Boosting)

Example

Catalogue

Categories

Your browser is out-of-date!