Naive Bayes from scratch


Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predcitors) in a learning problem. Maxumum-likelihood training can be done by evaluting a closed-form exporession, which takes linear time, rather tahn by expensive iterative approximation as used for many other typs of classifier. Wikipedia

Usecase - Spam filter

We will use the Naive Bayes algorithm to fit a spam filter.

Spam filters are used in alal email services to classify received emails as “Spam” or “Not Spam”. A simple apporach involves maintaining a vocabulary of words that commonly occur in “Spam” emails and classifying an email as “Spam” if the number of words from the dictionary that are present in the email is over a certain threshold.

Assume we are given the vocabulary consists of 15 words

$V$ = {secret, offer, low, price, valued, customer, today, dollar, million, sports, is, for, play, healthy, pizza}

We will use $V_i$ to represent the ith word in $V$. As our training dataset, we are also given 3 example spam messages:

• million dollar offer

• secret offer today

• secret is secret

and 4 example non-spam messages

• low price for valued customer

• play secret sports today

• sports is healthy

• low price pizza

1
2
3
4
5
6
7
import numpy as np

V = ['secret', 'offer', 'low', 'price', 'valued', 'customer', 'today', 'dollar', 'million', 'sports', 'is', 'for', 'play', 'healthy', 'pizza']

message = ['million dollar offer','secret offer today','secret is secret',
'low price for valued customer', 'play secret sports today',
'sports is healthy', 'low price pizza']

train is our input vector x corresponding to each training message, and it has length n = 15 (length of V). Since we have 7 training example of message, We will have 7 by 15 training data - 7 data 15 features.

1
2
train = np.zeros((7, 15))
label = [0,0,0,1,1,1,1]
1
2
3
4
5
6
7
8
'''
Converting 7 training message data set to x vector which has length n = 15
'''
for i in range(len(train)):
for j in range(len(V)):
for k in range(len(message[i].split(" "))):
if V[j] == message[i].split(" ")[k]:
train[i,j] += 1
1
2
3
4
5
6
'''
This is the feture x matrix.
row = n message
col = i_th feature vector
'''
train
array([[0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

Let’s list them separately

1
2
3
4
5
6
7
for i in range(len(label)):
if label[i] == 0:
print("this is {}th message - It's SCAM".format(i))
print(train[i])
else:
print("this is {}th message - It's NOT SCAM".format(i))
print(train[i])
this is 0th message - It's SCAM
[0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
this is 1th message - It's SCAM
[1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
this is 2th message - It's SCAM
[2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
this is 3th message - It's NOT SCAM
[0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
this is 4th message - It's NOT SCAM
[1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0.]
this is 5th message - It's NOT SCAM
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
this is 6th message - It's NOT SCAM
[0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
'''
spam and ham array will be count of x_i when each message is spam or not spam.
spam = (Y=0|x_i)
ham = (Y=1|x_i)
'''
spam = np.zeros((1, 15))
ham = np.zeros((1, 15))

for c in range(len(label)):
for i in range(train.shape[1]):
if label[c] == 0: # Spam
spam[0,i] += train[c,i]

else: # Not Spam
ham[0,i] += train[c,i]


1
spam
array([[3., 2., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0.]])

If you see a result above, spam[0] = 3 ,spam[1] = 2 , spam [2] = 0, it means total count of x0 = secret is three among training data classified as spam. We have two offer and zeor low in our training data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
'''
In this section, we will calculate probabiliy of each word xi when a message is spam or not.
prob_spam = P(x_i|y=0)
prob_ham = P(x_i|y=1)
'''
spam_counter = 0
ham_counter = 0

for c in range(len(label)):
if label[c] == 0:
spam_counter += 1

else:
ham_counter += 1

prob_spam_x_i = spam/spam_counter
prob_ham_x_i = ham/ham_counter
1
prob_spam_x_i 
array([[1.        , 0.66666667, 0.        , 0.        , 0.        ,
        0.        , 0.33333333, 0.33333333, 0.33333333, 0.        ,
        0.33333333, 0.        , 0.        , 0.        , 0.        ]])
1
prob_ham_x_i 
array([[0.25, 0.  , 0.5 , 0.5 , 0.25, 0.25, 0.25, 0.  , 0.  , 0.5 , 0.25,
        0.25, 0.25, 0.25, 0.25]])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
'''
Bayes Therorm
Calculating P(y=0|x_i)
prob_spam_per_word = P(y=0|x_i)
'''
prob_spam_per_word = np.zeros((1, 15))
prob_ham_per_word = np.zeros((1, 15))
prob_spam = 3/7
prob_ham = 4/7

for i in range(prob_spam_per_word.shape[1]):
spam_a = (prob_spam_x_i[0,i]*prob_spam)
spam_b = spam_a + (prob_ham_x_i[0,i]*prob_ham)
spam_c = spam_a/spam_b

ham_a = (prob_ham_x_i[0,i]*prob_ham)
ham_b = ham_a + (prob_spam_x_i[0,i]*prob_spam)
ham_c = ham_a/ham_b

prob_spam_per_word[0,i] = spam_c
prob_ham_per_word[0,i] = ham_c
1
prob_spam_per_word
array([[0.75, 1.  , 0.  , 0.  , 0.  , 0.  , 0.5 , 1.  , 1.  , 0.  , 0.5 ,
        0.  , 0.  , 0.  , 0.  ]])

The result above means that the probability of spam per i_th words. For example, the probability of spam when we have secret is 75% and a probability of spam of offer is 100%.

1
prob_ham_per_word # This is the probability of ham per words.
array([[0.25, 0.  , 1.  , 1.  , 1.  , 1.  , 0.5 , 0.  , 0.  , 1.  , 0.5 ,
        1.  , 1.  , 1.  , 1.  ]])

Since we don’t have enough data, I will just use training dataset to evaluate the bayes classifier

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
threshold = 0.50 # In what percentage do you want to classfy an eamil as spam.

for t in range(train.shape[0]):
conditional_prob_spam = prob_spam
conditional_prob_ham = prob_ham

for i in range(prob_spam_per_word.shape[1]):
if train[t][i] == 1:
conditional_prob_spam = conditional_prob_spam * prob_spam_per_word[0,i]
conditional_prob_ham = conditional_prob_ham * prob_ham_per_word[0,i]

if conditional_prob_spam != 0:
prob = conditional_prob_spam / (conditional_prob_spam + conditional_prob_ham) * 100
else:
prob = 0.0

if prob > threshold*100:
label = "SPAM"
else:
label = "NOT SPAM"

print("{} th email is {} with a probability of being spam {}%".format(t,label,prob))
#print(prob)
0 th email is SPAM with a probability of being spam 100.0%
1 th email is SPAM with a probability of being spam 100.0%
2 th email is NOT SPAM with a probability of being spam 42.857142857142854%
3 th email is NOT SPAM with a probability of being spam 0.0%
4 th email is NOT SPAM with a probability of being spam 0.0%
5 th email is NOT SPAM with a probability of being spam 0.0%
6 th email is NOT SPAM with a probability of being spam 0.0%

Given a new message “today is secret”, decide whether it is spam or not spam, based on the Naive Bayes classifier, learned from the above data.

1
2
3
4
5
6
'''
"today is secret"
Vectorizing the email..
'''
target = np.array([1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.])
target[0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
'''
Please play around with different threshold.
'''
threshold = 0.50 # In what percentage do you want to classfy an eamil as spam.

conditional_prob_spam = prob_spam
conditional_prob_ham = prob_ham

for i in range(prob_spam_per_word.shape[1]):
if target[i] == 1:
conditional_prob_spam = conditional_prob_spam * prob_spam_per_word[0,i]
conditional_prob_ham = conditional_prob_ham * prob_ham_per_word[0,i]

if conditional_prob_spam != 0:
prob = conditional_prob_spam / (conditional_prob_spam + conditional_prob_ham) * 100
else:
prob = 0.0

if prob > threshold*100:
label = "SPAM"
else:
label = "NOT SPAM"

print("{} th email is {} with a probability of being spam {}%".format(t,label,prob))
#print(prob)
6 th email is SPAM with a probability of being spam 69.23076923076923%
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×