Classifying Text data with Naive Bayes Classifier

Shreya Malraju
Aug 29, 2022
4 min read

Updated: Nov 15, 2022

Image referred from https://hands-on.cloud/implementing-naive-bayes-classification-using-python/

Aim

The goal of this assignment is to build a Naive Bayes Classifier[1] to perform classification on the Ford Sentence Classification Dataset.

Link to the Dataset - https://www.kaggle.com/datasets/gaveshjain/ford-sentence-classifiaction-dataset?resource=download

About Dataset - The dataset consists of 60115 rows and has 4 columns (index, sentence_id, new_sentence and type). Each sentence is classified into 6 different classes - Responsibility, Requirement, Skill, SoftSkill, Education, Experience.

The code for the above classifier is available on my github and also the below ipynb file.

Github link - https://github.com/shreyamalraju/DM_Assignment2

Knowledge

Naive Bayes Classifier depends on the Bayes Theorem of probabiity.

The Bayes Theorem is used to find the probability of an event to occur based on previous knowledge related to that event.

According to the Bayes theorem, if A and B are two independent events, then

Prob(A|B) = Prob(B|A) * Prob(A) / Prob(B)

We try to understand it with an example -

Suppose, you are planning to play cricket, but ,today morining,it was cloudy. We know that there is a 50percent chance that it could rain if mornings are cloudy. But, 40percent of the times cloudy mornings are common and it is a dry month and hence there os only 10percent chance of raining. If we want to find the chance of rain, given that the morning was cloudy, we proceed as follows-

given,

Prob(cloudy|rain)=50 percent=0.5

Prob(rain)=10 percent=0.1

Prob(cloudy)=40percent=0.4

Now, we have to find Prob(rain|cloudy)

From Bayes theorem, Prob(rain|cloudy) =Prob(cloudy|rain)*Prob(rain)/ Prob(cloudy)

=(0.5)*(0.1)/(0.4)

=0.125=12.5 percent

To conclude, Bayes Theorem allows us to make reasoned deduction of events happening in the real world based on prior knowledge of observations that may imply it.

In the Naive Bayes Classifier, we can interpret class probabilities by finding frequency of each instance of the event divided by the total number of instances.

So, from the above example, by applying bayes theorem, we can find the class probabilities and perform classification on different types of text data.

Naive Bayes classifier is very accurate to classify the data into different categories. It is very easy to code a Naive Bayes classifier and also it has very good training and testing runtime i.e., O(n) and O(1) respectively.

Code

IMPORT DATASET

Opendatasets module is used to directly download the code into the environment from kaggle. We then load the data as python dataframe.

The sample data is as below

REMOVE NULL VALUES

It is very important to remove the null values as it may effect the accuracy and he performance of the model. We first check if there are any null values in the dataset.

It is observed that New_Sentence column has some null values , hence we have to remove them.

After the null values are removed now we have 59002 rows in the dataset.

FACTORIZE THE TYPE COLUMN

We have 6 classes for classifying the type of sentence - Responsibility, Requirement, Skill, SoftSkill, Education, Experience. It is difficult to deal with the text at all times, hence we assign Type_Id to each of the type using factorize() method.

SPLIT THE DATA INTO TRAIN, DEV, TEST

The entire dataset is divided into train, dev and test data in the ratio 8:1:1

We find the probabilites of occurrance of each record type in the train set and observe the following

DATA CLEANING

With the help of regular expressions[4], we clean the data by removing unwanted symbols and numbers.

BUILD A VOCABULARY AS LIST

REMOVE THE WORDS WITH OCCURRENCE LESS THAN 5

We omit the rare words as their contribution for determining the probability of a class is low.

The final vocabulary consists of 4959 words.

CALCULATING CONDITIONAL PROBABILITIES BASED ON TYPE

P[word | type] = # of type documents containing word/ num of all positive type documents

CALCULATE ACCURACY USING DEV SET

We use 5-fold-cross validation[3] for finding the accuracy of the development set and hence observe the following accuracies

Cross validation is usually used when we donot have enough data to apply other efficient methods (train,dev and test)

Image Referred from https://www.researchgate.net/figure/Description-of-5-fold-cross-validation_fig4_340524896

Experiments

Smoothing[5]

Laplace smoothing[6] helps us to deal with the 0 probability cases in the naive bayes classifier.

We are going to discuss k plus smoothing where k is the count added to our denominator and numerator.

P,k(x) = [c(x) +k]/(N+kX)

Accuracy after smoothing

Comparing before and after smoothing

For this dataset, itt is concluded that accuracy is better without smoothing.

Calculating top 10 words predicting each class

We calculate Probability[class|word] and store it in an array and sort in reverse order according to the probabilities. We then retrieve the first 10 words from the list.

Now, we find the top 10 words that predict each class seperately.

Testing

Use the optimal hyperparameters obtained to calculate accuracy of test data.

The above code calculates accuracy on test data by using the optimal hyperparameters found before and after smoothing.

The final accuracy obtained is : 95.0687 %

Contribution

1. Implemented smoothing and observed that for this dataset, accuracy is better without smoothing.

2. Plotted graph to display various accuracies - before and after smoothing.

3. Performed 5-fold cross validation on the train and dev dataset and took optimal hyperparameters and performed on test set.

Challenges faced

1. I am very new to 5-fold cross validation, hence had to spend a lot of time learning about it.

2. The dataset contains lot of null values and data with different symbols and numbers, hence it was difficult to remove all of them and have only words.

3. Had to brainstorm a lot so as to why the accuracy is decreasing after permorming Laplace Smoothing. (still could not find the reason)

References

[1] https://www.dataquest.io/blog/naive-bayes-tutorial/

[2] https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/

[3]https://stackoverflow.com/questions/51085553/scikit-learn-5-fold-cross-validation-train-test-split

[4]https://www.pythonpool.com/remove-punctuation-python/

[5]https://www.analyticsvidhya.com/blog/2021/04/improve-naive-bayes-text-classifier-using-laplace-smoothing/

[6]https://www.codingninjas.com/codestudio/library/naive-bayes-and-laplace-smoothing-3905