Classifying Amazon review data using 7 different classifiers and analyze their performance

Shreya Malraju
Aug 29, 2022
7 min read

Updated: Nov 30, 2022

Image Source https://www.toptal.com/machine-learning/nlp-tutorial-text-classification

The ipynb notebook for the code is found below-

The Github Link for the code is -

https://github.com/shreyamalraju/DataMining_Final_Project/blob/main/DM_final_project.ipynb

The demo video is as under- https://www.youtube.com/watch?v=FRmL5AZJFiE

Aim : The project is determined to analyse the performance of different algorithms by classifying the reviews into their rating class. We compare the algorithms using accuracy, and running time as metrics of performance.

We try to determine the rating based on the review text.

About Dataset

Link- https://www.kaggle.com/datasets/vivekgediya/ecommerce-product-review-data

This dataset contains product reviews, metadata and links from Amazon and Flipkart. In this project, I used the 'Product Review Large Data.csv' file. It consists of 10971 rows and 27 columns.

The csv file is loaded as pandas dataframe and we observe the data as below-

The following are different columns of the dataset-

Algorithms Used-

Multinomial Naive Bayes
Random Forest Classifier
Logistic Regression
Decision Tree Classifier
kNN
Adaboost
SVM

Import Libraries-

The necessary libraries like sklearn, numpy, pandas, matplotlib are imported which can be used later.

Data Cleaning / Pre-processing-

This dataset consists of various columns(27 columns) which is very huge and it effects the run-time and accuracy. Hence, we must drop the columns which are not necessary for our classification.

The data now has only 3 columns i.e. rating, text and title. Hence, we can proceed further with removal of null values.

We see the rows are deleted which consists of null values. Now, the rows become 10551 without null values.

Next, we convert float values of rating into integer for simplifying classification.

There are 5 different classes of rating i.e. 1,2,3,4,5.

We now observe that the data is updated with rating as integer type and with no null values.

We also remove different symbols and empty spaces from the data and then perform train-test-split.

The traing size is 82 percent and test size is 18 percent.

Data1 is considered for traing where we drop the column 'rating' and Data2 is used for testing i.e. for finding rating based on review text.

Knowledge- Tfidf Vectorizer

TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This algorithm is used to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

It consists of 2 topics- Term Frequency , Inverse Document Frequency

The term frequency is the number of occurrences of a specific term in a document. Term frequency indicates how important a specific term in a document.

Inverse document frequency is the weight of a term, it aims to reduce the weight of a term if the term’s occurrences are scattered throughout all the documents.

We now perform training and test the data against 7 different models.

1) Multinomial Naive Bayes

Naive Bayes Classifier[1] depends on the Bayes Theorem of probabiity. The Bayes Theorem is used to find the probability of an event to occur based on previous knowledge related to that event.

According to the Bayes theorem, if A and B are two independent events, then

Prob(A|B) = Prob(B|A) * Prob(A) / Prob(B)

We try to understand it with an example -

Suppose, you are planning to play cricket, but ,today morining,it was cloudy. We know that there is a 50percent chance that it could rain if mornings are cloudy. But, 40percent of the times cloudy mornings are common and it is a dry month and hence there os only 10percent chance of raining. If we want to find the chance of rain, given that the morning was cloudy, we proceed as follows-

given,

Prob(cloudy|rain)=50 percent=0.5

Prob(rain)=10 percent=0.1

Prob(cloudy)=40percent=0.4

Now, we have to find Prob(rain|cloudy)

From Bayes theorem, Prob(rain|cloudy) =Prob(cloudy|rain)*Prob(rain)/ Prob(cloudy)

=(0.5)*(0.1)/(0.4)

=0.125=12.5 percent

To conclude, Bayes Theorem allows us to make reasoned deduction of events happening in the real world based on prior knowledge of observations that may imply it.In the Naive Bayes Classifier, we can interpret class probabilities by finding frequency of each instance of the event divided by the total number of instances. So, from the above example, by applying bayes theorem, we can find the class probabilities and perform classification on different types of text data.

Naive Bayes classifier is very accurate to classify the data into different categories. It is very easy to code a Naive Bayes classifier and also it has very good training and testing runtime i.e., O(n) and O(1) respectively.

EXPERIMENTS:

We experiment with different hyperparameter values i.e. different alpha values and plot the confusion matrix in each iteration and finally observe the accuracies.

The accuracy plot is obeserved to be as under-

The maximum accuracy obtained is 76.8 percent, when we use naive bayes classifier with execution time 1.3 sec.

2)Random Forest Classifier

The Random forest is a supervised Machine learning algorithm used for classification and regression using decision trees.

It is a collection of decision trees from a randomly selected subset of training set and it then collects the votes from different decision trees to decide the final prediction.

Procedure involved-

'k' records are taken randomly, from the dataset having 'n' records.
Decision trees are constructed for each sample.
Each decision tree will generate an output.
Final output is considered based on majority voting.

EXPERIMENTS:

We also eperiment with different hyper-parameters i.e. Estimator values and plot confusion matrix for each iteration.

The accuracies are observed to be as under and the training time is 1.31 sec.

We got 81 percent as highest accuracy with 5 estimators.

3)Logistic Regression

Logistic Regression is a classification model that uses input features to predict categorical outcome variable that takes one of the different classes, in our project, it is rating.

Logistic regression applies the logistic sigmoid function to weighted input values to generate a prediction of the data class.

Image Source https://en.wikipedia.org/wiki/File:Logistic-curve.svg

The sigmoid function is f(x)=1/(1+e^-x)

EXPERIMENTS:

We experiment with different hyperparamenters i.e. penalitites l1, l2 and none and plot confusion matrix at each iteration.

We plot the accuracies obtained and observe that the highest accuracy obtained is 87.5 percent with l1 penalty.

The training time for this algorithm is 5.8 sec.

4) Decision Tree Classifier

Decision tree is a flow-chart structure which has 4 parts- root node, internal node, leaf node and branch.

Rootnode is the topmost node, internal node represents a feature, branch represents a decision rule and each leaf node represents the outcome.

Image source https://www.datacamp.com/tutorial/decision-tree-classification-python

EXPERIMENTS:

We experiment with different hyper-paramenters i.e. impurities and plot the confusion matrix for min_impurity_decrease = 0, 0.001, 0.1, 0.5 and 0.9

We observe that the less impurity has the highest accuracies.

The comparision graph is plotted for different impurity values and accuracies obtained.

The maximum accuracy obtained is 81 percent and that is with 0 impurity. The algorithm takes 4,7sec for execution.

5) kNN

The k-nearest neighbors (KNN) algorithm is a data classification method for estimating the likelihood that a data point will become a member of one group or another based on what group the data points nearest to it belong to.

We use Euclidean distance formula to calculate the distance between two points.

Let the data points be A(x1,y1) and B(x2,y2), the distance between A and B is given by

d(A,b)=sqrt[(x1-x2)^2 + (y1-y2)^2]

EXPERIMENTS:

We experiment with diferrent hyper-parameters i.e. different k values where k is 5, 11, 30, 50, 70 and observe the confusion matrix.

We plot the accuracies obtained for eack k-value and observe the highest accuracy for k=30.

The time taken for training this algorithm is approximately 9sec.

6)AdaBoost

AdaBoost is called Adaptive Boosting- Machine Learning algorithm used as an Ensemble Method. The most common algorithm used with AdaBoost is decision trees with one level that means with Decision trees with only 1 split.

Image Source https://www.analyticsvidhya.com/blog/2021/09/adaboost-algorithm-a-complete-guide-for-beginners/

EXPERIMENTS:

We try experimenting with different hyper-parameters i.e learning-rate and observe the confusion matrix and also plot the graph of accuracies after each iteration.

The graph is as follows and the total time taken for training is 33 seconds approximately.

7) Support Vector Machines

SVM is a supervised machine learning algorithm that can be used for both classification and regression. It aims to find the best hyperplane that seperates the data in such a way that every label is properly classified into its type.

Image source https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

The degree is a hyperparameter which is used when the kernel is 'poly'. It is the degree of polynomial used to find the hyperplane to split the data into different categories.

EXPERIMENTS:

We try experimenting with different hyper-parameters i.e degree and observe the confusion matrix and also plot the graph of accuracies after each iteration.

The graph is observed as follows, the training time is 56sec and we see that the accuracy is 88.7 percent for degree=10.

COMPARISION OF PERFORMANCE - IN TERMS OF ACCURACY

The accuracies of all the models are noted and we find the maximum accuracy from among all the iterations and plot the bar graph to compare the accuracies of different models with the same dataset.

We obtain the results as under

We see SVM gives the highest accuracy amongst all the models and AdaBoost gives the lowest accuracy.

The following is the order of best to worst models according to accuracy-

SVM
Logistic Regression
kNN
Random Forest
Decision Tree
Naive Bayes Classifier
AdaBoost

COMPARISION OF PERFORMANCE - IN TERMS OF TIME TAKEN

The run time of an algorithm is calculated as Start_time - End_time. The both times are noted and we calculate the rumtime of each model and plot the comparision graph.

We observe that SVM takes maximum time for execution wheras Naive Bayes Classifier takes the least time for execution.

The best to worst algorithms according to time taken is -

Naive Bayes Classifier
Random Forest
Decision Tree
Logistic Regression
kNN
AdaBoost
SVM

CHOOSING THE BEST MODEL CONSIDERING ACCURACY AND TIME

As we see that even though SVM has the highest accuracy, it takes comparitively highest time for execution.

So, after analysing both accuracy and time Random forest is the best model and after that, we can also choose KNN or Decision Tree.

CONTRIBUTION

Performed various experiments with all the algorithms by hyper-parameter tuning.
Analyzed the confusion matrix after each iteration with every hyper-parameter.
Plotting the graphs for comparing the performance of the algorithm with every change in parameters.
Understanding and clear explanation of how the hyperparametrs of every algorithm effect the performance.
Plotting the graph comparing accuracies and run time of all the 7 algorithms.
Analyzing the best model according to both accuracy and time.
Handling the accuracy by removing the null values.

CHALLENGES FACED AND RESOLVED

The data consisted of large number of columns (27 columns) hence, I had to find all the unnecessary columns and drop them.
For selecting the hyper-parameters, I had to learn about the concepts behind it and analyze for which hyper-parameter the accuracy is improving and decreasing.
The review text consisted a lot of un-wanted symbols(noise) which lead to decrease in the performance both accuracy and time, hence I had to remove all those symbols and replace it with "".
The AdaBoost algorithm gave the least accuracy and took more time for execution even after fine tuning the hyper-parameters.
There was no significant improvement in the accuracy of kNN even after changing the nearest neighbours i.e k - reason still trying to find out.

CITATION

The major part of code referred is https://www.kaggle.com/code/vivekgediya/topic-modeling-on-e-commerce-review REFERENCES [1] https://www.dataquest.io/blog/naive-bayes-tutorial/ [2] https://stackabuse.com/the-naive-bayes-algorithm-in-python-with-scikit-learn/ [3]https://stackoverflow.com/questions/51085553/scikit-learn-5-fold-cross-validation-train-test-split [4]https://www.pythonpool.com/remove-punctuation-python/ [5]https://www.analyticsvidhya.com/blog/2021/04/improve-naive-bayes-text-classifier-using-laplace-smoothing/ [6]https://www.codingninjas.com/codestudio/library/naive-bayes-and-laplace-smoothing-3905 [7]https://www.jcchouinard.com/confusion-matrix-in-scikit-learn/ [8]https://www.geeksforgeeks.org/random-forest-classifier-using-scikit-learn/ [9]https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/?ref=lbp [10]https://www.datacamp.com/tutorial/adaboost-classifier-python [11]https://stackabuse.com/implementing-svm-and-kernel-svm-with-pythons-scikit-learn/ [12]https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a [13]https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

Shreya Malraju

Classifying Amazon review data using 7 different classifiers and analyze their performance

Recent Posts

Comments