Comparing the Performance of Naïve Bayes and Support Vector Machines in Text Classification

Gateru, Nicholas M

dc.contributor.author	Gateru, Nicholas M
dc.date.accessioned	2016-04-28T06:57:18Z
dc.date.available	2016-04-28T06:57:18Z
dc.date.issued	2015
dc.identifier.uri	http://hdl.handle.net/11295/95203
dc.description.abstract	Classification is a supervised learning task whose goal is to infer a prediction model using a training dataset containing instances whose category membership is known, and then using the model to assign class labels to testing instances whose class labels are unknown. E.g. in spam filtering, already labelled mail as either spam or not spam is used to train a classifier, and the classifier is then used in the future to automatically place mail whose category is unknown, into either spam or not spam categories. Training of a classifier progresses from gathering a training set that is representative of the real world, thereafter, the input data is represented into a feature vector that contains the features that describe the object. With input features in place, a training algorithm e.g. SVM or Naïve Bayes is selected and run on the training set to come up with a predicting function. The function is run on the testing set and its prediction accuracy and performance is measured. Owing to the proliferation of easily available textual data of late, the need and interest to classify that data has increased. In the real-world, the ability to automatically classify documents into a fixed set of categories is highly desirable. Machine learning offers powerful tools for automatically classifying documents. A techniques performance depends not only on the algorithm in use, but also on the characteristics of the data in use. As such, it’s prudent to apply various techniques on classifying the same dataset and try to analyze the performance of each technique relative to the particular data. In this project, we compared the performance of Support Vector Machines and Naïve Bayes algorithms in the task of text classification by using the ’20 newsgroups’ dataset. The ’20 newsgroups’ dataset comprises around 20,000 newsgroup posts on 20 topics split in two subsets: one for training and the other one for testing. The data pre-processing, training of the classifiers, testing of the classifiers and performance evaluation was accomplished by implementing a python script. Performance evaluation was done by comparing: training time, testing time, precision, recall, and F-measure scores for each classifier when each classifier was run against 4,887 documents and 10,794 documents. We found that SVM achieved an F-score of 0.969 and Naïve Bayes an F-score of 0.964 when tested using 4,887 documents. When tested using 10,794 documents, SVM achieved an F- iv score of 0.900, and Naïve Bayes an F-score of 0.869. We also found that, for 4,887 documents, SVM took 0.676s to train, while Naïve Bayes took 0.026s to train for the same number of documents. For 10,794 documents, SVM took 3.733s to train while Naïve Bayes took 0.106s to train for the same number of documents. The findings show that the size of the dataset affected the performance of both classifiers, i.e. with more documents used, both classifiers were less able to place documents in their correct classes. The findings also confirm the existing findings of the suitability of SVM as compared to other classifiers to classify text. Keywords: Text classification, SVM, Naïve Bayes, Recall, Precision, F-measure	en_US
dc.language.iso	en	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.title	Comparing the Performance of Naïve Bayes and Support Vector Machines in Text Classification	en_US
dc.type	Thesis	en_US

Files in this item

Name:: license_rdf
Size:: 1.203Kb
Format:: application/rdf+xml

View/Open

Name:: Gateru_Comparing the Performance ...
Size:: 880.8Kb
Format:: PDF
Description:: Full text

View/Open

This item appears in the following Collection(s)

Faculty of Science & Technology (FST) [4283]

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States