Show simple item record

dc.contributor.authorGateru, Nicholas M
dc.date.accessioned2016-04-28T06:57:18Z
dc.date.available2016-04-28T06:57:18Z
dc.date.issued2015
dc.identifier.urihttp://hdl.handle.net/11295/95203
dc.description.abstractClassification is a supervised learning task whose goal is to infer a prediction model using a training dataset containing instances whose category membership is known, and then using the model to assign class labels to testing instances whose class labels are unknown. E.g. in spam filtering, already labelled mail as either spam or not spam is used to train a classifier, and the classifier is then used in the future to automatically place mail whose category is unknown, into either spam or not spam categories. Training of a classifier progresses from gathering a training set that is representative of the real world, thereafter, the input data is represented into a feature vector that contains the features that describe the object. With input features in place, a training algorithm e.g. SVM or Naïve Bayes is selected and run on the training set to come up with a predicting function. The function is run on the testing set and its prediction accuracy and performance is measured. Owing to the proliferation of easily available textual data of late, the need and interest to classify that data has increased. In the real-world, the ability to automatically classify documents into a fixed set of categories is highly desirable. Machine learning offers powerful tools for automatically classifying documents. A techniques performance depends not only on the algorithm in use, but also on the characteristics of the data in use. As such, it’s prudent to apply various techniques on classifying the same dataset and try to analyze the performance of each technique relative to the particular data. In this project, we compared the performance of Support Vector Machines and Naïve Bayes algorithms in the task of text classification by using the ’20 newsgroups’ dataset. The ’20 newsgroups’ dataset comprises around 20,000 newsgroup posts on 20 topics split in two subsets: one for training and the other one for testing. The data pre-processing, training of the classifiers, testing of the classifiers and performance evaluation was accomplished by implementing a python script. Performance evaluation was done by comparing: training time, testing time, precision, recall, and F-measure scores for each classifier when each classifier was run against 4,887 documents and 10,794 documents. We found that SVM achieved an F-score of 0.969 and Naïve Bayes an F-score of 0.964 when tested using 4,887 documents. When tested using 10,794 documents, SVM achieved an F- iv score of 0.900, and Naïve Bayes an F-score of 0.869. We also found that, for 4,887 documents, SVM took 0.676s to train, while Naïve Bayes took 0.026s to train for the same number of documents. For 10,794 documents, SVM took 3.733s to train while Naïve Bayes took 0.106s to train for the same number of documents. The findings show that the size of the dataset affected the performance of both classifiers, i.e. with more documents used, both classifiers were less able to place documents in their correct classes. The findings also confirm the existing findings of the suitability of SVM as compared to other classifiers to classify text. Keywords: Text classification, SVM, Naïve Bayes, Recall, Precision, F-measureen_US
dc.language.isoenen_US
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.titleComparing the Performance of Naïve Bayes and Support Vector Machines in Text Classificationen_US
dc.typeThesisen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States