An approach for using twitter to perform sentiment analysis in Kenya
The interest in sentiment analysis as a research area has become increasingly popular with the development of new social interaction technologies. Twitter, being one of these new technologies, presents a unique environment where one can track sentiments expressed about various topics. This report therefore considers the problem of attempting to classify sentiments expressed on twitter about certain products, services or personalities as being positive, negative or neutral. The approach adopted to solve this problem is through the use of machine learning methods. In particular, the Naïve Bayes model is chosen to build the classifier. This being a learning problem, training data and testing data is required. Two methods of collecting training data are considered and their impact on the performance of the classifier is discussed. The first method is distant supervision, where emoticons are used as labels to identify and collect training data that contains sentiment information. The other method is manual supervision where a human trainer manually identifies and labels training data with that contains the necessary sentiment information. It is discovered that using distant supervision to collect training data results in poorer performance, than using manual supervision techniques, even where the training set collected using distant supervision is larger than the training set from the manual supervision techniques. Using emoticons as labels to identify 5000 tweets as training data, the classifier performed with an accuracy of 70.3% compared to use of 500 hand labeled tweets as training data which resulted in 76.3% accuracy. A third method for collecting training data using manual supervision methods is also suggested and its performance is also discussed. This method which uses hand labeled keywords grouped according to word characteristics yields a performance of 80.3%. This report concludes by giving recommendations of ideal models to start with when attempting to develop a twitter based sentiment classifier. A software tool, developed using the learning model to classify live streams of data from twitter into positive, negative or neutral classes and provide a summary of results, is also demonstrated.