Show simple item record

dc.contributor.authorOmbui, Edward O
dc.date.accessioned2022-03-30T08:34:43Z
dc.date.available2022-03-30T08:34:43Z
dc.date.issued2021
dc.identifier.urihttp://erepository.uonbi.ac.ke/handle/11295/157166
dc.description.abstractClassifying brief text messages containing hate speech from the massive amount of content generated by social media users is a difficult undertaking. Social media data provides significant difficulties for conventional natural language processing approaches when it comes to obtaining high-quality features from noisy, highly dimensional, codeswitched, and large unstructured data. Additionally, a detailed assessment of past studies revealed a dearth of publicly available annotated datasets for comparative studies, a deficit of theoretical support for the annotation systems employed, and a scarcity of research on codeswitched data. To overcome these shortcomings, this study takes a data-driven strategy to find qualitative and discriminatory characteristics in hate text messages from social media platforms. The objective is to use these attributes to construct a more effective machine classification model for detecting subtle hate speech text messages. Approximately 400k messages were crawled from social media during the 2017 Kenyan general election period, employing a combination of problematic hashtags, ethnic epithets, hate patterns, and messages from pro-hate user accounts. A random sample of 50k messages was manually classified by a team of 27 human annotators into three categories: Hate Speech, Offensive, or Neither. Subsequently, this dataset was condensed further by utilizing a hierarchical probability modeling technique to derive a psychosocial feature subset (PDC) informed by the conceptual framework. To analyze and select the best model, a grid search was conducted through all possible feature combinations using 5-fold cross-validation, with a tenth of the data set reserved for evaluation and to avoid over-fitting the model. According to the findings of the studies, the unique psychosocial feature set (PDC) was effective at identifying hate speech and outperformed traditional features when used to train the best classifier, namely the linear support vector machine algorithm, with an accuracy of 82.5 percent. The Passion (P) and Distance (D) factors were found to be most significant, with 74.3 percent and 74.2 percent accuracy, respectively. Further, the psychosocial feature framework generalized better than conventional features and classifiers in handling additional types of hate speech in codeswitched text messages. This study makes three contributions. First, it provides a gold-standard annotated dataset that may be used for comparative studies by other researchers. Second, the study provided an empirical framework and methodology for identifying hate speech in short text messages that are anchored in theory. Thirdly, this approach was important in the development of a text classification model capable of effectively generalizing to various forms of hate speech on social media. Subsequently, the classifier's outputs could be utilized to influence evidence-based judgments by relevant security authorities and data-driven policy formation addressing the monitoring of hate speech on social media during future presidential elections in Kenya. Keywords: Hate Speech, Psychosocial features, Dimensionality reduction, Supervised learning, Codeswitchingen_US
dc.language.isoenen_US
dc.publisherUniversity of Nairobien_US
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.subjectPsycho-social Features and Machine Learningen_US
dc.titleA Model for Classifying Hate Speech Text From Social Media Leveraging on Psycho-social Features and Machine Learningen_US
dc.typeThesisen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States