Show simple item record

dc.contributor.authorKwale, Francis M
dc.date.accessioned2018-10-19T12:14:57Z
dc.date.available2018-10-19T12:14:57Z
dc.date.issued2018
dc.identifier.urihttp://hdl.handle.net/11295/104275
dc.description.abstractThe focus of this thesis is on clustering algorithms. Clustering is a data mining technique for grouping related data sets together. Studies on improving clustering have been going on especially for the past twenty years. However despite these efforts, there is still need for clustering algorithms’ performance improvement (especially in accuracy). Secondly, there is lack of framework for comparison clustering algorithms, making it hard to evaluate/compare such performance improvements. The goal of this research is to explore ways of improving performance of clustering, with special emphasis on accuracy for applications with medium sized data, specifically structured data. We strategize to design an algorithm to improve the accuracy using an approach that is good at accuracy and also good at other performance factors (so as to allow later enhancements to ensure these other factors too), thus obtaining the ways of improving overall performance. We first use literature review and identify factors that significantly affect clustering performance, as well as various well performing clustering approaches (with their representative algorithms). We then deduce a clustering comparison framework that addresses the issues with previous comparisons. We then perform further comparisons (using the proposed framework) of the well performing approaches so as to obtain one winning approach based on the identified significant performance factors, and also derive its areas of further improvements. We finally design an algorithm that addresses the said areas of improvements, but also obtain recommendations to further improve the other performance factors, thus aiding improved clustering performance in general. We also evaluate the algorithm. The literature review resulted to identification of the following significant factors: accuracy, efficiency, scalability, and robustness. Also, the following algorithms were selected for further comparisons: HierarchicalClusterer, EM, KMeans, and DBSCAN. A comparison framework was then developed and evaluated using test cases and experiments. The above four algorithms were next compared using the framework. Results showed Kmeans (distance-based approach) as among the most accurate and also the highest performing in overall, and recommendations were to avoid KMeans limitations of parameters dependence, to cluster outliers and produce overlapped clusters. Finally, a new algorithm was designed using the winning algorithm’s approach (i.e. distance-based), and addressing the KMeans recommendations. The algorithm named ASIC, applies the idea of finding sub-clusters of each axis in the Vector Space Model (VSM), and then finding all sub-clusters’ intersections. The algorithm is evaluated using the developed comparison framework by comparing it with the above four. Using the metric Error Rate, ASIC attained an average of 38.9% compared to HierarchicalClusterer (50.2%), EM (47.4%), KMeans (52.1%), and DBSCAN (54.4%). This higher ASIC’s accuracy was found to be statistically significant. It is therefore concluded that ASIC be implemented to improve clustering accuracy, but also implement the recommendations so as to ensure higher efficiency and scalability (which was the lowest among the other four), thus improving clustering performance. Keywords: Clustering, data mining, clustering algorithms, comparison framework, ASIC.en_US
dc.language.isoenen_US
dc.publisherUniversity of Nairobien_US
dc.titleClustering of structured data using axes sub-clustering: the ASIC algorithmen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record