Clustering of structured data using axes sub-clustering: the ASIC algorithm

Kwale, Francis M

dc.contributor.author	Kwale, Francis M
dc.date.accessioned	2018-10-19T12:14:57Z
dc.date.available	2018-10-19T12:14:57Z
dc.date.issued	2018
dc.identifier.uri	http://hdl.handle.net/11295/104275
dc.description.abstract	The focus of this thesis is on clustering algorithms. Clustering is a data mining technique for grouping related data sets together. Studies on improving clustering have been going on especially for the past twenty years. However despite these efforts, there is still need for clustering algorithms’ performance improvement (especially in accuracy). Secondly, there is lack of framework for comparison clustering algorithms, making it hard to evaluate/compare such performance improvements. The goal of this research is to explore ways of improving performance of clustering, with special emphasis on accuracy for applications with medium sized data, specifically structured data. We strategize to design an algorithm to improve the accuracy using an approach that is good at accuracy and also good at other performance factors (so as to allow later enhancements to ensure these other factors too), thus obtaining the ways of improving overall performance. We first use literature review and identify factors that significantly affect clustering performance, as well as various well performing clustering approaches (with their representative algorithms). We then deduce a clustering comparison framework that addresses the issues with previous comparisons. We then perform further comparisons (using the proposed framework) of the well performing approaches so as to obtain one winning approach based on the identified significant performance factors, and also derive its areas of further improvements. We finally design an algorithm that addresses the said areas of improvements, but also obtain recommendations to further improve the other performance factors, thus aiding improved clustering performance in general. We also evaluate the algorithm. The literature review resulted to identification of the following significant factors: accuracy, efficiency, scalability, and robustness. Also, the following algorithms were selected for further comparisons: HierarchicalClusterer, EM, KMeans, and DBSCAN. A comparison framework was then developed and evaluated using test cases and experiments. The above four algorithms were next compared using the framework. Results showed Kmeans (distance-based approach) as among the most accurate and also the highest performing in overall, and recommendations were to avoid KMeans limitations of parameters dependence, to cluster outliers and produce overlapped clusters. Finally, a new algorithm was designed using the winning algorithm’s approach (i.e. distance-based), and addressing the KMeans recommendations. The algorithm named ASIC, applies the idea of finding sub-clusters of each axis in the Vector Space Model (VSM), and then finding all sub-clusters’ intersections. The algorithm is evaluated using the developed comparison framework by comparing it with the above four. Using the metric Error Rate, ASIC attained an average of 38.9% compared to HierarchicalClusterer (50.2%), EM (47.4%), KMeans (52.1%), and DBSCAN (54.4%). This higher ASIC’s accuracy was found to be statistically significant. It is therefore concluded that ASIC be implemented to improve clustering accuracy, but also implement the recommendations so as to ensure higher efficiency and scalability (which was the lowest among the other four), thus improving clustering performance. Keywords: Clustering, data mining, clustering algorithms, comparison framework, ASIC.	en_US
dc.language.iso	en	en_US
dc.publisher	University of Nairobi	en_US
dc.title	Clustering of structured data using axes sub-clustering: the ASIC algorithm	en_US
dc.type	Thesis	en_US

Files in this item

Name:: Musembi Kwale PhD Thesis 2018.pdf
Size:: 2.974Mb
Format:: PDF
Description:: Full text

View/Open

This item appears in the following Collection(s)

Faculty of Science & Technology (FST) [4085]

Show simple item record