Show simple item record

dc.contributor.authorTerer, Mercy, C
dc.date.accessioned2021-01-27T11:10:46Z
dc.date.available2021-01-27T11:10:46Z
dc.date.issued2020
dc.identifier.urihttp://erepository.uonbi.ac.ke/handle/11295/154303
dc.description.abstractData quality assurance is a key component in research. It is almost impossible to routinely check for errors in large datasets if automated smart mechanisms are not put in place. The quality of results from data analysis heavily relies on the underlying state of data. Quality data leads to e ective and unbiased reporting. Errors introduced into the data are inevitable hence the need to have error-checking mechanisms. Error checking mechanisms such as the use of range checks, quantile ranges and z-scores are limited to continuous data types and e ective for small feature space data. Errors in dichotomous and character data types are easily omitted hence the need to use methods that scan anomalies for all data types and for extremely large datasets. Two pass veri - cation on the other hand is a gold standard method for checking the quality state of data. It involves random sampling of observations to be re-entered from similar source documents to measure the level of accuracy and consistency of data. It is an accurate process; however, it is a tedious and manual process that relies on random sampling for larger datasets. We propose possible alternative methods for error checking by applying machine learning outlier detection algorithms. The observations that are outlying are subjected to crossreferencing for possible errors instead of randomly selecting a set of observations. We evaluated k-means clustering and isolation forest unsupervised machine learning algorithms to detect outliers. The outliers form the sample of observations to be validated and veri ed. We then compared two pass veri cation anomaly scores, k-means anomaly scores and isolation forest anomaly scores. Normalized mutual information score and the coe cient of determination metrics were used to determine the strength of the correlation. The results indicated that unsupervised machine learning methods can be possible alternatives for data quality assurance with a exibility for future considerations and improvements. Isolation forest performed better than k-means clustering.en_US
dc.language.isoenen_US
dc.publisherUniversity of Nairobien_US
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.subjectA Comparative Analysis of unsupervised outlier detection methods for Data Quality Assuranceen_US
dc.titleA Comparative Analysis of unsupervised outlier detection methods for Data Quality Assuranceen_US
dc.typeThesisen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States