Principal component analysis and linear discriminant analysis in gene expression data

Kagereki, Edwin M

dc.contributor.author	Kagereki, Edwin M
dc.date.accessioned	2013-11-25T11:57:32Z
dc.date.available	2013-11-25T11:57:32Z
dc.date.issued	2013-11
dc.identifier.citation	A Thesis Submitted In Partial Ful Llment For The Degree Of Masters Of Science In Medical Statistics, 2013	en
dc.identifier.uri	http://erepository.uonbi.ac.ke:8080/xmlui/handle/11295/60021
dc.description.abstract	The datasets from microarray experiments enables the measurement of gene ex- pression pro le of in cells. Statistical models maybe used for classify the samples into various physiological categories based on the gene expression pro le. How- ever gene classi cation as a domain of research is not straight-forwad due to some inherent properties of the data; mainly multidimensionality and the noise. The thesis studied three aspects of gene expression analysis. That is dimension reduction, classi cation of the expression pro les and described the variability of the gene expression data due to the covariates like age and gender. The dataset used in the thesis is the GEO dataset GSE34105 . Principle Component Analysis and Eigen-R2 methods were applied to dissect the overall variation. Subsequently a linear discriminant classi er was built and the e ect of the number of princi- pal components retained on the accuracy of the linear discriminant classi er was assessed using the leave-one-out cross-validation approach. All the data analysis was done in R 3.0.1 and R 2.6.2 and the relevant packages. The rst three components accounted for a cumulative 33.34 % of the total vari- ance (23.26 % , 6.02 % and 4.06 % respectively). The error rate of the linear discriminant classi er systematically increased at the number of retained princi- pal components increased from three to seventy (6 % to 33 %). In our study the age explained 0.8 % of the variance, the disease condition 26.5 % and gender only 1.59 %. The accuracy of the linear discriminant classi er was highly dependent on the number of principal components retained. The error rate increased systemat- ically from 6 % to 33% when the components retained were increased from 3 to 70. The fact that the rst few principal components explained a large proportion of the variance suggests that there were only a few genes that accounted for the signi cant amount of the variance.This aligns with the knowledge that only a few number of genes present relevant attributes and that the gene expressed data comes with presence of noise which can be termed as technical and biological distortions of the data. In conclusion the proper understanding of the variability of gene expression data is key to making proper biological conclusions. The appreciation of the contribution of the variability contributed to other biological factors is important in the study design.	en
dc.language.iso	en	en
dc.publisher	University of Nairobi	en
dc.title	Principal component analysis and linear discriminant analysis in gene expression data	en
dc.type	Thesis	en
dc.description.department	a Department of Psychiatry, University of Nairobi, ; bDepartment of Mental Health, School of Medicine, Moi University, Eldoret, Kenya
local.publisher	School of Medicine	en

Files in this item

Name:: Kagereki_Principal component ...
Size:: 1.061Mb
Format:: PDF
Description:: FullText

View/Open

This item appears in the following Collection(s)

Faculty of Health Sciences (FHS) [4267]

Show simple item record