dc.contributor.author | Kagereki, Edwin M | |
dc.date.accessioned | 2013-11-25T11:57:32Z | |
dc.date.available | 2013-11-25T11:57:32Z | |
dc.date.issued | 2013-11 | |
dc.identifier.citation | A Thesis Submitted In Partial Ful Llment For The Degree Of Masters Of Science In Medical Statistics, 2013 | en |
dc.identifier.uri | http://erepository.uonbi.ac.ke:8080/xmlui/handle/11295/60021 | |
dc.description.abstract | The datasets from microarray experiments enables the measurement of gene ex-
pression pro le of in cells. Statistical models maybe used for classify the samples
into various physiological categories based on the gene expression pro le. How-
ever gene classi cation as a domain of research is not straight-forwad due to some
inherent properties of the data; mainly multidimensionality and the noise.
The thesis studied three aspects of gene expression analysis. That is dimension
reduction, classi cation of the expression pro les and described the variability of
the gene expression data due to the covariates like age and gender. The dataset
used in the thesis is the GEO dataset GSE34105 . Principle Component Analysis
and Eigen-R2 methods were applied to dissect the overall variation. Subsequently
a linear discriminant classi er was built and the e ect of the number of princi-
pal components retained on the accuracy of the linear discriminant classi er was
assessed using the leave-one-out cross-validation approach. All the data analysis
was done in R 3.0.1 and R 2.6.2 and the relevant packages.
The rst three components accounted for a cumulative 33.34 % of the total vari-
ance (23.26 % , 6.02 % and 4.06 % respectively). The error rate of the linear
discriminant classi er systematically increased at the number of retained princi-
pal components increased from three to seventy (6 % to 33 %). In our study the
age explained 0.8 % of the variance, the disease condition 26.5 % and gender only
1.59 %. The accuracy of the linear discriminant classi er was highly dependent on
the number of principal components retained. The error rate increased systemat-
ically from 6 % to 33% when the components retained were increased from 3 to
70.
The fact that the rst few principal components explained a large proportion of
the variance suggests that there were only a few genes that accounted for the
signi cant amount of the variance.This aligns with the knowledge that only a few
number of genes present relevant attributes and that the gene expressed data comes
with presence of noise which can be termed as technical and biological distortions
of the data.
In conclusion the proper understanding of the variability of gene expression data is
key to making proper biological conclusions. The appreciation of the contribution
of the variability contributed to other biological factors is important in the study
design. | en |
dc.language.iso | en | en |
dc.publisher | University of Nairobi | en |
dc.title | Principal component analysis and linear discriminant analysis in gene expression data | en |
dc.type | Thesis | en |
dc.description.department | a
Department of Psychiatry, University of Nairobi, ; bDepartment of Mental Health, School of Medicine,
Moi University, Eldoret, Kenya | |
local.publisher | School of Medicine | en |