Principal component analysis and linear discriminant analysis in gene expression data
The datasets from microarray experiments enables the measurement of gene ex- pression pro le of in cells. Statistical models maybe used for classify the samples into various physiological categories based on the gene expression pro le. How- ever gene classi cation as a domain of research is not straight-forwad due to some inherent properties of the data; mainly multidimensionality and the noise. The thesis studied three aspects of gene expression analysis. That is dimension reduction, classi cation of the expression pro les and described the variability of the gene expression data due to the covariates like age and gender. The dataset used in the thesis is the GEO dataset GSE34105 . Principle Component Analysis and Eigen-R2 methods were applied to dissect the overall variation. Subsequently a linear discriminant classi er was built and the e ect of the number of princi- pal components retained on the accuracy of the linear discriminant classi er was assessed using the leave-one-out cross-validation approach. All the data analysis was done in R 3.0.1 and R 2.6.2 and the relevant packages. The rst three components accounted for a cumulative 33.34 % of the total vari- ance (23.26 % , 6.02 % and 4.06 % respectively). The error rate of the linear discriminant classi er systematically increased at the number of retained princi- pal components increased from three to seventy (6 % to 33 %). In our study the age explained 0.8 % of the variance, the disease condition 26.5 % and gender only 1.59 %. The accuracy of the linear discriminant classi er was highly dependent on the number of principal components retained. The error rate increased systemat- ically from 6 % to 33% when the components retained were increased from 3 to 70. The fact that the rst few principal components explained a large proportion of the variance suggests that there were only a few genes that accounted for the signi cant amount of the variance.This aligns with the knowledge that only a few number of genes present relevant attributes and that the gene expressed data comes with presence of noise which can be termed as technical and biological distortions of the data. In conclusion the proper understanding of the variability of gene expression data is key to making proper biological conclusions. The appreciation of the contribution of the variability contributed to other biological factors is important in the study design.