Research and development of a weighted most recent common ancester algorithm for metagenomic taxonomic assignment

Butungi, Hellen

dc.contributor.author	Butungi, Hellen
dc.date.accessioned	2013-02-13T11:50:36Z
dc.date.available	2013-02-13T11:50:36Z
dc.date.issued	2012
dc.identifier.citation	Master of Science Degree	en
dc.identifier.uri	http://erepository.uonbi.ac.ke:8080/xmlui/handle/123456789/9760
dc.description.abstract	The new generation of metagenomics has gained tremendous popularity in recent years. This has been majorly due to rapid advances in DNA sequencing technology, which has produced large amounts of sequence data in relatively shorter times, compared to conventional DNA sequencing methods. There is a need to taxonomically characterise these data by assigning individual sequence reads to their constituent taxa. However, there is lack of up-to-date and customized software tools to accomplish this task, and for taxonomic characterisation, an automated taxonomic classification scheme is necessary. The overall objective of this study was to improve the accuracy of the most recent common ancestor (MRCA) estimation method used in scoring metagenomic reads in the pathogen profiling pipeline (PPP). The specific objectives included investigating sequence comparison algorithms that have been used for assigning sequence reads to taxa excluding the MRCA, compare the taxonomic classification accuracy of MEGAN and MRCA on the same simulated metagenomic dataset and finally design the weighted MRCA algorithm that attains the maximum possible classification accuracy and implement it in the PPP. A novel "weighted most recent common ancestor" (weighted MRCA) algorithm was developed as a number of Perl scripts and evaluated for taxonomic accuracy. The datasets used for evaluation were simulated by the QSA Read simulator using reference viral and prokaryotic (Bacteria and Archaea) genomes obtained from the NCBI Refseq database. The results showed an improved mapping of up to 3.6% for viral sequences and 8.4% for the prokaryotic sequences (p-values as low as 0.0043 at a significance level of α = 0.05), at the species rank compared to MEGAN and MRCA. In the context of environmental science and medicine, these percentages are highly significant as they inform key decisions in public health. For large-scale pathogen discovery projects, this method facilitates more accurate analysis and reporting of candidate etiological agents in complex nucleic acid mixtures, which enhances outbreak preparedness by enhancing capacity for early recognition and containment of pathogens.	en
dc.language.iso	en	en
dc.publisher	University of Nairobi	en
dc.subject	METAGENOMIC TAXONOMIC. RESEARCH AND DEVELOPMENT. COMMON ANCESTOR ALGORITHM	en
dc.title	Research and development of a weighted most recent common ancester algorithm for metagenomic taxonomic assignment	en
dc.type	Thesis	en
local.publisher	Centre for Biotechnology and Bioinformatics	en

Files in this item

Name:: Butungi_Research and development ...
Size:: 3.014Mb
Format:: PDF
Description:: Full-text

View/Open

This item appears in the following Collection(s)

Faculty of Science & Technology (FST) [4085]

Show simple item record