Research and development of a weighted most recent common ancester algorithm for metagenomic taxonomic assignment
Abstract
The new generation of metagenomics has gained tremendous popularity in recent years. This
has been majorly due to rapid advances in DNA sequencing technology, which has produced
large amounts of sequence data in relatively shorter times, compared to conventional DNA
sequencing methods. There is a need to taxonomically characterise these data by assigning
individual sequence reads to their constituent taxa. However, there is lack of up-to-date and
customized software tools to accomplish this task, and for taxonomic characterisation, an
automated taxonomic classification scheme is necessary. The overall objective of this study
was to improve the accuracy of the most recent common ancestor (MRCA) estimation
method used in scoring metagenomic reads in the pathogen profiling pipeline (PPP). The
specific objectives included investigating sequence comparison algorithms that have been
used for assigning sequence reads to taxa excluding the MRCA, compare the taxonomic
classification accuracy of MEGAN and MRCA on the same simulated metagenomic dataset
and finally design the weighted MRCA algorithm that attains the maximum possible
classification accuracy and implement it in the PPP. A novel "weighted most recent common
ancestor" (weighted MRCA) algorithm was developed as a number of Perl scripts and
evaluated for taxonomic accuracy. The datasets used for evaluation were simulated by the
QSA Read simulator using reference viral and prokaryotic (Bacteria and Archaea) genomes
obtained from the NCBI Refseq database. The results showed an improved mapping of up to
3.6% for viral sequences and 8.4% for the prokaryotic sequences (p-values as low as 0.0043
at a significance level of α = 0.05), at the species rank compared to MEGAN and MRCA. In
the context of environmental science and medicine, these percentages are highly significant
as they inform key decisions in public health. For large-scale pathogen discovery projects,
this method facilitates more accurate analysis and reporting of candidate etiological agents in
complex nucleic acid mixtures, which enhances outbreak preparedness by enhancing
capacity for early recognition and containment of pathogens.
Citation
Master of Science DegreePublisher
University of Nairobi Centre for Biotechnology and Bioinformatics