Research and development of a weighted most recent common ancester algorithm for metagenomic taxonomic assignment
The new generation of metagenomics has gained tremendous popularity in recent years. This has been majorly due to rapid advances in DNA sequencing technology, which has produced large amounts of sequence data in relatively shorter times, compared to conventional DNA sequencing methods. There is a need to taxonomically characterise these data by assigning individual sequence reads to their constituent taxa. However, there is lack of up-to-date and customized software tools to accomplish this task, and for taxonomic characterisation, an automated taxonomic classification scheme is necessary. The overall objective of this study was to improve the accuracy of the most recent common ancestor (MRCA) estimation method used in scoring metagenomic reads in the pathogen profiling pipeline (PPP). The specific objectives included investigating sequence comparison algorithms that have been used for assigning sequence reads to taxa excluding the MRCA, compare the taxonomic classification accuracy of MEGAN and MRCA on the same simulated metagenomic dataset and finally design the weighted MRCA algorithm that attains the maximum possible classification accuracy and implement it in the PPP. A novel "weighted most recent common ancestor" (weighted MRCA) algorithm was developed as a number of Perl scripts and evaluated for taxonomic accuracy. The datasets used for evaluation were simulated by the QSA Read simulator using reference viral and prokaryotic (Bacteria and Archaea) genomes obtained from the NCBI Refseq database. The results showed an improved mapping of up to 3.6% for viral sequences and 8.4% for the prokaryotic sequences (p-values as low as 0.0043 at a significance level of α = 0.05), at the species rank compared to MEGAN and MRCA. In the context of environmental science and medicine, these percentages are highly significant as they inform key decisions in public health. For large-scale pathogen discovery projects, this method facilitates more accurate analysis and reporting of candidate etiological agents in complex nucleic acid mixtures, which enhances outbreak preparedness by enhancing capacity for early recognition and containment of pathogens.