ELM: enhanced lowest common ancestor based method for detecting a pathogenic virus from a large sequence dataset
© Ueno et al.; licensee BioMed Central Ltd. 2014
Received: 9 August 2013
Accepted: 7 July 2014
Published: 28 July 2014
Emerging viral diseases, most of which are caused by the transmission of viruses from animals to humans, pose a threat to public health. Discovering pathogenic viruses through surveillance is the key to preparedness for this potential threat. Next generation sequencing (NGS) helps us to identify viruses without the design of a specific PCR primer. The major task in NGS data analysis is taxonomic identification for vast numbers of sequences. However, taxonomic identification via a BLAST search against all the known sequences is a computational bottleneck.
Here we propose an enhanced lowest-common-ancestor based method (ELM) to effectively identify viruses from massive sequence data. To reduce the computational cost, ELM uses a customized database composed only of viral sequences for the BLAST search. At the same time, ELM adopts a novel criterion to suppress the rise in false positive assignments caused by the small database. As a result, identification by ELM is more than 1,000 times faster than the conventional methods without loss of accuracy.
We anticipate that ELM will contribute to direct diagnosis of viral infections. The web server and the customized viral database are freely available at http://bioinformatics.czc.hokudai.ac.jp/ELM/.
KeywordsNext generation sequencing Virus discovery Diagnostic virology Virome Taxonomic identification
Most emerging infectious diseases are zoonoses, the pathogens of which are transmitted between humans and animals. The 2009 pandemic H1N1 influenza virus spread worldwide through reassortment that exchanged a gene segment between pigs and humans . Recently, cases of influenza A virus H7N9 transmitted from birds to humans have been reported . The 2003 severe acute respiratory syndrome (SARS) outbreak originated from the transmission of a novel bat coronavirus . For the sporadically endemic Ebola virus, bats are suspected to be the natural reservoir, but this is still controversial . Vector-borne zoonoses caused by transmission of viruses through mosquitoes and ticks have also become a public health concern. The 1999 outbreak of West Nile virus (WNV) that occurred in New York was caused by the transmission of the WNV among birds, horses and humans via mosquitoes . Similarly, severe fever with thrombocytopenia syndrome (SFTS) was found to be due to a virus transmitted by ticks .
To prepare for the risk of emerging infectious diseases, we need to identify pathogenic viruses through surveillance of livestock and wild animals. Although universal PCR primers against 16S ribosomal RNA are available for the identification of bacteria, we needed specific PCR primers to identify viruses. In recent years, NGS technologies have become available for identifying novel viruses that cannot be found by Sanger sequencing due to the difficulty of isolation and passage culture .
The taxonomic classification of metagenomic sequences is an important task in NGS data analyses . It has been widely applied to investigate the relationship between human health and the microbiome . Recently, a metagenomic analysis of the virome in a monkey infected with simian immunodeficiency virus was conducted, suggesting that the virome was associated with enteropathy caused by HIV . Through the first screening with NGS, the novel influenza virus H17N10 was identified in bats from metagenomic samples .
The taxonomic classification of NGS data uses sequence similarity searches such as BLASTX and BLASTN  to assign each sequence into a specific taxon based on the hits. However, with the similarity-based approach it is difficult to decide the resolution of assignments because the resolution depends on whether the sequences are conserved or species specific. The metagenome analyzer (MEGAN) employs the lowest common ancestor (LCA) concept in graph theory to estimate the taxonomical contents of samples . MEGAN evaluates the resolution of similarity-based assignments as the level of taxonomy based on the LCA.
The LCA is the closest taxon shared among two or more taxa found by a BLAST search for a read. When multiple taxa are found by the BLAST search with sufficiently reliable BLAST scores, the common ancestor is a high-level taxon. The LCA assignments to high-level taxa are associated with conserved sequences. When a single taxon is found by a BLAST search for a read, the common ancestor still remains a low-level taxon. The LCA assignments to low-level taxa are associated with species-specific sequences. Thus, the LCA assignments to low-level taxa are more suitable for resolving closely related organisms than those to high-level taxa.
The SOrt-ITEMS  and CARMA3  methods extended the LCA using a reciprocal BLAST search to reduce false positives in assignments. CARMA3 introduced the concept of the mutation rate into the LCA algorithm, and reinforced the reciprocal BLAST search to identify a novel taxon, relatives of which are numbered .
While taxonomic classification of metagenomic sequences has been developed with respect to accuracy, NGS technologies continue to improve sequencing throughput, and require considerable computational time and resources to perform taxonomic classification. The throughput of Roche 454 sequencing is 700 Mb with an average length of 400–800 bases. The present throughputs of NGS have become over 1 Gb with Illumina sequences of 600 Gb and an average length of ~100 bases, SOLiD sequences of 20 Gb with an average length of ~50 bases, and Ion Torrent PGM sequences of 1 Gb with an average length of ~200 bases . These massive sequencing data prevent the fast detection of infecting viruses from metagenomic samples.
To reduce the computational time, we constructed a customized database composed only of viruses for the BLAST search. However, customized databases also increase accidental hits, i.e. the match of host sequences to viral genomic sequences. Here, we introduce ELM with a customized viral database for taxonomic identification. The method is based on the assumption that valid hits, the match of viral sequences to viral genomic sequences, raise the probability of finding other similar genomic sequences in the BLAST search. In other words, true assignments with the LCA should be sensitive to the threshold of the bit score in the BLAST search. Consequently, ELM can suppress the rise of false positive assignments while saving computational time and resources.
Construction and content
BLAST search for customized database
To reduce the computational time and save disk space, we constructed a customized database composed only of viral genomic sequences for a BLASTN search. First, the RefSeq genomic sequences were downloaded from the NCBI. Then a total of 3,336 viral genomic sequences were selected using a custom-made script program and converted into BLAST databases by the formatdb command in the NCBI BLAST package. We used the BLASTN program in the NCBI BLAST + version 2.2.26 package with the default parameters to search for similar sequences. The hits with an E-value under 10-4 were used for subsequent analyses.
LCA analysis for taxonomic classification
The LCA method assigns sequence reads to taxa with a criterion for the resolution of assignments . h(q, s) is the set of taxa found by a BLAST search for a sequence read q under the threshold of the bit score s. For a set of taxa h(q, s), the common ancestor located farthest from the root of the taxonomic tree defines the LCA as the representative taxon. Thus, the LCA allows the assignment of a read to a single taxon. At the same time, the taxonomic levels indicate the resolution of assignments because the LCA allows broad hits to be assigned as high-level taxa but specific hits to be assigned as low-level taxa. It also means that the number of the LCA assigned reads depends on the thresholds of the bit scores for BLAST hits.
We use MEGAN software version 4.62.5 for the LCA analysis . MEGAN assigns sequence reads into taxa at different hierarchical levels such as family, genus, and species in the taxonomic ordering relation.
ELM for evaluating the LCA assignments
where μ is the average of Δn for all assigned taxa, and σ is the standard deviation. Since multiple comparisons in IFs under top percent score filters ranging from 20% to 100% are performed nine times at 10% intervals, a P value of less than 0.05/9 is accepted for statistical significance after Bonferroni correction. Accordingly, IF >2.54 (one-tailed) is accepted with statistical significance.
Utility and discussion
Benchmark tests for NGS datasets
To evaluate the ability of ELM to detect pathogenic viruses from large sequence datasets, five real datasets were used. Dataset 1 consisted of 4,449,766 unassembled reads from a rodent sample in Zambia . Reads with an average length of 236 bases were obtained by Ion Torrent Personal Genome Machine (PGM) sequencing. Dataset 2 consisted of 4,146,547 unassembled reads from a reptile sample (SRR: 527074) deposited in the NCBI Sequence Read Archive (SRA). Reads with an average length of 200 bases were obtained by Illumina sequencing . Dataset 3 consisted of 12,393,506 unassembled reads from a simian sample (SRR: 167721) deposited in the SRA. Reads with an average length of 73 bases were obtained by Illumina sequencing . We selected these three datasets to evaluate the effects of the read length, host and NGS platform. Furthermore, we applied ELM to fecal samples including multiple virus and phage taxa in dataset 4 (SRR: 1055974 for 12-day-old piglets) and dataset 5 (SRR: 1055972 for 54-day-old piglets). Reads with an average length of 291 bases in dataset 4 and 400 bases in dataset 5 were obtained by 454 GS FLX Titanium sequencing . In these benchmark tests, the BLAST searches were performed on a workstation with an Intel Sandy Bridge CPU 2.6 GHz processor. We compared the result of the BLASTN search for the customized database with that for the NCBI NT database.
Identification of infecting viruses using the LCA with BLASTN-NT
Elapsed time for the LCA with BLASTN-NT
# of reads
# of BLASTN hits
Taxonomic classification using the LCA with BLASTN viruses
Elapsed time for ELM with BLASTN viruses
# of BLASTN hits
Identification of infecting viruses using ELM with BLASTN viruses
Detection of the viral genomes using ELM with BLASTN viruses
# of reads
# of contigs
Average contig length
Next, we evaluated the effect of the BLAST hit score on the inflation indices. The results showed that the inflation indices had little association with the E-value in the BLAST search (Additional file 1: Figure S1). We also investigated the coverage of the BLAST hits. Valid hits in dataset 1 were distributed across target genomic sequences but not a specific genomic sequence, something not seen in datasets 2 and 3 (Additional file 1: Figure S2).
Virome analyses using ELM with BLASTN viruses
Detection of abundant virus genera in fecal viromes of piglets using ELM with BLASTN viruses
ELM with BLASTN viruses (# of reads)
LCA with BLAST-NT (# of reads)
Dependovirusa (759),Bocavirus (133)
Dependovirus (754), Bocavirus (528), Chimpanzee stool associated circular ssDNA virusb (106)
Interpretation of results
ELM with a specific database drastically reduced the computational time and saved disk space. Furthermore, ELM was effective even for short reads. Though short reads can reduce the accuracy of BLAST searches, in this study we verified ELM for average lengths of between 73 and 400 bases. The results showed no difference between the capabilities for taxonomic assignment.
One approach to reduce the computational time needed for the BLAST search is the subtraction of reads by mapping host-derived reads onto reference sequences [21, 22]. This approach might be considered effective for reducing analyzed sequence data but is limited to known hosts. It is not suitable for surveillance of wild animals or metagenomic analysis because the host sequences have yet to be deposited in databases. Therefore, we need to decide a moderate threshold for NGS data before the mapping.
For ELM we adopted another approach using specific databases composed only of target sequences to reduce the computational cost. The difficulty in applying this approach directly to virus identification was the increase of false positive assignments (Figure 4). We tested several ways to solve this problem. As shown in Additional file 1: Figure S1, changing the threshold of the E-value dependent on the size of the database is probably not effective for discriminating between true and false assignments. The criteria for evaluating breadth coverage, i.e. the proportion of reads mapped across the hit genomes and the depth coverage (the number of reads mapped at a position), also failed to identify the target viruses (Additional file 1: Figure S2). On the other hand, ELM analyzed how sequence similarity to the relatives changes. This extension of the LCA method suppressed the rise in false positive assignments. A limitation of ELM would be the false-negative errors because ELM cannot detect viruses distantly related to other relatives (Table 4). Therefore, viruses without relatives should be carefully handled without the inflation index.
ELM is especially useful for the first screening of infectious diseases caused by viruses. In surveillance for pathogenic viruses, taxonomic assignment of the host sequence is not necessary for the initial screening. For this, sensitivity for detecting viruses is particularly required. Our results suggest that ELM recovers most reads assigned to target viruses. Therefore, we can apply these results to further sophisticated analyses. ELM will contribute to analyses of NGS data for limited targets such as the direct diagnosis of viral infections.
Availability and requirements
ELM is freely accessible at http://bioinformatics.czc.hokudai.ac.jp/ELM/. The web interface has been tested in the following web browsers: Google Chrome (version 36), Microsoft Internet Explorer (version 11) and Mozilla Firefox (version 31).
Next generation sequencing
Lowest common ancestor
Enhanced LCA-based method.
The work was supported by the Japan Initiative for Global Research Network on Infectious Diseases.
- Garten RJ, Davis CT, Russell CA, Shu B, Lindstrom S, Balish A, Sessions WM, Xu X, Skepner E, Deyde V, Okomo-Adhiambo M, Gubareva L, Barnes J, Smith CB, Emery SL, Hillman MJ, Rivailler P, Smagala J, de Graaf M, Burke DF, Fouchier RA, Pappas C, Alpuche-Aranda CM, Lopez-Gatell H, Olivera H, Lopez I, Myers CA, Faix D, Blair PJ, Yu C, et al: Antigenic and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses circulating in humans. Science. 2009, 325 (5937): 197-201. 10.1126/science.1176225.View ArticlePubMed CentralPubMedGoogle Scholar
- Gao R, Cao B, Hu Y, Feng Z, Wang D, Hu W, Chen J, Jie Z, Qiu H, Xu K, Xu X, Lu H, Zhu W, Gao Z, Xiang N, Shen Y, He Z, Gu Y, Zhang Z, Yang Y, Zhao X, Zhou L, Li X, Zou S, Zhang Y, Yang L, Guo J, Dong J, Li Q, Dong L, et al: Human infection with a novel avian-origin influenza A (H7N9) virus. N Engl J Med. 2013, 368 (20): 1888-1897. 10.1056/NEJMoa1304459.View ArticlePubMedGoogle Scholar
- Li W, Shi Z, Yu M, Ren W, Smith C, Epstein JH, Wang H, Crameri G, Hu Z, Zhang H, Zhang J, McEachern J, Field H, Daszak P, Eaton BT, Zhang S, Wang LF: Bats are natural reservoirs of SARS-like coronaviruses. Science. 2005, 310 (5748): 676-679. 10.1126/science.1118391.View ArticlePubMedGoogle Scholar
- Feldmann H, Wahl-Jensen V, Jones SM, Stroher U: Ebola virus ecology: a continuing mystery. Trends Microbiol. 2004, 12 (10): 433-437. 10.1016/j.tim.2004.08.009.View ArticlePubMedGoogle Scholar
- Nash D, Mostashari F, Fine A, Miller J, O'Leary D, Murray K, Huang A, Rosenberg A, Greenberg A, Sherman M, Wong S, Layton M: The outbreak of West Nile virus infection in the New York City area in 1999. N Engl J Med. 2001, 344 (24): 1807-1814. 10.1056/NEJM200106143442401.View ArticlePubMedGoogle Scholar
- Yu XJ, Liang MF, Zhang SY, Liu Y, Li JD, Sun YL, Zhang L, Zhang QF, Popov VL, Li C, Qu J, Li Q, Zhang YP, Hai R, Wu W, Wang Q, Zhan FX, Wang XJ, Kan B, Wang SW, Wan KL, Jing HQ, Lu JX, Yin WW, Zhou H, Guan XH, Liu JF, Bi ZQ, Liu GH, Ren J: Fever with thrombocytopenia associated with a novel bunyavirus in China. N Engl J Med. 2011, 364 (16): 1523-1532. 10.1056/NEJMoa1010095.View ArticlePubMed CentralPubMedGoogle Scholar
- Barzon L, Lavezzo E, Militello V, Toppo S, Palu G: Applications of next-generation sequencing technologies to diagnostic virology. Int J Mol Sci. 2011, 12 (11): 7861-7884.View ArticlePubMed CentralPubMedGoogle Scholar
- Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data. Genome Res. 2007, 17 (3): 377-386. 10.1101/gr.5969107.View ArticlePubMed CentralPubMedGoogle Scholar
- Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI: An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006, 444 (7122): 1027-1031. 10.1038/nature05414.View ArticlePubMedGoogle Scholar
- Handley SA, Thackray LB, Zhao G, Presti R, Miller AD, Droit L, Abbink P, Maxfield LF, Kambal A, Duan E, Stanley K, Kramer J, Macri SC, Permar SR, Schmitz JE, Mansfield K, Brenchley JM, Veazey RS, Stappenbeck TS, Wang D, Barouch DH, Virgin HW: Pathogenic simian immunodeficiency virus infection is associated with expansion of the enteric virome. Cell. 2012, 151 (2): 253-266. 10.1016/j.cell.2012.09.024.View ArticlePubMed CentralPubMedGoogle Scholar
- Tong S, Li Y, Rivailler P, Conrardy C, Castillo DA, Chen LM, Recuenco S, Ellison JA, Davis CT, York IA, Turmelle AS, Moran D, Rogers S, Shi M, Tao Y, Weil MR, Tang K, Rowe LA, Sammons S, Xu X, Frace M, Lindblade KA, Cox NJ, Anderson LJ, Rupprecht CE, Donis RO: A distinct lineage of influenza A virus from bats. Proc Natl Acad Sci U S A. 2012, 109 (11): 4269-4274. 10.1073/pnas.1116200109.View ArticlePubMed CentralPubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Monzoorul Haque M, Ghosh TS, Komanduri D, Mande SS: SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics. 2009, 25 (14): 1722-1730. 10.1093/bioinformatics/btp317.View ArticlePubMedGoogle Scholar
- Gerlach W, Stoye J: Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 2011, 39 (14): e91-10.1093/nar/gkr225.View ArticlePubMed CentralPubMedGoogle Scholar
- Ishii A, Thomas Y, Moonga L, Nakamura I, Ohnuma A, Hang’ombe B, Takada A, Mweene A, Sawa H: Novel arenavirus, Zambia. Emerg Infect Dis. 2011, 17 (10): 1921-1924. 10.3201/eid1710.10452.View ArticlePubMed CentralPubMedGoogle Scholar
- Stenglein MD, Sanders C, Kistler AL, Ruby JG, Franco JY, Reavill DR, Dunker F, Derisi JL: Identification, characterization, and in vitro culture of highly divergent arenaviruses from boa constrictors and annulated tree boas: candidate etiological agents for snake inclusion body disease. MBio. 2012, 3 (4): e00180-00112.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen EC, Yagi S, Kelly KR, Mendoza SP, Tarara RP, Canfield DR, Maninger N, Rosenthal A, Spinner A, Bales KL, Schnurr DP, Lerche NW, Chiu CY: Cross-species transmission of a novel adenovirus associated with a fulminant pneumonia outbreak in a new world monkey colony. PLoS Pathog. 2011, 7 (7): e1002155-10.1371/journal.ppat.1002155.View ArticlePubMed CentralPubMedGoogle Scholar
- Sachsenroder J, Twardziok SO, Scheuch M, Johne R: The general composition of the faecal virome of pigs depends on age, but not on feeding with a probiotic bacterium. PLoS One. 2014, 9 (2): e88888-10.1371/journal.pone.0088888.View ArticlePubMed CentralPubMedGoogle Scholar
- Warren RL, Sutton GG, Jones SJ, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 2007, 23 (4): 500-501. 10.1093/bioinformatics/btl629.View ArticlePubMedGoogle Scholar
- Cheung AK, Ng TF, Lager KM, Bayles DO, Alt DP, Delwart EL, Pogranichniy RM, Kehrli ME: A divergent clade of circular single-stranded DNA viruses from pig feces. Arch Virol. 2013, 158 (10): 2157-2162. 10.1007/s00705-013-1701-z.View ArticlePubMed CentralPubMedGoogle Scholar
- Chang Y, Cesarman E, Pessin MS, Lee F, Culpepper J, Knowles DM, Moore PS: Identification of herpesvirus-like DNA sequences in AIDS-associated Kaposi’s sarcoma. Science. 1994, 266 (5192): 1865-1869. 10.1126/science.7997879.View ArticlePubMedGoogle Scholar
- Simons JN, Pilot-Matias TJ, Leary TP, Dawson GJ, Desai SM, Schlauder GG, Muerhoff AS, Erker JC, Buijk SL, Chalmers ML, van Sant CL, Mushahwar IK: Identification of two flavivirus-like genomes in the GB hepatitis agent. Proc Natl Acad Sci U S A. 1995, 92 (8): 3401-3405. 10.1073/pnas.92.8.3401.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.