Reads2Type: a web application for rapid microbial taxonomy identification
© Saputra et al. 2015
Received: 14 April 2015
Accepted: 17 November 2015
Published: 25 November 2015
Identification of bacteria may be based on sequencing and molecular analysis of a specific locus such as 16S rRNA, or a set of loci such as in multilocus sequence typing. In the near future, healthcare institutions and routine diagnostic microbiology laboratories may need to sequence the entire genome of microbial isolates. Therefore we have developed Reads2Type, a web-based tool for taxonomy identification based on whole bacterial genome sequence data.
Raw sequencing data provided by the user are mapped against a set of marker probes that are derived from currently available bacteria complete genomes. Using a dataset of 1003 whole genome sequenced bacteria from various sequencing platforms, Reads2Type was able to identify the species with 99.5 % accuracy and on the minutes time scale.
In comparison with other tools, Reads2Type offers the advantage of not needing to transfer sequencing files, as the entire computational analysis is done on the computer of whom utilizes the web application. This also prevents data privacy issues to arise. The Reads2Type tool is available at http://www.cbs.dtu.dk/~dhany/reads2type.html.
KeywordsMicrobial identification Marker genes Whole genome sequencing Bacterial isolate
Identification of bacteria is important for making accurate clinical diagnoses and for narrowing down the list of potential antibiotics that may be used against the pathogens, and therefore for quickly initiating a medical therapy for treating the patient. In the past, traditional phenotypic and biochemical methods were widely used for bacterial identification [1, 2], as bacterial whole genome sequencing (WGS) was too expensive and difficult to implement. WGS has recently started showing its potential as a cost-effective and rapid solution for medical diagnostics and outbreak prevention. For example, via WGS one can identify species and strain [3, 4] and antibiotic resistance gene , as well as make predictions of pathogenicity  and identification of novel genes.
The latest development in sequencing technology has contributed to lowering sequencing error, producing longer sequence reads, increasing throughput on modern sequencers, and decreasing sequencing cost . Therefore it is expected that in the nearest future, clinical and industrial microbiological laboratories will have access to their own sequencers. The issue to be faced will then be how to handle and analyze the large amounts of sequencing data to produce useful biological and epidemiological information, for example regarding the identity of pathogens.
The major challenge for taxonomy identification based on sequencing data is the selection of marker genes. The 16S rRNA gene is commonly used for deriving phylogeny and taxonomy of microbes [2, 8], and for bacterial identification in metagenomics samples . This is due to the presence of the 16S rRNA gene in all bacteria, as well as its conserved function . However, the 16S rRNA gene has low discriminatory power at species level for several taxonomic groups [11, 12], for example the Enterobacteriaceae family [11, 13–17]. This lack of accuracy in identifying Enterobacteriaceae species using 16S is due to the high similarity of 16S sequences within the family . To increase the bacterial identification accuracy, one could instead use a combination of several housekeeping genes . However, the larger the marker sequence database is, the slower the bacterial identification process becomes. Therefore, a smaller bacterial marker sequence database needs to be constructed, which contains sufficient data for bacterial identification.
Aligning millions of sequencing reads against tenth of thousands of marker sequences is the second challenge in identifying species based on WGS data. String matching is computationally demanding and the most time-consuming part of the in-silico taxonomy identification process. Thus, a strategy to efficiently match two sets of sequences is needed.
A web-based taxonomy identification tool must be able to cope with both an unstable Internet connection (on the client side) and network bottlenecks (on the server side), as file upload failure may halt the analysis before it even starts and too many clients uploading files may cause network issues, respectively. Therefore, to avoid these drawbacks, the development of web-based taxonomy identification should focus on minimizing the amount of data transfer through the Internet.
In this paper we describe the web-application Reads2Type, by which one can rapidly identify the taxonomy of bacterial isolates based on raw WGS data. The user does not need to upload the sequencing data to the server. As far as we are aware, this is the first bacterial identification web server that assigns the computational analysis to the client side, thus avoiding network issues and minimizing data transfer. It was previously shown  that Reads2Type performs approximately 2.5 times faster than other tools, given the same taxonomy identification accuracy. This result is based on a benchmark study that compared the performance of the console application of Reads2Type with other tools for taxonomy identification of raw sequencing files. The old version of web-based Reads2Type used Java. However, Java web has limited its features since early 2014, and therefore we rebuilt a web-based Reads2Type on a Node.js environment, which is compiled with Browserify. This enables the version of Reads2Type that we present here to be faster and even more reliable for microbial identification than the old web-based Reads2Type version .
Evaluation dataset and computational resources
The evaluation set for selecting Enterobacteriaceae marker genes for the probe database consisted of 30,680 Enterobacteriaceae short read sequencing files from NCBI SRA (Short Read Archive)  from 24 different species. To examine the performance of Reads2Type, we evaluated 1003 raw sequencing data of Campylobacter jejuni, Enterococcus faecalis, Escherichia fergusonii, Escherichia coli, Klebsiella pneumoniae, Staphylococcus aureus, and Salmonella enterica isolates, sequenced at the Technical University of Denmark. For real-life outbreak data, Reads2Type was tested on 6 publicly available raw sequencing data files from the German 2011 E. coli outbreak. In addition, Reads2Type was tested on ERR025475_2, which is one of the K. pneumoniae raw read files provided by Sanger (http://www.ebi.ac.uk/ena/data/view/ERR025475). The latter test uses different computational resource (in Amman, Jordan) compared with the former: All computational experiments were conducted on a 2.6 GHz Intel Core i5 CPU, 8 GB memory, Mac OS X Yosemite operating system, except the implementation test on Amman, which was conducted on an Intel Core i7 CPU (@2.20 GHz), 6 GB memory, Windows 7 Home Basic SP-1 64-bit operating system.
A probe database, consisting of fragments of selected marker genes, was used as the reference database for Reads2Type to identify the taxonomy of prokaryotes based on WGS reads. The length of the probe sequences was set to 50 basepairs (bp), as nowadays most sequencing platforms produce reads longer than 50 bp. Moreover, reads with length less than 50 bp have an insufficient proportion of unique sequences that can be mapped to the genome .
16S rRNA was used as the main marker gene for the probe database. To increase Reads2Type prediction accuracy, the 16S rRNA sequences for the probe database were predicted from the collection of prokaryotic complete genomes using RNAmmer  instead of retrieving 16S rRNA sequences from publicly available databases of targeted sequencing and partial coding sequences. RNAmmer is highly accurate in predicting 16S sequences and may even predict 16S sequences that are not yet submitted to the public RNA databases. The complete genomes of bacteria and archaea were obtained from the NCBI Genome Database in August 26, 2012. This dataset of 2045 different strains consists of 969 different bacterial species, 150 bacterial unspecified species, 105 different archaeal species, and 13 archaeal unspecified species such as genomospecies, endosymbionts, uncultured microbes, and “sp.” organisms (i.e., organisms that have only been typed to the genus but not the species level).
Although 16S rRNA was the main marker gene, DNA gyrase subunit B (gyrB) was a better marker gene for Enterobacteriaceae, as shown in the Results section. Hence, the probe sequences were generated by, first, gathering 11,481 16S rRNA sequences, which were predicted by RNAmmer, and 1620 Enterobacteriaceae gyrB sequences, which were downloaded from the NCBI nucleotide database. Then, all possible 50 bp fragments were generated from these gathered sequences. The outcome formed the probe database.
Size reduction of probe database
To improve the performance of Reads2Type, we reduced the size of the probe database via three consecutive steps, which consisted in removing 1) all the 16S rRNA probes that were unique to Enterobacteriaceae, as these have low species identification accuracy, 2) the duplicates that resulted from chopping conserved regions of different strains, and 3) the consecutive probes. To remove the duplicates we followed this procedure: Given a marker gene for each of the 50-mer fragments extracted from this gene, we derived a list of organisms sharing those fragments, and called these fragments probe sequences. Of the 1,268,055 probe sequences that were produced, 1,040,203 were uniquely found in one organism; these are defined as unique probes, and the rest of them were shared between organisms; these are called shared probes. To reduce consecutive probes we retained 50-mers every 25 bp on the marker genes, as a window size of 25 bp is considered dense enough to identify the species of the given isolate.
Via the above mentioned three steps we managed to reduce the size of the probe database down to 61,462, which is ∼20 times smaller than the original size; 40,085 of them are unique probes, the rest are shared probes. The file size of the probe database is 4.6 MB. This database is loaded in the client computer’s memory once the Reads2Type website is accessed by the client.
To improve Reads2Type performance, each time a read matches a shared probe, the probe sequences that belong to organisms that are subset of the matching probe are kept, and other sequences are removed, thus significantly accelerating the progress of finding a read that matches a unique probe. For instance, if a read matches a probe sequence that is shared by organisms A, B, and C, then the probe database is reduced in such a way that it only contains the unique probes of A, B, and C, as well as the shared probes of A and B, A and C, and B and C. We define the above as the ‘narrow down’ approach. Also, DNA string matching is done by FM-indexing . This consists in a combination of the Burrows-Wheelers Transform indexing and the suffix array indexing that can be used to efficiently find exact matches to a pattern.
The Enterobacteriaceae marker genes choice
Accuracy test on in-house raw reads
To investigate how much a low Internet connection (54.0 Mbps) may affect the bacterial identification process, Reads2Type was tested in Amman (Jordan), which is located on a different continent than where the server is, i.e., Denmark. It turned out that, although the download speed was about 6 times lower than in Denmark, the needed time for bacterial identification in Jordan is similar to the one needed in Denmark (namely less than 10 s), suggesting that the run time is independent from the speed of the Internet on the client side.
The results from our study show that by using the web-based Reads2Type application it is possible to identify species based on raw reads of WGS. Also, we show that replacing Enterobacteriaceae’s unique probe sequences, which are derived from 16S rRNA, with the ones that are derived from gyrBincreases the accuracy of Reads2Type.
Reads2Type is a user friendly web-application that can be accessed via different types of platforms, and which provides species identification on the minutes time scale. This relatively short time scale is due to the ‘narrow down’ treatment (see Section “Reads2Type”), the fast FM index for string matching, the small probe database, and the fact that the computational analysis is performed on the client computers, instead of on the server. In contrast with standalone applications, Reads2Type does not require a downloading and installation of software.
The web-based BLAST tool may be used in alternative to Reads2Type for species identification. Like Reads2Type, BLAST  can identify the taxonomy of single isolates based on WGS data. However, BLAST requires conversion of raw sequencing data to FASTA format and uploading of data files to the server. This procedure may be difficult for untrained users. Also, it is time consuming for users that need immediate results or only have a slow Internet connection to their disposal. The accuracy of BLAST is higher than that of Reads2Type, as BLAST uses the nucleotide database, while Reads2Type uses subsets of 16S rRNA sequences and gyrB sequences from complete genome and nucleotide database, respectively. However, the web-based BLAST returns an error message when the sequencing file is very large, and as a consequence of this, BLAST may not be able to complete the computation within an hour; while Reads2Type can deal with files of whatever size, as users do not need to upload data files and Reads2Type stops reading the file when the species are identified. The web-based BLAST returns an error, too, if at least hundreds megabytes of high-scoring segment pairs are produced, thus requiring that users readjust the BLAST parameters and re-run the BLAST search.
Short read aligners such as BWA  and Bowtie2  may also be used for species identification of a bacterial isolate, although both needs downloading executables. However, both BWA and Bowtie2 require huge computational resources, and these may not be available to a number of users. Although the web-based BLAST and the short read aligners provide accurate results, Reads2Type may be practical when a quick identification of species from raw sequencing files is needed.
at least 75 GB of computer memory,
at least 160 GB of disk space,
a complex and computationally expensive pre-runs, which consist in downloading the installer, running the installation, setting the UNIX environment variables, downloading and prebuilding the reference database from NCBI complete and draft genomes database, and
reading the whole sequencing file before delivering results. Thus, Kraken execution time depends on the size of the sequencing file.
∼6 MB of free memory to load the marker database into the browser,
no disk space to run,
no pre-runs, and
no reading of entire sequencing files, as Reads2Type analysis ends when a read matches a unique probe.
One should keep in mind that Reads2Type does not provide prediction confidence as Kraken does. Also, Reads2Type can only be used to identify species of single isolate samples, whereas Kraken can be used for species identification in metagenomic samples. Therefore, if single isolate samples are investigated then one would expect that Kraken predicts only one species with high confidence. Furthermore, the prediction accuracy of Kraken is generally higher than that of Read2Type because Kraken uses both the complete and draft genome sequences as the reference database, while Reads2Type only uses fragments of selected marker gene derived from complete genome sequences.
The clinical advantage of using web-based WGS tools is that it provides not only a taxonomic identification, but also information regarding antibiotic resistance, virulence factors, novel genes, predictions regarding pathogenicity, and spatiotemporal data concerning previous outbreaks caused by the same or closely related pathogens. This is particularly useful for healthcare personnel that quickly needs to identify the upcoming threat, to help controlling ongoing outbreaks, and to contribute to the development of a global epidemiology map.
In the case when a contamination occurs during the sequencing stage, Reads2Type may misidentify the species. The current availability of complete reference genomes limits the identification power of Reads2Type. This is because the lower the number of strains available as complete genomes is, the less accurate marker probes can be extracted by Reads2Type, hence a lower accuracy in microbial identification follows. However the idea behind Reads2Type is not to substitute the use of accurate tools such as BLAST and BWA, but rather to give an educated guess regarding the identity of bacteria. Furthermore, when using Reads2Type one does not need to be concerned about data privacy issues, as data are not sent across the Internet, and the data analysis is entirely done on the client site.
Availability of supporting data
The dataset of 1003 whole genome sequenced bacteria were all sequenced at the Technical University of Denmark. The datasets that are already published refers to S. aureus , E. faecalis, and E. fergusonii (from project accession [ENA:PRJEB8647]. The remaining raw reads, which refer to C. jejuni, K. pneumoniae, S. enterica and E. coli, will be made available upon publication of the papers describing the epidemiology of these strains. Among these, the raw reads of Salmonella-spp-B25, Salmonella-spp-02-03-002, Salmonella-spp-02-03-008, Salmonella-spp-05-102, and Salmonella-spp-07-022 are public . The raw sequencing data files from the 2011 German E. coli outbreak were downloaded from ftp://ftp.genomics.org.cn/pub/Ecoli_TY-2482/110601_I238_FCB067HABXX_L3_ESCqslRAADIAAPEI-2_1.fq.gz for the BGI Illumina read data and http://www.hpa-bioinformatics.org.uk/lgp/genomes for the five HPA Illumina read data.
This work was supported by the Center for Genomic Epidemiology at the Technical University of Denmark and funded by grant 09-067103/DSF from the Danish Council for Strategic Research.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Clarridge 3rd JE. Impact of 16s rrna gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases. Clin Microbiol Rev. 2004; 17(4):840–62. doi:10.1128/CMR.17.4.840-862.2004.View ArticlePubMedPubMed CentralGoogle Scholar
- Janda JM, Abbott SL. 16s rrna gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. J Clin Microbiol. 2007; 45(9):2761–4. doi:10.1128/JCM.01228-07.View ArticlePubMedPubMed CentralGoogle Scholar
- Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, et al. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol. 2012; 50(4):1355–61. doi:10.1128/JCM.06094-11.View ArticlePubMedPubMed CentralGoogle Scholar
- Lukjancenko O, Thomsen MC, Larsen MV, Ussery DW. Panfunpro: Pan-genome analysis based on functional profiles. F1000Research. 2013; 2:265.Google Scholar
- Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, et al. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. 2012; 67(11):2640–4. doi:10.1093/jac/dks261.View ArticlePubMedPubMed CentralGoogle Scholar
- Cosentino S, Voldby Larsen M, Møller Aarestrup F, Lund O. Pathogenfinder–distinguishing friend from foe using bacterial whole genome sequence data. PLoS One. 2013; 8(10):77302. doi:10.1371/journal.pone.0077302.View ArticleGoogle Scholar
- Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012; 30(5):434–9. doi:10.1038/nbt.2198.View ArticlePubMedGoogle Scholar
- Woo PCY, Lau SKP, Teng JLL, Tse H, Yuen KY. Then and now: use of 16s rdna gene sequencing for bacterial identification and discovery of novel bacteria in clinical microbiology laboratories. Clin Microbiol Infect. 2008; 14(10):908–34. doi:10.1111/j.1469-0691.2008.02070.x.View ArticlePubMedGoogle Scholar
- Shah N, Tang H, Doak TG, Ye Y. Comparing bacterial communities inferred from 16s rrna gene sequencing and shotgun metagenomics. Pac Symp Biocomput. 2011; 16:165–76.Google Scholar
- Stackebrandt E, Goebel B. Taxonomic note: a place for dna-dna reassociation and 16s rrna sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol. 1994; 44(4):846–9.View ArticleGoogle Scholar
- Hedegaard J, Steffensen SA, Nørskov-Lauritsen N, Mortensen KK, Sperling-Petersen HU. Identification of enterobacteriaceae by partial sequencing of the gene encoding translation initiation factor 2. Int J Syst Bacteriol. 1999; 49 Pt 4:1531–8.View ArticlePubMedGoogle Scholar
- Naum M, Brown EW, Mason-Gamer RJ. Is 16s rdna a reliable phylogenetic marker to characterize relationships below the family level in the enterobacteriaceae?. J Mol Evol. 2008; 66(6):630–42. doi:10.1007/s00239-008-9115-3.View ArticlePubMedGoogle Scholar
- Delmas J, Breysse F, Devulder G, Flandrois JP, Chomarat M. Rapid identification of enterobacteriaceae by sequencing dna gyrase subunit b encoding gene. Diagn Microbiol Infect Dis. 2006; 55(4):263–8. doi:10.1016/j.diagmicrobio.2006.02.003.View ArticlePubMedGoogle Scholar
- Pham HN, Ohkusu K, Mishima N, Noda M, Monir Shah M, Sun X, et al. Phylogeny and species identification of the family enterobacteriaceae based on dnaj sequences. Diagn Microbiol Infect Dis. 2007; 58(2):153–61. doi:10.1016/j.diagmicrobio.2006.12.019.View ArticlePubMedGoogle Scholar
- Miñana-Galbis D, Urbizu-Serrano A, Farfán M, Fusté MC, Lorén JG. Phylogenetic analysis and identification of aeromonas species based on sequencing of the cpn60 universal target. Int J Syst Evol Microbiol. 2009; 59(Pt 8):1976–83. doi:10.1099/ijs.0.005413-0.View ArticlePubMedGoogle Scholar
- Paradis S, Boissinot M, Paquette N, Bélanger SD, Martel EA, Boudreau DK, et al. Phylogeny of the enterobacteriaceae based on genes encoding elongation factor tu and f-atpase beta-subunit. Int J Syst Evol Microbiol. 2005; 55(Pt 5):2013–5. doi:10.1099/ijs.0.63539-0.View ArticlePubMedGoogle Scholar
- Roggenkamp A. Phylogenetic analysis of enteric species of the family enterobacteriaceae using the oric-locus. Syst Appl Microbiol. 2007; 30(3):180–8. doi:10.1016/j.syapm.2006.06.004.View ArticlePubMedGoogle Scholar
- Spröer C, Mendrock U, Swiderski J, Lang E, Stackebrandt E. The phylogenetic position of serratia, buttiauxella and some other genera of the family enterobacteriaceae. Int J Syst Bacteriol. 1999; 49 Pt 4:1433–8.View ArticlePubMedGoogle Scholar
- Jolley KA, Feil EJ, Chan MS, Maiden MC. Sequence type analysis and recombinational tests (start). Bioinformatics. 2001; 17(12):1230–1.View ArticlePubMedGoogle Scholar
- Larsen MV, Cosentino S, Lukjancenko O, Saputra D, Rasmussen S, Hasman H, et al. Benchmarking of methods for genomic taxonomy. J Clin Microbiol. 2014; 52(5):1529–39. doi:10.1128/JCM.02981-13.View ArticlePubMedPubMed CentralGoogle Scholar
- Kodama Y, Shumway M, Leinonen R. International Nucleotide Sequence Database Collaboration: The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012; 40(Database issue):54–6. doi:10.1093/nar/gkr854.View ArticleGoogle Scholar
- Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009; 20(6):327–8. doi:10.1007/s00335-009-9187-4.View ArticlePubMedPubMed CentralGoogle Scholar
- Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW. Rnammer: consistent and rapid annotation of ribosomal rna genes. Nucleic Acids Res. 2007; 35(9):3100–8. doi:10.1093/nar/gkm160.View ArticlePubMedPubMed CentralGoogle Scholar
- Ferragina P, Manzini G. Opportunistic data structures with applications. In: Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium On. Redondo Beach, CA: IEEE: 2000. p. 390–8. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=892127.Google Scholar
- Kwok AYC, Chow AW. Phylogenetic study of staphylococcus and macrococcus species based on partial hsp60 gene sequences. Int J Syst Evol Microbiol. 2003; 53(Pt 1):87–92.View ArticlePubMedGoogle Scholar
- Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3):46. doi:10.1186/gb-2014-15-3-r46.View ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10. doi:10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Li H, Durbin R. Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics. 2010; 26(5):589–95. doi:10.1093/bioinformatics/btp698.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012; 9(4):357–9. doi:10.1038/nmeth.1923.View ArticlePubMedPubMed CentralGoogle Scholar
- Price LB, Stegger M, Hasman H, Aziz M, Larsen J, Andersen PS, et al. Staphylococcus aureus cc398: host adaptation and emergence of methicillin resistance in livestock. MBio. 2012;3(1). doi:10.1128/mBio.00305-11.
- Leekitcharoenphon P, Friis C, Zankari E, Svendsen CA, Price LB, Rahmani M, et al. Genomics of an emerging clone of salmonella serovar typhimurium st313 from nigeria and the democratic republic of congo. J Infect Dev Ctries. 2013; 7(10):696–706.View ArticlePubMedGoogle Scholar