Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations

Background Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence. We sought to validate a Scalable Metagenomics Alignment Research Tool (SMART), a novel searching heuristic for shotgun metagenomics sequencing results. Results After retrieving all genomic DNA sequences from the NCBI GenBank, over 1 × 1011 base pairs of 3.3 × 106 sequences from 9.25 × 105 species were indexed using 4 base pair hashtable shards. A MapReduce searching strategy was used to distribute the search workload in a computing cluster environment. In addition, a one base pair permutation algorithm was used to account for single nucleotide polymorphisms and sequencing errors. Simulated datasets used to evaluate Kraken, a similar metagenomics classification tool, were used to measure and compare precision and accuracy. Finally using a same set of training sequences we compared Kraken, CLARK, and SMART within the same computing environment. Utilizing 12 computational nodes, we completed the classification of all datasets in under 10 min each using exact matching with an average throughput of over 1.95 × 106 reads classified per minute. With permutation matching, we achieved sensitivity greater than 83 % and precision greater than 94 % with simulated datasets at the species classification level. We demonstrated the application of this technique applied to conjunctival and gut microbiome metagenomics sequencing results. In our head to head comparison, SMART and CLARK had similar accuracy gains over Kraken at the species classification level, but SMART required approximately half the amount of RAM of CLARK. Conclusions SMART is the first scalable, efficient, and rapid metagenomics classification algorithm capable of matching against all the species and sequences present in the NCBI GenBank and allows for a single step classification of microorganisms as well as large plant, mammalian, or invertebrate genomes from which the metagenomic sample may have been derived.


Background
Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence [1,2]. The library of sequenced DNA fragments mapped to an identified taxonomy species has been growing in parallel; the latest release of NCBI Genbank (v209) has catalogued 1.99 × 10 11 basepairs of cDNA and genomic DNA from 1.87 × 10 8 records [3] (http://www.ncbi.nlm.nih.gov/news/ 08-19-2015-genbank-release-209/). The computational challenge has been to rapidly and accurately identify species level DNA sequences from next generation metagenomic shotgun sequencing data.
Currently the most widely used classification algorithm, BLAST [4], relies on indexing unique fragments of DNA that narrow the search space. While BLAST works well for small numbers of sequences, the algorithm scales poorly to the large number of reads generated by next generation sequencing files [5]. Other sequence alignment software has been created specifically adapted to next generation sequencing output such as Bowtie2 [6], Burrows-Wheeler Aligner [7], and Short Oligonucleotide Analysis Package [8]. These alignment software work well for the precise alignment of a large number of next generation sequencing reads against single organism genomes but scale poorly when attempting to align reads against all known DNA sequences. MEGAN and Meta-Phyler have been developed to work with BLAST specifically for the use of metagenomic sequencing classification [9,10]. However even though these probabilistic approaches have high accuracy [11,12], they remain limited by the computational expensive nature of BLAST. In addition, empirical approaches have also used machine learning algorithms with both supervised [13][14][15][16] and unsupervised methods [17][18][19].
Recently Kraken was developed to specifically address the problem of classifying next generation sequencing output from metagenomics projects [5]. Briefly, Kraken works by creating a k-mer database mapped to the lowest common ancestor, reducing the search space significantly. By doing so, Kraken performs exact k-mer matching and maps reads against its database with high speed and throughput. Validation of Kraken suggested processing of 4 million reads per minute at a rate over 900 times faster than MegaBlast [5]. The limitations of Kraken includes the long execution time and memory consumption during the database construction as well as the current databases being limited to bacterial, archaeal, and viral genomes, necessitating the elimination of host genomic DNA prior to classification using Kraken.
In addition to Kraken, a number of other approaches have been published. In particular CLARK [20] and LMAT [21] have been shown to have similar if not higher accuracy while maintaining the impressive throughput of Kraken. LMAT, similar to Kraken, attempts to utilize taxonomy information to reduce the database of k-mers, but current implementations are limited to microbial genomes and do not include mammalian sequences. CLARK attempts to decrease the k-mer search space by only indexing keys that uniquely identify a given taxonomy level and offers several modes of execution, including a version called CLARK-L that is optimized for limited RAM environments by subsampling the database to smaller fraction. All three techniques, Kraken, LMAT, and CLARK, attempt to limit the k-mer search space by either finding the least common ancestor (LCA) kmers or finding discriminatory k-mers that uniquely identify an organism at a given taxonomy level.
Recently, the MapReduce programming model [22] has caused a substantial shift in the way that large data sets may be distributed in parallel within a computing cluster. For example, Google used the MapReduce [23] framework to regenerate their index of the Internet, and the MapReduce framework has become popularized as a generic framework to solve big data bioinformatics problems in many-core cluster systems [24][25][26][27]. Database sharding has been used in other fields to horizontally scale very large sets of data and can reduce the each subset of the database into a datastructure in memory limited environments [28,29]. Unlike prior algorithms which limited the k-mer search space, we sought to leverage parallel computing and a MapReduce computational framework with a sharded database to create a scalable complete search heuristic for next generation sequencing files from metagenomics projects.

Computational infrastructure
The University of Washington provides a shared highperformance computing cluster known as Hyak. Currently UW Hyak has 9,028 Intel Xeon processing cores with 834 computational nodes. Each node used to test computational scaling contained 16 CPU cores with 64 GB of memory.

Construction of database
The v209 release of NCBI Genbank was downloaded (September 2015) and each Genbank accession was linked using the NCBI Taxonomy database to a single species and class. Using parallelization across 156 cores and a MapReduce framework, the genomic DNA was then virtually cut at every 30 basepairs, and each 30-mer was linked to the corresponding species and class and sorted. Finally merge sort was used to combine all the sorted 30-mers for classification. The dataset was then split into shards based on the first four basepairs of each 30mer creating a 256 separate databases that could be deterministically searched. The databases were saved in a hashtable format that could be loaded at runtime into memory by each search program.

Description of search heuristic
A total of 256 search programs are started asynchronously in parallel with each program assigned a 4 basepair shard as part of the mapping step. Each search program then iterates through the list of sequences in FASTA or FASTQ format and slides a 30 basepair window if the first 4 basepairs match the assigned shard definition of the executing program. The remaining 26 basepairs are then used to execute an in-memory hash-table lookup (Fig. 1). The reverse complement is also checked for every read. Each successful match to a species, genus, or class is kept and recorded. In addition, a 1-edit distance permutation algorithm was created to generate every possible one base-pair substitution permutation of the 30-mer search to account for sequencing errors and single nucleotide polymorphisms, without accounting for insertions or deletions. The results of each program are sequentially reduced to create the final classification results. Matching is performed at the species level and multiple matches against different organisms are collected. If any match is mammalian then the read is classified as mammalian; the highest voted match at the species, genus, and class taxonomy levels are calculated for each read for the final classification. If the highest classification for a read is a tie, then the read is labeled as ambiguous for a given taxonomy level.

Datasets tested
Simulated datasets (HiSeq, MiSeq, and simBA5) were taken from the publicly available datasets that were used to evaluate Kraken [5].
In a previous clinical trial of acute conjunctivitis/epidemic keratoconjunctivitis (NV-422 Phase IIB/III, NovaBay, clinicaltrials.gov: NCT01532336), a total 500 patients with clinical signs and symptoms of epidemic keratoconjunctivitis were recruited worldwide. Institutional review board approval was obtained through Goodwyn IRB (Cincinnati, OH, approval number: CL1104) Clinical research adhered to the tenets of the Declaration of Helsinki and was conducted in accordance with Health Insurance Portability and Accountability Act regulations. Written informed consent was obtained before participation for all participants in the study. Conjunctival samples from the upper/lower tarsal conjunctiva and fornix were collected using sterile dry swabs (Copan diagnostics inc., Murrieta, CA). Genomic DNA was isolated from conjunctival swabs using Qiagen Blood & Tissue DNA Kit (Qiagen, Inc., Venlo, the Netherlands) as per protocol. Three samples were randomly selected for whole genome sequencing (WGS). One nanogram of genomic DNA from each sample was used to prepare libraries for WGS according to the manufacturer's instruction using Illumina Nextera XT Sample Prep Kit (Illumina, Inc, San Diego, CA). The DNA libraries were sequenced using MiSeq System following the manufacturer's standard protocols (Illumina, Inc, San Diego, CA). Three conjunctival samples were used from this clinical trial collected from patients on the day of enrollment prior to the initiation of either placebo or the investigative drug. The FASTQ files for Fig. 1 Sketch of search strategy in pseudocode these samples have been uploaded to the NCBI SRA archive (SRR3033169, SRR3033245, and SRR3033274). Flash was used to preprocess the paired end libraries and Sickle was used for quality trimming [30].
In addition, data from the Human Microbiome Project [31] was downloaded as an additional metagenomic dataset. Specifically, three gut microbiome datasets (SRS019120, SRS014468 and SRS015055) were downloaded from the NCBI Sequence Read Archive, and Sickle again was applied prior to analysis of the samples.

Evaluation of accuracy and speed
To allow for direct comparison of performance statistics, the same definition of sensitivity and precision were used as described by Wood et al. [5]. Briefly sensitivity was defined as the number of correct classifications of reads divided by the total number in each dataset. Precision was defined as the number of correct classifications divided by the total number of reads attempted to be classified.

Comparison of SMART to Kraken and CLARK
In order to compare the accuracy and performance of the three tools, DNA sequence files from all the bacterial, viral, and archaeal sections of RefSeq were downloaded. For Kraken, CLARK, and SMART, the same sequences were used to build a database in each tool respectively following the documentation provided. The simulated datasets were then analyzed by each tool on the same computational node (16 CPU cores with 64 GB of RAM) in the UW Hyak with multithreading enabled to the maximum number of CPUs. For Kraken, the database was preloaded into memory for maximal performance as suggested by the creators of Kraken for users with NFS filesystems. For CLARK, the standard mode (−m 1) was used to analyze the simulated files as the program failed to start with other modes due to the RAM limitation. In order to calculate throughput, each program was run sequentially three times and the lowest execution time was utilized to calculate throughput.

Software and statistics
Custom software was written in C++ and Ruby. Statistics were performed using R (http://r-project.org). Conjunctival classification results from Kraken were obtained using Illumina BaseSpace and NCBI Blast was run with the database downloaded on November 2015. Software depends on Google SparseHash (https:// github.com/sparsehash/sparsehash) and GNU parallel (http://www.gnu.org/software/parallel/). Software used to run SMART, prebuilt libraries, and training of

Results
After transferring all genomic DNA reads from the latest release of the NCBI GenBank (version 209), a total of over 1 × 10 11 bp of 3.34 × 10 9 sequences from 9.26 × 10 5 species of 1.49 × 10 3 classes were indexed. The number of sequences indexed and the total number of uniquely identifying sequences from the 20 most abundantly represented classes and species are shown in Tables 1 and 2 respectively. Over 3.28 × 10 9 sequences (98.3 %) and 3.32 × 10 9 sequences (99.6 %) were uniquely identifying of a single species and class respectively. With a 4 basepair shards, 256 separate hashtables were created and indexed using a quadratic probing hashtable structure. The uncompressed sharded files used 137 GB of hard disk space to store, with each shard on average consuming 0.53GB of space. Total database construction was completed within 1.5 h and each thread consumed less than 1GB of memory.
Using the same simulated datasets that were used to evaluate Kraken [5], we measured the sensitivity and precision at the species, genus, and class taxonomy levels (Fig. 2). On a single node with 12 search programs executing in parallel, each of the simulated datasets took a total of 30 min to finish. The maximum memory consumed by a single search program was 7.45 GB with an average of 3.78 GB used by each program. A 100 % utilization of each CPU core was noted during the execution of each search program. Using multiple nodes to further parallelize the computation, we achieved linear scaling in throughput with inversely proportional decreases in total computational time (Fig. 3). By increasing the number of nodes to 12, we achieved a maximum throughput of over 2.3 million reads per minute and the ability to classify each of the simulated datasets in under 5 min. Performance of classifying a "real-world" human conjunctival derived metagenomic next generation sequencing result did not show any difference in computational scaling (Fig. 3g and h). When the cost of 1 bp permutations was measured, there was on average a 12.17 times increase in execution time ( Fig. 3i and j). However an average of 6.83 × 10 5 additional reads (6.9 %) was classified in the three simulated datasets.
Because many metagenomic projects come from a single host organism, a major bioinformatic challenge is to effectively filter the host organism genomic DNA from the DNA of the microbial organisms. Indexing the totality of known DNA from the NCBI GenBank and using the NCBI taxonomy classes allows for simultaneous classification of all reads to both mammalian genomes and non-mammalian genomes without a need for a pre-filtering stage. To prevent false positive match for microbial DNA, a conservative approach was used in that if a read was classified even once as mammalian then it was considered to be mammalian in origin. Of note, in GenBank only 11.1 % and 0.4 % of all known 30-mers have perfect matches for bacterial and viral DNA, respectively, at the class taxonomy level. Using this strategy, the whole genome sequencing results from three separate conjunctival samples and three gut microbiome samples from the Human Microbiome Project (SRS019120, SRS014468 and SRS015055) were analyzed with 1 basepair permutations (Table 3). In the human gut samples, on average 42.44 % of all reads were classified with 38.62 % matching non-mammalian DNA. On average, in the paucibacterial conjunctival samples 98.6 % of all the reads were classified; of these, 0.02 % matched non-mammalian DNA. The total reads by classified genus were normalized by the depth of coverage of the human genome in each sample to account for sequencing depth variability. The top twenty organisms from each sample are shown in Fig. 4.
To compare the three methods, one conjunctival sample was analyzed. Human reads were filtered using Illumina Basespace, and was run through Kraken and CLARK with libraries built using all the bacterial, viral, and archaeal sequences from RefSeq. Kraken attempted to classify 6.4 × 10 5 non-human reads but 98 % were unable to be identified. Comparison of the same read results with SMART revealed that 83 % of unclassified reads by Kraken were mammalian DNA in origin. In addition, 69.8 % of microbial classified reads by Kraken also matched mammalian DNA by SMART. A comparison of the microbial matched reads by Kraken against BLAST revealed a similar trend (Table 4). In addition, a similar comparison was made with the results from CLARK; the majority of the reads classified by CLARK as microbial were identified by SMART as having mammalian origin and this was confirmed independently using BLAST (Table 5).
When comparing SMART to Kraken and CLARK directly, a separate database for SMART was developed with all the bacterial, viral, and archaeal sequences from RefSeq. A total of 11,061 sequences were indexed by each tool. During execution each tool utilized all 16 CPUs for multithreading. Sensitivity, precision, throughput, and memory utilization are shown in Fig. 5. SMART utilized on average 2.24 GB of RAM per search program. Disk space of databases for Kraken, CLARK, and SMART were 151 GB, 113 GB, and 29 GB respectively.

Discussion
By indexing every 30-mer in the NCBI GenBank with a multiplexed, parallel searching strategy, we were able to achieve an unsurpassed ability to classify reads against all currently catalogued DNA simultaneously while maintaining similar throughput, sensitivity and precision to Kraken and CLARK on simulated datasets. To the authors' knowledge, this is the first metagenomic classification algorithm capable of efficiently matching against all the species and sequences present in the NCBI GenBank, allowing for a single step classification of microorganisms as well as large plant, mammalian, or invertebrate genomes from which the metagenomic sample Kraken represented an improvement in throughput and accuracy in classification algorithms when released in 2014 [5]. During the construction of the Kraken-GB database, Wood et al. noted that there were several draft genomes that had included mislabeled DNA or included adapter sequences and cautioned against the interpretation of Kraken's matches [5]. Our approach of searching the entire GenBank genomic DNA catalogue would protect against these false-positive matches as erroneous sequences would be present in multiple organisms and these results would label the read as ambiguous. However, this highlights the limitation and potential biases introduced by selective over-representation of certain species in the NCBI GenBank. For example, many of the animal models used in the biomedical science are overrepresented in the genomic DNA catalogued, as scientists are most interested in these organisms ( Table 2). Hence false-positive matching may occur against these organisms if the true organism has not been sequenced yet. Statistical modeling could be used to generate matching likelihoods to each organism based on relative database representation.
With integration into Illumina BaseSpace, Kraken has rapidly become the bioinformatics pipeline used to analyze metagenomics next generation sequencing results. However, SMART has a number of advantages over Kraken. SMART employs a scalable infrastructure that is not dependent on a common database and can distribute the workload across many computational nodes. In addition, many metagenomics samples come from host-rich environments and Kraken suffers from false positive identification of microbial organisms. In our study, when comparing the  (Table 4). In a direct comparison among Kraken, CLARK, and SMART using the same training RefSeq sequences and the same computing environment, CLARK and SMART were noted to have higher sensitivity and precision compared to Kraken at the species classification level. Without 1 basepair permutations, SMART was noted to have similar throughput to CLARK and with 1 basepair permutations, SMART was noted to have similar throughput to Kraken. SMART was noted to use significantly lower RAM compared to Kraken and CLARK. The main advantage of SMART appears to be utilizing a many-shard database approach to achieve horizontal scaling of a very large training set. While Kraken and CLARK have similar throughput and accuracy, they are unable to index a large training set that includes many mammalian, plant, fungal, and other protozoan organisms, both in the database construction phase and in analysis due to limitations in RAM. Since the sharded database can be loaded asynchronously in pieces, SMART can work in limited RAM environments without any changes to the algorithm by lowering the number of threads.
The exact k-mer matching approach has been used in several prior classification algorithms. SMART is similar to Kraken, CLARK, and LMAT in using exact k-mer matching for classification. However, SMART utilizes substantially lower RAM usage in the database construction phase by avoiding linking k-mers to a taxonomy tree and determining LCA. Unlike CLARK, SMART keeps all k-mers in the database and does not limit the search space by only keeping discriminatory k-mers. By using a  deterministic sharding scheme, SMART is able to handle the expanded search space by asynchronously loading shards of database and allows for scalability. While the matching approach is similar to prior algorithms, SMART scales efficiently in a many-CPU, many-node environment and allows for accessing the entire NCBI GenBank in a single classification step. Despite filtering human sequences in the conjunctival sample using BaseSpace prior to classification, Kraken (Table 4) and CLARK (Table 5) had many reads classified as bacterial or viral which were classified as mammalian by SMART. BLAST verified that the majority of these sequences were indeed mammalian. If an improved filtering step were implemented, or if Kraken or CLARK included mammalian genomes in their databases, their performance in host-rich metagenomics samples would have likely been improved. Unfortunately due to memory constraints on the database construction steps of both Kraken and CLARK, it was not possible for us to construct a database to include mammalian genomes in the evaluation databases for Kraken and CLARK. Inclusion of human and mammalian sequence filtering as an intrinsic component of the SMART protocol resulted in higher specificity of sequences assigned to non-host sources.
As the number of species sequenced grows, the NCBI GenBank will continue to expand, and the database shards used in this approach will also grow and consume more memory. At a certain point in the future each shard may consume too much memory and the database may need to be split with larger barcodes. However, computational infrastructure have also been growing in accordance to Moore's law [32] and with cheaper costs in computer memory, this tipping point may be further away.
While we only benchmarked this approach in a cluster-computing environment, this deep search technique could be easily translated to a cloud computing infrastructure [33,34]. These on-demand high-memory instances could be elastically created in parallel to handle each workload and destroyed after their use, allowing another layer of parallelization to occur. One limitation of the UW Hyak computing cluster that we faced was the relatively slow Input and Output (IO) performance of the network filesystem. In contrast, many of the cloud computing infrastructures are optimized for IO performance Fig. 5 Comparison of accuracy, throughput, and memory utilization among Kraken, CLARK, and SMART built from the same RefSeq sequences. a, b, c Sensitivity at the level of species, genus, and class for simulated datasets (HiSeq, MiSeq, and simBA5). d, e, f Precision at the level of species, genus, and class. Throughput (g) and memory utilization (h) of datasets with 16 parallel threads in the same computing environment and this approach may benefit from implementation and tuning in a cloud environment.
Further improvements in this approach are possible to increase the throughput. In particular, the generation of 1 basepair permutations of the query may benefit from further optimization and from another MapReduce step. In addition, higher throughput would be achieved with the recruitment of more computational nodes in the cluster. This approach would also be applicable to RNA-Seq data in identifying gene transcripts and pathogen RNA by using a similar approach to index all the cDNA data in the NCBI GenBank. In particular viral transcripts may be proportionally enriched both in the GenBank catalogue as well as in the biological samples.

Conclusions
We present the first scalable complete search approach to effectively classify metagenomics sequencing data using both exact 30-mer matching and 1 basepair permutations using the entirety of the NCBI GenBank. We anticipate this approach will be useful in identifying pathogens, characterizing complex microbiomes, and be extendable into labeling transcripts in RNASeq data.