CLAST accuracy evaluation by comparison with the Smith–Waterman algorithm
To measure the search accuracy of CLAST, we compared the output results of BLAST 2.2.25 [4], BLAT 34 [5], and CLAST against that of SSEARCH version 36.3.6 [7] (hereafter referred to as the accuracy test). We chose only BLAST and BLAT in the accuracy test because these two tools are widely used in metagenomic analyses (e.g. MEGAN, which is a commonly used taxonomic and functional assignment tool for metagenomics, uses BLAST results for their taxonomic and functional assignment [17]; MG-RAST, which is a commonly used metagenomic analyses web service, uses BLAT for their sequence similarity analyses [18]). This comparison consisted of six phases. First, we obtained the reference genome sequences of all bacteria and archaea in the National Center for Biotechnology Information (NCBI) RefSeq Genome database (October 2011, 4.3 GB, 2,314 sequences) [19] that were completely sequenced and had full taxonomic information. Second, we created two query sets (100-base test; 10,000 reads of 100 bases as simulated-Illumina reads, 800-base test; 10,000 reads of 800 bases as simulated-454 reads) by randomly retrieving 100-base and 800-base sequence fragments from the 2,314 reference genome sequences. Thirdly, these query sets were searched against the reference genome sequences using SSEARCH, BLAST, BLAT, and CLAST. Fourthly, we removed hits from the results for each alignment tool if the assigned regions and query were from the same reference genome sequence. This step makes the result equivalent to a search for the query sequence in the reference sequence database without the original query genome sequence. Fifthly, we selected the best non-self hits from the result of each tools with the scoring criteria dependent on the alignment tool. Then sixthly, BLAST, BLAT, and CLAST were considered to accurately find a hit when they reported the same hit and alignment position as SSEARCH. This accuracy test was performed on a desktop computer with an Intel Xeon X5670 6 core 2.93 GHz CPU, 48 GB main memory, and two NVIDIA Tesla C2050 GPUs.
Results of comparison of search accuracy
In both the 100- and 800-base accuracy tests, the search accuracy of CLAST was comparable to that of BLAST, both when bit scores were >90 (100-base test) or 200 (800-base test) and when bit scores were <90 (100-base test) or 200 (800-base test). The search accuracy of CLAST was greater than that of BLAT in nearly all cases (Figure 5 and Additional file 4).
Evaluating speed, sensitivity, and accuracy of taxonomic assignments
Massive metagenomic analyses generally depend on the alignment for each read against reference genomes to assign taxonomy for the read. Therefore, we designed a simulated metagenomic analysis test to evaluate the sensitivity and accuracy of the taxonomic assignments as well as calculation time.
The simulated metagenomic analysis test consisted of six phases. First, we created two query sets (100,000 reads of 100 or 800 bases) from 2,314 reference genome sequences as in the accuracy test. Second, we searched for similar regions between each query sequence and the reference genome sequences. Third, we removed hits from the results if the assigned region and query were from the same reference genome sequence. Fourth, we selected the best non-self hits from the result of each tool, as in the previous accuracy test. Fifth, taxonomic assignment of the query sequences was performed using taxonomy of the best non-self hits (Figure 6A). Finally, we counted the number of query sequences that had similar regions in any reference genome (total reported hits) and the number of queries that were correctly taxonomically assigned (correct genus assignments) (Figure 6B). We compared the number of total reported hits, the number of correct genus assignments, and the correct genus assigned ratio (CGA ratio; number of correct genus assignments/number of total reported hits) among the tools tested. Total reported hits is a measure of the alignment tool sensitivity, and CGA ratio is a measure of the accuracy of taxonomic assignment. Since more sensitive similarity search tools will detect weaker similarity, and consequently will have a greater number of total reported hits, these tools are more useful for motif searching (Figure 6).
In the simulated metagenomic analysis test, we compared CLAST with other similar tools, namely BLAST 2.2.25, BLAT 34, FR-HIT v0.6, Burrows–Wheeler Aligner (BWA)/BWA-SW 0.5.9, Bowtie 2 2.0.4, and G-BLASTN 1.1, which depends on BLAST 2.2.28+ [4]-[6],[20]-[22]. G-BLASTN was separately compared with CLAST because G-BLASTN was designed for the NVIDIA Kepler architecture GPU. The default command line options were used for each alignment tool tested (Additional file 5). BWA/BWA-SW 0.5.9, Bowtie 2 2.0.4, and BLAT 34 cannot handle databases larger than 4 GB [23]-[25]. Therefore, we separated the reference genome sequences into three sets for testing these programs. Similarity search results from the three sets were merged for comparison with the results from BLAST, CLAST, and FR-HIT. The best non-self hits were selected using CIGAR code and MD tag (BWA), E-value (FR-HIT), and alignment score (Bowtie 2, BWA-SW, BLAST, BLAT, and CLAST). The simulated metagenomic analysis test (except for G-BLASTN) was performed on the same desktop computer as the accuracy test.
Results of comparison of calculation time between CLAST and other tools
In the 100-base test, Bowtie 2 (global mode) was the fastest tool, followed by Bowtie 2 (local mode), BWA, CLAST (global mode), CLAST (local mode), BLAT, FR-HIT (both global and local modes), and BLAST. CLAST (global mode) was 72.6 times faster than BLAST. CLAST (local mode) speed was comparable to CLAST (global mode) and 2.35 times faster than BLAT.
In the 800-base test, Bowtie 2 (global mode) and CLAST (both global and local modes) were the fastest tools. The calculation time of CLAST was comparable to Bowtie 2 (global mode) and faster than BWA-SW (Figure 7). CLAST (global mode) was 9.64 and 80.8 times faster than BLAT and BLAST, respectively.
Results of comparison of similarity search sensitivity and accuracy of taxonomic assignment
In the 100-base test, the highest number of total reported hits (highest sensitivity) was obtained with BLAST (Figure 8 and Additional file 6), followed by FR-HIT (local mode), CLAST (local mode), and the remaining tools. In the 800-base test, the highest number of total reported hits was also obtained with BLAST; however, CLAST (local mode) obtained nearly as many total reported hits, whereas the other tools obtained lower numbers. These results indicate that, for both read lengths, BLAST, FR-HIT (local mode), and CLAST (local mode) achieved high sensitivity, and that CLAST (local mode) is sensitive enough to map metagenomic reads to reference genome sequences.
For both read lengths, Bowtie 2 (global mode), CLAST (global mode), and FR-HIT (global mode) achieved higher CGA ratios (greater accuracy) than the other tools. In the 100-base test, BWA and Bowtie 2 (global mode) achieved very high CGA ratios (98 and 96%, respectively), but these algorithms produced fewer total reported hits than the other tools. In the 800-base test, the number of total reported hits from BWA-SW was 0.96 and 1.16 times greater than from CLAST (global mode) and FR-HIT (global mode). However, the CGA ratio of BWA-SW was much lower than that of FR-HIT (global mode) and CLAST (global mode) (Figure 8B) because the number of incorrect genus assignments of BWA-SW was 1.58 and 3.79 times greater than those of CLAST (global mode) and FR-HIT (global mode), respectively (Additional file 6). Similarly, the number of total reported hits of Bowtie 2 (local mode) was 1.10 and 1.33 times greater than those of CLAST (global mode) and FR-HIT (global mode), but the number of incorrect genus assignments of Bowtie 2 (local mode) was 1.80 and 4.31 times greater than those of CLAST (global mode) and FR-HIT (global mode) (Additional file 6). These results indicate that global alignment is useful for the purpose of taxonomic assignment. BWA, Bowtie 2 (global mode), FR-HIT (global mode), and CLAST (global mode) are able to assign reads to taxonomic groups with reasonably high accuracy. Especially among these four tools, CLAST (global mode) and FR-HIT (global mode) achieved not only high accuracy of taxonomic assignment but also moderate search sensitivity (Figure 9). Accuracy of taxonomic assignment of CLAST (global mode) and FR-HIT (global mode) excelled those of Bowtie 2 (local mode) and BWA-SW, and search sensitivity of CLAST (global mode) and FR-HIT (global mode) excelled those of Bowtie 2 (global mode) and BWA. In addition, by changing the identity threshold and the coverage threshold, the relationships between total reported hits and correct genus assignment of BLAST, BLAT, and CLAST (both global and local modes) were shown as curves (Figure 10 and Additional file 7). Although, the curves of the all tools were resembled each other, the curve of CLAST (global mode) was slightly higher than that of other tools in the 100 base test. The point of Bowtie 2 (global mode) was near to that of CLAST (global mode) with 90% identity threshold in both tests. The point of Bowtie 2 (local mode) achieved lower CGA ratio than the curves of BLAST, BLAT, and CLAST (global mode) in the 100 base test, and was near to the curve of BLAT in the 800 base test.
Calculation time using multiple GPUs
We ran CLAST on one, two, and eight GPUs with actual metagenomic reads to investigate the effect of GPU number on the calculation time. The reference genome sequences were the same as that used in the simulated metagenomic analysis test. The query sequences, which are the Illumina Genome Analyzer IIx reads from a human gut microbial community, were obtained from Qin et al. [2] (NCBI SRA accession number ERR011343; 75 bp, 21,739,219 reads). For this test, we used a 4-node GPU server. Each of the node had an Intel Xeon X5690 6 core 3.47 GHz CPU, 64 GB main memory, and two NVIDIA Tesla C2075 GPUs.
Results of calculation time using multiple GPUs
Total similarity search-calculation time for CLAST with the real metagenomic reads showed a linear inverse relationship with GPU number (Figure 11). With only one GPU, the calculation times of CLAST were 355 min (global mode) and 373 min (local mode) (Figure 11). With two GPUs, calculation times of CLAST were 188 min (global mode) and 192 min (local mode), and with eight GPUs, calculation times of CLAST were 49 min (global mode) and 50 min (local mode). This result indicates that CLAST function can be greatly accelerated by using multiple GPUs.
Comparison with G-BLASTN
In addition to comparison of CLAST with CPU-based tools, we compared the speed, sensitivity, and accuracy of CLAST taxonomic assignments to those of G-BLASTN (BLAST algorithm optimized for Kepler architecture GPU computing). The dataset and the analysis pipeline for comparison with G-BLASTN were the same as those of the simulated metagenomic analysis test. We used a workstation with two Intel Xeon E5-2687 W 8 core 3.10 GHz CPUs, 62.9 GB main memory, and two NVIDIA Tesla K20m GPUs (hereafter referred to as the two K20 machine). If the CLAST algorithm achieves the same speed as that of G-BLASTN, G-BLASTN (default settings) would be approximately two times as fast as CLAST (default settings) on the two K20 machine because G-BLASTN automatically uses all available GPUs, and one CLAST process uses only the one specified GPU. We compared CLAST to G-BLASTN run in the megablast mode (designed to identify only similar sequences) and blastn modes (command line parameters are -use_gpu true -outfmt 6 -task megablast and -use_gpu true -outfmt 6 -task blastn).
Results of comparison with G-BLASTN
In the simulated metagenomic analysis test, the G-BLASTN (blastn mode) analysis took 15,970 s when the query length was 100 bases, and 136,560 s when the query length was 800 bases, on the two K20 machines. On the other hand, CLAST took 210 s (global mode) and 215 s (local mode) for the 100-base query length, and 1,248 s (global mode) and 1,352 s (local mode) for the 800-base query length in the same GPU architecture. In other words, CLAST was 150–200 times faster than G-BLASTN (blastn mode). Furthermore, G-BLASTN (megablast mode) took 199 s when query length was 100 bases, and 724 s when query length was 800 bases. Thus, CLAST was 1.07–1.85 times faster than G-BLASTN (megablast mode). These results suggest that CLAST is much faster than G-BLASTN (blastn mode) and slightly faster than G-BLASTN (megablast mode).
The total reported hits and correct genus assignments of G-BLASTN (blastn mode) were 99,841 and 56,151, respectively (CGA ratio: 58%), when the query length was 100 bases. The total reported hits and correct genus assignments of G-BLASTN (blastn mode) were 100,000 and 62,728, respectively (CGA ratio: 63%), when the query length was 800 bases. Thus, G-BLASTN (blastn mode) performed similarly to BLAST in the simulated metagenomic analysis test. This result suggests that CLAST (local mode) can detect as much information as G-BLASTN (blastn mode) when the query length is 800 bases.
The total reported hits and correct genus assignments of G-BLASTN (megablast mode) were 46,720 and 42,664, respectively (CGA ratio: 91%), when the query length was 100 bases. The total reported hits and correct genus assignments of G-BLASTN (megablast mode) were 65,108 and 52,754, respectively (CGA ratio: 81%), when the query length was 800 bases. Thus, G-BLASTN (megablast mode) was similar to Bowtie 2 (local mode) in the simulated metagenomic analysis test. This result showed that the accuracy of taxonomic assignments of CLAST (global mode) is greater than that of G-BLASTN (megablast mode) and that the sensitivity of CLAST (local mode) is greater than that of G-BLASTN (megablast mode).