DiScRIBinATE algorithm
The DiScRIBinATE approach takes the BLASTx output obtained for all the reads against a sequence database (e.g nr database) as the input [7]. The appropriate taxonomic level (TL) where the assignment of each read needs to be restricted is identified using the approach followed by SOrt-ITEMS [5]. This is done by analysing various alignment parameters obtained using the best hit for a given read. The DiScRIBinATE approach uses the same thresholds of these alignment parameters as used in SOrt-ITEMS [5]. Once TL is identified, the taxon name of the best hit is replaced by the taxon name at TL. The taxon names of other hits are also substituted with the respective taxa names that occur at TL. For example, if the TL of a read lies at the level of family, the taxon name obtained for the best hit (e.g Burkholderia ambifaria AMMD) is substituted by the name of the corresponding family (i.e, Burkholderiaceae). Similarly, taxon names of the remaining hits are also substituted with their corresponding taxa names that occur at the family level. These substituted names are used in all subsequent steps. Therefore, the taxonomic level of the final assignment of the read can only occur either at the TL or at levels above TL, i.e. to a taxon in the phylogenetic path from the TL to the root. For tracing the path between the root to the taxon name corresponding to each hit, the NCBI taxonomy tree has been utilized as the reference tree.
Once the taxa names corresponding to the hits are substituted with the corresponding taxa names occurring at their respective TLs, reads with only one hit are assigned to the substituted taxon/clade corresponding to the hit. If a read has two hits, it is assigned to the LCA of the taxa/clades corresponding to these two hits. However, for reads with three or more hits, the following steps are performed for the final assignment of the read. The flowchart illustrating the first three steps of the 'DiScRIBinATE' work-flow is given in Figure 1 and that of the last step (step 4) is given in Figure 2.
Step 1. Obtaining a set of candidate ancestors
The nodes of the NCBI taxonomic tree follow a typical parent-child hierarchy, wherein the descendant nodes (intermediate or terminal) are referred to as 'child nodes'. All child nodes unidirectionally descend from the 'root node'. Intermediate nodes, which lie in the path from the root to the child, are referred to as 'ancestor nodes' for the corresponding child node. The set of unique candidate ancestors corresponding to each read is progressively obtained by finding the LCA of each of the substituted taxa names as explained below.
Among all the hits obtained for a given read, first the LCA of the taxon names corresponding to the top two scoring hits is obtained. This LCA is identified as the first candidate ancestor. The hit with the next highest score is then added to the previous hits considered, and the combined LCA of the corresponding taxa names is obtained from these hits. If this LCA is not the same as the one obtained earlier, it is added to the existing set of candidate ancestors. In this way a set of candidate ancestors for a given read is obtained by traversing the entire list of hits till the last hit is reached.
Step 2. Obtaining distances between candidate ancestors and hits
In this step, distances are calculated between the candidate ancestors and the set of taxa corresponding to the hits. The level of a taxon in the NCBI taxonomy tree denotes the distance traversed in terms of the number of edges from the root to that taxon. This level information is used for obtaining distances between any two taxa. If the taxon corresponding to a hit is a child node of a candidate ancestor, then the distance between the candidate ancestor and that taxon is equal to difference of levels between them. If the taxon corresponding to a hit is not a child node of a candidate ancestor, then the LCA of the candidate ancestor and that taxon is obtained, and the distance between them is subsequently calculated using the following formula:
Step 3. Assignment of reads to the 'Most Probable Ancestor'
In this step, the 'bit-score/distance' ratio between a hit and a candidate ancestor is obtained by dividing the bit-score of the hit by the distance of its corresponding taxon from the candidate ancestor (obtained as described in the previous step). These ratios are obtained considering all the hits and the candidate ancestors. For each candidate ancestor, the 'bit-score/distance' ratios obtained with each of the hits are then summed. The candidate ancestor with the highest sum, referred to as the 'Most Probable Ancestor' (MPA), is selected, and the read is assigned to the taxon corresponding to this candidate ancestor.
Step 4. Re-classifying assignments to isolated taxa
The objective of this step is to reduce the number of misclassified sequences. For this purpose, all assignments to isolated taxa are re-analyzed using the work-flow illustrated in Figure 2. Isolated taxa are defined as those taxa to which the number of reads assigned is less than a given threshold. This threshold is hereafter referred to as 'Isolated Taxon Size'.
In this step, reads assigned to each isolated taxon are reassigned to the taxon corresponding to their immediate phylogenetic ancestor, if the cumulative number of reads assigned to the latter taxon is greater than the Isolated Taxon Size. If this cumulative number does not exceed the Isolated Taxon Size, the above mentioned step is iterated for progressively higher taxonomic levels (up to the level of super-kingdom). This iteration is performed till an ancestor taxon is found for which the cumulative number of reads assigned exceeds the Isolated Taxon Size. However, if the cumulative number of reads for any of the ancestor taxa does not exceed the 'Isolated Taxon Size', reads assigned to the isolated taxon are categorized as 'unassigned'.
In the present study, an 'Isolated Taxon Size' of 300 or 1% of the total number of reads (whichever is less) was used. A value of 300 was used keeping in mind the presence of rare organisms in typical metagenomic samples. This value ensured that reads originating from rare organisms or organisms with small genome sizes are not unnecessarily reassigned at non-specific taxonomic levels. For example, the value of 300 ensures that an organism having a genome length as low as 0.6 Mb and also with extremely low coverage (as low as 0.5X, 0.2X, 0.13X and 0.05X, for read lengths of 1000 bp, 400 bp, 250 bp and 100 bp, respectively) is not picked up as an isolated taxon.
Consequently, reads belonging to this organism are not re-assigned at non-specific taxonomic levels during the reclassification step.
Strategy for evaluating the binning time
In order to evaluate the binning time achieved using the 'bit-score/distance ratio approach' (adopted by DiScRIBinATE) as compared to the 'orthology based approach' (used by SOrt-ITEMS) and standard LCA-based approach (used by MEGAN), the time taken by DiScRIBinATE, SOrt-ITEMS and MEGAN for binning 100000, 200000, 500000 and 1000000 sequences was determined. These tests were performed on a work station with Intel(R) Xeon(R) CPU, 1.86 GHz processor.
Data sets used for evaluating binning accuracy and specificity
Thirty-five completely sequenced bacterial genomes belonging to diverse taxonomic clades were downloaded from NCBI website ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. Using MetaSim [8], four data sets were created namely, Sanger, 454-400, 454-250 and 454-100. Each data set consisted of 35,000 reads (1000 reads from each genome) simulating the typical read lengths and errors models associated with Sanger (read lengths centred round 800 base pairs), 454-Titanium (400 bp), 454-FLX (250 bp), and 454-GS20 (100 bp) respectively.
Database variants used for evaluating binning accuracy and specificity
To validate the performance of DiScRIBinATE, with respect to the assignment of reads originating from 'known' as well as 'unknown' organisms, input reads were queried against the following three variants of the nr database.
a. 'Species unknown'
nr database where sequences belonging to the query species are absent. This scenario mimics a scenario wherein the read sequences belong to an unknown species.
b. 'Genus unknown'
nr database where sequences belonging to the query genus are absent. This scenario simulates a situation wherein read sequences originate from an unknown genus.
c. 'Family unknown'
nr database where sequences belonging to the query family are absent. This scenario simulates a situation wherein read sequences originate from an unknown family.
Categorization of taxonomic assignments
To quantify the accuracy and specificity of assignment of reads, the results obtained for the four validation data sets against the three database variants (mentioned above) were categorized as follows:
a. Correct assignments
Assignment of a read to a taxon that lies in the path between the root and the taxon corresponding to the source organism of the read was categorized as 'correct assignment'. For example, if the read originated from Burkholderia ambifaria AMMD, then its assignment to any of the taxa mentioned below was categorized as a correct assignment.
root; cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Burkholderiaceae; Burkholderia; Burkholderia cepacia complex; Burkholderia ambifaria; Burkholderia ambifaria AMMD.
To quantify the specificity, the 'Correct assignments' were sub-grouped into the following categories:
1. Higher (or Non-Specific) level: Correct assignments of the reads at the level of root or cellular organisms or super-kingdom.
2. Intermediate levels: Correct assignments of the reads at the level of phylum or class or order.
3. Specific levels: Assignments of the reads at the level of family or below.
b. Wrong assignments
Assignment of a read to a taxon that does not lie in the path between the root and the taxon corresponding to the source organism of the read was categorized as 'Wrong assignment'.
c. Unassigned
Those reads for which none of the hits had a bit-score greater than or equal to 35 and/or an alignment length of greater than 25 were classified as 'Unassigned'.
d. No hits
All reads with no BLAST hits were categorized as 'No hits'.
Results obtained using DiScRIBinATE were categorized into the above classes and were compared with corresponding results generated by the SOrt-ITEMS and the MEGAN program. Both SOrt-ITEMS and MEGAN were run using a "minimum bin-size" threshold of two. Results obtained for DiScRIBinATE after applying the reclassification step were compared with those obtained with SOrt-ITEMS and MEGAN.
Validation results
Comparative evaluation of binning time
As shown in Figure 3, DiScrIBinATE takes just half the time (similar to MEGAN), as compared to SOrt-ITEMS, for binning an equivalent number of reads. This is expected since the 'bit-score/distance ratio' approach adopted in DiScRIBinATE is an alignment-free method and involves simple mathematical calculations, as against the orthology step (involving alignment of sequences) used in SOrt-ITEMS.
Comparative evaluation of binning accuracy and specificity
Figure 4 and Additional file 1 summarize results obtained by DiScRIBinATE, MEGAN and SOrt-ITEMS for all four validation data sets using three database variants. For the current validation study, the three programs were run with the Minimum Bit-Score threshold of 35. SOrt-ITEMS and MEGAN were run with a minimum bin size (min-support) value of two.
As seen from Figure 4, the percentage of 'correct' assignments is consistently higher for DiScRIBinATE as compared to SOrt-ITEMS and MEGAN (Figure 4A, D, G). Moreover, this pattern is observed to be consistent across all the data sets and database variants. While, DiScRIBinATE has the lowest misclassification rate across all data sets and database variants, MEGAN has the highest percentage of reads being misclassified (Figure 4B, E, H). On the other hand, percentage of reads being categorized as 'Unassigned' is roughly similar for all three programs across all data sets/database scenarios (Figure 4C, F, I). The detailed results of assignments to various bin categories are summarized below:
Correct assignments
The percentage of reads correctly assigned by DiScRIBinATE is seen to be 2-18% higher as compared to SOrt-ITEMS, and 11 - 30% higher as compared to MEGAN. Interestingly, this quantum of difference becomes higher and noticeable as sequences belonging to progressively higher clade levels are removed from the reference database (e.g. results with 'Genus/Family unknown' database variants). Given the composition of typical metagenomic data sets (where majority of sequences originate from hitherto unknown taxa/clades), results obtained with DiScRIBinATE (especially with Genus/Family unknown database variants) indicate its suitability for use with metagenomic sequence data sets.
Specific levels
Results (Additional file 1) indicate that the percentage of reads assigned by DiScRIBinATE and SOrt-ITEMS at specific levels is higher than that of MEGAN for the 454 data sets/database variants. However, MEGAN assigns a higher percentage of reads in the Sanger data set at specific levels as compared to DiScRIBinATE and SOrt-ITEMS. The likely reason for the latter observation is the following. Since, the Sanger data set has reads with a higher length range (600-1100 bp), the hits generated for these reads generally have higher bit-scores. Besides, it is observed that the best hits (obtained for a read) generally have a bit-score which is relatively much higher than the remaining hits (obtained for that read). Applying the top percent criteria (of MEGAN) therefore results in very few hits remaining for analysis. Consequently, the LCA tends to converge at relatively specific taxonomic levels. This resulting in higher specificity for some of the assignments.
However, it should be noted that higher bit-scores do not necessarily correspond to higher alignment quality. Consequently, MEGAN's approach is specific and accurate only in cases where the best hits are from a correct taxon/clade. However, in typical metagenomic scenarios, majority of query sequences originate from hitherto unknown organisms. In such scenarios, the best hits for a majority of query sequences are usually from an incorrect taxon/clade. Using MEGAN's LCA approach in such scenarios will naturally lead to incorrect assignments. This is evident from our results (see later sections) wherein MEGAN, in spite of having a higher specificity in the case of SANGER reads, has a mis-classification rate which is 3 - 30 times higher than DiScRIBinATE and SOrt-ITEMS.
It is also observed that the binning specificity of DiScRIBinATE is slightly lower as compared to SOrt-ITEMS. This marginal decrease of specificity (1-2%) can be attributed to the reclassification step used in DiScRIBinATE, wherein assignments to isolated taxa are moved to relatively higher taxonomic levels.
Intermediate levels
The percentage of reads assigned by DiScRIBinATE at intermediate levels is around 4-10% higher than SOrt-ITEMS, and around 9-40% higher than MEGAN (Additional File 1). This is expected since the reclassification step in DiScRIBinATE essentially reassigns reads (those which were earlier assigned to isolated taxa) at progressively higher taxonomic levels. Since assignments to such isolated taxa are usually found to be false positive assignments, the reclassification step facilitates their reclassification (albeit at slightly higher taxonomic levels), instead of discarding them as 'unassigned'.
Higher levels
Except for the 454-100 data set, it is seen that MEGAN assigns a higher percentage of reads at higher levels, as compared to DiScRIBinATE or SOrt-ITEMS. As compared to SOrt-ITEMS, DiScRIBinATE is seen to assign a marginally higher percentage of reads at higher levels. This marginal increase is a likely result of the reclassification step adopted by DiScRIBinATE.
Wrong assignments
As previously mentioned, the misclassification rate of DiScRIBinATE is significantly lower than both SOrt-ITEMS and MEGAN. Results indicate that MEGAN has the highest misclassification rate. Interestingly, as the lengths of the input reads become smaller, there is a progressive increase in the misclassification rate by SOrt-ITEMS and MEGAN as compared to that by DiScRIBinATE. Overall, the misclassification rate of DiScRIBinATE is seen to be around 3 - 7%, which is 1.5 - 3 times lower as compared to that by SOrt-ITEMS, and 3 - 30 times lower as compared to that by MEGAN. The above observations demonstrate the immense utility of the reclassification step in reducing the percentage of incorrect assignments.
Unassigned
The percentage of reads categorized as 'unassigned' by all the three methods is comparable across all the data sets/database variants (Figure 4, Additional File 1).