We have analysed the most commonly used workflows applied for bacterial diversity studies including simple, one-tool, workflows and tool combinations, and compared them for accuracy, results variability and efficiency under a variety of conditions representative of Illumina-based, 454-based, near full-length and differential diversity studies.
While more complex workflows can be devised (e. g. using RDP web alignments or ESPRIT distance calculations or clustering as optional combinatorial steps), these are rarely reported except when evaluating new algorithms [23, 26] and are inconvenient for routine use. The most relevant patterns from the wealth of information generated by our analysis are highlighted below.
The relevance of initial data quality to the reliability of the results has been repeatedly noted, leading to the recommendation of filtering raw data according to quality, removal of chimeras and experimental noise, and ensuring all sequences correspond to the same region of 16S rDNA. This last step usually relies on the selection of experimental primers, but may sometimes benefit from the additional removal of non-overlapping sequences ("trump" filtering in Mothur). This step may arguably remove important elements in an alignment (e. g. if terminal gaps are significant), but it was the only way to obtain sensible results with the datasets from Quince et al. . However, its use with our artificially mutated sequences resulted in useless alignments, showing that this option must be applied with caution.
When a large number of short length sequences are used , it is easy to find many individuals with the same sequence. Identifying and using unique sequences exerts a major impact on reducing computation time, especially when dealing with modern vast datasets (in the hundreds of thousands or millions of sequences).
ESPRIT, which uses pairwise alignments, generally produces more accurate results than workflows relying on MSAs. Pairwise alignments; however, do not consider likely evolutionary relationships or structural properties, which is why MSA tools are often preferred. Alignments based on a reference alignment are shown to be very useful, while improving quality, as can already be seen from the RDP pipeline results; however, we observe that using MAFFT or MUSCLE to obtain ab initio MSAs gave better results than Mothur reference-guided alignments for short sequence lengths. This is most evident with our engineered mutated datasets, and is likely due to reference guided alignments selecting different reference sequences for related mutants, leading to an MSA where related sequences include more variability than actually exists. Hence, Mothur MSAs should always be filtered to remove redundant gaps if other tools are to be used for distance calculation . MAFFT and MUSCLE frequently identify the mutation correctly and produce more accurate alignments, as can be seen by inspecting the aligned stacked dataset. This advantage is diluted as sequence length increases (datasets from Quince, Hao and Youssef), since the availability of more information increases the accuracy of reference identification.
Another major difference is related to the computational time required for constructing the alignments. Both MAFFT and MUSCLE have a well-defined behaviour: MAFFT time complexity is O(N2L) , N being the number of sequences and L their length, and MUSCLE uses multiple iterations with a cost of O(NL+L2) per iteration; however, ESPRIT, CROP and Mothur compute more expensive Needleman-Wunsch alignments (cost O(L1·L2), or O(L2) for equal-length sequences) and rely on a previous filtering step to reduce N, the number of sequences to be compared. ESPRIT and CROP require (N·(N-1)) comparisons to build the distance matrix; one may expect ESPRIT k-mer filtering or CROP unique identification step to become less efficient for longer, relatively distant sequences. In the case of Mothur only N comparisons -each sequence against its reference, selected using one of various filtering methods (k-mer, blastn and suffix trees)- are needed. Additional file 2, Table S2 summarises the computational time needed for each step in the diversity analysis.
Summarizing, ESPRIT works better for short and closely related sequence reads and is usually very fast; however, it is much slower when tens of thousands of sequences are considered. MUSCLE is preferred for short sequence lengths, but its long running time puts it at a disadvantage when sequences are long, deficiently grouped, or too many are used. Mothur generally achieves the fastest alignments when using a reference database, although there are concerns with the relative merits of this approach [19, 35]. In the case of long sequences, MAFFT offers a very good balance between speed and quality.
Special mention must be made for MSAs of huge datasets in the order of hundreds of thousands of sequences: in these cases, MUSCLE is unable to directly align the data and requires a previous "clumping" and filtering step with USEARCH to guide the final alignment, or selection of unique sequences prior to alignment. In any case, subsequent processing makes it advisable to favour working with unique sequences or a pre-clustering step when available, both to reduce computation time and produce manageable datasets. Regarding computation time, this is the most expensive step, but only becomes a real issue with modern computer architectures when dealing with very large sequences (which is not usually the case in diversity studies) or huge numbers of sequences. In the last case, the selection of unique sequences and/or preclustering helps to reduce computation to acceptable times (from days or even months down to hours). In the case of MUSCLE it may additionally be required to reduce the number of iterations and increase the maximum memory used (--maxiter 2 --maxmb 10000) (Additional file 2, Table S2).
It has been argued that the direct calculation of distances from pairwise comparisons yields better distance matrices than distances calculated from MSAs, and indeed one expects less gaps in a pairwise than a multiple alignment, as reflected in the ESPRIT package, which tends to produce the lowest number of OTUs, in agreement with earlier findings [20, 25]. However, the computational cost increases with the number of sequences, making it more attractive to use MSA-based methods for large datasets (Additional file 2, Table S2). We see that both Jukes-Cantor corrected distances and Mothur distances give consistent and sensible results. Mothur distances are inclined to produce lower diversity and MAFFT distances tend to give 30-35% overestimations in the number of OTUs. The behaviour of MAFFT distance calculation is explained by its k-mer-based approach, where few changes in short sequences will have a bigger impact on overall distance than similar changes in longer ones.
In conclusion, distances derived from pairwise alignments produce better results, but may become inconvenient for long or large numbers of sequences. In these cases, Mothur distances produce lower OTU counts than Jukes-Cantor. K-mer-based distances are not currently recommended.
It is worth noting here that distance matrices inherently grow in size with the number of sequences squared. This may become a problem with modern datasets consisting of hundreds of thousands or millions of sequences, where terabyte-sized matrices may be generated. Thus, methods that avoid generating distance matrices may be more advisable for this data.
New versions of other popular alignment methods tuned for NGS analysis are expected soon (Higgins, D, personal communication), and it will be interesting to test them as soon as they become available.
We have compared the results obtained with different clustering strategies: CROP, ESPRIT, Mothur, USEARCH and RDP. To simplify the study, we did not consider other clustering tools such as CD-HIT-454, which uses heuristics that are too stringent for richness analysis . It is difficult to compare the different clustering strategies as ESPRIT, CROP, USEARCH and RDP have only been considered in their own workflow, except for the fact that Mothur is the easiest to integrate in complex workflows and offers a wider range of clustering options (average, furthest and nearest neighbour). However, Mothur failed to process the huge, terabyte-sized, distance matrices derived from full MSAs of the extensive dataset , forcing us to reduce information content by including only unique sequences in the MSA. Regarding the results obtained with CROP, while proposing a new, promising and rigorous approach, it was unable to complete the analyses or produced anomalous results in several cases. This may be due to CROP being a new method that is still actively refined.
We have also considered variability in the results of the clustering algorithms. There are two sources of variability. One is due to the sub sampling required to build rarefaction curves and estimate actual diversity. The other source depends on algorithmic randomness, and after Studentized correction, the results from 20 repeated analyses show that ESPRIT consistently produces the same results, Mothur produces a slight variability, depending on the analysed dataset, and will not normally affect the diversity results, and CROP, which uses an MCMC step, has a higher but still low variability that should have a minimal impact on the results, although this may rarely lead to counter-intuitive diversity estimates as separate runs are used for each similarity level.
Suitability for differential diversity studies
The results obtained from the analysis of the short-length hypervariable V6 16S rDNA sequences from soil were in all cases (with the exception of CROP for GTZ and glyphosate data) in good agreement with those previously reported for the same set of sequences, confirming that the reduction in species richness was more evident in the GTZ-treated soil at the first sampling time, and that the relative recovery from this herbicide treatment appeared to be poorer than that from the glyphosate treatment . All of the workflows used, except CROP, were able to show differences in species richness among the different herbicide treatments used, and the untreated soil was the one that contained the highest bacterial diversity. This is in accordance with the results we have obtained from other field studies (data not shown), yet a general conclusion may not be reached just from our soil data.
Use of our reference set of 60 short-length hypervariable V6 16S rDNA sequences shows that MSAs obtained using MUSCLE and MAFFT produce the same results, regardless of the distance calculation method (with the exception of MAFFT), while the Mothur guided alignment underestimates diversity when dealing with a small number of OTUs. The method used for distance calculation also affects the number of OTUs, as MAFFT is unable to discriminate between different dissimilarity levels (even at 18%, data not shown), considering the small number of OTUs found. Jukes-Cantor or Mothur distances slightly underestimate the number of OTUs.
No aberrant combinations were found when nearly full-length 16S rDNA sequences were analysed, thus indicating that when dealing with long sequences, Mothur reference-guided alignments may be preferred. When dealing with short length sequences, the ESPRIT package and the RDP pipeline are the most accurate on the three dissimilarity levels. When analysing a large number of long sequences, computational time requirements adversely affect the use of the ESPRIT package due to the longer time needed, while the use of Mothur alone produces acceptable estimates in a reasonable time.
For even larger numbers of short length sequences in the range of hundreds of thousands , a filtering step that reduces sequence numbers is needed. ESPRIT and Otupipe automatically apply an initial filtering step and hence can directly work with these huge datasets, producing acceptable results. The RDP web-based pipeline may experience troubles or long delays with some datasets, but does generally give good results. In the case of MAFFT, Mothur and MUSCLE, the initial selection of unique sequences and optional preclustering steps helps by significantly reducing the data to manageable sizes.
In conclusion, the RDP web-based pipeline is most convenient for general use, but when dealing with very large or many datasets, or when timely results are needed, ESPRIT and Otupipe are very efficient and can be used locally to produce acceptable results in every circumstance. In all the cases tested, RDP and ESPRIT always rendered more accurate and similar results, although perhaps in specific circumstances, and depending on the number of sequences, length and variability, combination workflows may still be an attractive option.