SuRankCo: supervised ranking of contigs in de novo assemblies

Background Evaluating the quality and reliability of a de novo assembly and of single contigs in particular is challenging since commonly a ground truth is not readily available and numerous factors may influence results. Currently available procedures provide assembly scores but lack a comparative quality ranking of contigs within an assembly. Results We present SuRankCo, which relies on a machine learning approach to predict quality scores for contigs and to enable the ranking of contigs within an assembly. The result is a sorted contig set which allows selective contig usage in downstream analysis. Benchmarking on datasets with known ground truth shows promising sensitivity and specificity and favorable comparison to existing methodology. Conclusions SuRankCo analyzes the reliability of de novo assemblies on the contig level and thereby allows quality control and ranking prior to further downstream and validation experiments. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0644-7) contains supplementary material, which is available to authorized users.


GC-Content
The fraction of GC-content in the contig.

Coverage
Pooled number of reads contributing to each position in the contig. Additional variants : Contig ends coverage is reported in addition, with end size equal to read length mean.

Core Coverage
Pooled number of reads contributing to each position in the contig with the same nucleotide as the one selected for the consensus. Additional variants : Contig ends core coverage is reported in addition, with end size equal to read length mean.

Base Conrmation
Signicance of the core coverage in contrast to the coverage per position, tested with a binomial test with k = core coverage, n = coverage and p = 0.98. p = 1 − error rate, where error rate denotes the average sequencing error. With an error rate of 2% the expectation of reads contributing the same correct nucleotide to each position is therefor 98%. Additional variants : Contig ends base conrmation is reported in addition, with end size equal to read length mean.

Coverage Comparison
Coverage comparison within an assembly represented by the relation of the contig coverage to the mean coverage of all contigs in the assembly. Additional variants : Contig ends coverage comparisons are reported in addition, with end size equal to read length mean.

Coverage Curve Drops
Coverage curve drops indicate local minima in the coverage of a contig with a value of less than 25% and 50% in contrast to their adjacent maxima within a xed window size w. The coverage is preprocessed with a sliding window smoothing with window size w which is chosen as the mean read length of a contig. The number of drops is reported normalized by the contig length as well as the biggest drop, i.e. maximal dierence between a minima and its smaller adjacent maxima.

K-mer Uniqueness Global
Number of K-mers unique in a contig in contrast to other contigs within the assembly normalized by the contig length (since longer contigs comprise more unique K-mers by chance). K-mers are extracted with a size of 8, and only K-mers containing standard nucleotide symbols (i.e. A,C,G and T) are considered.

K-mer Uniqueness Ends
Number of K-mers unique in a contig end in contrast to both ends of other contigs within the assembly. Reported as minimal and maximal Kmer uniqueness to avoid implicit orientation of the contig. The read length mean of a contig is chosen as end size.

S2 Contig Scores
Supplementary Table 2 provides an overview of the single contig scores calculated by SuRankCo. The scores are either based on match counts or error counts (edit distance, including mismatches and gaps) of contig-reference alignments. Normed Error Count 1 The edit distance respectively error count normalized by the contig length. Large Error Scores Account for very large errors or unstable regions which might originate from mis-joins or badly sequenced/covered regions, resp. While small errors are only considered by the General Scores, critical large errors are additionally penalized hereby.
Max. Contiguous Error The largest contiguous stretch of alignment errors normalized by the contig length.
Max. Region Error The largest number of alignment errors in a xed region size (100 bp). End Scores Similar to Large Error Scores but applied to contig ends only.
Errors in this region are rather critical for subsequent applications (e.g. for scaolding) are therefore additionally penalized.
Max. End Error Stretch Largest stretch of errors right at the contig ends (unxed length) normalized by the contig length.
Max. End Error Count The largest number of alignment errors in the ends (xed length of 100 bp). Other Scores Additionally account for insertions and critical mis-joins.
Normed Contig Length Relation of contig length to alignment length.

STraining Class Denitions
Each contig score applied by SuRankCo is separated into two classes to allow for binary classication. The separation into the two classes can be either set manually or automatically by tting exponential distributions. A threshold selection may be supported by histograms provided by the SuRankCo-Score module (as shown in Supplementary Figure 2).
The automatic exponential tting makes use of the MASS R package (Venables and Ripley, 2002). It ts an exponential distribution to each single score distribution of the training contigs. Finally, a certain quantile of a t is considered as the threshold for the score class separation. The selection of a quantile should be based on the concrete training data and be adjusted accordingly. However, 25% yielded a good separation for the E. coli contigs. See Supplementary Figure 2   The reads were assembled using Mira (Chevreux et al., 1999) with basic settings. A sample conguration for one experiment is provided in listing 1. Finally, the four resulting assemblies were randomly divided into three training assemblies (SRR400617, SRR400618 and SRR400619) and one test assembly (SRR400620). To compare the SuRankCo results of the E. coli experiment to ALE, the reads of the prediction data (SRR400620) were mapped against the corresponding contigs using Bowtie2 with default settings. Thereby, ambiguous reads where assigned according to the best alignment. The resulting sam le were sorted and, together with the contigs provided to ALE. Since ALE does not provide a score per contig, ALE sub-scores were transformed to error counts and summed up per contig. For each ALE sub score, a histogram over all contigs and positions was created to manually choose a threshold (as shown in Supplementary Figure 1). Each contig position below the threshold of a sub-score is counted as a potential error. The counts were summed for each contigs and normalized by the contigs length and the total number of sub-scores. To demonstrate the usage of SuRankCo in conjunction with several organisms and assemblers, we make use of the staggered mock community of the Human Microbiome Project. The data set is available in the NCBI Sequence Read Archive [SRA:SRR172903] with a total number of 7,932,819 reads and 595M bases. Organisms in the mock community are represented in Table 4 as well as the reference sequences used to classify reads and contigs and for evaluation. The mock data is evaluated in three dierent settings, a meta-assembly, single organism assemblies and a merged evaluation of single organism assemblies of dierent assemblers.

S4.3.1 Mock Meta-Assembly
The meta-assembly of the mock community is constructed using MetaVelvet (Namiki et al., 2012) with kmer size 31 and no scaolding. The resulting contigs are assigned to organisms by using Blast (Altschul et al., 1990) against all reference sequences and selecting the best hits according to the e-value. For training and prediction of SuRankCo scores, the organisms are randomly divided into two equal groups. The grouping and number of assigned contigs is depicted in Table 4. The MetaVelvet output is converted to ace les using AMOS (Treangen et al., 2011).

S4.3.2 Mock Single Assembly
For the single organism approach, the mock reads are mapped against all references and thereby assigned to organisms by using Bowtie2 (Langmead and Salzberg, 2012) with default settings. We selected all organisms with sucient coverage for the following assemblies, including E. coli (∼ 9x), M. smithii (∼ 11x), R. sphaeroides (∼ 30x), S. aureus (∼ 38x), S. epidermidis (∼ 35x) and S. mutans (∼ 20x). This selection agrees with organisms comprising a suitable amount of contigs in the meta-assembly as shown in Table 4. The reads for each organism are then assembled separately with Mira (Chevreux et al., 1999), Soap (Luo et al., 2012) and Velvet (Zerbino and Birney, 2008) with default settings, except in the following cases. For Soap and Velvet assemblies are constructed over a range of kmers from 1 to 75 and for each organism the assembly with the highest N50 is selected for further analysis. Since not all assemblers used here provide alignment information but contig sequences only, the corresponding reads are remapped to the assemblies by using Bowtie2 with default settings to produce sam les as input for SuRankCo. For training and prediction, organisms are assigned to the same groups as for the meta-assemblies.

S4.3.3 Mock Single Assembly Merged
For the third evaluation, the single organism assemblies from Mira, Soap and Velvet are merged into combined datasets. Thus the training set and the prediction set consist each of assemblies of three organisms from three assemblers.
In general, SuRankCo is used with default settings for all mock experiments. However, contigs are ltered for a minimum size of 350 bases, since commonly no valuable information such as genes are expected to be covered by shorter sequences.

S4.4 GAGE Study
To further demonstrate the usage of SuRankCo in conjunction with several assemblers, we make use of the bacterial assemblies provided by the GAGE study. We evaluate all available assemblies of Staphylococcus aureus and Rhodobacter sphaeroides including ABySS, ABySS2, Allpaths-LG, Bambus2, MSR-CA, SGA, SOAPdenovo, Velvet. However, the CABOG assembly of R. sphaeroides could not be evaluated since there is no CABOG assembly of S. aureus available.
Since none of the GAGE assemblies provide alignment information but contig sequences only, the corresponding reads are remapped to the assemblies by using Bowtie2 with default settings to produce sam les as input for SuRankCo. For each assembly, we used either the original read set or a corrected read set in accordance with the GAGE supplementary material.
The GAGE bacteria assemblies are evaluated in two dierent settings including the evaluation of single assemblies and a merged evaluation of the assemblies of dierent assemblers. In both settings S. aureus has been used for training and R. sphaeroides for prediction. In the rst setting, each assembly of R. sphaeroides is evaluated by training SuRankCo on the corresponding assemblies of S. aureus from the same assembler. For the second evaluation, the assemblies of S. aureus are merged into a combined training datasets. Then, each R. sphaeroides assembly is evaluated based on this merged training. SuRankCo is used with default settings for all GAGE experiments. However, contigs are ltered for a minimum size of 350 bases, since commonly no valuable information such as genes are expected to be covered by shorter sequences.  Each plot comprises a ROC curve of the contig evaluation score grouping in contrast to a varying grouping of the SuRankCo scores. Thereby, the changing color of the graph represents the changing threshold for the SuRankCo score grouping. 9  Figure 5: Evaluation of the SuRankCo predictions of the GAGE assemblies. Here, one ROC curve represents the evaluation of R. sphaeroidis assemblies classied by the single training dataset. Each plot comprises a ROC curve of the contig evaluation score grouping in contrast to a varying grouping of the SuRankCo scores. Thereby, the changing color of the graph represents the changing threshold for the SuRankCo score grouping.  For completeness and comparability, this gure features the ROC curve-based evaluation of the E. coli experiment as applied for the mock and GAGE experiments. The ROC curve is based on the contig evaluation score grouping in contrast to a varying grouping of the SuRankCo scores. Thereby, the changing color of the graph represents the changing threshold for the SuRankCo score grouping. Tables   Table 4: Organisms of the mock community data, their grouping for the experiments (T and P), assigned contigs of the meta-assembly and the used references.

Organism
Group N. Ctgs Accession N.