BESST - Efficient scaffolding of large fragmented assemblies

Background The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features. We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software’s general performance. Results We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. Conclusion We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-281) contains supplementary material, which is available to authorized users.

Assemblathon 2 conclusion was not made available until after a year. In general, the take-home-message of all previously cited studies is the following: (i) no single metric is able to fully and easily describe assembly quality and correctness; (ii) the most used length-based statistics like N50 and NG50 (given a set of contigs, each with its own length, the N50 length is defined as the longest length for which the collection of all contigs of that length or longer contains at least half of the total assembly length, the definition is similar for NG50, in this case the total assembly length is replaced by the estimated assembly length) and total assembly length are bad quality predictor and cannot be used to deduce the overall assembly quality (refer to [5,6] for a detailed discussion); (ii) even in presence of a high quality reference sequence de novo assembly evaluation is an extremely difficult task (see GAGE and Assemblathon 1 cases); (iv) the same tool employed on a different dataset (i.e., a different genome, different libraries, different sequencers, etc.) can produce utterly different results.
In the last 5 years more than 20 assemblers have been published. More often than not, new tools proved their own superiority with then existing ones by showing better performances (usually contiguity based statistics) on a single or few datasets. The situation holds true also in the case of scaffolders: not only is scaffolding performance typically evaluated using bad quality predictors (e.g., NG50), but, to the best of our knowledge, nobody as explored how contigs produced by different assemblers may or may not influence stand-alone scaffolder performance.
Conscious of the current limitations in assembly evaluation we tried to perform an evaluation of our tool as unbiased as possible and able to provide to the final users a complete picture of the advantages of BESST over other stand-alone scaffolders.
No single metric is able to fully or easily describe the quality of an assembler/scaffolder. For this reason we computed several metrics. In particular we computed length-based statistics (NG50, Number of Scaffolds), and, thanks to the availability of a reference sequence, reference-based statistics (corrected NG50, and number of mis-assemblies). For each tool, we computed the required run-time (clearly the speed of a tool does not suggests nothing about the quality, however a user must know if the tool he wants to use can at least provide an output in feasible time).

Mate-pair distribution bias
In the main manuscript we emphasize the fact that BESST performs better than other standalone scaffolders in presence of wide (i.e., high standard deviation) insert sizes distributions. Generating mate-pair (MP) libraries with a tight insert size variation (e.g., 10% of the expected and/or observed mean insert size) have been a difficult step in many early de novo assembly projects based on Illumina technology (see [1]). Recently, the introduction of the new Nextera Mate Pair protocol allows all labs to produce good MP libraries. However, this is true only if there is enough DNA available (i.e., 4 or more micro-grams) during library preparation. When DNA is scarce, gel cut selection cannot be performed, therefore yielding a much wider insert size distribution. In Figure S1 we show the insert size distribution plot of two Nextera Mate Pair Libraries: Figure S1a has been obtained starting from 5 micrograms of DNA, thus applying the gelcut step. Whereas the library plotted in Figure S1b has been obtained starting from 1 microgram of DNA, therefore without using the gel-cut procedure. Both samples are sequenced from a human genome and aligned against the human genome reference Hg19. Distribution plots have been obtained with Picard tools. Insert size distributions characterized by a large standard deviation are common also when trying to obtain large inserts (¿ 8 kbp). Therefore, there are still many situations where a narrow insert size distribution is not possible to obtain.  Figure S1: Insert size distribution of Mate Pairs with two variants of Nextera Mate Pair Kit. RF: reverse-forward orientated reads. FR: forward-reverse orientated reads. Tandem: forward-forward or reverse-revers oriented reads.

Experiments set up
GAGE [2] is the study that, so far, offers the best datasets to test a new software. Datasets for three highly different organisms togheter with finished reference sequences are provided. Each of the organisms have in turn been assembled with 8 different assemblers. Scripts to evaluate (but not rank) the assemblies are also provided to the community. The three GAGE datasets are Staphylococcus aureus (genome size 2.872.915), Rhodobacter sphaeroides (genome size 4.603.060), and Human chromosome 14 or Hc14 (ungapped size 88.289.540). All three datasets consists of two libraries: one paired-end library (average insert size 200 bp) and one mate-pair library (average insert size 3 Kbp). GAGE provide assemblies (both contigs and scaffolds) obtained with 8 different assemblers (ABySS, ALLPATHS-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet), however only 7 are available for Staphilococcus a. dataset as CABOG failed. Such assemblies can reasonably be considered the best achievable assemblies, as they were obtained by de novo assembly experts.
Our analysis and evaluation has been performed on the three original GAGE dataset, plus one fourth partially simulated dataset. This last dataset has been partially simulated in order to show how libraries characterized by insert sizes with large standard deviations (e.g., many of the mate-pair library produced) badly affect the majority of scaffolders. In particular, a mate-pair-like library characterized by a mean insert size of 3 Kbp and an insert standard deviation of 1.5 Kbp has been simulated using Rhodobacter s. reference sequence. Such sim-ulated library provided a raw coverage of 30×. This simulated library, together with the original pair-end library has been used to scaffold all Rhodobacter s. contig-level assemblies.
For each available GAGE assembly (in contig version) we used 4 different stand-alone scaffolders: All scaffolders have been run with default parameters, see section 1.5 for details. The script that automatically run all the scaffolders and perform the reference-based evaluation is available at http://gage.cbcb.umd.edu/results/ index.html (note, the script need a large number of programs like samtools, mummer to be available in the main path).
To summarize we run a total of 124 scaffolding experiments on the original GAGE datasets. However, "only" 117 have been evaluated as SOPRA and OPERA did not finish in time on 3 and 4 Hs14 instances respectively. In particular, SOPRA and OPERA were not able to provide output after 48 hours.
All experiments have been run on a 1TB RAM machine equipped with 24 CPU. All scaffolders have been run using a single CPU. Even though we employed such a powerful machine, none of the tested scaffolders required an unreasonable amount of RAM memory.
For each of 117 successfully completed scaffolding experiments we used the GAGE evaluation script to compute: • NG50: size of the longest scaffold such that the sum of the lengths of all scaffolds longer than it is at least half of the genome size; • corrNG50: original assembly is break every time a mis-assemble is found. corrNG50 is the NG50 computed on this set of "error-free" scaffolds; • mis-assemblies: number of inversions, relocations, and translocations identified by getScaffoldStats script found on GAGE homepage.
Moreover, for each entry, we also compute: • number of initial contigs and number of produced scaffolds; • time required by the scaffolder (without considering time required to align reads).  S4b, and S4c) and represents the same data of Figure S3 but for one dataset (i.e., genome) per time. From these figures it is clear that there is no winner, i.e., there is no assembler always performing better than others. In general, looking also at Tables S2-S24, BESST and SSPACE achieve good results, however both tools have some outliers. As an example, consider Hc14 dataset on ABySS assembly (big red square and cross in the right bottom corner of Figures S3 and S4c). On this specific dataset all scaffolder perform badly not being able to scaffold the highly fragmented contigs produced by ABySS. Figure S5 summarize for each dataset (the three original GAGE dataset, Figures S5a-S5c plus the one partially simulated dataset, Figure S5d) the relationship between number of mis-assemblies (i.e., x-axis) and corrected LG50 (i.e., y-axis). Figures S5a-S5c clearly highlight the good performances of BESST and SSPACE over OPERA and SOPRA. Such behaviour is even more evident in Figure S5d. SOPRA assemblies form a cloud in the bottom left corner (i.e., few errors but extremely fragmented assembly), while OPERA is almost always the scaffolder introducing the highest number of mis-assemblies.

Supplementary Results
By comparing Figures S5b and Figure S5d we appreciate how BESST is less affected by the large insert size variation. When scaffolding ABySS, BAMBUS, CABOG, MSR-CA, SGA, SOAPdenovo, and Velvet contigs, BESST is not affected by the large variation, while SSPACE is badly affected (as an example, in MSR-CA case, corrNG50 decreases from 2.5 Mbp down to 1 Mbp). In in ABySS case, both scaffolders are unable to increase corrNG50. When working with Allpaths-LG contigs both BESST and SSPACE improve with BESST always being the one characterized by the lowest amount of mis-assemblies (i.e., 0).

run commands
For scaffolders requiring specification of insert size of library. these values were provided

Function definitions
Since derivation of Theorem 1 is built upon theory from the work of GapEst, we adopt this notation. The following functions can be found, and motivated for, in [3].
• f denotes the standard normal density.

Derivation
We have σ o|d = V ar(o|d) = E[o 2 |d] − E[o|d] 2 . We will calculate E[o 2 |d] and E[o|d] separately. Note that o = x − d and assume that the gap is known, with length k, i.e., d = k. Let h be the gap estimation model introduced in [3] with the corresponding functions p, f, g, I. Also, for overview, the functions are written without their parameters and displaying their dependence of a variable in lowercase, e.g., p(x − k|c 1 , c 2 ) = p x when integration over variable x, then We let the expression for E[o|d = k] be as it is for now (without deriving a) and go on to calculate E[o 2 |d].
second equality being obtained by integration by parts two times with antiderivative of f and derivative of p. Now, from Equation (1) we get V ar(o|d) = 2σ 2 µg d + (µ 2 + σ 2 )g d + σ 4 r d g d − (σ 2 g d + µg d ) 2 g 2 Which is equivalent to the result we want to prove since a variance is always positive, i.e., the square root operation is well defined.