SIS: a program to generate draft genome sequence scaffolds for prokaryotes

Background Decreasing costs of DNA sequencing have made prokaryotic draft genome sequences increasingly common. A contig scaffold is an ordering of contigs in the correct orientation. A scaffold can help genome comparisons and guide gap closure efforts. One popular technique for obtaining contig scaffolds is to map contigs onto a reference genome. However, rearrangements that may exist between the query and reference genomes may result in incorrect scaffolds, if these rearrangements are not taken into account. Large-scale inversions are common rearrangement events in prokaryotic genomes. Even in draft genomes it is possible to detect the presence of inversions given sufficient sequencing coverage and a sufficiently close reference genome. Results We present a linear-time algorithm that can generate a set of contig scaffolds for a draft genome sequence represented in contigs given a reference genome. The algorithm is aimed at prokaryotic genomes and relies on the presence of matching sequence patterns between the query and reference genomes that can be interpreted as the result of large-scale inversions; we call these patterns inversion signatures. Our algorithm is capable of correctly generating a scaffold if at least one member of every inversion signature pair is present in contigs and no inversion signatures have been overwritten in evolution. The algorithm is also capable of generating scaffolds in the presence of any kind of inversion, even though in this general case there is no guarantee that all scaffolds in the scaffold set will be correct. We compare the performance of sis, the program that implements the algorithm, to seven other scaffold-generating programs. The results of our tests show that sis has overall better performance. Conclusions sis is a new easy-to-use tool to generate contig scaffolds, available both as stand-alone and as a web server. The good performance of sis in our tests adds evidence that large-scale inversions are widespread in prokaryotic genomes.

shows the variation in the number of correct adjacencies determined by each scaffold program when the reference genome is the closest to the query genome. Figures 2 and 3 show the boxplots when the correct number of adjcencies is averaged over the results obtained using the 10 closest and 20 closest genomes, respectively. The decrease in performance for all programs when more distant genomes are included is clear. SIS (nucmer) has the best median value in the closest reference genome case; Mauve has the best median value in the 10-closest case, and SIS (promer) has the best median value in the 20-closest case. Tables with results according to genome   Tables 1, 2, and 3 show the results of all programs on each individual replicon sequence tested. Most programs do well on the replicon sequence with the fewest contigs if only the closest reference genome is used (arguably the "easiest" case). Based on the totals of the last rows in these tables SIS (promer or nucmer) are clearly the best programs.

Variation of performance depending on fraction of correct adjacencies
We separated results based on average number of correct adjacencies according to the following bins: 0-20%, 20-40%, 40-60%, 60-80%, and 80-100%. Thus, for example, a program that gets between 40 and 60% correct adjacencies in 80% of the tests would have a bar with length proportional to 80% in the bar corresponding to the 20-40% bin. The results are shown in Figs. 4, 5, and 6.
Considering the bin with more than 80% correct adjacencies, the graphs show that SIS (promer) is the best program in all three scenarios tested (top 1, top 10, and top 20). The same happens for the cumulative bins with more than 40% correct adjacencies. For the case where replicon sequences have 25 or fewer contigs, SIS (promer or nucmer) are the best programs in all three scenarios. For 26 or more contigs, the results show that there are cases when r2cat or Mauve achieve the best results. The graphs also suggest that r2cat and Mauve are less sensitive to the number of contigs in the input.

Variation of performance depending on number of contigs
As noted in the main text, these results require some caution because of the small number of cases in each bin (between five and seven).

Tests using simulated contigs
For this test we used 21 complete genomes of the Mycobacterium genus, 18 complete genomes of the Pseudomonadaceae family, 20 complete genomes of the Shewanella genus, and 9 complete genomes of the Xanthomonas genus. All genomes in these groups have one main circular chromosome. Some members have small plasmids; we do not take plasmids into consideration. Pairwise whole genome comparisons of these genomes show a distinctive 'X' pattern, indicating a predominance of symmetric inversions (See We simulated contigs by the following procedure: each genome was divided up into 100 substrings of the same size. Then a section with length 2.5% of the total contig length was removed from each end, with the goal of simulating the fact that draft genomes usually contain less than the complete genome sequence.
The resulting sequence corresponds to 95% of the complete genome. Then the 100 contigs were randomly shuffled, and with 0.5 probability each contig was reverse complemented. The reverse complementation with 0.5 probability simulates the fact that when contigs are generated in the sequencing and assembly processes they can come out in any of the two possible orientations.
Let R = {r 1 , r 2 , . . . , r n } be the original set of genomes in a group and Q = {q 1 , q 2 , . . . , q n } be the set of simulated draft genomes obtained by the above process, where q i is the draft genome obtained from r i . We carried out n(n − 1) tests for each program and each group (with values of n being 21, 18, 20, and 9), where in each test r i is the reference genome and q j is the draft genome, with i = j.

Results
Figs. 11, 12, 13 and 14 shows the results averaging over all pairs of genomes for each program and for each bacterial group.
In practice, and as stated in the main text, the choice of a reference genome would be guided by phylogenetic distance. The closest genome to the draft genome is the one most likely to yield best results.
With this in mind we computed the pairs of closest Mycobacterium, Pseudomonadaceae, Shewanella and Xanthomonas genomes based on the genomic distance MUMi [8], and tabulated the outcomes for these pairs only. The results are shown in Fig. 15

Effect of duplications
We investigated the effect of genome duplications on the performance of SIS. For each test case (from the test set of the main text) we determined the existence of duplicated segments of size at least k bp, with k varying from a minimum of 100 to a maximum of 2000. For each value of k we determined the contigs that contained such duplicated segments, and placed them in set D. Duplications were determined by running nucmer (or promer) on the pair of genomes (draft, reference) and then script delta-filter (part of the MUMmer package). With SIS results in hand, we computed correct adjacencies in contigs in D and contigs not in D. Tables 4 and 5 show the results for all pairs for which set D was not empty. The decrease in average performance did not exceed 13 percentage points. Variation in the number of correct adjacencies determined by each scaffold program when the reference genome is the closest to the query genome. The diamond is the median. File S1.pdf.         Pairwise whole genome comparison of two Pseudomonas species. The comparison was done using nucmer [9]. File S10.pdf.          File S20.pdf.