CAR: contig assembly of prokaryotic draft genomes using rearrangements
© Lu et al.; licensee BioMed Central Ltd. 2014
Received: 11 July 2014
Accepted: 5 November 2014
Published: 28 November 2014
Next generation sequencing technology has allowed efficient production of draft genomes for many organisms of interest. However, most draft genomes are just collections of independent contigs, whose relative positions and orientations along the genome being sequenced are unknown. Although several tools have been developed to order and orient the contigs of draft genomes, more accurate tools are still needed.
In this study, we present a novel reference-based contig assembly (or scaffolding) tool, named as CAR, that can efficiently and more accurately order and orient the contigs of a prokaryotic draft genome based on a reference genome of a related organism. Given a set of contigs in multi-FASTA format and a reference genome in FASTA format, CAR can output a list of scaffolds, each of which is a set of ordered and oriented contigs. For validation, we have tested CAR on a real dataset composed of several prokaryotic genomes and also compared its performance with several other reference-based contig assembly tools. Consequently, our experimental results have shown that CAR indeed performs better than all these other reference-based contig assembly tools in terms of sensitivity, precision and genome coverage.
CAR serves as an efficient tool that can more accurately order and orient the contigs of a prokaryotic draft genome based on a reference genome. The web server of CAR is freely available at http://genome.cs.nthu.edu.tw/CAR/and its stand-alone program can also be downloaded from the same website.
KeywordsBioinformatics Contig assembly Rearrangement
The draft genomes produced by most assemblers for next generation sequencing (NGS) are just collections of independent contigs, whose relative positions and orientations along the genome being sequenced are unknown. To address this problem, a process called scaffolding is then used to order and orient these contigs of a draft genome. An accurate scaffolding is critical and helpful for accomplishing the subsequent finishing process, which applies the primer walking technique to closing the gaps between ordered and oriented contigs. Currently, many NGS assemblers utilize the information of paired-end reads (or mate-pair reads) to produce the scaffolds, each of which is a set of ordered and oriented contigs -. Such paired-end reads can be generated by sequencing both ends of large DNA molecules like bacterial artificial chromosomes (BAC), thus producing pairs of sequenced reads with known relative orientation and approximate distance. As a result, if the two paired-end reads can be mapped to two individual contigs unambiguously, the relative order and the distance between these two contigs can thus be correctly identified. In practice, a mixture of paired-end reads with various distances is needed to improve the accuracy of the scaffolding by reducing the experimental errors. In computation, such a scaffolding process can be modeled as a combinatorial optimization problem, which aims to order and orient the input contigs in a manner that maximizes the number of supporting paired-end reads. Unfortunately, this problem is computationally difficult, because it has been shown to be NP-hard , meaning that finding an efficient polynomial time algorithm to solve this problem is highly unlikely. An alternative approach to order and orient the contigs of a draft genome is to take advantage of and utilize the finished genome of a related organism as a reference . In principle, the contigs of a draft genome can be mapped to a reference genome and their positions on the reference genome are then used to infer the scaffolding of contigs. Thus far, several tools using this approach have been developed, such as Projector 2 , OSLay , ABACAS , Mauve Aligner , fillScaffolds , r2cat , CONTIGuator  and SIS .
In this study, we present a novel reference-based contig assembly (or scaffolding) tool named as CAR (short for “Contig Assembly using Rearrangements”) that can efficiently and more accurately order and orient the contigs of a prokaryotic draft genome based on a reference genome of a related organism. The kernel program of CAR was implemented using a different but more accurate algorithm we recently developed . In principle, we formulated the reference-based scaffolding problem as the following combinatorial optimization problem: Given a set of contigs for a draft genome and a reference genome, the goal of the problem is to order and orient the contigs of the draft genome in a way that minimizes the rearrangement distance between the assembled draft genome and the reference genome. The rationale of defining such a reference-based scaffolding problem is as follows. Firstly, the draft and reference genomes in this problem are represented by signed permutations of n integers, where each integer represents a conserved genetic marker (gene or synteny block) shared between the draft and reference genomes and its associated sign indicates the strandedness of the corresponding genetic marker. If the draft and reference genomes are phylogenetically closely related, then the contig assembly of the draft genome may have a genetic-marker order similar to that of the reference genome, since the global (or large-scale) mutations of genome rearrangements between them are relatively rare . Note that the reference-based scaffolding problem we formulated above is a variant of the one defined by Gaul and Blanchette , because the reference genome used by Gaul and Blanchette can be a draft genome (but not necessarily a finished genome as required here). As already shown in our previous study , we used the permutation groups to design an efficient algorithm to solve this reference-based scaffolding problem, where the rearrangement distance in the problem was measured by reversals and block-interchanges (also called generalized transpositions) with the weight ratio 1:2 . Reversal and block-interchange are two different kinds of genome rearrangements that can affect the genomic organization of DNA molecules . Reversal affects a segment on a chromosome by reversing this segment as well as exchanging its strands, while block-interchange is a generalized transposition that exchanges two nonoverlapping (but not necessarily adjacent) segments on a chromosome. Usually, transpositions, as well as block-interchanges, occur less frequently than reversals in many evolutionary scenarios. As also discussed in our previous studies ,, it is biologically meaningful to assign twice the weight to block-interchanges than to reversals based on the observation of real biological data  and the result of computer simulations . It is worth mentioning here that the contigs of a draft genome can be ordered and oriented by our algorithm in time , where n is the number of genetic markers.
CAR is an easy-to-use tool for contig assembly of a prokaryotic draft genome. Given a set of contigs in multi-FASTA format and a reference genome in FASTA format, it can output a list of scaffolds, each consisting of the ordered and oriented contigs. To validate CAR, we have tested it on a real dataset composed of several prokaryotic genomes and also compared its performance with several other reference-based contig assembly tools. As a consequence, our experimental results have shown that CAR indeed performs better than all these other reference-based tools in terms of sensitivity, precision and genome coverage.
The method we used to implement CAR is described as follows. Note that the genomes considered below are unichromosomal. For the calculation of rearrangement distance, the input draft genome π and the reference genome σ of our algorithm must be represented as two signed permutations of n integers between 1 and n, where each integer represents a conserved genetic marker between the draft and reference genomes and its associated sign indicates the strandedness of the corresponding genetic marker. For this purpose, we first used MUMmer’s programs , NUCmer and PROmer, with default settings to detect the conserved genetic markers between the draft and reference genomes, where NUCmer is performed on the input nucleotide sequences and PROmer on the amino acid sequences translated from the input nucleotide sequences in all six reading frames. The delta-filter utility program of MUMmer with parameter ‘-1’ was then used to remove the repeated genetic markers from the draft and reference genomes. Subsequently, we applied our algorithm  on the obtained signed permutations to order and orient the contigs of the draft genome π based on the reference genome σ.
Basic idea of algorithm
The algorithm we designed in  was based on permutation groups in algebra, which have been proven to be very useful in the studies of genome rearrangements ,. Basically, we consider the assembly (scaffolding, i.e., ordering and orienting) of two contigs as a rearrangement, called fusion, that joins these two contigs into one. Assume that there are m contigs in the draft genome π. The main job of our algorithm is then to find m−1 fusions to join the m contigs in π such that the rearrangement distance between the resulting contig assembly of π and the reference genome σ is minimized. For proper modeling of the contigs using permutation groups, we initially add two caps (i.e., dummy genetic markers) to the ends of each contig of π and σ, resulting in the capped draft genome and the capped reference genome . We then show that the fusion of two contigs in π can be mimicked by a special translocation acting on the corresponding contigs in , where the translocation is a kind of rearrangement that acts on two chromosomes by exchanging their end fragments. Next, we calculate the production of and the inverse of , from which we can further derive m−1 special translocations to act on such that their rearrangement effects on original π are m−1 fusions. In particular, we show that these m−1 fusions can be used to optimally join the m contigs of π, and the whole process of this contig assembly can be finished in linear time. For full details on this algorithm, we refer the reader to our original paper .
Usage of CAR
Results and discussion
Draft chromosomal genomes used in the testing dataset
Aciduliprofundum boonei T469
Bacillus subtilis 168
Bifidobacterium longum DJO 10A
Brucella melitensis bv 1 16M (I)
Brucella melitensis bv 1 16M (II)
Brucella pinnipedialis B2 94 (I)
Brucella pinnipedialis B2 94 (II)
Burkholderia thailandensis E264 (II)
Burkholderia thailandensis E264 (I)
Chlamydia muridarum Nigg
Clostridium cellulovorans 743B
Corynebacterium aurimucosum ATCC 700975
Corynebacterium efficiens YS 314
Micrococcus luteus NCTC 2665
Mycobacterium tuberculosis H37Ra
Mycoplasma genitalium G37
Saccharopolyspora erythraea NRRL 2338
Selenomonas sputigena ATCC 35185
Stigmatella aurantiaca DW 431
Streptococcus pneumoniae TIGR4
Vibrio Ex25 (I)
Vibrio Ex25 (II)
Yersinia pestis Nepal 516
Comparisons on sensitivity and precision
The number of correct contig joins (or adjacency) is the main quality measure for a scaffold . A join of two contigs in a scaffold is said to be correct if they are also consecutive in the completely finished query genome. Note that in the above dataset the genomic sequences of the species are completely finished and available from the GenBank of NCBI. Using these completely finished genomes, we can thus derive a reference order for the collection of contigs of each draft chromosomal genome to serve as the standard of truth in our evaluation. The reference order was derived by mapping the contigs to their corresponding finished chromosomal genome and placing them on the positions where they gained the most matches. Note that those contigs that were not matched at all were excluded in the reference order. Let P denote the number of all contig joins in the reference order. For the output of each contig assembly tool, we compared it with the reference order by counting the number of all contig joins that also occur in the corresponding reference order as true positive (denoted by TP) and the number of the others as false positive (denoted by FP). Using these values of each contig assembly tool, we computed the sensitivity defined as (T P×100)/P and the precision as (T P×100)/(T P+F P).
Comparison of average sensitivity for various contig assembly tools
Comparison of average precision for various contig assembly tools
Actually, all the contig assembly tools used in this study can be classified into the following two categories: (a) alignment-based tools and (b) rearrangement-based tools. Projector 2 , OSLay , ABACAS , Mauve Aligner , r2cat  and CONTIGuator  belong to the former category of alignment-based tools, while fillScaffolds , SIS  and CAR belong to the latter category of rearrangement-based tools. The alignment-based tools align contigs or contig ends of a draft genome against a reference sequence, and then ordered and oriented the contigs according to their positions (matches) in the reference. The performance of these tools for ordering and orienting the contigs is highly dependent on the similarity between the draft and reference genomes. If the draft and reference genomes are not similar to a sufficient degree, or their phylogenetic relationship is not very close, the alignment-based tools may place the contigs in an incorrect order. As to the rearrangement-based tools, they attempt to order and orient the contigs by utilizing the comparison of genetic-marker orders between draft and reference genomes. Basically, DNA molecules are subject to local mutations (such as nucleotide substitutions, insertions and deletions) and global mutations (such as genome rearrangements) during their evolution. In contrast to local mutations that normally accumulate rather quickly, genome rearrangements are relatively rare events during evolution, implying that the genetic-marker orders between two species should be more conserved than their nucleotide sequences. This may thus suggest that the rearrangement-based tools should fit better than the alignment-based tools for correctly ordering and orienting the contigs of a draft genome, especially when the draft genome is phylogenetically distant from the reference genome. On the other hand, among the three rearrangement-based tools mentioned above, CAR has better performance when compared to SIS and fillScaffolds. The reason may be as follows. SIS deals with only reversals and searches for inversion signatures to order and orient the contigs in a draft genome. In addition to reversals, fillScaffolds considers other rearrangements, such as transpositions and translocations (including fissions and fusions). It treats each contig as a (linear) chromosome and uses an existing rearrangement algorithm, such as the one proposed by Tesler , to order and orient the contigs in a draft genome. However, the purpose of the existing rearrangement algorithm itself is not dedicated to the ordering and orientation of the contigs. CAR herein considers both reversals and block-interchanges (generalized transpositions) and further utilizes an exact algorithm that can optimally solve the reference-based scaffolding problem we formulated in this study. As compared to the exact algorithm used by CAR that can produce mathematically optimal solutions, the algorithms adopted by SIS and fillScaffolds are heuristics that can produce only approximate solutions.
Comparison on genome coverage
Comparison of genome coverage for various contig assembly tools
Additional performance results of all contig assembly tools on individual query chromosomes can be found in Additional file 1.
It should be noted that the process of identifying conserved genetic markers between draft and reference chromosomes dominates the overall running time of CAR. For example, in the experiments performed above, the average running time of CAR for a pair of draft and reference chromosomes is 15.96 seconds when running with NUCmer and 86.51 seconds with PROmer. In the former case, however, NUCmer takes about 14.56 seconds and in the latter case, PROmer takes about 76.06 seconds. Considering both cases, CAR itself takes on average between 1.40 and 10.45 seconds to finish the assembly of contigs.
Contig assembly (scaffolding) is a process of ordering and orienting contigs of a draft genome, which is important and helpful to the finishing of a genome sequencing project. In this study, we introduced CAR, an easy-to-use contig assembly tool, that can efficiently produce a more accurate contig assembly of a prokaryotic draft genome based on a reference genome of a related organism. CAR was implemented based on a linear time algorithm we recently developed using genome rearrangements and permutation groups in algebra. For the size of prokaryotic chromosomes, CAR was able to finish its contig assembly job in several seconds to a couple of minutes. When compared to other tools using a real dataset composed of several prokaryotic genomes, CAR exhibited the best performance in sensitivity, precision and genome coverage in reference-based contig assembly.
Availability and requirements
Project name: CAR Project home page: http://genome.cs.nthu.edu.tw/CAR/Operating system(s): Linux Programming language: PHP Other requirements: MUMmer License: GNU GPL Any restrictions to use by non-academics: None
This work was supported in part by National Science Council of Taiwan under grant NSC100-2221-E-007-129-MY3.
- Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus . Genome Res. 2004, 14: 149-159. 10.1101/gr.1536204.View ArticlePubMed CentralPubMedGoogle Scholar
- Dayarian A, Michael TP, Sengupta AM: SOPRA: scaffolding algorithm for paired reads via statistical optimization . BMC Bioinformatics. 2010, 11: 345-10.1186/1471-2105-11-345.View ArticlePubMed CentralPubMedGoogle Scholar
- Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE . Bioinformatics. 2011, 27: 578-579. 10.1093/bioinformatics/btq683.View ArticlePubMedGoogle Scholar
- Huson DH, Reinert K, Myers EW: The greedy path-merging algorithm for contig scaffolding . J ACM. 2002, 49: 603-615. 10.1145/585265.585267.View ArticleGoogle Scholar
- Bentley DR: Whole-genome re-sequencing . Curr Opin Genet Dev. 2006, 16: 545-552. 10.1016/j.gde.2006.10.009.View ArticlePubMedGoogle Scholar
- van Hijum SA, Zomer AL, Kuipers OP, Kok J: Projector 2 contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies . Nucleic Acids Res. 2005, 33: W560-W566. 10.1093/nar/gki356.View ArticlePubMed CentralPubMedGoogle Scholar
- Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies . Bioinformatics. 2007, 23: 1573-1579. 10.1093/bioinformatics/btm153.View ArticlePubMedGoogle Scholar
- Assefa S, Keane TM, Otto TD, Newbold C, Berriman M: ABACAS algorithm-based automatic contiguation of assembled sequences . Bioinformatics. 2009, 25: 1968-1969. 10.1093/bioinformatics/btp347.View ArticlePubMed CentralPubMedGoogle Scholar
- Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT: Reordering contigs of draft genomes using the Mauve Aligner . Bioinformatics. 2009, 25: 2071-2073. 10.1093/bioinformatics/btp356.View ArticlePubMed CentralPubMedGoogle Scholar
- Muñoz A, Zheng CF, Zhu QA, Albert VA, Rounsley S, Sankoff D: Scaffold filling, contig fusion and comparative gene order inference . BMC Bioinformatics. 2010, 11: 304-10.1186/1471-2105-11-304.View ArticlePubMed CentralPubMedGoogle Scholar
- Husemann P, Stoye J: r2cat: synteny plots and comparative assembly . Bioinformatics. 2010, 26: 570-571. 10.1093/bioinformatics/btp690.View ArticlePubMed CentralPubMedGoogle Scholar
- Galardini M, Biondi EG, Bazzicalupo M, Mengoni A: CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes . Source Code Biol Med. 2011, 6: 11-10.1186/1751-0473-6-11.View ArticlePubMed CentralPubMedGoogle Scholar
- Dias Z, Dias U, Setubal JC: SIS: a program to generate draft genome sequence scaffolds for prokaryotes . BMC Bioinformatics. 2012, 13: 96-10.1186/1471-2105-13-96.View ArticlePubMed CentralPubMedGoogle Scholar
- Li CL, Chen KT, Lu CL: Assembling contigs in draft genomes using reversals and block-interchanges . BMC Bioinformatics. 2013, 14 Suppl 5: S9-10.1186/1471-2105-14-S5-S9.View ArticlePubMedGoogle Scholar
- Fertin G, Labarre A, Rusu I, Tannier E, Vialette S: Combinatorics of Genome Rearrangements, Cambridge: The MIT Press; 2009.Google Scholar
- Gaul E, Blanchette M: Ordering partially assembled genomes using gene arrangements . Lect Notes Comput Sci. 2006, 4205: 113-128. 10.1007/11864127_10.View ArticleGoogle Scholar
- Huang YL, Lu CL: Sorting by reversals, generalized transpositions, and translocations using permutation groups . J Comput Biol. 2010, 17: 685-705. 10.1089/cmb.2009.0025.View ArticlePubMedGoogle Scholar
- Huang YL, Huang CC, Tang CY, Lu CL: SoRT 2: a tool for sorting genomes and reconstructing phylogenetic trees by reversals, generalized transpositions and translocations . Nucleic Acids Res. 2010, 38: W221-W227. 10.1093/nar/gkq520.View ArticlePubMed CentralPubMedGoogle Scholar
- Blanchette M, Kunisawa T, Sankoff D: Parametric genome rearrangement . Gene. 1996, 172: GC11-GC17. 10.1016/0378-1119(95)00878-0.View ArticlePubMedGoogle Scholar
- Eriksen N: (1+ ε )-approximation of sorting by reversals and transpositions . Theor Comput Sci. 2002, 289: 517-529. 10.1016/S0304-3975(01)00338-3.View ArticleGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes . Genome Biol. 2004, 5: R12-10.1186/gb-2004-5-2-r12.View ArticlePubMed CentralPubMedGoogle Scholar
- Tesler G: Efficient algorithms for multichromosomal genome rearrangements . J Comput Syst Sci. 2002, 65: 587-609. 10.1016/S0022-0000(02)00011-9.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.