- Research article
- Open Access
SIS: a program to generate draft genome sequence scaffolds for prokaryotes
© Dias et al.; licensee BioMed Central Ltd. 2012
- Received: 19 December 2011
- Accepted: 19 March 2012
- Published: 14 May 2012
Decreasing costs of DNA sequencing have made prokaryotic draft genome sequences increasingly common. A contig scaffold is an ordering of contigs in the correct orientation. A scaffold can help genome comparisons and guide gap closure efforts. One popular technique for obtaining contig scaffolds is to map contigs onto a reference genome. However, rearrangements that may exist between the query and reference genomes may result in incorrect scaffolds, if these rearrangements are not taken into account. Large-scale inversions are common rearrangement events in prokaryotic genomes. Even in draft genomes it is possible to detect the presence of inversions given sufficient sequencing coverage and a sufficiently close reference genome.
We present a linear-time algorithm that can generate a set of contig scaffolds for a draft genome sequence represented in contigs given a reference genome. The algorithm is aimed at prokaryotic genomes and relies on the presence of matching sequence patterns between the query and reference genomes that can be interpreted as the result of large-scale inversions; we call these patterns inversion signatures. Our algorithm is capable of correctly generating a scaffold if at least one member of every inversion signature pair is present in contigs and no inversion signatures have been overwritten in evolution. The algorithm is also capable of generating scaffolds in the presence of any kind of inversion, even though in this general case there is no guarantee that all scaffolds in the scaffold set will be correct. We compare the performance of sis, the program that implements the algorithm, to seven other scaffold-generating programs. The results of our tests show that sis has overall better performance.
sis is a new easy-to-use tool to generate contig scaffolds, available both as stand-alone and as a web server. The good performance of sis in our tests adds evidence that large-scale inversions are widespread in prokaryotic genomes.
- Genome assembly
- Contig order
With the decreasing costs of DNA sequencing it is now very common for prokaryotic genomes to be sequenced at “draft” status only. This means that the generated sequence will be a set of contigs (a contig is a substring of the string over the DNA alphabet that represents the genome sequence). The number of contigs depends on the sequencing fold coverage and DNA sequencing technology, and typically varies between half a dozen to a few hundred.
As of December 9, 2011, the number of draft microbial genome sequences in GenBanka is 2324, compared to 1814 complete sequences. This difference is growing (in favor of draft sequences) over time (on March 15, 2011 the numbers were 1821 and 1485, respectively). Therefore there is an increasing need for tools that can improve the sequencing and assembly results beyond a simple contig set.
One technique that can be used to improve automated assembly results is to generate contig scaffolds from the contig set. A scaffold is an ordered set of contigs, with the desired order being the correct genome order, and with each contig in the correct orientation. Scaffolds help in whole genome comparisons and they can guide the finishing process, showing where the gaps are.
Some references in the literature adopt the term scaffold only for the situation where the contig ordering and orientation is given by paired-end read information . Here we adopt a more general meaning, assuming a scaffold is a contig ordering obtained by any technique and/or additional data, or combination thereof.
There are various techniques that can help to create a scaffold. In addition to paired-end reads, one can use data from physical  or optical maps [3, 4]. Information about the order of specific contigs sometimes can be obtained by searching for genes that have been split by the gaps between those contigs.
Another technique, which became popular in the last few years, is to use a reference genome. In this case we assume the query (draft) genome A has a close phylogenetic relative B that has been fully sequenced, and the genome of B can be used to guide the assembly of A and to generate a scaffold as well. This technique is used by programs such as ABACAS , fillScaffolds , Mauve Aligner , OSLay , Projector 2 , r2cat , PGA4genomics , and CONTIGuator .
When using a reference genome B to create a scaffold for A, one possible problem is the existence of rearrangements in A with respect to B. In prokaryotes, the most common rearrangement is the inversion , leading to the ‘X’ patterns that are frequently seen in dotplot graphs representing whole genome alignments of prokaryotic genomes  (when inversions are symmetric with respect to the origin of replication).
In this work we present an algorithm for obtaining contig scaffolds that explicitly takes into consideration the presence of inversions in A with respect to B. We do this by searching for inversion signatures in the input contigs. To our knowledge the only existing program that deals directly with inversions is fillScaffolds  (FillScaffolds deals with several other kinds of rearrangements besides inversions.). One other program seems to deal at least indirectly with rearrangements: Mauve Aligner , which is based on the multiple genome alignment program Mauve  and on GRIL .
We created program sis as an implementation of our algorithm, and tested it on real draft genomes that have been completed, comparing its performance to seven other programs. The criterion used to select genome sequences for testing was solely the availability of each genome in two formats: incomplete (several contigs) and complete (one contig per replicon). In these tests sis had the best overall performance.
A replicon is a self-replication unit of a genome, such as a chromosome or a plasmid.
We represent a pair of single-replicon genomes by signed permutations, where numbers represent conserved genes or blocks between the two genomes, and the signs represent the plus or minus strand. We represent the reference genome by the identity permutation ι n =[ + 1, + 2, + 3,…, + n], and the query genome by the permutation Π=[Π 1,Π 2,Π 3,…, Π n ], where 1≤| Π i |≤n and | Π i |≠| Π j | if i≠j.
An inversion ρ(i,j) is a rearrangement event that reverses the order and the signs of a consecutive section of a genome: Πρ(i,j)=[Π 1,…,Π i−1,− Π j ,…,− Π i ,Π j + 1,…, Π n ], such that 1≤i≤j≤n. Two consecutive conserved blocks ( Π i ,Π i + 1) are a breakpoint if Π i ≠Π i + 1−1; otherwise the consecutive blocks are an adjacency. A strip is a maximal substring [ Π i ,Π i + 1,…, Π j ] such that every pair ( Π k ,Π k + 1) is an adjacency, for i≤k<j. We say that an inversion ρ(r,s)acts on a strip [ Π i ,Π i + 1,…, Π j ] if i<r≤s<j.
Any sequence of symmetric inversions can be sorted so that the result is a series of nested inversions, with the same effect on a genome. Every series of nested inversions is a series of safe inversions, but the converse is not true.
Let P 1(n), P 2(n), P 3(n) and P 4(n) denote the sets of all permutations that can be obtained from ι n by applying a series of symmetric inversions, a series of nested inversions, a series of safe inversions, and a series of generic inversions, respectively. Clearly, .
A collection of m contigs from a genome Π is represented by a collection of substrings C1, C2, …, Cm, such that, for 1≤k≤m, or , and the intervals in Π that correspond to any two contigs and cannot have an overlap, that is , for 1≤k 1<k 2≤m.
The algorithm we have developed generates correct scaffolds in the presence of symmetric, nested, or safe inversions (as long as at least one IS is present for every IS pair); in addition it can also produce scaffolds, not necessarily 100% correct, for generic inversions.
The algorithm depends on the following assumptions in order to generate one single correct scaffold: (1) There are no duplicated conserved blocks in either of the two genomes; (2) all contigs are correctly assembled; (3) both genomes contain only one chromosome; and (4) conserved block 01 is in the first position (with + sign) or in the last position (with - sign), in some contig.
Assumption (1) means that there is no conserved block that contains a sequence that is a repeat in any of the two genomes. This is somewhat unrealistic, since all genomes in different degrees have repeats. In this sense, assumption (2) also depends on the lack of repeats. But as the tests below show, these assumptions do not seem to be major obstacles. In particular, repeats account for no more than 13% of errors in contig adjacencies in our tests. We require assumption (3) primarily for simplicity of exposition; in a real setting it is generally possible to decompose scaffold creation for separate replicons as separate problems. Assumption (4) helps simplify the algorithm description. In case block 01 does not follow assumption (4), a remapping of conserved blocks can be made in time O(n) such that the result does follow assumption (4).
The algorithm can detect lack of information, caused by absence of an IS pair in the input or by the overwriting of an earlier (in evolution) IS by a later unsafe inversion, and still generate a scaffold. In such cases the algorithm chooses a contig that has not been positioned yet, and begins generation of a new scaffold (so the result will be a set of scaffolds). When IS pairs are missing there are no guarantees that the scaffolds generated are correct.
The input for the algorithm is a list of contigs, each contig represented by the conserved blocks it contains. Note that we deal only with conserved blocks; insertions and deletions of the query genome with respect to the reference genome are not considered.
We now give a detailed example to illustrate the various concepts as well as to present the algorithm. In this example, there are 40 conserved blocks distributed in 9 contigs as follows:
· [−23,−22, + 19, + 20]
· [−15,−14, + 12, + 13,−11]
· [ + 34, + 35]
· [ + 36, + 37, + 38, + 39, + 40]
· [ + 05, + 06, + 07, + 08]
· [ + 31, + 32, + 33,−30,−29]
· [−28,−27, + 26,−25,−24, + 04]
· [ + 21,−18,−17,−16, + 09, + 10]
Recall that the reference genome is the identity permutation. The IS pairs are the following:
· [ + 03,−23] × [−04, + 24]
· [−22, + 19] × [ + 21,−18]
· [−16, + 09] × [ + 15,−08]
· [ + 11,−13] × [−12, + 14]
· [ + 25,−26] × [−26, + 27]
· [ + 28,−35] × [−29, + 36]
· [−34, + 31] × [ + 33,−30]
An inspection reveals that the query genome has 7 safe inversions with respect to the reference genome. However, note that only the following ISs are present in the contigs:
· × [−04, + 24]
· [−22, + 19] × [ + 21,−18]
· [−16, + 09] ×
· [ + 11,−13] × [−12, + 14]
· [ + 25,−26] × [−26, + 27]
· × [ + 33,−30]
Note that IS pair 6 is missing from the input contigs.
The algorithm builds the scaffold incrementally, choosing one contig following certain rules and placing it in the scaffold immediately after the previously chosen contig. The first contig chosen is the one that contains block +01 (or -01).
Execution of the algorithm on the example
The next conserved block to be searched is + 21. It is found at the beginning of contig 9, on the plus strand as expected, so this contig is placed in the third scaffold position. The next block searched is + 11, which is found inverted in the last position of contig 2. This means that contig 2 will be inserted in the fourth scaffold position after it is reverse-complemented. The next block to be searched is + 16, which is found on the minus strand in the middle of the already placed contig 9. The algorithm then uses the IS (−16, + 09) to determine that the next block to be searched is −08. Block −08 is found on the plus strand in the last position of contig 6. The subsequent steps proceed along the same lines. The resconstructed scaffold is [−C 5,−C 1, + C 9,−C 2,−C 6,−C 8,−C 7, + C 3, + C 4]. This scaffold contains errors, because the correct scaffold given by permutation Π is [−C 5,−C 1, + C 9,−C 2,−C 6,−C 8,−C 3, + C 7, + C 4]. This was caused by the fact that from the given input it is impossible to tell that contig 3 should remain as it is and that contig 7 should be reverse complemented. This stems from the fact that the endpoints of IS pair 6 are not to be found in the input contigs. But the algorithm still managed to get 7 correct adjacencies out of 9.
In addition to the data structures described above the algorithm uses two auxiliary vectors: used indicates which contigs have been placed in the scaffold; and searched records conserved blocks that have been searched so far. Variable nextC indicates the contig with smallest ID not yet placed in the scaffold.
The main loop in lines 7–39 is executed until all contigs have been placed in the scaffold. Variable ok indicates whether the algorithm found a candidate contig to be placed in the scaffold. Line 9 ensures that the algorithm is not searching for an invalid conserved block. Lines 21–24 check whether the algorithm is searching an already searched conserved block.
If the algorithm decides to insert a new contig in the scaffold (test in line 25) lines 26–39 are executed. If no valid contig was found, the first contig not yet placed is determined in lines 26–28. Lines 29–30 place that contig as the first in a new scaffold. Vector used and variables nextC and i are updated in lines 31–34.
Recalling that n is the number of conserved blocks and m is the number of contigs, with n≥m, we now state the algorithm’s complexity. Data structure inContig can be built in O(n). Initialization takes O(n + m). The main loop is executed n + m times, since in each iteration either a new position in vector searched is marked or a new contig is placed in the scaffold. There is one operation inside the main loop that does not take constant time, which is the loop in lines 33–34. However, variable nextC can be incremented at most m times, so the amortized cost of each update is O(1). This leads to an overall complexity of O(n).
We have implemented the algorithm creating program sis (Scaffolds from Inversion Signatures). We stress that sis does not require the algorithm assumptions listed above in order to be used. We used real contigs available in GenBank for testing as described below.
The main quality measure for a scaffold is the number of correct contig adjacencies. A contig adjacency for contigs x and y is correct if they are consecutive in the (completed) query genome as well as in the generated scaffold set. In the case of circular replicons, all correct scaffolds with m contigs have exactly m correct adjacencies. Even though SIS in general outputs a set of scaffolds, for the purposes of this test we consider the result to be just one scaffold, with scaffolds being placed side by side as the algorithm creates them.
A second quality measure is genome coverage. In this measure we attempt to determine how much of the genome to which the contigs belong is actually covered by the scaffolds generated. The working definition we use is as follows. If both ends of a contig have correct adjacencies, then we count the entire length of that contig as contributing to the coverage. If only one end has a correct adjacency, then half of the contig length is used. If both ends have incorrect adjacencies, then the contig is not used. We define coverage as the ratio of the sum of contig lengths considered per the above rules and the sum of all contig lengths.
Web, Command line
It is important to mention that among programs tested fillScaffolds  was the only one designed specifically for eukaryote sequences, whereas all of our tests are done on prokaryote sequences. FillScaffolds is a sophisticated tool that takes into account many kinds of rearrangements in addition to inversions, such as transpositions, reciprocal translocations, and chromosome fusions and fissions. We have included fillScaffolds in the present study for completeness.
One important consideration is block conservation detection. In the case of sis this was done by programs nucmer  and promer , with default settings. We postprocessed the outputs using the MUMmer script delta-filter  with parameter −1 , which ensures that only nonrepeated blocks are processed.
Mauve Aligner uses progressiveMauve  for sequence comparison. OSLay can use nucmer and that is what we chose (with default settings). r2cat does sequence comparison internally. Projector 2 can use BLAT  (default) or BLAST , and we chose BLAT. CONTIGuator uses BLAST+ . ABACAS can use nucmer or promer. We chose nucmer. (We observed that choosing promer made ABACAS run for more than two days without providing any scaffold.) For fillScaffolds  we have used both nucmer and promer.
Chromosome Sequences Used in Tests
Aciduliprofundum boonei T469
Bacillus subtilis 168
Bifidobacterium longum DJO10A
Brucella melitensis bv 1 16M
Brucella melitensis bv 1 16M
Brucella pinnipedialis B2 94
Brucella pinnipedialis B2 94
Burkholderia thailandensis E264
Burkholderia thailandensis E264
Chlamydia muridarum Nigg
Clostridium cellulovorans 743B
Corynebacterium aurimucosum ATCC 700975
Corynebacterium efficiens YS 314
Micrococcus luteus NCTC 2665
Mycobacterium tuberculosis H37Ra
Mycoplasma genitalium G37
Saccharopolyspora erythraea NRRL 2338
Selenomonas sputigena ATCC 35185
Stigmatella aurantiaca DW4 3 1
Streptococcus pneumoniae TIGR4
Yersinia pestis Nepal516
For each of the query chromosomes we obtained a list of the 20 closest genomes (excluding the query genome itself), among 1331 complete prokaryotic genomes available at GenBank on May 5, 2011. We used NUCMi  (a variation of MUMi ) to compute the distance between each of the 1331 genomes and the query genome. Our rationale for selecting 20 closest other genomes instead of only the closest was as follows. In practice, the choice of a reference genome should be guided by phylogenetic distance. The closest genome to the draft genome is the one most likely to yield best results. On the other hand, given the extremely small sample of the prokaryotic world that has been sequenced to date, it is perfectly possible that new strains will be sequenced in the future for which the closest reference genome will not be particularly close. Therefore it is important to understand how the performance of a scaffolding program changes with increased distance between the query and possible reference genomes.
We ran each scaffold program on each of the 23 query chromosomes using as reference each of the 20 closest chromosomes to the query chromosome. This resulted in 460 scaffold sets for each program. For a given program and a given set of reference chromosomes, we computed average results over the 23 query chromosomes.
Average Performance (Correct Adjacencies)
Best Scaffolds (Correct Adjacencies)
Average Performance (Coverage)
Best Scaffolds (Coverage)
It is important to note that the pairs of query-reference genomes tested were not selected for containing inversions. So any given pair query-reference could have any kind of rearrangement. The success of sis in the cases tested suggests that inversions are indeed widespread in prokaryotic genomes (to the extent that genomes in Table 2 are a representative sample).
Here we briefly describe additional results for which tables and graphs can be found in the Additional file 1.
Another way of parsing results is to ask the question: which program yields best results for each query chromosome sequence, in terms of number of average correct adjacencies? sis (promer and nucmer) are the best programs under this measure.
Another question to ask is what kind of influence does the number of contigs has on a program’s performance? We have found that all programs obtain their best results for chromosome sequences that have few contigs (between 4 and 25). As for worst results, the outcome was less clear, since it is not always the case that the largest number of contigs results in poorest performance. We also determined that fillScaffolds and r2cat seem to be less sensitive to variation in number of contigs in the input than other programs. These results need to be taken with caution, because the bins we used had a relatively small number of instances in each (between five and seven).
Finally, we investigated the influence of duplications on the performance of SIS in terms of correct adjacencies. We found that on average no more than 13% of incorrect adjacencies are due to duplications.
Running times of the tests are dominated by the sequence comparison step. For example, in the case of sis, nucmer takes on average 1 minute for a genome pair, and promer takes on average 5 minutes. sis itself takes less than a second to compute a scaffold set. OSLay takes about 3 seconds. After obtaining the conserved blocks, fillScaffolds takes about 3 seconds. r2cat takes between 15 and 20 seconds to determine conserved blocks and about 2 seconds to generate a scaffold. ABACAS takes between 10 and 30 seconds total time. CONTIGuator takes about 2 minutes to determine conserved blocks using BLAST+ and to generate a scaffold. Mauve Aligner takes between 8 and 322 minutes per genome pair (on average, 46 minutes). All these tests were executed on the same machine, a standard desktop computer with Intel Core2 Duo processor, 3.0 GHz and 4 GB RAM. Projector 2 can only be run on a web server. We observed that the total computation time on the server was about 1 minute, when the server appeared to have a light load.
Details of these tests and other tests
In Additional file 1 we present more detailed results of the tests above as well as results using pairs of real genomes and simulated contigs. sis also came out as the best program in these other tests. These results are a refinement of preliminary tests presented in .
Availability of SIS
SIS is available as a web server at http://marte.ic.unicamp.br:8747. It can also be freely downloaded as a stand alone program from the same website. SIS generates its scaffolds both as ordered lists of contigs as well as in nucleotide sequence FASTA format.
We have presented a new linear-time algorithm for generating scaffolds for draft genomes, based on the concept of inversion signatures. We implemented this algorithm creating program sis, which is to our knowledge the first scaffold program that explicitly models the biological phenomenon of replicon inversion in prokaryotes. We compared sis to seven other programs, and demostrated that sis achieves better performance relative to these other programs using a real and diverse suite of test cases.
In a real world setting, the scaffolds generated by sis can help in the gap closure process. For example, sis output can easily be used as input to primer generation programs such as Primer3 . If paired-end read data is available, it can be used to check the scaffolds provided and to connect separate scaffolds, thus generating more complete and reliable contig orderings. Ideally, sis should be able to use paired-end read information, but this will require a substantial change to the algorithm presented here, as evidenced by the sophistication of the recent algorithm for paired-end read scaffolding presented by Gao et al. . Another possible improvement in the scaffold generating process is to use several reference genomes instead of only one, under the rationale that some inversions may be missed by using reference genome X but may be detected using some other reference genome Y . An idea similar to this is described for the program PGA4genomics .
This work was supported by Fundação de Amparo à Pesquisa do Estado de São Paulo (Brazil) [PhD fellowship 2007/05574-4 to UD]; and by Conselho Nacional de Desenvolvimento Científico e Tecnológico (Brazil) [200815/2010-5,483177/2009-1, 473867/2010-9 to ZD].
This work is based on the earlier work: Using Inversion Signatures to Generate Draft Genome Sequence Scaffolds, in Proceedings of the 2nd ACM International Conference on Bioinformatics, Computational Biology and Biomedicine. ACM New York, NY, USA Ⓒ2011. ACM ISBN: 978-1-4503-0796-3.
- Gao S, Sung WK, Nagarajan N: Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 2011, 18(11):1681–1691. 10.1089/cmb.2011.0170PubMed CentralView ArticlePubMedGoogle Scholar
- Warren RL, Varabei D, Platt D, Huang X, et al.: Physical map-assisted whole-genome shotgun sequence assemblies. Genome Res 2006, 16: 768–775. 10.1101/gr.5090606PubMed CentralView ArticlePubMedGoogle Scholar
- Nagarajan N, Read TD, Pop M: Scaffolding and validation of bacterial genome assemblies using optical restriction maps. Bioinformatics 2008, 24: 1229–1235. 10.1093/bioinformatics/btn102PubMed CentralView ArticlePubMedGoogle Scholar
- Valouev A, Zhang Y, Schwartz DC, Waterman MS: Refinement of optical map assemblies. Bioinformatics 2006, 22: 1217–1224. 10.1093/bioinformatics/btl063View ArticlePubMedGoogle Scholar
- Assefa S, Keane TM, Otto TD, Newbold C, Berriman M: ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 2009, 25: 1968–1969. 10.1093/bioinformatics/btp347PubMed CentralView ArticlePubMedGoogle Scholar
- Munoz A, Zheng C, Zhu Q, Albert VA, Rounsley S, Sankoff D: Scaffold filling, contig fusion and comparative gene order inference. BMC Bioinf 2010, 11: 304. 10.1186/1471-2105-11-304View ArticleGoogle Scholar
- Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT: Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics 2009, 25: 2071–2073. 10.1093/bioinformatics/btp356PubMed CentralView ArticlePubMedGoogle Scholar
- Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics 2007, 23: 1573–1579. 10.1093/bioinformatics/btm153View ArticlePubMedGoogle Scholar
- van Hijum, Zomer AL, Kuipers OP, Kok J: Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. Nucleic Acids Res 2005, 33: W560–566. 10.1093/nar/gki356View ArticleGoogle Scholar
- Husemann P, Stoye J: r2cat: synteny plots and comparative assembly. Bioinformatics 2010, 26: 570–571. 10.1093/bioinformatics/btp690PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao F, Hou H, Bao Q, Wu J: PGA4genomics for comparative genome assembly based on genetic algorithm optimization. Genomics 2009, 94: 284–286. 10.1016/j.ygeno.2009.06.006View ArticlePubMedGoogle Scholar
- Galardini M, Biondi EG, Bazzicalupo M, Mengoni A: CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code Biol Med 2011, 6(11):.Google Scholar
- Darling AE, Miklós I, Ragan MA: Dynamics of genome rearrangement in bacterial populations. PLoS Genet 2008, 4(7):e1000128. 10.1371/journal.pgen.1000128PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen JA, Heidelberg JF, White O, Salzberg SL: Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol 2000, 1(6):research0011.1–0011.9. 10.1186/gb-2000-1-6-research0011View ArticleGoogle Scholar
- Darling AE, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 2004, 14: 1394–1403. 10.1101/gr.2289704PubMed CentralView ArticlePubMedGoogle Scholar
- Darling AE, Mau B, Blattner FR, Perna NT: GRIL: genome rearrangement and inversion locator. Bioinformatics 2004, 20: 122–124. 10.1093/bioinformatics/btg378View ArticlePubMedGoogle Scholar
- Swenson KM, Moret BM: Inversion-based genomic signatures. BMC Bioinformatics 2009, 10 Suppl 1: S7.View ArticlePubMedGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12. 10.1186/gb-2004-5-2-r12PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res 2002, 12: 656–664.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinf 2009, 10: 421. 10.1186/1471-2105-10-421View ArticleGoogle Scholar
- Dias U, Dias Z, Setubal JC: Two new whole-genome distance measures. In Proceedings of the 6th Brazilian Symposium on Bioinformatics (BSB’2011). , ; 2011:61–64.Google Scholar
- Deloger M, El Karoui, Petit MA: A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol 2009, 191: 91–99. 10.1128/JB.01202-08PubMed CentralView ArticlePubMedGoogle Scholar
- Dias Z, Dias U, Setubal JC: Using Inversion Signatures to Generate Draft Genome Sequence Scaffolds. In Proceedings of the 2nd ACM International Conference on Bioinformatics, Computational Biology and Biomedicine (ACM BCB 2011). , ; 2011:39–48.View ArticleGoogle Scholar
- Rozen S, Skaletsky H: Primer3 on the WWW for general users and for biologist programmers. Methods in Molecular Biology 2000, 132: 365–386.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.