- Open Access
MISHIMA - a new method for high speed multiple alignment of nucleotide sequences of bacterial genome scale data
© Kryukov and Saitou; licensee BioMed Central Ltd. 2010
- Received: 17 July 2009
- Accepted: 18 March 2010
- Published: 18 March 2010
Large nucleotide sequence datasets are becoming increasingly common objects of comparison. Complete bacterial genomes are reported almost everyday. This creates challenges for developing new multiple sequence alignment methods. Conventional multiple alignment methods are based on pairwise alignment and/or progressive alignment techniques. These approaches have performance problems when the number of sequences is large and when dealing with genome scale sequences.
We present a new method of multiple sequence alignment, called MISHIMA (Method for Inferring Sequence History In terms of Multiple Alignment), that does not depend on pairwise sequence comparison. A new algorithm is used to quickly find rare oligonucleotide sequences shared by all sequences. Divide and conquer approach is then applied to break the sequences into fragments that can be aligned independently by an external alignment program. These partial alignments are assembled together to form a complete alignment of the original sequences.
MISHIMA provides improved performance compared to the commonly used multiple alignment methods. As an example, six complete genome sequences of bacteria species Helicobacter pylori (about 1.7 Mb each) were successfully aligned in about 6 hours using a single PC.
- Alignment Score
- Alignment Quality
- Pairwise Sequence Comparison
- Complete Alignment
- Partial Alignment
Alignment of vast number of nucleotide sequences is one of the central problems in current genomic studies. Sequence alignment is used in many kinds of biological studies, including phylogenetic analysis, identification of conserved domains, prediction of protein structure and database searches.
Alignment of multiple sequences remains a considerable challenge in terms of computation time required to complete an alignment of a reasonably large sequence dataset. Even pairwise sequence alignment becomes hard when the sequences are very long. Most alignment algorithms use some sorts of heuristics to reduce the computation time. Every alignment method was designed with some expectations on the size of data to be analyzed and what sort of computer hardware is likely to be used. The progress in alignment techniques during the last 30 years would not be possible without the parallel constant improvement in computer technology. The alignment methods being developed these days work better than those of 20 ago partly just because they are designed to run on faster processors and use more memory.
The most simple and widely used such heuristic is the method of progressive alignment , which was employed in a number of alignment programs such as MULTALIGN , MULTAL , MultAlin , CLUSTAL W [5, 6], and MUSCLE . Under the idea of progressive alignment, first the sequences are compared pairwise, one to one. Each pair of sequences is aligned with the dynamic programming algorithm, and the evolutionary distance between the sequences is estimated. Based on those distances the phylogenetic tree is constructed and then used for building multiple alignment.
An alternative approach to solve the problem of multiple sequence alignment is the iterative procedure of refining the global alignment, for example in PRRP program . Also a genetic algorithm for multiple sequence alignment was proposed . MAFFT package  contains both progressive and iterative alignment methods, depending on a novel approach of finding homologous regions with the help of fast Fourier transform.
Another direction focuses on increasing the performance of traditional dynamic programming based methods. This includes Divide-and-Conquer techniques, implemented in DCA  and OMA . There are also studies attempting to take advantage of the parallel computation, especially popular for progressive alignment techniques like CLUSTAL W [13–15].
Recently it became popular to locate regions of local conservations before attempting to construct the global alignment. This idea was used for pairwise alignment in MUMmer , Shuffle-LAGAN , Avid  and GS-Aligner . For multiple sequences a similar technique was used in DIALIGN , CHAOS , and MAFFT .
While there is considerable progress in recent multiple alignment research, it still remains a challenging problem to align multiple genomic sequences. Therefore we investigated an non-conventional heuristic approach to multiple sequence analysis, which we present in this work. A prototype of MISHIMA was briefly introduced by Kryukov and Saitou .
Extraction of important information
It is believed that only a certain part of a genome is functional, and most of the genome content is never expressed or used. As a result some parts of the genome remain mostly unchanged during the evolution, while other parts quickly accumulate mutations through neutral evolution . Conserved parts of the genome can be very helpful in identifying homology. If we could quickly locate conserved areas in the genome, that information obtained could be used to assist the process of alignment.
This idea has already been implemented in other methods, in which homology was first detected by pairwise sequence comparison and the obtained information was then used to build the alignment (DIALIGN , MLAGAN ). However for large genomic sequences the pairwise comparison step is very time consuming. Also pairwise comparison does not reveal features shared by multiple sequences. On the other hand, it is often the case that several sequences share some regions of high similarity. We propose a new heuristic method for locating conserved patterns shared by multiple sequences that can be directly used in a sequence alignment procedure.
Rearrangements of DNA fragments are known to happen very rarely in the evolutionary process, once sequences started to evolve independently in different species. Majority of the mutations result from substitutions or insertion/deletion events. Therefore in most cases it is a valid approach to use divide-and-conquer procedure for sequence alignment. Its principle is that sequences can be split into parts that will be aligned independently and a complete alignment will finally be constructed by assembling the partial alignments together [11, 25].
The main difficulty in this method is to find the splitting points. The DCA program uses a pairwise sequence comparison with a dynamic programming-based method to find the splitting positions . However the cost of pairwise comparison increases rapidly with the increase of sequence length and/or number of sequences. Below we propose a more rapid method to find the splitting positions.
Once the sequences are divided into sufficiently short segments, these segments can be aligned by calling a conventional alignment program. DCA uses MSA alignment program [26, 27] to align the parts of sequences. In this study we use CLUSTAL W  and MAFFT  to align the sequence segments separately.
CLUSTAL W is not suitable for aligning long sequences or large number of sequences because of high computational requirements. MAFFT can be very fast with small datasets, but its computation time increases rapidly when aligning larger datasets, like complete bacterial genomes. Application of these programs can be extended to much larger datasets by combining with the Divide-and-Conquer approach.
It should be noted that such Divide-and-Conquer approach is not able to detect duplications, inversions and genomic rearrangements. This approach is valid under the assumption that all sequences can be aligned in a linear fashion to each other to form a multiple alignment.
Using k-tuple frequencies
The core idea of MISHIMA is to analyze k-tuples found in the original sequences and to evaluate them based on their frequencies. When a particular nucleotide k-tuple is found exactly once in each sequence, we consider it as a likely homology signal. A k-tuple that has its close variants found once in each sequence is considered to be less likely, but still possible homology signal. Analyzing all k-tuples up to certain length allows us to select those k-tuples that represent the most probable homology shared by multiple sequences. These k-tuples can then be used to anchor the sequences before employing Divide and Conquer method to complete the alignment.
Find potentially useful k-tuples based on the number of their occurrences in the sequence data.
Analyze the potentially useful k-tuples, and select those that represent most probable local homology.
Use the selected k-tuples as anchors, split the sequences into segments.
Align the segments independently from each other.
Join partial alignments to complete the final multiple sequence alignment.
Dictionary of k-tuples
As a first step we count the number of occurrences of each short k-tuple in the original sequence dataset. We keep this information in a dictionary structure, indexed by a k-tuple sequence. The number of possible nucleotide k-tuples is 4K, so the maximum k we can use is limited by the amount of memory we can use for the dictionary. We store k-tuple frequencies as 32-bit numbers, so 4*4K bytes is required to store the frequencies of all nucleotide k-tuples. This allowed us to use k of 13 and 14 on 32-bit machines.
We found that knowing the number of occurrences of each k-tuple in the original sequence dataset is not enough to efficiently decide which k-tuples are more likely to represent the local homology. Therefore we decided to also store the number of sequences exhibiting each k-tuple. Additionally we store the index of the last sequence where the k-tuple was found, which allows us to collect all the frequencies using single read through the sequence dataset.
Choosing the right lengths of a k-tuple
Length k is a very important factor affecting the chance for a k-tuple to be useful for alignment purposes. A random sequence of length L is expected to contain L/4K occurrences of each k-tuple. Let us take L as A*N, where N is the number of sequences and A is the average sequence length. The expected number of occurrences for one k-tuple is A*N/4K. Since we are mostly interested in finding k-tuples with exactly N occurrences (once in each sequence), an ideal case for this method would be where A*N/4K = N, which means A = 4K. Thus, to have the best results with this method, the length of sequences to be analyzed should be comparable to, or shorter than 4K.
This is an intuitive estimation, to give some feeling of relation between the dataset size and k-tuple's k. Of course the real biological sequences, especially those that we are trying to align, are often not completely random, which suggests that longer sequences can be aligned if they contain significant homology. Therefore the capability of MISHIMA to align the real sequences should be higher than this estimate. But non-conserved intergenic regions in a genomic sequences may well be considered as random sequence, and the total size of such regions in a sequence data is a limiting factor for MISHIMA. This shows that the length k is the basic parameter affecting the performance of this method.
Choosing the optimal length of a k-tuple is critical for the successful alignment. If k is too small, too many occurrences of each k-tuple will be found. Although to some extent this depends on the length of the sequences, it is difficult to predict whether the particular k is too small or not. Therefore we use all k-tuples with k from 1 to M in this method. For selecting M - the upper limit of k - we suggest to depend on the memory limitation. Most of the steps of the MISHIMA algorithm are performed in linear time, which makes it practical to use the largest M allowed by the amount of available memory.
We ended up using 12 bytes for each k-tuple in a dictionary. The total amount of memory required for dictionary is therefore 12*(4+42+...+4M) bytes. This allows us to use M of 12, with the dictionary occupying less than 300 MB of RAM.
Our initial method of using identical k-tuples shared by all sequences has a low sensitivity, and we decided to use inexact matching instead. We do this by updating the dictionary to include the number of inexact copies of a k-tuple in the sequence dataset. Our method allows up to one substitution difference between two k-tuples.
After a k-tuple dictionary is constructed and updated with inexact match counts, it becomes possible to select k-tuples which represent potential homology. Potentially useful k-tuples are defined as those that have one inexact match in every sequence. Such k-tuples, or seeds, are extracted and saved for further analysis.
The next step is to find the locations of seeds in sequence data. Locations are extracted in a second read through the sequence data. It should be noted that this step also completes in linear time.
In cases where no seeds could be found, the external aligner is used to align the whole dataset.
Now we will introduce measure for compatibility of two seeds, based on their relative coordinates in sequence data. For now we will consider only seeds that are found not more than once in every sequence. If both of two seeds, A and B, can be found in a sequence, there are two possible orders: either the coordinate of A is smaller than that of B, or vice versa. We can mark these two cases as A->B and B->A.
If all sequences in a dataset exhibit the same order of seeds A and B, these two seeds are compatible. However, if sequences have these seeds some in A->B order while some other in B->A order, we can then define an incompatibility distance as number of sequences, where A and B order is different from the majority case. A distance defined in this way can be effectively used to evaluate the possibility that two seeds together represent a possible homology signal. This distance is used to construct a maximum non-conflicting set of seeds in the following procedure.
A constant number of T best seeds is selected (T is 600 in current MISHIMA, because of memory limitations). Then the matrix of size TxT is constructed. Each cell of the matrix contains the incompatibility distance between two seeds. In the first step all of the selected T seeds are included and the matrix contains TxT distance values. Then the matrix is analyzed to find the sum of all numbers for every row. This sum corresponds to the total amount of incompatibility introduced by one seed. Now the seed which introduces the largest amount of incompatibility is removed from the set. The matrix is recalculated and the procedure iterates until no incompatibility is left unresolved. Finally only compatible seeds remain, which are used as alignment anchors. If all seeds are mutually incompatible with each other, there will be still one anchor left after removing the incompatibility. This anchor is used to divide the sequences into 2 parts, which are aligned separately.
The number of anchors found in all our example datasets is included in Additional file 1, in *.anchors files.
Iterative procedure of dictionary analysis
After the dictionary step has finished and a set of non-conflicting seeds is selected, the algorithm will have a set of anchors, connecting the sequences. The sequences are then divided and regions between the anchors are aligned independently from each other. After all partial alignments are complete, they are concatenated to construct a final complete alignment.
Alignment of the regions between the seeds is performed using an external aligner provided the sequences are short enough. Otherwise MISHIMA is used again (recursively) to divide the sequences into shorter parts. The depth of the recursive operation in MISHIMA can be set with the command line option '-max-depth = x'. Since MISHIMA takes a constant time for each step of finding seeds and dividing the sequences, using a limited depth may improve the performance on some datasets.
Assessing the alignment speed and quality
The performance of MISHIMA was evaluated on several example dataset, detailed in the Results section. We used several popular aligners for comparison: CLUSTAL W, MUSCLE, MAFFT and MLAGAN. The details, including version and parameters we used with these methods, are included in Additional file 1.
For assessing the alignment quality we used the sum-of-pairs alignment score , based on the pairwise sequence identity. The score is computed as arithmetic mean of pairwise identity scores of all sequence pairs in the alignment. Pairwise identity score is computed as following: Sij = 100*Mij/L, where Mij is the number of alignment positions where sequences i and j have the same nucleotide (not including gaps), and L is the length of complete multiple alignment (including gaps).
While a simulated dataset is necessary for a serious discussion of alignment quality, we believe that using the sum-of-pairs score is sufficient for evaluating MISHIMA's performance, for the following reasons: 1. Our main focus is on comparing MISHIMA+external aligner combination to external aligner alone. 2. We are mainly interested in verifying whether MISHIMA would not produce particularly big misalignments; the simplified score is sufficient for this purpose. 3. MISHIMA is targeting rather closely related sequences, which means that the correct alignments produced by almost any method will have about the same score, and misalignments will result in a noticeable score decrease.
The sum-of-pairs identity score was computed for all example alignments produced by MISHIMA and other methods.
Human mtDNA genomes
Complete human mitochondrial DNA (mtDNA) genome sequences were used as an example of very closely related sequences. As of July 16, 2009, a total of 4,718 complete human mtDNA genome sequences are available in DDBJ/EMBL/GenBank International Nucleotide Sequence Database. MISHIMA was able to successfully align hundreds of such genomes. Computation results can be seen in Additional file 1.
Alignment time comparison of various numbers of human mtDNA genomes (complete sequences)
No. of sequences compared
MISHIMA + MAFFT
MISHIMA + CLUSTAL W -max-depth = 1
CLUSTAL W is the slowest among the compared aligners, even with [-quicktree] option, followed by MUSCLE (Fig. 2). Fig. 2 shows that computation time of MISHIMA and MAFFT increases much slower than that of CLUSTAL W and MUSCLE. CLUSTAL W took 90-100 times longer time for complete mtDNA sequence alignment than it took aligning 2 kb fragments for all four cases. MISHIMA alignment time increased only 10 times between 2 kb and full length datasets. MUSCLE was unable to complete the alignment of 200 and 400 complete sequences. We used two kinds of option sets for MAFFT; [--retree 1 --maxiterate 0] (shown in orange) and [--retree 2 --maxiterate 1000] (shown in blue). Both option sets showed much faster results than either MUSCLE or CLUSTAL W. MISHIMA is as fast as MAFFT set at faster option (shown in orange) either when CLUSTAL W was used (shown in light blue) or when MAFFT was used (shown in red) as the external alignment program when 50, 100, and 200 human mtDNA sequences were compared.
When 400 human mtDNA sequences are compared, MAFFT set at faster option is clearly faster than MISHIMA. However, if we add [-max-depth = 1] option to MISHIMA (shown in light blue), both aligners finish in similar time. Because human mtDNA sequences are quite similar with each other, good seeds are found in abundance in the first iteration of dictionary analysis, and average block length (length of sequences between two adjacent anchor seeds) is already short enough for CLUSTAL W usage. Therefore, we recommend using this option when aligning closely related sequences, such as sequences from same species.
We also tried MLAGAN  on these datasets, but it did not work well, either producing corrupted alignment, or taking very long time (even compared to CLUSTAL W). Therefore MLAGAN results are not included in the figures.
Mammalian mtDNA genomes
MISHIMA is powerful for aligning closely related nucleotide sequences, as shown in the above section. If more divergent sequences are compared, how does MISHIMA perform? We thus retrieved complete mtDNA genomes from various order of mammalian species. Average nucleotide difference for these datasets are within the range of 0.25-0.27.
Alignment time comparison of various numbers of mammalian mtDNA genomes (complete sequences)
No. of sequences compared
MISHIMA + MAFFT
If we compare computation times shown in Table 1 and Table 2, those of MISHIMA are certainly much slower for different mammalian species than for human individuals; 23 times and 13 times more for 50 and 100 sequences, respectively. This is because the number of good seeds for anchoring is reduced as sequence divergence increases. Computation results can be found in Additional file 1.
Complete genomes of four different strains of bacteria
We are now moving from mtDNA genome sequences to much larger bacterial genomes. We first chose 4 strains of Streptococcus pyogenes as example. Each genome is about 2 MB, so the total dataset size is about 8 MB. CLUSTAL W could not produce the alignment even after 15 days, and computation was aborted. In good contrast, MISHIMA+CLUSTAL W could successfully produce the multiple alignment of these four genomes in less than three hours. In the course of alignment the dataset was separated into 485 segments, divided by 484 seeds. The average length of one segment was about 4 kb, which is much easier to align than complete sequences.
Alignment time of 4 complete genomes of Streptococcus pyogenes with various settings of MISHIMA
MISHIMA -max-depth = 1 -aligner = clustalw
MISHIMA -max-depth = 2 -aligner = clustalw
MISHIMA -max-depth = 3 -aligner = clustalw
MISHIMA (unlimited depth) -aligner = clustalw
MISHIMA -max-depth = 1 -align-seeds-only
In any case, it is clear that MISHIMA can be used to align multiple complete bacterial genomes, as long as they are closely related. The alignment of four strains of Streptococcus pyogenes revealed several large insertions and deletions, some more than 10 kb long. If one is interested in quickly finding such large deletions, we recommend use of [-align-seeds-only] option, so that the external aligner is not activated. In this way only seed anchoring is conducted, which is much faster (only 27 seconds in S. pyogenes example) than producing full multiple alignment.
Complete genomes of six different strains of bacteria
MISHIMA+MAFFT combination performed well on this dataset. Adding options [-max-depth = 1] and [-max-depth = 2] allowed to save time and improve the alignment score with this dataset.
Interestingly, MISHIMA alignment time starts to decrease when the sequences are getting closer to complete genome. The reason is that more complete sequences allow for the construction of a set of anchors that leaves smaller unaligned sequence block in the end. This effect is particularly apparent with 14 sequences (Fig. 8).
At present, we were not able to compare the time and score of MISHIMA+MAFFT with just MAFFT, however from previous examples we have seen that MISHIMA+MAFFT alignment is similar in quality to MAFFT alignment. Thus by using MISHIMA we can produce MAFFT-quality alignment even for large datasets where we can't use MAFFT alone.
This example shows that MISHIMA is effective on large datasets, including those that are very difficult to align with other programs.
Alignment time of 20 kb random sequences
Number of sequences
We therefore recommend to first try using MISHIMA with [-align-seeds-only -max-depth = 1] options on new data. By this way you will be able to quickly know if there are any problems with the sequences. If MISHIMA was able to find seeds and produce a rough alignment, then the sequences are fine and normal MISHIMA alignment procedure can be used. In fact, computation times using this option set for human complete mtDNA genomes are 00:00:18, 00:01:15, and 00:14:38 for 50, 100, and 200 sequences, respectively. When [-max-depth = 1] option is also activated, computation times for 200 and 400 sequences are 00:00:51 and 00:02:27, respectively. These are much shorter than MISHIMA default for producing full multiple alignments.
We developed a heuristic method for multiple sequence alignment that can greatly speed up alignment of large datasets - those consisting of hundreds of sequences, as well as those with very long sequences, such as complete bacteria genomes.
MISHIMA depends on either MAFFT (default option), or CLUSTAL W (with [-aligner = clustalw]) for aligning the areas between the "seeds" to produce the complete alignment. However even without other programs MISHIMA can be used to quickly produce a rough alignment where only seeds are aligned. This can be very useful for quick evaluation of a new dataset before applying a more thorough alignment procedure.
Given the rapidly increasing number of sequenced organisms, there is a growing interest in comparing closely related species. MISHIMA is particularly helpful for such analysis due to its ability to quickly locate homology signals.
Project name: MISHIMA
Project home page: http://esper.lab.nig.ac.jp/study/mishima/
Operating system: Windows
Programming language: C (not open source).
Other requirements: For using with CLUSTAL W, CLUSTAL W version 2.0.10 or later should be installed. For using with MAFFT, MAFFT version 6.707b or later should be installed.
License: Binary is free for any use.
Any restrictions to use by non-academics: None.
This work was supported by grants from Ministry of Education, Culture, Sports, Science and Technology. KK was additionally supported by Human Genome Network Project led by Drs. Takashi Gojobori and Kazuho Ikeo.
- Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25(4):351–360. 10.1007/BF02603120View ArticlePubMedGoogle Scholar
- Barton GJ, Sternberg MJE: A Strategy for the Rapid Multiple Alignment of Protein Sequences. Confidence Levels from Tetriary Structure Compositions. J Mol Biol 1987, 198: 327–337. 10.1016/0022-2836(87)90316-0View ArticlePubMedGoogle Scholar
- Taylor WR: A Flexible Method to Align Large Number of Biological Sequences. J Mol Evol 1988, 28(1–2):161–169. 10.1007/BF02143508View ArticlePubMedGoogle Scholar
- Corpet F: Multiple sequence alignment with hierarchical clustering. Nucleic Acids Research 1988, 16(22):10881–10890. 10.1093/nar/16.22.10881View ArticlePubMedPubMed CentralGoogle Scholar
- Higgins DG, Sharp PM: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 1988, 73: 237–244. 10.1016/0378-1119(88)90330-7View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22: 4673–4680. 10.1093/nar/22.22.4673View ArticlePubMedPubMed CentralGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 2004, 32(5):1792–1797. 10.1093/nar/gkh340View ArticlePubMedPubMed CentralGoogle Scholar
- Gotoh O: Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinement as Assessed by Reference to Structural Alignments. J Mol Biol 1996, 264: 823–838. 10.1006/jmbi.1996.0679View ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG: SAGA: Sequence Alignment by Genetic Algorithm. Nucleic Acids Research 1996, 24: 1515–1524. 10.1093/nar/24.8.1515View ArticlePubMedPubMed CentralGoogle Scholar
- Katoh K, Misawa K, Kuma KI, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 2002, 30(14):3059–3066. 10.1093/nar/gkf436View ArticlePubMedPubMed CentralGoogle Scholar
- Tonges U, Perrey SW, Stoye J, Dress AWM: A General Method for Fast Multiple Sequence Alignment. Gene 1996, 172(1):GC33-GC41. 10.1016/0378-1119(96)00123-0View ArticlePubMedGoogle Scholar
- Reinert K, Stoye J, Will T: An iterative method for faster sum-of-pairs multiple sequence alignment. Bioinformatics 2000, 16(9):808–814. 10.1093/bioinformatics/16.9.808View ArticlePubMedGoogle Scholar
- Mikhailov D, Cofer H, Gompets R: Performance optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL. SGI White Paper 2001.Google Scholar
- Cheetham J, Dehne F, Pitre S, Rau-Chaplin A, Taillon PJ: Parallel CLUSTAL W for PC Clusters. Proceedings of the International Conference on Computational Science and Its Applications: Montreal; Springer Lecture Notes in Computer Science 2003, 2667(Part II):300–309.Google Scholar
- Li KB: ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 2003, 19(12):1585–1586. 10.1093/bioinformatics/btg192View ArticlePubMedGoogle Scholar
- Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL: Alignment of whole genomes. Nucleic Acids Research 1999, 27(11):2369–2376. 10.1093/nar/27.11.2369View ArticlePubMedPubMed CentralGoogle Scholar
- Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: finding rearrangements during alignment. Bioinformatics 2003, 19(Suppl 1):i54-i62. 10.1093/bioinformatics/btg1005View ArticlePubMedGoogle Scholar
- Bray N, Dubchak I, Pachter L: AVID: A Global Alignment Program. Genome Research 2003, 13: 97–102. 10.1101/gr.789803View ArticlePubMedPubMed CentralGoogle Scholar
- Shih ACC, Li WH: GS-Aligner: A Novel Tool for Aligning Genomic Sequences Using Bit-Level Operations. Mol Biol Evol 2003, 20(8):1299–1309. 10.1093/molbev/msg139View ArticlePubMedGoogle Scholar
- Morgenstern B, Frech K, Dress A, Werner T: DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 1998, 14(3):290–294. 10.1093/bioinformatics/14.3.290View ArticlePubMedGoogle Scholar
- Brudno M, Chapman M, Gottgens B, Batzoglou S, Morgenstern B: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 2003, 4: 66. 10.1186/1471-2105-4-66View ArticlePubMedPubMed CentralGoogle Scholar
- Kryukov K, Saitou N: MISHIMA: a Method for Identifying Sequence History In terms of Multiple Alignment. In proceedings of the RECOMB 2005; Boston 2005, 177–178.Google Scholar
- Kimura M: The Neutral Theory of Molecular Evolution. Cambridge, Cambridge University Press; 1983.View ArticleGoogle Scholar
- Brudno M, Do C, Cooper G, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Research 2003, 13(4):721–731. 10.1101/gr.926603View ArticlePubMedPubMed CentralGoogle Scholar
- Stoye J, Moulton V, Dress AWM: DCA: An Efficient Implementation of the Divide-and-Conquer Multiple Sequence Alignment Algorithm. CABIOS 1997, 13(6):625–626.PubMedGoogle Scholar
- Lipman DJ, Altschul SF, Kececioglu JD: A tool for multiple sequence alignment. Proc Natl Acad Sci USA 1989, 86(12):4412–4415. 10.1073/pnas.86.12.4412View ArticlePubMedPubMed CentralGoogle Scholar
- MISHIMA Homepage[http://esper.lab.nig.ac.jp/study/mishima/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.