- Research article
- Open Access
Optimizing substitution matrix choice and gap parameters for sequence alignment
- Robert C Edgar^{1}Email author
https://doi.org/10.1186/1471-2105-10-396
© Edgar; licensee BioMed Central Ltd. 2009
- Received: 10 August 2009
- Accepted: 2 December 2009
- Published: 2 December 2009
Abstract
Background
While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.
Results
POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB.
Conclusion
The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.
Keywords
- Global Alignment
- Substitution Matrix
- Alignment Accuracy
- Reference Alignment
- Substitution Matrice
Background
Sequence alignment is a fundamental tool in contemporary biology. Most algorithmic formulations of the problem seek an alignment maximizing a function known as the objective score. Objective scores are usually defined as a sum of terms for matching pairs of letters (tabulated in a substitution matrix) and penalties for gaps. While the effects of different substitution matrices and gap parameters have been extensively studied in the context of local alignment and homolog recognition (see for example [1] and references therein), their effects on alignment accuracy, especially for global alignment, are not so well understood. Several heuristics are in common use, for example CLUSTALW's choice of low-identity matrices for aligning low-identity sequences [2], which have not to the best of my knowledge been empirically tested. One factor impeding such testing is the lack of effective automated methods for optimizing parameters for a given objective function. Previous work in this area has included unsupervised expectation maximization [3], discriminative training [4], inverse parametric alignment [5–9] and maximum-margin structured learning [10, 11]. Katoh et al. [12] reported using golden section search to optimize gap penalties, but this generally assumes a unimodal function which does not hold in this case. Exhaustive search has been attempted several times, for examples see [1, 13].
In this work I describe a new parameter optimization procedure, POP, compare it with the IPA method of [6], and use it to investigate a number of questions related to global protein alignment accuracy, including: which is the best substitution matrix, do the best choices of matrix and gap penalties vary with sequence identity, should terminal gaps be penalized differently from internal gaps, and does the loss of precision due to integer rounding in substitution matrices degrade alignment accuracy? IPA was chosen for comparison because it is a recently published method, code was readily available, and because it optimizes parameters exactly equivalent to those studied in this work. To the best of my knowledge there has been no prior comparative study of alternative gap optimization methods, so IPA's effectiveness relative to other approaches is not known. In the tests performed here, POP was shown to achieve from 0.2% to 1.3% higher training-set accuracy than IPA, and evidence is presented that POP finds parameters within about 0.1% of the true optimum, i.e. an improvement in training set error of up to an order of magnitude. Differences of a fraction of a percent will be shown to be informative in assessing differences between substitution matrices, for which a method with the improved precision of POP is required. POP, like IPA, is designed to maximize training set accuracy, which may be achieved at some cost in generalization error, i.e. worse performance on novel data. In the case of POP this is by design: the problem of minimizing training set error is simpler than minimizing generalization error and the experiments reported here were not designed to compute the most biologically appropriate parameters for a given algorithm but rather to compare models with small and equal numbers of free parameters. It is then reasonable to assume that training set accuracy is a good measure of the relative performance of those models.
Methods
IPA
Kim and Kececioglu [6] described an inverse parametric alignment method that can be trained on so-called "partial examples" such as those found in BALIBASE [14] in which only a subset of columns are annotated as reliably aligned. Eagu Kim (personal communication) kindly provided a software implementation of this algorithm (IPA). This implementation extended the method to allow separate terminal and internal gap penalties to be learned.
POP
The central idea in POP is to search for changes in parameters that result in "Goldilocks" changes in Q--not too big, not too small, but changes in just of the right size to have a good chance of indicating coarse trends rather than small-scale noise. This approach can be regarded as a generalization of line search optimization (see for example [15]). As theoretical and empirical methods for dealing with non-convex, non-continuous Q functions are currently limited, practical experience, especially examination of plots such as Fig. 1, is the best guide to determining the size of a "just right" change. In the experiments reported here, values around 0.5% were most effective.
POP proceeds in three stages. In the first stage, an N-dimensional hypercuboid is explored by evaluating Q(w) at each point in a regularly spaced grid. Local maxima are identified as points at which Q is greater than all neighboring points, and the best of these are used as starting points for the second stage. (It is important to do this rather than to use the best Q values found in the grid because these will tend to cluster around the same maxima and it is better to find a diverse set of starting points to avoid local maxima.) The second and third stages use a hill-climbing strategy to approach a local maximum given a starting point. The second stage uses a faster but less accurate variant of the hill-climbing method than the third stage, which starts from the best local optima found in the second stage. The final result is the best maximum found by the third stage. To save computation time, the three stages typically use increasingly large subsets of the training samples, with the first two stages using randomly chosen subsets and the third stage the entire training set.
Hill-climbing
Hill climbing repeats the following procedure until no improvement is found. Starting from a point w, each axis i is explored in both the positive and negative direction; i.e., all parameters are held fixed but one (w_{ i }). Let δ be a proposed change in w_{ i }, = w_{ i }+ δ, w' = {w_{ j }, j ≠ i, }, Q' = Q(w'), i.e. Q' is the value of Q after a proposed move along one axis, and Δ = Q' - Q. In each direction a δ is found that gives Δ > 0 (improves Q), or, failing that, a δ that reduces Q by an amount that is small, but not too small, say in the range 0.1% to 1%. Very small changes are undesirable because they will tend to be dominated by "noise" (Fig 1) rather than indicating a systematic trend. Large reductions in Q will tend to take w_{ i }out of the range where it might make an improvement when combined with a change along another axis (further discussed below). A heuristic designed to minimize these problems introduces a maximum (μ) and minimum (λ) reduction in Q (i.e., λ ≤ -Δ ≤ μ). After computing the proposed changes, a new w is determined as described in Making a move below.
Proposed changes
A proposed change δ is determined as follows. In the first iteration δ is set to a small fraction (say, 10%) of the absolute value of w_{ i }or a small value (say, 0.1) if w_{ i }is zero, or a "hint" value set by the user. In subsequent iterations the initial value of δ is its final value from the previous iteration. Having set the initial value, the following procedure is then repeated. The value of Δ is computed. If Δ = 0, δ is multiplied by 4. If Δ < 0 and -Δ < λ, i.e. Q is reduced by too small an amount, δ is doubled. If Δ < 0 and -Δ > μ, i.e. Q is reduced by too much, δ is halved. Otherwise, Δ > 0 (Q is improved) and further exploration is tried. First, smaller values of δ (specifically, mδ/4 for m = 1, 2 and 3) are evaluated in an attempt to discover an intermediate maximum; if one is found this is repeated to investigate increasingly small changes. If smaller values give no further improvement, larger values are tried (mδ/4 for m = 5, 6, 7 and 8) in a similar way. A maximum improvement in Q, denoted Γ, is imposed in order to avoid extreme changes in a single parameter when far from an optimum (except when the smallest improvement found gives Δ > Γ, in which case it is accepted).
Making a move
It also turns out that allowing moves along axes that reduce Q can, when combined with moves on other axes, improve Q and that allowing this possibility also gives significantly improved optimization in some cases. For example, a small increase in the gap open penalty and a small decrease in the gap extend penalty might each reduce Q, but increase Q when both changes are made simultaneously.
Fast hill-climbing
A "fast" variant of the hill-climbing procedure sets K = 1 (no multi-axis moves) and immediately applies any proposed move that is found to improve Q. Speed is also improved by increasing μ, λ and Γ. These modifications reduce the number of times the Q function is invoked, saving execution time but sometimes yielding significantly inferior parameters. Directions to try, and the sign (+ or -) first tried, may be selected at random or cycled to minimize systematic bias.
Gap models and substitution matrices
Let a model be the set of parameters associated with a given objective function. In this work the substitution matrix is regarded as fixed; only gap penalty parameters are included. A gap is a series of indel symbols; formally, a maximal consecutive sequence of indels. If a gap includes the first or last column of an alignment it is described as terminal; all other gaps are internal. While POP has been implemented for a wide variety of models, most of these are works in progress and I will therefore report results for just two models: g2, in which the same affine penalties are applied to all gaps, and g4, in which internal and terminal gaps have different open and extend penalties. The parameters are g, e, G and E; the penalty for an internal gap of length L is g + (L - 1) e and G + (L - 1) E for a terminal gap. Thus g2 is a special case of g4 in which G = g and E = e. The objective score is then the sum of substitution scores minus gap penalties; a maximum-scoring global alignment is found using standard dynamic programming techniques.
The following substitution matrix types are considered: BLOSUM [16], RBLOSUM [17], PAM [18], JTT [19] and VT/VTML [20, 21]. The RBLOSUM matrices were constructed using a bug-fixed version of the program used to compute BLOSUM matrices from the BLOCKS database [22]. Surprisingly, the corrected BLOSUM62 matrix was found to slightly degrade performance in homolog recognition [17]. Each matrix family is a series with members defined by a measure of evolutionary distance: percent identity cutoff in the case of BLOSUM and RBLOSUM, PAM distance for the rest. Conventionally, integer valued matrices are used in which log-odds scores in fractional bits have been rounded to one or two significant figures. Presumably this is for historical reasons: in older computer processors integer arithmetic was faster than floating point. This is no longer the case for many general-purpose processors, and regardless it is of interest to investigate whether the loss of precision due to integer rounding has an effect on alignment accuracy. Unless otherwise stated, full precision, one bit unit matrices were used.
Benchmark data
Reference data was taken from three protein alignment benchmarks: PREFAB [23] version 4 and BALIBASE versions 2 and 3. There are 1,681 pair-wise reference alignments in PREFAB v4. Extracting all pairs from the multiple alignments in BALIBASE versions 2 and 3 gives 8,135 and 297,960 pairs respectively. The large number of pairs in version 3 motivated the use of the more tractable version 2 for all-pairs tests. Subsets 4 (alignments with long terminal gaps) and 5 (long internal gaps) of version 3 were used to investigate optimizing terminal gap penalties.
Accuracy measure
The accuracy measure (q) for a single pair is the number of correctly aligned residue pairs divided by the number of residue pairs in the reference alignment. The total accuracy score (Q) is the weighted average of q over all pairs, where the weighting is uniform in the case of PREFAB and the inverse of the number of pairs in the original multiple alignment in the case of BALIBASE. It would have been desirable to estimate error bars using a method such as the Bayesian bootstrap [1]; however this proved to be infeasible due to limitations in available computer time (a CPU year was needed to generate the results reported here). For brevity, the reference sets will be denoted Bali (all-pairs from BALIBASE v2), Prefab (all-pairs from PREFAB v4), TermGaps (1,000 randomly selected pairs from BALIBASE v3, subset 4), and IntGaps (1,000 randomly selected pairs from BALIBASE v3, subset 5). To investigate the effects of evolutionary distance three subsets of BALIBASE v3 pairs were constructed: 1,000 randomly selected pairs with identities in the range 0-33% (Id0_33), 33-66% (Id33_66) and 66-99% (Id66_99), respectively. These were selected from the full-length rather than domain-trimmed sequences.
Results
Comparison of POP and IPA
Training set accuracy
Set | Model | POP | IPA | >Acc |
---|---|---|---|---|
Bali | g2 | 80.3% | 80.0% | 0.4% |
Bali | g4 | 80.6% | 79.6% | 1.3% |
TermGaps | g2 | 80.2% | 79.7% | 0.6% |
TermGaps | g4 | 84.0% | 83.8% | 0.2% |
Substitution matrix family
Matrix selection by evolutionary distance
Recommended matrix choice
In all the tests reported here, and others not shown, VTML200 is the best or close to best choice and is recommended as the general-purpose choice. In the VT and BLOSUM series, VT200 and BLOSUM70 respectively are recommended. In the clearly inferior PAM series, PAM100 to PAM150 appear to be best.
Improving CLUSTALW accuracy
Improvement in CLUSTALW using VTML200
Test | Defaults | VTML200 | >Acc | MUSCLE | PROBCONS |
---|---|---|---|---|---|
B3 TC | 44.8% | 48.4% | 8% | 53.2% | 61.3% |
B3 SPS | 78.6% | 80.9% | 3% | 84.4% | 88.3% |
P4 SPS | 61.7% | 64.6% | 5% | 67.7% | 71.7% |
Terminal gaps
Example parameters
Set | Model | g | e | G | E | Q |
---|---|---|---|---|---|---|
Bali | g2 | 5.370 | 0.423 | 80.7% | ||
Bali | g4 | 5.874 | 0.473 | 2.332 | 1.488 | 81.0% |
Prefab | g2 | 6.914 | 0.320 | 61.2% | ||
Prefab | g4 | 6.742 | 0.476 | 3.964 | 0.322 | 62.3% |
TermGaps | g2 | 6.508 | 0.133 | 80.8% | ||
TermGaps | g4 | 5.010 | 0.635 | 15.507 | 0.026 | 84.2% |
IntGaps | g2 | 5.725 | 0.304 | 78.3% | ||
IntGaps | g4 | 5.725 | 0.304 | 6.419 | 3.343 | 78.4% |
Random element
POP has a random element due to the selection of random subsets of the training data in its first two stages. It is therefore of interest to investigate how results vary for the same training data when different random number seeds are used. I chose to do this using the TermGaps set as practical experience shows it to have the most challenging parameter landscape of the sets considered here. I ran POP ten times for the g4 model on this set and found the following characteristics of the resulting Q values: mean 0.8412, standard deviation 5.3 × 10-4, maximum 0.8419, minimum 0.8401. These results suggest a low sensitivity to subset selection and are consistent with finding the global optimum to within approximately three significant figures.
Integer rounding
Integer rounding causes a small, but consistent, degradation in accuracy of around 0.1% to 0.3% in VTML and 0.05% to 0.1% in BLOSUM (detailed results not shown).
Corrected BLOSUM matrices
RBLOSUM gave a small but again consistent improvement in accuracy over BLOSUM of around 0.1%. This result is surprising considering that the opposite effect was found when testing homolog recognition [17].
Discussion
POP is an ad hoc algorithm without a strong theoretical foundation and has more heuristic parameters than one would ideally like. It does not scale well to larger numbers of parameters. Implementing and using POP requires an understanding of the input data and some trial and error. However, consistency of trends across different benchmarks, even when differences are very small, and consistency when run with different random number seeds combine to suggest that POP may find a global optimum to within approximately three significant figures on the pair-wise global alignment tests considered here, and regardless provide strong evidence that POP provides a sensitive test of differences between models with the same number of parameters. It is also readily parallelized and, unlike some other approaches, can easily be applied to multiple alignment. The hill-climbing in POP can start from parameters proposed by some other method, such as IPA, providing a lower bound on the training set error. It should be noted that the reduced training set error achieved by POP may come at some cost in generalization error; this is a topic for further study.
Typical compute resources required by POP.
Set | Model | Time (mins.) | Memory (Mb) |
---|---|---|---|
Bali | g2 | 135 | 72 |
Bali | g4 | 840 | 72 |
Prefab | g2 | 27 | 24 |
Prefab | g4 | 68 | 24 |
TermGaps | g2 | 55 | 26 |
TermGaps | g4 | 192 | 27 |
IntGaps | g2 | 13 | 12 |
IntGaps | g4 | 61 | 25 |
An examination of Fig. 2 shows that BLOSUM60-70 achieves almost the same accuracy as VTML150-200 on BALIBASE but is roughly 1% worse on PREFAB. This suggests that BALIBASE may be biased towards BLOSUM62 due to the use of sequence methods in alignment construction (87% of BALIBASE sequences have unknown structure, so are necessarily aligned by sequence alone). The source code for an example implementation of POP is available from the author's web site at http://www.drive5.com/pop. This is designed for optimization of the g2 and g4 models described here, but can be modified relatively easily for other models by replacing the appropriate Q and alignment functions.
As I can attest from an abundance of personal experience, manual optimization of alignment parameters is a tedious process that rarely leaves the practitioner feeling confident in the results. Having an automated method at one's experimentation with new and modified algorithms by enabling a relatively trustworthy and painless evaluation of their relative effectiveness. I plan to use POP to explore ideas for improved pair-wise and multiple global alignment algorithms.
Conclusion
On the basis of this analysis, the VTML200 matrix is recommended as the most appropriate in general when global alignment accuracy is desired. My results suggest a bias in BALIBASE towards the BLOSUM series of matrices of around 1% in accuracy. While the effect is small, a bias of this magnitude could be significant in validations of multiple alignment methods because differences between the better methods on BALIBASE are of comparable size. Bias towards substitution matrices or gap penalty functions is not unexpected as only 13% of BALIBASE sequences have solved structures, and its alignments were therefore constructed mostly by the use of sequence rather than structure methods. Future studies of alignment accuracy should use data derived independently of sequence in order to avoid such biases.
Declarations
Acknowledgements
I am indebted to Chuong B. Do for many helpful discussions and to Eagu Kim for assistance with IPA.
Authors’ Affiliations
References
- Price GA, Crooks GE, Green RE, Brenner SE: Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics 2005, 21: 3824–31. 10.1093/bioinformatics/bti627View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–80. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 2005, 15: 330–40. 10.1101/gr.2821705PubMed CentralView ArticlePubMedGoogle Scholar
- Do CB, Woods DA, Batzoglou S: CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics 2006, 22: e90–8. 10.1093/bioinformatics/btl246View ArticlePubMedGoogle Scholar
- Kececioglu J, Kim E: Learning Scoring Schemes for Sequence Alignment from Partial Examples. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2008, PP: 1.Google Scholar
- Kim E, Kececioglu J: Learning scoring schemes for sequence alignment from partial examples. IEEE/ACM Trans Comput Biol Bioinform 2008, 5: 546–56. 10.1109/TCBB.2008.57View ArticlePubMedGoogle Scholar
- Waterman MS, Eggert M, Lander E: Parametric sequence comparisons. Proc Natl Acad Sci USA 1992, 89: 6090–3. 10.1073/pnas.89.13.6090PubMed CentralView ArticlePubMedGoogle Scholar
- Dewey CN, Huggins PM, Woods K, Sturmfels B, Pachter L: Parametric alignment of Drosophila genomes. PLoS Comput Biol 2006, 2: e73. 10.1371/journal.pcbi.0020073PubMed CentralView ArticlePubMedGoogle Scholar
- Gusfield D, Balasubramanian K, Naor D: Parametric optimization of sequence alignment. Algorithmica 1994, 12: 312–326. 10.1007/BF01185430View ArticleGoogle Scholar
- Chapelle O, Do C, Le Q, Smola A, Teo C: Tighter bounds for structured estimation. Proc NIPS 2009 2009.Google Scholar
- Flannick J, Novak A, Do CB, Srinivasan BS, Batzoglou S: Automatic parameter learning for multiple local network alignment. J Comput Biol 2009, 16: 1001–22. 10.1089/cmb.2009.0099PubMed CentralView ArticlePubMedGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 2005, 33: 511–8. 10.1093/nar/gki198PubMed CentralView ArticlePubMedGoogle Scholar
- Blackshields G, Wallace IM, Larkin M, Higgins DG: Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol 2006, 6: 321–39.PubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 1999, 15: 87–8. 10.1093/bioinformatics/15.1.87View ArticlePubMedGoogle Scholar
- Box M, Davies D, Swann W: Non-Linear optimisation techniques. Oliver & Boyd; 1969.Google Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89: 10915–9. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G: BLOSUM62 miscalculations improve search performance. Nat Biotechnol 2008, 26: 274–5. 10.1038/nbt0308-274View ArticlePubMedGoogle Scholar
- Dayhoff MO: Atlas of Protein Sequence and Structure. Washington: N.B.R.F; 1978.Google Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 1992, 8: 275–82.PubMedGoogle Scholar
- Muller T, Spang R, Vingron M: Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol 2002, 19: 8–13.View ArticlePubMedGoogle Scholar
- Muller T, Vingron M: Modeling amino acid replacement. J Comput Biol 2000, 7: 761–76. 10.1089/10665270050514918View ArticlePubMedGoogle Scholar
- Henikoff JG, Henikoff S: Blocks database and its applications. Methods Enzymol 1996, 266: 88–105. full_textView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–7. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Lassmann T, Sonnhammer EL: Kalign--an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 2005, 6: 298. 10.1186/1471-2105-6-298PubMed CentralView ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23: 2947–8. 10.1093/bioinformatics/btm404View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5: 113. 10.1186/1471-2105-5-113PubMed CentralView ArticlePubMedGoogle Scholar
- Phuong TM, Do CB, Edgar RC, Batzoglou S: Multiple alignment of protein sequences with repeats and rearrangements. Nucleic Acids Res 2006, 34: 5932–42. 10.1093/nar/gkl511PubMed CentralView ArticlePubMedGoogle Scholar
- Van Walle I, Lasters I, Wyns L: Align-m--a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics 2004, 20: 1428–35. 10.1093/bioinformatics/bth116View ArticlePubMedGoogle Scholar
- Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256: 1443–5. 10.1126/science.1604319View ArticlePubMedGoogle Scholar
- Pei J, Grishin NV: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res 2006, 34: 4364–74. 10.1093/nar/gkl514PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.