How reliably can we predict the reliability of protein structure predictions?
 István Miklós^{1}Email author,
 Ádám Novák '^{1},
 Balázs Dombai^{2} and
 Jotun Hein^{1}
DOI: 10.1186/147121059137
© Miklós et al; licensee BioMed Central Ltd. 2008
Received: 06 December 2007
Accepted: 03 March 2008
Published: 03 March 2008
Abstract
Background
Comparative methods have been the standard techniques for in silico protein structure prediction. The prediction is based on a multiple alignment that contains both reference sequences with known structures and the sequence whose unknown structure is predicted. Intensive research has been made to improve the quality of multiple alignments, since misaligned parts of the multiple alignment yield misleading predictions. However, sometimes all methods fail to predict the correct alignment, because the evolutionary signal is too weak to find the homologous parts due to the large number of mutations that separate the sequences.
Results
Stochastic sequence alignment methods define a posterior distribution of possible multiple alignments. They can highlight the most likely alignment, and above that, they can give posterior probabilities for each alignment column. We made a comprehensive study on the HOMSTRAD database of structural alignments, predicting secondary structures in four different ways. We showed that alignment posterior probabilities correlate with the reliability of secondary structure predictions, though the strength of the correlation is different for different protocols. The correspondence between the reliability of secondary structure predictions and alignment posterior probabilities is the closest to the identity function when the secondary structure posterior probabilities are calculated from the posterior distribution of multiple alignments. The largest deviation from the identity function has been obtained in the case of predicting secondary structures from a single optimal pairwise alignment. We also showed that alignment posterior probabilities correlate with the 3D distances between C_{ α }amino acids in superimposed tertiary structures.
Conclusion
Alignment posterior probabilities can be used to a priori detect errors in comparative models on the sequence alignment level.
Background
Due to the increasing speed and number of genome sequencing projects, the gap between the number of known structures and the number of known protein sequences keeps increasing. As a result, demand for reliable computational methods today is higher than ever, while in silico estimation of protein structures remains one of the most challenging tasks in bioinformatics.
The central assumption of comparative bioinformatics methods for proteins is that the structures of proteins are more conserved than their aminoacid sequences. This allows homology modelling, namely, mapping the structure of a sequence onto homologous sequences. As insertions and deletions separating two homologous sequences accumulate, homologous characters in the two sequences will occupy different positions, which causes a nontrivial problem of identifying homologous positions. This problem can be solved by sequence alignment algorithms [1–4], which maximise the similarity between aligned positions while also minimise the insertions and deletions needed to align the sequences.
The relationship between gappenalties and similarity scores can be set such that they maximise the number of correctly aligned positions in a benchmark set of alignments [5, 6]. By contrast, stochastic models are capable to calibrate their parameters by applying a Maximum Likelihood approach even if no benchmark set is available. Hidden Markov Models were the first such stochastic models which have appeared in bioinformatics more than ten years ago [7]. Thorne, Kishino and Felsenstein introduced timecontinuous Markov models for describing insertion and deletion events [8, 9], and they showed on simulated data that the Maximum Likelihood method could correctly estimate the insertiondeletion as well as the substitution parameters with which the simulated data had been generated. The TKF models have subsequently been improved [10, 11], and have been tested for alignment accuracy on biological data [11]. Above automatic parameter estimation, the other main advantage of stochastic models is that such models can provide posterior probabilities for each estimated alignment column as well as for the whole alignment, and these posterior probabilities correlate with the probability for the alignment column being correctly aligned [11–13].
The uncertainty in the sequence alignment can be slightly reduced when more than two sequences are simultaneously aligned together, and hence, much effort has been put in developing accurate multiple sequence alignment methods. Although efficient algorithms exist for any type of pairwise alignment problem, the multiple sequence alignment problem is hard. It has been proved that the optimal multiple sequence alignment problem under the sumofpairs scoring scheme is NPhard [14], and it is strongly believed that the statistical approach to multiple sequence alignment is algorithmically not simpler than scorebased approaches. Since it is unlikely that fast algorithms exist for any type of exact multiple sequence alignment problem, heuristic approaches have become widespread. ProfileHMM methods [15, 16] align sequences to a profileHMM instead of each other, and the multiple sequence alignment is obtained by aligning sequences together via a profileHMM. Since the jumping and emission parameters of the HMM are learned from the data, this approach needs many sequences for parameter optimisation. Nevertheless, profileHMMs do not consider evolutionary relationships amongst sequences, and hence, they cannot handle properly overrepresentation of evolutionary groups.
Iterative approaches have been introduced for scorebased methods in the eighties [17, 18] and have recently been extended for stochastic methods [13, 19] using the transducer theory [20, 21]. The drawback of iterative approaches is that in each iteration, they consider only a single, locally optimal alignment that might not lead to a globally optimal alignment. Moreover, as they consider only locally optimal partial solutions, they naturally underestimate the uncertainty of posterior probabilities.
The Markov chain Monte Carlo (MCMC) method represents a third way to attack the multiple stochastic alignment problem. It was first introduced for assessing the Bayesian distribution of evolutionary parameters of the TKF91 model aligning two sequences [22], and has subsequently been extended to multiple sequence alignment [23–28]. The general theory of Markov chain Monte Carlo [29, 30] states that the Markov chain will be in the prescribed distribution after infinite number of random steps. Obviously, we cannot wait infinite many steps in practice, and therefore the success of MCMC methods depends on fast convergence: if the Markov chain converges quickly to the prescribed distribution, the bias of samples from the Markov chain after a limited number of steps will be negligible. The convergence can be checked by measuring autocorrelation in the loglikelihood trace or a few other statistics of the Markov chain and by running several parallel chains with different random starting points [31].
Since the above mentioned methods for multiple stochastic sequence alignment problems have been introduced only recently, no largescale, comprehensive analysis on the performance of methods for protein structure prediction has been published yet. In this paper, we present a survey on how stochastic alignment methods can be used for protein secondary structure predictions. The prediction can be based on pairwise or multiple alignments and in both cases, either only a single, optimal alignment or the whole posterior distribution of alignments is used for prediction. We are interested in the question how much one can gain by involving more sequences and the posterior distribution of the alignments into the secondary structure prediction.
Results
Implementation of the methods
We implemented a stochastic pairwise and a stochastic multiple sequence alignment method in Java programming language (see Additional file 1), and we made a study of the methods on the HOMSTRAD database as described in the Methods section.
The stochastic pairwise alignment method was tested on all the possible 9494 pairs of sequences belonging to the same family. The analysis took two days on an Intel Xeon 3.0 GHz computer with SUSE Linux 9.3 operating system and JVM 1.5.0. The most timeconsuming part of the analysis was the Maximum Likelihood parameter optimisation, which took approximately 90% of the total running time.
12 families have been selected for testing the stochastic multiple sequence alignment method, see Table 1. The families have been selected such that they reasonably cover the percentage identity distribtion of the HOMSTRAD database and they contain relatively many and approximately the same number of sequences. There are 541 possible pairs of homologous sequences obtainable from the 12 families, which is 5.7% of the possible homologous sequence pairs of the HOMSTRAD database. The analysis was performed on the ZUSE cluster of the Oxford Supercomputing Centre, each job ran on a dual Intel Xeon 3.6GHz processor under JVM 1.6.0. 1000000 MCMC steps were taken after convergence on each family. The running time of the analysis varied between 2.5 hours (7 sequences, length of 105 amino acids in average) and two days (11 sequences, 294 amino acids in average). The convergence has been verified based on the loglikelihood trace and comparing sampling distributions from parallel chains with different starting points, see Fig. 1. for an example.
Selected families from the HOMSTRAD database for testing the performance of stochastic multiple sequence alignment methods
Family name  Class  Number of sequences  Average length  Average sequence id 

Xylose isomerase  Alpha beta barrel  6  388  69% 
Annexin  All alpha  6  317  57% 
Calciumbinding protein – parvalbuminlike  All alpha  7  107  56% 
Starch binding domain  All beta  8  105  52% 
Glycosyl hydrolase family 22 (lysozyme)  Alpha+beta  12  126  51% 
Legume lectin  All beta  12  234  50% 
Papain family cysteine proteinase  Alpha+beta  13  223  40% 
Subtilase  Alpha/beta  11  294  40% 
Src homology 2 domains  Alpha+beta  11  105  35% 
Ctype lectin  Alpha+beta  8  126  27% 
Haloperoxidase  Alpha/beta  9  286  25% 
Response regulator receiver domain  Alpha/beta  13  122  25% 
Postprocessing the data
Secondary structure predictions have been given in four ways:

Based on the Viterbi alignment (referred to as "Viterbi"). In this case, the most likely – a.k.a. Viterbi – alignment was obtained for all pairs of sequences and was used to map the secondary structure of one of the sequences onto the other sequence.

Based on the posterior distribution of pairwise alignments using the ForwardBackward algorithm ("Forward"). In this case, the posterior probabilities that two amino acids are aligned together were obtained for all pairs of sequences and all pairs of amino acids. The secondary structure of one of the sequences was mapped onto the other sequence in a fuzzy way using the posterior probabilities.

Based on the Maximum Posterior Decoding estimation from samples of a Markov chain Monte Carlo (MCMC) stochastic multiple alignment ("MPD"). In this case, the Maximum Posterior Decoding (MPD) alignments were predicted from MCMC samples and were used to map the secondary structure of one of the sequences onto the other sequences. The MPD alignment maximizes the product of the posterior probabilities of its alignment columns. See the Methods section for an explanation why the MPD alignment can be more accurately estimated from MCMC samples than the Viterbi alignment.

Based on the posterior distribution of multiple alignments obtained by MCMC stochastic multiple alignment ("Bayesian"). In this case, the posterior probabilities that two amino acids are aligned together were estimated from the MCMC samples for all pair of sequences choosable from a multiple alignment and all pair of amino acids. The secondary structure of one of the sequences was mapped onto the other sequence in a fuzzy way using the posterior probabilities.
Our results indicate that methods predicting secondary structures based on a single alignment are overpessimistic about their performance on alpha helices and beta sheets, namely, the posterior probabilities associated to the prediction are lower than the actual probability that the prediction is correct. Methods that predict structures based on the whole distribution of sequence alignments are less pessimistic – the alignment posterior probabilities better approximate the observed probabilities that the prediction is correct. All pairwise alignment methods proved to be overoptimistic estimating the reliability of their predictions for alpha helices and beta sheets with posterior probability above 0.8.
Predicting the correctness of 3_{10} helix predictions turned out to be the toughest of all secondary structure types. Each method except the Bayesian estimation on multiple sequence alignments is much overoptimistic on their power of predicting 3_{10} helices. MPD is less optimistic than pairwise methods.
Among all methods studied, Bayesian estimation based on multiple alignments was the only one that was able to correctly predict its prediction power of all secondary structure types, including 3_{10} helices, which makes MCMCbased multiple alignment methods successful candidates for promotion to a fundamental tool in protein structure prediction.
To show that the alignment posterior probabilities correlate not only with the goodness of secondary structure predictions but they also correlate with the similarities in the 3D structures, we calculated from the HOMSTRAD superimposed 3D structures the 3D distances between the C_{ α }atoms for each aligned pair of amino acids. The alignment posterior probabilities were evenly divided into 10 categories, and the average 3D distances as well as the low and high quartiles have been plotted for each category.
Discussion
Comparing predictions on different secondary structure types
The differences between the predictions of different secondary structure elements can be explained by their general attributes. Alpha helices are typically formed by 10 amino acids or more. Substitutions are frequent in alpha helices and they are surrounded by loop sequences where insertions and deletions often occur, therefore stochastic alignment methods realise some uncertainty, which yields relatively low posterior probabilities when aligning these regions. However, since alpha helices are relatively long, and the substitutions that occur in them rarely change the chemical behaviour of the affected amino acids, the long runs of chemically similar amino acids in the two sequences to be aligned give a strong statistical signal that helps align alpha helices.
Beta sheet elements are typically shorter than alpha helices, and are also surrounded by nonstructured fragments accumulating insertions and deletions, which also yields relatively low alignment posterior probabilities. However, beta sheet elements are more likely to be misaligned, since their short length keeps them from carrying a statistical signal that alpha helices do.
There is a similar explanation for the overoptimism in the region of 0.8 and higher posterior probabilities in the case of alpha helices and beta sheets: slight structural changes might shift the position where an alpha helix or a beta sheet starts or ends, even if the amino acids in the positions of question do not change. Fig. 8. also shows examples of such variations of secondary structure elements. For instance, the first alignment column is indicated to have a beta sheet structure in some sequences while it is nonstructural in others.
Comparing predictions of different protocols
Predictions based on a single, optimal pairwise or multiple alignment are overpessimistic: alignment columns from both the Viterbi alignments and the MPD multiple alignments are labelled with posterior probabilities that are typically lower than the actual probability that the secondary structure predictions are correct for these columns. When the whole posterior ensemble of alignments is the basis of the secondary structure prediction, the posterior probabilities are closer to the actual probabilities that the prediction is correct. One main difference between the two strategies – prediction based on a single optimal alignment and prediction based on the posterior distribution of alignments – is that in the latter case posterior probabilities of all secondary structure types are given for each amino acid, while in the former case, the Viterbi or MPD alignment assigns at most one secondary structure element to each amino acid. This suggests the hypothesis that prediction methods based on the posterior distribution of alignments are less overpessimistic due to possessing such false positive predictions with small posterior probabilities that are not part of a Viterbi or MPD alignmentbased estimation.
To test this hypothesis, we predicted alpha helices and beta sheets from the posterior distribution of pairwise alignments in an alternative way. In this alternative prediction, each amino acid has been assigned to at most one secondary structure element that had maximal posterior probability (if the posterior probability of not harbouring a secondary structure type was maximal, then no secondary structure has been associated to the amino acid in question).
The correlation between alignment posterior probabilities and probabilities of correctly predicting a secondary structure type is obviously the same under the two different protocols if the posterior probability is greater than 0.5, since an event having probability greater than 0.5 must be the most likely event. The two types of curves split very soon below 0.5 (data not shown), and the second type of prediction protocol (considering at most one secondary structure type prediction for an amino acid) gets less overpessimistic than the other protocol. This means that there are more true positive predictions than false positive predictions with nonmaximal posterior probabilities.
Correlation between 3D structure similarities and alignment posterior probabilities
High alignment posterior probabilities indicate that the aligned residues are close to each other in the superimposed 3D structures. The average 3D distance between the aligned residues increases as the alignment posterior probability decreases. However, the distribution of residue distances become flatter for small alignment posterior probabilities, namely, a small alignment posterior probability does not necessarily mean that the aligned residues are far from each other. For example, 0.5 alignment posterior probability in a pairwise alignment means that there is still about 25% probability that the aligned residues are closer to each other than the average distance between amino acids that are aligned together with more than 0.9 posterior probability. The distance distribution is even flatter in case of multiple alignments. One possible explanation is that the alignment posterior probabilities are calculated for multiple alignment columns while distances are calculated for all possible pairs of amino acids in alignment columns. A small alignment posterior probability indicates possible differences in the 3D structures, however, some of the 3D structures might be still similar. Averaging the 3D distances in alignment columns naturally makes the distribution more centred (data not shown).
Conclusion
In this paper, we studied how posterior probabilities of aligning characters in pairwise or multiple alignments might indicate whether secondary structure predictions based on the alignments in question are correct. We found that pairwise alignment methods are overpessimistic on predicting alpha helices and beta sheets, namely, posterior probabilities of alignment columns are lower than the actual probability that the structure prediction based on the alignment column is correct, while they are overoptimistic on predicting 3_{10} helices, i.e., posterior probabilities for these alignment columns are greater than the probabilities that the secondary structure prediction for these amino acids is correct. Multiple alignment methods provide slightly more reliable predictions about their reliability of secondary structure predictions – they are less overoptimistic on 3_{10} helix predictions.
Secondary structure predictions can be given based on single, optimal pairwise or multiple alignments and also based on the posterior ensemble of alignments. In the latter case, posterior probabilities are closer to the probabilities that the secondary structure prediction is correct, especially when the structure prediction is based on the posterior distribution of multiple sequence alignments.
The multiple sequence alignment is the Holy Grail of bioinformatics [32] since what "one or two homologous sequences whisper ... a full multiple sequence alignment shouts out loud" [33]. Our experiments show that multiple sequence alignments not only highlight conserved positions better than pairwise alignments, but they also more reliably indicate the reliability of their prediction capabilities. This extra information could be exploited in 3D protein structure prediction: high posterior probabilities indicate the regions of the sequence alignment where the alignment accuracy is significantly better than the average alignment accuracy, see Figs. 5 and 6. These parts can be used as a reliable scaffold in homology modelling. On the remaining, unreliable parts, homology modelling is expected to have a low quality, and hence the 3D structure of these regions should be predicted with alternative methods, like ab initio threading methods [34–36].
It is worth mentioning that the alignment methods we applied in this work do not consider any information about how secondary structures evolve. It is wellknown that different secondary structure elements follow different substitution processes, and this difference in the substitution pattern can be used for secondary structure prediction [37]. It is fairly straightforward to incorporate into current alignment methods a priori knowledge on the substitution, insertion and deletion processes of secondary structures, and we expect that such combined approaches will have a better performance in structure prediction. Nevertheless, secondary structures can be predicted not only in a comparative way, but also using a single sequence, based on the statistical properties of the amino acids in different secondary structure types [38, 39]. Potential prior distributions for secondary types elements might be derived from such statistics and might be used in Bayesian analysis.
The running time of the methods obviously increases with the complexity of the background models, and analyses utilising such combined methods currently take too long to be applicable for everyday use on personal computers. However, the speed of processors keeps increasing exponentially following Moore's law, and will soon reach a level when it won't pose barrier to such combined approaches. Nevertheless, there are also promising channels to improve the running time of the methods. The standard approach for statistical multiple alignment is going to be MCMC, and current implementations make use of very basic tricks only, like the alignment window cut algorithm described in the Methods section. Several groups are working on making MCMC alignment methods more efficient and quickly mixing, and significant improvements are expected in the coming years.
Methods
The HOMSTRAD database [40] has been downloaded and was used as a benchmark set for the methods we tested. As of December 2007, the database contains 1032 families of sequences, each family shares a common 3D structure. Each sequence in the database is annotated in JOY format [41] that, among other information, describes the secondary structure type of each amino acid (one of alpha helix, beta sheet, 3_{10} helix or none). We predicted the secondary structures of the sequences as described below.
Pairwise sequence alignments
The stochastic model
Predicting secondary structure based on a single optimal alignment ("Viterbi")
For each family in the HOMSTRAD database, each pair of sequences has been aligned using the above described pairHMM. Since the jumping probabilities in the pairHMM are interdependent via common parameters, the usual EM algorithm [12] cannot be applied, instead, we made use of the conjugate gradient method [44] to get the numerical approximation of the Maximum Likelihood parameters for each pair of sequences. The Viterbi alignment [12] has been obtained for each sequence pair using the ML parameters, and for each alignment column in the Viterbi alignment, posterior probabilities have been calculated with the Forward and Backward algorithms [12]. The Viterbi alignment was used to map the secondary structure of one sequence to the other.
Predicting secondary structure based on the distribution of alignments ("Forward")
where m is the length of sequence B, P (a_{ i }, b_{ j }) is the posterior probability that characters a_{ i }and b_{ j }are aligned, and ${\delta}_{s,{b}_{j}}$ is 1 if the known secondary structure of character b_{ j }is s, 0 otherwise.
Multiple sequence alignment methods
Bayesian model for sequence alignments, evolutionary trees and model parameters
The transducer theory [20] has been used to construct a multipleHMM along an evolutionary tree from pairHMMs. The same pairHMM described in the previous section was used in the construction, and the soobtained multipleHMM gives the likelihood of a multiple alignment and an evolutionary tree. This multipleHMM describes sequence evolution as independent events on the branches of the evolutionary tree. This means that the sequence fragmentation on an edge of the evolutionary tree is not inherited on descending branches. Moreover, the fragmentations on sibling branches are independent from each other. Uninformative, exponential priors with expectation 1 have been used as priors for edge lengths and insertiondeletion parameters in the TKF92 model. All tree topologies were equally probable a priori. These priors together with the likelihood of a tree and multiple alignment on the tree define the joint posterior distribution of multiple sequence alignments, evolutionary trees and model parameters.
Markov chain Monte Carlo inferring of sequence alignments
Since the joint distribution of alignments, trees and parameters is a high dimensional distribution that is too complicated for direct, analytical inferring, Markov chain Monte Carlo [29, 30] has been used for sampling from the posterior distribution. One of the key questions here is how far we can go with the analytical calculations. For the biologically less reliable, but computationally more tractable TKF91 model [8], we developed a fast algorithm [24, 25] that calculates the likelihood of an evolutionary tree and a multiple sequence alignment of observed sequences. Such fast algorithm in the case of the TKF92 model is unknown, and hence, more data augmentation is necessary. This data augmentation includes sequences associated to the internal nodes and pairwise sequence alignments of neighbour nodes associated to the edges of the evolutionary tree. Since the likelihood of substitution events can be efficiently calculated with Felsenstein's algorithm [45], we only store the distribution of conditional likelihoods – also known as "Felsenstein's wildcards" [23] – at internal nodes of the evolutionary tree. We call this structure extended alignment.
The Markov chain performs a random walk on the space comprising the following components:

Edge lengths of the tree

Model parameters

Extended alignment, described above

Tree topology
We applied MetropolisHastings moves to change one of the components randomly, each component selected with a fixed, prescribed probability that was chosen to maximise the mixing of the Markov chain. Standard techniques were used for modifying edge lengths and parameters in the model, for a reference, see [25].
Because the MCMC analysis is timeconsuming, we selected 12 families from the HOMSTRAD database, see Table 1., on which we performed an MCMC analysis. The convergence was verified based on the loglikelihood trace and one million steps were taken in each Markov chain after its burnin period. Each chain was sampled each 100 steps, so 10000 samples have been collected from each chain. In a few cases, alternative chains with different starting points were set up, and the MPD alignment has been estimated from both chains, see Fig. 1.
Predicting secondary structures based on the MPD estimation of multiple sequence alignment ("MPD")
In an earlier work [25], we showed that the maximum a posteriori (MAP) estimation from an MCMC sample is unstable, since there are many suboptimal alignments, and typically almost all sampled alignments from a Markov chain will be different. The same alignment showing up occasionally in multiple samples is merely due to the nonoptimal mixing of the Markov chain, and such an alignment cannot be regarded as the most probable in the posterior distribution in any sense. Instead, we estimated the Maximum Posterior Decoding (MPD) alignment [12, 47] that maximises the product of the posterior singlecolumn probabilities. This method offers a significantly more reliable result since many alignments share particular columns. The estimation for the MPD alignment from an MCMC sample can be obtained by the simple dynamic programming algorithm which first creates a directed acyclic graph whose vertices are the alignment columns of the MCMC samples, and then estimates the posterior probability for each alignment column by the relative frequencies of alignment columns in the sample. The MPD estimation is the path that maximises the product of the relative frequencies. The MPD alignment is used to map the secondary structure of one of the sequences onto other sequences.
Predicting secondary structures based the posterior distribution of multiple alignments ("Bayesian")
where N is the number of alignments in the Markov chain, f_{ k }(a_{ i }) is the amino acid in sequence B with which a_{ i }is aligned in the k th alignment, and δ_{s,x}is 1 if the known secondary structure of character x is s, otherwise 0.
Declarations
Acknowledgements
This research was supported by BBSRC grant BB/C509566/1. IM was also supported by a Bolyai postdoctoral fellowship and an OTKA grant F61730. The authors would like to thank the two anonymous referees for their valuable comments.
Authors’ Affiliations
References
 Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48(3):443–53. 10.1016/00222836(70)900574View ArticlePubMed
 Waterman M, Smith T, Beyer W: Some biological sequence metrics. Advan Math 1976, 20: 367–387. 10.1016/00018708(76)902024View Article
 Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–7. 10.1016/00222836(81)900875View ArticlePubMed
 Gotoh O: An improved algorithm for matching biological sequences. J Mol Biol 1982, 162: 705–708. 10.1016/00222836(82)903989View ArticlePubMed
 Waterman M: Parametric and ensemble sequence alignment algorithms. Bulletin of Mathematical Biology 1994, 5(4):743–767.View Article
 Kececioglu J, Kim E: Simple and Fast Inverse Alignment. Lecture Notes in Computer Science 2006, 3909: 441–455.View Article
 Krogh A, Brown M, Mian I, Sjolander K, Haussler D: Hidden Markov models in computational biology: Applications to protein modeling. J Mol Biol 1994, 235: 1501–1531. 10.1006/jmbi.1994.1104View ArticlePubMed
 Thorne JL, Kishino H, Felsenstein J: An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol 1991, 33(2):114–24. 10.1007/BF02193625View ArticlePubMed
 Thorne JL, Kishino H, Felsenstein J: Inching toward reality: an improved likelihood model of sequence evolution. J Mol Evol 1992, 34: 3–16. 10.1007/BF00163848View ArticlePubMed
 Knudsen B, Miyamoto M: Sequence alignments and pair hidden Markov models using evolutionary history. J Mol Biol 2003, 333: 453–460. 10.1016/j.jmb.2003.08.015View ArticlePubMed
 Miklós I, Lunter GA, Holmes I: A 'long indel' model for evolutionary sequence alignment. Mol Biol Evol 2004, 21(3):529–540. 10.1093/molbev/msh043View ArticlePubMed
 Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press; 1998.View Article
 Löytynoja A, Milinkovitch M: A hidden Markov model for progressive multiple alignment. Bioinformatics 2003, 19(12):1505–1513. 10.1093/bioinformatics/btg193View ArticlePubMed
 Wang L, Jiang T: On the complexity of multiple sequence alignment. J Comp Biol 1994, 1(4):337–348.View Article
 Karplus K, Barrett C, Hughey R: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 1998, 14(10):846–856. 10.1093/bioinformatics/14.10.846View ArticlePubMed
 Eddy S: Profile Hidden Markov Models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMed
 Hogeweg P, Hesper B: The alignment of sets of sequences and the construction of phyletic trees: An integrated method. J Mol Evol 1984, 20(2):175–186. 10.1007/BF02257378View ArticlePubMed
 Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25: 351–360. 10.1007/BF02603120View ArticlePubMed
 Löytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. PNAS 2005, 102(30):10557–10562. 10.1073/pnas.0409137102PubMed CentralView ArticlePubMed
 Holmes I: Using guide trees to construct multiplesequence evolutionary HMMs. Bioinformatics 2003, 19: i147i157. 10.1093/bioinformatics/btg1019View ArticlePubMed
 Bradley R, Holmes I: An Emerging Probabilistic Framework for Modeling Indels on Trees. Bioinformatics 2007. 10.1093/bioinformatics/btm402
 Metzler D, Fleissner R, von Haeseler A, Wakolbinger A: Assessing variability by joint sampling of alignments and mutation rates. J Mol Evol 2001, 53: 660–669. 10.1007/s002390010253View ArticlePubMed
 Holmes I, Bruno W: Evolutionary HMMs : a Bayesian approach to multiple alignment. Bioinformatics 2001, 17(9):803–820. 10.1093/bioinformatics/17.9.803View ArticlePubMed
 Lunter G, Miklós I, Drummond A, Jensen J, Hein J: Bayesian phylogenetic inference under a statistical indel model. Lecture Notes in Bioinformatics 2003, 2812: 228–244.
 Lunter G, Miklós I, Drummond A, Jensen J, Hein J: Bayesian Coestimation of Phylogeny and Sequence Alignment. BMC Bioinformatics 2005, 6: 83. 10.1186/14712105683PubMed CentralView ArticlePubMed
 Fleissner R, Metzler D, von Haesaler A: Simultaneous Statistical Multiple Alignment and Phylogeny Reconstruction. Systematic Biology 2005, 54: 548–561. 10.1080/10635150590950371View ArticlePubMed
 Redelings B, Suchard M: Joint Bayesian estimation of alignment and phylogeny. Syst Biol 2005, 50: 401–418. 10.1080/10635150590947041View Article
 Suchard M, Redelings B: BAliPhy: simultaneous Bayesian inference of alignment and phylogeny. Bioinformatics 2006, 22(16):2047–2048. 10.1093/bioinformatics/btl175View ArticlePubMed
 Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E: Equations of state calculations by fast computing machines. J Chem Phys 1953, 21(6):1087–1091. 10.1063/1.1699114View Article
 Hastings W: Monte Carlo sampling methods using Markov chains and their applications. Biometrica 1970, 57: 97–109. 10.1093/biomet/57.1.97View Article
 Ronquist F, Huelsenbeck J: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 2003, 19(12):1572–1574. 10.1093/bioinformatics/btg180View ArticlePubMed
 Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press; 1997.View Article
 Hubbard T, Lesk A, Tramontano A: Gathering them into the fold. Nature Structural Biology 1996, 3: 313. 10.1038/nsb0496313View ArticlePubMed
 Skolnick J, Kolinski A, Kihara D, Betancourt M, Rotkiewicz PMB, M B: Ab initio protein structure prediction via a combination of threading, lattice folding, clustering, and structure refinement. Proteins 2002, 44(S5):149–156.
 Wu S, Skolnick J, Zhang Y: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology 2007, 5: 17. 10.1186/17417007517PubMed CentralView ArticlePubMed
 Zhou H, Skolnick J: Ab Initio Protein Structure Prediction Using ChunkTASSER. Biophysical Journal 2007, 93: 1510–1518. 10.1529/biophysj.107.109959PubMed CentralView ArticlePubMed
 Goldman N, Thorne J, Jones D: Using Evolutionary Trees in Protein Secondary Structure Prediction and Other Comparative Sequence Analyses. J Mol Biol 1996, 263(2):196–08. 10.1006/jmbi.1996.0569View ArticlePubMed
 Kneller D, Cohen F, Langridge R: Improvements in Protein Secondary Structure Prediction by an Enhanced Neural Network. J Mol Biol 1990, 214: 171–182. 10.1016/00222836(90)90154EView ArticlePubMed
 Garnier J, Gibrat JF, B R: GOR secondary structure prediction method version IV. Methods in Enzymology 1996, 266: 540–553.View ArticlePubMed
 Mizuguchi K, Deane CM, Blundell TL, P OJ: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 1998, 7: 2469–2471.PubMed CentralView ArticlePubMed
 Mizuguchi K, Deane C, Johnson M, Blundell T, Overington J: JOY: protein sequencestructure representation and analysis. Bioinformatics 1998, 14: 617–623. 10.1093/bioinformatics/14.7.617View ArticlePubMed
 Dayhoff M, Schwartz R, Orcutt B: Atlas of protein sequence and structure. Volume 5. National Biomedical Research Foundation, Washington, D.C., chap. A model of evolutionary changes in proteins; 1978:345–352.
 Holmes I, Rubin G: An expectation maximization algorithm for training hidden substitution models. J Mol Biol 2002, 317: 757–768. 10.1006/jmbi.2002.5405View Article
 Press W, Flannery B, Teukolsky S, Vetterling W: Numerical Recipes in C. The Art of Scientific Computing. Cambridge University Press; 2001.
 Felsenstein J: Evolutionary trees from DNA sequences : a maximum likelihood approach. J Mol Evol 1981, 17: 68–376. 10.1007/BF01734359View Article
 Drummond A, Nicholls G, Rodrigo A, Solomon W: Estimating Mutation Parameters, Population History and Genealogy Simultaneously From Temporally Spaced Sequence Data. Genetics 2002, 161(3):1307–1320.PubMed CentralPubMed
 Holmes I, Durbin R: Dynamic programming alignment accuracy. J Comp Biol 1998, 5: 493–504.View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.