How reliably can we predict the reliability of protein structure predictions?
© Miklós et al. 2008
Received: 06 December 2007
Accepted: 03 March 2008
Published: 03 March 2008
Skip to main content
© Miklós et al. 2008
Received: 06 December 2007
Accepted: 03 March 2008
Published: 03 March 2008
Comparative methods have been the standard techniques for in silico protein structure prediction. The prediction is based on a multiple alignment that contains both reference sequences with known structures and the sequence whose unknown structure is predicted. Intensive research has been made to improve the quality of multiple alignments, since misaligned parts of the multiple alignment yield misleading predictions. However, sometimes all methods fail to predict the correct alignment, because the evolutionary signal is too weak to find the homologous parts due to the large number of mutations that separate the sequences.
Stochastic sequence alignment methods define a posterior distribution of possible multiple alignments. They can highlight the most likely alignment, and above that, they can give posterior probabilities for each alignment column. We made a comprehensive study on the HOMSTRAD database of structural alignments, predicting secondary structures in four different ways. We showed that alignment posterior probabilities correlate with the reliability of secondary structure predictions, though the strength of the correlation is different for different protocols. The correspondence between the reliability of secondary structure predictions and alignment posterior probabilities is the closest to the identity function when the secondary structure posterior probabilities are calculated from the posterior distribution of multiple alignments. The largest deviation from the identity function has been obtained in the case of predicting secondary structures from a single optimal pairwise alignment. We also showed that alignment posterior probabilities correlate with the 3D distances between C α amino acids in superimposed tertiary structures.
Alignment posterior probabilities can be used to a priori detect errors in comparative models on the sequence alignment level.
Due to the increasing speed and number of genome sequencing projects, the gap between the number of known structures and the number of known protein sequences keeps increasing. As a result, demand for reliable computational methods today is higher than ever, while in silico estimation of protein structures remains one of the most challenging tasks in bioinformatics.
The central assumption of comparative bioinformatics methods for proteins is that the structures of proteins are more conserved than their amino-acid sequences. This allows homology modelling, namely, mapping the structure of a sequence onto homologous sequences. As insertions and deletions separating two homologous sequences accumulate, homologous characters in the two sequences will occupy different positions, which causes a non-trivial problem of identifying homologous positions. This problem can be solved by sequence alignment algorithms [1–4], which maximise the similarity between aligned positions while also minimise the insertions and deletions needed to align the sequences.
The relationship between gap-penalties and similarity scores can be set such that they maximise the number of correctly aligned positions in a benchmark set of alignments [5, 6]. By contrast, stochastic models are capable to calibrate their parameters by applying a Maximum Likelihood approach even if no benchmark set is available. Hidden Markov Models were the first such stochastic models which have appeared in bioinformatics more than ten years ago . Thorne, Kishino and Felsenstein introduced time-continuous Markov models for describing insertion and deletion events [8, 9], and they showed on simulated data that the Maximum Likelihood method could correctly estimate the insertion-deletion as well as the substitution parameters with which the simulated data had been generated. The TKF models have subsequently been improved [10, 11], and have been tested for alignment accuracy on biological data . Above automatic parameter estimation, the other main advantage of stochastic models is that such models can provide posterior probabilities for each estimated alignment column as well as for the whole alignment, and these posterior probabilities correlate with the probability for the alignment column being correctly aligned [11–13].
The uncertainty in the sequence alignment can be slightly reduced when more than two sequences are simultaneously aligned together, and hence, much effort has been put in developing accurate multiple sequence alignment methods. Although efficient algorithms exist for any type of pairwise alignment problem, the multiple sequence alignment problem is hard. It has been proved that the optimal multiple sequence alignment problem under the sum-of-pairs scoring scheme is NP-hard , and it is strongly believed that the statistical approach to multiple sequence alignment is algorithmically not simpler than score-based approaches. Since it is unlikely that fast algorithms exist for any type of exact multiple sequence alignment problem, heuristic approaches have become widespread. Profile-HMM methods [15, 16] align sequences to a profile-HMM instead of each other, and the multiple sequence alignment is obtained by aligning sequences together via a profile-HMM. Since the jumping and emission parameters of the HMM are learned from the data, this approach needs many sequences for parameter optimisation. Nevertheless, profile-HMMs do not consider evolutionary relationships amongst sequences, and hence, they cannot handle properly over-representation of evolutionary groups.
Iterative approaches have been introduced for score-based methods in the eighties [17, 18] and have recently been extended for stochastic methods [13, 19] using the transducer theory [20, 21]. The drawback of iterative approaches is that in each iteration, they consider only a single, locally optimal alignment that might not lead to a globally optimal alignment. Moreover, as they consider only locally optimal partial solutions, they naturally underestimate the uncertainty of posterior probabilities.
The Markov chain Monte Carlo (MCMC) method represents a third way to attack the multiple stochastic alignment problem. It was first introduced for assessing the Bayesian distribution of evolutionary parameters of the TKF91 model aligning two sequences , and has subsequently been extended to multiple sequence alignment [23–28]. The general theory of Markov chain Monte Carlo [29, 30] states that the Markov chain will be in the prescribed distribution after infinite number of random steps. Obviously, we cannot wait infinite many steps in practice, and therefore the success of MCMC methods depends on fast convergence: if the Markov chain converges quickly to the prescribed distribution, the bias of samples from the Markov chain after a limited number of steps will be negligible. The convergence can be checked by measuring autocorrelation in the log-likelihood trace or a few other statistics of the Markov chain and by running several parallel chains with different random starting points .
Since the above mentioned methods for multiple stochastic sequence alignment problems have been introduced only recently, no large-scale, comprehensive analysis on the performance of methods for protein structure prediction has been published yet. In this paper, we present a survey on how stochastic alignment methods can be used for protein secondary structure predictions. The prediction can be based on pairwise or multiple alignments and in both cases, either only a single, optimal alignment or the whole posterior distribution of alignments is used for prediction. We are interested in the question how much one can gain by involving more sequences and the posterior distribution of the alignments into the secondary structure prediction.
We implemented a stochastic pairwise and a stochastic multiple sequence alignment method in Java programming language (see Additional file 1), and we made a study of the methods on the HOMSTRAD database as described in the Methods section.
The stochastic pairwise alignment method was tested on all the possible 9494 pairs of sequences belonging to the same family. The analysis took two days on an Intel Xeon 3.0 GHz computer with SUSE Linux 9.3 operating system and JVM 1.5.0. The most time-consuming part of the analysis was the Maximum Likelihood parameter optimisation, which took approximately 90% of the total running time.
Selected families from the HOMSTRAD database for testing the performance of stochastic multiple sequence alignment methods
Number of sequences
Average sequence id
Alpha beta barrel
Calcium-binding protein – parvalbumin-like
Starch binding domain
Glycosyl hydrolase family 22 (lysozyme)
Papain family cysteine proteinase
Src homology 2 domains
Response regulator receiver domain
Secondary structure predictions have been given in four ways:
Based on the Viterbi alignment (referred to as "Viterbi"). In this case, the most likely – a.k.a. Viterbi – alignment was obtained for all pairs of sequences and was used to map the secondary structure of one of the sequences onto the other sequence.
Based on the posterior distribution of pairwise alignments using the Forward-Backward algorithm ("Forward"). In this case, the posterior probabilities that two amino acids are aligned together were obtained for all pairs of sequences and all pairs of amino acids. The secondary structure of one of the sequences was mapped onto the other sequence in a fuzzy way using the posterior probabilities.
Based on the Maximum Posterior Decoding estimation from samples of a Markov chain Monte Carlo (MCMC) stochastic multiple alignment ("MPD"). In this case, the Maximum Posterior Decoding (MPD) alignments were predicted from MCMC samples and were used to map the secondary structure of one of the sequences onto the other sequences. The MPD alignment maximizes the product of the posterior probabilities of its alignment columns. See the Methods section for an explanation why the MPD alignment can be more accurately estimated from MCMC samples than the Viterbi alignment.
Based on the posterior distribution of multiple alignments obtained by MCMC stochastic multiple alignment ("Bayesian"). In this case, the posterior probabilities that two amino acids are aligned together were estimated from the MCMC samples for all pair of sequences choosable from a multiple alignment and all pair of amino acids. The secondary structure of one of the sequences was mapped onto the other sequence in a fuzzy way using the posterior probabilities.
Our results indicate that methods predicting secondary structures based on a single alignment are over-pessimistic about their performance on alpha helices and beta sheets, namely, the posterior probabilities associated to the prediction are lower than the actual probability that the prediction is correct. Methods that predict structures based on the whole distribution of sequence alignments are less pessimistic – the alignment posterior probabilities better approximate the observed probabilities that the prediction is correct. All pairwise alignment methods proved to be over-optimistic estimating the reliability of their predictions for alpha helices and beta sheets with posterior probability above 0.8.
Predicting the correctness of 310 helix predictions turned out to be the toughest of all secondary structure types. Each method except the Bayesian estimation on multiple sequence alignments is much over-optimistic on their power of predicting 310 helices. MPD is less optimistic than pairwise methods.
Among all methods studied, Bayesian estimation based on multiple alignments was the only one that was able to correctly predict its prediction power of all secondary structure types, including 310 helices, which makes MCMC-based multiple alignment methods successful candidates for promotion to a fundamental tool in protein structure prediction.
To show that the alignment posterior probabilities correlate not only with the goodness of secondary structure predictions but they also correlate with the similarities in the 3D structures, we calculated from the HOMSTRAD superimposed 3D structures the 3D distances between the C α atoms for each aligned pair of amino acids. The alignment posterior probabilities were evenly divided into 10 categories, and the average 3D distances as well as the low and high quartiles have been plotted for each category.
The differences between the predictions of different secondary structure elements can be explained by their general attributes. Alpha helices are typically formed by 10 amino acids or more. Substitutions are frequent in alpha helices and they are surrounded by loop sequences where insertions and deletions often occur, therefore stochastic alignment methods realise some uncertainty, which yields relatively low posterior probabilities when aligning these regions. However, since alpha helices are relatively long, and the substitutions that occur in them rarely change the chemical behaviour of the affected amino acids, the long runs of chemically similar amino acids in the two sequences to be aligned give a strong statistical signal that helps align alpha helices.
Beta sheet elements are typically shorter than alpha helices, and are also surrounded by non-structured fragments accumulating insertions and deletions, which also yields relatively low alignment posterior probabilities. However, beta sheet elements are more likely to be misaligned, since their short length keeps them from carrying a statistical signal that alpha helices do.
There is a similar explanation for the overoptimism in the region of 0.8 and higher posterior probabilities in the case of alpha helices and beta sheets: slight structural changes might shift the position where an alpha helix or a beta sheet starts or ends, even if the amino acids in the positions of question do not change. Fig. 8. also shows examples of such variations of secondary structure elements. For instance, the first alignment column is indicated to have a beta sheet structure in some sequences while it is non-structural in others.
Predictions based on a single, optimal pairwise or multiple alignment are over-pessimistic: alignment columns from both the Viterbi alignments and the MPD multiple alignments are labelled with posterior probabilities that are typically lower than the actual probability that the secondary structure predictions are correct for these columns. When the whole posterior ensemble of alignments is the basis of the secondary structure prediction, the posterior probabilities are closer to the actual probabilities that the prediction is correct. One main difference between the two strategies – prediction based on a single optimal alignment and prediction based on the posterior distribution of alignments – is that in the latter case posterior probabilities of all secondary structure types are given for each amino acid, while in the former case, the Viterbi or MPD alignment assigns at most one secondary structure element to each amino acid. This suggests the hypothesis that prediction methods based on the posterior distribution of alignments are less over-pessimistic due to possessing such false positive predictions with small posterior probabilities that are not part of a Viterbi or MPD alignment-based estimation.
To test this hypothesis, we predicted alpha helices and beta sheets from the posterior distribution of pairwise alignments in an alternative way. In this alternative prediction, each amino acid has been assigned to at most one secondary structure element that had maximal posterior probability (if the posterior probability of not harbouring a secondary structure type was maximal, then no secondary structure has been associated to the amino acid in question).
The correlation between alignment posterior probabilities and probabilities of correctly predicting a secondary structure type is obviously the same under the two different protocols if the posterior probability is greater than 0.5, since an event having probability greater than 0.5 must be the most likely event. The two types of curves split very soon below 0.5 (data not shown), and the second type of prediction protocol (considering at most one secondary structure type prediction for an amino acid) gets less over-pessimistic than the other protocol. This means that there are more true positive predictions than false positive predictions with non-maximal posterior probabilities.
High alignment posterior probabilities indicate that the aligned residues are close to each other in the superimposed 3D structures. The average 3D distance between the aligned residues increases as the alignment posterior probability decreases. However, the distribution of residue distances become flatter for small alignment posterior probabilities, namely, a small alignment posterior probability does not necessarily mean that the aligned residues are far from each other. For example, 0.5 alignment posterior probability in a pairwise alignment means that there is still about 25% probability that the aligned residues are closer to each other than the average distance between amino acids that are aligned together with more than 0.9 posterior probability. The distance distribution is even flatter in case of multiple alignments. One possible explanation is that the alignment posterior probabilities are calculated for multiple alignment columns while distances are calculated for all possible pairs of amino acids in alignment columns. A small alignment posterior probability indicates possible differences in the 3D structures, however, some of the 3D structures might be still similar. Averaging the 3D distances in alignment columns naturally makes the distribution more centred (data not shown).
In this paper, we studied how posterior probabilities of aligning characters in pairwise or multiple alignments might indicate whether secondary structure predictions based on the alignments in question are correct. We found that pairwise alignment methods are over-pessimistic on predicting alpha helices and beta sheets, namely, posterior probabilities of alignment columns are lower than the actual probability that the structure prediction based on the alignment column is correct, while they are overoptimistic on predicting 310 helices, i.e., posterior probabilities for these alignment columns are greater than the probabilities that the secondary structure prediction for these amino acids is correct. Multiple alignment methods provide slightly more reliable predictions about their reliability of secondary structure predictions – they are less overoptimistic on 310 helix predictions.
Secondary structure predictions can be given based on single, optimal pairwise or multiple alignments and also based on the posterior ensemble of alignments. In the latter case, posterior probabilities are closer to the probabilities that the secondary structure prediction is correct, especially when the structure prediction is based on the posterior distribution of multiple sequence alignments.
The multiple sequence alignment is the Holy Grail of bioinformatics  since what "one or two homologous sequences whisper ... a full multiple sequence alignment shouts out loud" . Our experiments show that multiple sequence alignments not only highlight conserved positions better than pairwise alignments, but they also more reliably indicate the reliability of their prediction capabilities. This extra information could be exploited in 3D protein structure prediction: high posterior probabilities indicate the regions of the sequence alignment where the alignment accuracy is significantly better than the average alignment accuracy, see Figs. 5 and 6. These parts can be used as a reliable scaffold in homology modelling. On the remaining, unreliable parts, homology modelling is expected to have a low quality, and hence the 3D structure of these regions should be predicted with alternative methods, like ab initio threading methods [34–36].
It is worth mentioning that the alignment methods we applied in this work do not consider any information about how secondary structures evolve. It is well-known that different secondary structure elements follow different substitution processes, and this difference in the substitution pattern can be used for secondary structure prediction . It is fairly straightforward to incorporate into current alignment methods a priori knowledge on the substitution, insertion and deletion processes of secondary structures, and we expect that such combined approaches will have a better performance in structure prediction. Nevertheless, secondary structures can be predicted not only in a comparative way, but also using a single sequence, based on the statistical properties of the amino acids in different secondary structure types [38, 39]. Potential prior distributions for secondary types elements might be derived from such statistics and might be used in Bayesian analysis.
The running time of the methods obviously increases with the complexity of the background models, and analyses utilising such combined methods currently take too long to be applicable for everyday use on personal computers. However, the speed of processors keeps increasing exponentially following Moore's law, and will soon reach a level when it won't pose barrier to such combined approaches. Nevertheless, there are also promising channels to improve the running time of the methods. The standard approach for statistical multiple alignment is going to be MCMC, and current implementations make use of very basic tricks only, like the alignment window cut algorithm described in the Methods section. Several groups are working on making MCMC alignment methods more efficient and quickly mixing, and significant improvements are expected in the coming years.
The HOMSTRAD database  has been downloaded and was used as a benchmark set for the methods we tested. As of December 2007, the database contains 1032 families of sequences, each family shares a common 3D structure. Each sequence in the database is annotated in JOY format  that, among other information, describes the secondary structure type of each amino acid (one of alpha helix, beta sheet, 310 helix or none). We predicted the secondary structures of the sequences as described below.
For each family in the HOMSTRAD database, each pair of sequences has been aligned using the above described pair-HMM. Since the jumping probabilities in the pair-HMM are interdependent via common parameters, the usual EM algorithm  cannot be applied, instead, we made use of the conjugate gradient method  to get the numerical approximation of the Maximum Likelihood parameters for each pair of sequences. The Viterbi alignment  has been obtained for each sequence pair using the ML parameters, and for each alignment column in the Viterbi alignment, posterior probabilities have been calculated with the Forward and Backward algorithms . The Viterbi alignment was used to map the secondary structure of one sequence to the other.
where m is the length of sequence B, P (a i , b j ) is the posterior probability that characters a i and b j are aligned, and is 1 if the known secondary structure of character b j is s, 0 otherwise.
The transducer theory  has been used to construct a multiple-HMM along an evolutionary tree from pair-HMMs. The same pair-HMM described in the previous section was used in the construction, and the so-obtained multiple-HMM gives the likelihood of a multiple alignment and an evolutionary tree. This multiple-HMM describes sequence evolution as independent events on the branches of the evolutionary tree. This means that the sequence fragmentation on an edge of the evolutionary tree is not inherited on descending branches. Moreover, the fragmentations on sibling branches are independent from each other. Uninformative, exponential priors with expectation 1 have been used as priors for edge lengths and insertion-deletion parameters in the TKF92 model. All tree topologies were equally probable a priori. These priors together with the likelihood of a tree and multiple alignment on the tree define the joint posterior distribution of multiple sequence alignments, evolutionary trees and model parameters.
Since the joint distribution of alignments, trees and parameters is a high dimensional distribution that is too complicated for direct, analytical inferring, Markov chain Monte Carlo [29, 30] has been used for sampling from the posterior distribution. One of the key questions here is how far we can go with the analytical calculations. For the biologically less reliable, but computationally more tractable TKF91 model , we developed a fast algorithm [24, 25] that calculates the likelihood of an evolutionary tree and a multiple sequence alignment of observed sequences. Such fast algorithm in the case of the TKF92 model is unknown, and hence, more data augmentation is necessary. This data augmentation includes sequences associated to the internal nodes and pairwise sequence alignments of neighbour nodes associated to the edges of the evolutionary tree. Since the likelihood of substitution events can be efficiently calculated with Felsenstein's algorithm , we only store the distribution of conditional likelihoods – also known as "Felsenstein's wildcards"  – at internal nodes of the evolutionary tree. We call this structure extended alignment.
The Markov chain performs a random walk on the space comprising the following components:
Edge lengths of the tree
Extended alignment, described above
We applied Metropolis-Hastings moves to change one of the components randomly, each component selected with a fixed, prescribed probability that was chosen to maximise the mixing of the Markov chain. Standard techniques were used for modifying edge lengths and parameters in the model, for a reference, see .
Because the MCMC analysis is time-consuming, we selected 12 families from the HOMSTRAD database, see Table 1., on which we performed an MCMC analysis. The convergence was verified based on the log-likelihood trace and one million steps were taken in each Markov chain after its burn-in period. Each chain was sampled each 100 steps, so 10000 samples have been collected from each chain. In a few cases, alternative chains with different starting points were set up, and the MPD alignment has been estimated from both chains, see Fig. 1.
In an earlier work , we showed that the maximum a posteriori (MAP) estimation from an MCMC sample is unstable, since there are many suboptimal alignments, and typically almost all sampled alignments from a Markov chain will be different. The same alignment showing up occasionally in multiple samples is merely due to the non-optimal mixing of the Markov chain, and such an alignment cannot be regarded as the most probable in the posterior distribution in any sense. Instead, we estimated the Maximum Posterior Decoding (MPD) alignment [12, 47] that maximises the product of the posterior single-column probabilities. This method offers a significantly more reliable result since many alignments share particular columns. The estimation for the MPD alignment from an MCMC sample can be obtained by the simple dynamic programming algorithm which first creates a directed acyclic graph whose vertices are the alignment columns of the MCMC samples, and then estimates the posterior probability for each alignment column by the relative frequencies of alignment columns in the sample. The MPD estimation is the path that maximises the product of the relative frequencies. The MPD alignment is used to map the secondary structure of one of the sequences onto other sequences.
where N is the number of alignments in the Markov chain, f k (a i ) is the amino acid in sequence B with which a i is aligned in the kth alignment, and δ s,xis 1 if the known secondary structure of character x is s, otherwise 0.
This research was supported by BBSRC grant BB/C509566/1. IM was also supported by a Bolyai postdoctoral fellowship and an OTKA grant F61730. The authors would like to thank the two anonymous referees for their valuable comments.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.