The representative set of 2482 amino acid sequences (PDB_SELECT) for preliminary statistical analysis were obtained from [44]. The procedure used to generate the PDB_SELECT list was described earlier [45]. In this set, the percentage of identity between any pair of sequences is less than 25%. The 3324 "sequence-unique" proteins were downloaded from the EVA server ftp site, as of 2004_05_09, and the copy of the data were placed at [46]. The proteins in this set, which was used in leave-one-out cross validation experiments were selected to satisfy the condition that percentage of identity between any pair of sequences should not exceed the length dependent threshold S (for instance, for sequences longer than 450 amino acids, S = 19.5) [47]. CASP6 targets were downloaded from [48], and the PDB definitions were used for the amino acid sequences and secondary structure assignments. PSIPRED training data was downloaded from [49].
Bayesian formulation
The linear sequence that defines a secondary structure of a protein can be described by a pair of vectors (S, T), where S denotes the structural segment end (border) positions and, T determines the structural state of each segment (α-helix, β-strand or loop). For instance, for the secondary structure shown in Fig. 2, S = (4, 9, 12, 16, 21, 28, 33) and T = (L, E, L, E, L, H, L).
Given a statistical model specifying probabilistic dependencies between sequence and structure elements, the problem of protein secondary structure prediction could be stated as the problem of maximizing the a posteriori probability of a structure given the primary sequence. Thus, given the sequence of amino acids, R, one has to find the vector (S, T) with maximum a posteriori probability P(S, T | R) as defined by an appropriate statistical model. Using Bayes' rule, this probability can be expressed as:
where P(R | S, T) denotes the likelihood and P(S, T) is the a priori probability. Since P(R) is a constant with respect to (S, T), maximizing P(S, T | R) is equivalent to maximizing P(R | S, T)P(S, T). To proceed further, we need models for each of these probabilistic terms. We model the distribution of a priori probability P(S, T) as follows:
Here, m denotes the total number of uniform secondary structure segments. P(T
j
| Tj-1) is the probability of transition from segment with secondary structure type Tj-1to a segment with secondary structure type T
j
. Table 3 shows the transition probabilities P(T
j
| Tj-1), estimated from a representative set of 2482 "unrelated" proteins (see the Methods section). The third term, P(S
j
| Sj-1, T
j
), reflects the length distribution of the uniform secondary structure segments. It is assumed that
P(S
j
| Sj-1, T
j
) = P(S
j
- Sj-1| T
j
), (7)
where S
j
- Sj-1is equal to the segment length (Fig. 2). The segment length distributions for different types of secondary structure have been determined earlier [50].
The likelihood term P(R | S, T) can be written as:
Here, R[p:q]denotes the sequence of amino acid residues with position indices from p to q. The probability of observing a particular amino acid sequence in a segment adopting a particular type of secondary structure is P( | S, T). This term is assumed to be equal to P( | Sj-1, S
j
, T
j
), thus this probability depends only on the secondary structure type of a given segment, and not of adjacent segments. Note that, we ignore the non-local interactions observed in β-sheets. This simplification allows us to implement an efficient hidden semi-Markov model.
To elaborate on the segment likelihood terms in Eq. 8, we have to consider the correlation patterns within the segment with uniform secondary structure. These patterns reflect the secondary structure specific physico-chemical interactions. For instance, α-helices are strengthened by hydrogen bonding between amino acid pairs situated at specific distances. To correctly define the likelihood term we should also pay attention to proximal positions, typically the four initial and the four final positions of a secondary structure segment. In particular, α-helices include capping boxes, where the hydrogen bonding patterns and side-chain interactions are different from the internal positions [51, 52]. The observed distributions of amino acid frequencies in proximal (capping boxes) and internal positions of α-helix segments are depicted in Schmidler et al. [14], and show noticeably distinct patterns.
Presence of this inhomogeneity in the statistical model leads to the following expression for P( | S
j
, Sj-1, T
j
):
Here, the first and third sub-products represent the probability of observing l
N
and l
C
specific amino acids at the segment's N-terminal and C-terminal, respectively. The second sub-product defines the observation probability of given amino acids in the segment's internal positions. Note that, k
b
and k
e
designate Sj-1+ 1 and S
j
, respectively. The probabilistic expression (9) is generic for α-helices, β-strands and loops. Formula (9) assumes that, the probabilistic model is fully dependent within a segment, i.e. observation of an amino acid at a given position depends on all previous amino acids within the segment. However, at this time, the Protein Data Bank (PDB, [53]) does not have a sufficient amount of experimental data to reliably estimate all the parameters of a fully dependent model. Therefore, it is important to reduce the dependency structure and keep only the most important correlations. In order to achieve this goal, we performed the statistical analysis described in the following section.
Correlation patterns of amino acids
Amino acids have distinct propensities for the adoption of secondary structure conformations [54]. These propensities are in the heart of many secondary structure prediction methods [51, 52, 55–62]. Our goal is to come up with a dependency pattern that is comprehensive enough to capture the essential correlations yet simple enough in terms of the number of model parameters to allow reliable parameter estimation from the available training data. Therefore, we performed a χ2-test to identify the most significant correlations between amino acid pairs located in adjacent and non-adjacent positions for each type of secondary structure segments. The χ2-test compared empirical distribution of amino acid pairs with the respective product of marginal distributions. Therefore, a 20 × 20 contingency table was computed for the frequencies of possible amino acid pairs observed in different structural states. In this test, the threshold was computed as 404.6 for a statistical significance level of 0.05.
We first considered the correlations between amino acid pairs at various separation distances (Table 4). In α-helix segments, a residue at position i is highly correlated with residues at positions i - 2, i - 3 and i - 4. Similarly, a β-strand residue had its highest correlations with residues at positions i - 1, i - 2, and a loop residue had its most significant correlation with a residue at position i - 1. The test statistics for the remaining pairs were above the threshold for statistical significance but these values were considerably lower than the ones listed above. The dependencies that were identified by the statistical analysis are in agreement with the well known physical nature of the secondary structure conformations.
Next, we analyzed proximal positions and a representative set of internal positions. Frequency patterns in proximal positions deviate from the patterns observed in internal positions [51, 58]. For a better quantification, we computed the Kullback-Liebler (KL) distance between probability distributions of the proximal and internal positions (Table 5). The observation that the KL distance is significantly higher for positions closer to segment borders suggests that amino acids in proximal locations have significantly different distributions from those at internal regions.
Finally, we performed a χ2-test for proximal positions to identify the correlations between amino acid pairs at various separation distances. As can be seen from the results for α-helix segments (Table 6), the general assumption of segment independence does not hold as statistically significant correlations were observed between residues situated on both sides of the segment borders. For instance, the second amino acid i in the α-helix N-terminal significantly correlates with the previous amino acid at position i - 2, which is outside the segment. This correlation can be caused by physical interactions between nearby residues [51]. Also, the strength of correlation for the i + (downstream) residues was different from the strength observed for i - (upstream) residues (Table 6). This fact indicates an asymmetry in correlation behavior for i+ and i- residues. The parameters of position specific correlations were also computed for β-strand and loop segments (data not shown).
A similar asymmetry in the correlation patterns was also observed in internal positions (data not shown). For instance, for α-helices, the ith residue in an internal position is highly correlated with the i - 2th, i - 3th, i - 4th, i + 2th and i + 4th residues. The parameters of correlation between ith and i - 2th residues is different from the parameters of correlation between ith and i + 2th residues.
In the next section, we will refine the probabilistic model needed to determine P( | Sj-1, S
j
, T
j
) using the most significant correlations identified by the statistical analysis.
Reduced dependency model
Correlation analysis allows one to reduce the alphabet size in the likelihood expression (Eq. 9) by selecting only the most significant correlations. The dependence patterns revealed by the statistical analysis are shown in Table 7 divided into panels for α-helix (H), β-strand (E), and loop (L) structures. To reduce the dimension of the parameter space, we grouped the amino acids into three and five hydrophobicity classes. We used five classes only for those positions, which have significantly high correlation measures. In Table 7, stands for the dependency of an amino acid at position i to the hydrophobicity class of an amino acid at position i - 1, and the superscript 3 represents the total number of hydrophobicity classes.
To better characterize the features that define the secondary structure, we distinguished positions within a segment as well as segments with different lengths. We identified as proximal positions those in which the amino acid frequency distributions significantly deviate from ones in internal positions in terms of the KL distance (Table 5). Based on the available training data, we chose 6 proximal positions (N1-N4, C1-C2) for α-helices, 4 proximal positions (N1-N2, C1-C2) for β-strands, and 8 proximal positions (N1-N4, C1-C4) for loops. The remaining positions are defined as internal positions (Int). In addition to position specific dependencies, we derived separate patterns for segments with different lengths. Table 7 shows the dependence patterns for segments longer than L residues, where L is 5 for α-helices, and 3 for β-strands and loops. For shorter segments, we selected a representative set of patterns from Table 7 according to the available training data.
To fully utilize the dependency structure, we found it useful to derive three separate dependency models. The first model, 1, uses only dependencies to upstream positions, (i-), the second model, 2, includes dependencies to upstream (i-), and downstream (i+) positions simultaneously, and the third model, 3, incorporates only downstream (i+) dependencies. For each dependency model (1-3), the probability of observing an amino acid at a given position is defined using the dependence patterns selected from Table 7. For instance, according to the model 2, the conditional probability of observing an amino acid at position i = N 3 of an α-helix segment becomes PN3(Ril ). By multiplying the conditional probabilities selected from Table 7 as formulated in Eq. 9, we obtain the propensity value for the observation of the amino acid segment under the specified model. In the case of 1, and 3, this product gives the segment likelihood expression, which is a properly normalized probability value P( | S, T). Hence, 1, and 3 are probabilistic models. For the model 2, we rather obtain a score Q( | S, T) that represents the potential of a given amino acid segment to adopt a particular secondary structure conformation. This scoring system can be used to characterize amino acid segments in terms of their propensity to form structures of different types and when uniformly applied to compute segment potentials, allows to implement algorithms following the theory of hidden semi-Markov models. Implementing three different models enables to generate three predictions each specializing in a different section of the dependency structure. Those predictions can then be combined to get a final prediction sequence, as explained in the next section.
The hidden semi-Markov model and computational methods
Amino acid and DNA sequences have been successfully analyzed using hidden Markov models (HMM) as the character strings generated in "left-to-right" direction. For a comprehensive introduction to HMMs, see [63].
Here, we consider a hidden semi-Markov model (HSMM) also known as HMM with duration. Such type of model was earlier used in gene finding methods, such as Genie [64], GenScan [65] and GeneMark.hmm [66]. The HSMM technique was introduced for protein structure prediction by Schmidler et al. [14]. In a HSMM, a transition from a hidden state into itself cannot occur, while a hidden state can emit a whole string of symbols rather than a single symbol. The hidden states of the model used in protein secondary structure prediction are the structural states {H, E, L} designating α-helix, β-strand and loop segments, respectively. The state transitions occur with probabilities P(T
j
| Tj-1) thus forming a first order Markov chain. At each hidden state, an amino acid segment with uniform structure is generated according to a given length distribution P(S
j
| Sj-1, T
j
), and the likelihood P( | Sj-1, S
j
, T
j
) (Fig. 3).
Having defined this HSMM, we can consider the protein secondary structure prediction problem as the problem of finding the sequence of hidden states with the highest a posteriori probability given the amino acid sequence. One efficient algorithm to solve this optimization problem is well known. Given an amino acid sequence R, the vector (S, T)* = arg max P(S, T | R) can be found using the Viterbi algorithm. Here lies a subtle difference between the result that can be delivered by the Viterbi algorithm and the result needed in the traditional statement of the protein secondary structure prediction problem.
The Viterbi path does not directly optimize the three-state-per residue accuracy (Q3):
Also, the Viterbi algorithm might generate many different segmentations, which might have significant probability mass but are not optimal [14]. As an alternative to the Viterbi algorithm, we can determine the sequence of structural states that are most likely to occur in each position. This approach will use forward and backward algorithms (posterior decoding) generalized for HSMM [63]. Although the prediction sequence obtained by forward and backward algorithms might not be a perfectly valid state sequence (i.e. it might not be realized given the parameters of HSMM), the prediction measure defined as the marginal posterior probability distribution (Eq. 14) correlates very strongly with the prediction accuracy (Q3) [14]. The performance of the Viterbi and forward-backward algorithms are compared in Schmidler et al. [14]. Here, forward and backward variables are defined as follows (n is the total number of amino acids in a sequence):
The forward variable αθ (j, t) is the joint probability of observing the amino acid sequence up to position j, and a secondary structure segment that ends at position j with type t. Here, θ represents the statistical dependency model. Similarly, the backward variable βθ (j, t) defines the conditional probability of observing the amino acid sequence in positions j + 1 to n, and a secondary structure segment that ends at position j with type t. Then, the a posteriori probability for a hidden state in position i to be either an α-helix, β-strand or loop is computed via all possible segmentations that include position i (Eq. 13). The hidden state at position i is identified as the state with maximum a posteriori probability. Finally, the whole predicted sequence of hidden states is defined by Eq.14.
The computational complexity of this algorithm is O(n3). If the maximum size of a segment is limited by a value D, the first summation in Eq. 11 starts at (j - D), and the first summation in Eq. 12 ends at (j + D) reducing the computational cost to O(nD2).
Note that, forward and backward variables are computed by multiplying probabilities, which are less than 1, and as the sequence gets longer, these variables approximate to zero after a certain position. Hence it is necessary to introduce a scaling procedure to prevent numerical underflow. The scaling for a "classic" HMM is described in [63]. This procedure can easily be generalized for an HSMM, where the scaling coefficients are introduced at every D positions.
This completes the derivation of the algorithm for a single model. Since we are utilizing three dependency models, θ = 1, 2, 3, it becomes necessary to combine the outputs of the three models with an appropriate function. In our simulations, we implemented averaging and maximum functions to perform this task and observed that the averaging function gives a better performance. The final prediction sequence is then computed as:
Iterative model training
To improve the estimation of the model parameters, we implemented an iterative training procedure. Upon obtaining an initial secondary structure prediction for a given amino acid sequence, we re-adjust the HSMM parameters using proteins that have similar structural features and repeat the prediction step. That is, once we obtain the prediction result for a test sequence, we compute the α-helix, β-strand, and loop compositions (the percentages of α-helix, β-strand, and loop predictions). We then remove from the training set those sequences that do not have a similar secondary structure composition. To assess the similarity between the prediction sequence and a training set protein, we compute the absolute value differences of the composition values and apply a fixed threshold. If the differences for all secondary structure types are less than the threshold, then the two sequences are assumed to be similar. The dataset reduction step is followed by the re-estimation of the HSMM parameters and the prediction of the secondary structure (Fig. 4). Note that, in the second and all subsequent iterations, we always start from the initial data set of training sequences and use the predicted sequence to reconstruct the training set. This approach prevents the iterations from sidetracking and converging to an incorrect result.
Although affinity in structural composition does not guarantee structural similarity, using this measure allows us to reduce the training set to proteins that belong to more closely related SCOP families [67]. Thus, for example, a prediction of a structure from an all-α class is likely to be followed by a training using proteins having high α-helix content. In our simulations, we observed that after several iterations (no more than 3) the predicted secondary structure sequence did not change indicating the algorithm convergence.