Use of a structural alphabet for analysis of short loops connecting repetitive structures

Background Because loops connect regular secondary structures, analysis of the former depends directly on the definition of the latter. The numerous assignment methods, however, can offer different definitions. In a previous study, we defined a structural alphabet composed of 16 average protein fragments, which we called Protein Blocks (PBs). They allow an accurate description of every region of 3D protein backbones and have been used in local structure prediction. In the present study, we use this structural alphabet to analyze and predict the loops connecting two repetitive structures. Results We first analyzed the secondary structure assignments. Use of five different assignment methods (DSSP, DEFINE, PCURVE, STRIDE and PSEA) showed the absence of consensus: 20% of the residues were assigned to different states. The discrepancies were particularly important at the extremities of the repetitive structures. We used PBs to describe and predict the short loops because they can help analyze and in part explain these discrepancies. An analysis of the PB distribution in these regions showed some specificities in the sequence-structure relationship. Of the amino acid over- or under-representations observed in the short loop databank, 20% did not appear in the entire databank. Finally, predicting 3D structure in terms of PBs with a Bayesian approach yielded an accuracy rate of 36.0% for all loops and 41.2% for the short loops. Specific learning in the short loops increased the latter by 1%. Conclusion This work highlights the difficulties of assigning repetitive structures and the advantages of using more precise descriptions, that is, PBs. We observed some new amino acid distributions in the short loops and used this information to enhance local prediction. Instead of describing entire loops, our approach predicts each position in the loops locally. It can thus be used to propose many different structures for the loops and to probe and sample their flexibility. It can be a useful tool in ab initio loop prediction.


Background
Since the first descriptions of protein structures by Pauling and Corey [1,2], their repetitive secondary structures have been widely analyzed. They have been studied from two principal points of view -assignment and prediction. Different approaches can be used for assigning secondary structures to a 3D protein structure. The most common is DSSP [3], which is based on hydrogen bonding patterns. STRIDE [4] relies on the same criteria with slightly different parameters and computes backbone dihedral angles. DEFINE [5] uses an inter-Cα distance matrix that corresponds to ideal repetitive secondary structures. PCURVE [6] is based on the helicoidal parameters of each peptide unit and generates a global peptide axis. Finally, PSEA [7] bases its assignments only on the Cα position, using distance and angle criteria. Not surprisingly, these methods do not assign the same state to all residues, especially those located at the beginning and end of repetitive structures. For instance, DSSP, DEFINE and PCURVE only assign 65% of residues to the same state [8].
Several prediction methods have been developed [9], and accuracy rates climb to 80% with neural networks and sequence homology [10]. Secondary structures do not, however, entirely describe the 3D protein structure. Coils account for more than 40% of residues. In the conventional 3-state description, they are associated with only one state, defined as non-helicoidal and non-extended. The coil state is in fact composed of really distinct local folds, such as turns [11]. Several studies have attempted to analyze loops [12,13] and predict their conformations [14], but they still fail to take a significant portion of residues into account.
Protein structure descriptions that use a library or set of small prototypes, i.e., N states rather than the conventional three, can help improve definitions of these regions and may also improve prediction. Such a library constitutes a structural alphabet [15,16] and is composed of structural prototypes. Because these describe all the local folds, repetitive structures as well as coils, they allow a better approximation of the entire protein structure. Thus, they can be used to reconstruct protein structures [17] or to predict the local structure [18]. In a previous study, we defined a structural alphabet composed of 16 protein fragments, each 5 residues in length, called Protein Blocks (PBs, cf. Figure 1) [19]. They have been used to describe 3D protein backbones [20][21][22] and to predict local structures [19,23]. Our structural alphabet is particularly informative [24] and is thus useful for pre-processing before ab initio and new fold prediction.
We focus here on the study of small loops that connect two repetitive structures. We first analyze the classic secondary structure assignments with the five above-mentioned methods. Secondly, we describe the short loops with our structural alphabet and analyze the sequencestructure relationship in these local structures. Finally, we make local predictions based on the amino acid sequences.

Secondary structure assignments
As noted by Woodcock et al. [25], a serious problem raised by the variety of methods for secondary structure assignment is that they often yield differing results. A consensus method has been proposed to lessen this effect [8]. Here we used an agreement rate, denoted as C 3 , which is the proportion of residues associated with the same state. Table 1 summarizes the correspondence between the secondary structure assignments from the five methods. It clearly highlights three points: (i) with its default parameters, DEFINE yielded results very different from the other methods, as shown by its C 3 values, close to 62%; (ii) DSSP and STRIDE produced nearly identical assignments, with C 3 equal to 95%. Of the remaining assignments, 4% corresponded to confusion between α-helices and coils, and the remaining 1% to confusion between β-strands and coils; (iii) all the other comparisons gave a mean C 3 of 80%, with 6-7% confusion between α-helices and coils and 12-13% between β-strands and coils.
In addition, DEFINE was the only method to confuse αhelices and β-strands. This confusion ranged from 2% to 5% between DEFINE and the other methods, while for all other comparisons, it was less than 0.05%. These results did not change when β-strands were described by 'E' (extended-strand participating in a β-ladder) and 'B' (residue in isolated β-bridge) [9] labels for DSSP and STRIDE rather than only 'E'.
These results show the difficulties related to defining an appropriate length for α-helices, β-strands and coils and locating their ends [26]. These inaccuracies in defining the repetitive structures have direct repercussions on the definition of loops. Figures 2 and 3 use the example of the ribosomal protein S15 from Bacillus Stearothermophilus (PDB code 1A32; another example, proto-oncogene Mtcp-1, PDB code 1A1X, is given [see Additional file 1 and Additional file 2]) to show the multiple secondary structure assignments that can ensue. 79% of the residues are assigned to the same state, rather more than for many other proteins. The repetitive structure caps remain quite confusing (cf. Figure 3), however, despite good agreement. For instance, the C-cap of the first helix is defined over three residues, depending on the assignment method (positions 13 to 15). The connecting zone between helices 2 and 3 is fuzzy. DSSP and STRIDE assign positions 44-48, PSEA 45-50 and PCURVE 45-47 as coils whereas DEFINE assigns positions 47-50 as a small β-strand. In this example, we see that the 16 Protein Blocks (PBs), labeled a-p, describe every part of the protein structures specifically. This description includes the repetitive structures, their edges, and the coils that the secondary structures define only as non-helicoidal and non-extended. Each prototype is five residues in length and corresponds to eight dihedral angles (φ,ψ). The PBs m and d can be roughly described as prototypes for the central α-helix and the central β-strand, respectively. For each PB, the N cap extremity is on the left and the C-cap on the right.

Protein Blocks
Each PB is a fragment five residues long that corresponds to a local fold and is defined by eight dihedral angles. PBs m and d correspond roughly to the core of α-helices and β-strands, respectively. In the example in Figure 2, several series of PB m accurately describe the helix cores. Where a secondary structure assignment method assigned a βstrand (PSEA, positions 14-17, DEFINE positions 16-18 and 47-50), the PB assignment gave PB b, or PB c and e, all close to β-strand geometry. Thus, PBs may explain the ambiguity of the assignments. In this case, PBs b, c and e can take the variability of the β-strand into account. The structural alphabet was more structurally informative (16 states instead of 3 states) and better approximated the protein backbone. It is thus a relevant alternative for describing loops.

Describing short loops in terms of Protein Blocks
Loops are defined as protein fragments that connect two series of PBs m and/or d and contain no repetitive PBs. Short loops have a length of 2 to 6 PBs. The short loop databank contains 3,319 fragments: 644 for mm/mm, 801 for dd/dd, 989 for dd/mm and 886 for mm/dd. Table 2 summarizes the properties of the PBs in the overall databank as well as in the loop and short loop databanks. We focused on the frequency of occurrence of PBs in these regions and on the main transitions between successive PBs, since previous studies observed only a limited number of transitions [19,23]. Table 2 points out the specificities of the transitions of some PBs in the short loops (for comparison, this information on the PBs in the complete databank is given [see Additional file 3]).
We observed that PBs k, l, n, o and p were relatively more specific to short loops. Their frequencies were 1% higher in short loops than in all loops. Inversely, the frequency of PB b dropped from 9.0% in all loops to only 3.7% in short loops. Moreover, it was slightly less frequent in the short loops than in the overall databank (4.4%). The frequencies of the other PBs were the same in loops and short loops.
The transition frequencies between successive PBs varied substantially between the complete databank and the short loops. We noted three main categories. (i) The principal transitions became more pronounced for most PBs (i.e., 11). For example, the transition from PB a to PB c increased by more than 20% (50.9% versus 71.8%), c to d more than 20%, e to h more than 10%, f to k (24%) and l to m (15%). For PBs h, i, k, n, o, and p, the increase was smaller, ranging from +2 to +10%. (ii) For two PBs, the first preference transitions were inverted. The second most common transition of PB g (PB c) in the databank took over first place for short loops, and its frequency climbed from 28.0% to 39.7%. PB j was the fuzziest PB (rmsd = 0.74 Å) and had a high number of "main" transitions (6 with a transition rate greater than 10%). In the short loops, its third most common transition, PB l, becomes first (and its rate went from 16.1% to 25.0%). (iii) PB b Comparison of methods for secondary structure assignment Representation of the secondary structure assignments  We analyzed the distribution of the classic secondary structures in our short loop definitions. The secondary structure assignments (with PSEA [7]) changed substantially from their distribution in the entire databank. The frequency of PBs a, c and e in β-strands increased by 5%, 2% and 9%, respectively, and the frequency of PBs k, l, n and o in α-helices by more than 12%. The frequency of PB b in coils increased from 85.4% to 95.8%. Other methods of secondary structure assignment yield similar results. Figure 4 reports the amino acid occurrence matrices, normalized into Z-scores, and their asymmetric Kullback-Leibler index (KLd [27]) for two PBs, c and l, calculated from the complete databank and from the short loop set. The PBs are five residues in length (noted from -2 to +2 and centered in 0). We showed in a previous study [19] that prediction can be improved only by enlarging the sequence window to 15 residues (noted from -7 to +7 and still centered in 0). We therefore computed the occurrence matrices for fragments of 15 residues. Positive Z-scores (respectively negative) correspond to overrepresented (respectively underrepresented) amino acids and provide information for each amino acid at each position. The KLd analyzes the contrast between the amino acid distribution observed in a given position of the occurrence matrix and the reference amino acid distribution in the protein set. Hence, it measures the sequence information content and highlights the most informative positions.

Analysis of the sequence-structure relationship
We PB l behaved distinctively. Its amino acid distributions in the short loop set differed from those in the entire databank (cf. Figures 4b and 4f). The informative region was restricted to only three positions (-2, -1, 0) with KLd values of 0.23, 0.13 and 0.13 respectively (cf. Figure 4d). In the short loops, position (+2) increased significantly, to 0.11 and became equivalent to position (-1). Position (0) lost specificity (-0.01), but position (-2) remained most specific, increasing to 0.03 (cf. Figure 4h). Table 3 summarized the 149 amino acid over-and underrepresentations observed in the short loop set, fewer than in the overall databank. This was due mainly to the number of occurrences, by definition lower in the short loops. Nevertheless, 20% of the significant amino acids had not previously been found. Nearly all PBs had at least one amino acid over-or under-represented. As expected, in most cases, it was glycine (9 times), although 8 other types of amino acids were involved. We note two specific examples: (i) the overrepresentation of methionine in position (+1) of PB p (the only methionine overrepresented in all the short loops), and (ii) the underrepresentation of glycine in position (-2) of PB f, although it was overrepresented in the global distribution. Table 4 summarizes the predictions. A training set corresponding to 2/3 of the dataset was used to learn the sequence-structure relationship for all predictions, and a test set corresponding to the remaining 1/3 to evaluate the results. We ran three different sets of predictions: the first two used occurrence matrices computed from the complete databank, and the third, matrices computed only from the short loop regions. We computed Q 16 and Q 14 ratios to analyze the quality of the predictions. Q 16 corresponds to the total number of true predicted PBs over the total number of predicted PBs. The Q 14 value is specific for loops, i.e., PB m and d are not taken into account.

Predicting with PBs in the short loops
The first prediction (init) is the conventional Bayesian prediction, run with all 16 PBs. It yielded a global prediction rate Q 16 equal to 35.2%. This value is close to that in our previous study (Q 16 = 34.4% [19]) and far superior to the value of 7.5% obtained with random assignment. The Q 14 value equals 36.0% for both the short and long loops. This computation shows that the non-repetitive PBs were predicted as accurately as the PBs m (39.3%) and d (27.7%). Prediction was thus not biased in favor of the most populated blocks.

Discussion
We have observed that the secondary structure assignment methods can produce highly discordant results. In most cases, only 80% of the residues are assigned to the same state. The capping regions of repetitive secondary structures are particularly mismatched. The difficulties of describing clearly repetitive regions have often been pointed out [28-30].
PBs allow more precise description than do the secondary structures. In addition, they overlap. Accordingly, a small modification of PB assignment has fewer consequences than changing a secondary structure assignment; for example, a PB m is relatively similar to a PB n whereas an α-helix should be highly distinct from a coil. Analysis of series of PBs prove their structural relevance [23]. All these points justify the use of our structural alphabet to describe and analyze short loops. A recent approach has shown that most short loop fragments can be approximated correctly in the Protein DataBank [31].
The behavior of PB b in short loops differs from that in all loops: it appears to be a β-strand N-cap mainly involved in long loops. This point may partly explain its poor pre-diction rate in the short loops. Similarly, we observe that most of the rates of leading transitions are lower in the complete databank than in the short loops. This indicates that the less frequent transitions are associated with longer loops, i.e., fragments of more than 6 PBs.
Analysis of the sequence-structure relationship shows that most of the PBs in short loops have specific amino acid distributions that differ in many cases from the reference PB distribution. Nonetheless, as noted with PB l (see Figure 4), some positions lose amino acid specificity.
Because of the limited number of short loops in our nonredundant databank, we ran three different sets of predictions so that we could carefully observe the behavior of the PBs. (i) The global prediction shows that the loops were predicted as accurately as the repetitive structures (Q 16 = 35.2% and Q 14 = 36.0%), i.e., this method did not introduce artificial bias resulting in preferential prediction of repetitive regions. (ii) The sequence-structure relationship in the short loops was strongly determinist and thus significantly improved the prediction (Q 14 = 41.2%). The use of the global occurrence matrices, however, induced an imbalance in the prediction of certain PBs: PBs associated with the repetitive PB m enjoy many advantages over other PBs mainly associated with the coil-state. (iii) Accordingly, a specific approach dedicated to the short loops yielded better, more accurate predictions, better balanced between the different PBs (Q14 = 42.3%), with no particular bias.
PB j is the only PB for which results really suffer with this approach. It is the least frequent PB and the most variable. Consequently, the poor prediction rate for it may be explained by the lack of information in the databank for it. We also have noted important over-fitting (more than 20% between the learning set and the validation set) for this PB, substantially higher than for the other blocks.
One advantage of such an approach is that it enables us to compute the most significant series of PBs and from this information propose alternative 3D candidate structures. Figure 5 shows an example of short loop prediction with the PB probabilities associated with a given sequence window and the corresponding possible 3D structures.

Conclusions
Loop prediction, despite the considerable work devoted to it and the numerous methods developed, remains a difficult research topic [14,32,33]. Prediction methods are often used in comparative modeling and propose one "complete" loop [14,33]. Here, instead of describing entire loops, we predict locally each position of the loops. This Bayesian approach can be used to propose not just one, but many different loops. Because each PB at each posi- Example of prediction: scoring PB combinations tion is associated with a corresponding probability score, correlated in turn with the prediction accuracy [19,23]

Data sets
The main set of proteins (PAPIA), based on the PAPIA/ PDB-REPRDB database [40], comprises 717 protein chains and 180,854 residues [41]. It has been used in previous work [23] and is available at http://www.ebgm.jus sieu.fr/~debrevern. The set contains no more than 30% pairwise sequence identity. The selected chains have X-ray crystallographic resolutions less than 2.0 Å and an R-factor less than 0.2. Each structure selected has a rmsd value larger than 10 Å from all representative chains. Each chain was carefully examined with geometric criteria to avoid bias from zones with missing density. An updated databank has been built with the same criteria; it is composed of 1,403 proteins and 320,005 residues.

Protein Blocks
They correspond to a set of 16 local prototypes, labeled from a to p (cf. Figure 1), 5 residues in length and based on Φ, Ψ dihedral angle description [19]. They were obtained by an unsupervised classifier similar to Kohonen Maps [38] and Hidden Markov Models [39]. The PBs m and d can be roughly described as prototypes for central αhelices and central β-strands, respectively. PBs a through c primarily represent β-strand N-caps and PBs e and f, Ccaps; PBs g through j are specific to coils, PBs k and l to αhelix N-caps, and PBs n through p to C-caps. This structural alphabet allows a reasonable approximation of local protein 3D structures [19,23] with a root mean square deviation (rmsd) now evaluated at 0.42 Å.

Short loop description
We defined the short loops as PB series 2 to 6 PBs long. These series must be composed of non-repetitive PBs, i.e., all PBs except d and m. They must have flanking regions composed of series of PBs mm and/or dd.

Agreement rate
To compare two distinct secondary structure assignment methods, we used an agreement rate denoted C 3 and defined as the proportion of residues associated with the same state (α-helix, β-strand and coil).

Z-score
The amino acid occurrences for each PB were normalized into a Z-score: with the number of times amino acid i was observed in position j for a given PB and the number expected.
The product of observations in position j and its frequency in the entire databank equals . Positive Zscores (respectively negative) correspond to amino acids that are overrepresented (respectively underrepresented); threshold values of 4.42 and 1.96 were chosen (probability less than 10 -5 and 5.10 -2 respectively).

Asymmetric Kullback-Leibler measure
The Kullback-Leibler measure or relative entropy [27], denoted by KLd, makes it possible to compute the contrast between two amino acid distributions, i.e., that observed in a given position j and the reference distribution in the protein set (DB). The relative entropy KLd(j|PB x ) in the site j for the block PBx is expressed as: where P(aa j = i|PB x ) is the probability of observing the amino acid i in position j (j = -w, ...,0, ..., +w) of the sequence window, given protein block PBx, and, P(aa j = i|DB) the probability of observing the same amino acid in the databank (named DB).
Thus, it enables us to detect the "informative" positions in terms of amino acids for a given protein block [19].

Prediction
In a strategy of structure prediction from sequence [19,23], we must compute for a given sequence window S aa = {aa -w , ..., aa 0 , ..., aa +w }, the probability of observing a given protein block PBx, i.e., P(PBx | S aa ). For this purpose, each PB is associated with an occurrence matrix of dimension l × 20 centered upon the PB, with l = 2 w +1 (in the study, w = 7). Using the Bayes theorem to compute this a posteriori probability P(PBx | S aa ) from the a priori probability P(S aa | PBx) deduced from the occurrence matrix allows us to define the odds score R x : The highest score R x corresponds to the most probable PB [19,23]. The Q 16 value computed is the total number of true predicted PBs over the total number of predicted PBs. We also computed a Q 14 value, specific for loops, i.e., the PB m and d are not taken into account in the accuracy rate computation.