Directionality in protein fold prediction

Ellis, Jonathan J; Huard, Fabien PE; Deane, Charlotte M; Srivastava, Sheenal; Wood, Graham R

doi:10.1186/1471-2105-11-172

Research article
Open access
Published: 07 April 2010

Directionality in protein fold prediction

Jonathan J Ellis¹,
Fabien PE Huard¹,
Charlotte M Deane²,
Sheenal Srivastava¹ &
…
Graham R Wood¹

BMC Bioinformatics volume 11, Article number: 172 (2010) Cite this article

6821 Accesses
17 Citations
Metrics details

Abstract

Background

Ever since the ground-breaking work of Anfinsen et al. in which a denatured protein was found to refold to its native state, it has been frequently stated by the protein fold prediction community that all the information required for protein folding lies in the amino acid sequence. Recent in vitro experiments and in silico computational studies, however, have shown that cotranslation may affect the folding pathway of some proteins, especially those of ancient folds. In this paper aspects of cotranslational folding have been incorporated into a protein structure prediction algorithm by adapting the Rosetta program to fold proteins as the nascent chain elongates. This makes it possible to conduct a pairwise comparison of folding accuracy, by comparing folds created sequentially from each end of the protein.

Results

A single main result emerged: in 94% of proteins analyzed, following the sense of translation, from N-terminus to C-terminus, produced better predictions than following the reverse sense of translation, from the C-terminus to N-terminus. Two secondary results emerged. First, this superiority of N-terminus to C-terminus folding was more marked for proteins showing stronger evidence of cotranslation and second, an algorithm following the sense of translation produced predictions comparable to, and occasionally better than, Rosetta.

Conclusions

There is a directionality effect in protein fold prediction. At present, prediction methods appear to be too noisy to take advantage of this effect; as techniques refine, it may be possible to draw benefit from a sequential approach to protein fold prediction.

Background

The purpose of this paper is to investigate whether directionality of synthesis can have an impact on the accuracy of protein structure prediction. In order to do this a sequential structure prediction algorithm, based on the most successful free modelling method of our time, Rosetta, was developed and used to predict structure, first starting from the nitrogen terminus and then starting from the carbon terminus. Free modelling protein structure prediction methodology has improved in recent years, but is still not accurate enough to be considered satisfactory (see results of CASP6 [1] and CASP7 [2, 3] and the more recent CASP8 [4]). Given this noisy nature of current free modelling stucture prediction techniques, the pairwise comparison design used here appears to be required; it succeeded in detecting a consistent directionality effect. We begin, however, by summarizing the area.

Almost fifty years ago Anfinsen et al. [5, 6] showed that denatured small globular proteins could refold to their native state. On the other hand, experimentalists have known for many years that cotranslation can play an important role in protein folding [7–12]. Polypeptides are synthesized sequentially, and translation can occur at variable rates according to codon speed [13–17]. In Escherichia coli, for example, translation can occur in the order of 0.05 s/codon [13, 18–20]. On the other hand, it has been shown that helices and sheets fold in the low millisecond scale [21–23]. Therefore, some proteins fold faster than they elongate, and it is reasonable to assume that nascent chains can adopt secondary or tertiary structures cotranslationally. Experimental evidence for cotranslational folding dates back to the 1960s with a study on cotranslation in vivo reporting that ribosome-bound β-galactosidase was showing enzymic activity [24]. More recently it has been shown that the Semliki Forest Virus Protein (SFVP), which contains a protease domain that folds to autocatalytically cleave the protein from a larger polyprotein precursor, gains its enzymic activity before complete synthesis of the polyprotein [25]. Moreover, the rapid cotranslational folding of SFVP does not require additional cellular components [26].

In addition to enzymatic activity whilst still bound to the ribosome, intermediate stages of cotranslational folding may have native-like structures. Various length α-globins have been shown to have specific heme binding activity on several truncated ribosome-bound nascent chains. The shortest of these contained only the first 86 residues (from a total of 147 residues), demonstrating that the nascent chain has native-like structure [27]. NMR studies of nascent chains containing tandem Ig domains and still attached to the ribosome revealed that the N-terminus domain folds to its native state while the C-terminus domain is largely unfolded and flexible [28]. Recent molecular dynamics simulations also conclude that small peptides may adopt a conformation that is similar to the one adopted in full proteins [29]. The discovery of the formation of disulphide bonds in nascent immunoglobulin peptides also confirms the ability of proteins to begin to fold whilst they are being synthesized [30, 31].

As well as adopting native-like conformations while still attached to the ribosome, there is evidence that peptides can begin to fold whilst still in the ribosomal exit tunnel. Analysis of the ribosomal exit tunnel reveals that peptides can traverse the tunnel in an α-helical conformation [32], but that at no point is the tunnel big enough to accommodate structures larger than α-helices [33, 34]. Peptides are not restricted to an α-helix, however, and may adopt more extended conformations [35]. Analysis of the exit tunnel has also shown that the tunnel can entropically stabilize α-helical conformations as they pass through [36].

The rate of in vitro refolding has often been observed to be slower than the corresponding rate in vivo [37, 38]. Cotranslation has been studied in the bacterial luciferase αβ heterodimer, and the formation of the heterodimer is faster when the β monomer is translated in the presence of the folded α monomer than when the β monomer is refolded from a denatured state [38]. This shows that, under cotranslational folding, the β monomer is able to obtain a conformation that is more receptive to the formation of the dimer, thus avoiding kinetic traps associated with refolding from a denatured state [39]. Native-like structure has also been observed in cotranslationally folding monomeric firefly luciferase; again, cotranslational and in vitro folding pathways appear to be different, with cotranslational folding being faster [40]. Cotranslational folding in P22 tailspike protein has been shown to guide the peptide away from aggregation-prone conformations that are frequently encountered when refolding in vitro, leading to the hypothesis that cotranslational folding could be an efficient strategy for the folding of β-sheet topologies, and for large, multidomain proteins in general [41]. One possible explanation for this is that the peptide begins to fold while still attached to the ribosome [42, 43]. Another possible explanation is the existence of additional folding machinery contained in the cell; however, only approximately 20% of proteins associate, for example, with chaperones [44, 45]. The removal of major chaperones, such as DnaK and Hsp70, in E. coli has no adverse effect on cell growth or viability [46, 47]. This suggests that chaperones alone cannot account for the higher folding rates observed in vivo.

Complementing these experimental findings, computational models of cotranslational folding have also been explored, an early, incidental, use of this idea appearing in [48]. Simple computational models of protein folding incorporating cotranslation demonstrate that such folding favours local contacts in intermediate and final folds [49, 50]. More recently the effect of energy barriers on simple cotranslational models was studied, and it was found that the ground state of proteins folded sequentially was not necessarily the one of lowest energy [51]. Computational models have provided evidence that nascent chains may adopt partial structures similar to the corresponding parts of the complete protein [52]. Other lattice studies present a differing view of cotranslation where nascent peptides can remain largely unstructured until the final stages of synthesis (estimated to be when 90% or more of the protein has been extruded) [53]. This finding is dependent on the involvement of the C-terminal in tertiary interactions, and may not be applicable to all proteins. There is also evidence arising from lattice models that cotranslational folding pathways and refolding pathways are different [53]. Computational simulations of real proteins folding cotranslationally compared to refolding from a denatured state show mixed results. Chymotrypsin inhibitor 2 (CI2) and barnase were shown to fold mostly posttranslationally, with intermediates similar to those observed in refolding [54]. An alternative computational, cotranslational approach using dynamic optimisation in [55] found that major elements of the CI2 tertiary structure only form when the amino acid string is fully translated. For SFVP, which is known to fold cotranslationally [25], different pathways were taken during synthesis to those taken when folding from a denatured state [54]. A further promising approach is found in [56]. Pathways which minimize the difficulty of folding to the native state (for example, those which avoid having the chain pass through an opening) are found; results indicate that earlier folding is more likely around the N-terminus than the C-terminus, so pointing to an asymmetry of the folding process that is confirmed in the current work.

Finally, there is also evidence of cotranslational protein folding that arises from numerical summaries of known protein structures. An analysis of structures in the Protein Data Bank (PDB) found that residues are, in general, closer to previously synthesized residues than those synthesized later, and that the N-terminal region was more compact than the C-terminal region [57]. It was argued that this provided evidence of cotranslational folding, however, these findings were contradicted by a later analysis of a larger set of proteins [58]. In the second study it was observed that the C-terminals were more compact and contained greater numbers of local contacts than N-terminals. Further analysis that considered topological accessibility (the ability of a protein to fold from a given residue as a starting point using only local contacts) found this to be more evident towards the N-terminus in the α/β class of proteins [59]. In a similar vein, Deane et al. [60] developed a measure of previous contacts which assesses the extent to which the chain forms contacts with previously extruded residues. They also found that the α/β class and ancient folds [61] exhibited such evidence of cotranslation.

To date, protein structure prediction methods do not incorporate cotranslational effects. This paper describes such an algorithm and evaluates its performance. This evaluation reveals that, in more than 94% of cases, a sequential algorithm that follows the sense of translation, that is, from N-terminus to C-terminus, is more accurate than an algorithm that follows the reverse sense, from C-terminus to N-terminus. The success of the sequential algorithm is greater the more the target shows evidence of cotranslational folding. It is also found that a sequential algorithm can match, and on occasion better (in 51% of proteins tested), the performance of a leading non-sequential protein structure prediction algorithm, namely Rosetta.

Methods

Structure prediction algorithms

A sequential algorithm (SAINT, a Sequential Algorithm Initiated at the Nitrogen Terminus) was developed and used to predict the structure of a number of proteins. This algorithm uses the Rosetta program [62] (version 2.1.0), extending it to incorporate cotranslational aspects of protein folding. To investigate the importance of following the direction of translation, the sequential algorithm was adapted to predict the structure of proteins produced in the reverse direction, from the C-terminus to the N-terminus. Predictions from the sequential and reverse sequential algorithms were compared and they in turn compared to predictions made using an unmodified version of Rosetta. These algorithms are now described.

Sequential algorithm

SAINT extends the peptide by a nine residue fragment at each iteration, starting with the N-terminus. Each fragment is added in a fully extended conformation (ϕ = -150°, ψ = 150° and ω = 180°). The final fragment may contain fewer than nine residues; it will contain as many residues as are required to complete the full protein chain. At each extension the peptide is allowed to fold and the conformation reached is used as the starting structure for the next extension, with Rosetta ab initio used to perform the structure predictions at each stage. In order to make comparisons between the sequential and non-sequential algorithms fair, each uses the same total number of cycles. For the sequential algorithm these cycles were distributed evenly amongst each extension of the peptide with the number of cycles calculated as follows. If b is a base number of cycles and l is the protein length then the total number of cycles t is b(l/100) and the number of extrusions e is ⌈l/9⌉. This results in n = ⌊t/e⌋ cycles for the first e - 1 extrusions and t - n(e - 1) cycles for the final extrusion.

Reverse sequential algorithm

The reverse sequential algorithm is the same as the sequential algorithm. It differs only in that the peptide is extended from the C-terminus to the N-terminus.

Non-sequential algorithm

In non-sequential folding a protein is folded from a fully extended state. The Rosetta ab initio algorithm is employed for this process, using insertion from a library of fragments to build decoys (predicted structures). This has proved a successful technique for protein structure prediction in recent years [3, 63–65]. Rosetta can select fragments from the target, so the algorithm as used here is not strictly ab initio. The number of cycles (fragment insertions) used by Rosetta varies with protein length in this study. A base number of 34,000 cycles was used for a protein of 100 residues, and this number increased proportionately; for example, for a protein with 143 residues the number of cycles is increased by a factor of 1.43. This is reasonable as in the cell longer proteins take more time to be synthesized, and thus have more time to explore conformational space before synthesis is completed.

Selection of targets

In Deane et al. [60] a measure was developed, an Average Logarithmic Ratio (ALR), which assesses the extent of previous contacts within a peptide chain; proteins with positive ALR are expected to be those for which the cotranslational aspect of folding has a substantial impact, whilst proteins with negative ALR are expected to be those for which cotranslation has lesser impact. Two sets of targets were created from a PISCES[66] data set (< 30% sequence identity, resolution better than 3 Å, at least 100 residues and no missing residues, downloaded 6 February, 2009). The first set contained protein chains with an ALR value of 0.15 or greater (total of 34 proteins), and the second contained chains with an ALR of -0.15 or less (total of 34 proteins); these two sets are referred to as the positive and negative sets respectively. For each protein in the two sets, 1000 decoys were generated with each of the algorithms described above (sequential, reverse sequential and non-sequential). GDT_TS values [67] were calculated for each of the resulting predictions. GDT_TS is defined as (N₁ + N₂ + N₄ + N₈)/(4N), where N_iis the number of corresponding residues within i Å and N is the total number of residues. It measures the closeness of corresponding residues in known and predicted structures, more heavily weighting closer pairs. It is helpful to see it in non-cumulative form as where .

Larger sample size

To establish whether the sample size (that is, the number of decoys produced for each protein) has an effect on the results, two proteins were subjected to a larger sampling. An additional 100,000 decoys were generated for the FLiG C-terminal domain of Thermotoga maritima (1qc7A) and also for 1ji4A, using the SAINT algorithm.

Variability in peptide termini

As the differences between mean GDT_TS scores for SAINT and reverse SAINT, for a given protein, prove to be generally small, additional tests were conducted to ascertain whether terminus loop regions could be causing the observed effects. The termini of proteins are often unstructured, and their structure can be highly variable and difficult to predict. Small mistakes in the terminus regions could lead to the small differences observed between the mean GDT_TS scores.

The first N-terminus and last C-terminus secondary structure elements were identified in the experimental structure for each protein, and the termini up to the identified secondary structure element of the corresponding predicted model with the highest GDT_TS were removed. A secondary structure element was defined as a run of four residues with identical secondary structure assignment. Secondary structure was assigned from the experimentally determined structure with DSSP. In addition to these conditions the N-terminus and C-terminus secondary structure element had to be separated by at least five residues. GDT_TS scores were recalculated and counts taken of how often SAINT outperformed reverse SAINT and how often SAINT outperformed Rosetta.

Clash analysis

A possible reason for better performance of SAINT was conjectured to be that extrusion from the nitrogen terminus produces fewer steric clashes than does extrusion from the carbon terminus. In order to investigate this, ten protein sequences were selected on the basis of their mean GDT_TS scores: four in which SAINT performed better, three in which reverse SAINT performed better, and three in which SAINT and reverse SAINT performed comparably. For each protein, two of the 1000 models generated were selected for each of SAINT and reverse SAINT. The extent of steric clashes in conformations following folding, for five extruded lengths (18, 36, 54, 72, 90), were assessed using MolProbity [68], a web server that calculates a "clashscore", equal to the number of steric overlaps that are greater than 0.4 Å per 1000 atoms. Nine residues in fully extended conformation were then added at the C-terminus (for SAINT) or the N-terminus (for reverse SAINT) to produce strings of length 27, 45, 63, 81, and 99 and these checked again for steric clashes. For each of the five positions, the clashscore before the addition of nine residues was subtracted from the clashscore after the addition of the 9-mer fragment. An average of the differences in clashscores, across all five lengths, was taken for each protein sequence and each algorithm.

The importance of sense

To investigate why SAINT might perform consistently better than reverse SAINT, measures of secondary structure prediction quality were developed. For a given decoy, structural alignments for every overlapping fragment of 11 residues against the experimental structure were obtained, and the average C_α-C_αdistance of the alignment was assigned to the fragment's center residue (fragments of 11 residues were chosen to provide insight into prediction accuracy on a more local scale than, for example, taking an entire secondary structure element). These residue-assigned distance measures were averaged across all residues in α-helices in the decoy (residue secondary structure was assigned by DSSP for the experimentally determined model) and these in turn averaged over all 1000 decoys. This was done for both the forward and reverse decoy sets. Finally, the forward helical prediction quality measure was subtracted from the reverse helical prediction quality measure. The same process was followed for β-strands. If directionality is not important in folding we would expect the accuracy of helical or strand predictions to be similar regardless of the direction of synthesis, resulting in the difference calculated above being zero. A positive difference would indicate that forward predictions were more accurate than reverse predictions while negative differences would indicate that reverse predictions were more accurate. One of the proteins in the positive set (1qc7A) and four in the negative set (1kf6D, 1mkaA, 1nekC and 1uz3A) contained no β-strand residues and, therefore, were not considered in the analysis.

Results and Discussion

The emerging partial conformations produced by SAINT for sequence 1qc7A are shown in Figure 1, using the most successful decoy. The six helices are seen to progressively take shape as the chain is extruded, with early conformations largely preserved.

Results for SAINT, reverse SAINT and Rosetta for each of the proteins in the positive set (ALR ≥ 0.15, see Methods, Selection of targets) and negative set (ALR ≤ -0.15) are summarized in Table 1 and Table 2 respectively. The mean performance and best models produced by SAINT show that it predicts structures better than reverse SAINT in the majority of cases (Table 3). For example, SAINT yielded a higher mean GDT_TS than reverse SAINT for 32 of the 34 proteins with positive ALR and equally, for 32 of the 34 proteins with negative ALR.

Table 1 Results from positive set. Accuracy of models obtained for 34 proteins with ALR ≥ 0.15 using SAINT, reverse SAINT and Rosetta.

Full size table

Table 2 Results from negative set. Accuracy of models obtained for 34 proteins with ALR ≤ -0.15 using SAINT, reverse SAINT and Rosetta.

Full size table

Table 3 Summary of results. Pairwise (SAINT vs reverse SAINT and SAINT vs Rosetta) comparison of the algorithms.

Full size table

Plots of the mean scores for SAINT, reverse SAINT and Rosetta for the positive set are given in Figure 2A, with proteins ordered from smallest to largest mean SAINT GDT_TS score. Corresponding plots for the negative set are given in Figure 3A. The consistent superiority of SAINT over reverse SAINT is evident, with the difference being slightly greater for the positive set. The largest such difference seen in all the data is 8.49%, observed between the means of SAINT and reverse SAINT for 3ezmA (negative set), and representing an increase in GDT_TS from 20.25% to 28.74%. Mean performances of SAINT and Rosetta indicate that Rosetta outperforms SAINT in both the positive (Rosetta 19.72, SAINT 19.50) and negative (Rosetta 18.26, SAINT 17.84) sets. The difference is greater for the negative set (Table 3).

Plots of the maximum scores for SAINT, reverse SAINT and Rosetta for the positive set are given in Figure 2B, with proteins ordered from smallest to largest maximum SAINT GDT_TS score. Corresponding plots for the negative set are shown in Figure 3B. When considering best performance, SAINT is again superior to reverse SAINT, and more so in the positive set. Rosetta is no longer superior when best performance is considered; SAINT outperforms Rosetta, for example, in 19 of the 34 proteins in the positive set. The most successful SAINT prediction in the positive set was found for 3vubA. It is shown superposed on the native conformation in Figure 4, together with superpositions of the best reverse SAINT and Rosetta predictions on the native conformation. SAINT captures the structure better than either reverse SAINT or Rosetta.

A GDT_TS value of 30% or above is generally considered to ensure that a reasonable prediction is found [4]; a scan of Table 1 indicates that roughly one half (15 out of 34) of the best SAINT predictions are satisfactory, and similarly for Rosetta (16 out of 34).