Skip to main content


Directionality in protein fold prediction



Ever since the ground-breaking work of Anfinsen et al. in which a denatured protein was found to refold to its native state, it has been frequently stated by the protein fold prediction community that all the information required for protein folding lies in the amino acid sequence. Recent in vitro experiments and in silico computational studies, however, have shown that cotranslation may affect the folding pathway of some proteins, especially those of ancient folds. In this paper aspects of cotranslational folding have been incorporated into a protein structure prediction algorithm by adapting the Rosetta program to fold proteins as the nascent chain elongates. This makes it possible to conduct a pairwise comparison of folding accuracy, by comparing folds created sequentially from each end of the protein.


A single main result emerged: in 94% of proteins analyzed, following the sense of translation, from N-terminus to C-terminus, produced better predictions than following the reverse sense of translation, from the C-terminus to N-terminus. Two secondary results emerged. First, this superiority of N-terminus to C-terminus folding was more marked for proteins showing stronger evidence of cotranslation and second, an algorithm following the sense of translation produced predictions comparable to, and occasionally better than, Rosetta.


There is a directionality effect in protein fold prediction. At present, prediction methods appear to be too noisy to take advantage of this effect; as techniques refine, it may be possible to draw benefit from a sequential approach to protein fold prediction.


The purpose of this paper is to investigate whether directionality of synthesis can have an impact on the accuracy of protein structure prediction. In order to do this a sequential structure prediction algorithm, based on the most successful free modelling method of our time, Rosetta, was developed and used to predict structure, first starting from the nitrogen terminus and then starting from the carbon terminus. Free modelling protein structure prediction methodology has improved in recent years, but is still not accurate enough to be considered satisfactory (see results of CASP6 [1] and CASP7 [2, 3] and the more recent CASP8 [4]). Given this noisy nature of current free modelling stucture prediction techniques, the pairwise comparison design used here appears to be required; it succeeded in detecting a consistent directionality effect. We begin, however, by summarizing the area.

Almost fifty years ago Anfinsen et al. [5, 6] showed that denatured small globular proteins could refold to their native state. On the other hand, experimentalists have known for many years that cotranslation can play an important role in protein folding [712]. Polypeptides are synthesized sequentially, and translation can occur at variable rates according to codon speed [1317]. In Escherichia coli, for example, translation can occur in the order of 0.05 s/codon [13, 1820]. On the other hand, it has been shown that helices and sheets fold in the low millisecond scale [2123]. Therefore, some proteins fold faster than they elongate, and it is reasonable to assume that nascent chains can adopt secondary or tertiary structures cotranslationally. Experimental evidence for cotranslational folding dates back to the 1960s with a study on cotranslation in vivo reporting that ribosome-bound β-galactosidase was showing enzymic activity [24]. More recently it has been shown that the Semliki Forest Virus Protein (SFVP), which contains a protease domain that folds to autocatalytically cleave the protein from a larger polyprotein precursor, gains its enzymic activity before complete synthesis of the polyprotein [25]. Moreover, the rapid cotranslational folding of SFVP does not require additional cellular components [26].

In addition to enzymatic activity whilst still bound to the ribosome, intermediate stages of cotranslational folding may have native-like structures. Various length α-globins have been shown to have specific heme binding activity on several truncated ribosome-bound nascent chains. The shortest of these contained only the first 86 residues (from a total of 147 residues), demonstrating that the nascent chain has native-like structure [27]. NMR studies of nascent chains containing tandem Ig domains and still attached to the ribosome revealed that the N-terminus domain folds to its native state while the C-terminus domain is largely unfolded and flexible [28]. Recent molecular dynamics simulations also conclude that small peptides may adopt a conformation that is similar to the one adopted in full proteins [29]. The discovery of the formation of disulphide bonds in nascent immunoglobulin peptides also confirms the ability of proteins to begin to fold whilst they are being synthesized [30, 31].

As well as adopting native-like conformations while still attached to the ribosome, there is evidence that peptides can begin to fold whilst still in the ribosomal exit tunnel. Analysis of the ribosomal exit tunnel reveals that peptides can traverse the tunnel in an α-helical conformation [32], but that at no point is the tunnel big enough to accommodate structures larger than α-helices [33, 34]. Peptides are not restricted to an α-helix, however, and may adopt more extended conformations [35]. Analysis of the exit tunnel has also shown that the tunnel can entropically stabilize α-helical conformations as they pass through [36].

The rate of in vitro refolding has often been observed to be slower than the corresponding rate in vivo [37, 38]. Cotranslation has been studied in the bacterial luciferase αβ heterodimer, and the formation of the heterodimer is faster when the β monomer is translated in the presence of the folded α monomer than when the β monomer is refolded from a denatured state [38]. This shows that, under cotranslational folding, the β monomer is able to obtain a conformation that is more receptive to the formation of the dimer, thus avoiding kinetic traps associated with refolding from a denatured state [39]. Native-like structure has also been observed in cotranslationally folding monomeric firefly luciferase; again, cotranslational and in vitro folding pathways appear to be different, with cotranslational folding being faster [40]. Cotranslational folding in P22 tailspike protein has been shown to guide the peptide away from aggregation-prone conformations that are frequently encountered when refolding in vitro, leading to the hypothesis that cotranslational folding could be an efficient strategy for the folding of β-sheet topologies, and for large, multidomain proteins in general [41]. One possible explanation for this is that the peptide begins to fold while still attached to the ribosome [42, 43]. Another possible explanation is the existence of additional folding machinery contained in the cell; however, only approximately 20% of proteins associate, for example, with chaperones [44, 45]. The removal of major chaperones, such as DnaK and Hsp70, in E. coli has no adverse effect on cell growth or viability [46, 47]. This suggests that chaperones alone cannot account for the higher folding rates observed in vivo.

Complementing these experimental findings, computational models of cotranslational folding have also been explored, an early, incidental, use of this idea appearing in [48]. Simple computational models of protein folding incorporating cotranslation demonstrate that such folding favours local contacts in intermediate and final folds [49, 50]. More recently the effect of energy barriers on simple cotranslational models was studied, and it was found that the ground state of proteins folded sequentially was not necessarily the one of lowest energy [51]. Computational models have provided evidence that nascent chains may adopt partial structures similar to the corresponding parts of the complete protein [52]. Other lattice studies present a differing view of cotranslation where nascent peptides can remain largely unstructured until the final stages of synthesis (estimated to be when 90% or more of the protein has been extruded) [53]. This finding is dependent on the involvement of the C-terminal in tertiary interactions, and may not be applicable to all proteins. There is also evidence arising from lattice models that cotranslational folding pathways and refolding pathways are different [53]. Computational simulations of real proteins folding cotranslationally compared to refolding from a denatured state show mixed results. Chymotrypsin inhibitor 2 (CI2) and barnase were shown to fold mostly posttranslationally, with intermediates similar to those observed in refolding [54]. An alternative computational, cotranslational approach using dynamic optimisation in [55] found that major elements of the CI2 tertiary structure only form when the amino acid string is fully translated. For SFVP, which is known to fold cotranslationally [25], different pathways were taken during synthesis to those taken when folding from a denatured state [54]. A further promising approach is found in [56]. Pathways which minimize the difficulty of folding to the native state (for example, those which avoid having the chain pass through an opening) are found; results indicate that earlier folding is more likely around the N-terminus than the C-terminus, so pointing to an asymmetry of the folding process that is confirmed in the current work.

Finally, there is also evidence of cotranslational protein folding that arises from numerical summaries of known protein structures. An analysis of structures in the Protein Data Bank (PDB) found that residues are, in general, closer to previously synthesized residues than those synthesized later, and that the N-terminal region was more compact than the C-terminal region [57]. It was argued that this provided evidence of cotranslational folding, however, these findings were contradicted by a later analysis of a larger set of proteins [58]. In the second study it was observed that the C-terminals were more compact and contained greater numbers of local contacts than N-terminals. Further analysis that considered topological accessibility (the ability of a protein to fold from a given residue as a starting point using only local contacts) found this to be more evident towards the N-terminus in the α/β class of proteins [59]. In a similar vein, Deane et al. [60] developed a measure of previous contacts which assesses the extent to which the chain forms contacts with previously extruded residues. They also found that the α/β class and ancient folds [61] exhibited such evidence of cotranslation.

To date, protein structure prediction methods do not incorporate cotranslational effects. This paper describes such an algorithm and evaluates its performance. This evaluation reveals that, in more than 94% of cases, a sequential algorithm that follows the sense of translation, that is, from N-terminus to C-terminus, is more accurate than an algorithm that follows the reverse sense, from C-terminus to N-terminus. The success of the sequential algorithm is greater the more the target shows evidence of cotranslational folding. It is also found that a sequential algorithm can match, and on occasion better (in 51% of proteins tested), the performance of a leading non-sequential protein structure prediction algorithm, namely Rosetta.


Structure prediction algorithms

A sequential algorithm (SAINT, a Sequential Algorithm Initiated at the Nitrogen Terminus) was developed and used to predict the structure of a number of proteins. This algorithm uses the Rosetta program [62] (version 2.1.0), extending it to incorporate cotranslational aspects of protein folding. To investigate the importance of following the direction of translation, the sequential algorithm was adapted to predict the structure of proteins produced in the reverse direction, from the C-terminus to the N-terminus. Predictions from the sequential and reverse sequential algorithms were compared and they in turn compared to predictions made using an unmodified version of Rosetta. These algorithms are now described.

Sequential algorithm

SAINT extends the peptide by a nine residue fragment at each iteration, starting with the N-terminus. Each fragment is added in a fully extended conformation (ϕ = -150°, ψ = 150° and ω = 180°). The final fragment may contain fewer than nine residues; it will contain as many residues as are required to complete the full protein chain. At each extension the peptide is allowed to fold and the conformation reached is used as the starting structure for the next extension, with Rosetta ab initio used to perform the structure predictions at each stage. In order to make comparisons between the sequential and non-sequential algorithms fair, each uses the same total number of cycles. For the sequential algorithm these cycles were distributed evenly amongst each extension of the peptide with the number of cycles calculated as follows. If b is a base number of cycles and l is the protein length then the total number of cycles t is b(l/100) and the number of extrusions e is l/9. This results in n = t/e cycles for the first e - 1 extrusions and t - n(e - 1) cycles for the final extrusion.

Reverse sequential algorithm

The reverse sequential algorithm is the same as the sequential algorithm. It differs only in that the peptide is extended from the C-terminus to the N-terminus.

Non-sequential algorithm

In non-sequential folding a protein is folded from a fully extended state. The Rosetta ab initio algorithm is employed for this process, using insertion from a library of fragments to build decoys (predicted structures). This has proved a successful technique for protein structure prediction in recent years [3, 6365]. Rosetta can select fragments from the target, so the algorithm as used here is not strictly ab initio. The number of cycles (fragment insertions) used by Rosetta varies with protein length in this study. A base number of 34,000 cycles was used for a protein of 100 residues, and this number increased proportionately; for example, for a protein with 143 residues the number of cycles is increased by a factor of 1.43. This is reasonable as in the cell longer proteins take more time to be synthesized, and thus have more time to explore conformational space before synthesis is completed.

Selection of targets

In Deane et al. [60] a measure was developed, an Average Logarithmic Ratio (ALR), which assesses the extent of previous contacts within a peptide chain; proteins with positive ALR are expected to be those for which the cotranslational aspect of folding has a substantial impact, whilst proteins with negative ALR are expected to be those for which cotranslation has lesser impact. Two sets of targets were created from a PISCES[66] data set (< 30% sequence identity, resolution better than 3 Å, at least 100 residues and no missing residues, downloaded 6 February, 2009). The first set contained protein chains with an ALR value of 0.15 or greater (total of 34 proteins), and the second contained chains with an ALR of -0.15 or less (total of 34 proteins); these two sets are referred to as the positive and negative sets respectively. For each protein in the two sets, 1000 decoys were generated with each of the algorithms described above (sequential, reverse sequential and non-sequential). GDT_TS values [67] were calculated for each of the resulting predictions. GDT_TS is defined as (N1 + N2 + N4 + N8)/(4N), where N i is the number of corresponding residues within i Å and N is the total number of residues. It measures the closeness of corresponding residues in known and predicted structures, more heavily weighting closer pairs. It is helpful to see it in non-cumulative form as where .

Larger sample size

To establish whether the sample size (that is, the number of decoys produced for each protein) has an effect on the results, two proteins were subjected to a larger sampling. An additional 100,000 decoys were generated for the FLiG C-terminal domain of Thermotoga maritima (1qc7A) and also for 1ji4A, using the SAINT algorithm.

Variability in peptide termini

As the differences between mean GDT_TS scores for SAINT and reverse SAINT, for a given protein, prove to be generally small, additional tests were conducted to ascertain whether terminus loop regions could be causing the observed effects. The termini of proteins are often unstructured, and their structure can be highly variable and difficult to predict. Small mistakes in the terminus regions could lead to the small differences observed between the mean GDT_TS scores.

The first N-terminus and last C-terminus secondary structure elements were identified in the experimental structure for each protein, and the termini up to the identified secondary structure element of the corresponding predicted model with the highest GDT_TS were removed. A secondary structure element was defined as a run of four residues with identical secondary structure assignment. Secondary structure was assigned from the experimentally determined structure with DSSP. In addition to these conditions the N-terminus and C-terminus secondary structure element had to be separated by at least five residues. GDT_TS scores were recalculated and counts taken of how often SAINT outperformed reverse SAINT and how often SAINT outperformed Rosetta.

Clash analysis

A possible reason for better performance of SAINT was conjectured to be that extrusion from the nitrogen terminus produces fewer steric clashes than does extrusion from the carbon terminus. In order to investigate this, ten protein sequences were selected on the basis of their mean GDT_TS scores: four in which SAINT performed better, three in which reverse SAINT performed better, and three in which SAINT and reverse SAINT performed comparably. For each protein, two of the 1000 models generated were selected for each of SAINT and reverse SAINT. The extent of steric clashes in conformations following folding, for five extruded lengths (18, 36, 54, 72, 90), were assessed using MolProbity [68], a web server that calculates a "clashscore", equal to the number of steric overlaps that are greater than 0.4 Å per 1000 atoms. Nine residues in fully extended conformation were then added at the C-terminus (for SAINT) or the N-terminus (for reverse SAINT) to produce strings of length 27, 45, 63, 81, and 99 and these checked again for steric clashes. For each of the five positions, the clashscore before the addition of nine residues was subtracted from the clashscore after the addition of the 9-mer fragment. An average of the differences in clashscores, across all five lengths, was taken for each protein sequence and each algorithm.

The importance of sense

To investigate why SAINT might perform consistently better than reverse SAINT, measures of secondary structure prediction quality were developed. For a given decoy, structural alignments for every overlapping fragment of 11 residues against the experimental structure were obtained, and the average C α -C α distance of the alignment was assigned to the fragment's center residue (fragments of 11 residues were chosen to provide insight into prediction accuracy on a more local scale than, for example, taking an entire secondary structure element). These residue-assigned distance measures were averaged across all residues in α-helices in the decoy (residue secondary structure was assigned by DSSP for the experimentally determined model) and these in turn averaged over all 1000 decoys. This was done for both the forward and reverse decoy sets. Finally, the forward helical prediction quality measure was subtracted from the reverse helical prediction quality measure. The same process was followed for β-strands. If directionality is not important in folding we would expect the accuracy of helical or strand predictions to be similar regardless of the direction of synthesis, resulting in the difference calculated above being zero. A positive difference would indicate that forward predictions were more accurate than reverse predictions while negative differences would indicate that reverse predictions were more accurate. One of the proteins in the positive set (1qc7A) and four in the negative set (1kf6D, 1mkaA, 1nekC and 1uz3A) contained no β-strand residues and, therefore, were not considered in the analysis.

Results and Discussion

The emerging partial conformations produced by SAINT for sequence 1qc7A are shown in Figure 1, using the most successful decoy. The six helices are seen to progressively take shape as the chain is extruded, with early conformations largely preserved.

Figure 1

Cotranslational structure prediction of the FLiG C-terminal domain (1qc7A; 101 residues). Segments of nine residues are extruded at a time except for the last segment which consists of two residues. One thousand decoys were produced; the particular simulation above produced the structure with the highest GDT_TS of 63.12%. In each sub-figure the N-terminal is coloured dark blue and appears at the center adopting approximately the same orientation; it cannot always be the same orientation due to changes in conformation as the protein folds.

Results for SAINT, reverse SAINT and Rosetta for each of the proteins in the positive set (ALR ≥ 0.15, see Methods, Selection of targets) and negative set (ALR ≤ -0.15) are summarized in Table 1 and Table 2 respectively. The mean performance and best models produced by SAINT show that it predicts structures better than reverse SAINT in the majority of cases (Table 3). For example, SAINT yielded a higher mean GDT_TS than reverse SAINT for 32 of the 34 proteins with positive ALR and equally, for 32 of the 34 proteins with negative ALR.

Table 1 Results from positive set. Accuracy of models obtained for 34 proteins with ALR ≥ 0.15 using SAINT, reverse SAINT and Rosetta.
Table 2 Results from negative set. Accuracy of models obtained for 34 proteins with ALR ≤ -0.15 using SAINT, reverse SAINT and Rosetta.
Table 3 Summary of results. Pairwise (SAINT vs reverse SAINT and SAINT vs Rosetta) comparison of the algorithms.

Plots of the mean scores for SAINT, reverse SAINT and Rosetta for the positive set are given in Figure 2A, with proteins ordered from smallest to largest mean SAINT GDT_TS score. Corresponding plots for the negative set are given in Figure 3A. The consistent superiority of SAINT over reverse SAINT is evident, with the difference being slightly greater for the positive set. The largest such difference seen in all the data is 8.49%, observed between the means of SAINT and reverse SAINT for 3ezmA (negative set), and representing an increase in GDT_TS from 20.25% to 28.74%. Mean performances of SAINT and Rosetta indicate that Rosetta outperforms SAINT in both the positive (Rosetta 19.72, SAINT 19.50) and negative (Rosetta 18.26, SAINT 17.84) sets. The difference is greater for the negative set (Table 3).

Figure 2

Plots of mean and maximum GDT_TS for the positive set. Graphic A shows the mean GDT_TS scores for the 34 proteins in the positive set, for SAINT (red squares), reverse SAINT (blue circles) and Rosetta (green triangles), with the proteins ordered according to ascending mean SAINT GDT_TS. SAINT and Rosetta perform similarly and consistently better than reverse SAINT. Graphic B plots maximum GDT_TS in the same way, ordered this time by ascending maximum SAINT GDT_TS, revealing greater variation but still a consistent and generally larger improvement of SAINT on reverse SAINT.

Figure 3

Plots of mean and maximum GDT_TS for the negative set. Graphic A shows the mean GDT_TS scores for the 34 proteins in the negative set, for SAINT (red squares), reverse SAINT (blue circles) and Rosetta (green triangles), with the proteins ordered according to ascending mean SAINT GDT_TS. Graphic B plots maximum GDT_TS for proteins in the negative set, ordered by ascending maximum SAINT GDT_TS. Outcomes are the same as for the positive set, with all differences less marked.

Plots of the maximum scores for SAINT, reverse SAINT and Rosetta for the positive set are given in Figure 2B, with proteins ordered from smallest to largest maximum SAINT GDT_TS score. Corresponding plots for the negative set are shown in Figure 3B. When considering best performance, SAINT is again superior to reverse SAINT, and more so in the positive set. Rosetta is no longer superior when best performance is considered; SAINT outperforms Rosetta, for example, in 19 of the 34 proteins in the positive set. The most successful SAINT prediction in the positive set was found for 3vubA. It is shown superposed on the native conformation in Figure 4, together with superpositions of the best reverse SAINT and Rosetta predictions on the native conformation. SAINT captures the structure better than either reverse SAINT or Rosetta.

Figure 4

Superpositions of the best predictions for 3vubA on the native structure. The best decoy produced overall was by SAINT for 3vubA, whose native conformation is shown in a). The remaining graphics show the superposition of this native conformation with the best decoy produced by b) SAINT (GDT_TS = 67.57), c) reverse SAINT (GDT_TS =37.62) and d) Rosetta (GDT_TS = 51.24). The SAINT decoy best captures the native loop and sheet conformation; a loop error causes the C-terminal helix to be incorrectly oriented.

A GDT_TS value of 30% or above is generally considered to ensure that a reasonable prediction is found [4]; a scan of Table 1 indicates that roughly one half (15 out of 34) of the best SAINT predictions are satisfactory, and similarly for Rosetta (16 out of 34).

Larger sample size

Summaries of the distribution of GDT_TS scores indicate that the size of the decoy sets used (that is, 1000) does not significantly influence their values (for 1qc7A, sample size of 1000 has min. 23.0, max. 69.8, mean 40.6, std devn 7.9; sample size of 100,000 has min. 22.0, max. 73.0, mean 40.9, std devn 8.2). When repeated with 1ji4A, similar results were produced (sample size of 1000 has min. 19.79, max. 49.31, mean 30.37, std devn 4.07; sample size of 100,000 has min. 17.71, max. 56.94, mean 30.78, std devn 4.38).

Variability in peptide termini

The results of this test indicate that the differences in GDT_TS observed are not due to variability in the terminus regions of the peptides (data presented in Tables 4 and 5).

Table 4 Variability in peptide termini: Results from positive set.
Table 5 Variability in peptide termini: Results from negative set.

Clash analysis

The results are shown in Table 6. Four of the ten protein conformations examined have higher steric clashscores for SAINT than reverse SAINT. The steric clashscore appears not to be influenced by its mean GDT_TS score, evidenced by two (1mf7A and 2d00A) out of the four proteins with higher mean GDT_TS scores for SAINT having greater steric clashscores than reverse SAINT. Steric clashes produced by SAINT and reverse SAINT are generally comparable, so providing no evidence that fewer steric clashes are the reason for the better performance of SAINT.

Table 6 Clash analysis.

The importance of sense

The differences obtained from both the positive and negative sets are shown in Figure 5. These results show that for both types of secondary structure SAINT is generally producing better predictions, but that the difference is more pronounced for strand residues. In 28 of the 33 proteins (85%) in the positive set the difference between forward and reverse folding is greater for strands than for helices (with 16 (48%) having a β-strand difference more than twice the α-helix difference). Similarly, in 26 of the 30 proteins (87%) in the negative set the difference between forward and reverse folding is greater for strands than for helices (with 19 (63%) having a β-strand difference more than twice the α-helix difference). These results indicate that in general SAINT is more accurate when predicting strands than is reverse SAINT. These differences are small, but they would account for the differences observed in the results.

Figure 5

Accuracy of helix and strand predictions. Accuracy of helix and strand predictions separately for (A) positive and (B) negative sets. Plots show the difference (reverse SAINT minus SAINT) in the secondary structure distance measure for helical (grey) and strand (black) residues. Positive values here indicate that SAINT is producing predictions that are more accurate than those of reverse SAINT. Evidently SAINT outperforms reverse SAINT for both types of secondary structure, but more strongly for strands and the negative set.


A consistent difference in prediction accuracy was seen between SAINT and reverse SAINT. SAINT is markedly superior to reverse SAINT, and slightly more so for proteins with positive ALR values. When looking in detail at SAINT and reverse SAINT, the differences observed are most likely due to the detrimental effect on strand prediction observed when elongating a peptide from the C-terminus to the N-terminus. SAINT produced decoys with a higher mean GDT_TS than reverse SAINT for more than 94% of proteins in both the positive and negative protein sets. The differences between mean GDT_TS scores for SAINT and reverse SAINT decoys were also bigger than those between SAINT and Rosetta decoys. If directionality played no part in the folding process it would be expected that there would be no difference in the predictive accuracy of extrusions from the N-terminus to C-terminus and extrusion from C-terminus to the N-terminus. Three possible explanations for these results are outlined below.

Peptides, when extruded from the ribosome, start at the N-terminus. For this reason, fragments near the N-terminus are less influenced in their folding by the remainder of the peptide, since this has yet to emerge from the ribosome. On the other hand, fragments towards the C-terminus must fold in the presence of the bulk of the peptide. Thus the conformation assumed by the early fragment is a local choice, in that it depends largely on the amino acid sequence of the fragment. The conformation reached by a later fragment is determined by more than its amino acid sequence, in that it also depends on surrounding structure. This behaviour is mimicked by SAINT but not by reverse SAINT, so providing an explanation for the consistently better predictive accuracy of SAINT.

A second explanation arises from the way that the two algorithms allocate fragment insertions. At any stage, due to the constraints of Rosetta, fragment insertions are made uniformly across the currently extruded peptide length. The upshot is that more fragment insertions are attempted at the N-terminus than the C-terminus for SAINT while the opposite is true for reverse SAINT. Should it be the case that the N-terminus of the peptide is harder to predict than the C-terminus, SAINT would be more successful than reverse SAINT since SAINT puts in effort where it is needed. Due to the reasons stated above, however, we expect the N-terminus to be more easily predicted than the C-terminus.

A third possibility is that Rosetta itself has some inherent directionality, so favouring SAINT over reverse SAINT. A study of Rosetta, however, provides no indication of such a directional bias.

A strong correlation between mean GDT_TS and chain length is seen for both the positive and negative sets and for all three algorithms: as the chain length increases the GDT_TS decreases. 1oaaA is the only target over 200 residues in length that produced a set of decoys with mean GDT_TS greater than 20%, indicating that the versions of the algorithms employed in this study are not sufficient to accurately predict the structure of chains with more than 200 residues (this accounts for 50% of the positive set and 24% of the negative set). Excluding this data from the analysis, however, makes no difference to the overall findings.

Given that SAINT outperforms reverse SAINT it might be expected that SAINT would also outperform Rosetta, Rosetta being, in some senses, midway between the two. In best performance, arguably more important than mean performance, there is weak evidence that SAINT does outperfom Rosetta; for the positive set SAINT outperfoms Rosetta in 19 out of 33 instances (there is one tie) and for the negative set SAINT outperforms Rosetta in 16 out of 30 instances (there are four ties). An explanation why this remains weak at this stage is that SAINT remains crude, barely exploiting spatial and temporal advantages which may be available in cotranslational folding; we have simply used an iterative version of Rosetta. For example, at each extrusion, fragment insertions are chosen uniformly along the extruded peptide, whereas use of an insertion location distribution skewed towards the carbon terminus might be more realistic. To its credit, however, the SAINT versus reverse SAINT investigation exploits the power of a "paired comparison" design more effectively than does the SAINT versus Rosetta investigation, in that it contrasts opposites and so is more likely to reveal an effect.


This study has presented an algorithm that builds cotranslation into protein structure prediction. To assess the importance of the direction of translation the sequential algorithm was compared to a reverse sequential algorithm where the protein was produced from the C-terminus to N-terminus. Two sets of proteins were chosen: one where the residues have, on average, more contacts with previous residues than successive residues and the other where the residues have, on average, more contacts with successive residues than previous residues. The performance of the sequential algorithm for protein structure prediction was also compared with Rosetta, which folds from a fully elongated chain.

When SAINT was compared to reverse SAINT a very pronounced difference was observed. When mean GDT_TS was used as the performance measure SAINT outperformed reverse SAINT for over 94% of targets from both the positive and negative sets. These figures were still high when the maximum GDT_TS was used as the performance measure, with SAINT outperforming reverse SAINT in over 91% of targets from the positive set and over 73% of targets from the negative set.

The results show that Rosetta produces decoy sets with higher mean GDT_TS scores than SAINT for both the positive and negative protein sets, but that this superiority of Rosetta is not seen when the models with the highest GDT_TS scores are compared. If it were possible to always select the most accurate structure from the set of decoys then SAINT would, overall, produce a better prediction than Rosetta. The selection of the best decoy from a set, however, is a separate problem that is not addressed in this study. While Rosetta is producing decoy sets with higher mean GDT_TS scores than SAINT, examination of the differences between the means shows that the difference is always small. Only on one occasion does a Rosetta decoy set have a mean GDT_TS greater than 2% above the corresponding SAINT decoy set (an increase in mean GDT_TS from SAINT to Rosetta of 2.4% for 1ji4A). It has been established that the size of the decoy set and flexibility of peptide terminus residues do not affect the distribution of GDT_TS scores.

The sequential algorithm described in this study is in its earliest stages of development. Future work will include investigation of the effect of translation speed, allowing extruded segments to have variable length and the number of fragment insertion attempts at each iteration to vary. Improvements should also include incorporation of spatial restrictions to simulate the constraint of the ribosome tunnel.


  1. 1.

    Vincent JJ, Tai CH, Sathyanarayana BK, Lee B: Assessment of CASP6 predictions for new and nearly new fold targets. Proteins 2005, 61(Suppl 7):67–83. 10.1002/prot.20722

  2. 2.

    Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction-Round VII. Proteins 2007, 69(Suppl 8):3–9. 10.1002/prot.21767

  3. 3.

    Jauch R, Yeo HC, Kolatkar PR, Clarke ND: Assessment of CASP7 structure predictions for template free targets. Proteins 2007, 69(Suppl 8):57–67. 10.1002/prot.21771

  4. 4.

    Kryshtatovych A, Fidelis K, Moult J: CASP8 results in context of previous experiments. Proteins 2009, 77(9 Suppl):217–228. 10.1002/prot.22562

  5. 5.

    Anfinsen CB, Haber E, Sela M, White FH Jr: The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci USA 1961, 47: 1309–14. 10.1073/pnas.47.9.1309

  6. 6.

    Anfinsen CB: Principles that govern the folding of protein chains. Science 1973, 181(96):223–230. 10.1126/science.181.4096.223

  7. 7.

    Fedorov AN, Baldwin TO: Cotranslational protein folding. J Biol Chem 1997, 272(52):32715–32718. 10.1074/jbc.272.52.32715

  8. 8.

    Basharov MA: Cotranslational folding of proteins. Biochemistry (Mosc) 2000, 65(12):1380–1384. 10.1023/A:1002800822475

  9. 9.

    Basharov MA: Protein folding. J Cell Mol Med 2003, 7(3):223–237. 10.1111/j.1582-4934.2003.tb00223.x

  10. 10.

    Kolb VA: Cotranslational protein folding. Mol Biol 2001, 35(4):584–590. 10.1023/A:1010579111510

  11. 11.

    Giglione C, Fieulaine S, Meinnel T: Cotranslational processing mechanisms: towards a dynamic 3D model. Trends in Biochemical Sciences 2009, 34: 417–426. 10.1016/j.tibs.2009.04.003

  12. 12.

    Kadokura H, Beckwith J: Detecting folding intermediates of a protein as it passes through the bacterial translocation channel. Cell 2009, 138: 1164–1173. 10.1016/j.cell.2009.07.030

  13. 13.

    Pedersen S: Escherichia coli ribosomes translate in vivo with variable rate. EMBO J 1984, 3(12):2895–2898.

  14. 14.

    Wilson KS, Noller HF: Molecular movement inside the translational engine. Cell 1998, 92(3):337–349. 10.1016/S0092-8674(00)80927-7

  15. 15.

    Clarke T, Clark P: Rare codons cluster. PLoS ONE 2008, 3: e3412. 10.1371/journal.pone.0003412

  16. 16.

    Zhang G, Hubalewska M, Ignatova Z: Transient ribosomal attenuation coordinates protein synthesis and co-translational folding. Nature Structural and Molecular Biology 2009, 16: 274–280. 10.1038/nsmb.1554

  17. 17.

    Zhang G, Ignatova Z: Generic algorithm to predict the speed of translational elongation: implications for protein biogenesis. PLoS ONE 2009, 4: e5036. 10.1371/journal.pone.0005036

  18. 18.

    Krüger MK, Pedersen S, Hagervall TG, Sørensen MA: The modification of the wobble base of tRNAGlu modulates the translation rate of glutamic acid codons in vivo. J Mol Biol 1998, 284(3):621–631. 10.1006/jmbi.1998.2196

  19. 19.

    Sørensen MA, Pedersen S: Absolute in vivo translation rates of individual codons in Escherichia coli . The two glutamic acid codons GAA and GAG are translated with a threefold difference in rate. J Mol Biol 1991, 222(2):265–280. 10.1016/0022-2836(91)90211-N

  20. 20.

    Varenne S, Buc J, Lloubes R, Lazdunski C: Translation is a non-uniform process. Effect of tRNA availability on the rate of elongation of nascent polypeptide chains. J Mol Biol 1984, 180(3):549–576. 10.1016/0022-2836(84)90027-5

  21. 21.

    Roder H, Elöve GA, Englander SW: Structural characterization of folding intermediates in cytochrome c by H-exchange labelling and proton NMR. Nature 1988, 335(6192):700–704. 10.1038/335700a0

  22. 22.

    Briggs MS, Roder H: Early hydrogen-bonding events in the folding reaction of ubiquitin. Proc Natl Acad Sci USA 1992, 89(6):2017–2021. 10.1073/pnas.89.6.2017

  23. 23.

    Lu J, Dahlquist FW: Detection and characterization of an early folding intermediate of T4 lysozyme using pulsed hydrogen exchange and two-dimensional NMR. Biochemistry 1992, 31(20):4749–4756. 10.1021/bi00135a002

  24. 24.

    Kiho Y, Rich A: Induced enzyme formed on bacterial polyribosomes. Proc Natl Acad Sci USA 1964, 51: 111–118. 10.1073/pnas.51.1.111

  25. 25.

    Nicola AV, Chen W, Helenius A: Co-translational folding of an alphavirus capsid protein in the cytosol of living cells. Nat Cell Biol 1999, 1(6):341–345. 10.1038/14032

  26. 26.

    Sánchez IE, Morillas M, Zobeley E, Kiefhaber T, Glockshuber R: Fast folding of the two-domain semliki forest virus capsid protein explains co-translational proteolytic activity. J Mol Biol 2004, 338: 159–167. 10.1016/j.jmb.2004.02.037

  27. 27.

    Komar AA, Kommer A, Krasheninnikov IA, Spirin AS: Cotranslational folding of globin. J Biol Chem 1997, 272(16):10646–10651. 10.1074/jbc.272.16.10646

  28. 28.

    Hsu STD, Fucini P, Cabrita LD, Launay H, Dobson CM, Christodoulou J: Structure and dynamics of a ribosome-bound nascent chain by NMR spectroscopy. Proc Natl Acad Sci USA 2007, 104(42):16516–16521. 10.1073/pnas.0704664104

  29. 29.

    Voelz VA, Shell MS, Dill KA: Predicting peptide structures in native proteins from physical simulations of fragments. PLoS Comput Biol 2009, 5(2):e1000281. 10.1371/journal.pcbi.1000281

  30. 30.

    Bergman LW, Kuehl WM: Formation of an intrachain disulfide bond on nascent immunoglobulin light chains. J Biol Chem 1979, 254(18):8869–8876.

  31. 31.

    Bergman LW, Kuehl WM: Formation of intermolecular disulfide bonds on nascent immunoglobulin polypeptides. J Biol Chem 1979, 254(13):5690–5694.

  32. 32.

    Lim VI, Spirin AS: Stereochemical analysis of ribosomal transpeptidation. Conformation of nascent peptide. J Mol Biol 1986, 188(4):565–574. 10.1016/S0022-2836(86)80006-7

  33. 33.

    Jenni S, Ban N: The chemistry of protein synthesis and voyage through the ribosomal tunnel. Curr Opin Struct Biol 2003, 13(2):212–219. 10.1016/S0959-440X(03)00034-4

  34. 34.

    Voss NR, Gerstein M, Steitz TA, Moore PB: The geometry of the ribosomal polypeptide exit tunnel. J Mol Biol 2006, 360(4):893–906. 10.1016/j.jmb.2006.05.023

  35. 35.

    Tsalkova T, Odom OW, Kramer G, Hardesty B: Different conformations of nascent peptides on ribosomes. J Mol Biol 1998, 278(4):713–723. 10.1006/jmbi.1998.1721

  36. 36.

    Ziv G, Haran G, Thirumalai D: Ribosome exit tunnel can entropically stabilize alpha-helices. Proc Natl Acad Sci USA 2005, 102(52):18956–18961. 10.1073/pnas.0508234102

  37. 37.

    Seckler R, Fuchs A, King J, Jaenicke R: Reconstitution of the thermostable trimeric phage P22 tailspike protein from denatured chains in vitro. J Biol Chem 1989, 264(20):11750–11753.

  38. 38.

    Fedorov AN, Baldwin TO: Process of biosynthetic protein folding determines the rapid formation of native structure. J Mol Biol 1999, 294(2):579–586. 10.1006/jmbi.1999.3281

  39. 39.

    Evans MS, Clarke TF, Clark PL: Conformations of co-translational folding intermediates. Protein Pept Lett 2005, 12(2):189–195. 10.2174/0929866053005908

  40. 40.

    Frydman J, Erdjument-Bromage H, Tempst P, Hartl FU: Co-translational domain folding as the structural basis for the rapid de novo folding of firefly luciferase. Nat Struct Biol 1999, 6(7):697–705. 10.1038/10754

  41. 41.

    Evans MS, Sander IM, Clark PL: Cotranslational folding promotes β -helix formation and avoids aggregation in vivo. J Mol Biol 2008, 383(3):683–692. 10.1016/j.jmb.2008.07.035

  42. 42.

    Tsou CL: Folding of the nascent peptide chain into a biologically active protein. Biochemistry 1988, 27(6):1809–1812. 10.1021/bi00406a001

  43. 43.

    Fedorov AN, Baldwin TO: Contribution of cotranslational folding to the rate of formation of native protein structure. Proc Natl Acad Sci USA 1995, 92(4):1227–1231. 10.1073/pnas.92.4.1227

  44. 44.

    Frydman J: Folding of newly translated proteins in vivo: the role of molecular chaperones. Annu Rev Biochem 2001, 70: 603–647. 10.1146/annurev.biochem.70.1.603

  45. 45.

    Hartl FU, Hayer-Hartl M: Molecular chaperones in the cytosol: from nascent chain to folded protein. Science 2002, 295(5561):1852–1858. 10.1126/science.1068408

  46. 46.

    Deuerling E, Schulze-Specking A, Tomoyasu T, Mogk A, Bukau B: Trigger factor and DnaK cooperate in folding of newly synthesized proteins. Nature 1999, 400(6745):693–696. 10.1038/23301

  47. 47.

    Teter SA, Houry WA, Ang D, Tradler T, Rockabrand D, Fischer G, Blum P, Georgopoulos C, Hartl FU: Polypeptide flux through bacterial Hsp70: DnaK cooperates with trigger factor in chaperoning nascent chains. Cell 1999, 97(6):755–765. 10.1016/S0092-8674(00)80787-4

  48. 48.

    Srinivasan R, Rose G: LINUS: A hierarchical procedure to predict the fold of a protein. Proteins 1995, 22: 81–99. 10.1002/prot.340220202

  49. 49.

    Bornberg-Bauer E: How are model protein structures distributed in sequence space? Biophys J 1997, 73(5):2393–2403. 10.1016/S0006-3495(97)78268-7

  50. 50.

    Morrissey MP, Ahmed Z, Shakhnovich EI: The role of cotranslation in protein folding: a lattice model study. Polymer 2004, 45: 557–571. 10.1016/j.polymer.2003.10.090

  51. 51.

    Huard FPE, Deane CM, Wood GR: Modelling sequential protein folding under kinetic control. Bioinformatics 2006, 22(14):e203-e210. 10.1093/bioinformatics/btl248

  52. 52.

    Lu HM, Liang J: A model study of protein nascent chain and cotranslational folding using hydrophobic-polar residues. Proteins 2008, 70(2):442–449. 10.1002/prot.21575

  53. 53.

    Wang P, Klimov DK: Lattice simulations of cotranslational folding of single domain proteins. Proteins 2008, 70(3):925–937. 10.1002/prot.21547

  54. 54.

    Elcock AH: Molecular simulations of cotranslational protein folding: fragment stabilities, folding cooperativity, and trapping in the ribosome. PLoS Comput Biol 2006, 2(7):e98. 10.1371/journal.pcbi.0020098

  55. 55.

    Senturk S, Baday S, Arkun Y, Erman B: Optimum folding pathways for growing protein chains. Phys Biol 2007, 4(4):305–316. 10.1088/1478-3975/4/4/007

  56. 56.

    Norcross T, Yeates T: A framework for describing topological frustration in models of protein folding. JMB 2006, 362: 605–621. 10.1016/j.jmb.2006.07.054

  57. 57.

    Alexandrov N: Structural argument for N-terminal initiation of protein folding. Protein Sci 1993, 2(11):1989–1991. 10.1002/pro.5560021121

  58. 58.

    Laio A, Micheletti C: Are structural biases at protein termini a signature of vectorial folding? Proteins 2006, 62: 17–23. 10.1002/prot.20712

  59. 59.

    Taylor WR: Topological accessibility shows a distinct asymmetry in the folds of βα proteins. FEBS Lett 2006, 580(22):5263–5267. 10.1016/j.febslet.2006.08.070

  60. 60.

    Deane CM, Dong M, Huard FPE, Lance BK, Wood GR: Cotranslational protein folding-fact or fiction? Bioinformatics 2007, 23(13):i142-i148. 10.1093/bioinformatics/btm175

  61. 61.

    Winstanley HF, Abeln S, Deane CM: How old is your fold? Bioinformatics 2005, 21(Suppl 1):i449-i458. 10.1093/bioinformatics/bti1008

  62. 62.

    Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997, 268: 209–225. 10.1006/jmbi.1997.0959

  63. 63.

    Simons KT, Bonneau R, Ruczinski I, Baker D: Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins 1999, (Suppl 3):171–176. 10.1002/(SICI)1097-0134(1999)37:3+<171::AID-PROT21>3.0.CO;2-Z

  64. 64.

    Chivian D, Kim DE, Malmström L, Bradley P, Robertson T, Murphy P, Strauss CEM, Bonneau R, Rohl CA, Baker D: Automated prediction of CASP-5 structures using the Robetta server. Proteins 2003, 53(Suppl 6):524–533. 10.1002/prot.10529

  65. 65.

    Chivian D, Kim DE, Malmström L, Schonbrun J, Rohl CA, Baker D: Prediction of CASP6 structures using automated Robetta protocols. Proteins 2005, 61(Suppl 7):157–166. 10.1002/prot.20733

  66. 66.

    Wang G, Dunbrack RL: PISCES: a protein sequence culling server. Bioinformatics 2003, 19(12):1589–1591. 10.1093/bioinformatics/btg224

  67. 67.

    Zemla A: LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res 2003, 31(13):3370–3374. 10.1093/nar/gkg571

  68. 68.

    Davis IW, Leaver-Fay A, Chen VB, Block JN, Kapral GJ, Wang X, Murray LW, Arendall WB III, Snoeyink J, Richardson JS, Richardson DC: MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research 2007, (35 Web Server):W375-W383. 10.1093/nar/gkm216

Download references

Author information

Correspondence to Graham R Wood.

Additional information

Authors' contributions

Conceived and designed the experiments: FPEH, GRW, CMD and JJE. Performed the experiments: JJE, FPEH and SS. Analyzed the data: JJE, GRW, FPEH and CMD and SS. Wrote the paper: JJE, FPEH, GRW, SS and CMD. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ellis, J.J., Huard, F.P., Deane, C.M. et al. Directionality in protein fold prediction. BMC Bioinformatics 11, 172 (2010).

Download citation


  • Sequential Algorithm
  • Protein Structure Prediction
  • Steric Clash
  • Nascent Chain
  • Fragment Insertion