Directionality in protein fold prediction

Background Ever since the ground-breaking work of Anfinsen et al. in which a denatured protein was found to refold to its native state, it has been frequently stated by the protein fold prediction community that all the information required for protein folding lies in the amino acid sequence. Recent in vitro experiments and in silico computational studies, however, have shown that cotranslation may affect the folding pathway of some proteins, especially those of ancient folds. In this paper aspects of cotranslational folding have been incorporated into a protein structure prediction algorithm by adapting the Rosetta program to fold proteins as the nascent chain elongates. This makes it possible to conduct a pairwise comparison of folding accuracy, by comparing folds created sequentially from each end of the protein. Results A single main result emerged: in 94% of proteins analyzed, following the sense of translation, from N-terminus to C-terminus, produced better predictions than following the reverse sense of translation, from the C-terminus to N-terminus. Two secondary results emerged. First, this superiority of N-terminus to C-terminus folding was more marked for proteins showing stronger evidence of cotranslation and second, an algorithm following the sense of translation produced predictions comparable to, and occasionally better than, Rosetta. Conclusions There is a directionality effect in protein fold prediction. At present, prediction methods appear to be too noisy to take advantage of this effect; as techniques refine, it may be possible to draw benefit from a sequential approach to protein fold prediction.


Background
The purpose of this paper is to investigate whether directionality of synthesis can have an impact on the accuracy of protein structure prediction. In order to do this a sequential structure prediction algorithm, based on the most successful free modelling method of our time, Rosetta, was developed and used to predict structure, first starting from the nitrogen terminus and then starting from the carbon terminus. Free modelling protein structure prediction methodology has improved in recent years, but is still not accurate enough to be considered satisfactory (see results of CASP6 [1] and CASP7 [2,3] and the more recent CASP8 [4]). Given this noisy nature of current free modelling stucture prediction techniques, the pairwise comparison design used here appears to be required; it succeeded in detecting a consistent directionality effect. We begin, however, by summarizing the area.
Almost fifty years ago Anfinsen et al. [5,6] showed that denatured small globular proteins could refold to their native state. On the other hand, experimentalists have known for many years that cotranslation can play an important role in protein folding [7][8][9][10][11][12]. Polypeptides are synthesized sequentially, and translation can occur at variable rates according to codon speed [13][14][15][16][17]. In Escherichia coli, for example, translation can occur in the order of 0.05 s/codon [13,[18][19][20]. On the other hand, it has been shown that helices and sheets fold in the low millisecond scale [21][22][23]. Therefore, some proteins fold faster than they elongate, and it is reasonable to assume that nascent chains can adopt secondary or tertiary structures cotranslationally. Experimental evidence for cotranslational folding dates back to the 1960s with a study on cotranslation in vivo reporting that ribosomebound β-galactosidase was showing enzymic activity [24]. More recently it has been shown that the Semliki Forest Virus Protein (SFVP), which contains a protease domain that folds to autocatalytically cleave the protein from a larger polyprotein precursor, gains its enzymic activity before complete synthesis of the polyprotein [25]. Moreover, the rapid cotranslational folding of SFVP does not require additional cellular components [26].
In addition to enzymatic activity whilst still bound to the ribosome, intermediate stages of cotranslational folding may have native-like structures. Various length αglobins have been shown to have specific heme binding activity on several truncated ribosome-bound nascent chains. The shortest of these contained only the first 86 residues (from a total of 147 residues), demonstrating that the nascent chain has native-like structure [27]. NMR studies of nascent chains containing tandem Ig domains and still attached to the ribosome revealed that the N-terminus domain folds to its native state while the C-terminus domain is largely unfolded and flexible [28]. Recent molecular dynamics simulations also conclude that small peptides may adopt a conformation that is similar to the one adopted in full proteins [29]. The discovery of the formation of disulphide bonds in nascent immunoglobulin peptides also confirms the ability of proteins to begin to fold whilst they are being synthesized [30,31].
As well as adopting native-like conformations while still attached to the ribosome, there is evidence that peptides can begin to fold whilst still in the ribosomal exit tunnel. Analysis of the ribosomal exit tunnel reveals that peptides can traverse the tunnel in an α-helical conformation [32], but that at no point is the tunnel big enough to accommodate structures larger than α-helices [33,34]. Peptides are not restricted to an α-helix, however, and may adopt more extended conformations [35]. Analysis of the exit tunnel has also shown that the tunnel can entropically stabilize α-helical conformations as they pass through [36].
The rate of in vitro refolding has often been observed to be slower than the corresponding rate in vivo [37,38]. Cotranslation has been studied in the bacterial luciferase αβ heterodimer, and the formation of the heterodimer is faster when the β monomer is translated in the presence of the folded α monomer than when the β monomer is refolded from a denatured state [38]. This shows that, under cotranslational folding, the β monomer is able to obtain a conformation that is more receptive to the formation of the dimer, thus avoiding kinetic traps associated with refolding from a denatured state [39]. Nativelike structure has also been observed in cotranslationally folding monomeric firefly luciferase; again, cotranslational and in vitro folding pathways appear to be different, with cotranslational folding being faster [40]. Cotranslational folding in P22 tailspike protein has been shown to guide the peptide away from aggregation-prone conformations that are frequently encountered when refolding in vitro, leading to the hypothesis that cotranslational folding could be an efficient strategy for the folding of β-sheet topologies, and for large, multidomain proteins in general [41]. One possible explanation for this is that the peptide begins to fold while still attached to the ribosome [42,43]. Another possible explanation is the existence of additional folding machinery contained in the cell; however, only approximately 20% of proteins associate, for example, with chaperones [44,45]. The removal of major chaperones, such as DnaK and Hsp70, in E. coli has no adverse effect on cell growth or viability [46,47]. This suggests that chaperones alone cannot account for the higher folding rates observed in vivo.
Complementing these experimental findings, computational models of cotranslational folding have also been explored, an early, incidental, use of this idea appearing in [48]. Simple computational models of protein folding incorporating cotranslation demonstrate that such folding favours local contacts in intermediate and final folds [49,50]. More recently the effect of energy barriers on simple cotranslational models was studied, and it was found that the ground state of proteins folded sequentially was not necessarily the one of lowest energy [51]. Computational models have provided evidence that nascent chains may adopt partial structures similar to the corresponding parts of the complete protein [52]. Other lattice studies present a differing view of cotranslation where nascent peptides can remain largely unstructured until the final stages of synthesis (estimated to be when 90% or more of the protein has been extruded) [53]. This finding is dependent on the involvement of the C-terminal in tertiary interactions, and may not be applicable to all proteins. There is also evidence arising from lattice models that cotranslational folding pathways and refolding pathways are different [53]. Computational simulations of real proteins folding cotranslationally compared to refolding from a denatured state show mixed results. Chymotrypsin inhibitor 2 (CI2) and barnase were shown to fold mostly posttranslationally, with intermediates similar to those observed in refolding [54]. An alternative computational, cotranslational approach using dynamic optimisation in [55] found that major elements of the CI2 tertiary structure only form when the amino acid string is fully translated. For SFVP, which is known to fold cotranslationally [25], different pathways were taken during synthesis to those taken when folding from a denatured state [54]. A further promising approach is found in [56]. Pathways which minimize the difficulty of folding to the native state (for example, those which avoid having the chain pass through an opening) are found; results indicate that earlier folding is more likely around the N-terminus than the C-terminus, so pointing to an asymmetry of the folding process that is confirmed in the current work.
Finally, there is also evidence of cotranslational protein folding that arises from numerical summaries of known protein structures. An analysis of structures in the Protein Data Bank (PDB) found that residues are, in general, closer to previously synthesized residues than those synthesized later, and that the N-terminal region was more compact than the C-terminal region [57]. It was argued that this provided evidence of cotranslational folding, however, these findings were contradicted by a later analysis of a larger set of proteins [58]. In the second study it was observed that the C-terminals were more compact and contained greater numbers of local contacts than Nterminals. Further analysis that considered topological accessibility (the ability of a protein to fold from a given residue as a starting point using only local contacts) found this to be more evident towards the N-terminus in the α/β class of proteins [59]. In a similar vein, Deane et al. [60] developed a measure of previous contacts which assesses the extent to which the chain forms contacts with previously extruded residues. They also found that the α/β class and ancient folds [61] exhibited such evidence of cotranslation.
To date, protein structure prediction methods do not incorporate cotranslational effects. This paper describes such an algorithm and evaluates its performance. This evaluation reveals that, in more than 94% of cases, a sequential algorithm that follows the sense of translation, that is, from N-terminus to C-terminus, is more accurate than an algorithm that follows the reverse sense, from Cterminus to N-terminus. The success of the sequential algorithm is greater the more the target shows evidence of cotranslational folding. It is also found that a sequential algorithm can match, and on occasion better (in 51% of proteins tested), the performance of a leading nonsequential protein structure prediction algorithm, namely Rosetta.

Structure prediction algorithms
A sequential algorithm (SAINT, a Sequential Algorithm Initiated at the Nitrogen Terminus) was developed and used to predict the structure of a number of proteins. This algorithm uses the Rosetta program [62] (version 2.1.0), extending it to incorporate cotranslational aspects of protein folding. To investigate the importance of following the direction of translation, the sequential algorithm was adapted to predict the structure of proteins produced in the reverse direction, from the C-terminus to the N-terminus. Predictions from the sequential and reverse sequential algorithms were compared and they in turn compared to predictions made using an unmodified version of Rosetta. These algorithms are now described.

Sequential algorithm
SAINT extends the peptide by a nine residue fragment at each iteration, starting with the N-terminus. Each fragment is added in a fully extended conformation (ϕ = -150°, ψ = 150° and ω = 180°). The final fragment may contain fewer than nine residues; it will contain as many residues as are required to complete the full protein chain. At each extension the peptide is allowed to fold and the con-formation reached is used as the starting structure for the next extension, with Rosetta ab initio used to perform the structure predictions at each stage. In order to make comparisons between the sequential and non-sequential algorithms fair, each uses the same total number of cycles. For the sequential algorithm these cycles were distributed evenly amongst each extension of the peptide with the number of cycles calculated as follows. If b is a base number of cycles and l is the protein length then the total number of cycles t is b(l/100) and the number of extrusions e is Ll/9O. This results in n = Nt/eQ cycles for the first e -1 extrusions and tn(e -1) cycles for the final extrusion.

Reverse sequential algorithm
The reverse sequential algorithm is the same as the sequential algorithm. It differs only in that the peptide is extended from the C-terminus to the N-terminus.

Non-sequential algorithm
In non-sequential folding a protein is folded from a fully extended state. The Rosetta ab initio algorithm is employed for this process, using insertion from a library of fragments to build decoys (predicted structures). This has proved a successful technique for protein structure prediction in recent years [3,[63][64][65]. Rosetta can select fragments from the target, so the algorithm as used here is not strictly ab initio. The number of cycles (fragment insertions) used by Rosetta varies with protein length in this study. A base number of 34,000 cycles was used for a protein of 100 residues, and this number increased proportionately; for example, for a protein with 143 residues the number of cycles is increased by a factor of 1.43. This is reasonable as in the cell longer proteins take more time to be synthesized, and thus have more time to explore conformational space before synthesis is completed.

Selection of targets
In Deane et al. [60] a measure was developed, an Average Logarithmic Ratio (ALR), which assesses the extent of previous contacts within a peptide chain; proteins with positive ALR are expected to be those for which the cotranslational aspect of folding has a substantial impact, whilst proteins with negative ALR are expected to be those for which cotranslation has lesser impact. Two sets of targets were created from a PISCES [66] data set (<30% sequence identity, resolution better than 3 Å, at least 100 residues and no missing residues, downloaded 6 February, 2009). The first set contained protein chains with an ALR value of 0.15 or greater (total of 34 proteins), and the second contained chains with an ALR of -0.15 or less (total of 34 proteins); these two sets are referred to as the positive and negative sets respectively. For each protein in the two sets, 1000 decoys were generated with each of the algorithms described above (sequential, reverse sequential and non-sequential). GDT_TS values [67] were calculated for each of the resulting predictions. GDT_TS is ber of corresponding residues within iÅ and N is the total number of residues. It measures the closeness of corresponding residues in known and predicted structures, more heavily weighting closer pairs. It is helpful to see it in non-cumulative form as where .

Larger sample size
To establish whether the sample size (that is, the number of decoys produced for each protein) has an effect on the results, two proteins were subjected to a larger sampling. An additional 100,000 decoys were generated for the FLiG C-terminal domain of Thermotoga maritima (1qc7A) and also for 1ji4A, using the SAINT algorithm.

Variability in peptide termini
As the differences between mean GDT_TS scores for SAINT and reverse SAINT, for a given protein, prove to be generally small, additional tests were conducted to ascertain whether terminus loop regions could be causing the observed effects. The termini of proteins are often unstructured, and their structure can be highly variable and difficult to predict. Small mistakes in the terminus regions could lead to the small differences observed between the mean GDT_TS scores. The first N-terminus and last C-terminus secondary structure elements were identified in the experimental structure for each protein, and the termini up to the identified secondary structure element of the corresponding predicted model with the highest GDT_TS were removed. A secondary structure element was defined as a run of four residues with identical secondary structure assignment. Secondary structure was assigned from the experimentally determined structure with DSSP. In addition to these conditions the N-terminus and C-terminus secondary structure element had to be separated by at least five residues. GDT_TS scores were recalculated and counts taken of how often SAINT outperformed reverse SAINT and how often SAINT outperformed Rosetta.

Clash analysis
A possible reason for better performance of SAINT was conjectured to be that extrusion from the nitrogen terminus produces fewer steric clashes than does extrusion from the carbon terminus. In order to investigate this, ten protein sequences were selected on the basis of their mean GDT_TS scores: four in which SAINT performed better, three in which reverse SAINT performed better, and three in which SAINT and reverse SAINT performed comparably. For each protein, two of the 1000 models generated were selected for each of SAINT and reverse SAINT. The extent of steric clashes in conformations following folding, for five extruded lengths (18,36,54,72,90), were assessed using MolProbity [68], a web server that calculates a "clashscore", equal to the number of steric overlaps that are greater than 0.4 Å per 1000 atoms. Nine residues in fully extended conformation were then added at the C-terminus (for SAINT) or the N-terminus (for reverse SAINT) to produce strings of length 27, 45, 63, 81, and 99 and these checked again for steric clashes. For each of the five positions, the clashscore before the addition of nine residues was subtracted from the clashscore after the addition of the 9-mer fragment. An average of the differences in clashscores, across all five lengths, was taken for each protein sequence and each algorithm.

The importance of sense
To investigate why SAINT might perform consistently better than reverse SAINT, measures of secondary structure prediction quality were developed. For a given decoy, structural alignments for every overlapping fragment of 11 residues against the experimental structure were obtained, and the average C α -C α distance of the alignment was assigned to the fragment's center residue (fragments of 11 residues were chosen to provide insight into prediction accuracy on a more local scale than, for example, taking an entire secondary structure element). These residue-assigned distance measures were averaged across all residues in α-helices in the decoy (residue secondary structure was assigned by DSSP for the experimentally determined model) and these in turn averaged over all 1000 decoys. This was done for both the forward and reverse decoy sets. Finally, the forward helical prediction quality measure was subtracted from the reverse helical prediction quality measure. The same process was followed for β-strands. If directionality is not important in folding we would expect the accuracy of helical or strand predictions to be similar regardless of the direction of synthesis, resulting in the difference calculated above being zero. A positive difference would indicate that forward predictions were more accurate than reverse predictions while negative differences would indicate that reverse predictions were more accurate. One of the pro- teins in the positive set (1qc7A) and four in the negative set (1kf6D, 1mkaA, 1nekC and 1uz3A) contained no βstrand residues and, therefore, were not considered in the analysis.

Results and Discussion
The emerging partial conformations produced by SAINT for sequence 1qc7A are shown in Figure 1, using the most successful decoy. The six helices are seen to progressively take shape as the chain is extruded, with early conformations largely preserved. Results for SAINT, reverse SAINT and Rosetta for each of the proteins in the positive set (ALR ≥ 0.15, see Methods, Selection of targets) and negative set (ALR ≤ -0.15) are summarized in Table 1 and Table 2 respectively. The mean performance and best models produced by SAINT show that it predicts structures better than reverse SAINT in the majority of cases (Table 3). For example, SAINT Figure 1 Cotranslational structure prediction of the FLiG C-terminal domain (1qc7A; 101 residues). Segments of nine residues are extruded at a time except for the last segment which consists of two residues. One thousand decoys were produced; the particular simulation above produced the structure with the highest GDT_TS of 63.12%. In each sub-figure the N-terminal is coloured dark blue and appears at the center adopting approximately the same orientation; it cannot always be the same orientation due to changes in conformation as the protein folds. The mean GDT_TS and maximum GDT_TS for all 1000 decoys produced for each combination of protein and algorithm is shown. For both the mean and maximum GDT_TS the highest GDT_TS is shown in bold while the lowest is shown in italics. The mean GDT_TS and maximum GDT_TS for all 1000 decoys produced for each combination of protein and algorithm is shown. For both the mean and maximum GDT_TS the highest GDT_TS is shown in bold while the lowest is shown in italics.
yielded a higher mean GDT_TS than reverse SAINT for 32 of the 34 proteins with positive ALR and equally, for 32 of the 34 proteins with negative ALR. Plots of the mean scores for SAINT, reverse SAINT and Rosetta for the positive set are given in Figure 2A, with proteins ordered from smallest to largest mean SAINT GDT_TS score. Corresponding plots for the negative set are given in Figure 3A. The consistent superiority of SAINT over reverse SAINT is evident, with the difference being slightly greater for the positive set. The largest such difference seen in all the data is 8.49%, observed between the means of SAINT and reverse SAINT for 3ezmA (negative set), and representing an increase in GDT_TS from 20.25% to 28.74%. Mean performances of SAINT and Rosetta indicate that Rosetta outperforms SAINT in both the positive (Rosetta 19.72, SAINT 19.50) and negative (Rosetta 18.26, SAINT 17.84) sets. The difference is greater for the negative set ( Table 3).
Plots of the maximum scores for SAINT, reverse SAINT and Rosetta for the positive set are given in Figure 2B, with proteins ordered from smallest to largest maximum SAINT GDT_TS score. Corresponding plots for the negative set are shown in Figure 3B. When considering best performance, SAINT is again superior to reverse SAINT, and more so in the positive set. Rosetta is no longer superior when best performance is considered; SAINT outperforms Rosetta, for example, in 19 of the 34 proteins in the positive set. The most successful SAINT prediction in the positive set was found for 3vubA. It is shown superposed on the native conformation in Figure 4, together with superpositions of the best reverse SAINT and Rosetta predictions on the native conformation. SAINT captures the structure better than either reverse SAINT or Rosetta.
A GDT_TS value of 30% or above is generally considered to ensure that a reasonable prediction is found [4]; a scan of Table 1 indicates that roughly one half (15 out of 34) of the best SAINT predictions are satisfactory, and similarly for Rosetta (16 out of 34).

Larger sample size
Summaries of the distribution of GDT_TS scores indicate that the size of the decoy sets used (that is, 1000) does not significantly influence their values (for 1qc7A, sample size of 1000 has min. 23

Variability in peptide termini
The results of this test indicate that the differences in GDT_TS observed are not due to variability in the terminus regions of the peptides (data presented in Tables 4  and 5).

Clash analysis
The results are shown in Table 6. Four of the ten protein conformations examined have higher steric clashscores for SAINT than reverse SAINT. The steric clashscore appears not to be influenced by its mean GDT_TS score, evidenced by two (1mf7A and 2d00A) out of the four proteins with higher mean GDT_TS scores for SAINT having greater steric clashscores than reverse SAINT. Steric clashes produced by SAINT and reverse SAINT are generally comparable, so providing no evidence that fewer steric clashes are the reason for the better performance of SAINT.

The importance of sense
The differences obtained from both the positive and negative sets are shown in Figure 5. These results show that for both types of secondary structure SAINT is generally producing better predictions, but that the difference is more pronounced for strand residues. In 28 of the 33 proteins (85%) in the positive set the difference between forward and reverse folding is greater for strands than for helices (with 16 (48%) having a β-strand difference more than twice the α-helix difference). Similarly, in 26 of the 30 proteins (87%) in the negative set the difference between forward and reverse folding is greater for strands than for helices (with 19 (63%) having a β-strand difference more than twice the α-helix difference). These results indicate that in general SAINT is more accurate when predicting strands than is reverse SAINT. These differences are small, but they would account for the differences observed in the results.

Discussion
A consistent difference in prediction accuracy was seen between SAINT and reverse SAINT. SAINT is markedly superior to reverse SAINT, and slightly more so for proteins with positive ALR values. When looking in detail at SAINT and reverse SAINT, the differences observed are most likely due to the detrimental effect on strand predic-tion observed when elongating a peptide from the C-terminus to the N-terminus. SAINT produced decoys with a higher mean GDT_TS than reverse SAINT for more than 94% of proteins in both the positive and negative protein sets. The differences between mean GDT_TS scores for SAINT and reverse SAINT decoys were also bigger than those between SAINT and Rosetta decoys. If directionality played no part in the folding process it would be expected that there would be no difference in the predictive accuracy of extrusions from the N-terminus to C-terminus and extrusion from C-terminus to the N-terminus. Three possible explanations for these results are outlined below. Peptides, when extruded from the ribosome, start at the N-terminus. For this reason, fragments near the Nterminus are less influenced in their folding by the remainder of the peptide, since this has yet to emerge from the ribosome. On the other hand, fragments towards the C-terminus must fold in the presence of the Figure 3 Plots of mean and maximum GDT_TS for the negative set. Graphic A shows the mean GDT_TS scores for the 34 proteins in the negative set, for SAINT (red squares), reverse SAINT (blue circles) and Rosetta (green triangles), with the proteins ordered according to ascending mean SAINT GDT_TS. Graphic B plots maximum GDT_TS for proteins in the negative set, ordered by ascending maximum SAINT GDT_TS. Outcomes are the same as for the positive set, with all differences less marked.  A   1aym1  1aym3  1ddlA  1qqp3  1e0cA  2edmA  3besR  1umhA  2ag4A  1kyfA  1l7lA  2b0aA  2e56A  1aocA  1mkaA  1tt8A  2tgiA  2bnqD  1dy5A  1y8cA  2ov0A  2awgA  1kptA  1dwkA  2nwfA  2owpA  1wt9B  1kf6D  3ezmA  1seiA  1uz3A  1p0zA  1nekC  2p25A   0   10   20   30   40   50   60   70 Maximum GDT_TS (%) B bulk of the peptide. Thus the conformation assumed by the early fragment is a local choice, in that it depends largely on the amino acid sequence of the fragment. The conformation reached by a later fragment is determined by more than its amino acid sequence, in that it also depends on surrounding structure. This behaviour is mimicked by SAINT but not by reverse SAINT, so providing an explanation for the consistently better predictive accuracy of SAINT. A second explanation arises from the way that the two algorithms allocate fragment insertions. At any stage, due to the constraints of Rosetta, fragment insertions are made uniformly across the currently extruded peptide length. The upshot is that more fragment insertions are attempted at the N-terminus than the C-terminus for SAINT while the opposite is true for reverse SAINT. Should it be the case that the N-terminus of the peptide is harder to predict than the C-terminus, SAINT would be more successful than reverse SAINT since SAINT puts in effort where it is needed. Due to the reasons stated above, however, we expect the N-terminus to be more easily predicted than the C-terminus.
A third possibility is that Rosetta itself has some inherent directionality, so favouring SAINT over reverse SAINT. A study of Rosetta, however, provides no indication of such a directional bias.
A strong correlation between mean GDT_TS and chain length is seen for both the positive and negative sets and for all three algorithms: as the chain length increases the GDT_TS decreases. 1oaaA is the only target over 200 residues in length that produced a set of decoys with mean GDT_TS greater than 20%, indicating that the versions of the algorithms employed in this study are not sufficient to accurately predict the structure of chains with more than 200 residues (this accounts for 50% of the positive set and 24% of the negative set). Excluding this data from the Figure 4 Superpositions of the best predictions for 3vubA on the native structure. The best decoy produced overall was by SAINT for 3vubA, whose native conformation is shown in a). The remaining graphics show the superposition of this native conformation with the best decoy produced by b) SAINT (GDT_TS = 67.57), c) reverse SAINT (GDT_TS =37.62) and d) Rosetta (GDT_TS = 51.24). The SAINT decoy best captures the native loop and sheet conformation; a loop error causes the C-terminal helix to be incorrectly oriented. Among the 1000 decoys produced for each protein with ALR ≥ 0.15 by each of SAINT, reverse SAINT, and Rosetta the best model (with highest GDT_TS) was found (as indicated in Table 1 by Maximum GDT_TS). Each of these selected models was then altered by chopping off the first N-terminus and last C-terminus secondary structure elements identified in its native structure. GDT_TS scores were then recalculated for each algorithm and are displayed below. The highest GDT_TS is shown in bold while the lowest is shown in italics. Sample size was reduced to 33 as no secondary structural element at least five residues in length was found at either terminal of the protein chain 2j01Vpdb2j01V. Among the 1000 decoys produced for each protein with ALR ≤ -0.15 by each of SAINT, reverse SAINT, and Rosetta the best model (with highest GDT_TS) was found (as indicated in Table 2 by Maximum GDT_TS). Each of these selected models was then altered by chopping off the first N-terminus and last C-terminus secondary structure elements identified in its native structure. GDT_TS scores were then recalculated for each algorithm and are displayed below. The highest GDT_TS is shown in bold while the lowest is shown in italics.
analysis, however, makes no difference to the overall findings. Given that SAINT outperforms reverse SAINT it might be expected that SAINT would also outperform Rosetta, Rosetta being, in some senses, midway between the two. In best performance, arguably more important than mean performance, there is weak evidence that SAINT does outperfom Rosetta; for the positive set SAINT outperfoms Rosetta in 19 out of 33 instances (there is one tie) and for the negative set SAINT outperforms Rosetta in 16 out of 30 instances (there are four ties). An explanation why this remains weak at this stage is that SAINT remains crude, barely exploiting spatial and temporal advantages which may be available in cotranslational folding; we have simply used an iterative version of Rosetta. For example, at each extrusion, fragment insertions are chosen uniformly along the extruded peptide, whereas use of an insertion location distribution skewed towards the carbon terminus might be more realistic. To its credit, however, the SAINT versus reverse SAINT investigation exploits the power of a "paired comparison" design more effectively than does the SAINT versus Rosetta investigation, in that it contrasts opposites and so is more likely to reveal an effect.

Conclusions
This study has presented an algorithm that builds cotranslation into protein structure prediction. To assess the importance of the direction of translation the sequential algorithm was compared to a reverse sequential algo-rithm where the protein was produced from the Cterminus to N-terminus. Two sets of proteins were chosen: one where the residues have, on average, more contacts with previous residues than successive residues and the other where the residues have, on average, more contacts with successive residues than previous residues. The performance of the sequential algorithm for protein structure prediction was also compared with Rosetta, which folds from a fully elongated chain.
When SAINT was compared to reverse SAINT a very pronounced difference was observed. When mean GDT_TS was used as the performance measure SAINT outperformed reverse SAINT for over 94% of targets from both the positive and negative sets. These figures were still high when the maximum GDT_TS was used as the performance measure, with SAINT outperforming reverse SAINT in over 91% of targets from the positive set and over 73% of targets from the negative set.
The results show that Rosetta produces decoy sets with higher mean GDT_TS scores than SAINT for both the positive and negative protein sets, but that this superiority of Rosetta is not seen when the models with the highest GDT_TS scores are compared. If it were possible to always select the most accurate structure from the set of decoys then SAINT would, overall, produce a better prediction than Rosetta. The selection of the best decoy from a set, however, is a separate problem that is not addressed in this study. While Rosetta is producing decoy sets with higher mean GDT_TS scores than SAINT, examination of the differences between the means shows that the differ- Mean difference in clashscores for each protein sequence; the larger the mean difference, the more clashes created by the extrusion. The first four proteins in the table have higher mean GDT_TS scores for SAINT, the next three have higher mean GDT_TS scores for reverse SAINT and the remaining three have comparable mean GDT_TS scores for SAINT and reverse SAINT. There is no evidence that SAINT creates more clashes.
ence is always small. Only on one occasion does a Rosetta decoy set have a mean GDT_TS greater than 2% above the corresponding SAINT decoy set (an increase in mean GDT_TS from SAINT to Rosetta of 2.4% for 1ji4A). It has been established that the size of the decoy set and flexibility of peptide terminus residues do not affect the distribution of GDT_TS scores. The sequential algorithm described in this study is in its earliest stages of development. Future work will include investigation of the effect of translation speed, allowing extruded segments to have variable length and the number of fragment insertion attempts at each iteration to vary. Improvements should also include incorporation of spatial restrictions to simulate the constraint of the ribosome tunnel.  SAINT) in the secondary structure distance measure for helical (grey) and strand (black) residues. Positive values here indicate that SAINT is producing predictions that are more accurate than those of reverse SAINT. Evidently SAINT outperforms reverse SAINT for both types of secondary structure, but more strongly for strands and the negative set.