Reduction of the secondary structure topological space through direct estimation of the contact energy formed by the secondary structures

Background Electron cryomicroscopy is a fast developing technique aiming at the determination of the 3-dimensional structures of large protein complexes. Using this technique, protein density maps can be generated with 6 to 10 Å resolution. At such resolutions, the secondary structure elements such as helices and β-strands appear to be skeletons and can be computationally detected. However, it is not known which segment of the protein sequence corresponds to which of the skeletons. The topology in this paper refers to the linear order and the directionality of the secondary structures. For a protein with N helices and M strands, there are (N!2N)(M!2M) different topologies, each of which maps N helix segments and M strand segments on the protein sequence to N helix and M strand skeletons. Since the backbone position is not available in the skeleton, each topology of the skeletons corresponds to additional freedom to position the atoms in the skeletons. Results We have developed a method to construct the possible atomic structures for the helix skeletons by sampling the solution space of all the possible topologies of the skeletons. Our method also ranks the possible structures based on the contact energy formed by the secondary structures, rather than the entire chain. If we assume that the backbone atomic positions are known for the skeletons, then the native topology of the secondary structures can be found in the top 30% of the ranked list of all possible topologies for all the 30 proteins tested, and within the top 5% for most of the 30 proteins. Without assuming the backbone location of the skeletons, the possible atomic structures of the skeletons can be constructed using the axis of the skeleton and the sequence segments. The best constructed structure for the skeletons has RMSD to native between 4 and 5 Å for the four tested α-proteins. These best constructed structures were ranked the 17th, 31st, 16th and 5th respectively for the four proteins out of 32066, 391833, 98755 and 192935 possible assignments in the pool. Conclusion Our work suggested that the direct estimation of the contact energy formed by the secondary structures is quite effective in reducing the topological space to a small subset that includes a near native structure for the skeletons.


Conclusion:
Our work suggested that the direct estimation of the contact energy formed by the secondary structures is quite effective in reducing the topological space to a small subset that includes a near native structure for the skeletons.
Although not yet at the atomic resolution, this technique can be used to generate protein density maps at 6 to 10 Å resolution [1][2][3][4][5][6][7][8][9][10][11]. Computational methods have been developed to detect the secondary structures such as helices and β-strands from the density map at this resolution [12][13][14][15][16]. Since the secondary structures are often the major building blocks of a structure, the detected secondary structures appear as the skeletons (marked as sticks in Figure 1) of the density map. In the density map of 6-10 Å resolution, alpha helices look like sausages with 5-6 Å in diameter, and beta-sheets appear as extended volumes with varying sizes. The connection relationship between the skeletons is often unavailable in the density maps.
Given a protein sequence, the location of secondary structures on the sequence can be roughly predicted using existing secondary structure prediction methods. Such methods can generally predict the secondary structures to about 70-80% accuracy [17][18][19]. Although not com-pletely correct, the predicted secondary structure segments of the sequence provide a rough location of the secondary structures. Wu et al has used the secondary structure geometrical relationships existing in the PDB to reduce the topological space. The reduced possible topologies were then used to add the loops and a simulation of the entire chain was used to identify the true topology [20]. We have also shown previously that the decoys generated by Rosetta ab initio structure prediction software can be used to reduce the topological space [21,22]. In this paper, we will introduce a method that directly measures the contact energy of the secondary structures to reduce the topological space of the skeletons. This is the first such demonstration. The secondary structure topology problem of the CryoEM density map is to find the true mapping between the predicted secondary structure segments on the sequence and the skeletons of the density map. Secondary structure topology in this paper refers to the order and the direction of the secondary structures such as helices and strands with respect to the protein sequence. For a protein with N helices and M strands, there are (N!2 N )(M!2 M ) different topologies. For example, there are N! different permutations for assigning N helix segments to N helix skeletons and two directions to assign a sequence segment to each skeleton. This paper investigates the following question. Given the skeletons detected in the protein density map, is it possible to derive a small subset of the topological space that contains the native topology of the secondary structures using a direct measurement of the energy formed by the secondary structures? The determination of favourable secondary structure topologies has always been one of the most essential problems in tertiary structure prediction. Although predicting the topology, in general, is as hard as predicting the tertiary structure, it is not clear if it is true when the rough location of the secondary structures is known.
Systematic mutation analysis has shown recently that the native protein sequence for a specific 3-dimensional structure is close to optimal [23]. This is intuitively not surprising since nature might have derived a near optimal sequence to achieve the function through evolution. The research in this paper is guided by the intuition that when the relative geometry of the secondary structures is fixed in the 3-dimensional space, there might only be a small number of topologies that can provide energetically favorable structures. In this paper, we describe a method to Protein density map and the detected helix skeletons Figure 1 Protein density map and the detected helix skeletons. The density map simulated using protein 1AGW (PDB Id) to 10 Å resolution using EMAN [31,32]. The helix skeletons (sticks) were detected using Helix Tracer [13]. perform mutation analysis for the secondary structures. It generates possible atomic structures for the skeletons through the exhaustive search in the topology space of the secondary structures.

Methods
The overall framework to generate the possible secondary structure topologies contains two stages. The first stage is part of the second stage, but it demonstrates the potential of the overall approach. The question for the first stage is how to generate all the topologies if the positions of the Cα atoms in the secondary structures are given. Note that the Cα atoms are not resolved in the low resolution density map, although the location of the secondary struc-tures can be computationally detected [12][13][14][15][16]. The second stage starts with the axial location of the skeletons, rather than the backbone atom locations. It generates the coordinates of the Cα atoms through translation and rotation around the axis. The second stage in this paper only implemented the helices, although the first stage implemented both the helices and β-strands. Figure 2 shows a flowchart of the topology generation. We have done a test of stage 1 using 30 proteins. Our preliminary result of stage 2 involves 4 proteins.
Generation of the secondary structure topologies for the skeletons Figure 2 Generation of the secondary structure topologies for the skeletons. The three major steps in the first stage are included in two boxes with solid lines. The additional steps in the second stage are marked with a box in dashed lines.

Stage 1 -Secondary structure mutation and side chain optimization
For each of the tested proteins, a total of 2 N 2 M (N!M!)of all the possible topologies were generated, where N is the number of helices and M is the number of beta strands. A mutated topology of the secondary structures was generated using the coordinate of the backbone atoms of the secondary structures and the sequence segments that form the secondary structures. Two major steps were involved: (1) mapping sequence segments to the skeletons and (2) construction of the side chain atoms. Figure 3 shows an example of a native topology (R 1 -R 2 -R 3 ) and a mutated topology (R 3 -R 2 -R 1 ) of the secondary structures.
In the native topology, R 1 , R 2 and R 3 skeletons were assigned to the sequence segments H 1 , H 2 , and H 3 respectively. In the mutated topology, R 1 , R 2 and R 3 were assigned to H 3 , H 2 , and H 1 respectively. When the length of the sequence segment (i.e. H 1 in Figure 3) does not match that of the skeleton (i.e. R 3 in Figure 3), we used the length of the skeleton as the reference. The sequence segment was either truncated (crossed boxes in Figure 3) or padded (amino acids in red in Figure 3) based on the dis-Secondary structure mutation of 1DV5 (PDB Id) Figure 3 Secondary structure mutation of 1DV5(PDB Id). The location and orientation of the three helix skeletons (cylinder R 1 , R 2 and R 3 ) are shown. The three segments of the sequence (H1, H2 and H3) forming helices are labelled on the sequence. A mutated topology of the secondary structures was generated by swapping the sequence assignment for (R 1 , R 3 ). Amino acids in the boxes: the deleted amino acids during the mutation; Amino acids in red: the padded amino acid. The thicker side chains are in the native structure. The thinner side chains are in the mutated structure after side chain relaxation. For viewing clarity, only the side chains in the vicinity of the secondary structure interaction are shown.
tance from the center amino acid of the segment. For example, when R 3 is mutated from H 3 to H 1 , three amino acids were deleted from each end of H 1 (Figure 3). When R 1 is mutated from H 1 to H 3 , H 3 is padded with six amino acids to match the length of R 1 . In this case, since H 3 is only two amino acids away from the end of the sequence, it is padded more at the other end.
After the sequence segments were assigned to the skeletons, the side chains were constructed. The Rotamer library was used and the side chains were relaxed using the simulated annealing [24,25].

Multi-well energy function
We have developed a multi-well energy function by incorporating the statistical data to a Lennard Jones shaped function [26]. The multi-well function is composed of the original Lennard Jones shaped function (U 0 in formula (1)) [26] and the statistically derived function (U 1 in (1))using two window functions f 1 and f 2 .
The e ij is the contact energy between two amino acids [27,28]. The p, q and r 0 are adjustable parameters. r mn is the distance between the side-chain centers of residue m and n. The signs are chosen to be positive if e ij > 0, or negative if e ij < 0. There is a potential well near the favorable sidechain distance r 0 in U 0 .
, N peak and c are statistically derived parameters based on an analysis of 413 selected proteins from the PDB. Here is a non-dimensional coefficient varying with distance r mn and represents the depth of the potential well. r k is the position of the k-th residue distance peak in the distribution curve of the residue pair distance. Here the window functions f 0 , f 1 satisfy: are the first and the last peak positions respectively in the distance distribution. δ is a small constant to allow the smooth connection between U 1 and U 0 that are combined using a set of window functions f 1 and f 0 . By statistical analysis of 413 proteins in PDB, a new multiwell contact energy function was developed for each of the 210 residue pair interaction. The contact energy measured is the mean of the energy from all the contact pairs in the secondary structures. Both the intra and inter secondary structure energy were considered. The contact energy from a pair of residues within a helix or a strand is the intra energy, and the contact energy from a pair of residues in different helices or stands is the inter energy. The average of inter and intra energy, named as effective contact energy U eff , was used as a measurement of the stability of the secondary structures. We are preparing another paper to with more details about the construction of the energy function.

Stage 2 -Sequence assignment to the skeletons
In stage 1, we assumed that the number of detected skeletons and the number of the sequence segments of the secondary structures are the same. In a general case, suppose that there are Kα sequence segments forming helices and Kβ sequence segments forming β-strands in a native protein structure. Let's also suppose that there are N helix skeletons and M strand skeletons detected from the protein density map. Based on our experience, Kα and Kβ are usually larger numbers than N and M respectively. The total number of different sequence combination is: For each set of the sequence segments picked from Kα and Kβ, the total number of possible sequence assignment is (N! × 2 N )(M! × 2 M ), as was dealt with in the first stage.
When not all the secondary structures are detected in the density map, the total number of topologies is: When the sequence segments are predicted using a secondary structure prediction tool, such as PSIPRED [29], the predicted segments are often shifted from their corresponding true locations. In order to estimate the magnitude of the computation, let's suppose that each sequence segment can have a shift in both directions with a maximal number of p amino acids. Then there are 2p+1 possible sequence segments for each true segment. The total number of assignments for the skeletons is Note that the above estimation of the total assignments assumes each segment having the same maximum number of shifts. In reality, different segments might have different maximum number of shifts depending on the nature of the particular sequence. In the current implementation of stage 2, we allowed the maximum shift to be 1 (p = 1) and all the tested proteins are α-helical proteins.
The total possible assignment number n total is still a huge number, even when M, N, p are relatively small.
In order to ignore the obvious wrong assignment between a sequence segment and the skeleton where their lengths differ significantly, we used a threshold d = 6 to require them to have similar lengths.
Here l skl , l seq is the length of skeleton and sequence segment.

Stage 2 -Construction of the atomic structures for the skeletons
Once a set of sequence segments are assigned to a set of helix skeletons, the possible atomic structures can be constructed for the skeletons using the information of the segments and the spatial position of the skeletons. However, since the skeleton does not provide the exact location of the backbone atoms at low resolution, this step generates the possible atomic coordinates of the atoms. By CSAW method [30], the protein structure backbone and the side chains were created and moved to the corresponding spatial positions of the helix skeletons. Each constructed helix structure was then rotated around the central axis of the skeleton as part of the structure adjustment. A rigid shift along the axis was also built in the method, although we did not use it in this paper due to the large solution space. The rotation step size was 30° in this paper. For each candidate structure, side chain Rotamer library was used to relax the side chain overlap and Simulated Annealing method was used to minimize contact energy. Protein structures passed the overlap screening were ranked according to their contact energy formed by the helices.

Results
The dataset used to test the first stage involves 30 proteins from the Protein Data Bank (PDB). Some of the proteins are listed in the paper of Nanias et. al [26]. Others were randomly selected from the proteins that satisfy the following criteria: (1) has single domain (2) has 1.5 Å or better resolution (3) share less than 30% sequence similarity (4) has less than 7 secondary structures, due to the amount of computation. The key question in the first stage was to see if the protein is comfortable when the segments of the sequence are assigned to the skeleton in all possible topologies. Intuitively, when the positions of the backbone atoms are fixed, only a small number of the all possible sequence assignments to the skeletons will be energetically favorable. However, there has not been any data to support this hypothesis and as to how much of the population in the mutated topologies is energetically acceptable. Our program was used to construct the allatom structure for each topology in the total topological space of (N! × 2 N )(M! × 2 M ). The topologies with collision of atoms after side chain relaxation and those with large variation in the per-pair contact energy were eliminated. The rest of the topologies were ranked based on the contact energy. The numbers of topologies with lower contact energy than that of the native topology is shown in column 7 (Table 1). For example, in the case of protein 1HDJ (row 4), there is only 1 topology (column 7) that has the contact energy lower than that of the native topology, out of total 384 (column 6) topologies (Table 1). It appears that the native topology is ranked within the top 30% among all the topologies. For most of the proteins, there is only less than 5% of the total population with lower energy than that of the native. This finding suggests that if the locations of the backbone atoms of the secondary structures are fixed, only a small portion of the huge topological space will result in energetically stable topologies for the secondary structures.
We tested the approach in stage 2 with four small α-proteins. The helix skeletons were first detected using Helix Tracer [13]. Our program was then used to construct the possible all-atom structures exhaustively. It uses the axis location of the helix skeleton and performs rotational sampling of the Cα coordinates every 30°. It also allows the inexact positioning of a sequence segment for a maximum of 1 amino acid shift. All the constructed structures of the secondary structures were sorted based on their contact energy. Table 2 shows an example (PDB Id = 1LRE) of the ranked structures for the skeletons. Since the total number of the possible structures is too large, only those with negative contact energy were stored and analyzed. One of the key questions in stage 2 is to see if the native assignment of the sequence segments is still included in the top percentage of the list when only the axes of the skeletons are known. Unlike in stage 1 where the true backbone location was used for the skeleton, stage 2 samples the backbone locations through rotation around the skeleton axis. Therefore, our rotation sampling may not hit the true structure exactly, although a structure close to the native can be reached. In fact, if one applies our approach for a protein with unknown structure, this is the situation encountered. Therefore, it is interesting to see even with 30° of sampling step size, how accurate the resulting structure is. Table 2 illustrates the trend of the constructed structures after they were ranked by the contact energy (column 5). The one with the smallest RMSD to native (column 6 row 5) was identified from the entire list. It appears that the structure with the smallest RMSD to native (4.781 Å), the one most similar to the true structure, was ranked the 17 th among all the possible constructed structures for the skeletons. This constructed structure indeed has the true topology with the topology Id = 123000. 123000 (column 3 row 5) means that there are three helices in the skeleton and they were assigned to the correct helix sequence segments with the correct direction. We used the first three digits of the topology label to indicate the permutation of the assignment and the last three digits to indicate the assignment direction for each of skeletons. As an example, a topology label of 213001 means the sequence assignment for the first and the second helix segments is swapped, and the third segment is assigned to the correct skeleton but with an opposite direction from the true direction. The structure that is the most similar to the true structure has a shift of 1 amino acid in the 1st and the 3 rd segment respectively (column 3 row 5 of Table 2).
The results of the four tested proteins are summarized in Table 3. For each of the proteins, a table similar to Table 2 was constructed (data not shown). Then the structure with the smallest RMSD to native from each of the four tables is included as a row in Table 3. For all the four proteins, the structure most similar to the native was ranked in the top portion of the huge list. For example, in the case of 1GXGA, such a structure was ranked the 5 th based on its contact energy comparing to other assignments. In this case, Helix Tracer only detected three of the four helices in the true structure. Therefore, only three helices (1, 2, and 4) were used in the calculation. Our direct estimation of the contact energy formed by the secondary structures appears to be quite effective for the four proteins. The structure that is the most similar to the native structure was ranked at the 17 th , 31 st , 16 th , and 5 th out of 32066, 391833, 98755 and 192935 constructed structures for the four tested proteins respectively (column 2 and 3 of Table  3). The RMSD to native of the best constructed structure is N AA : number of amino acids Pct N : the percentage of amino acids in the secondary structures (helices and strands); N α : the number of alpha helix; N b : gthe number of beta strands; N m : the gnumber of secondary structure mutations, the cross mutation between a helix and a strand is ignored; N eff : the gnumber of the mutated topologies with lower effective contact energy than that of the native; Pct eff : the percentage of the mutated topologies with lower effective contact energy than that of the native by multi-well function;  Rank: the rank of the structure by the contact energy (5 th column) Topology: the topology Id, the 1 st half of the digits: permutation of the assignment, the last half of the digits: directions (0 or 1) of the assignment for each helix; Shift: the amino acid position shift from the true helix segment for each assignment, "-" left, "+" right; Rot: rotation angle around the skeleton axis for each helix, in radian; CE: Effective contact energy of the constructed helices; RMSD: the root mean square deviation of the Cα atoms between the constructed candidate structure and the native structure, in Å. between 4 Å and 5 Å for all the four proteins. The small RMSD to native value suggests that the best constructed structure is close to the native structure for all the four proteins.
To illustrate the nature of the top ranked structures, we show a distribution of the topologies, the contact energy, and the RMSD to native for the top 100 constructed structures in Figure 4. The structures of the skeleton of 1LRE were ranked by the contact energy of the secondary structures (top penal of Figure 4). The two most popular topologies in the top 100 structures are 123100 and 123000, although a few 123110 (i.e. rank between 70 ~80) and 123001 (i.e. rank between 60~70) exist (lower panel of Figure 4). Here, the topology Id of 123000 means the three helix segments of sequence were correctly assigned to the corresponding skeletons with the correct directions. A topology ID of 123100 means the three helix segments of the sequence were correctly assigned to the skeletons. However, the direction of H 1 sequence segment was assigned to the opposite direction from the true direction. Note that a wrong topology (i.e. 123100) can be energetically favourable, since we observed that the structure with the lowest contact energy has a wrong topology. However, our result shows that the great majority of the wrong topologies correspond to energetically non-favorable structures. The near native structure of the skeletons can be found near the top of the list, the 17 th that are marked in red (middle and lower panel of Figure 4).

Discussion
We introduced a method to construct the possible atomic structures for skeletons of the density map. The method is built on the direct estimation of the contact energy formed by the skeletons. The evaluation of the constructed structures is independent from the knowledge of the conformation of the entire protein chain. This approach can be potentially useful when the size of the protein is large where the ab initio prediction of the entire chain is often hard. Our approach suggests that it is possible to derive a set of possible topologies of the secondary structures without constructing the conformation of the entire chain. Although our approach is expected to encounter the huge computation expenses when large proteins are the targets, it is interesting to see if the approach can be optimized to tackle the computation problem.
We performed a small test in stage 2 using four α-proteins.
Our current method samples the possible structures for the skeletons using a fixed step size. The advantage of this sampling method is that it provides details about the solution space. The disadvantage obviously is its computational expense. This approach was able to rank the most near native structure at the 17 th of 32066, the 16 th of 98755, the 5 th of 192935 and the 31 st of 391833 structures for 1LRE, 1DP3A, 1GXGA, and 1JW2A respectively ( Table  3). Note that the base number here only includes the structures with negative contact energy, and not the entire topological space. A test including more proteins, particularly the non-alpha proteins, will provide more concrete results. It is also possible to use simulated annealing in stead of the exhaustive sampling of the topological space.
We have developed a method to eliminate the secondary structure topologies through an direct estimation of the contact energy of the secondary structures. It will be interesting to see how stable this screen method is compared to a geometrical screening method introduced by Wu et. al [20]. It is possible that the combination of the two approaches can be more effective in reducing the topological space for the skeletons. Assignments: the total number of assignments with the negative contact energy Rank: the rank of the structure with the smallest RMSD to native; Topology: the topology Id, the 1 st half of the digits: permutation of the assignment, the last half of the digits: directions (0 or 1) of the assignment for each helix; Shift: the amino acid position shift from the true helix segment for each assignment, "-" left, "+" right; Rot: rotation angle around the skeleton axis for each helix, in radian; CE: Effective contact energy of the constructed helices; RMSD: the root mean square deviation of the Cα atoms between the constructed candidate structure and the native structure, in Å.

Conclusion
This paper explores the question if it is possible to derive a small set of atomic structures for the skeletons through the sampling of the entire topological space of the secondary structures. We have developed an approach to enumerate the possible atomic structures for the skeletons and to rank the structures using a direct estimation of the contact energy formed by the secondary structures. We demonstrate for the first time that it is possible to use the secondary structure contact energy to eliminate the great majority of the wrong structures for the skeletons. A test of 30 proteins suggests that when the backbone atoms are given for the skeletons, there are only a small number of topologies with more comfortable structures than that of the native. Even when the backbone atoms are not known for the skeletons, it is still possible to construct a small set of structures that contains a near native structure for the skeletons. Our results suggest that without constructing the structure of the entire chain, it is still possible to derive a small set of the atomic structures for the skeletons, out of the huge solution space, using our recent development of the multi-well energy function.

Competing interests
The authors declare that they have no competing interests.
Top 100 structures for the skeletons in 1LRE (PDB Id) Figure 4 Top 100 structures for the skeletons in 1LRE(PDB Id). The constructed atomic structures for the helix skeletons were ranked by the contact energy (top panel). A topology Id (bottom panel) includes six digits. The first three digits represent the permutation of the assignment, and the last three represent the relative direction (0, or 1) between the sequence segment and the skeleton for each of the skeletons. The RMSD (middle panel) to the native structure is shown for each of the 100 constructed structures for the skeletons. The constructed structure with the smallest RMSD to native (the 17 th of the 100) is marked in red for its topology and the RMSD.