- Open Access
Reduction of the secondary structure topological space through direct estimation of the contact energy formed by the secondary structures
BMC Bioinformatics volume 10, Article number: S40 (2009)
Electron cryomicroscopy is a fast developing technique aiming at the determination of the 3-dimensional structures of large protein complexes. Using this technique, protein density maps can be generated with 6 to 10 Å resolution. At such resolutions, the secondary structure elements such as helices and β-strands appear to be skeletons and can be computationally detected. However, it is not known which segment of the protein sequence corresponds to which of the skeletons. The topology in this paper refers to the linear order and the directionality of the secondary structures. For a protein with N helices and M strands, there are (N!2N)(M!2M) different topologies, each of which maps N helix segments and M strand segments on the protein sequence to N helix and M strand skeletons. Since the backbone position is not available in the skeleton, each topology of the skeletons corresponds to additional freedom to position the atoms in the skeletons.
We have developed a method to construct the possible atomic structures for the helix skeletons by sampling the solution space of all the possible topologies of the skeletons. Our method also ranks the possible structures based on the contact energy formed by the secondary structures, rather than the entire chain. If we assume that the backbone atomic positions are known for the skeletons, then the native topology of the secondary structures can be found in the top 30% of the ranked list of all possible topologies for all the 30 proteins tested, and within the top 5% for most of the 30 proteins. Without assuming the backbone location of the skeletons, the possible atomic structures of the skeletons can be constructed using the axis of the skeleton and the sequence segments. The best constructed structure for the skeletons has RMSD to native between 4 and 5 Å for the four tested α-proteins. These best constructed structures were ranked the 17th, 31st, 16th and 5th respectively for the four proteins out of 32066, 391833, 98755 and 192935 possible assignments in the pool.
Our work suggested that the direct estimation of the contact energy formed by the secondary structures is quite effective in reducing the topological space to a small subset that includes a near native structure for the skeletons.
Electron cryomicroscopy (Cryo-EM) is a fast developing experimental technique aiming at the determination of the 3-dimensional structure of large protein complexes [1–11]. Although not yet at the atomic resolution, this technique can be used to generate protein density maps at 6 to 10 Å resolution [1–11]. Computational methods have been developed to detect the secondary structures such as helices and β-strands from the density map at this resolution [12–16]. Since the secondary structures are often the major building blocks of a structure, the detected secondary structures appear as the skeletons (marked as sticks in Figure 1) of the density map. In the density map of 6–10 Å resolution, alpha helices look like sausages with 5–6 Å in diameter, and beta-sheets appear as extended volumes with varying sizes. The connection relationship between the skeletons is often unavailable in the density maps.
Given a protein sequence, the location of secondary structures on the sequence can be roughly predicted using existing secondary structure prediction methods. Such methods can generally predict the secondary structures to about 70–80% accuracy [17–19]. Although not completely correct, the predicted secondary structure segments of the sequence provide a rough location of the secondary structures. Wu et al has used the secondary structure geometrical relationships existing in the PDB to reduce the topological space. The reduced possible topologies were then used to add the loops and a simulation of the entire chain was used to identify the true topology . We have also shown previously that the decoys generated by Rosetta ab initio structure prediction software can be used to reduce the topological space [21, 22]. In this paper, we will introduce a method that directly measures the contact energy of the secondary structures to reduce the topological space of the skeletons. This is the first such demonstration. The secondary structure topology problem of the CryoEM density map is to find the true mapping between the predicted secondary structure segments on the sequence and the skeletons of the density map. Secondary structure topology in this paper refers to the order and the direction of the secondary structures such as helices and strands with respect to the protein sequence. For a protein with N helices and M strands, there are (N!2N)(M!2M) different topologies. For example, there are N! different permutations for assigning N helix segments to N helix skeletons and two directions to assign a sequence segment to each skeleton.
This paper investigates the following question. Given the skeletons detected in the protein density map, is it possible to derive a small subset of the topological space that contains the native topology of the secondary structures using a direct measurement of the energy formed by the secondary structures? The determination of favourable secondary structure topologies has always been one of the most essential problems in tertiary structure prediction. Although predicting the topology, in general, is as hard as predicting the tertiary structure, it is not clear if it is true when the rough location of the secondary structures is known.
Systematic mutation analysis has shown recently that the native protein sequence for a specific 3-dimensional structure is close to optimal . This is intuitively not surprising since nature might have derived a near optimal sequence to achieve the function through evolution. The research in this paper is guided by the intuition that when the relative geometry of the secondary structures is fixed in the 3-dimensional space, there might only be a small number of topologies that can provide energetically favorable structures. In this paper, we describe a method to perform mutation analysis for the secondary structures. It generates possible atomic structures for the skeletons through the exhaustive search in the topology space of the secondary structures.
The overall framework to generate the possible secondary structure topologies contains two stages. The first stage is part of the second stage, but it demonstrates the potential of the overall approach. The question for the first stage is how to generate all the topologies if the positions of the Cα atoms in the secondary structures are given. Note that the Cα atoms are not resolved in the low resolution density map, although the location of the secondary structures can be computationally detected [12–16]. The second stage starts with the axial location of the skeletons, rather than the backbone atom locations. It generates the coordinates of the Cα atoms through translation and rotation around the axis. The second stage in this paper only implemented the helices, although the first stage implemented both the helices and β-strands. Figure 2 shows a flowchart of the topology generation. We have done a test of stage 1 using 30 proteins. Our preliminary result of stage 2 involves 4 proteins.
Stage 1 – Secondary structure mutation and side chain optimization
For each of the tested proteins, a total of 2N2M(N!M!)of all the possible topologies were generated, where N is the number of helices and M is the number of beta strands. A mutated topology of the secondary structures was generated using the coordinate of the backbone atoms of the secondary structures and the sequence segments that form the secondary structures. Two major steps were involved: (1) mapping sequence segments to the skeletons and (2) construction of the side chain atoms. Figure 3 shows an example of a native topology (R1 - R2 - R3) and a mutated topology (R3 - R2 - R1) of the secondary structures.
In the native topology, R1, R2 and R3 skeletons were assigned to the sequence segments H1, H2, and H3 respectively. In the mutated topology, R1, R2 and R3 were assigned to H3, H2, and H1 respectively. When the length of the sequence segment (i.e. H1 in Figure 3) does not match that of the skeleton (i.e. R3 in Figure 3), we used the length of the skeleton as the reference. The sequence segment was either truncated (crossed boxes in Figure 3) or padded (amino acids in red in Figure 3) based on the distance from the center amino acid of the segment. For example, when R3 is mutated from H3 to H1, three amino acids were deleted from each end of H1 (Figure 3). When R1 is mutated from H1 to H3, H3 is padded with six amino acids to match the length of R1. In this case, since H3 is only two amino acids away from the end of the sequence, it is padded more at the other end.
Multi-well energy function
We have developed a multi-well energy function by incorporating the statistical data to a Lennard Jones shaped function . The multi-well function is composed of the original Lennard Jones shaped function (U0 in formula (1))  and the statistically derived function (U1 in (1))using two window functions f1 and f2.
The e ij is the contact energy between two amino acids [27, 28]. The p, q and r0 are adjustable parameters. r mn is the distance between the side-chain centers of residue m and n. The signs are chosen to be positive if e ij > 0, or negative if e ij < 0. There is a potential well near the favorable side-chain distance r0 in U0. , N peak and c are statistically derived parameters based on an analysis of 413 selected proteins from the PDB. Here is a non-dimensional coefficient varying with distance r mn and represents the depth of the potential well. r k is the position of the k-th residue distance peak in the distribution curve of the residue pair distance. Here the window functions f0, f1 satisfy:
r1, are the first and the last peak positions respectively in the distance distribution. δ is a small constant to allow the smooth connection between U1 and U0 that are combined using a set of window functions f1 and f0. By statistical analysis of 413 proteins in PDB, a new multi-well contact energy function was developed for each of the 210 residue pair interaction. The contact energy measured is the mean of the energy from all the contact pairs in the secondary structures. Both the intra and inter secondary structure energy were considered. The contact energy from a pair of residues within a helix or a strand is the intra energy, and the contact energy from a pair of residues in different helices or stands is the inter energy. The average of inter and intra energy, named as effective contact energy Ueff, was used as a measurement of the stability of the secondary structures. We are preparing another paper to with more details about the construction of the energy function.
Stage 2 – Sequence assignment to the skeletons
In stage 1, we assumed that the number of detected skeletons and the number of the sequence segments of the secondary structures are the same. In a general case, suppose that there are Kα sequence segments forming helices and Kβ sequence segments forming β-strands in a native protein structure. Let's also suppose that there are N helix skeletons and M strand skeletons detected from the protein density map. Based on our experience, Kα and Kβ are usually larger numbers than N and M respectively. The total number of different sequence combination is:
For each set of the sequence segments picked from Kα and Kβ, the total number of possible sequence assignment is (N! × 2N)(M! × 2M), as was dealt with in the first stage. When not all the secondary structures are detected in the density map, the total number of topologies is:
When the sequence segments are predicted using a secondary structure prediction tool, such as PSIPRED , the predicted segments are often shifted from their corresponding true locations. In order to estimate the magnitude of the computation, let's suppose that each sequence segment can have a shift in both directions with a maximal number of p amino acids. Then there are 2p+1 possible sequence segments for each true segment. The total number of assignments for the skeletons is
Note that the above estimation of the total assignments assumes each segment having the same maximum number of shifts. In reality, different segments might have different maximum number of shifts depending on the nature of the particular sequence. In the current implementation of stage 2, we allowed the maximum shift to be 1 (p = 1) and all the tested proteins are α-helical proteins. The total possible assignment number n total is still a huge number, even when M, N, p are relatively small.
In order to ignore the obvious wrong assignment between a sequence segment and the skeleton where their lengths differ significantly, we used a threshold d = 6 to require them to have similar lengths.
|l skl - l seq | ≤ d
Here l skl , l seq is the length of skeleton and sequence segment.
Stage 2 – Construction of the atomic structures for the skeletons
Once a set of sequence segments are assigned to a set of helix skeletons, the possible atomic structures can be constructed for the skeletons using the information of the segments and the spatial position of the skeletons. However, since the skeleton does not provide the exact location of the backbone atoms at low resolution, this step generates the possible atomic coordinates of the atoms. By CSAW method , the protein structure backbone and the side chains were created and moved to the corresponding spatial positions of the helix skeletons. Each constructed helix structure was then rotated around the central axis of the skeleton as part of the structure adjustment. A rigid shift along the axis was also built in the method, although we did not use it in this paper due to the large solution space. The rotation step size was 30° in this paper. For each candidate structure, side chain Rotamer library was used to relax the side chain overlap and Simulated Annealing method was used to minimize contact energy. Protein structures passed the overlap screening were ranked according to their contact energy formed by the helices.
The dataset used to test the first stage involves 30 proteins from the Protein Data Bank (PDB). Some of the proteins are listed in the paper of Nanias et. al . Others were randomly selected from the proteins that satisfy the following criteria: (1) has single domain (2) has 1.5 Å or better resolution (3) share less than 30% sequence similarity (4) has less than 7 secondary structures, due to the amount of computation. The key question in the first stage was to see if the protein is comfortable when the segments of the sequence are assigned to the skeleton in all possible topologies. Intuitively, when the positions of the backbone atoms are fixed, only a small number of the all possible sequence assignments to the skeletons will be energetically favorable. However, there has not been any data to support this hypothesis and as to how much of the population in the mutated topologies is energetically acceptable. Our program was used to construct the all-atom structure for each topology in the total topological space of (N! × 2N)(M! × 2M). The topologies with collision of atoms after side chain relaxation and those with large variation in the per-pair contact energy were eliminated. The rest of the topologies were ranked based on the contact energy. The numbers of topologies with lower contact energy than that of the native topology is shown in column 7 (Table 1). For example, in the case of protein 1HDJ (row 4), there is only 1 topology (column 7) that has the contact energy lower than that of the native topology, out of total 384 (column 6) topologies (Table 1). It appears that the native topology is ranked within the top 30% among all the topologies. For most of the proteins, there is only less than 5% of the total population with lower energy than that of the native. This finding suggests that if the locations of the backbone atoms of the secondary structures are fixed, only a small portion of the huge topological space will result in energetically stable topologies for the secondary structures.
We tested the approach in stage 2 with four small α-proteins. The helix skeletons were first detected using Helix Tracer . Our program was then used to construct the possible all-atom structures exhaustively. It uses the axis location of the helix skeleton and performs rotational sampling of the Cα coordinates every 30°. It also allows the inexact positioning of a sequence segment for a maximum of 1 amino acid shift. All the constructed structures of the secondary structures were sorted based on their contact energy. Table 2 shows an example (PDB Id = 1LRE) of the ranked structures for the skeletons. Since the total number of the possible structures is too large, only those with negative contact energy were stored and analyzed. One of the key questions in stage 2 is to see if the native assignment of the sequence segments is still included in the top percentage of the list when only the axes of the skeletons are known. Unlike in stage 1 where the true backbone location was used for the skeleton, stage 2 samples the backbone locations through rotation around the skeleton axis. Therefore, our rotation sampling may not hit the true structure exactly, although a structure close to the native can be reached. In fact, if one applies our approach for a protein with unknown structure, this is the situation encountered. Therefore, it is interesting to see even with 30° of sampling step size, how accurate the resulting structure is. Table 2 illustrates the trend of the constructed structures after they were ranked by the contact energy (column 5). The one with the smallest RMSD to native (column 6 row 5) was identified from the entire list. It appears that the structure with the smallest RMSD to native (4.781 Å), the one most similar to the true structure, was ranked the 17th among all the possible constructed structures for the skeletons. This constructed structure indeed has the true topology with the topology Id = 123000. 123000 (column 3 row 5) means that there are three helices in the skeleton and they were assigned to the correct helix sequence segments with the correct direction. We used the first three digits of the topology label to indicate the permutation of the assignment and the last three digits to indicate the assignment direction for each of skeletons. As an example, a topology label of 213001 means the sequence assignment for the first and the second helix segments is swapped, and the third segment is assigned to the correct skeleton but with an opposite direction from the true direction. The structure that is the most similar to the true structure has a shift of 1 amino acid in the 1st and the 3rd segment respectively (column 3 row 5 of Table 2).
The results of the four tested proteins are summarized in Table 3. For each of the proteins, a table similar to Table 2 was constructed (data not shown). Then the structure with the smallest RMSD to native from each of the four tables is included as a row in Table 3. For all the four proteins, the structure most similar to the native was ranked in the top portion of the huge list. For example, in the case of 1GXGA, such a structure was ranked the 5th based on its contact energy comparing to other assignments. In this case, Helix Tracer only detected three of the four helices in the true structure. Therefore, only three helices (1, 2, and 4) were used in the calculation. Our direct estimation of the contact energy formed by the secondary structures appears to be quite effective for the four proteins. The structure that is the most similar to the native structure was ranked at the 17th, 31st, 16th, and 5th out of 32066, 391833, 98755 and 192935 constructed structures for the four tested proteins respectively (column 2 and 3 of Table 3). The RMSD to native of the best constructed structure is between 4 Å and 5 Å for all the four proteins. The small RMSD to native value suggests that the best constructed structure is close to the native structure for all the four proteins.
To illustrate the nature of the top ranked structures, we show a distribution of the topologies, the contact energy, and the RMSD to native for the top 100 constructed structures in Figure 4. The structures of the skeleton of 1LRE were ranked by the contact energy of the secondary structures (top penal of Figure 4). The two most popular topologies in the top 100 structures are 123100 and 123000, although a few 123110 (i.e. rank between 70 ~80) and 123001 (i.e. rank between 60~70) exist (lower panel of Figure 4). Here, the topology Id of 123000 means the three helix segments of sequence were correctly assigned to the corresponding skeletons with the correct directions. A topology ID of 123100 means the three helix segments of the sequence were correctly assigned to the skeletons. However, the direction of H1 sequence segment was assigned to the opposite direction from the true direction. Note that a wrong topology (i.e. 123100) can be energetically favourable, since we observed that the structure with the lowest contact energy has a wrong topology. However, our result shows that the great majority of the wrong topologies correspond to energetically non-favorable structures. The near native structure of the skeletons can be found near the top of the list, the 17th that are marked in red (middle and lower panel of Figure 4).
We introduced a method to construct the possible atomic structures for skeletons of the density map. The method is built on the direct estimation of the contact energy formed by the skeletons. The evaluation of the constructed structures is independent from the knowledge of the conformation of the entire protein chain. This approach can be potentially useful when the size of the protein is large where the ab initio prediction of the entire chain is often hard. Our approach suggests that it is possible to derive a set of possible topologies of the secondary structures without constructing the conformation of the entire chain. Although our approach is expected to encounter the huge computation expenses when large proteins are the targets, it is interesting to see if the approach can be optimized to tackle the computation problem.
We performed a small test in stage 2 using four α-proteins. Our current method samples the possible structures for the skeletons using a fixed step size. The advantage of this sampling method is that it provides details about the solution space. The disadvantage obviously is its computational expense. This approach was able to rank the most near native structure at the 17th of 32066, the 16th of 98755, the 5th of 192935 and the 31st of 391833 structures for 1LRE, 1DP3A, 1GXGA, and 1JW2A respectively (Table 3). Note that the base number here only includes the structures with negative contact energy, and not the entire topological space. A test including more proteins, particularly the non-alpha proteins, will provide more concrete results. It is also possible to use simulated annealing in stead of the exhaustive sampling of the topological space.
We have developed a method to eliminate the secondary structure topologies through an direct estimation of the contact energy of the secondary structures. It will be interesting to see how stable this screen method is compared to a geometrical screening method introduced by Wu et. al . It is possible that the combination of the two approaches can be more effective in reducing the topological space for the skeletons.
This paper explores the question if it is possible to derive a small set of atomic structures for the skeletons through the sampling of the entire topological space of the secondary structures. We have developed an approach to enumerate the possible atomic structures for the skeletons and to rank the structures using a direct estimation of the contact energy formed by the secondary structures. We demonstrate for the first time that it is possible to use the secondary structure contact energy to eliminate the great majority of the wrong structures for the skeletons. A test of 30 proteins suggests that when the backbone atoms are given for the skeletons, there are only a small number of topologies with more comfortable structures than that of the native. Even when the backbone atoms are not known for the skeletons, it is still possible to construct a small set of structures that contains a near native structure for the skeletons. Our results suggest that without constructing the structure of the entire chain, it is still possible to derive a small set of the atomic structures for the skeletons, out of the huge solution space, using our recent development of the multi-well energy function.
Chiu W: Electron microscopy of frozen, hydrated biological specimens. Annu Rev Biophys Biophys Chem. 1986, 15: 237-57.
Chiu W: What does electron cryomicroscopy provide that X-ray crystallography and NMR spectroscopy cannot?. Annu Rev Biophys Biomol Struct. 1993, 22: 233-55.
Chiu W, Schmid MF, Prasad BV: Teaching electron diffraction and imaging of macromolecules. Biophys J. 1993, 64: 1610-25.
Bottcher B, Wynne SA, Crowther RA: Determination of the fold of the core protein of hepatitis B virus by electron cryomicroscopy. Nature. 1997, 386: 88-91.
Chiu W, Schmid MF: Pushing back the limits of electron cryomicroscopy. Nat Struct Biol. 1997, 4: 331-3.
Conway JF, Cheng N, Zlotnick A, Wingfield PT, Stahl SJ, Steven AC: Visualization of a 4-helix bundle in the hepatitis B virus capsid by cryo-electron microscopy. Nature. 1997, 386: 91-4.
Mancini EJ, Clarke M, Gowen BE, Rutten T, Fuller SD: Cryo-electron microscopy reveals the functional organization of an enveloped virus, Semliki Forest virus. Mol Cell. 2000, 5: 255-66.
Zhou ZH, Dougherty M, Jakana J, He J, Rixon FJ, Chiu W: Seeing the herpesvirus capsid at 8.5 Å. Science. 2000, 288: 877-880.
Zhou ZH, Baker ML, Jiang W, Dougherty M, Jakana J, Dong G, Lu G, Chiu W: Electron cryomicroscopy and bioinformatics suggest protein fold models for rice dwarf virus. Nat Struct Biol. 2001, 8: 868-73.
Chiu W, Baker ML, Jiang W, Zhou ZH: Deriving folds of macromolecular complexes through electron cryomicroscopy and bioinformatics approaches. Curr Opin Struct Biol. 2002, 12: 263-9.
Medalia O, Weber I, Frangakis AS, Nicastro D, Gerisch G, Baumeister W: Macromolecular architecture in eukaryotic cells visualized by cryoelectron tomography. Science. 2002, 298: 1209-13.
Jiang W, Baker ML, Ludtke SJ, Chiu W: Bridging the information gap: computational tools for intermediate resolution structure interpretation. J Mol Biol. 2001, 308: 1033-44.
Del Palu A, He J, Pontelli E, Lu Y: Identification of Alpha-Helices from Low Resolution Protein Density Maps. Proceeding of Computational Systems Bioinformatics Conference(CSB). 2006, 89-98.
Kong Y, Ma J: A structural-informatics approach for mining beta-sheets: locating sheets in intermediate-resolution density maps. J Mol Biol. 2003, 332: 399-413.
Kong Y, Zhang X, Baker TS, Ma J: A Structural-informatics approach for tracing beta-sheets: building pseudo-C(alpha) traces for beta-strands in intermediate-resolution density maps. J Mol Biol. 2004, 339: 117-30.
Baker ML, Ju T, Chiu W: Identification of secondary structure elements in intermediate-resolution density maps. Structure. 2007, 15: 7-19.
Birzele F, Kramer S: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics. 2006, 22: 2628-34.
Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins. 2002, 47: 228-35.
McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics. 2000, 16: 404-5.
Wu Y, Chen M, Lu M, Wang Q, Ma J: Determining protein topology from skeletons of secondary structures. J Mol Biol. 2005, 350: 571-86.
Y Lu, J He, CEM Strauss: Deriving Topology and Sequence Alignment for the Helix Skeleton in Low Resolution Protein Density Maps. Proceeding of Asia-Pacific Bioinformatics Conference. 2007, 143-52.
Lu Y, He J, Strauss CE: Deriving topology and sequence alignment for the helix skeleton in low-resolution protein density maps. J Bioinform Comput Biol. 2008, 6: 183-201.
Kuhlman B, Baker D: Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci USA. 2000, 97: 10383-8.
Cerny V: A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm. Journal of Optimization Theory and Applications. 1985, 45: 41-51.
Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by Simulated Annealing. Science. 1983, 220: 671-680.
Nanias M, Chinchio M, Pillardy J, Ripoll DR, Scheraga HA: Packing helices in proteins by global optimization of a potential energy function. Proc Natl Acad Sci USA. 2003, 100: 1706-10.
Miyazawa S, Jernigan RL: Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules. 1985, 18: 534-552.
Miyazawa S, Jernigan RL: Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol. 1996, 256: 623-44.
Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202.
Sun W: Protein folding simulation by all-atom CSAW method. IEEE International Conference on Bioinformatics and Biomedicine. 2007, 2: 45-52.
The work in this paper was sponsored by NSF HRD-0420407, Army High Performance Computing Center, and the Active Researcher Supporting Foundation of Tsinghua University.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
The authors declare that they have no competing interests.
WS and JH developed the methodology. WS developed the code and performed the tests. JH provided overall guidance. Both are involved in developing the manuscript.
About this article
Cite this article
Sun, W., He, J. Reduction of the secondary structure topological space through direct estimation of the contact energy formed by the secondary structures. BMC Bioinformatics 10, S40 (2009). https://doi.org/10.1186/1471-2105-10-S1-S40
- Secondary Structure
- Topological Space
- Protein Data Bank
- Backbone Atom
- Sequence Segment