Hinge Atlas: relating protein sequence to sites of structural flexibility
© Flores et al. 2007
Received: 23 December 2006
Accepted: 22 May 2007
Published: 22 May 2007
Skip to main content
© Flores et al. 2007
Received: 23 December 2006
Accepted: 22 May 2007
Published: 22 May 2007
Relating features of protein sequences to structural hinges is important for identifying domain boundaries, understanding structure-function relationships, and designing flexibility into proteins. Efforts in this field have been hampered by the lack of a proper dataset for studying characteristics of hinges.
Using the Molecular Motions Database we have created a Hinge Atlas of manually annotated hinges and a statistical formalism for calculating the enrichment of various types of residues in these hinges.
We found various correlations between hinges and sequence features. Some of these are expected; for instance, we found that hinges tend to occur on the surface and in coils and turns and to be enriched with small and hydrophilic residues. Others are less obvious and intuitive. In particular, we found that hinges tend to coincide with active sites, but unlike the latter they are not at all conserved in evolution. We evaluate the potential for hinge prediction based on sequence.
Motions play an important role in catalysis and protein-ligand interactions. Hinge bending motions comprise the largest class of known motions. Therefore it is important to relate the hinge location to sequence features such as residue type, physicochemical class, secondary structure, solvent exposure, evolutionary conservation, and proximity to active sites. To do this, we first generated the Hinge Atlas, a set of protein motions with the hinge locations manually annotated, and then studied the coincidence of these features with the hinge location. We found that all of the features have bearing on the hinge location. Most interestingly, we found that hinges tend to occur at or near active sites and yet unlike the latter are not conserved. Less surprisingly, we found that hinge residues tend to be small, not hydrophobic or aliphatic, and occur in turns and random coils on the surface. A functional sequence based hinge predictor was made which uses some of the data generated in this study. The Hinge Atlas is made available to the community for further flexibility studies.
Motions play an essential role in catalysis and protein-ligand interactions. In particular, hinge bending motions account for 45% of motions in a representative set from the Database of Macromolecular Motions comprising domain hinge motions (31% of the total) and fragment hinge motions (14%) [2–4]. Thus understanding fundamental aspects of hinge bending mechanisms may lead to an improved understanding of the relationship between structure and function.
There are three levels of hinge prediction. The easiest case occurs when the atomic coordinates are available for two or more conformations of a given protein. In this case it is possible to visually inspect the motion to determine the hinge location, as we have done here. The process can also be automated with various available packages, including FlexProt[5, 6], Hingefind and DynDom. A much more difficult problem is that of predicting hinges when only one set of structural atomic coordinates is available. Several algorithms have been developed for this purpose [9–15]. The very hardest case occurs when the sequence is known but no atomic coordinates are available at all.
The problem of finding flexible hinges between rigid regions based on sequence is in some ways similar to the problem of finding domain boundaries, which can be flexible or inflexible. Although little work has been done on the former problem, several algorithms exist to address the latter. In one significant contribution, Nagarajan and Yona analyzed multiple sequence alignments and were able to identify domains with some accuracy. Marsden et al focused on the case of proteins with no significant sequence homology to well characterized proteins and found that predicted secondary structure contained information about domain boundaries. Jones et al. combined PUU, DETECTIVE, and DOMAK to make a consensus-based domain boundary predictor. Heger et al.  created the Automatic Domain Decomposition Algorithm (ADDA) and associated online database. Murzin et al. created the SCOP (Structural Classification of Proteins) database. For purposes of the above algorithms and classifications, however, domains are defined as proteins or regions of proteins having a common evolutionary origin[22, 24]; flexibility is not a consideration. Indeed most small and medium sized proteins, such as those prevalent in the Hinge Atlas, consist of a single domain. Therefore the problem of finding flexible hinges is not solved by finding domain boundaries as defined for these methods. Schlessinger et al developed a method to predict B-factors from sequence, but it is not clear that B-factors obtained in this way would yield accurate flexibility predictions[26, 27]. In light of the limitations of existing methods, the prediction of domain hinges from sequence is considered an open problem.
In this article we focus on the characterization of these hinges based on sequence. To that end, we compiled the Hinge Atlas, a manually annotated dataset of hinge bending motions, as well as a separate computer annotated dataset, both available for further studies. The Hinge Atlas has several applications. First, the statistical properties of hinges can be studied (composition, sequence correlations, coincidence with active sites, etc). Second, it can be used to benchmark hinge prediction programs. Third, by homology hinge annotations could potentially be transferred to proteins where the existence and location of a hinge are unknown. Fourth, the annotations could conceivably be used in future protein motion prediction programs. The first application was of most interest to us in the current work.
Are certain residue types differentially represented in hinges?
Do certain pairs of amino acids coincide with hinges?
Can sequence be used to predict hinges?
Do hinges coincide with active sites?
Do hinges prefer certain secondary structural elements?
Do hinge residues share physicochemical or steric properties?
Are hinge residues conserved in evolution?
As our first task, we computed the rate of occurrence of each residue type in the Hinge Atlas. Certain amino acids were found to be differentially represented in hinges in a statistically significant fashion. We also investigated whether certain consecutive pairs of residues were differentially represented in hinges. In the course of the above, we observed that one of the overrepresented residues (serine) is potentially catalytic; this was the original motivation for question 4 above. To answer that question, we searched the Catalytic Sites Atlas (CSA) for close homologs to the proteins in our dataset, and extracted the active site residue numbers from those proteins for comparison to the Hinge Atlas annotation.
Our next task was to investigate hinge coincidence with secondary structure. Hinges are generally believed to occur in disordered regions, but this belief has never been tested or quantified rigorously to our knowledge.
Following up on our finding that hinges coincide with active site residues, we went on to the question, are hinge residues more likely to be conserved than other residues, as active sites are? We ranked the residues by relative conservation and examined the differences between hinge and non-hinge residues.
Significant correlations between sequence features and hinges were found in the above analyses. We computed Hinge Indices for each of these which may be used to relate sequence features to flexibility. We then sought to determine what predictive value sequence might have on its own and whether various sequence features collectively could be used for prediction.
We first made a simple GOR (Garnier-Osguthorpe-Robson) [30, 31]-like predictor. We computed the log-odds rate of occurrence for residues located at the -8 to +8 positions along the sequence in the training set. We used this table to make predictions on the test set and examined their predictive power.
As a second approach, we made a composite Hinge Index, which we call HingeSeq, from the Hinge Indices of each of the sequence features found to be the strongest indicators of flexibility. The statistical significance of this measure was computed much as for the individual sequence features. To show that the measure is predictive, we again divided the Hinge Atlas into training and test sets and recomputed the relevant Hinge Indices to include only training set data. We used the regenerated HingeSeq to predict hinges in the test set and generated a Receiver Operating Characteristic (ROC) curve.
As a final step, we examined MolMovDB as a whole to determine whether any particular database bias was in evidence. We also used resampling to check for sampling artifacts in the Hinge Atlas. Lastly, we compared the Hinge Atlas to our computer annotated dataset. The resulting work provides insight into the composition, physicochemical properties, geometry, and evolution of hinge regions in proteins.
Prior to generating the manually annotated Hinge Atlas, we used computational methods to generate a dataset of hinge residues for our statistical studies. We began by running FlexProt, a leading hinge identification tool, on all morphs (pairs of homologous protein structures) in the Database of Macromolecular Motions[1, 2, 4, 9, 28, 33, 34] FlexProt works by matching and structurally aligning fragments in one structure with corresponding fragments in the other. The goal is to find fragment pairs which (1) have minimal RMSD and (2) are maximal in size. The hinges are then reported as the boundaries separating those fragments. Goal (2) is equivalent to minimizing the number of these hinges. Since domains are never completely rigid, RMSD tends to grow with fragment size and therefore goal (1) is in conflict with goal (2). This conflict is dealt with by providing the user with a series of adjustable parameters, and further by reporting not one but several alternative hinge locations from which the user can choose. We used a combination of computer and manual culling to select those morphs for which the identified hinges met the following criteria:
Motion was domain wise, i.e. two or more domains could be observed moving approximately as rigid bodies with respect to each other.
The identified hinge was located in the flexible region connecting two rigid domains, rather than in the domains themselves.
The morph trajectory was sterically reasonable, i.e. chains were not broken in the attempt to interpolate motion.
We found that FlexProt's Maximal[35, 36] RMSD (Root Mean Square Deviation) parameter had a strong effect on the results. Therefore when FlexProt gave visibly incorrect results for a given morph, we reran the program, systematically varying this parameter. If one of these runs gave sufficiently accurate results, the annotation for that morph was entered into the database. We discarded immediately those morphs that did not exhibit clear hinge bending motion. Lastly, we removed redundant morphs using nrdb90.
Note that the definition of a hinge given in the introduction allows for a hinge of zero length. FlexProt indeed often returned such hinges. To deal with this, in all cases one residue on each side of the hinge, was taken to also belong to the hinge. Thus most hinges are two residues long. At the end of this process, the computer annotated set contained 273 morphs.
As described, the computer annotation of hinges requires significant human intervention and the results were often debatable. Many of the hinge annotations differed slightly but visibly from the boundary between rigid domains, such that the backbone flexions that could account for the domain motion were not seen in the predicted hinge region. In other cases hinges were missed, and some annotations appeared where no hinge existed. The more flagrantly misannotated hinges were removed from the dataset, but making the manual culling too stringent would simply have resulted in a dataset too small to be statistically meaningful. For these reasons, the computer annotated dataset was not used in most of this work. Nonetheless, the computer annotated dataset is arguably more objective then the manually annotated set described below, and so is made available to the community.
To address the accuracy issues, we decided to generate a manually annotated set of hinges – the Hinge Atlas. To generate this set we first created the Hinge Annotation Tool which can also be used by the public as we will now explain.
The creation of publicly accessible tools for manual annotation of hinges involved significant changes to the morph page. The morph page is the primary point on MolMovDB for analyzing single morphs. It is accessible from the "movies" page or through our search tool, both linked to or visible on our front page. Our server also provides a link to this page in an email sent to the submitter of each morph request. We added all of the new tools to the "Hinge Analysis" tab on this page. The first of these is the Hinge Annotation tool. Each of three rows of "arrow" buttons on this tool move a highlighted window of two residues along the protein chain, allowing the user to highlight up to three hinges in a protein. The "Show all" button then highlights all selected residues in the Jmol viewer window. Once the user is satisfied with the hinge selection, clicking "Submit" records this selection in the database. Once the morph page is regenerated, a "Show public hinge" button will be visible which, when clicked, highlights the selected residues. Lastly, the user can use a pointing device to reorient the protein in the Jmol window to his/her liking. A GIF image based on that view can be generated by clicking on the "color by domain" link. The animation will be rendered using VMD's "new cartoon" style, with the identified hinge region and two rigid domains each colored distinctively. The hinge annotations made in this way persist in our database for visualization and use by others, until overwritten. With minor modification, these tools were used to generate the Hinge Atlas dataset of manually annotated hinges. The criteria we used for selection are described in the following section.
Highlighting the Hinge Atlas hinges (described below) on the animated morph movie is a matter of going to the morph page and clicking on the "Hinge Analysis" tab as above and clicking the "Show Hinge Atlas hinge" button. The annotated hinge location will be rendered in green spacefill style, which contrasts with the white trace used elsewhere in the protein.
The tools described above answer only the technical question of how we annotated hinges. In this section we clarify the motivation for the Hinge Atlas and its applications and answer the scientific question of how we decided on the precise location of the hinge for each morph.
For each morph in the Hinge Atlas, we used the Hinge Annotation Tool as described to select the hinge location. Motivated in part by our long term goal of providing a resource that could be used in motion prediction work, and in part by a desire to deepen basic understanding of protein motion, we asked ourselves the following question:
Would it be possible to approximately reproduce the observed motion by allowing flexure at the hinge points but keeping the regions between hinges rigid?
In order for this question to be answered in the affirmative, the hinge selection should be the one to best meet the following criteria:
1. The φ, ψ, and α (effective α-carbon to α-carbon) torsion angles of hinge residues may often (but not always) be larger than those of their neighbors.
2. Amino acids on either side of the hinge residues must be co-moving with their respective rigid regions.
3. Rotations of one of the rigid regions about the hinge region must not result in significant and irreconcilable steric clashes.
In order to use (1) as a useful guide to selecting the hinge location, we made use of the torsion angle charts and graphs in the structure analysis tools section on the morph page. However often large rotations of the main chain are induced by multiple cooperative torsions in the hinge, and these may be individually small, particularly in α-helices. The usefulness of this flexibility measure is further limited by the frequent occurrence of large torsion angles which do not coincide with hinges. Nonetheless, when the precise location of the hinge was otherwise unclear, torsion angles were often examined to help adjust the selection.
Criterion (2) is a definition of a hinge. Sometimes the hinge was slightly longer than others, and in those cases we added more residues to the hinge, up to a limit of about five residues in total. If the hinge was distributed over too many residues such that no one short stretch could be said to constitute the entire hinge, then the morph was discarded from the Hinge Atlas, since the motion was not hingelike. Criterion (3) is a practical requirement of a working hinge. If substantial flexure at points outside the hinge is required to avoid domain interpenetration, then the choice of hinge location is incorrect, or the motion is not hinge but rather shear or unclassifiable.
The next question was, how to select the morphs which would be annotated and included in the Hinge Atlas. The entire Database of Macromolecular Motions (MolMovDB) with (at the time) over 17000 morphs, could clearly not all be annotated given limited manpower. Further, only a minority of morphs (albeit a large one) exhibited hinge bending motion, and even within this group much redundancy existed.
To address these issues and make the annotation work manageable, we first selected a nonredundant subset of the morphs in MolMovDB by aligning all sequences to NRDB90. This reduced the dataset to 1000 morphs. This was more manageable, but still the set contained many proteins which did not exhibit hinge bending motions. Fortunately we found that the score output by FlexProt, normalized by dividing by the number of residues, provided an accurate measure of the degree to which a protein exhibited hinge bending. High scores, close to unity, indicated proteins more likely to exhibit hinge bending motion. Lower scores, below 0.9 or so, were very unlikely to do so. We sorted the 1000 nonredundant morphs by descending normalized flexprot score (described earlier) and annotated them in that order. Those proteins for which we could find hinges allowing a positive answer to the question above were annotated and added to the Hinge Atlas. Those proteins which did not exhibit hinge bending motion or for which no suitable hinge could be found were discarded. At the end of this culling and annotation effort, the Hinge Atlas contained 214 nonredundant annotated morphs. We also manually annotated a small set of specifically fragment (rather than domain) hinge bending motions which may be useful for some studies, described below.
In the course of this study we compiled a number of sets of morphs which can be viewed on our online galleries listed and linked to on our sets page. The Hinge Atlas and computer annotated sets are compared more rigorously in the "Statistical comparison of datasets" section. The galleries provide easy browsing and visual inspection of morph movies sharing certain characteristics. The sets offered include:
No two morphs in this set have more than 90% sequence homology. This set was compiled by alignment to proteins in nrdb90.
All morphs in this set have annotated active sites which can be highlighted in the jmol viewer.
Same as above, but with redundant morphs removed by comparison to nrdb90.
Computer annotated set used in parts of this study and described above. We consider it to be less useful than the Hinge Atlas, but the data is nonetheless made available.
A small set of hinge bending motions involving fragments smaller than domains, as alluded to in the previous section.
Contains the manually annotated protein pairs used in this study. A link on the sets page permits the download of the sequence data (including residue number, residue type, hinge annotation, catalytic site annotation, and secondary structure) in mySQL format. The same data is available in tab-delimited text format which is human readable and importable into MS Excel and other packages. Another link on the same page facilitates the download of the interpolated structure files associated with each morph in the Hinge Atlas set.
Clicking on the thumbnail image leads to the "movies" page, where users can browse through the 214 proteins in the Hinge Atlas. Clicking on any of the protein thumbnail images, in turn, leads to the corresponding morph page, where the hinge annotation can be viewed as described in the "Hinge Annotation Tool" section
Throughout this study, we will be comparing how often a particular entity (be it a certain amino acid, a certain pair of amino acids, a certain class of amino acids, a certain secondary structural element etc.) occurs in hinges versus everywhere in the Hinge Atlas or another of the datasets described above. The statistical analysis will be the same regardless of the particulars, so we will here present the general approach and later only mention adjustments particular to the specific question addressed.
First we defined the following variables:
D = total number of residues in the dataset
H = total number of residues in hinges in the dataset
C = classification scheme used to create groups of residue positions. For example, C could be secondary structure, degree of conservation, etc.
c = a particular grouping of residues, where c ∈ C. For instance, if C = secondary structure, then c = helix is the class of all residues in helices, c = strand is the class of all residues in strands, etc. Another example might be C = evolutionary conservation, with c = cons1 = top 20% most conserved residues, c = cons2 = second 20% most conserved, etc.
a c = set of all residues of class c in the dataset.
d c = number of times residues of class c occurred anywhere in the dataset.
h c = number of times residues of a particular class c occurred in hinges.
These can be used to estimate various probabilities as follows:
p(a c ) = d c /D is the prior probability of c – in other words, the probability that residues of class c occur anywhere in the dataset.
p(a c |h) = h c /H is the conditional probability that a residue belongs to class c, given it is a hinge.
Where the prior probability that a residue is a hinge is given by .
The argument of the log is the ratio of the observed frequency of occurrence of classes of amino acids a c in hinges, over the expected. Note that this argument is close to the likelihood ratio used in Bayesian statistics because H is so small compared to D. The quantity HI yields an intuitive measure of the enrichment of certain classes of residues in hinges, with positive numbers indicating enrichment and negative numbers indicating scarcity. Just because the HI is nonzero, however, does not mean that the differential representation has statistical significance. To establish the latter, we considered two statistical hypotheses:
H0: The null hypothesis.
If this is true, then the hinge set is chosen without replacement in an unbiased fashion from the dataset, and p(a c |h) is given by the hypergeometric distribution (Equation 3).
H1, The alternate hypothesis.
then we reject H 0 iff our p-value .
We applied the described statistical formalism to the problem of amino acid frequency of occurrence in hinges by taking C = amino acid type, and c to designate each of the 20 canonical amino acids. HI scores and p-values were thus calculated for each of 20 identifications of c corresponding to the 20 canonical amino acids.
Amino acid frequency of occurrence in hinges.
Occurrence in hinges
As mentioned earlier, the fact that one of the overrepresented residues is potentially catalytic led us to suspect that hinge residues are more likely to occur in active sites, or within a few residues of an active site, than would be expected by chance. This would make sense from a biochemical and mechanical perspective. Hinge motions are often opening and closing motions of domains intended to expose the active site, which often would be located at the center of the motion, i.e. the hinge.
Prior work shows that active sites are more likely to occur at regions of low first normal mode displacement. Such regions have been shown to coincide with hinges. Here we close the loop, comparing active sites directly with the Hinge Atlas annotation and quantifying the correspondence.
In order to annotate the active site locations, we BLASTed the morph sequences in the computer annotated dataset against the sequences in the Catalytic Sites Atlas and considered a morph in the hinge dataset to match a protein in the CSA if they had sequence identity ≥ 99%. This high threshold was chosen to minimize the possibility of incorrectly labeling a residue in the Hinge Atlas and thereby diminishing the significance of the results. For each such pair, we transferred the catalytic site annotation to the morph. We described earlier how to browse the CSA morphs online. Of the 214 proteins in the Hinge Atlas, 94 were annotated with active site information from the CSA; the rest had no close CSA homologs. The 94 proteins comprised the dataset for this calculation. We analyzed this set using the statistical formalism described earlier, with the following variable definitions:
C = distance from the nearest active site, in residues.
c = successively: active site residues, amino acids 1 residue away from the nearest active site residue, 2 residues away, etc.
D = 28050 residues in the dataset of 94 proteins
H = 378 hinge residues in the dataset
d c = residues of class c in the dataset
h c = residues of class c in hinges.
HI and associated p-value for hinge residue coincidence with active site, and with residues at certain distances from active site residues.
m = distance from nearest active site (residues)
residues at positions m
hinge residues at positions m
It is generally accepted that hinges tend to avoid secondary structure. However this belief has, to our knowledge, never been tested on a quantitative basis, and indeed numerous counterexamples can be found. For instance, the hinge in calmodulin and troponin C[26, 46] occurs in an α-helix, and in glutamine binding protein it occurs in two parallel beta strands. Thus we do not know which particular types of secondary structure are avoided or preferred, or to what degree. To obtain this information, we tabulated the number of hinge residues occurring in the various types of secondary structural elements, and compared this with the distribution of all residues, proceeding as follows.
STRIDE recognizes secondary structural elements from atomic coordinates. We used this program to assign secondary structural classes to all residues in the Hinge Atlas. We then tabulated the number of residues assigned to each class, both in hinges and elsewhere in the dataset. Lastly, we calculated the HI scores and the p-values as before, letting C = secondary structural element type and c designate e.g. helix, coil, etc.
Hinge frequency of occurrence in various types of secondary structure.
Hinge Residues (count)
All residues (count)
Hinge Residues (expected)
Coil (none of the others)
Hinge frequency of occurrence in various physicochemical classifications
Hinge Residues (count)
All residues (count)
Hinge Residues (expected)
I, L, V
H, F, W, Y
A, C, G, H, I, L, K,
M, F, T, W, Y, V
R, D, E, H, Y
R, H, Y
R, N, D, E, Q, H, Y,
S, T, W, Y
A, N, D, C, G, P, S,
G, A, S
We next investigated whether hinge residues are conserved. Since certain residue classes are preferred in hinges, one might suspect that hinge residues would be conserved. First, we BLASTed each of the Hinge Atlas sequences against nrdb90, a non-redundant sequence database in which protein sequences have no more than 90% sequence identity with each other  Next we extracted up to 50 top-aligned sequences to a given morph to generate a multiple sequence alignment using Clustal W. For each position in the multiple sequence alignment, we used the formalism developed by Schneider et al to compute the information content associated with a column in the multiple sequence alignment at this position[50, 51].
We sorted the residues in Hinge Atlas morphs according to the magnitude of the information content scores. We then divided the residues into five bins of equal size. If hinge residues are conserved, then there should be an enrichment of hinge residues in the top bins, which correspond to the most conserved residues. On the other hand, if hinge residues are hypermutable, there should be more of them in the bottom bins, corresponding to the least conserved residues. Because it is widely agreed that active sites should be conserved, we used the conservation of active sites as a control.
To quantify the enrichment, we calculated the HI scores as described previously. Here, c is a label applied to residues that ranked in a given percentile bin, e.g. the top 20% most conserved. For that bin p(a c |h) = h c /H is thus the ratio of the number of hinge residues in the bin divided by the total number of hinge residues. Similarly, p(a c ) = d c /D is the ratio of the number of residues in the dataset in the bin divided by the grand total of residues in the dataset. To determine the statistical significance of HI scores, we calculated the p-values using the hypergeometric distribution with the d c , h c , D, H defined above.
For the control set, we performed the same calculation but made the following changes to the variable definitions:
1. Our dataset was no longer the Hinge Atlas, but rather the "Catalytic Sites Atlas (nonredundant)" set described earlier. D is the total number of residues in this set.
2. a c still represents residues in the dataset belonging to a given conservation rank bin. d c is the total number of residues in that bin.
3. h c now represents the number of active site residues in a given bin corresponding to c. Similarly, H represents the total number of active site residues in the dataset.
Hinge frequency of occurrence vs. conservation bin.
Active site residue propensity
Hinge residue propensity
Enzymes in Hinge Atlas
Enzymes in Hinge Atlas
Conservation score bin
Active site residues
The Hinge Atlas pools enzymes together with non-catalytic proteins. We reasoned therefore that perhaps only hinges in non-catalytic proteins are hypermutable, and that if we analyzed a set consisting only of enzymes, then the propensity of active sites to occur in hinges would lead to conservation, rather than hypermutability of hinge residues for that set.
Even this test, however pools together hinges that are near the active site (or contain one or more active site residues) with hinges that occur at some distance from it. So we selected from the 94 proteins a small set that had at least one active site residue in the hinge, and removed the active site residues themselves. We then calculated the propensity of hinge residues to occur in the five conservation bins. This set was found to be too small, however, and statistical significance was too low to draw a conclusion (data not shown). A study using the set of fragment hinge motions described earlier was similarly inconclusive.
The hypermutability of hinge residues that we found is reasonable because hinge residues tend to be on the surface of proteins (see below) rather than in the more highly conserved core. Hinges are less likely to be buried inside domains because they would then be highly coordinated with near neighbors and hence less flexible. The apparent contradiction of hypermutability on the one hand and enrichment of active sites on the other is dealt with in the Discussion section.
Hinge Index and p-value for differential representation of residues binned by solvent accessible surface area (bin #1 represents largest area).
Number of hinge points per protein in the Hinge Atlas
Number of hinge points
Number of protein pairs (morphs)
GOR[30, 31] method is useful for predicting secondary structure from sequence with fair accuracy. We implemented a GOR-like method to determine whether sequence contained enough information for hinge prediction. We divided the dataset into a training set and a test set for this study. The log-odds frequency of occurrence of amino acids in the training set were tabulated not only at a given hinge residue, but also at positions ranging from -8 to +8 from the given residue in sequence space. For simplicity, hinge residues at positions less than eight residues from either end of the chain were not included.
Once the table was generated, it was used on the test set. The score for a given residue was taken to be the sum of the scores for the residues in positions -8 to +8 from that residue. The scores were computed for all residues in the test set, except those less than eight residues from either end of the chain. The idea is that a threshold score can be chosen and residues scoring higher than this threshold are considered more likely to be hinges. Note that where Robson and Suzuki used a different fitting parameter for each type of secondary structure, we used no fitting parameter, since we were interested in only one "secondary structure": the hinges. The rates of true and false positives and negatives were calculated for each choice of score threshold over a range.
Our training set numbered 136 proteins from the computer annotated set. We tested the method on a test set of 137 proteins from the same set and obtained a ROC curve (not shown; ROC curves are explained later in this work). The area under this curve was nearly 0.5, indicating negligible predictive value.
For simplicity, statistical independence of the various features was assumed in creating this definition. Here the i's correspond to individual amino acids in the protein sequence. For each i, j designates one of the 20 amino acid types, k designates the secondary structural classification, and l designates active site versus non-active site classification.
Thus HI amino·acid(i) is assigned according to residue type by looking up the corresponding value in Table 1. Similarly, HI secondary·structure(i) isobtained according to secondary structure type from Table 3. Following Table 2 approximately, we assign HI active·site(i) as 0.4 for residues four or fewer amino acid positions away from the nearest active site residue, and 0.0 elsewhere. The highest values of HS(i) correspond to residues most likely to occur in hinges.
Clearly, extending this method is only a matter of obtaining amino acid propensities to occur in hinges according to additional classifications. The resulting index can then simply be included as an additional term in the above formula, with no need for adjustable weighting factors.
Statistical analysis of HingeSeq predictor.
Total resid. in Hinge Atlas
Hinges in Hinge Atlas
Total residues with HingeSeq score > .5
Hinge residues with HingeSeq score > .5
We nonetheless wished to show that HingeSeq is predictive, rather simply reflectling peculiarities of the dataset. To this end, we divided the 214 proteins of the Hinge Atlas into a training set numbering 161 proteins, and a test set numbering 53. Of the 214 Hinge Atlas proteins, the 94 proteins with annotation from the CSA were apportioned such that 71 were included in the training set and 23 in the test set. We tested the performance of the predictor by means of ROC (Receiver Operating Characteristic) curves. We need to define a few terms in order to use these:
Test positives: Residues with HS(i)greater than or equal to a certain threshold.
Test negatives: Residues with HS(i)less than a certain threshold.
Gold standard positives: Residues annotated as hinges in the Hinge Atlas.
Gold standard negatives: Residues which are not in hinges according to the Hinge Atlas annotation.
True positives (TP): Those residues that are both test positives and gold standard positives.
True negatives (TN): Residues that are both test negatives and gold standard negatives.
False positives (FP): Residues that are test positives and gold standard negatives.
These findings assume that the dataset used does not contain significant bias or artifacts, either in the composition of the entire dataset or of the hinges within it. To substantiate this, we performed various studies as follows.
Frequency of Gene Ontology terms in PDB vs. Hinge Atlas
Counts in PDB
Counts in Hinge Atlas
Gene Ontology term
nucleic acid binding
nucleotide binding oxidoreductase
protein binding electron transporter
To compare the Hinge Atlas counts to the PDB counts in an overall fashion, we used the chi-square distribution with 162 degrees of freedom (from 163 GO terms and 2 datasets) and obtained a chi-square value of 121.1. This corresponds to a p-value of 0.9931, so there is no statistically significant difference in the distribution of these terms in the Hinge Atlas vs. the entire Protein Data Bank.
The Hinge Atlas and computer annotated sets were compiled differently, therefore one might suspect that the hinges from one set might comprise a statistically different population from the hinges of the other set. If this were the case, then one of the two sets would be preferable to the other, otherwise if the populations were essentially the same then the two sets could potentially be used interchangeably. It is therefore necessary to quantitatively compare these two populations. It is also necessary to confirm that within one set, the hinge residues are a statistically distinct population from the rest of the set; if this were not true then the amino acid propensity data reported earlier would not be meaningful.
The hinges within the computer annotated set comprise a distinct population from the rest of the set (p-value = 0.017).
Computer annotated set Hinge vs. non-hinge residues
Hinge Atlas Hinge vs. non-hinge residues
Hinge Atlas hinges vs. Computer annotated hinges
We conclude from this calculation for both the Hinge Atlas and computer annotated set, the hinge population is different from the non-hinge population, therefore statistically significant information can be extracted from both. However the hinge population of the Hinge Atlas is different from that of the computer annotated set, albeit with much lower significance. We argue that one of the two sets should therefore be preferred for statistical studies. The preferred set should be the Hinge Atlas since the computer annotated set contains numerous annotations which are slightly different from the correct and visually verifiable hinge location.
We next asked the question, do the morphs in the Hinge Atlas reflect intrinsic flexibility of the protein, or is the apparent conformational change the result of sequence differences between the two structures in the pair? That is, do the morphs display motions observable in a single protein, or do they instead represent evolutionary change? To answer this we counted the number of times both structures in the morph came from the same vs different organisms. Of the 214 morphs, 123 had structures downloaded directly from the PDB rather than uploaded by users, and also had valid source organism data. For 109 of the 123, both proteins in the pair came from the same species, while for 14 the two proteins came from different species. Of the 14, 11 pairs were of proteins that were somewhat related to each other (7 pairs of bacterial, and 4 pairs of mammalian), while only three pairs were comprised of two proteins from different kingdoms. Thus the conformational changes are likely to reflect experimentally observable motions rather than evolutionary effects.
As a further test of confidence in the Hinge Atlas, we decided to look for sampling artifacts in the hinge set. Resampling or bootstrapping is a technique suited for this purpose. We bootstrapped the frequency of occurrence of amino acid types. The method consists of drawing random samples and computing the frequency of occurrence of a given amino acid type in that sample. We present the results for glycine, the residue type most overrepresented in hinges.
We randomly chose 1/8 of the 214 proteins in the Hinge Atlas. The sample was labelled with an index j. Within that sample we counted the following:
: the number of hinge residues of all amino acid types in sample j,
: the number of NON-hinge residues of all amino acid types in sample j,
(a GLY ) : the number of glycine residues in hinges in sample j,
(a GLY ) : the number of glycines in NON-hinge residues in sample j,
: the sample frequency of occurrence of glycines within hinges within sample j, and
: the sample frequency of occurrence of glycines among NON-hinge residues in sample j.
We repeated the above for for j = 1 to 10000, randomizing the sample each time. For the case of a GLY = glycines, we generated bins 0.02 wide and counted the number of times values of (a GLY ) and (a GLY ) occurred in each interval.
From the cumulative Gaussian distribution, events 1.42 or more standard deviations from the mean have a probability of occurrence of 0.077, giving us an additional measure of confidence that the distribution of glycines is different in the hinge vs. non-hinge sets. Note that we would expect this to be a conservative estimate of the significance (p-value is actually much lower, since the process of resampling subdivides the dataset). The main point of this analysis is that the sample is not biased by particular anomalous proteins.
Correlations were found between hinges and several sequence features. We found that some amino acid types are overrepresented in hinges, and much of this can be explained on the basis of physicochemical properties. Small residues appear to be preferred, especially the "tiny Ser, Gly, and Ala. Aliphatic and hydrophobic residues tend not to be in hinges. We found that residues within four amino acid positions of an active site are significantly more likely to be hinges. This is most likely related to the fact that hinge bending motion is often related to the catalytic mechanism of the enzyme. Active site residues most logically occur inside the binding cleft and therefore are likely to be in the hinge or close by. Some of these results are intuitive, but are nonetheless useful in buttressing the less expected results. Further, even the intuitive results have in many cases never been rigorously tested or put on a quantitative footing.
Surprisingly, hypermutable residues are more likely than conserved residues to occur in hinges. This was found to be true not only for the Hinge Atlas set of 214 proteins (which includes proteins with no annotated active sites), but also for the subset of 94 enzymes with CSA annotation (Figure 5, Figure 6). This may appear to contradict our earlier result that active site residues and their near neighbors are enriched in hinges. However although the catalytic residue enrichment has very high statistical significance, the number of active site residues in hinges is still small compared to the total number of residues in hinges. Thus their presence is insufficient to counter the wider tendency of hinge residues to be hypermutable. Also, the near neighbors of active site residues have no particular reason to be conserved and thus their enrichment in hinges seems unlikely to counter the tendency toward hypermutability.
This raises the question, why would residues that are functionally important not be conserved? The answer may be that it is the intricate network of interactions within the hydrophobic core of rigid regions on either side of the hinge that needs to be conserved, and not the hinges themselves. The importance of the stability of these domains rather than of any detailed properties of the hinges themselves is underscored by the significant success of structure-based hinge predictors which analyze the interactions within the domains and between the domains and the solvent, but which pay no particular attention to the hinge region itself (Flores and Gerstein, submitted), or which implicitly or explicitly find highly interconnected regions of the protein.
One might also ask, is it possible that co-evolution (alternatively called compensatory mutation or mutational correlation) occurs in hinge residues even in the absence of independent (single-site) conservation? Repeatedly investigators have found that co-evolving residue pairs tend to be proximal in space and stabilize proteins, for instance by periodically bridging consecutive turns of α-helices or by interacting across the contact interface between two such helices. This is an active area of research with possible future implications on hinge finding.
Sequence in the immediate neighborhood of a hinge was not found to be sufficient for substantive hinge prediction by a GOR-like method, although the latter is successful at predicting secondary structure. Similiarly, no particular sequential pairs of amino acid types were found to be overrepresented in hinges. However, we did find that combining amino acid propensity data with hinge propensities of active sites and secondary structure yielded some predictive information. The prediction method we present can easily be extended as additional hinge propensity data is reported. Indeed the publicly available Hinge Atlas can be used not only to obtain such data but also to test the resulting predictors. As an additional application, the Hinge Atlas can potentially be used to help find hinges by homology. We note, for instance, that a hinge occurring (unusually) in the helix connecting the two EF hands of calmodulin has also been found in the evolutionarily related Troponin C.
We found that the amino acids glycine and serine are more likely to occur in hinges, whereas phenylalanine, alanine, valine, and leucine are less likely to occur. No evidence was found for sequence bias in hinges by a GOR-like method, nor for propensity towards sequential pairs of residues. Hinges tend to be small, but not hydrophobic or aliphatic. They are found less often in α-helices, and more often in turns or random coils. Active site residues were found to coincide significantly with hinges. Interestingly, however, the latter were not conserved. Lastly, hinges are also more likely to occur on the protein surface than in the core.
A consistent picture of hinge residues is suggested. In this view, hinges often occur near the active site, probably to participate in the bending motion needed for catalysis. They avoid regions of secondary structure. They are hypermutable, possibly due to the fact that they occur more often on the surface than in the core. These correlations yield insights into protein flexibility and the structure-function relationship. Strong sequence-based hinge prediction, however, remains a goal for future work.
The authors acknowledge support from the National Institutes of Health. S. Flores thanks Leslie Kuhn for annotating numerous proteins in the Hinge Atlas, Cheryl Leung for significant editing of this manuscript, Mihali Felipe for systems help, Alexander Karpikov for valuable discussions on low pass filters in fourier space, Thomas Royce and Andrea Sboner for statistical advice, and Michel DuMontier for discussions on Armadillo and testing the hypermutability of functional hinges. This work was funded by NIH/NHBLI grant # N01-HV-28186.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.