Structural alphabets derived from attractors in conformational space
© Pandini et al; licensee BioMed Central Ltd. 2010
Received: 24 August 2009
Accepted: 20 February 2010
Published: 20 February 2010
The hierarchical and partially redundant nature of protein structures justifies the definition of frequently occurring conformations of short fragments as 'states'. Collections of selected representatives for these states define Structural Alphabets, describing the most typical local conformations within protein structures. These alphabets form a bridge between the string-oriented methods of sequence analysis and the coordinate-oriented methods of protein structure analysis.
A Structural Alphabet has been derived by clustering all four-residue fragments of a high-resolution subset of the protein data bank and extracting the high-density states as representative conformational states. Each fragment is uniquely defined by a set of three independent angles corresponding to its degrees of freedom, capturing in simple and intuitive terms the properties of the conformational space. The fragments of the Structural Alphabet are equivalent to the conformational attractors and therefore yield a most informative encoding of proteins. Proteins can be reconstructed within the experimental uncertainty in structure determination and ensembles of structures can be encoded with accuracy and robustness.
The density-based Structural Alphabet provides a novel tool to describe local conformations and it is specifically suitable for application in studies of protein dynamics.
Most proteins have arisen by natural selection to adopt a hierarchical three-dimensional fold, where regularly shaped structural motifs are packed together and form a hydrophobic core. The first description of two of these motifs introduced the concept of secondary structures (α-helix and β-sheet) and demonstrated that some local structures have a repetitive nature . Later it was discovered that almost all regions of a protein backbone can be rebuilt by few substructures common to different proteins . With increasing availability of high quality structures, it also became clear that some of the adopted conformations are realised much more frequently than others and more recently a detailed analysis of the Ramachandran space  for structures of different crystallographic resolution showed clustering of both secondary structure types and random coil conformations at distinct conformational attractors . These attractors can be labelled as conformational 'states' and the protein structure can be considered as a sequence of conformational states. Indeed, classical secondary structure attribution encodes a protein structure into a sequence of states.
However, the protein fold cannot be fully reconstructed from the secondary structure sequence alone, because this code describes the conformation of single residues and provides too few states to capture the entire variety of local conformations. To overcome these limitations, comprehensive libraries of frequently occurring fragments spanning several, typically 4-7, residues were derived [5–12]. These libraries provide a richer choice of conformational states and they comprise intrinsically the structural correlation between consecutive residues. Using fragments, protein fold reconstruction can be achieved by superimposing chains of fragments in a head-to-tail arrangement.
Structural Alphabets are fragment libraries composed of a relatively small number of fragments that complement each other to form a 'universal code' of local conformations. Several Structural Alphabets have been derived  using methods such as cluster analysis [6, 8, 10, 12, 14], Kohonen maps  and Hidden Markov Models [11, 15]. Generally machine learning strategies yielded better performing alphabets at the price of an indirect description of the conformational space. Despite the relative novelty, the potential of Structural Alphabets has been exploited for decoy generation , local structure prediction [9, 17, 18], sequence-based structural comparison , combined sequence-structure alignments , 3D structure alignment , structure mining [12, 22–25], structure reconstruction from C α , fold classification , fold prediction , structure generation , de novo prediction [30, 31], de novo backbone design , but not yet for molecular motions and conformational transitions. Therefore, a description of high quality Structural Alphabets is needed that allows for a projection of these properties into the conformational space of the alphabet, which would facilitate the development of applications that combine a static and dynamic description of proteins.
In this paper we devise a simple and explicit description of four-residue long fragments, the conformation of each being defined by three internal angles. All protein fragments were mapped as points in a three-dimensional space of these internal angles. Structural Alphabets were extracted directly from the conformational attractors on that fragment map and assessed in terms of their accuracy in reconstructing protein structures. A performance comparison was made with other Structural Alphabets of four-residue fragments. Finally the suitability of our best performing alphabet for the description of protein dynamics was assessed by encoding the different structures in a test set of conformation ensembles and measuring the correlation between local flexibility and encoding variability.
A reference set of high quality protein structures was selected from ASTRAL SCOP 10 (v1.73), which includes domains with less than 10% sequence identity [33, 34]. The degree of quality was measured by the Summary PDB ASTRAL Check Index (SPACI) . This index provides information on the reliability and precision of protein structures. It includes three contributions: the quality of the experimental data (resolution), the quality of the fitting procedure (R-factor), and the quality of the deposited model structure (stereochemical accuracy). Only X-ray structures with complete backbone chains and SPACI quality scores > 0.5 were included, yielding a total of 1830 protein domains. The list of SCOP ids is available for download at http://mathbio.nimr.mrc.ac.uk/download/MK.dataset.txt.
A molecule composed of N atoms possesses 3N degrees of freedom. Removal of trivial rigid body motions and bond constraints yields 3 * 4 - 6 - 3 = 3 degrees of freedom, which are entirely described by the three independent internal angles ϕ1, ϕ2 and θ. Advantages of this representation are the conceptual simplicity, the ease of visualisation and the fast comparison of fragment geometries by angle differences instead of atom super-positioning.
Density-based cluster analysis
Cluster analysis of the fragments extracted from the selected set of 1830 protein domains (325923 fragments) was performed in the conformational space of the three internal fragment angles (ϕ1, ϕ2, θ). The values of these angles were neither normalised nor standardised to preserve the original ratio between the variance in the torsional angle θ compared to the planar angles ϕ. This information is central to correctly detect the geometries associated with secondary structures and the transitions between them. A cubic grid with 2° resolution was defined to obtain an initial fragment density estimate. To reduce the computational complexity, irrelevant data were removed by an initial filtering step: those fragments in cubes containing in total less than 10 fragments were removed. The final dataset included 133254 fragments.
Extraction of a Structural Alphabet from the data cloud in this conformational space requires the identification of representatives within the high-density regions of point clusters. However, the data did not lend themselves to standard clustering methods, because of the large variation of cluster densities and the partial overlap of clusters with different densities. To overcome this problem, the OPTICS (Ordering Points To Investigate the Clustering ) method was implemented in C and applied to the dataset. A flow chart of the algorithm is given in Additional File 1: Supplementary Figure S1. OPTICS is based on a nearest neighbour walk through the data space, thereby ordering and recording pairwise point distances . The approach is particularly suitable for the extraction of a Structural Alphabet, because fragments can be clustered hierarchically by density and representatives may be selected amidst the highest point densities. The algorithm requires only two input parameters: a neighbourhood radius (ε) and a minimum number of neighbour points (MinPts). We set ε to 200°; since the largest angular RMSD in the dataset is 78°, each point is reachable and this parameter had no influence on the results.
Briefly, the algorithm starts at a random data point, calculates the distance to all points within the neighbourhood radius (ε) and, if at least a minimal number of points (MinPts) is encountered, it records the nearest neighbour distance (Reachability Distance) and the smallest radius that encircles MinPts objects (Core Distance). If less than MinPts points fall within ε, the point is considered as noise. The algorithm repeats the same procedure for the nearest neighbour point and proceeds iteratively until all data points have been visited, thereby generating an ordered list. Our specific choice of ε implies that none of the fragments is labelled 'noise' and each is included in one cluster at least. This choice allows to scan afterwards for clusters at any density. Distances d ij were calculated as the root mean square deviation in angular coordinates (aRMSD) between fragment pairs. Angle differences of ϕ1 and ϕ2 naturally fell into the value range [0,180], while for θ periodicity was removed to retain the value range [0,180]. The ordered list of Reachability Distances (RDs) can be drawn as a comprehensive nearest neighbour distance plot (called Reachability Plot).
Structural Alphabet extraction
Generate a list in which the fragments are sorted by decreasing Reachability Distance (increasing density) with 'merge sort'.
Parse the sorted list to find two fragments that are more than MinPts apart in the Reachability Plot; these fragments enclose a candidate cluster.
If the size of the candidate cluster is at least MinPts smaller than the parent cluster, label the candidate cluster as accepted cluster and remove all its points from the sorted list.
Repeat 2 until reaching the end of the sorted list.
In the first iteration, the entire Reachability Plot is scanned for root clusters; in following iterations, new clusters are extracted by processing the clusters that were identified at the previous step. The fragment with the lowest Core Distance (highest density of fragments in its neighbourhood) is taken as the representative of a cluster.
As previously reported , the distribution of pairwise Euclidean distances between short protein fragments of fixed length is multi-modal, with one peak corresponding to intra-cluster distances (same fragment conformation) and the others to inter-cluster distances (different fragment conformation). Using the same principle, we derived a cutoff distance to remove redundant representatives from our set.
The distribution of pairwise distances was calculated for a random sample of 13326 fragments from the dataset. The Euclidean distance between the C α atoms after optimal superposition of all fragment pairs was computed. The resulting distance distribution of the intra-cluster peak is log-normal for values smaller than 1.0 Å and it was fitted using the maximum-likelihood method of the R  package MASS .
Occurrences within one standard deviation (0.307 Å) account for 84% of the data and 0.307 Å was selected as the redundancy distance cutoff for cluster representatives.
Assessment by fit quality
The quality of a Structural Alphabet is generally assessed by its accuracy in approximating real protein structures. For this purpose the performance measure previously used by other authors was adopted: local fit and global fit accuracy . A local fit is obtained if each position of a given protein structure template is overlaid with the best-fitting alphabet fragment. The coordinate root mean square deviation (cRMSD) on the position of each C α atom from the template protein is then calculated. A global fit is obtained if the best-fitting sequence of fragments is given and the protein structure is reconstructed by progressively overlaying the ends of neighbouring fragments in the sequence. In this case the cRMSD value is calculated after aligning the reconstructed structure with the template.
In both cases the median of the protein RMSD fit was calculated for a test set including 798 proteins from ASTRAL SCOP superfamily level with SPACI scores >0.5 [33, 34]. Dataset and test set did not overlap. The list of SCOP ids of the test set is available for download at http://mathbio.nimr.mrc.ac.uk/download/MK.testset.txt.
For this analysis the global fit procedure described by  was implemented in C. Whenever possible, RMSD calculations were performed with Theobald's fast quaternion method . A heap size of 2000 was used to ensure convergence of the encoding.
Assessment by information content
The Structural Alphabets obtained for different values of MinPts were ranked according to their AIC. The residual errors were calculated in the form of a local fit RMSD of the 325923 fragments in the dataset. The Structural Alphabet with the lowest AIC was selected as the most informative alphabet, providing the highest fit performance for the lowest number of constituting fragments.
Genetic Algorithm optimisation of combined fragments
To overcome any potential bias by the hierarchical cluster extraction, all fragments of the alphabets obtained with MinPts parameters in the range [10, 99] (initially 1709 representative fragments, reduced to 106 non-redundant fragments using a 0.1 Å cRMSD cut-off) were submitted to a global optimisation within the framework of a Genetic Algorithm . The purpose of this optimisation is an independent alphabet derivation to verify that any potential methodological biases of the described OPTICS selection do not interfere with the performance of the selected alphabets.
The target size of the optimised Structural Alphabet was 25 fragments to match the size of our best performing alphabet (see below). Each fragment was represented by a gene in the form of a binary number, where [1/0] indicates either inclusion or exclusion with respect to the final subset. The fitness of the genome was calculated as the average local fit on the 10 top quality proteins (according to their SPACI score) in ASTRAL SCOP 40, which cover a diverse set of folds. The GA was run three times with a population size of 5000 genomes over 50 generations, cross-over breeding of the fittest 5% genomes and elitism (fittest genomes survive). The algorithm converged to a unique solution in 17 generations. The optimised subset was assessed by local fit, global fit and AIC.
Analysis of native contacts and intrinsic flexibility
The local and global fit accuracy are robust and general measures of reconstruction quality. For applications other than reconstruction, specific quality measures should complement the assessment. The Structural Alphabets introduced here are intended to also capture the intrinsic flexibility of protein structures. This implies that the network of interactions in the native structure is correctly described in the reconstructed structures. Therefore, reconstructed structures were assessed in terms of their dynamics and residue interactions. A useful approach to test this is provided by the Gaussian Network Model (GNM) [44–46]. With this simple but elegant model it is possible to calculate the protein contact map and to derive an estimate of the Root Mean Square Fluctuation (RMSF) of the atom positions directly from a single structure.
The overlap ranges from 0 (no overlap) to 1 (identical matrices).
Analysis of conformational states in structural ensembles
The suitability of the proposed Structural Alphabet to analyse protein dynamics was further tested by investigating both the robustness of the fragments to small fluctuations and their ability to describe conformational transitions. To limit the computational effort, the analysis was performed on a set of 24 proteins. The conformational space of each protein was explored with the tCONCOORD method [48–50] that provides a more accurate model than GNM, since an all-atom representation of the system is used and anharmonicities in atom motions are allowed, but it is still simpler and faster than Molecular Dynamics (MD) simulations.
In tCONCOORD, ensembles of structures are generated by fulfilling a set of distance constraints between atom pairs. The permitted distance intervals are determined on the basis of the distance values found in the starting structure and of the type of the interaction (e.g. covalent bonds, hydrogen bonds, salt bridges or hydrophobic interactions), so that lower tolerances are used to describe stronger interactions. All the contacts in the original structure are preserved, except for 'under-wrapped' hydrogen bonds [49, 51] which are considered unstable since they are not sufficiently shielded from the environment by hydrophobic groups. It has been shown that the detection of unprotected hydrogen bonds, together with the calibration of the distance constraint definition, allows the prediction of conformational transitions . Moreover, even if the molecule description is less accurate than that provided by the force fields generally used in MD simulations and there is no explicit representation of the solvent, the collective motions and the overall RMSF profiles extracted from tCONCOORD ensembles have been found in good agreement with both MD and experimental results [48, 49, 52, 53].
Dataset for the test on accuracy and robustness in encoding structural ensembles.
SCOP superfamily classification
A 'per-fragment' flexibility profile was obtained for each protein by calculating the RMSFs of C α over N-3 sliding windows of 4 residues. The roto-translational motion was eliminated by least-square superposition of the fragment in each frame to the reference starting structure. The value assigned to each window was calculated as the quadratic mean of the RMSF values of each C α in the fragment.
where p ij is the fraction of structures where fragment i was encoded by letter j and k is the total number of letters in the alphabet.
Structural Alphabets from conformational attractors
This density plot was then processed with a variant of the Drop-Down algorithm , in which a density threshold was progressively increased, to extract the cluster structure (see Methods). A diagram of the hierarchical extraction is presented in the top panel of Figure 3, the resulting fragments and their location in the Reachability Plot are shown in the middle panel. In this scheme, frequently occurring fragments are selected first, rarer conformations later, which can also be interpreted as the importance of the attractor (cluster) in the conformational space. The collection of extracted fragments forms the Structural Alphabet. The rugged fine structure of the data density combined with the sensitive clustering method yielded some representatives with near-identical conformations. A plot of all pairwise fragment distances (cRMSD) in the dataset showed that the intra-cluster peak follows a log-normal distribution; fragments were deemed redundant and removed if their distance to an accepted representative within the intra-cluster peak was shorter than a cutoff value (0.307 Å, see Methods). An example of the distribution of fragments of a Structural Alphabet is shown in Figure 2; each fragment is indicated by an annotated circle. It is noticeable that the fragments representing helical conformations (S-W) are spaced much closer than those of extended conformations (A-I). Helical conformations can be well represented by a few similar states, while extended conformations are more versatile and require more representatives to capture the variability of strands in proteins.
Structural Alphabet performance assessment
Structural Alphabets capture the essence of conformational variability of the folded protein backbone in a small number of states. By extracting the states of highest density as representatives, we maximise the probability to match this state in any given structure. Therefore, in terms of a Structural Alphabet, a protein structure can be thought of as a sequence of conformational states and the representation of a protein structure can be reduced to a string of alphabet characters (structure string). This translation can be thought of as an encoding, achieved by matching the best fitting alphabet fragment to each position of the protein structure. This can be done for each position independently (local fit) allowing for non-exact fragment overlaps, or by searching for the sequence of fragments that comprehensively best approximate the geometry of the template structure (global fit) with exact fragment overlaps. In both cases the cRMSD between fitted and template structure is a measure for the error associated with the encoding. The two fit procedures exemplify also two extreme cases of the structure prediction problem: local and fold prediction. As the aim of this work is not to provide a new tool for structure prediction, these two fit assessments should not be interpreted as a measure of the predictive ability of the alphabets; further tests would be required to validate the alphabets in the context of fold prediction.
Performance assessment of Structural Alphabets in terms of the local and global fit quality.
For all fragment sets the quality of the global fit is comparable to the experimental uncertainty in protein structure determination: the median cRMSD is in the range 0.70-1.00 Å with an IQD in the range 0.10-0.25 Å. The local fit results show that a representative fragment can be found for any local conformation with an average fit error in the range 0.2-0.3 Å.
Other groups have devised optimal alphabets of size 27  and 28 . While in those studies the identification of the optimal size was not done by fit performance, it is noteworthy that the range of optimal values is similar to the one identified by OPTICS.
The robustness of the AIC test was confirmed by bootstrapping. The results of 10000 bootstraps are reported as error bars in the plot in Additional File 1: Supplementary Figure S3.
Structural Alphabet M32K25
Angle values of the Structural Alphabet M32K25.
While all three angles (ϕ1, ϕ2, θ) are needed to fully describe the variety of conformations, the torsion angle θ provides most of the information. This is confirmed by the order of fragments in the Reachability Plot and illustrated by the selected fragments on the right panel of Figure 2: the order extended (A), loop (P), helical (U) and turn (Y) corresponds to a progressive decrease of the torsion angle θ.
As previously reported , one should not expect a strict correspondence between the states in a Structural Alphabet and those obtained from secondary structure assignment for the same local structures. The possibility to fit some of the fragments to different structural environments is important to achieve high accuracy in protein reconstruction.
Optimisation of combined alphabets
To investigate how the hierarchical cluster extraction scheme has biased the alphabet selection, we performed a separate extraction. A Genetic Algorithm optimisation was performed on the collection of all non-redundant fragments of all alphabets within the MinPts range 10-99. The optimisation was designed to select those 25 out of 106 fragments in the collection that yield the best local fit score on a set of 10 high quality structures.
The resulting alphabet (MxK25GA) was assessed by local fit, global fit and AIC performance (Table 2). The improvement in local fit (0.01 Å) and global fit (0.02 Å) is smaller than the variance in the corresponding fit processes and the AIC difference is also relatively small (8.3 kbit). The GA optimised set is indeed equivalent to the M32K25 alphabet, confirming the results of the hierarchical extraction procedure.
Intrinsic flexibility of reconstructed proteins
The fit assessment is a simple and robust method to measure the accuracy of a Structural Alphabet in approximating real proteins, but it does not guarantee that all native features are correctly reproduced in the reconstructed structure .
While the M32K25 alphabet has satisfied the necessary requirement of reconstructing static structures within the experimental error, it is important to validate its ability to capture the intrinsic flexibility of proteins. This can be done by comparing the flexibility of the native and reconstructed structure: the dynamical properties should be unaffected by the discretisation imposed by the Structural Alphabet. An elegant and fast method to perform such an analysis is provided by the GNM [44, 45], that has already been used in several structural studies  and proven to be a reliable approximation for the dynamic properties of proteins.
For our purpose, the cross-correlation of atomic fluctuations was derived by applying the GNM and compared in the native and reconstructed structures (see Methods). Both the atomic RMSF profiles and the cross-correlation matrices demonstrated the suitability of the M32K25 encoding. The reconstructed structures preserved the required native features: both the RMSF correlation (0.95 ± 0.04) and the matrix overlap (0.93 ± 0.02) are close to 1 for a large set of high quality structures. The former correlation is also higher than the one reported with B-factors , confirming that the encoding error is within the experimental uncertainty. Additionally the two indices are independent of the global and local fit quality measures (correlations < 0.3).
This is also an indirect test of the ability of a structural alphabet encoding to preserve native contacts. Indeed in this harmonic model the conformational freedom of each atom is a function of the number of neighbour interactions [44, 45]. The preservation of native contacts is a necessary precondition to obtain similar flexibility profiles.
For comparison purposes, this test was performed also on the Structural Alphabets CGT2004 and MSM2000. Both alphabets performed as well as the M32K25 in preserving the native contacts and the intrinsic flexibility. The RMSF correlations were 0.95 ± 0.03 (CGT2004) and 0.92 ± 0.06 (MSM2000), while the matrix overlap was 0.93 ± 0.03 for both.
While with a different aim and procedure, a previous study  has also highlighted that pairwise contact specificity is greater in terms of structural letters than amino acids.
Accuracy and robustness in encoding structural ensembles
To be able to capture the dynamical behaviour of a protein, a Structural Alphabet should be stable to small fluctuations on one side and it should reproduce transitions between different states on the other. Thus a first requisite of an alphabet is that, when used to encode different structures in a conformational ensemble, the variability of the letters is correlated with the flexibility of the position that they describe. This means that a position encoded by many different letters should also show large fluctuations. On the other hand, if a fragment is relatively rigid, only few letters should be sufficient to accurately represent it. We analysed the relationship between flexibility profiles and encoding variability by generating ensembles of 500 structures for a set of 24 proteins with the tCONCOORD method. This relies on a more accurate description of the molecule than the GNM approach used in the previous section. Moreover, it allows the breaking of native contacts, so that the generated ensembles contain also transitions between significantly different conformations. A generally good agreement has been found between tCONCOORD and Molecular Dynamics (MD) [49, 53]. However the aim of the present test is to assess the performance of the alphabet in reproducing structural variability in ensembles, independently from the method used to generate them. Finally, since the increased computational cost of tCONCOORD prevented its application to the entire dataset used for the other alphabet assessments, a limited number of proteins had to be selected.
Correlation coefficient between fragment RMSF and encoding entropy.
Performance of Structural Alphabets in local and global fit reconstruction of tCONCOORD ensembles.
We have derived the structural alphabet M32K25 from the conformational attractors of protein structures. The intention of our approach was to devise a simple and generic description of local conformations that is readily amenable to visualisation and computational analysis. A solution was found in the conformational space spanned by three internal angles corresponding to the fragment's degrees of freedom.
The OPTICS algorithm provided two important functions for the analysis of the data space. Firstly, cluster representatives were extracted in the order of decreasing density, which is equivalent with decreasing importance for encoding. Secondly, the unique data ordering corresponds to a minimum distance path, providing a gradual inter-conversion among the states. Therefore, despite the mutual independence between fragments, the resulting Structural Alphabet includes important connective fragments to allow for a smooth protein reconstruction. Connectivity is partially implied by the overlap of neighboured fragments in the original structures: the ϕ2 angle of a fragment in a given structure is identical to the ϕ1 angle of the next C-terminal fragment. The addition of a structural alphabet fragment to a growing reconstruction model adds one unconstrained atom (while three atoms overlap with the previous fragment) and two unconstrained angles ϕ2 and θ (while ϕ1 overlaps with ϕ2 of the previous fragment).
The OPTICS algorithm has been recently applied for sequence clustering , but to the knowledge of the authors, this is the first use in Structural Bioinformatics and it suggests a general suitability of density-based approaches for protein structure analysis.
A further objective of this study was to minimise the number of free parameters in design and analysis. Excluding the descriptive conformational parameters (ϕ1, ϕ2, θ), only the MinPts parameter influenced the set of representative fragments. We explored the range of suitable values for MinPts and selected the most informative alphabet. The high entropy of the M32K25 alphabet allows for protein reconstruction with an error comparable to that of structure resolution techniques. Theoretical studies on small libraries of local structures [6, 10] predict the global fit accuracy for an alphabet of this size to 0.60 Å in agreement with our results (0.70 ± 0.11 Å).
A comparison can be drawn directly only to other alphabets composed by fragments of the same type, i.e. of length four C α atoms. Previous Structural Alphabets used a fragment representation in the form of a set of Cartesian coordinates (MSM2000 ) or of a four-vector descriptor composed of three not-consecutive C α -C α distances and the projection of the fourth atom onto the plane formed by the other three (CGT2004 ). Our angular representation has the advantage of being an internal metric that is independent of the molecular orientation, as for CGT2004, but with only three parameters. The other alphabets include collections of 27 fragments (CGT2004) and 28 fragments(MSM2000), while our best alphabet M32K25 includes 25 states. The performance as measured by the global fit (shown in Table 2) is 0.70 ± 0.11 for M32K25, 0.67 ± 0.15 for CGT2004 and 0.95 ± 0.41 for MSM2000, indicating that the M32K25 alphabet achieves similar or better performance with only 25 states.
But the main difference between the M32K25 alphabet and other Structural Alphabets is its stringency in the representation of high density states as shown by the fragment locations in Figure 4: other approaches were equally successful in describing only some of the attractors. The efficiency in our extraction was achieved by including a minimal number (three) of the most informative (angular) degrees of freedom to describe each fragment and by analysing selected high quality structures.
Associating physico-chemical properties to the M32K25 fragments automatically maps these properties onto the most important conformational states. The simplicity of this mapping together with the option to visualise the map should be useful for protein structure analysis and design.
The main advantage of a density-based selection is the ability to directly capture the most highly populated conformations; these have also a higher chance to be sampled during protein dynamics. Borrowing Anfinsen's 'thermodynamic hypothesis' one may speculate that the alphabet fragments correspond to low energy conformations, because proteins can be reconstructed using solely these fragments.
We investigated the suitability of the M32K25 alphabet and its associated mapping in the analysis of conformational ensembles of protein structures.
A precondition for this type of conformational analysis is that the alphabet encoding can correctly preserve the intrinsic flexibility of a protein structure. This was demonstrated for the M32K25 by an assessment based on GNM calculations: the native contacts and the flexibility calculated with this harmonic model were completely preserved in the reconstructed structures. An extension of the GNM calculations to structures reconstructed with the CGT2004 and MSM2000 alphabets also suggests that this fidelity is a general property of structural alphabets, but not directly correlated to the accuracy in fit reconstruction. A previous study  has also demonstrated that the CGT2004 alphabet has more specificity than the amino acid code in capturing inter-residue contacts in protein complexes.
The ability to capture the dynamical behaviour of a protein was tested by encoding the different structures in conformational ensembles generated by tCONCOORD. We measured the accuracy and robustness of M32K25, CTG2004 and MSM2000 alphabets by the correlation between the Shannon Entropy of the encoded ensemble and its fragment flexibility in terms of RMSF.
Correlations are generally higher for local than global fit encoding, because optimal global reconstruction is achieved at expense of local accuracy. This suggests the importance of designing strategies to estimate the weight of this inaccuracy in the encoding. All three alphabets have comparable results for β-class proteins, but the performances are significantly better for M32K25 and MSM2000 in the other SCOP classes. Where the former has the best performance in global fit and the latter in local fit (see Table 4 for details).
M32K25 is the more efficient in capturing the average flexibility per protein (see Figure 11).
The performance difference between the structural alphabets can be explained in terms of robustness. A set of representatives that efficiently samples the conformational space with low redundancy will be less affected by small fluctuations, while a set that contains groups of relatively similar fragments describing the same state will tend to overestimate the conformational difference. This is a possible explanation for the performance of CGT2004 for α-helix containing proteins: the alphabets includes a group of closely located fragments in the α region of the [ϕ1, ϕ2, θ] space (see Figure 4). On the contrary a set of well spaced fragments does not imply an accurate encoding. The good performance of MSM2000 does not correspond to a good accuracy in the reconstruction (see Table 5).
The fragment composition of a structural alphabet is dependent on the type of strategy employed to select conformational representatives. This can affect the overall encoding stability. Both M32K25 and MSM2000 were derived by indirectly maximizing the geometrical diversity, while CGT2004 was optimized for statistical representativity. The former strategy provides a clear advantage in terms of encoding stability at the expense of a minor (M32K25) to significant (MSM2000) decrease in the encoding accuracy, while the reverse is true for the CGT2004 set: the inclusion of statistically significant but geometrically similar helical conformations can decrease the stability but provides a very accurate description of linear, kinked and curved helices . This limitation in encoding stability could be overcome by considering the states as not strictly independent and consequently by either weighting their contributions according to their geometrical dissimilarity or by constructing a suitable substitution matrix. A successful example of the latter approach has been already used in the context of 3D structural alignment , where the performance of string-based structural comparison was increased allowing non-exact matches by means of a structural alphabet substitution matrix.
We do not aim to provide an alternative framework for structure prediction, but a novel tool for studies of protein structures and their dynamics. The addition of this newly designed assessment to the ones previously proposed in the literature is in line with the purpose of our alphabet. The density-based M32K25 alphabet has proven to be accurate for protein reconstruction and stable for ensemble encoding. The combination of these features suggests that M32K25 is specifically suitable for studies of protein dynamics.
The density-based Structural Alphabet provides a two-fold advantage: ensembles of protein structures can be encoded with high accuracy and sufficient robustness to correctly describe local flexibility.
Future developments may involve the employment of this Structural Alphabet to analyse and annotate structure ensembles from Molecular Simulations to easily map molecular motions onto the fragment space. The attractors can act as a guide to classify dynamics features and to compare protein families or different energetic states of the same protein. This can help in understanding, for example, binding specificity to multiple partners or conserved biological mechanisms.
We like to thank ACC Coolen, F Fraternali, D Frenkel and J MacDonald for insightful discussions. We are grateful for funding by the European Science Foundation [exchange grant within the ESF Research Networking Programme "Frontiers of Functional Genomics" (FFG) to AP], the EU Marie Curie programme [220256 of call PEOPLE-2007-2-1-IEF to AP] and the Leverhulme Trust [F/07 040/AL to AF].
- Corey RB, Pauling L: Fundamental dimensions of polypeptide chains. Proceedings Royal Society London, B, Biological Sciences 1953, 141(902):10–20. 10.1098/rspb.1953.0011View ArticleGoogle Scholar
- Jones TA, Thirup S: Using known substructures in protein model building and crystallography. EMBO Journal 1986, 5(4):819–22.PubMedPubMed CentralGoogle Scholar
- Ramachandran GN, Ramakrishnan C, Sasisekharan V: Stereochemistry of polypeptide chain configurations. Journal of Molecular Biology 1963, 7: 95–9. 10.1016/S0022-2836(63)80023-6View ArticlePubMedGoogle Scholar
- Walther D, Cohen FE: Conformational attractors on the Ramachandran map. Acta Crystallographica D Biological Crystallography 1999, 55(Pt 2):506–17. 10.1107/S0907444998013353View ArticlePubMedGoogle Scholar
- Rooman MJ, Rodriguez J, Wodak SJ: Automatic definition of recurrent local structure motifs in proteins. Journal of Molecular Biology 1990, 213(2):327–36. 10.1016/S0022-2836(05)80194-9View ArticlePubMedGoogle Scholar
- Park BH, Levitt M: The complexity and accuracy of discrete state models of protein structure. Journal of Molecular Biology 1995, 249(2):493–507. 10.1006/jmbi.1995.0311View ArticlePubMedGoogle Scholar
- Bystroff C, Baker D: Prediction of local structure in proteins using a library of sequence-structure motifs. Journal of Molecular Biology 1998, 281(3):565–77. 10.1006/jmbi.1998.1943View ArticlePubMedGoogle Scholar
- Micheletti C, Seno F, Maritan A: Recurrent oligomers in proteins: an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies. Proteins 2000, 40(4):662–74. 10.1002/1097-0134(20000901)40:4<662::AID-PROT90>3.0.CO;2-FView ArticlePubMedGoogle Scholar
- de Brevern AG, Etchebest C, Hazout S: Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 2000, 41(3):271–87. 10.1002/1097-0134(20001115)41:3<271::AID-PROT10>3.0.CO;2-ZView ArticlePubMedGoogle Scholar
- Kolodny R, Koehl P, Guibas L, Levitt M: Small libraries of protein fragments model native protein structures accurately. Journal of Molecular Biology 2002, 323(2):297–307. 10.1016/S0022-2836(02)00942-7View ArticlePubMedGoogle Scholar
- Camproux AC, Gautier R, Tufféry P: A Hidden Markov Model derived structural alphabet for proteins. Journal of Molecular Biology 2004, 339(3):591–605. 10.1016/j.jmb.2004.04.005View ArticlePubMedGoogle Scholar
- Tung CH, Huang JW, Yang JM: Kappa-alpha plot derived structural alphabet and BLOSUM-like substitution matrix for rapid search of protein structure database. Genome Biology 2007, 8(3):R31. 10.1186/gb-2007-8-3-r31View ArticlePubMedPubMed CentralGoogle Scholar
- Offmann B, Tyagi M, de Brevern AG: Local Protein Structures. Current Bioinformatics 2007, 2(3):165–202. 10.2174/157489307781662105View ArticleGoogle Scholar
- Hunter CG, Subramaniam S: Protein fragment clustering and canonical local shapes. Proteins 2003, 50(4):580–8. 10.1002/prot.10309View ArticlePubMedGoogle Scholar
- Camproux AC, Tuffery P, Chevrolat JP, Boisvieux JF, Hazout S: Hidden Markov model approach for identifying the modular framework of the protein backbone. Protein Eng 1999, 12(12):1063–73. 10.1093/protein/12.12.1063View ArticlePubMedGoogle Scholar
- Kolodny R, Levitt M: Protein decoy assembly using short fragments under geometric constraints. Biopolymers 2003, 68(3):278–85. 10.1002/bip.10262View ArticlePubMedGoogle Scholar
- Fourrier L, Benros C, de Brevern AG: Use of a structural alphabet for analysis of short loops connecting repetitive structures. BMC Bioinformatics 2004, 5: 58. 10.1186/1471-2105-5-58View ArticlePubMedPubMed CentralGoogle Scholar
- Etchebest C, Benros C, Hazout S, Brevern AGD: A structural alphabet for local protein structures: improved prediction methods. Proteins 2005, 59(4):810–27. 10.1002/prot.20458View ArticlePubMedGoogle Scholar
- Friedberg I, Harder T, Kolodny R, Sitbon E, Li Z, Godzik A: Using an alignment of fragment strings for comparing protein structures. Bioinformatics 2007, 23(2):e219–24. 10.1093/bioinformatics/btl310View ArticlePubMedGoogle Scholar
- Schenk G, Margraf T, Torda AE: Protein sequence and structure alignments within one framework. Algorithms for molecular biology: AMB 2008, 3: 4.View ArticlePubMedPubMed CentralGoogle Scholar
- Guyon F, Camproux AC, Hochez J, Tufféry P: SA-Search: a web tool for protein structure mining based on a Structural Alphabet. Nucleic Acids Research 2004, (32 Web Server):W545–8. 10.1093/nar/gkh467Google Scholar
- Yang JM, Tung CH: Protein structure database search and evolutionary classification. Nucleic Acids Research 2006, 34(13):3646–59. 10.1093/nar/gkl395View ArticlePubMedPubMed CentralGoogle Scholar
- Tung CH, Yang JM: fastSCOP: a fast web server for recognizing protein structural domains and SCOP superfamilies. Nucleic Acids Research 2007, (35 Web Server):W438–43. 10.1093/nar/gkm288Google Scholar
- Tyagi M, de Brevern AG, Srinivasan N, Offmann B: Protein structure mining using a structural alphabet. Proteins 2008, 71(2):920–37. 10.1002/prot.21776View ArticlePubMedGoogle Scholar
- Pandini A, Bonati L, Fraternali F, Kleinjung J: MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database. Bioinformatics 2007, 23(4):515–6. 10.1093/bioinformatics/btl637View ArticlePubMedGoogle Scholar
- Maupetit J, Gautier R, Tufféry P: SABBAC: online Structural Alphabet-based protein BackBone reconstruction from Alpha-Carbon trace. Nucleic Acids Research 2006, (34 Web Server):W147–51. 10.1093/nar/gkl289Google Scholar
- Le Q, Pollastri G, Koehl P: Structural Alphabets for Protein Structure Classification: A Comparison Study. Journal of Molecular Biology 2009, 387(2):431–50. 10.1016/j.jmb.2008.12.044View ArticlePubMedPubMed CentralGoogle Scholar
- Deschavanne P, Tufféry P: Enhanced protein fold recognition using a structural alphabet. Proteins 2009, 76: 129–37. 10.1002/prot.22324View ArticlePubMedGoogle Scholar
- Tuffery P, Derreumaux P: Dependency between consecutive local conformations helps assemble protein structures from secondary structures using Go potential and greedy algorithm. Proteins 2005, 61(4):732–40. 10.1002/prot.20698View ArticlePubMedGoogle Scholar
- Maupetit J, Derreumaux P, Tuffery P: PEP-FOLD: an online resource for de novo peptide structure prediction. Nucleic Acids Research 2009, (37 Web Server):W498–503. 10.1093/nar/gkp323Google Scholar
- Maupetit J, Derreumaux P, Tufféry P: A fast method for large-scale De Novo peptide and miniprotein structure prediction. Journal of computational chemistry 2010, 31(4):726–38.PubMedGoogle Scholar
- MacDonald JT, Maksimiak K, Sadowski MI, Taylor WR: De novo backbone scaffolds for protein design. Proteins: Structure, Function, and Bioinformatics 2009, 78(5):1311–1325. 10.1002/prot.22651View ArticleGoogle Scholar
- Chandonia JM, Hon G, Walker NS, Conte LL, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Research 2004, (32 Database):D189–92. 10.1093/nar/gkh034Google Scholar
- Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Research 2000, 28: 254–6. 10.1093/nar/28.1.254View ArticlePubMedPubMed CentralGoogle Scholar
- Ankerst M, Breunig MM, Kriegel HP, Sander J: OPTICS: Ordering Points To Identify the Clustering Structure. In SIGMOD Proceedings ACM SIGMOD International Conference on Management of Data, June 1–3, 1999, Philadelphia, Pennsylvania, USA. Edited by: Delis A, Faloutsos C, Ghandeharizadeh S. ACM Press; 1999:49–60. full_textView ArticleGoogle Scholar
- Daszykowski M, Walczak B, Massart DL: Looking for natural patterns in analytical data. 2. Tracing local density with OPTICS. Journal of chemical information and computer sciences 2002, 42(3):500–7.PubMedGoogle Scholar
- Kriegel H, Brecheisen S, Januzaj E, Kröger P: Visual Mining of Cluster Hierarchies. Proceedings 3rd International Workshop on Visual Data Mining (VDM@ICDM2003) 2003, 151–165.Google Scholar
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2009. [http://www.R-project.org]Google Scholar
- Venables WN, Ripley BD: Modern applied statistics with S. 4th edition. Springer, New York; 2002.View ArticleGoogle Scholar
- Theobald DL: Rapid calculation of RMSDs using a quaternion-based characteristic polynomial. Acta Crystallographica A Foundations of Crystallography 2005, 61(Pt 4):478–80. 10.1107/S0108767305015266View ArticlePubMedGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE transactions on automatic control 1974, 19(6):716–723. 10.1109/TAC.1974.1100705View ArticleGoogle Scholar
- Konishi S, Kitagawa G: Information Criteria and Statistical Modeling. Springer Publishing Company, Incorporated; 2007.Google Scholar
- Mitchell M: An Introduction to Genetic Algorithms. MIT Press; 1998.Google Scholar
- Bahar I, Atilgan AR, Erman B: Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Folding & design 1997, 2(3):173–81. 10.1016/S1359-0278(97)00024-2View ArticleGoogle Scholar
- Haliloglu T, Bahar I, Erman B: Gaussian dynamics of folded proteins. Physical Review Letters 1997, 79: 3090–3093. 10.1103/PhysRevLett.79.3090View ArticleGoogle Scholar
- Bahar I, Rader AJ: Coarse-grained normal mode analysis in structural biology. Current Opinion in Structural Biology 2005, 15(5):586–92. 10.1016/j.sbi.2005.08.007View ArticlePubMedPubMed CentralGoogle Scholar
- Hess B: Convergence of sampling in protein simulations. Physical review E, Statistical, nonlinear, and soft matter physics 2002, 65(3 Pt 1):031910.View ArticlePubMedGoogle Scholar
- de Groot BL, van Aalten DM, Scheek RM, Amadei A, Vriend G, Berendsen HJ: Prediction of protein conformational freedom from distance constraints. Proteins 1997, 29(2):240–251. 10.1002/(SICI)1097-0134(199710)29:2<240::AID-PROT11>3.0.CO;2-OView ArticlePubMedGoogle Scholar
- Seeliger D, Haas J, de Groot BL: Geometry-based sampling of conformational transitions in proteins. Structure (London, England: 1993) 2007, 15(11):1482–1492.View ArticleGoogle Scholar
- Seeliger D, De Groot BL: tCONCOORD-GUI: visually supported conformational sampling of bioactive molecules. Journal of computational chemistry 2009, 30(7):1160–1166. 10.1002/jcc.21127View ArticlePubMedGoogle Scholar
- Fernández A, Berry RS: Extent of hydrogen-bond protection in folded proteins: a constraint on packing architectures. Biophysical journal 2002, 83(5):2475–2481. 10.1016/S0006-3495(02)75258-2View ArticlePubMedPubMed CentralGoogle Scholar
- Barrett CP, Hall BA, Noble ME: Dynamite: a simple way to gain insight into protein motions. Acta crystallographica. Section D, Biological crystallography 2004, 60(Pt 12 Pt 1):2280–2287. 10.1107/S0907444904019171View ArticlePubMedGoogle Scholar
- Eyrisch S, Helms V: What induces pocket openings on protein surface patches involved in protein-protein interactions? Journal of computer-aided molecular design 2009, 23(2):73–86. 10.1007/s10822-008-9239-yView ArticlePubMedGoogle Scholar
- Higurashi M, Ishida T, Kinoshita K: PiSite: a database of protein interaction sites using multiple binding states in the PDB. Nucleic acids research 2009., (37 Database): gkn659+ gkn659+Google Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- Spoel D, Lindahl E, Hess B, Groenhof G, Mark AE, Berendsen HJ: GROMACS: fast, flexible, and free. Journal of computational chemistry 2005, 26(16):1701–1718. 10.1002/jcc.20291View ArticleGoogle Scholar
- Kaminski GA, Friesner RA, Tirado-Rives J, Jorgensen WL: Evaluation and Reparametrization of the OPLS-AA Force Field for Proteins via Comparison with Accurate Quantum Chemical Calculations on Peptides†. The Journal of Physical Chemistry B 2001, 105(28):6474–6487. 10.1021/jp003919dView ArticleGoogle Scholar
- Shannon CE: A Mathematical Theory of Communication. The Bell System Technical Journal 1948, 27: 379–423.View ArticleGoogle Scholar
- Frishman D, Argos P: Knowledge-based protein secondary structure assignment. Proteins 1995, 23(4):566–79. 10.1002/prot.340230412View ArticlePubMedGoogle Scholar
- Kundu S, Melton JS, Sorensen DC, Phillips GN: Dynamics of proteins in crystals: comparison of experiment with simple models. Biophysical Journal 2002, 83(2):723–32. 10.1016/S0006-3495(02)75203-XView ArticlePubMedPubMed CentralGoogle Scholar
- Martin J, Regad L, Etchebest C, Camproux AC: Taking advantage of local structure descriptors to analyze interresidue contacts in protein structures and protein complexes. Proteins 2008, 73(3):672–689. 10.1002/prot.22091View ArticlePubMedGoogle Scholar
- Chen Y, Reilly KD, Sprague AP, Guan Z: SEQOPTICS: a protein sequence clustering system. BMC Bioinformatics 2006, 7(Suppl 4):S10. 10.1186/1471-2105-7-S4-S10View ArticlePubMedPubMed CentralGoogle Scholar
- Ligges U, Mächler M: Scatterplot3d - an R Package for Visualizing Multivariate Data. Journal of Statistical Software 2003, 8(11):1–20.View ArticleGoogle Scholar
- Delano WL: The PyMOL Molecular Graphics System.Palo Alto, CA, USA; 2008. [http://www.pymol.org]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.