Reduced representation of protein structure: implications on efficiency and scope of detection of structural similarity
© Zhang et al. 2010
Received: 27 October 2009
Accepted: 26 March 2010
Published: 26 March 2010
Skip to main content
© Zhang et al. 2010
Received: 27 October 2009
Accepted: 26 March 2010
Published: 26 March 2010
Computational comparison of two protein structures is the starting point of many methods that build on existing knowledge, such as structure modeling (including modeling of protein complexes and conformational changes), molecular replacement, or annotation by structural similarity. In a commonly used strategy, significant effort is invested in matching two sets of atoms. In a complementary approach, a global descriptor is assigned to the overall structure, thus losing track of the substructures within.
Using a small set of geometric features, we define a reduced representation of protein structure, together with an optimizing function for matching two representations, to provide a pre-filtering stage in a database search. We show that, in a straightforward implementation, the representation performs well in terms of resolution in the space of protein structures, and its ability to make new predictions.
Perhaps unexpectedly, a substantial discriminating power already exists at the level of main features of protein structure, such as directions of secondary structural elements, possibly constrained by their sequential order. This can be used toward efficient comparison of protein (sub)structures, allowing for various degrees of conformational flexibility within the compared pair, which in turn can be used for modeling by homology of protein structure and dynamics.
The comparison of two protein structures is most efficiently handled as a hierarchical problem, more or less closely following the protocol laid out over a decade ago by Singh and Brutlag .
Its first step, fold recognition on the level of secondary structural element (SSE) correspondence, has been tackled repeatedly in the literature [2–14], building on the founding body of work related to aligning protein structures at the atomic-resolution level [15–19]. Most of the latter adopted an iterative resolution approach by starting from variously defined fragments of protein structure [20–23]. However, prominent methods capable of doing fast and conformationally tolerant search , such as SSM  and Fatcat  still take several tens of minutes of CPU time to perform a database search, doing a thorough but perhaps unnecessary job in order to eliminate bad candidates for a match. The development of complementary, ultra-fast methods for rigid structural comparison of proteins seems to have migrated to the realm of computer science, and typically relies on index or hash based database retrieval [24–29]. The algorithm from this family possibly the closest in spirit, if not in the scope, to the one we will propose below is TableauSearch. With its high level of abstraction of protein structure, it indeed proves capable of searching databases approaching 105 entries as a matter of minutes. The method records and then discretizes the relative angle between any two pairs of SSEs in a structure, and stores it as a tableau  to be used in the database search. The entries, however, in TableauSearch database are rigid domains, and the algorithm thus dispenses with conformationally flexible searches right at the outset. Looking for a substructural match in this approach is not a completely straightforward affair either .
From mathematical biology side, several global descriptors of protein structure have been proposed  (for statistical takes on the problem see [32, 33]) with an eye on organizing our knowledge of protein (or fold) universe . An example is provided by SGM [24, 35], a method that relies on a set of measures from differential geometry to describe overall structure of a protein domain, and which we will use below as a representative of its class of methods.
Even though the idea of a pre-filter based on gross features of the protein structure is implicit in several methods that have undergone a steady development [2, 12], it has seldom been discussed and documented as a computational problem on its own. We propose, therefore, to take a look specifically at the question of the smallest possible set of features needed to describe protein structure, and propose its intuitively motivated reduced representation, equipped with a scoring function capable of detecting both rigid matches at domain level, and conformational changes involving relative motion of structural domains.
In the current literature there exists a broad selection of methods for pairwise structural comparison of proteins [36–38]. Typically centered on backbone atom matching [18, 21, 39, 40], they take several seconds or even several minutes to compare two structures and decide that the match is not possible. Why is it, then, that a human observer can establish, after a single glance that two protein structures in cartoon representation are (dis)similar? Certainly we are not mentally matching the backbone atoms, nor the angles through which the secondary structural elements (SSEs) are joined. Rather, the human observer will try to orient the two structures so that the SSEs point roughly in the same direction, and then perhaps check if the two diagrams are showing the same sequential ordering of the SSEs.
that is, the distance of the value F from the average over all rotations R, measured in units of standard deviation (the denominator in Eq. 4). The rotations resulting in near-zero (or even positive) z F can then be dismissed as insignificant given the geometry of the problem. The estimate of the z-score requires evaluation of the first two moments of F(R; X, Y) over the set of all possible rotations R, which can be done explicitly in the case of the average, and numerically in the case of the average square [Additional File 1].
where ≺ indicates the precedence on the primary sequence of the protein.
Depending on the algorithm and parametrization, the alignment procedure may assign various gap penalties for the SSEs that do not map onto the other structure. T (D) ignores such SSEs. By retaining only the matched pairs which optimize T (D), we obtain a good orientation match between the pairs of SSEs, that at the same time complies with the sequential ordering in both structures.
A conformationally flexible match in this picture consists of two local minima in F(R; X, Y) for two different R's, thus incorporating a model in which structural domains maintain their internal structural organization during conformational changes in a protein. In practice, it has to be verified that the two minima in R are different in a way that is statistically significant; the details may depend on the implementation.
(with ΔL iM(i) the absolute value of the length difference, and tol ΔL an adjustable parameter) may improve the performance. The size of the improvement depends on the value used for δ and on the nature of the test the method is subjected to (i.e. do we count as a hit the cases of overall structural similarity allowing wide range of length mismatch in SSE, or not; how many levels of CATH classification we are trying to reproduce, etc.) [Additional File 1].
where s i is the SSE type indicator, as in the previous section, and here is the Kronecker δ, not to be confused with the Gaussian width parameter used in the rest of the paper. This leads to the worse case scenario (when all SSEs in both structure are of the same type) of complexity of , where N x and N y are the number of the SSEs in the two structures. This number may be substantially smaller (down to 0) in the case of mixed α/β structures. It can also be alleviated by grouping the directions nearly parallel in space, in which case the complexity becomes where n x and n y are the numbers of distinct SSE directions in structures X and Y. (This compactification is later "unfolded" to do the pairwise alignment of SSEs.) In practice, the second and the fourth sum can be truncated at j = i + m and k = l + m, respectively, where m is a small number (2 or 3), without significant loss in performance. This makes the complexity O(N x N y ), or O(n x n y ) in the case of grouped directions.
In a pairwise comparison of two structures, the minima occurring at different R's are stored if the z score is deemed statistically significant; later they are sorted in the order of the ascending z-score, and the best one is reported. A certain top number of suboptimal minima from this sorted list is checked for the complementarity of the match, and the complementary pair assigning the highest total score is reported as a flexible match. This part can be generalized to n, rather than only two complementary matches.
where z 1 and z 2 are the z-scores, Eq. 4, for F(R; X, Y) evaluated at two rotations, R 1 and R 2 which match two different structural domains, and T (D max), given in Eq. 6, is the quantity optimized in the pairwise alignment. The two maps corresponding to the two rotations result in two different matrices D ij (R 1) and D ij (R 2); T is evaluated based on the larger of the two values of matrix elements .
It is quite a strong claim that the three features we have selected (direction, type and sequential ordering of SSEs) provide enough information to distinguish two protein structures (and, conversely, to detect them as similar). While proving it may not be feasible, we may demonstrate that, statistically, it works quite reliably. Below we investigate the distinguishing and classifying power of the representation, and point out its two possible applications: annotation of novel structures and modeling of protein dynamics by homology. Parametrization, data sets, and related statistics can be found in Additional File 1.
Consistent with the qualitative description underlying the CATH classification, in this experiment we obtained better sensitivity/specificity tradeoff by relaxing the criterion for the directional match by setting δ = 0.5. Together with the results for the exact match problem above, this suggests a possibility of simulated annealing approach with stepwise decreasing δ, to obtain a distribution of hits with varying strictness in structural similarity to the query. (For a remote structural similarity, no significant value of F may be obtainable for very small δ. Conversely, large F for small δ indicates a very small variation in direction of SSEs, typically related to strong overall similarity.)
The time taken for this computational experiment, of approximately 170 CPU minutes for δ = 0.5 (and 90 minutes for δ = 0.3) on a 3 GHZ processor, compares favorably with some 40 to 1000 CPU hours needed for the high resolution methods to complete the same task (see Fig. 3 and experiments in [14, 19, 46]). It should be kept in mind, of course, that the approach we have laid out functions only as a pre-filter and classifier, and does not produce the actual structural alignment on the atomic level. Rather, we argue, a small subset of the structural features is responsible for the classification we are trying to reproduce, and this can be used to economize with computational resources.
Among the existing tools for structure comparison, the most proficient ones do include some way of pre-filtering (e.g. the difference in sum of rows and columns in the contact matrix used by DALI, ), discussed to a larger or smaller extent. Since most of such methods have been designed to pass the test like the domain classification discussed here, this is a reasonable place to point out the novel aspects of our reasoning, and their performance implications. Hence we conclude this section with a brief comparison with VAST and SSM, both well established methods with their own servers [2, 12, 47], and both using SSEs as the basic elements of protein structure.
In the course of development of SSM, the authors have devoted a whole publication  to the discussion of its preliminary step. In its pre-filtering stage, SSM constructs a graph for each structure. The vertices of the graph carry the information about each SSE (length, type), while the edges are associated with the attribute of each pair of vertices/SSEs they are connecting (distance, angles). The algorithm then looks for a common subgraph using a set of heuristic rules, divided into 5 levels, to decide on the "sameness" between vertices and edges from the two graphs. The necessity to introduce the cutoffs (in order to deduce the existence of common subgraph elements) enforces discretization of the scoring function, in contrast to the approach proposed in this work. An attempt to run the CATH experiment using the five discrete levels of similarity in SSM is shown in Fig. 3, dotted line in the inset.
Despite the comparatively small area under the ROC curve obtained this way, the information manipulated by SSM is actually quite seizable. In an attempt to extract more resolution from the SSM pre-filtering, we took the outcome from all five levels simultaneously, and treated the optimization of the ROC area as a machine learning problem [Additional File 1]. The outcome is shown in Fig. 3, inset, blue line. It approaches the ROC achieved by the method we are proposing, but at the cost of introducing fifty parameters difficult to grasp intuitively.
VAST also relies on a graph matching tactics in its pre-filtering stage, but in a substantially different way ([2, 47]; also see ). In contrast to SSM, where one graph represents each structure, in VAST graph is constructed for a map between two structures, as follows.
In VAST, SSE elements are represented by direction vectors that, aside from the information about the type, explicitly retain the information about the SSE length (i.e. while our representation consists of unit vectors, VAST uses vectors of the length proportional to the length of the SSE). A discrete set of rotations acting on one of the structures/representations is attempted, each rotation satisfying the following: (i) the rotation maps one of the vectors from the rotated structure exactly onto one of the vectors from the fixed structure (defined to be the z direction), and (ii) it brings another vector from the rotated structure into the plane defined by z and some vector from the fixed structure. This contrasts with the continuous space of rotations our algorithm is exploring. In both (i) and (ii) the vectors mapped between the two structures have to be of the same type.
For each of the rotations from this set, a structure termed "digraph" is constructed: the vertices correspond to pairs of vectors, one from each structure, that fall within certain cutoffs regarding the angle between the two and the allowed distance between their endpoints. Digraph's edges exist if the two pairings involved respect the sequential ordering on the input structures, and if they carry the weight inversely proportional to the difference in the angles between two vectors corresponding to the same structure. The cutoffs and the weight are parametrized and the parameters optimized for a typical search. In our comparison runs we used the default set of parameters. The digraph can be traversed quickly to find the best alignment for a given rotation, and consequently the best scoring alignment over all rotations chosen.
This algorithm is similar to certain extent to the one proposed in this work, in the limit of small Gaussian width δ (which does not allow for much exploration of rotational space away from the initial guess), and with the length penalty (Eq. 7) included. The digraph traversal is an alternative to the dynamic programming approach we are taking.
The crucial difference here, however, is that VAST operates on the level of pairs of SSEs, somewhat analogous to the possibility of using F 2 as a scoring function in our approach. The requirements this imposes are stricter, explaining the somewhat higher sensitivity of VAST in the low false positive region (Fig. 3; it might be worth re-iterating here that in considering Fig. 3 one should keep in mind that the method in question is not VAST proper, but, rather, its pre-filtering stage only.) The rotations, furthermore, that are tested are not optimized to match the rest of the structure (only the pair of pairs defining the rotation) and thus are not readily applicable to further refinement of the matching transformation, an extension that our approach is in principle amenable to. Also, insisting on the length match (something that we do not do necessarily), in order to get rid of false positive matches, is a strategy which might backfire in the attempts to extend the search to a more distant structural similarity.
As is, however, this method of pre-filtering is very fast (the initial guess for a rotation is not further optimized, a major time consuming step in our approach), and slightly better in the small FP region than our best take on the CATH test, (Fig. 3; inset; full blue line). Nevertheless, our ROC does show signs of greater robustness and higher resolution capability, as indicated by the larger area under the curve.
To indicate an impossibility first, methods which assign a set of global descriptors to the structure (such as SGM, dash-dotted line in Fig. 4) cannot tackle this problem. SGM is used as an example here: this is equally true for all methods relying on assigning a hash function to a predefined structural domain.
In the following, we choose not to reinterpret the methods used, and refer the reader to the original publications. Instead, we would like to highlight the specific algorithmic choices, which, we believe, give these methods advantage over other approaches, in particular the one proposed in this work. Notably here, MAMMOTH  and SABERTOOTH  were never designed with large database scanning primarily in mind (even though they perform well in that respect), but rather with the goal of comparing a model structure to a template (MAMMOTH) or obtaining a precise structural alignment between two structures (SABERTOOTH).
The basic unit of structure being compared in MAMMOTH  is a heptapeptide. Compared to our current implementation, which enforces an SSE as an elementary unit, irrespective of its length (see "Discussion"), the heptapeptide approach enables better resolution in the high specificity region.
Sabertooth  uses the idea that the correct alignment may be, among other things, recognized by the similar environment ("connectivity pattern") seen by the aligned structural motifs. This approach also results in strong performance in the small FP regime, and something that our approach hardly generalizes to.
TMalign  uses SSE alignment, before trying SSE mapping which would be a nonsense from the sequence point of view. This is an opportunity our implementation misses (see the the first paragraph in "Implementation" section). Otherwise, TMalign manipulates very similar input information to the approach proposed here, and is therefore encouraging to see it trace out a practically identical ROC curve, both in this test (Fig. 4), and in the test shown in Fig. 3.
3Dhit  shows the most robust ROC curve in this test. It dissects the peptide into 13-residue fragments, which first conveys the same advantage over our approach as discussed for MAMMOTH above. The algorithm then chops up the fragment into clusters. In mapping clusters the identity of amino acid (types, presumably) is enforced , an algorithmic move that speeds up the search substantially, but we would like to stay away from as long as possible in development of our algorithm. Further on, however, the fact is exploited that for similar structures a transformation that matches two subsets will match larger regions of protein structure, the fact that our approach also builds on, and that can be used to increase resolution of the match toward a full backbone match.
SSM , a veteran high resolution method, does the best in the small FP region, but at the cost of CPU time two orders of magnitude longer than required to do the rough comparison we are proposing here. The high resolution is achieved by imposing a series of geometric requirements on matched pairs of SSE from the two structures (see the sections "Comparison with pre-filters used in other methods," and "Discussion," as well as the original publication, of course).
The overall message seems to be that judging by the speed of our prefiltering and quite competitive sensitivity/specificity tradeoff it achieves it provides a good base for protein comparison engine.
The larger and the smaller domain appear rearranged with respect to each other in the two structures, as can be seen by using a linear coordinate transformation that overlaps the larger domain (Fig. 6). One structure, this suggests, could then be used as a scaffolding to model conformational change of the other. Could this conformational change appear as part of physiological functioning of either of the two proteins? Is it at all physically possible to get from one conformation to another at room temperature? Finding examples with structural similarity in two structural domains, rearranged in space, may indicate a possibility of conformational change in one or both proteins. The reasonableness of such conjecture is, of course, subject to further testing through targeted molecular dynamics, or some related approach. The type of structural comparison we are advocating, however, should produce more examples for this type of study.
We have suggested in the Methods section that after dispensing with translation and the length of the SSEs, the structure is effectively represented by a set of points on a unit sphere. With these points we associated information about underlying SSE type and sequential order. By settling on the minimal representation of the protein structure, we set out to analyze its sufficiency for structure description and retrieval.
As noted in subsection "Self-scoring in a large database of structures", Fig. 2 and the related discussion, the directions themselves, except when taken very narrowly (δ = 0.1 in our formulation), may be matched by quite diverse protein structures. To get rid of false positive matches that arise that way, we have suggested imposing the requirement that the matched SSEs follow the same sequential order in the two structures. This, however is not the only possible way around the problem: as discussed in Mizuguchi and Go , and later elaborated by Krissinel and Henrick  in development of SSM (discussed above), the directions of SSEs can be supplemented by various other pieces of information: the length of SSEs, the distance and torsion angle between all possible pairs of SSEs in a structure and/or the angles between their directions and the direction of the line passing through their geometric centers. The advantage of using this type of information, rather than requiring the common sequential order of the SSEs, then is in the ability to look for pairs of proteins with different connectivity between SSEs, that still result in the overall comparable structures. On the flip-side, the set of requirements might end up being too restrictive in the search of similar (but non-identical) structures, as we have illustrated in the inset of Fig. 4.
Contrary to the model of similarity adopted here, where similar structures are assumed to share to certain extent the underlying SSE arrangement, it is conceivable that two proteins might share a common function as long as they offer a common geometry of the surface to their common (or similar) interacting partners . In that case one might be interested in a method for detection and retrieval of proteins sharing the same shape, irrespective of the underlying secondary structure. It is a possibility not explored here. Methods for retrieval by global shape similarity have been discussed in literature (see  and references therein) and extremely short retrieval times (~10-4 s on a 3 GHz CPU) reported ). Some questions remain outside the scope of these methods, such as detection of a common substructure or structural motif.
Sticking to a more conservative model of shared protein structure, the problem which ultimately needs to be resolved is the correspondence: which SSEs (and later, on a finer detail level, which backbone atoms) on two structures correspond to one another. Function F(R; X, Y) enables us to initially sidestep this problem, in principle at least, because the fast fall-off of the closeness measure D ij (R) (Eq. 2) makes possible the double sum over all elements without the danger of obtaining as the optimal a solution where no actual match exists, but the sum over many distant neighbors artificially increases the score. By starting the protein structure comparison by minimization of F(R; X, Y), we are effectively adopting, on the SSE level, the match-first-align-later approach, popularized by Gerstein and Levitt  (see also  for a further development of the idea).
Ideally, the scoring function F would quantify, in a single expression, the geometric match under the constraint of sequential ordering of the pairs, a problem which we leave open. On the high-resolution end of the spectrum of related ideas lies the URMS-RMS hybrid algorithm [8, 23, 57]. There, a set of directions in space is also considered, however not along the SSEs, but along the lines connecting neighboring C α atoms within a heptapeptide. Being a high-resolution method, it comes with the computational burden comparable to the other backbone-matching approaches (and, of course, with the final reward of the actual detailed matching of two backbone traces). The match scoring function used in that work is different from the one suggested here, but it runs into a similar difficulty of estimating the statistical significance for a match of different structures. A solution offered there is comparison with an empirically derived background distribution of match probabilities using existing, unrelated protein structures.
Instead, we opted for a solution which separates the geometric match from the alignment. The fuzziest point in the algorithm we have outlined, therefore, is that the averages in Eq. 4 should properly be evaluated not over the set of all rotations R, but only over those rotations which allow, through the matrix D ij (R), the alignment of subsequences of the two proteins of substantial length. Numerical evaluation of these proper averages would effectively grind the search to a halt, so in our prototype evaluation we keep the averages over all R as an approximation. The approximation works well for the rigid search, where it is used to dispense with bad solutions, rather than score good ones. In the case of the flexible search we resort to the total assigned score as a scoring function, coupled with the requirement that both maps have a high rotational z-score on their own.
In terms of the implementation, the room for improvement is certainly ample. The relatively large number of false positives is attributable, at least in part, to parallel beta sheets and helix bundles, which can be amended by more careful grouping of the representation vectors. Also, in the implementation used here, each β strand is represented by a single vector - a rather crude approximation for most β strands, which are often bent.
Perhaps stating the obvious, the ultimate degree of success of an approach will depend on the choices made in the implementation, as much so as on the underlying idea. In this work, the available implementations (steepest descent and Needleman-Wunsch) decided the way in which the three features we selected to describe a protein were used. Even though a faster, or more robust, implementation could perhaps be achieved by a different choice of optimization or alignment algorithm, these are replaceable components, and the main points of improvement are in the representation itself, in the distance (or match scoring) function, and in its statistical evaluation.
In an attempt to build a pre-filtering tool for a search through a database of protein structures, we proposed (i) reducing the representation of protein structure to an ordered set of unit vectors carrying the information about the direction and the type of the secondary structure element they represent, Eq. 1, (ii) measuring the distance between two elements of the same type in terms of a quantity falling off exponentially with the increasing angle between the two, Eq. 2, (iii) measuring the distance between the two representations as sum over pairwise distances between their elements, Eq. 3, and (iv) ordering the near matches by their total aligned score, Eq. 6.
The representation is easily extendable to other descriptors of protein geometry by generalization of the type, currently restricted to α-helix or β-strand, and interesting statistics may result from allowing the Gaussian width δ to be type dependent.
We have shown that an implementation which minimizes the distance defined in Eq. 3 through a steepest descent calculation, and subsequently enforces the sequential order between the matched SSEs using standard sequence alignment approach, performs well in terms of the resolution in the structure space. Notable, also, is the speed that can be achieved in structure comparison without tying up the information in the form of a single index - it is precisely this feature which enables us to generalize the search to flexible and multidomain cases, and makes this idea uniquely versatile among structural comparison algorithms. The main concern addressed in this work has been whether this minimalist description of protein structure contains enough information to uniquely describe protein (sub)structures, and structural classes. The conclusion is that the information is certainly sufficient for a unique self match of each protein structure studied (Fig. 2), and represents the large chunk of the signal detected by the high resolution methods.
(Fig. 3). When extended to detection of distant structural similarity, the approach starts to suffer from "false positive" matches (note that the information about the translational degrees of freedom is absent), but it stays within the acceptable limits of accuracy set by high resolution methods, and its speed certainly allows for an improvement by extending the number of elements and types in the description.
The straightforward motivation for this description of protein structure makes clear what the pitfalls and directions of improvement are, but even the existing implementation indicates that the approach may prove valuable in making novel predictions, in terms of both rigid and flexible structural comparison. The server to accompany this paper, as well as the code used in the analysis presented in the text is available a http://epsf.bmad.bii.a-star.edu.sg.
secondary structural element
receiver operating characteristic
The authors thank E. Krissinel, P. Roegen, T. Madej, M. Porto and D. Plewczynski for tips and explanations regarding their implementations of SMM, SGM, VAST, SABERTOOTH and 3Dhit respectively, and S. Hayward and Guoying Qi for providing the geometric descriptors for the user-created Dyndom database. Special thanks to R. Kolodny for letting us use the data originally collected for , and R. Robinson and I. Reš for encouragement and critical reading of the manuscript. Funding by Biomedical Research Council of A*STAR Singapore is gratefully acknowledged.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.