Improved protein surface comparison and application to low-resolution protein structure data
© Kihara and Sael; licensee BioMed Central Ltd. 2010
Published: 14 December 2010
Skip to main content
© Kihara and Sael; licensee BioMed Central Ltd. 2010
Published: 14 December 2010
Recent advancements of experimental techniques for determining protein tertiary structures raise significant challenges for protein bioinformatics. With the number of known structures of unknown function expanding at a rapid pace, an urgent task is to provide reliable clues to their biological function on a large scale. Conventional approaches for structure comparison are not suitable for a real-time database search due to their slow speed. Moreover, a new challenge has arisen from recent techniques such as electron microscopy (EM), which provide low-resolution structure data. Previously, we have introduced a method for protein surface shape representation using the 3D Zernike descriptors (3DZDs). The 3DZD enables fast structure database searches, taking advantage of its rotation invariance and compact representation. The search results of protein surface represented with the 3DZD has showngood agreement with the existing structure classifications, but some discrepancies were also observed.
The three new surface representations of backbone atoms, originally devised all-atom-surface representation, and the combination of all-atom surface with the backbone representation are examined. All representations are encoded with the 3DZD. Also, we have investigated the applicability of the 3DZD for searching protein EM density maps of varying resolutions. The surface representations are evaluated on structure retrieval using two existing classifications, SCOP and the CE-based classification.
Overall, the 3DZDs representing backbone atoms show better retrieval performance than the original all-atom surface representation. The performance further improved when the two representations are combined. Moreover, we observed that the 3DZD is also powerful in comparing low-resolution structures obtained by electron microscopy.
The number of protein structures deposited to the Protein Data Bank (PDB) is increasing at a rapid pace. Particularly, more and more protein structures of unknown function are being solved by structural genomics projects [1, 2]. The flood of structures raises significant challenges to develop computational methods that will provide critical information for hypothesizing function and evolution of the proteins from their structural information . The study of protein structures has now entered the informatics era, just as biological sequence analyses has done in the previous decade when efficient reuse of knowledge from existing databases became crucial. Speed is an essence in such analyses, since biologists would need to run various database searches using different tools, and such analyses would be conveniently performed if they are done in real-time. Most of the current structure comparison methods , such as those that compare main-chain orientations or corresponding residue positions[5, 6], are designed for pair wise comparison and are not suitable for a fast database search.
There are other new tasks which have been brought up by recent experimental techniques, such as electron microscopy (EM), which provides low-resolution structure data. Here, challenges include how to use a low-resolution EM density map for fitting high resolution structures [7–9] or guiding protein structure prediction , and how to efficiently and accurately compare global and local structures [11, 12]. Thus, development of a new generation of structure analysis tools, which allow a fast screening of large structure databases and can handle low resolution structure data, is needed. With this in mind, we represent protein structures as surface shapes by the 3D Zernike descriptors (3DZD) [13–17]. There have been previous works which employ a protein surface representation , such as volumetric representation , convex hull , and the spin image . Compared to those works, the 3DZD has the following properties that make it ideal for use in protein shape analyses. First, it is rotation and translation invariant. Thus, prior time-consuming structure alignment is not necessary for structure comparison since their position in space does not change their 3DZDs. This property enables direct comparison of the EM density maps where atomic coordinates are not available. Another advantage is its compact representation; a 3D shape is effectively represented in only 121 (when the order is 20) numbers called the invariants. The 3DZD can also represent physicochemical properties such as the electrostatic potential and the hydrophobicity on the protein surfaces . Recently the 3DZD has been further applied for protein-protein docking prediction , local surface comparison , pocket shape matching for structure-based function prediction , and ligand molecule screening [13, 24].
In the previous work, we showed that the all atom surface representation by the 3DZD agrees sufficiently with the main-chain based structure comparison by CE , but some differences were also observed due to the variations in the representation . The new contribution of this paper is two fold. First, we introduce three main-chain atom based surface representations which are found to better agree with the two existing structure classifications, the CE and the SCOP database , as compared with the previous all-atom surface representation. Second, we show that the proposed representation also allows for a fast and accurate database search for EM density maps.
To examine structure retrieval performance of the proposed methods, we use a data set of 2337 representative protein structures, which are arbitrarily selected from 185 fold groups defined in a protein classification database by CE (ftp://ftp.sdsc.edu/pub/sdsc/biology/CE/db/ata_3.8_jul-2004.txt.gz). CE is one of the frequently used protein structure comparison programs that compares Cα positions of proteins using a dynamic programming algorithm. These representative structures have a resolution of 3.0 Å or better, have no more than 10 missing residues in the solved structure, have all heavy atom positions solved, and are longer than 100 residues. In addition, the structure similarity of each pair is less than a Z-score of 3.8 by CE.
This dataset also provides the SCOP classification code of proteins, which classifies the proteins into 8 class groups, 149 folds groups, 187 superfamily groups, and 279 family groups. We use both CE-based and SCOP classifications in our study since they have the following complementary features: The CE classification is automatic without human intervention and considers main-chain orientation, while SCOP is curated manually at a certain degree to take evolution into account.
At this juncture, it is important to note that there is no golden standard in classification of proteins. The structure similarity measured for different representations can be largely different for distantly related proteins since they capture different aspects of the structures . In our previous paper , we showed that CE and SCOP do not fully agree and also that DALI , which compares distance maps of proteins, and CE have poorer agreement than CE and the 3DZD. Each method has its own strength and thus an appropriate method should be selected depending on the purpose of each study. We have further shown examples of proteins whose surface shape similarity infers functional similarity, which are not detected by the conventional sequence or main-chain structure comparison methods . In this study, we demonstrate that the new main-chain surface representations encoded by the 3DZD have a better agreement to CE and SCOP as compared to the original all-atom surface representation introduced in the previous study .
For a protein structure, four different surface representations are computed: one that uses all heavy atoms (AASurf), the backbone conformation with all heavy atoms in the main-chain, i.e. Cα, C, N, and O atoms (CACNO), the backbone with Cα, C, and N atoms (CACN), and the backbone Cα atoms only (CA). For the set of extracted atoms, the surface is generated using the MSMS program . MSMS rolls a probe sphere on the atoms and defines the surface as the path of the center of the probe. The radius of the probe sphere is set to default value of 1.5Å for AASurf, CACNO, and CACN and the radius is set to 2.0Å for CA to generate a smoother representation. The generated surface is then mapped on a 3D grid. A grid cell (voxel) is assigned a value of 1 if it is on the surface and 0 otherwise. Because the 3DZD is defined within a unit sphere, the protein surfaces represented by voxels are scaled into a unit sphere. Therefore, the size information of the protein is lost. The resulting voxels are considered as an input 3D function, f(x), which is used as input for computing the 3DZD as described in the next section.
The 3DZD is a series expansion of an input 3D function, which allows for a compact representation of the 3D object (i.e. the input 3D function) [17, 28]. The mathematical foundation of the 3DZD was laid out by Canterakis (1999) and was applied on 3D object retrieval by Novotni and Klein (2003). For readers’ convenience, a brief mathematical derivation of the 3DZD is shown below. For detailed derivations and discussions, refer to the aforementioned two papers Canterakis  and Novotni and Klein .
The database retrieval performance of the four surface representations is evaluated with precision-recall curves. The precision-recall curves are often confused with the receiver operator characteristic curves. Although these two curves are related, the precision-recall curve is considered to be a better measure when the dataset is skewed . The number of proteins in a group in the dataset used varies from 3 to 180 and thus a precision-recall is used here. For each protein in the dataset, the rest of the proteins described with the 3DZD are sorted by the Euclidean distance (L2 norm) to the 3DZD of the query protein. Then, the precision and the recall values are computed at each distance threshold value. The precision is defined as the fraction of the retrieved proteins of the same group with the query among all proteins retrieved above the distance threshold. The recall is defined as the fraction of the retrieved proteins of the same group with the query among all the proteins in the same group. Finally, we calculate the average precision and recall for each distance threshold. The precision-recall curves of different representations are evaluated by the area under curve (AUC).
As employed in the previous work , we also apply pre-filtering of the proteins by their sequence length. For a query, a protein in the database is filtered out if it is longer than 135% or shorter than 65% to the length of the query protein. This is done because of the loss of the size of the proteins during the process of computing the 3DZD, since the proteins are scaled to fit into a unit sphere.
where y and x are the two proteins compared and i is the index of 3DZD invariants of AASurf, S, and CACNO, B. w yS and w yB are weights for AASurf and CACNO of the query protein y, and m1 and m2 are the number of invariants in the 3DZD of AASurf and CACNO, respectively. In this study, the 3DZD of AASurf and CACNO is set to the same size, i.e. m1=m2=121. Eqn. 4 is asymmetric since the weights w yS and w yB depend on the query protein, y.
where N is the number of atoms in protein x j , cog, is center of gravity of protein x j , and R is the approximate radius of atoms in which 1.5Å is used .
where RS(x) is the radius of a sphere that has the same volume as the protein all-atom surface representation computed by the MSMS program. A larger value indicates that the protein is more spherical.
Figure 2C, D, E, and F shows the retrieval performance based on the four hierarchical SCOP classifications: the class, the fold, the superfamily, and the family. The length-based pre-filtering was not applied. The improvement of the retrieval by the backbone representations over AASurf becomes more obvious as a more detailed level hierarchy is used, depicting that the backbone structure is more relevant in the family classification. In all the graphs in Figure 2 except for Figure 2C, CACN has the best retrieval performance followed by CACNO, CA, and AASurf. The reason of the better performance by CACNO, CACN, and CA over AASurf is that the former representations have more ruggedness, which can reflect the trace of the main-chain orientation better than AASurf. Using all atoms by AASurf makes many proteins look close to spheres or ellipsoids, and it is not advantageous in general to distinguish proteins of different folds.
Although CACN performs the best, the difference in performance by CACNO and CA is marginal with a average AUC difference of 0.003. Since the three backbone representations show almost identical performance, we will only show results of CACNO along with AASurf for further analyses.
Figure 3A indicates that CACNO tends to perform better than AASurf for spherical proteins (proteins with positive sphericity value). This is because CACNO gives more distinctive structural features which are characteristics for each group. For spherical proteins, AASurf shapes alone do not clearly separate proteins of the same group from spherical proteins from the other groups. On the other hand, AASurf performs better for proteins with a long tail structure (Fig. 3B). The 3DZD also tends to overestimate the similarity of the tail-like structures for proteins of different fold groups. A larger volume and structural information from the main body of the structure seems to help AASurf in recognizing the correct structure from the same group with the query.
The observed difference by AASurf and CACNO for individual cases inspired us to combine the two representations to improve the retrieval performance. Those results are shown in the next section. For this combined representation, spherical proteins (those which have the sphericity value of larger than 0.15, the gray regions in Fig. 3A) and proteins with tail-like structures (those which have more than 3 residues in a tail, the gray regions in Fig. 3B) use different weights when combining the distance.
Precision-recall AUC improvement using weighted distance.
Among the different weight values tested, the weight combination 0.3 and 0.7 performed the best. In this combination, the weights for AASurf (w yS ) and CACNO (w yB ) are set to 0.7 and 0.3, respectively, for query proteins with a tail-like structure. On the other hand, if a query protein is spherical and has no tail-like structure, w yS is set to 0.3 and w yB is set to 0.7. Otherwise w yS and w yB are set to 0.4 and 0.6, respectively. Overall, the 0.3/0.7 weight combination results in an AUC increase of 0.015 and 0.138 when compared with the retrieval results using CACNO and AASurf, respectively. Out of 185 fold groups in the dataset, 116 groups show improvement by Surf-Back. There are 53 groups where CACNO shows a better performance and 16 groups where AASurf performs better than Surf-Back.
Observing the agreement between the EM isosurface and structures coded by the 3DZD, we investigate further if the 3DZD can be used for searching similar EM density maps of protein structures (Table 2). To perform this experiment, we prepared datasets of the 3DZDs of EM density maps as follows: First, EM density maps of the 2337 representative protein structures are computed with pdb2mrc for two different resolutions, 10 and 15Å. Then, for an EM density map of a protein, two 3DZDs are computed, one using voxels with a high density value range (e.g. 9-11 for the resolution of 10) and another one using voxels with a low density value range (e.g. 5-8 for the resolution of 10; Table 2). Thus, a total of four datasets of 3DZDs of simulated EM isosurfaces are prepared. Voxels with a high density value in an EM density map locate at the core of a protein and thus its isosurface resembles the CACNO representation of the protein structure. On the other hand, the isosurface of a lower density value in the EM map looks more similar to the AASurf representation of the protein structure.
Precision-recall AUC value of database retrieval of EM isosurfaces.
Density Range a)
3DZD order 15
3DZD order 20
1.46 x 10-4
PDB database scan a)
1 day 23 hours 36 minutes
In this work we examined the applicability of the 3DZD for two important tasks in structural bioinformatics. The first task is the real-time protein structure database search. In contrast to our previous work  in which the 3DZD is used to represent an all-atom surface of protein structures (called in AASurf representation in this work), we have now examined backbone amino acid-based surface representations (e.g. CACNO). The backbone-based representations showed significantly better performance than the AASurf when the retrieval performance was evaluated on the agreement to the CE and SCOP classifications. Combinations of AASurf and CACNO showed further improvement over CACNO.
The second task explores the applicability of the 3DZD on representation and comparison of low-resolution structural data by evaluating the database retrieval performance of simulated EM density maps. Intuitively, isosurfaces of the density maps and molecular surfaces represented by 3DZDs look similar to each other. Indeed, we showed that the 3DZD is well suited for database retrieval of EM maps, achieving comparable accuracy to regular protein structure database retrieval in identifying proteins of the same fold to the query protein. This is the most comprehensive study so far published in identifying the fold class of proteins by comparing EM density map of proteins. It is noteworthy that the 3DZD can identify proteins of the same fold with EM maps even at 15Å resolution. Using EM maps of a higher resolution, which have now become more and more available, the retrieval accuracy will only get better. Here we compared the EM maps of single proteins as the proof of concept that 3DZD is suitable for handling EM maps. We expect this work will stimulate further investigations for applying 3DZD or similar descriptors for handling EM maps of multiple protein complexes and other low-resolution structure data, such as electron tomography.
Altogether, we are in a new exciting informatics era of structural biology, and we believe surface representation with the 3DZD will provide new directions and ideas that lead us to new findings through surfing ever expanding molecular structural information.
three dimensional Zernike descriptor
structural classification of proteins (name of a database)
combinatorial extension (name of a protein structure comparison program)
all heavy atom surface
a surface representation using C-alpha, C, N, O atoms in the main-chain
surface representation using C-alpha, C, N, atoms in the main-chain
molecular surface calculation program by Michel Sanner
area under the curve
receiver operator characteristics
Distance matrices alignment algorithm
Protein Data Bank
This work is supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM075004). DK also acknowledges grants from NSF (DMS0800568, EF0850009, IIS0915801). The authors are grateful to Gregg Thomas for proofreading the manuscript.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 11, 2010: Proceedings of the 21st International Conference on Genome Informatics (GIW2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S11.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.