Three-dimensional protein model similarity analysis based on salient shape index
- Bo Yao^{1},
- Zhong Li^{1}Email author,
- Meng Ding^{1} and
- Minhong Chen^{1}
https://doi.org/10.1186/s12859-016-0983-z
© Yao et al. 2016
Received: 14 September 2015
Accepted: 9 March 2016
Published: 18 March 2016
Abstract
Background
Proteins play a special role in bioinformatics. The surface shape of a protein, which is an important characteristic of the protein, defines a geometric and biochemical domain where the protein interacts with other proteins. The similarity analysis among protein models has become an important topic of protein analysis, by which it can reveal the structure and the function of proteins.
Results
In this paper, a new protein similarity analysis method based on three-dimensional protein models is proposed. It constructs a feature matrix descriptor for each protein model combined by calculating the shape index (SI) and the related salient geometric feature (SGF), and then analyzes the protein model similarity by using this feature matrix and the extended grey relation analysis.
Conclusions
We compare our method to the Multi-resolution Reeb Graph (MRG) skeleton method, the L1-medial skeleton method and the local-diameter descriptor method. Experimental results show that our protein similarity analysis method is accurate and reliable while keeping the high computational efficiency.
Keywords
Background
Protein similarity analysis is an important topic in bioinformatics. With it, we can help understand the structure and the function of proteins. The protein shape analysis plays an important role in medical research, computer aided molecular design, protein structure retrieval and prediction, among others. However, the analysis is highly challenging due to the complexity of a protein’s three-dimensional surface shape, which can deform significantly enough to change the topological structure during molecular interactions [1].
Many researchers have contributed to similarity analysis methods for comparing protein shapes. Via et al. [2] gave a survey on the current knowledge of the protein surface similarity. The similarity analysis method based on comparison of shape feature is a common approach. Sael et al. [3] proposed a popular protein surface similarity method based on a 3D Zernike descriptor. Compactness and rotational invariance of this descriptor enable fast comparison suitable for protein database searches. However, in order to capture the high resolution of the protein surface similarity, computation increases as the number of terms in its series expansion increases. And it is not applicable for the protein models with complex topology such as holes. Osada et al. [4] provided a shape distribution method based on the statistical histogram that measures the vertex distribution of the whole model surface, from which it forms a shape feature distribution histogram, and finally obtains a three-dimensional model’s geometric similarity measure by comparing two similar distances. Horn et al. [5] proposed an algorithm based on an extended Gaussian image, in which it maps each grid of the model surface to a unit sphere, thus obtains an extended Gaussian ball vector. Ohbuchi et al. [6] presented a statistical histogram algorithm in which the three-dimensional model vertices are sampled and then a three-dimensional coordinate axis histogram is used to generate three statistics about the model’s geometric features. Vranic et al. [7] introduced a functional analysis method that assesses the three-dimensional model similarity using the modulus of a spherical harmonic analysis coefficient.
Other shape similarity methods based on topology are also widely studied. For example, Hilaga et al. [8] proposed a multi-resolution Reeb graph (MRG) method. It uses the model surface’s geodesic distance as a Morse function to draw the multi-resolution Reeb graph of a three-dimensional model. Bronstein et al. [9] provided a method based on heat kernel signatures (HKS). It draws analogies with feature-based image representations to construct shape descriptors, which are invariant to a wide class of transformations on one hand and are discriminative on the other hand. Forked et al. [10] proposed a method based on the simplified medial axis, which is parameterized by a separation angle. The angle is formed by the vectors connecting a point on the medial axis to the closest points on the boundary. Du et al. [11] proposed a method based on the skeleton graph. It first calculates the skeleton node of a three-dimensional model and then constructs the corresponding skeleton graph between nodes. Both of these methods are computationally expensive, are more sensitive to holes in the three-dimensional models, and are lack of robustness to noise. Morris et al. [12] obtained a similarity comparison of three-dimensional protein models by spherical harmonic expansion. Fang et al. [13] proposed a shape comparison method based on the local diameter (LD). Qin et al. [14] introduced an improved MRG skeleton algorithm. In this process, sample points are used to build the local diameter (LD) for a model similarity comparison, but it is a computationally expensive approach. Li et al. [15] presented a method based on an improved L1-medial protein skeleton, but they only apply it to the CPK protein model. Hence, their method is not generally applicable to all three-dimensional protein models.
Motivated by the salient theory of a three-dimensional model proposed by Hoffman and Singh [16], and by the shape index (SI) concept by Bradford et al. [17], we propose a new shape comparison method for three-dimensional protein models. We first compute the shape index which reflects the protein surface’s geometric feature including the concave and convex properties. Then we construct the salient geometric feature (SGF) through the region-related shape index information. The shape index and the salient geometric feature are then combined to form the feature matrix of each protein model. We finally use the extended grey relation analysis to analyze the feature matrix and obtain the final shape similarity results of protein models.
Methods
Shape index (SI)
where k _{1} and k _{2} denote the maximum and the minimum principal curvatures, respectively. From the above formula, we know SI is between -1 and 1. When the shape index is close to 1, it indicates the convex shape of the given vertex on the protein surface. On the contrary, when the shape index is close to -1, it indicates the concave shape of the given vertex on the surface. When k _{1} = - k _{2}, the shape index is 0.
Salient Geometric Feature (SGF)
where F is a cluster consisting of each vertex i, w _{1} and w _{2} are the weights, we set them as 0.5. Area(i) is the area of the patch associated with vertex i relative to a cluster size, N(SI) is the number of local minimum(s) or maximum(s) shape index in the cluster, Var(SI) is the shape index variance in the cluster, SI(i) is the shape index associated with vertex i.
For the 1HLB protein, we give its salient geometric feature model in Fig. 2e. The regions with the red color represent the more salient parts, and the regions with the blue color are the less salient parts. And we also use the 1HLB model as the example to address the difference scales of SI and SGF on the protein surface. For vertex A in Fig. 2(a-1), its SI value is 0.5368 and its SGF value is 0.6425. For vertex B in Fig. 2(a-2), its SI value is 0.5279 and its SGF value is 0.2981. We find their SI values are close which are hard to reflect the difference of local feature. Whereas, their SGF values have a big difference because SGF value is related to the local geometric region. When the local geometric region varies saliently, the SGF value is high. So from this model, we conclude that point A has a salient geometric feature since it has a high SGF value.
Feature descriptor structure
The shape index and the salient geometric feature of all vertices on a three-dimensional model constitute an n-dimensional vector (where n is the number of model vertices), respectively. Because the number of vertices on each model surface is not the same for different proteins, these vectors cannot be directly compared and analyzed. In our approach, we cluster all feature values into the same group number through a clustering algorithm [20]. For the number of clusters representing different features in K-means clustering, the high value of K will improve the accuracy of shape analysis, but it also increases the computation of shape comparison. The low value of K does not need the high running time of computation, but it can not guarantee the accuracy of shape analysis. Here, we set it as K = 48. Then, we calculate the mean for each group t _{ i } and obtain a feature described vector of shape indexes T = (t _{1}, t _{2}, …, t _{ K }). For the shape index feature clustering of a protein model, we randomly select K data points from the database of n values as the initial cluster centers for use with the K-means clustering algorithm. We perform clustering until the change in cluster centers reaches a convergence condition. From this, we obtain the final K data point clusters.
In order to better reflect the shape feature of a three-dimensional protein model, the method based on the feature matrix expression has become a popular method for the model shape analysis [21]. Here, we apply above two vectors to construct a matrix which contains rich feature information as a feature descriptor. We denote Q _{ K×2} = [T; P]^{ T } and use Q ^{ T } Q to represent a K × K (K = 48) feature matrix, then do the similarity analysis for the protein models with this feature descriptor. We give the feature matrix figure of 1HLB protein model in Fig. 3c.
Similarity measurement
For the shape analysis of a protein sequence or its surface model, common methods use distance measurements such as Euclidean distance, Manhattan distance, angle cosine method and correlation coefficient method, etc [22]. One problem using these methods is that the measure value is normally not guaranteed to lie in the standard interval [0, 1]. If we use the normalization to transform the values into [0,1], it relates to the maximum and the minimum measure values of all protein models and this transformation will influence the accuracy of the shape analysis for protein models.
For our similarity analysis of three-dimensional protein models, because we construct a matrix-based feature descriptor, the previous vector-based method is not directly applicable for our measurement. At the same time, we hope to advocate the use of a scalar value of similarity directly between 0 and 1, where higher values represent greater similarity between two protein models. Here we popularize the vector-based grey relation analysis [23] to the matrix-based grey relation analysis, which also keeps the value in [0,1] and other properties of the grey relation analysis. Then, we apply it to measure the similarity of three-dimensional protein models.
From the above calculation process, it is easily known that the grey relation degree is between 0 and 1, and the degree indicates the high similarity of two models when it is close to 1.
Results
The algorithm presented in this paper is implemented on a Intel(R) Core(TM) i3-3110 M CPU @2.5 Ghz desktop computer with 4GB RAM running MS Windows 7. The software environment of the experiment is based on Mathworks’ MATLAB R2010a.
Comparison by the grey relation distance between Qin et al’s algorithm [15] and our algorithm
Similarity values by different algorithms | |||
---|---|---|---|
1BPD | 2BPG | 1WRP | 3WRP |
Algorithm in [15] | 0.9713 | 0.5625 | 0.5816 |
Our algorithm | 0.9856 | 0.3853 | 0.3605 |
2BPG | 1BPD | 1WRP | 3WRP |
Algorithm in [15] | 0.9713 | 0.6497 | 0.6823 |
Our algorithm | 0.9856 | 0.3844 | 0.3596 |
1WRP | 1BPD | 2BPG | 3WRP |
Algorithm in [15] | 0.5625 | 0.6497 | 0.9597 |
Our algorithm | 0.3853 | 0.3844 | 0.9723 |
3WRP | 1BPD | 2BPG | 1WRP |
Algorithm in [15] | 0.5816 | 0.6823 | 0.9597 |
Our algorithm | 0.3605 | 0.3596 | 0.9723 |
Running time comparison between Qin et al’s [15] algorithm and our algorithm (The time unit is ms)
Grey relation distance comparison by different similarity measure methods
Similarity measure methods | 1BPD and 2BPG | 1WRP and 3WRP |
---|---|---|
Feature vectors of SI (T) | 0.9346 | 0.9517 |
Feature vectors of SGF (P) | 0.9419 | 0.9486 |
Combined feature vectors of SI and SGF ((T + P)/2) | 0.9357 | 0.9541 |
Matrix-based feature descriptor (Q ^{ T } Q) | 0.9856 | 0.9723 |
Similarity results for six protein models from Fig. 5
Models | 2B3I | 1NIN | 1RCD | 1IER | 1DBW | 1B00 |
---|---|---|---|---|---|---|
2B3I | 1.0000 | 0.9500 | 0.8812 | 0.4460 | 0.9350 | 0.9244 |
1NIN | 0.9500 | 1.0000 | 0.9269 | 0.4918 | 0.9241 | 0.9176 |
1RCD | 0.8812 | 0.9269 | 1.0000 | 0.9630 | 0.9109 | 0.7513 |
1IER | 0.4460 | 0.4918 | 0.9630 | 1.0000 | 0.4749 | 0.4785 |
1DBW | 0.9350 | 0.9241 | 0.9109 | 0.4749 | 1.0000 | 0.9930 |
1B00 | 0.9244 | 0.9176 | 0.7513 | 0.4785 | 0.9930 | 1.0000 |
Running time comparison between the skeleton extraction algorithm [14] and our algorithm
Model | 2B3I | 1NIN | 1RCD | 1IER | 1DBW | 1B00 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | A | B | A | B | A | B | A | B | A | B | |
2B3I | 2480 | 104 | 2481 | 104 | 2465 | 105 | 2481 | 105 | 2480 | 104 | 2464 | 105 |
1NIN | 7472 | 215 | 8472 | 202 | 7753 | 208 | 6131 | 210 | 11809 | 198 | 10733 | 205 |
1B00 | 7457 | 208 | 10733 | 205 | 7738 | 207 | 10743 | 200 | 10732 | 194 | 10749 | 202 |
1RCD | 7488 | 210 | 7753 | 208 | 7737 | 212 | 7759 | 198 | 7753 | 207 | 7738 | 207 |
1DBW | 7456 | 204 | 11809 | 198 | 7753 | 207 | 6130 | 194 | 11825 | 195 | 10732 | 194 |
1IER | 7472 | 199 | 6131 | 210 | 7753 | 198 | 6147 | 194 | 6130 | 194 | 10732 | 200 |
Similarity results for ten protein models from Fig. 6
Models | 1HLM | 1HLB | 2LHB | 1MBA | 5MBN | 1LH2 | 1CHR | 2MNR | 5P21 | 1GNP |
---|---|---|---|---|---|---|---|---|---|---|
1HLM | 1.000 | 0.973 | 0.562 | 0.676 | 0.437 | 0.752 | 0.653 | 0.549 | 0.789 | 0.752 |
1HLB | 0.973 | 1.000 | 0.608 | 0.763 | 0.574 | 0.647 | 0.564 | 0.653 | 0.742 | 0.698 |
2LHB | 0.562 | 0.608 | 1.000 | 0.969 | 0.579 | 0.695 | 0.745 | 0.659 | 0.732 | 0.634 |
1MBA | 0.676 | 0.763 | 0.969 | 1.000 | 0.614 | 0.713 | 0.697 | 0.705 | 0.720 | 0.744 |
5MBN | 0.437 | 0.574 | 0.579 | 0.614 | 1.000 | 0.987 | 0.353 | 0.438 | 0.493 | 0.464 |
1LH2 | 0.752 | 0.647 | 0.695 | 0.713 | 0.987 | 1.000 | 0.434 | 0.468 | 0.464 | 0.413 |
1CHR | 0.653 | 0.564 | 0.745 | 0.697 | 0.353 | 0.434 | 1.000 | 0.993 | 0.563 | 0.615 |
2MNR | 0.549 | 0.653 | 0.659 | 0.705 | 0.438 | 0.468 | 0.993 | 1.000 | 0.595 | 0.685 |
5P21 | 0.789 | 0.742 | 0.732 | 0.720 | 0.493 | 0.464 | 0.563 | 0.595 | 1.000 | 0.981 |
1GNP | 0.752 | 0.698 | 0.634 | 0.744 | 0.464 | 0.413 | 0.615 | 0.685 | 0.981 | 1.000 |
Running time comparison including searching the dataset between the skeleton extraction algorithm [14] and our algorithm
Models Methods | _{1HLM} | _{2LHB} | _{5MBN} | _{1CHR} | _{5P21} |
---|---|---|---|---|---|
A | 20mins 28 s | 25mins 05 s | 22mins 49 s | 18mins 58 s | 21mins 36 s |
B | 5mins | 7mins | 6mins | 5mins | 6mins |
43 s | 19 s | 57 s | 34 s | 26 s |
Discussion
Our method is based on the surface analysis and the advantage is that the running time is fast because it does not need to conduct the skeleton extraction. The disadvantage is that it can not be applied for other protein models such as protein CPK models. The advantage of the skeleton based method is that it can be applied for both the protein surface (triangular mesh model) and the protein CPK model (point cloud representation), the disadvantage is the skeleton extraction requires a time-consuming process.
Conclusion and future work
In this paper, we propose a three-dimensional protein model’s similarity analysis algorithm based on salient shape index. We first calculate the shape index (SI) and salient geometric feature (SGF) of the protein models. And then we construct the matrix-based feature descriptor by SI and SGF information. Finally, we compare the similarity of protein models by the matrix-based grey relation degree. Experimental results show the effectiveness of our protein similarity analysis method.
Currently, we only consider the shape index (convex and concave properties of the protein surface) and the salient geometric feature to analyze the similarity of the protein models. We do not take account of the physical properties of the protein molecules. In fact, these properties such as pH, polar and non-polar, hydrophilic, also affect the structure and the function of the protein molecules. How to combine these factors for the protein shape similarity analysis will be our future research. For the clustering in our similarity analysis, we find the cluster size is normally not equal and the clustering is sometimes dominated by several big clusters. The size of the clusters might be highly relevant in describing the global shape of the protein model. This also gives us an interesting work for our future research.
Declarations
Acknowledgements
We thank Prof. Stefan Heller in the school of Medicine, Stanford University, USA, for helpful discussion and suggestion. This research was supported by Scientific Research Foundation of Ministry of Education of China under Grant No. [2009]1590, Zhejiang Provincial Natural Science Foundation of China under Grant No. LY14A010032, Zhejiang Province Key Science and Technology Innovation Team Project (2013TD18) and Project of 521 Excellent Talent of Zhejiang Sci-Tech University.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Xu SC, Li Z, Zhang SP, et al. Primary structure similarity analysis of proteins sequences by a new graphical representation. SAR QSAR Environ Res. 2014;25(10):791–803.View ArticlePubMedGoogle Scholar
- Via A, Ferre F, Brannetti B, et al. Protein surface similarities: a survey of methods to describe and compare protein surfaces. CMLS Cell Mol Life Sci. 2000;57:1970–7.View ArticlePubMedGoogle Scholar
- Sael L, La D, Li B, et al. Rapid comparison of properties on protein surface. Proteins Struct Funct Bioinforma. 2008;73:1–10.View ArticleGoogle Scholar
- Osada R, Funkhouser T, Chazelle B, et al. Matching 3D models with shape distributions. Geneva: Proceeding of Shape Modeling International; 2001. p. 07–11.Google Scholar
- Horn BK. Extended Gaussian image. Proc IEEE. 1984;1671–1686.Google Scholar
- Ohbuchi R, Nakazawa M, Takei T. Retrieving 3D shapes based on their appearance. Berkeley, California, USA: Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval; 2003.View ArticleGoogle Scholar
- Vranic DV, Saupe D. 3D model retrieval with spherical harmonics and moments. London: The 23rd DAGM Symposium on Pattern Recognition, SpringVerlag; 2001. p. 392–7.Google Scholar
- Hilaga M, Shinagawa Y, Komura T, et al. Topology matching for fully automatic similarity estimation of 3D shapes. Los Angles, California, USA: Computer Graphics, Proceedings of Annual Conference Series, ACM SIGGRAPH; 2001. p. 203–12.Google Scholar
- Bronstein MM, Kokkinos I. Scale-invariant heat kernel signatures for non-rigid shape recognition. CVPR. 2010;1704–1711.Google Scholar
- Foskey M, Lin MC, Manocha D. Efficient computation of a simplified medial axis. Seattle Washington, USA: Proceedings of the Eighth ACM Symposium on Solid Modeling and Applications; 2003. p. 96–107.Google Scholar
- Du HX, Qin H. Medial axis extraction and shape manipulation of solid objects using parabolic PDEs. Genoa: ACM Symposium on Solid Modeling and Applications; 2004. p. 25–34.Google Scholar
- Morris RJ, Najmanovich RJ, Kahraman A, et al. Real spherical harmonic expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics. 2005;21:2347–55.View ArticlePubMedGoogle Scholar
- Fang Y, Liu YS, Ramani K. Three dimensional shape comparison of flexible proteins using the local-diameter descriptor. BMC Struct Biol. 2009;29(9):01–15.Google Scholar
- Li Z, Qin SW, Yu ZY, et al. Skeleton-based shape analysis of protein models. J Mol Graph Model. 2014;53:72–81.View ArticlePubMedGoogle Scholar
- Qin SW, Li Z, Jin Y, et al. Shape similarity comparison of CPK models based on improved L1-medial skeleton. SAR QSAR Environ Res. 2014;25(9):747–59.View ArticlePubMedGoogle Scholar
- Hoffman D, Singh MD. Salience of visual parts. Dep Cogn Sci. 1997;63(1):29–78.Google Scholar
- Bradford R, Westhead R. Improved prediction of protein–protein binding sites using a support vector machines approach. Bioinformatics. 2008;1487–1494.Google Scholar
- Dyn N, Hormann K, Kim S-J, Levin D. Optimizing 3d triangulations using discrete curvature analysis. In: Lyche T, Schumaker LL. (eds.) Mathematical Methods in CAGD, Oslo 2000, pp. 135–146 (2001).Google Scholar
- Gal R, Cohen-Or D. Salient Geometric Features for Partial Shape Matching and Similarity. ACM Trans Graph. 2006;25(1):134–8.View ArticleGoogle Scholar
- Kanungo T, Mount DM, Netanyahu NS, et al. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell. 2002;24(7):881–92.View ArticleGoogle Scholar
- Hu R, Fan L, Liu L. Co-segmentation of 3D shapes via subspace clustering. Eurographics Symp Geom Process. 2012;31(5):1703–13.Google Scholar
- Choi S, Tappert S. A survey of binary similarity and distance measures. Cybern Inform. 2010;8(1):43–8.Google Scholar
- Zhang S. Matrix absolute grey relational degree of B-Mode and its significance. J Grey Syst. 2012;9:135–41.Google Scholar
- Protein Data Bank [http://www.rcsb.org/pdb/home/home.do].
- Pelta DA, Gonzalez JR, Vega MM. A simple and fast heuristic for protein structure comparison. BMC Bioinforma. 2008;9(1):156–61.View ArticleGoogle Scholar
- Krasnogor N, Pelta DA. Measuring the similarity of protein structure by means of the universal similarity metric. BMC Bioinforma. 2004;20(7):1015–21.View ArticleGoogle Scholar