Protein structure similarity from principle component correlation analysis

Background Owing to rapid expansion of protein structure databases in recent years, methods of structure comparison are becoming increasingly effective and important in revealing novel information on functional properties of proteins and their roles in the grand scheme of evolutionary biology. Currently, the structural similarity between two proteins is measured by the root-mean-square-deviation (RMSD) in their best-superimposed atomic coordinates. RMSD is the golden rule of measuring structural similarity when the structures are nearly identical; it, however, fails to detect the higher order topological similarities in proteins evolved into different shapes. We propose new algorithms for extracting geometrical invariants of proteins that can be effectively used to identify homologous protein structures or topologies in order to quantify both close and remote structural similarities. Results We measure structural similarity between proteins by correlating the principle components of their secondary structure interaction matrix. In our approach, the Principle Component Correlation (PCC) analysis, a symmetric interaction matrix for a protein structure is constructed with relationship parameters between secondary elements that can take the form of distance, orientation, or other relevant structural invariants. When using a distance-based construction in the presence or absence of encoded N to C terminal sense, there are strong correlations between the principle components of interaction matrices of structurally or topologically similar proteins. Conclusion The PCC method is extensively tested for protein structures that belong to the same topological class but are significantly different by RMSD measure. The PCC analysis can also differentiate proteins having similar shapes but different topological arrangements. Additionally, we demonstrate that when using two independently defined interaction matrices, comparison of their maximum eigenvalues can be highly effective in clustering structurally or topologically similar proteins. We believe that the PCC analysis of interaction matrix is highly flexible in adopting various structural parameters for protein structure comparison.


Background
Conformational resemblance between proteins, whether remote or close, is often used to infer functional properties of proteins and to reveal distant evolutionary relationships between two proteins exhibiting no similarity in their amino acid sequences. Traditionally, high-resolution structure determination succeeds the biological and biochemical studies of proteins to further provide mechanistic details of the function of proteins. The biological function of these proteins have usually been suggested prior to their structural studies by in vitro binding assays, in vivo gene knock-out experiments, and sequence homology with proteins of known function. However, with the completion of the sequencing of the genomes of human and other organisms, major structural biology resources have been harnessed to solve structures of large numbers of proteins encoded by the genomes in a high throughput but less specific fashion, under the name 'structural genomics' [1]. Subsequently, large sets of protein structures are accumulated in the public domain databases for which we know little about their biological roles. This shortfall calls for the development of cost-effective computational methods to predict protein function based on three-dimensional structures, with the aim of providing preliminary information to guide biological experiments later.
In the post-genomic era, large amounts of new protein sequences are available for statistics-based recognition of their biological properties. It has been shown in many cases that with the help of elegant computational algorithms, amino acid sequence information alone can be used to successfully predict a protein's structural class [2][3][4], sub-cellular location [5,6], and even enzymatic activities [7][8][9][10]. These approaches, however, are often limited by sequence noise arose from natural mutations throughout the evolutionary path, in which proteins are structurally and functionally conserved, but divergent in amino acid sequences. It is a recurring theme in structural biology that proteins with completely different sequences can adopt very similar global fold. Hence, incorporating structural information into functional genomics would potentially upgrade predictions to the next level of accuracy. Owing to the rapid technical advances in X-ray crystallography and liquid-state NMR spectroscopy, protein structure determination becomes more routine than before. It is reasonable to predict that full-scale structure determination can be the first step towards characterizing the biological role and mechanism of a newly sequenced protein.
In the 13,000-large protein structure database (PDB), there are only approximately 4,000 different folds represented in the PDB, with a fold/structure ratio of approximately 1/5 (in the protein data bank) [11]. Therefore, (a) Ribbon representation of 1IP9, showing two α helixes and four β strands, and (b) the corresponding symmetric interaction matrix (defined in eq. 2), where h 3 and h 5 are the two α helices, and h 1 , h 2 , h 4 and h 6 are the four β strands Figure 1 (a) Ribbon representation of 1IP9, showing two α helixes and four β strands, and (b) the corresponding symmetric interaction matrix (defined in eq. 2), where h 3 and h 5 are the two α helices, and h 1 , h 2 , h 4 and h 6 are the four β strands. The gray-level values denote the distance between any two C α atoms with white corresponding to the shortest distance, i.e., 0.
given a new protein structure determined experimentally, chances are high that its topological arrangement of secondary fragments already exists in PDB either as an individual protein, or as a domain within a larger protein.
Structure comparison is traditionally based on coordinate RMSD [12,13]. While the RMSD approach is effective in comparing two close topologic structures with similar chain length, it fails when proteins are of different shapes or lengths. One outstanding example is Calmodulin, a ubiquitous Ca 2+ binding protein that plays a key role in numerous cellular Ca 2+ -dependent signaling pathways [14]. The backbone RMSD between the Ca 2+ -bound and apo states of individual calmodulin domain (~64 residues) is as large as 4Å, despite the fact that they are the same molecules with the same topology. When using the Ca 2+ -bound structure as a starting model, a homology based NMR residual dipolar coupling (RDC) refinement scheme, which relies heavily on the model having the correct topology, is able to converge the model to an accurate apo structure using RDCs measured for the apo state [15]. There are numerous proteins with similar secondary element arrangements in the 3D space yet acquire different overall shapes. Clearly for these proteins, algorithms different from the RMSD must be used to reveal their topological similarities. Another well-known software called Matching Molecular Models Obtained from Theory (MAMMOTH) is a sequence-independent protein structural alignment method [16]. It compares an experimental protein structure using an arbitrary low-resolution protein tertiary model. The distance defined in MAM-MOTH is quite different from our approach. There are also many other methods of protein structure comparison, such as [17][18][19][20][21]. Note that all of the aforementioned methods used sequence based comparison. In contrast, our method adopts secondary structure based comparison and focuses on extracting invariant topological features.
In our study, we measure structural similarity between proteins by correlating the principle components of their secondary structure interaction matrix. In this method, referred here as the principle component correlation (PCC) analysis, the symmetric matrix for an individual protein is constructed with relationship parameters between secondary elements that can take the form of distance, orientation, or other relevant structural invariants. It is first demonstrated that the maximum eigenvalues of these interaction matrices can be effectively used to group structurally or topologically homologous proteins. Then by taking into account both maximum eigenvalues and their corresponding eigenvectors, a more refined pair-wise structure comparison is performed, which is able to differentiate structures of similar shape but different topological backbone traces. It is also shown that the results of PCC analysis are highly comparable to those given by the scaled Gauss metric (SGM) calculations [22] for the data sets studied. We believe the PPC method is flexible in adopting various structural parameters for pair-wise structure comparison.
(a) The plot of scaled λ 2 (the second largest eigenvalue) ver-sus λ 1 (the maximum eigenvalue), calculated using the PD matrix, for all proteins in the four data sets, and (b) the plot of λ 1 of PID matrix versus that of PD matrices

Materials
A total of fifty-six protein structures, grouped into 6 different sets according to CATH [23,24] are used to test our algorithms. Proteins in structure set I belong to the "mainly alpha" class, including mostly apoptosis regulators in the BCL-x L super family as well as others with remote conformational resemblance; all have the "Orthogonal Bundle" architecture. The atomic coordinates were retrieved from PDB with accession codes 1A4F, 1A6G, 1COL

Clustering of structurally similar proteins by SMEC method
One of the goals of this study is to compare and identify structurally or topologically similar proteins. In other words, given a new experimentally determined protein structure, the proposed method is expected to rapidly place the structure into a group of structurally or topologically similar proteins in the database, thereby aiding in correlating topological similarity with functional similarity. To illustrate the application of the SMEC approach, we compute the scaled eigenvalues of PD and PID interaction matrices (Section Methods). Figure 2a shows the plot of scaled λ 2 versus λ 1 , calculated using the PD matrix, for all proteins in the four data sets. Figure 2b shows the plot of λ 1 of PID matrix versus that of PD matrices. The different symbols represent different structural groups. These plots were used to resolve clusters of structurally similar structures.

Pair-Wise structural comparison by PCC method
In addition to correlating the maximum eigenvalues, the PCC method described in Section Methods, which compares both eigenvalues and eigenvectors, was tested for the four selected data sets. Using the pair-wise distance matrix defined in Section Methods, the difference metric R defined in Eq. 5 between all pairs of protein structures in the four data sets were calculated and shown in Tables 1-6. Additionally for the same data sets, writhing numbers computed using the SGM method were presented in the same corresponding tables. The R values between a few selected proteins from different groups were also shown to provide a negative control (Table 2).

Discussion
The concept of principle component analysis (PCA) is widely used in mathematics and pattern recognition to simplify a data set. In mathematical terms, it is a transform that chooses a new coordinate system for the data set, such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principle component), the second greatest variance on the second axis, and so on. Because of the large amount of information stored along the first axis, the maximum eigenvalue itself can be characteristic enough to represent structural features of a protein.  ing. It is therefore expected that smaller components of interaction matrices are not effective for this purpose. Similarly, when using the first number computed with the SGM algorithm, the four structure sets can be resolved (see Fig. 3).
In addition to the PD matrix, PID matrix defined above was used to provide further separation between clusters of eigenvalues. This was demonstrated in Fig. 2b, in which the plot of λ 1 of PID matrices versus that of PD matrices achieves a much better grouping of the four structural sets in the vertical dimension as compared to the plot in Fig.  2a. This further emphasizes the importance of the maximum eigenvalues and variations in the definition of the interaction matrix that provides independent structural information. It does not escape our notice that even better resolution can be achieved by correlating λ 1 with three or more different types of interaction matrices in a multidimensional plot. The caveat, however, is that definitions of invariant relation constructing the matrices should not be redundant as there are a limited number of independent invariants in a protein structure. Nevertheless, the results here show that the PCA method using secondary interaction matrix is highly flexible in adopting various structural parameters as a means of structure comparison. We also investigate how much the first eigenvalue captures the eigenvalue spectrum in the BCL-x L family. We found that the first eigenvalue captures 45.78% of the sum of the 105 eigenvalues. That indicates that more eigenvalues could be helpful in protein structure classification in our future work.
A more elaborate method built on PCA is explored in this study to utilize the directional information contained in the eigenvector corresponding to λ 1 , named here as the PCC analysis as described in Section Methods. This method is particularly suited for the pair-wise structural comparison. Using the simple PD matrix definition (Section Methods), the pair-wise difference metrics, R, are all small (< 0.4) within each of the four known structural sets (Tables 1 and Figure 5(a)-(f)). The SGM score in Figure 5 is defined as the absolute difference between the SGM values of two proteins. The symbol 'o' denotes that the R score is smaller than SGM score, and the '*' denotes the R score is bigger than SGM score. Furthermore, as a negative control, R values between structures from different sets are much larger, typically greater than 2.0 ( Figure 5(e)). Based on the R values in Table 1 and Figure 5(a)-(f) , we found empirically that by setting the cutoff R value to 0.4, the PCC method can faithfully place all structures in their designated groups.
To provide a more in-depth view of the PCC method, the analysis of data set I is described here in detail. This set consists of mainly α helical structures having the "Orthogonal Bundle" architecture. Proteins 2BID, 1F16, 1G5M, 1GJH, 1MAZ, and 1DDB are apoptosis regulators of celldeath pathways associated with mitochondrion. Since mitochondria originated from prokaryotes, these proteins    The plot of R score versus the SGM score: (a)-(f) are plotted for datasets from I to VI, respectively Figure 5 The plot of R score versus the SGM score: (a)-(f) are plotted for datasets from I to VI, respectively. The SGM score is defined as the absolute difference between the SGM values of two proteins. The symbol '*' denotes that the R score is smaller than SGM score, and the 'o' denotes the R score is bigger than SGM score.
are believed to have evolved from the same ancient design. Although they differ substantially in amino acid sequence as well as in shape, the overall scaffold and topology are similar. As expected, the R values among them are all less than 0.4 (Table 1). Other proteins in this set, including bacterial toxins that are capable of forming membrane pores (1MDT and 1COL) and myoglobin (1A6G), have remote conformational resemblance with the BCL-x L proteins. The R values between these structures and the apoptosis regulators are also less than 0.3 and are comparable to those found within the BCL-x L family. It is interesting to note that although 1MDT and 1COL are not related to the BCL-x L proteins in terms of physiological roles, they do share a similarity with the BCL-x L members other than topology; that is, they all are able to form large pores when inserted into cellular membrane.
In summing the results of Table 1 and Figure 5(a)-(f), the R values within individual sets are on average very small, with a mean of 0.1102 and standard deviation of 0.1269. This is expected because the structures have been manually examined and pre-grouped into topologically similar sets. The comparison results from PCC analyses are generally comparable to that of SGM for the data sets under study (see Table 1 and Figure 5(a)-(f)). However, in a few isolated cases, the difference in the scaled writhing numbers within the same structure set can exceed the threshold of 0.4 that governs similarity (for example, protein pairs (1MAZ, 2BID), (1F16, 1DDB) in Table 1, and pro-tein pairs (1C78, 1FM0), (1C78, 1NDD), and (1C78, 1IBQ) in Figure 5(b). This is because the PCC analysis using the PD matrix emphasizes more on spatial separation and orientation of secondary segments. It must be mentioned that the PD matrix alone is not expected to detect pure topological similarities. The results for structure sets with predominately β strands and mixed α/β proteins show similar R values ( Figure 5(c) and 5(d)), indicating the generality of this method in protein structure comparison. We also tested these six data sets using MAMMOTH, it can also separate the six classes well.
Another variation of the PD matrix definition is to take into account the N -C terminal sense, in attempt to further emphasize protein topological features. A good example is the comparison between structures 1COL and 1DDB in data set I. A visual examination of the two structures reveals that they share similar shape, but are considerably different in topological arrangement of helices 1 and 3. In protein 1COL, the first and third helices are antiparallel, whereas they are parallel in 1DDB (see Figure 4). This is not identified by the PCC analysis using the PD matrix as R = 0.029. The great similarity in shape prevailed in the comparison. However, by applying the PDS matrix defined in Section Methods, the R-value considerably increases to 1.707, clearly highlighting the difference in backbone topological traces. Finally we also would like to pint out that the definition of R could be improved by introducing more eigenvalues.

Figure 4
Ribbon representation of protein structures of (a) 1COL and (b) 1DDB. The two proteins have similar shape, but different topological arrangements in helices 1 and 3.

Conclusion
PCC analysis of secondary interaction matrix is a conceptually simple method that yields results highly comparable to the SGM method. Both are able to distinguish protein conformations based on the more subtle topological features. While the SGM method compares structures in a more topological sense, the outcome of PCC analysis is more dependent on the definition of the interaction matrix. With the PD matrix, the PCC analysis puts more weight on the detailed structure and shape, while it is also capable, to a certain extent, of distinguishing different topological traces. In certain cases of pair-wise comparison, such as that between 1COL and 1DDB, protein shapes can overwhelm their topological features in the analysis; yet the PCC analysis of the PDS matrix is able to completely differentiate between 1COL and 1DDB.
Owing to the flexibility offered by the new method, a more effective definition of interaction matrix can be explored to provide a more efficient structure comparison. There exist many invariants in each protein. Some invariants are important for protein classification, but some are not. Hence, our future work will further explore feature selection, automated classification of PDB, modeling and statistical learning, as well as protein domain matching.

Principle component analysis of secondary interaction matrix
Assuming a protein having n secondary fragments denoted by h 1 , h 2 ,..., h n , and the number of residues in each secondary structure denoted by l 1 , l 2 ,..., l n , respectively, the total number of residues belonging to second- The principle components of the interaction matrix is then obtained by orthogonal decomposition as shown below: where λ 1 ≥ λ 2 ≥ ʜ ≥ λ N are the sorted eigenvalues, the corresponding eigenvectors are e 1 , e 2 ,..., e N , and E = [e 1 , e 2 ,..., e N ] is an invertible matrix. Generally, the maximum eigenvalue, λ 1 , and its corresponding eigenvector in Ndimensional space encode the most dominant features in the structure and therefore can be effectively used to directly compare structures, as well as to identify the less obvious topological features common to the proteins.
Since the eigenvalues depend largely on the dimension of interaction matrix, they are divided by the matrix size N, a treatment similar to the scaling of writhing numbers in the SGM method (Rogen P. and Fain B., 2003). In a relatively crude analysis, λ 1 can be directly compared to infer structural similarity. This method is referred here as the Scaled Maximum Eigenvalue Comparison (SMEC).
In addition to the maximum eigenvalues, their corresponding eigenvectors can also be used to correlate similar structures. Particularly for pair-wise structure comparison, degree of similarity can be more accurately measured by comparing both eigenvalue and eigenvector. Since proteins are generally not of the same length, their eigenvectors cannot be directly correlated due to different dimensionality. Therefore, a "sliding window" approach is employed to correlate the smaller protein to all matching segments (length-wise) in the larger protein.
secondary structure residues 1 ... N, (λ B 2 , e B 2 ) are from secondary structure residues 2 ... N+1, and so on. To quantify structural similarity, we define a difference metric, R, between Î of protein A and Î of the jth matching segment of protein B as Obviously, smaller R j indicates better correlation or higher degree of structural similarity. The overall difference between the two proteins is defined as The minimum of R 1 , R 2 , ..., R M-N+1 is used here to measure similarity because this potentially allows mapping a smaller structure onto a homologous domain within a larger protein. This method is called the Principle Component Correlation (PCC) analysis.

Defining the matrix elements
The definition of block matrix elements, d( , ), depends on the desired structural features to be extracted.
In the current study, we focus structural comparison on protein backbone conformation. Clearly the simplest invariant describing the backbone conformation is the Euclidian distance between a pair of C α atoms from two different secondary segments. Formally, the elements are defined as d( , ) = || -|| where and are the coordinates of the two C α atoms of residues u of h i and v of h j , respectively. For conciseness, we name the interaction matrix so defined as the Pair-wise Distance (PD) matrix. For illustration purpose, the interaction matrix for the structure of Pb1, Domain of Bem1P (PDB accession code 1IP9), is shown in Fig. 1. This structure, consisting of two α helices and four β strands (Fig. 1a), is used here to provide distances between all pairs of C α atoms in the six secondary elements (Fig. 1b).
Furthermore, two variations of the PD matrix definition are explored in attempt to provide a better resolution in structural comparison and classification. Since physical energy of interaction between a pair of atoms typically increase monotonically as the inverse of their separation, inverse of distance is used to mimic physical interactions between secondary elements. Here the elements of F(h i , h j ) are defined as where u 0 represent a hard-sphere boundary below which the interaction is constant. In this study, we arbitrarily set u 0 to 3Å. This definition is referred as Pair-wise Inverse Distance (PID) matrix.
Another variation of the PD matrix definition is to take into account the N -C terminal sense, in attempt to further emphasize protein topological features. For a secondary element, h i , its direction vector v i is defined by two points in Cartesian space: the center of mass of the five consecutive N-terminal C α and the center of mass of the five consecutive C-terminal C α atoms. Given a pair of secondary elements h i and h j , the new matrix elements are defined as where sgn(x) is a symbol function which is 1 when x ≥ 0 and -1 when x < 0. This variation is referred as Pair-wise Distance with Sense (PDS) matrix in this study.

Linking/Writhing numbers
To evaluate the ability of PCC analysis in extracting pure topological features, the linking and writhing numbers, which are good measures of global topology, are also calculated for the four sets of structures for comparison. The linking number of two curves is defined by the Călugăreanu-Fuller-White formula [25][26][27]: Lk = Wr + Tw, where the linking number Lk counts the sum of signed crossings between the ribbon's two boundary curves, the writhing number Wr counts the sum of signed self-crossings of the curve, averaged over all projection directions [28], and Tw is the twist number.Lk is an invariant to any smooth deformation that avoids self-intersections [29], and it is also independent of projection direction. Wr and Tw are invariant to some transformations, such as rigid body motions. Here we compute the writhing numbers using the Scaled Gauss Metric (SGM) approach previously described by Rogen and Fain [22].
Given two curves c 1 and c 2 , which are two closed nonintersecting curves in 3-dimentional space, and define e(s, t) = (c 2 (t) -c 1 (s))/||c 2 (t) -c 1 (s)||, where ||·|| denotes the Euclidean norm. For two closed curves, the vector field e(s, t) is doubly periodic. Such mappings have an integer-valued degree that is invariant under topological deforma- Writhing number is not invariant under general smooth deformations such as translations, rotations, re-parameterizations, and dilations (Murasugi, 1996). Since the backbone of a protein is a polygonal curve, the writhing number of c 1 (t) can be calculated by where W(i 1 , i 2 ) is the writhing number between the i 1 th and the i 2 th segment; s and t denote two different C α atoms, and N is the total number of C α atoms. The SGM method is defined as the normalized writhing number, namely, Wr is divided by N [22]. The absolute difference between their writhing numbers is used to infer topological similarity.