- Methodology article
- Open access
- Published:

# Protein structure similarity from principle component correlation analysis

*BMC Bioinformatics*
**volume 7**, Article number: 40 (2006)

## Abstract

### Background

Owing to rapid expansion of protein structure databases in recent years, methods of structure comparison are becoming increasingly effective and important in revealing novel information on functional properties of proteins and their roles in the grand scheme of evolutionary biology. Currently, the structural similarity between two proteins is measured by the root-mean-square-deviation (RMSD) in their best-superimposed atomic coordinates. RMSD is the golden rule of measuring structural similarity when the structures are nearly identical; it, however, fails to detect the higher order topological similarities in proteins evolved into different shapes. We propose new algorithms for extracting geometrical invariants of proteins that can be effectively used to identify homologous protein structures or topologies in order to quantify both close and remote structural similarities.

### Results

We measure structural similarity between proteins by correlating the principle components of their secondary structure interaction matrix. In our approach, the Principle Component Correlation (PCC) analysis, a symmetric interaction matrix for a protein structure is constructed with relationship parameters between secondary elements that can take the form of distance, orientation, or other relevant structural invariants. When using a distance-based construction in the presence or absence of encoded N to C terminal sense, there are strong correlations between the principle components of interaction matrices of structurally or topologically similar proteins.

### Conclusion

The PCC method is extensively tested for protein structures that belong to the same topological class but are significantly different by RMSD measure. The PCC analysis can also differentiate proteins having similar shapes but different topological arrangements. Additionally, we demonstrate that when using two independently defined interaction matrices, comparison of their maximum eigenvalues can be highly effective in clustering structurally or topologically similar proteins. We believe that the PCC analysis of interaction matrix is highly flexible in adopting various structural parameters for protein structure comparison.

## Background

Conformational resemblance between proteins, whether remote or close, is often used to infer functional properties of proteins and to reveal distant evolutionary relationships between two proteins exhibiting no similarity in their amino acid sequences. Traditionally, high-resolution structure determination succeeds the biological and biochemical studies of proteins to further provide mechanistic details of the function of proteins. The biological function of these proteins have usually been suggested prior to their structural studies by *in vitro* binding assays, *in vivo* gene knock-out experiments, and sequence homology with proteins of known function. However, with the completion of the sequencing of the genomes of human and other organisms, major structural biology resources have been harnessed to solve structures of large numbers of proteins encoded by the genomes in a high throughput but less specific fashion, under the name 'structural genomics' [1]. Subsequently, large sets of protein structures are accumulated in the public domain databases for which we know little about their biological roles. This shortfall calls for the development of cost-effective computational methods to predict protein function based on three-dimensional structures, with the aim of providing preliminary information to guide biological experiments later.

In the post-genomic era, large amounts of new protein sequences are available for statistics-based recognition of their biological properties. It has been shown in many cases that with the help of elegant computational algorithms, amino acid sequence information alone can be used to successfully predict a protein's structural class [2–4], sub-cellular location [5, 6], and even enzymatic activities [7–10]. These approaches, however, are often limited by sequence noise arose from natural mutations throughout the evolutionary path, in which proteins are structurally and functionally conserved, but divergent in amino acid sequences. It is a recurring theme in structural biology that proteins with completely different sequences can adopt very similar global fold. Hence, incorporating structural information into functional genomics would potentially upgrade predictions to the next level of accuracy. Owing to the rapid technical advances in X-ray crystallography and liquid-state NMR spectroscopy, protein structure determination becomes more routine than before. It is reasonable to predict that full-scale structure determination can be the first step towards characterizing the biological role and mechanism of a newly sequenced protein. In the 13,000-large protein structure database (PDB), there are only approximately 4,000 different folds represented in the PDB, with a fold/structure ratio of approximately 1/5 (in the protein data bank) [11]. Therefore, given a new protein structure determined experimentally, chances are high that its topological arrangement of secondary fragments already exists in PDB either as an individual protein, or as a domain within a larger protein.

Structure comparison is traditionally based on coordinate RMSD [12, 13]. While the RMSD approach is effective in comparing two close topologic structures with similar chain length, it fails when proteins are of different shapes or lengths. One outstanding example is Calmodulin, a ubiquitous Ca^{2+} binding protein that plays a key role in numerous cellular Ca^{2+}-dependent signaling pathways [14]. The backbone RMSD between the Ca^{2+}-bound and apo states of individual calmodulin domain (~64 residues) is as large as 4**Å**, despite the fact that they are the same molecules with the same topology. When using the Ca^{2+}-bound structure as a starting model, a homology based NMR residual dipolar coupling (RDC) refinement scheme, which relies heavily on the model having the correct topology, is able to converge the model to an accurate apo structure using RDCs measured for the apo state [15]. There are numerous proteins with similar secondary element arrangements in the 3D space yet acquire different overall shapes. Clearly for these proteins, algorithms different from the RMSD must be used to reveal their topological similarities. Another well-known software called Matching Molecular Models Obtained from Theory (MAMMOTH) is a sequence-independent protein structural alignment method [16]. It compares an experimental protein structure using an arbitrary low-resolution protein tertiary model. The distance defined in MAMMOTH is quite different from our approach. There are also many other methods of protein structure comparison, such as [17–21]. Note that all of the aforementioned methods used sequence based comparison. In contrast, our method adopts secondary structure based comparison and focuses on extracting invariant topological features.

In our study, we measure structural similarity between proteins by correlating the principle components of their secondary structure interaction matrix. In this method, referred here as the principle component correlation (PCC) analysis, the symmetric matrix for an individual protein is constructed with relationship parameters between secondary elements that can take the form of distance, orientation, or other relevant structural invariants. It is first demonstrated that the maximum eigenvalues of these interaction matrices can be effectively used to group structurally or topologically homologous proteins. Then by taking into account both maximum eigenvalues and their corresponding eigenvectors, a more refined pair-wise structure comparison is performed, which is able to differentiate structures of similar shape but different topological backbone traces. It is also shown that the results of PCC analysis are highly comparable to those given by the scaled Gauss metric (SGM) calculations [22] for the data sets studied. We believe the PPC method is flexible in adopting various structural parameters for pair-wise structure comparison.

## Results

### Materials

A total of fifty-six protein structures, grouped into 6 different sets according to CATH [23, 24] are used to test our algorithms. Proteins in structure set I belong to the "mainly alpha" class, including mostly apoptosis regulators in the BCL-x_{L} super family as well as others with remote conformational resemblance; all have the "Orthogonal Bundle" architecture. The atomic coordinates were retrieved from PDB with accession codes 1A4F, 1A6G, 1COL (A), 1DDB (A), 1F16 (A), 1G5M (A), 1GJH (A), 1MAZ, 1MDT (A), and 2BID (A), where (A) means chain A. Set II is also "mainly alpha" and has the same architecture as Set I, including structures 1CK7 (A), 1CXW (A), 1E8B (A), 1E88 (A), 1J7M (A), 1KS0 (A), 1PDC, and 2FN2. However this set consists of DNA helicase domains that have vastly different topology from Set I. Set II is used here to test the ability of our method to separate proteins that are in the same class of secondary structure but have different topologies. Set III belongs to the "mainly beta" class and has the barrel architecture, consisting of acid protease structures 1A5T, 1BVS (A), 1CUK, 1DV (A), 1F4I (A), 1G4A (E), 1G41 (A), 1HJP, 1IM2 (A), and 1JR3 (E). Set IV consists of the "alpha/beta" class proteins with the roll architecture, including structures 1FM0 (D), 1D4B (A), 1C78 (A), 1LM8 (B), 1NDD (A), 1UBQ, 1IBQ (A), and 1IP9 (A). The structures in set IV all have the Ubiquitin-like topology. Set V consists of the "mainly alpha" with the Alpha/alpha barrel architecture, including 1C82 (A), 1CB8 (A), 1EGU (A), 1F1S (A), 1F9G (A), 1HM2 (A), 1HM3 (A), 1HMU (A), 1HMW (A), 1HV6 (A), 1I8Q (A), and 1QAZ (A). The structures in Set V all have the Glycosyltransferase topology. Set VI consists of the "mainly beta" with the ribbon architecture, including 1AIW, 1E6N (A), 1E6P (A), 1E6R (A), 1E6Z (A), 1E15 (A), 1ED7 (A), and 1GOI (A). The structures in Set VI have the Seminal Fluid Protein PDC-109 (domain B).

### Clustering of structurally similar proteins by SMEC method

One of the goals of this study is to compare and identify structurally or topologically similar proteins. In other words, given a new experimentally determined protein structure, the proposed method is expected to rapidly place the structure into a group of structurally or topologically similar proteins in the database, thereby aiding in correlating topological similarity with functional similarity. To illustrate the application of the SMEC approach, we compute the scaled eigenvalues of PD and PID interaction matrices (Section Methods). Figure 2a shows the plot of scaled λ_{2} versus λ_{1}, calculated using the PD matrix, for all proteins in the four data sets. Figure 2b shows the plot of λ_{1} of PID matrix versus that of PD matrices. The different symbols represent different structural groups. These plots were used to resolve clusters of structurally similar structures.

### Pair-Wise structural comparison by PCC method

In addition to correlating the maximum eigenvalues, the PCC method described in Section Methods, which compares both eigenvalues and eigenvectors, was tested for the four selected data sets. Using the pair-wise distance matrix defined in Section Methods, the difference metric *R* defined in Eq. 5 between all pairs of protein structures in the four data sets were calculated and shown in Tables 1-6. Additionally for the same data sets, writhing numbers computed using the SGM method were presented in the same corresponding tables. The *R* values between a few selected proteins from different groups were also shown to provide a negative control (Table 2).

## Discussion

The concept of principle component analysis (PCA) is widely used in mathematics and pattern recognition to simplify a data set. In mathematical terms, it is a transform that chooses a new coordinate system for the data set, such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principle component), the second greatest variance on the second axis, and so on. Because of the large amount of information stored along the first axis, the maximum eigenvalue itself can be characteristic enough to represent structural features of a protein. Figure 2a plots eigenvalues λ_{1} versus λ_{2} derived from the PD matrices of all four sets of structures under study. Clearly λ_{1} values alone are distinct enough from each other for grouping most of the structures into their known conformation sets. The same plot also illustrates that the second largest eigenvalue λ_{2} is generally not powerful enough to accomplish the grouping. It is therefore expected that smaller components of interaction matrices are not effective for this purpose. Similarly, when using the first number computed with the SGM algorithm, the four structure sets can be resolved (see Fig. 3).

In addition to the PD matrix, PID matrix defined above was used to provide further separation between clusters of eigenvalues. This was demonstrated in Fig. 2b, in which the plot of λ_{1} of PID matrices versus that of PD matrices achieves a much better grouping of the four structural sets in the vertical dimension as compared to the plot in Fig. 2a. This further emphasizes the importance of the maximum eigenvalues and variations in the definition of the interaction matrix that provides independent structural information. It does not escape our notice that even better resolution can be achieved by correlating λ_{1} with three or more different types of interaction matrices in a multi-dimensional plot. The caveat, however, is that definitions of invariant relation constructing the matrices should not be redundant as there are a limited number of independent invariants in a protein structure. Nevertheless, the results here show that the PCA method using secondary interaction matrix is highly flexible in adopting various structural parameters as a means of structure comparison. We also investigate how much the first eigenvalue captures the eigenvalue spectrum in the BCL-x_{L} family. We found that the first eigenvalue captures 45.78% of the sum of the 105 eigenvalues. That indicates that more eigenvalues could be helpful in protein structure classification in our future work.

A more elaborate method built on PCA is explored in this study to utilize the directional information contained in the eigenvector corresponding to λ_{1}, named here as the PCC analysis as described in Section Methods. This method is particularly suited for the pair-wise structural comparison. Using the simple PD matrix definition (Section Methods), the pair-wise difference metrics, *R*, are all small (< 0.4) within each of the four known structural sets (Tables 1 and Figure 5(a)–(f)). The SGM score in Figure 5 is defined as the absolute difference between the SGM values of two proteins. The symbol 'o' denotes that the R score is smaller than SGM score, and the '*' denotes the R score is bigger than SGM score. Furthermore, as a negative control, *R* values between structures from different sets are much larger, typically greater than 2.0 (Figure 5(e)). Based on the *R* values in Table 1 and Figure 5(a)–(f), we found empirically that by setting the cutoff *R* value to 0.4, the PCC method can faithfully place all structures in their designated groups.

To provide a more in-depth view of the PCC method, the analysis of data set I is described here in detail. This set consists of mainly α helical structures having the "Orthogonal Bundle" architecture. Proteins 2BID, 1F16, 1G5M, 1GJH, 1MAZ, and 1DDB are apoptosis regulators of cell-death pathways associated with mitochondrion. Since mitochondria originated from prokaryotes, these proteins are believed to have evolved from the same ancient design. Although they differ substantially in amino acid sequence as well as in shape, the overall scaffold and topology are similar. As expected, the *R* values among them are all less than 0.4 (Table 1). Other proteins in this set, including bacterial toxins that are capable of forming membrane pores (1MDT and 1COL) and myoglobin (1A6G), have remote conformational resemblance with the BCL-x_{L} proteins. The *R* values between these structures and the apoptosis regulators are also less than 0.3 and are comparable to those found within the BCL-x_{L} family. It is interesting to note that although 1MDT and 1COL are not related to the BCL-x_{L} proteins in terms of physiological roles, they do share a similarity with the BCL-x_{L} members other than topology; that is, they all are able to form large pores when inserted into cellular membrane.

In summing the results of Table 1 and Figure 5(a)–(f), the *R* values within individual sets are on average very small, with a mean of 0.1102 and standard deviation of 0.1269. This is expected because the structures have been manually examined and pre-grouped into topologically similar sets. The comparison results from PCC analyses are generally comparable to that of SGM for the data sets under study (see Table 1 and Figure 5(a)–(f)). However, in a few isolated cases, the difference in the scaled writhing numbers within the same structure set can exceed the threshold of 0.4 that governs similarity (for example, protein pairs (1MAZ, 2BID), (1F16, 1DDB) in Table 1, and protein pairs (1C78, 1FM0), (1C78, 1NDD), and (1C78, 1IBQ) in Figure 5(b). This is because the PCC analysis using the PD matrix emphasizes more on spatial separation and orientation of secondary segments. It must be mentioned that the PD matrix alone is not expected to detect pure topological similarities. The results for structure sets with predominately β strands and mixed α/β proteins show similar *R* values (Figure 5(c) and 5(d)), indicating the generality of this method in protein structure comparison. We also tested these six data sets using MAMMOTH, it can also separate the six classes well.

Another variation of the PD matrix definition is to take into account the N – C terminal sense, in attempt to further emphasize protein topological features. A good example is the comparison between structures 1COL and 1DDB in data set I. A visual examination of the two structures reveals that they share similar shape, but are considerably different in topological arrangement of helices 1 and 3. In protein 1COL, the first and third helices are anti-parallel, whereas they are parallel in 1DDB (see Figure 4). This is not identified by the PCC analysis using the PD matrix as *R* = 0.029. The great similarity in shape prevailed in the comparison. However, by applying the PDS matrix defined in Section Methods, the *R*-value considerably increases to 1.707, clearly highlighting the difference in backbone topological traces. Finally we also would like to pint out that the definition of *R* could be improved by introducing more eigenvalues.

## Conclusion

PCC analysis of secondary interaction matrix is a conceptually simple method that yields results highly comparable to the SGM method. Both are able to distinguish protein conformations based on the more subtle topological features. While the SGM method compares structures in a more topological sense, the outcome of PCC analysis is more dependent on the definition of the interaction matrix. With the PD matrix, the PCC analysis puts more weight on the detailed structure and shape, while it is also capable, to a certain extent, of distinguishing different topological traces. In certain cases of pair-wise comparison, such as that between 1COL and 1DDB, protein shapes can overwhelm their topological features in the analysis; yet the PCC analysis of the PDS matrix is able to completely differentiate between 1COL and 1DDB. Owing to the flexibility offered by the new method, a more effective definition of interaction matrix can be explored to provide a more efficient structure comparison. There exist many invariants in each protein. Some invariants are important for protein classification, but some are not. Hence, our future work will further explore feature selection, automated classification of PDB, modeling and statistical learning, as well as protein domain matching.

## Methods

### Principle component analysis of secondary interaction matrix

Assuming a protein having *n* secondary fragments denoted by **h**_{1}, **h**_{2},..., **h**_{
n
}, and the number of residues in each secondary structure denoted by *l*_{1}, *l*_{2},..., *l*_{
n
}, respectively, the total number of residues belonging to secondary structures is given by N={\displaystyle \sum _{i=1}^{n}{l}_{i}}. The invariant relation between a pair of secondary elements (**h**_{
i
}, **h**_{
j
}) is described by a block matrix **F**(**h**_{
i
}, **h**_{
j
}), in which the individual matrix elements represent a particular relation between residues of the two secondary structures. Since **h**_{
i
}has *l*_{i} residues (denoted by {c}_{i}^{1}, {c}_{i}^{2},..., {c}_{i}^{{l}_{i}}), and **h**_{
j
}has *l*_{j} residues (denoted by {c}_{j}^{1}, {c}_{j}^{2},..., {c}_{j}^{{l}_{j}}), the elements of the *l*_{
i
}× *l*_{
j
}**F** block matrix, *g*({c}_{i}^{u}, {c}_{j}^{v}), are defined as

g({c}_{i}^{u},{c}_{j}^{v})=\{\begin{array}{cc}d({c}_{i}^{u},{c}_{j}^{v})& i\ne j\\ 0& i=j\end{array},\left(1\right)

where 1 ≤ *u* ≤ *l*_{
i
}, 1 ≤ *v* ≤ *l*_{
j
}, and *d*({c}_{i}^{u}, {c}_{j}^{v}) is a real number representing an arbitrary invariant relation between residues of **h**_{
i
}and **h**_{
j
}. Note this approach allows the definition of *d*({c}_{i}^{u}, {c}_{j}^{v}) to be rather arbitrary. The full interaction matrix of a protein structure is square and symmetric and is defined as

\widehat{I}={\left[\begin{array}{cccc}0& F({h}_{1},{h}_{2})& \cdots & F({h}_{1},{h}_{n})\\ F({h}_{2},{h}_{1})& 0& \cdots & F({h}_{2},{h}_{n})\\ \vdots & \vdots & \ddots & \vdots \\ F({h}_{n},{h}_{1})& F({h}_{n},{h}_{2})& \cdots & 0\end{array}\right]}_{N\times N}\left(2\right)

The principle components of the interaction matrix is then obtained by orthogonal decomposition as shown below:

\widehat{I}={E}^{T}\left[\begin{array}{c}{\lambda}_{1}\\ {\lambda}_{2}\\ \ddots \\ {\lambda}_{N}\end{array}\right]E\left(3\right)

where λ_{1} ≥ λ_{2} ≥ ⋯ ≥ λ_{
N
}are the sorted eigenvalues, the corresponding eigenvectors are **e**_{1}, **e**_{2},..., **e**_{
N
}, and **E** = [**e**_{1}, **e**_{2},..., **e**_{
N
}] is an invertible matrix. Generally, the maximum eigenvalue, λ_{1}, and its corresponding eigenvector in *N*-dimensional space encode the most dominant features in the structure and therefore can be effectively used to directly compare structures, as well as to identify the less obvious topological features common to the proteins. Since the eigenvalues depend largely on the dimension of interaction matrix, they are divided by the matrix size *N*, a treatment similar to the scaling of writhing numbers in the SGM method (Rogen P. and Fain B., 2003). In a relatively crude analysis, λ_{1} can be directly compared to infer structural similarity. This method is referred here as the Scaled Maximum Eigenvalue Comparison (SMEC).

In addition to the maximum eigenvalues, their corresponding eigenvectors can also be used to correlate similar structures. Particularly for pair-wise structure comparison, degree of similarity can be more accurately measured by comparing both eigenvalue and eigenvector. Since proteins are generally not of the same length, their eigenvectors cannot be directly correlated due to different dimensionality. Therefore, a "sliding window" approach is employed to correlate the smaller protein to all matching segments (length-wise) in the larger protein. Let us consider two proteins, A and B, having *N* and *M* secondary structure residues, respectively, and *N* ≤ *M*. For the protein having shorter secondary segments, λ^{A} and e^{A} are respectively the maximum eigenvalue and its corresponding *N*-dimensional eigenvector. For the protein with more secondary structure residues, *M*-*N*+1 interaction matrices are decomposed, where (λ^{B}_{1}, e^{B}_{1}) represent the principle components of the interaction matrix constructed from secondary structure residues 1 ... *N*, (*λ*^{B}_{2}, e^{B}_{2}) are from secondary structure residues 2 ... N+1, and so on. To quantify structural similarity, we define a difference metric, *R*, between **Î** of protein A and **Î** of the *j* th matching segment of protein B as

{R}_{j}=\left|\right|{e}^{A}-{e}_{j}^{B}\left|\right|\left|{\lambda}^{A}-{\lambda}_{j}^{B}\right|\text{,}1\le j\le M-N+1.\left(4\right)

Obviously, smaller *R*_{
j
}indicates better correlation or higher degree of structural similarity. The overall difference between the two proteins is defined as

*R* = min(*R*_{1}, *R*_{2},..., *R*_{M-N+1}). (5)

The minimum of *R*_{1}, *R*_{2}, ..., *R*_{M-N+1}is used here to measure similarity because this potentially allows mapping a smaller structure onto a homologous domain within a larger protein. This method is called the Principle Component Correlation (PCC) analysis.

### Defining the matrix elements

The definition of block matrix elements, *d*({c}_{i}^{u}, {c}_{j}^{v}), depends on the desired structural features to be extracted. In the current study, we focus structural comparison on protein backbone conformation. Clearly the simplest invariant describing the backbone conformation is the Euclidian distance between a pair of C^{α} atoms from two different secondary segments. Formally, the elements are defined as *d*({c}_{i}^{u}, {c}_{j}^{v}) = ||{c}_{i}^{u} - {c}_{j}^{v}|| where {c}_{i}^{u} and {c}_{j}^{v} are the coordinates of the two C^{α} atoms of residues *u* of **h**_{i} and *v* of **h**_{j}, respectively. For conciseness, we name the interaction matrix so defined as the Pair-wise Distance (PD) matrix. For illustration purpose, the interaction matrix for the structure of Pb1, Domain of Bem1P (PDB accession code 1IP9), is shown in Fig. 1. This structure, consisting of two α helices and four β strands (Fig. 1a), is used here to provide distances between all pairs of C_{α} atoms in the six secondary elements (Fig. 1b).

Furthermore, two variations of the PD matrix definition are explored in attempt to provide a better resolution in structural comparison and classification. Since physical energy of interaction between a pair of atoms typically increase monotonically as the inverse of their separation, inverse of distance is used to mimic physical interactions between secondary elements. Here the elements of **F**(**h**_{
i
}, **h**_{
j
}) are defined as

d({c}_{i}^{u},{c}_{j}^{v})=\{\begin{array}{cc}\frac{1}{\left|\left|{c}_{i}^{u}-{c}_{j}^{v}\right|\right|},& \left|\left|{c}_{i}^{u}-{c}_{j}^{v}\right|\right|\ge {u}_{0}\\ \frac{1}{{u}_{0}}& \left|\left|{c}_{i}^{u}-{c}_{j}^{v}\right|\right|<{u}_{0}\end{array}\left(6\right)

where *u*_{0} represent a hard-sphere boundary below which the interaction is constant. In this study, we arbitrarily set *u*_{0} to 3**Å**. This definition is referred as Pair-wise Inverse Distance (PID) matrix.

Another variation of the PD matrix definition is to take into account the N – C terminal sense, in attempt to further emphasize protein topological features. For a secondary element, **h**_{
i
}, its direction vector **v**_{
i
}is defined by two points in Cartesian space: the center of mass of the five consecutive N-terminal C^{α} and the center of mass of the five consecutive C-terminal C^{α} atoms. Given a pair of secondary elements **h**_{
i
}and **h**_{
j
}, the new matrix elements are defined as

*d*({c}_{i}^{u}, {c}_{j}^{v})' = *d*({c}_{i}^{u}, {c}_{j}^{v})sgn(**v**_{
i
}·**v**_{
j
}) (7)

where sgn(*x*) is a symbol function which is 1 when *x* ≥ 0 and -1 when *x* < 0. This variation is referred as Pair-wise Distance with Sense (PDS) matrix in this study.

### Linking/Writhing numbers

To evaluate the ability of PCC analysis in extracting pure topological features, the linking and writhing numbers, which are good measures of global topology, are also calculated for the four sets of structures for comparison. The linking number of two curves is defined by the C**ă** lug**ă** reanu-Fuller-White formula [25–27]: *Lk* = *Wr* + *Tw*, where the linking number *Lk* counts the sum of signed crossings between the ribbon's two boundary curves, the writhing number *Wr* counts the sum of signed self-crossings of the curve, averaged over all projection directions [28], and *Tw* is the twist number.*Lk* is an invariant to any smooth deformation that avoids self-intersections [29], and it is also independent of projection direction. *Wr* and *Tw* are invariant to some transformations, such as rigid body motions. Here we compute the writhing numbers using the Scaled Gauss Metric (SGM) approach previously described by Rogen and Fain [22].

Given two curves *c*_{1} and *c*_{2}, which are two closed non-intersecting curves in 3-dimentional space, and define *e*(*s*, *t*) = (*c*_{2}(*t*) - *c*_{1}(*s*))/||*c*_{2}(*t*) - *c*_{1}(*s*)||, where ||·|| denotes the Euclidean norm. For two closed curves, the vector field *e*(*s*, *t*) is doubly periodic. Such mappings have an integer-valued degree that is invariant under topological deformations. The linking number of two curves is further defined as

Lk({c}_{1},c{}_{2})=\frac{1}{4\pi}{\displaystyle {\int}_{{c}_{1}}{\displaystyle {\int}_{{c}_{2}}\left[e,{e}_{s},{e}_{t}\right]}}dsdt=\frac{1}{4\pi}{\displaystyle {\int}_{{c}_{1}}{\displaystyle {\int}_{{c}_{2}}\frac{\left({c}_{1}^{\text{'}}(s)\times {c}_{2}^{\text{'}}(t)\right)\cdot \left({c}_{1}(s)-{c}_{2}(t)\right)}{{\Vert {c}_{1}(s)-{c}_{2}(t)\Vert}^{3}}}}dsdt\left(8\right)

where *e*_{
s
}and *e*_{
t
}are the tangents of *e*(*s*, *t*) at point (*s*, *t*), as well as {c}_{1}^{\text{'}}(*s*) and {c}_{2}^{\text{'}}(*t*) are the tangents along the *c*_{1} and *c*_{2} at *s* and *t*. Note that here *e*_{
s
}, *e*_{
t
}, {c}_{1}^{\text{'}}(*s*), and {c}_{2}^{\text{'}}(*t*) are vectors. Define *w*(*s*, *t*) = (*c*_{1}(*t*) - *c*_{1}(*s*)/||*c*_{1}(*t*) - *c*_{1}(*s*)||. The writhing number for a single curve *c*_{1} is defined as

Wr({c}_{1})=\frac{1}{4\pi}{\displaystyle {\int}_{{c}_{1}}{\displaystyle {\int}_{{c}_{1}}\left[w,{w}_{s},{w}_{t}\right]}}dsdt=\frac{1}{4\pi}{\displaystyle {\int}_{{c}_{1}}{\displaystyle {\int}_{{c}_{2}}\frac{\left({c}_{1}^{\text{'}}(s)\times {c}_{1}^{\text{'}}(t)\right)\cdot \left({c}_{1}(s)-{c}_{1}(t)\right)}{{\Vert {c}_{1}(s)-{c}_{1}(t)\Vert}^{3}}}}dsdt\left(9\right)

where *w*_{
s
}and *w*_{
t
}are the tangent of *w*(*s*, *t*) at point (*s*, *t*). Writhing number is not invariant under general smooth deformations such as translations, rotations, re-parameterizations, and dilations (Murasugi, 1996). Since the backbone of a protein is a polygonal curve, the writhing number of *c*_{1}(*t*) can be calculated by

Wr({c}_{1})={\displaystyle \sum _{0<{i}_{1}<{i}_{2}<N}W({i}_{1},{i}_{2}),W({i}_{1},{i}_{2})=\frac{1}{2\pi}{\displaystyle {\int}_{{i}_{1}=s}^{s+1}{\displaystyle {\int}_{{i}_{2}=t}^{t+1}w(s,t)dsdt}}}\left(10\right)

where *W*(*i*_{1}, *i*_{2}) is the writhing number between the *i*_{1} th and the *i*_{2}th segment; *s* and *t* denote two different C^{α} atoms, and *N* is the total number of C^{α} atoms. The SGM method is defined as the normalized writhing number, namely, *Wr* is divided by *N* [22]. The absolute difference between their writhing numbers is used to infer topological similarity.

## References

Kim S:

**Shining a light on structural genomics.***Nat Struct Biol*1998,**5:**643–645. 10.1038/1334Chou PY, Fasman GD:

**Prediction of the secondary structure of proteins from their amino acid sequence.***Adv Enzymol Relat Areas Mol Biol*1978,**47:**45–148.Chou KC, Zhang CT:

**Predicting protein folding types by distance functions that make allowances for amino acid interactions.***Journal of Biological Chemistry*1994,**269:**22014–22020.Bahar I, Atilgan AR, Jernigan RL, Erman B:

**Understanding the recognition of protein structural classes by amino acid composition.***PROTEINS: Structure, Function, and Genetics*1997,**29:**172–185. Publisher Full Text 10.1002/(SICI)1097-0134(199710)29:2<172::AID-PROT5>3.0.CO;2-FChou KC, Cai YD:

**Using functional domain composition and support vector machines for prediction of protein subcellular location.***Journal of Biological Chemistry*2002,**227:**45765–45769. 10.1074/jbc.M204161200Nakai K, Horton P:

**PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization.***Trends in Biochemical Science*1999,**24:**34–36. 10.1016/S0968-0004(98)01336-XChou K, Elrod DW:

**Prediction of enzyme family classes.***J Proteome Res*2003,**2**(2):183–190. 10.1021/pr0255710Cai YD, Lin S, Chou KC:

**Support vector machines for prediction of protein signal sequences and their cleavage sites.***Peptides*2003,**24:**159–161. 10.1016/S0196-9781(02)00289-9Chou JJ:

**Predicting cleavability of peptide sequences by HIV protease via correlation-angle approach.***Journal of Protein Chemistry*1993,**12:**291–302. 10.1007/BF01028191Chou KC:

**Prediction of HIV protease cleavage sites in proteins.***Analytical Biochemistry*1996,**233:**1–14. 10.1006/abio.1996.0001Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE:

**The Protein Data Bank.***Nucleic Acids Research*2000,**28:**235–242. 10.1093/nar/28.1.235Eidhammer I, Jonassen I, Taylor WR:

**Structure comparison and structure patterns.***Journal of Computational Biology*2000,**7:**685–716. 10.1089/106652701446152Koehl P:

**Protein structure similarities.***Current Opinion in Structural Biology*2001,**11:**348–353. 10.1016/S0959-440X(00)00214-1Cohen P, Klee CB:

*Calmodulin.*New York: Elsevier; 1988.Chou JJ, Li SP, Bax A:

**Study of conformational rearrangement and refinement of structural homology models by the use of heteronuclear dipolar couplings.***Journal of Biomolecular NMR*2000,**18**(3):217–227. 10.1023/A:1026563923774Ortiz AR, Strauss CE, Olmea O:

**MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison.***Protein Sci*2002,**11**(11):2606–21. 10.1110/ps.0215902Bostick DL, Shen M, Vaisman II:

**A simple topological representation of protein structure: implications for new, fast, and robust structural classification.***Proteins*2004,**56**(3):487–501. 10.1002/prot.20146Carugo O, Pongor S:

**Protein fold similarity estimated by a probabilistic approach based on C(alpha)-C(alpha) distance comparison.***J Mol Biol*2002,**315**(4):887–98. 10.1006/jmbi.2001.5250Kedem K, Chew LP, Elber R:

**Unit-vector RMS (URMS) as a tool to analyze molecular dynamics trajectories.***Proteins*1999,**37**(4):554–64. 10.1002/(SICI)1097-0134(19991201)37:4<554::AID-PROT6>3.0.CO;2-1Zemla A:

**LGA: A method for finding 3D similarities in protein structures.***Nucleic Acids Res*2003,**31**(13):3370–4. 10.1093/nar/gkg571Bastolla U,

*et al*.:**Principal eigenvector of contact matrices and hydrophobicity profiles in proteins.***Proteins*2005,**58**(1):22–30. 10.1002/prot.20240Rogen P, Fain B:

**Automatic classification of protein structure by using Gauss integrals.***Proc Natl Acad Sci USA*2003,**100**(1):119–124. 10.1073/pnas.2636460100Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM:

**CATH – A hierarchic classification of protein domain structures.***Structure*1997,**5:**1093–1108. 10.1016/S0969-2126(97)00260-8Pearl FMG, Lee D, Bray JE, Sillitoe I, Todd AE, Harrison AP, Thornton JM, Orengo CA:

**Assigning genomic sequences to CATH.***Nucleic Acids Research*2000,**28:**277–282. 10.1093/nar/28.1.277Calugareanu G:

**Sur les classes d'isotopie des noeuds tridimensionnels et leurs invariants.***Czechoslovak Math*1961,**11:**588–625.Fuller FB:

**The writhing number of a space curve.***Proc Natl Acad Sci USA*1971,**68**(4):815–9.Bauer WR, Crick FH, White JH:

**Supercoiled DNA.***Sci Am*1980,**243**(1):100–13.Erdmann MA:

**Protein similarity from knot theory and geometric convolution.**In*Proceedings of the eighth annual international conference on Computational molecular biology*. San Diego, California, USA; 2004.Murasugi K:

*Theory and Its Applications.*Boston, USA: Birkhäuser; 1996.

## Acknowledgements

This research was supported by the Center for Bioinformatics Research Program Grant of Harvard Center for Neurodegeneration and Repair to Dr. Stephen TC Wong. Dr. James Chou is supported by the Smith Family Award for Young Investigators and the PEW Scholarship.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Authors' contributions

X.Z and J.C. played the major role in carrying out the proposed approach and experiments, as well as drafted the manuscript. S.T.C.W. has been involved in and has guided the research discussion, as well as the preparation of the manuscript. He has given the final approval of the version to be published.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

**Open Access**
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Zhou, X., Chou, J. & Wong, S.T. Protein structure similarity from principle component correlation analysis.
*BMC Bioinformatics* **7**, 40 (2006). https://doi.org/10.1186/1471-2105-7-40

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2105-7-40