Principle component analysis of secondary interaction matrix
Assuming a protein having n secondary fragments denoted by h1, h2,..., h
n
, and the number of residues in each secondary structure denoted by l1, l2,..., l
n
, respectively, the total number of residues belonging to secondary structures is given by . The invariant relation between a pair of secondary elements (h
i
, h
j
) is described by a block matrix F(h
i
, h
j
), in which the individual matrix elements represent a particular relation between residues of the two secondary structures. Since h
i
has li residues (denoted by , ,..., ), and h
j
has lj residues (denoted by , ,..., ), the elements of the l
i
× l
j
F block matrix, g(, ), are defined as
where 1 ≤ u ≤ l
i
, 1 ≤ v ≤ l
j
, and d(, ) is a real number representing an arbitrary invariant relation between residues of h
i
and h
j
. Note this approach allows the definition of d(, ) to be rather arbitrary. The full interaction matrix of a protein structure is square and symmetric and is defined as
The principle components of the interaction matrix is then obtained by orthogonal decomposition as shown below:
where λ1 ≥ λ2 ≥ ⋯ ≥ λ
N
are the sorted eigenvalues, the corresponding eigenvectors are e1, e2,..., e
N
, and E = [e1, e2,..., e
N
] is an invertible matrix. Generally, the maximum eigenvalue, λ1, and its corresponding eigenvector in N-dimensional space encode the most dominant features in the structure and therefore can be effectively used to directly compare structures, as well as to identify the less obvious topological features common to the proteins. Since the eigenvalues depend largely on the dimension of interaction matrix, they are divided by the matrix size N, a treatment similar to the scaling of writhing numbers in the SGM method (Rogen P. and Fain B., 2003). In a relatively crude analysis, λ1 can be directly compared to infer structural similarity. This method is referred here as the Scaled Maximum Eigenvalue Comparison (SMEC).
In addition to the maximum eigenvalues, their corresponding eigenvectors can also be used to correlate similar structures. Particularly for pair-wise structure comparison, degree of similarity can be more accurately measured by comparing both eigenvalue and eigenvector. Since proteins are generally not of the same length, their eigenvectors cannot be directly correlated due to different dimensionality. Therefore, a "sliding window" approach is employed to correlate the smaller protein to all matching segments (length-wise) in the larger protein. Let us consider two proteins, A and B, having N and M secondary structure residues, respectively, and N ≤ M. For the protein having shorter secondary segments, λA and eA are respectively the maximum eigenvalue and its corresponding N-dimensional eigenvector. For the protein with more secondary structure residues, M-N+1 interaction matrices are decomposed, where (λB1, eB1) represent the principle components of the interaction matrix constructed from secondary structure residues 1 ... N, (λB2, eB2) are from secondary structure residues 2 ... N+1, and so on. To quantify structural similarity, we define a difference metric, R, between Î of protein A and Î of the j th matching segment of protein B as
Obviously, smaller R
j
indicates better correlation or higher degree of structural similarity. The overall difference between the two proteins is defined as
R = min(R1, R2,..., RM-N+1). (5)
The minimum of R1, R2, ..., RM-N+1is used here to measure similarity because this potentially allows mapping a smaller structure onto a homologous domain within a larger protein. This method is called the Principle Component Correlation (PCC) analysis.
Defining the matrix elements
The definition of block matrix elements, d(, ), depends on the desired structural features to be extracted. In the current study, we focus structural comparison on protein backbone conformation. Clearly the simplest invariant describing the backbone conformation is the Euclidian distance between a pair of Cα atoms from two different secondary segments. Formally, the elements are defined as d(, ) = || - || where and are the coordinates of the two Cα atoms of residues u of hi and v of hj, respectively. For conciseness, we name the interaction matrix so defined as the Pair-wise Distance (PD) matrix. For illustration purpose, the interaction matrix for the structure of Pb1, Domain of Bem1P (PDB accession code 1IP9), is shown in Fig. 1. This structure, consisting of two α helices and four β strands (Fig. 1a), is used here to provide distances between all pairs of Cα atoms in the six secondary elements (Fig. 1b).
Furthermore, two variations of the PD matrix definition are explored in attempt to provide a better resolution in structural comparison and classification. Since physical energy of interaction between a pair of atoms typically increase monotonically as the inverse of their separation, inverse of distance is used to mimic physical interactions between secondary elements. Here the elements of F(h
i
, h
j
) are defined as
where u0 represent a hard-sphere boundary below which the interaction is constant. In this study, we arbitrarily set u0 to 3Å. This definition is referred as Pair-wise Inverse Distance (PID) matrix.
Another variation of the PD matrix definition is to take into account the N – C terminal sense, in attempt to further emphasize protein topological features. For a secondary element, h
i
, its direction vector v
i
is defined by two points in Cartesian space: the center of mass of the five consecutive N-terminal Cα and the center of mass of the five consecutive C-terminal Cα atoms. Given a pair of secondary elements h
i
and h
j
, the new matrix elements are defined as
d(, )' = d(, )sgn(v
i
·v
j
) (7)
where sgn(x) is a symbol function which is 1 when x ≥ 0 and -1 when x < 0. This variation is referred as Pair-wise Distance with Sense (PDS) matrix in this study.
Linking/Writhing numbers
To evaluate the ability of PCC analysis in extracting pure topological features, the linking and writhing numbers, which are good measures of global topology, are also calculated for the four sets of structures for comparison. The linking number of two curves is defined by the Că lugă reanu-Fuller-White formula [25–27]: Lk = Wr + Tw, where the linking number Lk counts the sum of signed crossings between the ribbon's two boundary curves, the writhing number Wr counts the sum of signed self-crossings of the curve, averaged over all projection directions [28], and Tw is the twist number.Lk is an invariant to any smooth deformation that avoids self-intersections [29], and it is also independent of projection direction. Wr and Tw are invariant to some transformations, such as rigid body motions. Here we compute the writhing numbers using the Scaled Gauss Metric (SGM) approach previously described by Rogen and Fain [22].
Given two curves c1 and c2, which are two closed non-intersecting curves in 3-dimentional space, and define e(s, t) = (c2(t) - c1(s))/||c2(t) - c1(s)||, where ||·|| denotes the Euclidean norm. For two closed curves, the vector field e(s, t) is doubly periodic. Such mappings have an integer-valued degree that is invariant under topological deformations. The linking number of two curves is further defined as
where e
s
and e
t
are the tangents of e(s, t) at point (s, t), as well as (s) and (t) are the tangents along the c1 and c2 at s and t. Note that here e
s
, e
t
, (s), and (t) are vectors. Define w(s, t) = (c1(t) - c1(s)/||c1(t) - c1(s)||. The writhing number for a single curve c1 is defined as
where w
s
and w
t
are the tangent of w(s, t) at point (s, t). Writhing number is not invariant under general smooth deformations such as translations, rotations, re-parameterizations, and dilations (Murasugi, 1996). Since the backbone of a protein is a polygonal curve, the writhing number of c1(t) can be calculated by
where W(i1, i2) is the writhing number between the i1 th and the i2th segment; s and t denote two different Cα atoms, and N is the total number of Cα atoms. The SGM method is defined as the normalized writhing number, namely, Wr is divided by N [22]. The absolute difference between their writhing numbers is used to infer topological similarity.