### Mutlipolar representation

The notion of multipoles comes from physics where they are used to describe the field generated by the spatial distribution of a scalar quantity such as mass (gravitational field) or charge (electrostatic field) density. The potential of the field created by such a quantity, at a given distance outside the region it occupies in space, can be conveniently expressed in the form of a multipole expansion [13]. Each multipole in the series accounts for the contribution of a certain type of deviation of the density field from a spherically symmetric distribution and, in general, the higher the order of the multipole the smaller the spatial scale of the deviation it describes. In this sense, the multipoles can be viewed as descriptors for the shape of the scalar distribution. The use of multipoles as shape descriptors is also closely related to the more general methods of 3D moments used in the field of object recognition in computer science [21–23]. In the above physical example the multipoles of higher orders typically account for relatively small contributions in the force field compared to lower order multipoles, and they may be neglected for many practical purposes. In a similar way, higher order multipoles representing small scale details of the shape, can be ignored in the process of describing protein structures when only a rough comparison is needed. Before giving a formal definition of multipoles in general, let us start by discussing the few lower order multipoles which are more familiar and more widely encountered in the research literature.

Quantitatively, the multipoles associated with the space distribution of a scalar property (density of mass in our example) form a sequence of tensors over the three dimensional position space. The multipole of rank zero, or the monopole, is just the space integral of the scalar property. When the scalar property is the mass density then the monopole is the total mass of the set of atoms.

The multipole of rank one is proportional to the position of the center of mass. We will use it to set the origin of the coordinate system with respect to which all multipoles are calculated. Therefore, in our calculations, the multipole of rank one, (the dipole as it is commonly known), is always going to be a null vector. For completeness, we should mention that for the multipoles of a distribution of charge this can not always be done. If the total charge is zero there may be a non-zero dipole moment that can not be made to vanish by a translation of coordinates. This is however a technical problem and it has been addressed before [19].

The multipole of rank two, or the quadrupole has nine Cartesian components. For our discrete distribution, the components are given by the following expression:

{\displaystyle \sum _{\alpha =1}^{N}\left(3{x}_{\alpha ,i}{x}_{\alpha ,j}-{r}_{\alpha}^{2}{\delta}_{i,j}\right)}.\left(1\right)

Here, *δ*_{
i,j
}= 1 for *i* = *j* and 0 otherwise. The sum runs over all *N* C*α* atoms in the structure, *x*_{α,i}is the component *i* of the {\overrightarrow{x}}_{\alpha} position vector (one of the *x, y, z* Cartesian components) and *r*_{
α
}represents the length of the {\overrightarrow{x}}_{\alpha} position vector. For example, the first diagonal component is {\displaystyle \sum \left(3{x}_{\alpha}^{2}-{r}_{\alpha}^{2}\right)={\displaystyle \sum \left(2{x}_{\alpha}^{2}-{y}_{\alpha}^{2}-{z}_{\alpha}^{2}\right)}} and one of the non-diagonal terms is ∑*x*_{
α
}*y*_{
α
}.

For higher order multipoles, enumerating the Cartesian components in closed form is not a simple task. Moreover, there is a large number of symmetry properties obeyed by the Cartesian components and therefore keeping all components of a given multipole would be redundant. Instead, more commonly, the *irreducible spherical components* are defined since they have a compact form when represented in terms of spherical harmonic functions and are independent. This makes them suitable for analytical and numerical calculations. Within the rest of the paper we will use the term multipoles to denote these irreducible components which represent the focus of our attention. We will explicitely name the Cartesian components if needed to distinguish them from the *spherical components*.

For a discrete set of *N* atoms the multipoles of rank *l* are defined as:

{q}_{lm}={\displaystyle \sum _{i=1}^{N}{r}_{i}^{l}{Y}_{lm}^{*}\left({\theta}_{i},{\phi}_{i}\right)}\left(2\right)

where *r*_{
i
}, *θ*_{
i
}, *φ*_{
i
}represent the spherical coordinates of atom *i*, *Y*_{
lm
}denotes a spherical harmonic function and the * denotes complex conjugation. For the definition and summary properties of these functions, see for example [24]. Here, since we will only consider the C*α* atoms, we set the mass of each atom to unity to simplify the notation. For a set of arbitrary atoms, each term in the sum would be weighted by the mass of the atom (or another scalar property in a generalized case).

The rank *l* can take any integer value from 0 to ∞ and for each given *l* the number *m* can take values in the range -*l*...*l*. Then, the number of irreducible components, specified by the index *m*, increases linearly with the rank *l* of the multipole as 2*l* + 1. When all multipoles with rank from 0 to *n* are used, the total number of independent components describing the shape of the protein is (*n* + 1)^{2}. As *n* increases, this number approaches the number of Cartesian components of the position vectors of the C*α* atoms. When the two numbers are equal, the description provided by the set of multipolar components for the structure of the protein is of the same level of detail as the original description offered by the atomic coordinates. We fully recover the amount of information provided by those coordinates. When this happens, from a mathematical standpoint the multipole series is just a coordinate transformation and if not singular, at least in principle, we can transform back and forth from one description to the other.

As a last remark, we will note that the set of multipoles that we use here is only a subset of a larger set which, in its entirety, uniquely describes the potential field surrounding the distribution of charge (for example) [13], for a given set of boundary conditions. The formalism that we present can be extended to include any portion of this complete set of multipoles and this leaves open the possibility for further optimization of a protein comparison process. The reason for retaining this particular subset of multipoles is that they describe the field outside the region occupied by the molecule and therefore they are more likely to correlate with its interaction capabilities and consequently its function.

### Constructing a distance function in the protein conformation space

What makes the set of multipoles defined in Eq. (2) a good set of descriptors for comparison purposes is that they form a series of quantities with remarkable symmetry properties. Specifically, for any given rank *l*, the 2*l* + 1 components *q*_{
lm
}, *m* = *l*, *l* -1, ... -*l* + 1, -*l* form an irreducible tensorial set of order *l* [25]. This means that under regular rotations in the three-dimensional (3D) physical space, these components are transformed according to a well defined induced rotation matrix (see e.g. [26]), in a way similar to the behavior of a 2*l* + 1 dimensional vectorial quantity. The immediate benefit is that one can apply the regular operations with vectors to the multipoles of a given rank and one can construct invariant quantities following the known rules from Euclidian vectors. In particular, if we denote by **q**_{
1
}the set of all components {\left\{{q}_{lm}\right\}}_{m=\overline{-l,l}}, we can define the length of the multipole of rank *l* using the scalar product:

\Vert {q}_{1}\Vert =({\displaystyle \sum _{m=-l}^{l}{q}_{lm}{q}_{lm}^{*}{)}^{1/2}}=({\displaystyle \sum {(-1)}^{m}{q}_{lm}{q}_{l-m}{)}^{1/2}},\left(3\right)

where the last part of the equation follows from the definition of the multipoles and well known symmetry properties of the spherical harmonic functions [24]. This norm can then be used to define a distance between two structures inside the subspace defined by the multipolar components (say **q**_{
1
}, {{q}^{\prime}}_{1}) of a given rank, provided that either the structures have been previously spatially superimposed, or, a "canonical orientation" of the structures has been set in some consistent manner:

*δ*_{
l
}(**q**_{
1
}, **q**_{
1
}**'**) = ||**q**_{
1
}- **q**_{
1
}**'**||. (4)

There is no a priori prescription for combining distances in subspaces of different ranks *l* to construct a global distance function. Such a prescription needs to be extracted from numerical experimentation with the problem to be modeled and/or from more general principles. The alternative discussed here is the result of our tests of the sensitivity and selectivity in discerning protein structures. Since the dimensionality of multipoles differs with *l*, in order to combine distances from different subspaces to construct a global metric one has to first define quantities with the same dimensionality. The solution adopted in this paper, is to redefine the distance in all ranks so that it has the same dimensionality, say dimensionality of length. Except eventually for a general factor (with the dimension of mass in our example), the dimensionality of the multipoles is a power of length equal to their rank. The general factor can be rendered dimensionless by a proper rescaling. Then, one can obtain a quantity with dimension of length by taking an appropriate root of the Euclidian distance as follows:

*d*_{
l
}(**q**_{
1
}, **q**_{
1
}**'**) = ||**q**_{
1
}- **q**_{
1
}**'**||^{1/l}, *l* > 0. (5)

Note that, once the multipole components have been calculated, any reference to the original Cartesian coordinates disappears from the representation. As a consequence, unlike the *rmsd* which requires a one-to-one correspondence between the set of atoms in the structures compared, the distance in Eq. (5) is defined for arbitrary structures, without any restriction with respect to their number or sequence of amino-acids. Therefore no alignment is implied. In practice, a normalization with respect to the "size" of the structures involved may still be necessary. For that purpose, each multipole in Eq. (5) can be separately rescaled with a factor inversely dependent on the "size" of the corresponding molecule. Then, instead of Eq. (5), we will use

The notation we use in this formula for the "size" dependent factors (|*q*_{0}|, |{{q}^{\prime}}_{0}|) is motivated by the fact that the multipole of rank 0 (monopole) is up to a constant numerical factor the "size" of the molecule (the number of atoms for example). When the two structures have the same size, the rescaling of the multipoles in Eq. (6) reduces to a rescaling of the distance (5) by a factor inversely proportional to the common size of the two molecules. This is qualitatively equivalent to the 1/*N* factor in the *rmsd* distance (Eq. (10)). It can be shown that Eq. (6) satisfies the triangle inequality and therefore a global distance function which also satisfies the triangle inequality can be defined by adding the distances (multiplied eventually by a weight factor) for all ranks of the multipoles. Here, we use the following formula:

This function will be used as a dissimilarity measure for proteins in our study. The upper limit in the summation is the maximum rank of the multipoles retained in the representation and determines the dimensionality of the representation and, implicitely, its level of detail.

The interest for reduced representations is manifest in the literature. From a shape perspective, similar to ours, such representations emerge in approaches such as that described in [10, 11]. From a different perspective, starting from individual atomic coordinates and using an averaging approach, an alternative method is presented in [27].

### Defining a canonical reference frame

The multipoles behave like vectorial quantities and the numerical values of their components depend on the location and orientation of the reference frame. For the comparison of structures to be meaningful, we need to either minimize the distance in Eq. (7) over all rigid transformations (translations and rotations) of one of the molecules, or to choose a standard for the reference frame with respect to which the multipolar configurations of the molecules are calculated [8]. Since the second approach is much more efficient for large scale computations, we chose to test this second alternative.

The problem of choosing such a standard arises in many research areas where 3D systems are involved [21], and various schemes can be found in the literature, depending of the research field. A common choice is to select a system of axes that is placed at the center of mass and having its three orthogonal directions along the eigenvectors of a suitable, symmetric matrix (principal axes reference frame). While the location at the center of mass is natural, its orientation as described above is ambiguous. This prescription is not appropriate in our case since it does not uniquely define the axes: any combination of permutations and inversions of the versors of a given principal axes frame form also a principal axes frame. Since we are using multipoles of ranks higher than the quadrupole (rank two), which are sensitive to these various orientations, we can not allow an arbitrary choice. We need a prescription that uniquely defines a frame. Our choice is based on the use of lower order vectorial (magnetic) counterparts of the multipoles introduced above. Specifically, we start by defining the following vectors:

{\overrightarrow{e}}_{1}={\displaystyle \sum _{i=0}^{N-1}\left({\overrightarrow{x}}_{i+1}-{\overrightarrow{x}}_{i}\right)=\left({\overrightarrow{x}}_{N}-{\overrightarrow{x}}_{0}\right)}.\left(8\right)

{\overrightarrow{e}}_{2}=\frac{1}{2}{\displaystyle \sum _{i=0}^{N-1}\left({\overrightarrow{x}}_{i+1}+{\overrightarrow{x}}_{i}\right)\times \left({\overrightarrow{x}}_{i+1}-{\overrightarrow{x}}_{i}\right)}.\left(9\right)

The first vector reduces to the relative position of the last amino acid with respect to the first, while the second one is a more complex quantity that is sensitive to the details of the path of the protein backbone. Except for special cases (for example when the two vectors in Eqs. (8, 9) are not well-defined, or they become parallel), these vectors are independent. Then, they can be orthonormalized and the resulting unit vectors will serve as the first two versors of our *canonical* reference frame. The third one will be their cross product.

The "canonical" reference frame defined by Eqs. (8, 9) is unique by construction. However, other unique definitions can be developed [8]. We are not aware of any rigorous prescription for constructing such a reference frame and therefore our choice remains heuristic.

### Testing the multipole representation

To test our representation of protein structure, we performed a number of calculations with the goal of assessing both its discriminatory power and, where meaningful, its correlation with the Cartesian description.

#### Comparing biologically relevant molecules

As already stated, the use of multipoles opens the possibility of protein shape comparisons without the need for a pre-existing amino acid alignment. However, while technically our representation allows for the comparison of arbitrary collections of atoms, in biological applications, such as protein classification, not any comparison will make sense: we need to restrict the comparison to those portions of the proteins which are relevant to the problem, for example, the functional regions. In principle, the multipoles can be used in identifying corresponding domains in structures, however, as of this moment we do not have fully functional tools to do that. Therefore, as a benchmark for testing the method, we use a manual alignment of the catalytic cores from the protein kinase-like superfamily [20]. The set contains 25 typical protein kinases (TPK) and 6 atypical protein kinases (AK) which phosphorylate non-proteins. As has been shown [20] these diverse structures can be traced to a common ancestor, but today their sequence identity is below 15% in some cases and significant structural changes have taken place, particularly in the C-terminal lobe so that a variety of substrates can be phosphorylated using the same ATP gamma-phosphate transference mechanism. These structures represent an excellent test case for structure recognition since the accurate hand curated alignment provides a valuable benchmark.

We performed an all-against-all comparison of the proteins in the benchmark set, using as input coordinates of C*α* atoms identified as part of the catalytic core [20]. Some of these atoms needed to be omitted because of unresolved portions of some structures as will be discussed later in this section. In this first set of calculations, we did not use any alignment information and there is significant variability in the number of amino acids used for each structure. Eq. (7) was used to generate the distance matrix for all pairs of structures. The results are shown in Figure 1. The ordering of the proteins along the two axis is the one provided by the authors of the set [20]. Therefore, the first six proteins are each a representative of an AK family and the rest are all TPKs. Within the TPKs, the proteins are ordered roughly according to its various groups [28, 29] (Figure 1). This implies grouping of all pairs involving an AK kinase in either the upper or left hand band of six rows or columns, respectively. The shading is such that greater distances map into darker squares. A sharp discrimination between any TPK and any AK family on one hand, and between any pair of AK families on the other hand, is observed. Thus, the multipole representation is capable of distinguishing between the major different families included in the test, without the need for detailed alignment and spatial superimposition.

Even at the subfamily level with relatively little shape discrimination, the distance matrix retains some of its discriminatory power. A close examination reveals distinct patterns along the diagonal corresponding to the various groups of kinases in the test set (Figure 1).

The discriminatory power at the family level is limited by unresolved portions of some structures. The lack of coordinates for parts of the polypeptide chain affects the calculated distances both directly (a missing piece of chain is seen as a difference in shape) and indirectly (a missing piece of chain leads to a different canonical reference frame). To reduce these perturbations, we chose to ignore in our calculations any portion of the alignment corresponding to missing parts in at least one of the proteins in the set. Most unresolved portions are relatively short (approximately 20 amino acids) and do not affect the shape dramatically.

#### Correlation with the Cartesian representation

The multipolar description offers a hierarchical approach to characterizing the shape of a molecule. While at the coarsest level there is no information about the shape, except that defined by the length of the chain, at the most refined level of details (when the number of multipole components is of the same order of magnitude as the original Cartesian coordinates) the description is as rich as the original amino acid coordinate set. At this end of the spatial spectrum we would expect a good correlation with results provided by the Cartesian coordinates. To empirically prove this, we need to devise experiments in which both representations can be applied and then compare the results.

An obvious choice is the comparison of aligned proteins, alignment being necessary for the *rmsd* to be defined. Since as yet we do not have tools for aligning proteins based on their multipole representation, we used the high quality expert alignments provided by the authors of the benchmark set [20]. We performed an analysis of how well distances calculated based on the multipole representation compare with the ones based on the Cartesian coordinates. Two different cases were considered, each defined by how the Cartesian and multipolar representation were calculated in the two proteins.

#### Case 1

In the first case, the *rmsd* distances were calculated based on a prior spatial superposition of the aligned structures (the typical approach for assessing structural similarity). The *rmsd*-distance was calculated using the formula

rmsd=\sqrt{\frac{1}{N}{\displaystyle \sum _{i}{\left({\overrightarrow{x}}_{i}-{\overrightarrow{y}}_{i}\right)}^{2}}}.\left(10\right)

The two vectors in Eq. (10) denote the coordinates of aligned residues. The multipoles of the aligned portions of the proteins were calculated with the coordinates expressed in the canonical reference frame defined by Eqs. (8, 9).

The multipoles of each protein in a given pair were computed from the C*α* coordinates only. The distances were calculated several times, each time retaining a different range of multipoles to analyze the results at different levels of structural detail. The coarsest calculation corresponds to retaining only multipoles up to rank 2 (quadrupole) and the finest one contains all multipoles up to rank 12. For each set of distances obtained in this way we calculated their linear correlation coefficients with the set of *rmsd* values.

The correlation calculations are shown in Figure 2 as a function of the highest rank of the multipoles retained in the description, i.e. of the level of detail. In this way, we have an image of how the level of detail of the representation affects the relationship between the two descriptions. The results in Figure 2 are highly significant. The total number of pairs compared in the correlation set is 465 which places even the lowest point on the curve well below a *p* value of 0.001. It is apparent from the correlation curve that, as expected from a theoretical standpoint, as the rank of multipoles retained in the representation increases, the similarity with the *rmsd* results improves. This tendency saturates after a certain level of detail is reached. If the relationship between the two sets of distances was linear, then the saturation would happen at a value equal to unity. In our case, the definition of the multipoles makes the relationship between the two quantities very non-trivial and therefore the linearity between them is, in general, excluded. This is one of the factors that contribute to the saturation of the correlation curve at values smaller than unity.

In Figure 3 we show side-by-side the distance matrix in the multipole (a) (where *l*_{
max
}= 4, the point of saturation in Figure 2) and Cartesian (b) representations. Again, we see a clear discrimination between the various families in the set, especially in the multipole description. There are a number of extra features inside the TPK family that show up in the Cartesian description and not in our representation. There are also some extra features in the multipole representation, especially inside the AK category, that do not appear in the distance matrix calculated from the Cartesian coordinates. These extra features represent another factor that affects the correlation coefficient. As will be seen from the next calculations, the differences in these details are to a great extent related to the different prescription in the positioning of the proteins, for the purpose of comparison.

For a better understanding of how the two descriptions correlate, we need to analyze more carefully the distances in the two representations. To make the discussion more quantitative, in Figure 4 we show the distances between all pairs of non-identical structures in a subset of the original collection of proteins. The subset contains nine structures: 1bol, 1ia9, 1e8x, 1cja, 1nwl, 1j7u, 1cdk, 1csn and 1ir3. The first six are all AK representatives while the last three all belong to the TPK family. The ordering of the 36 resulting independent pairs is as follows: the first structure in the set is paired with all the rest in order; then the second in the set is paired with the rest, etc. For enhanced visibility, we rescaled the multipoles-based distances (after centering at the mean value) and then shifted the whole set of values by the same amount. Both Figure 3 and Figure 4 show that the multipolar distances follow in general big inter-family variations in the *rmsd* distances, but tend to smoothen the intrafamily variations. This is especially obvious inside the TPK family where more pronounced features are visible in the distance matrix in Figure 3b. A closer examinations of Figure 4 also reveals a clearly better correlation in the region of proteins placed closer in our ordering to the TPK family. This region includes roughly all pairs between the subset of structures from 1cja to the end of the list of representatives. The conclusion of a stronger correlation in this area is reinforced if we recalculate the correlation curve with only this subset of structures. The new curve, shown in Figure 5, saturates at values around 0.9 for the correlation coefficient. The major disconnect between the two representations can be attributed to the left hand series of about 15 points in Figure 4. We can see that there is a high variability in the *rmsd* distances between AT and TPK proteins which is not followed accurately by the multipoles-based distance. With our ordering, these are pairs of less related proteins. In this cases we are in the common situation where the *rmsd* in general has little informative value [30] regarding the similarity of the structures and we need to look at global feature in order to characterize similarity. Shape-based descriptors like ours offer a natural path towards a meaningful extension of the notion of similarity.

#### Case 2

In a second set of calculations, both the multipole and *rmsd*-based distances were calculated in either the "canonical" reference frame, or after spatial superposition. In the second case the superposition was done by minimizing the *rmsd* between the structures. The results are shown in Figures 6 and 7. There is a visibly higher similarity between the two distance matrices than in Figure 3. This is quantitatively confirmed by a recalculation of the correlation coefficients which show a significant improvement over the previous case.

It is clear that, while the multipole description differs in some intrafamily details from the typical *rmsd* results, especially when the latest are calculated with spatial superposition, the results are quite robust in their capacity to discriminate between the families. The level of correlation increases with the level of detail of the representation, i.e. with the number of multipole approaching the dimensionality of the fold space. These latest results, when compared with the previous ones, suggest that the two ways of positioning the structures for the purpose of comparison are not entirely equivalent and the use of a "canonical" positioning produces a more robust clustering of the structures.

## Comments

View archived comments (1)