Datasets
To examine structure retrieval performance of the proposed methods, we use a data set of 2337 representative protein structures, which are arbitrarily selected from 185 fold groups defined in a protein classification database by CE (ftp://ftp.sdsc.edu/pub/sdsc/biology/CE/db/ata_3.8_jul-2004.txt.gz). CE is one of the frequently used protein structure comparison programs that compares Cα positions of proteins using a dynamic programming algorithm. These representative structures have a resolution of 3.0 Å or better, have no more than 10 missing residues in the solved structure, have all heavy atom positions solved, and are longer than 100 residues. In addition, the structure similarity of each pair is less than a Z-score of 3.8 by CE.
This dataset also provides the SCOP classification code of proteins, which classifies the proteins into 8 class groups, 149 folds groups, 187 superfamily groups, and 279 family groups. We use both CE-based and SCOP classifications in our study since they have the following complementary features: The CE classification is automatic without human intervention and considers main-chain orientation, while SCOP is curated manually at a certain degree to take evolution into account.
At this juncture, it is important to note that there is no golden standard in classification of proteins. The structure similarity measured for different representations can be largely different for distantly related proteins since they capture different aspects of the structures [4]. In our previous paper [14], we showed that CE and SCOP do not fully agree and also that DALI [26], which compares distance maps of proteins, and CE have poorer agreement than CE and the 3DZD. Each method has its own strength and thus an appropriate method should be selected depending on the purpose of each study. We have further shown examples of proteins whose surface shape similarity infers functional similarity, which are not detected by the conventional sequence or main-chain structure comparison methods [14]. In this study, we demonstrate that the new main-chain surface representations encoded by the 3DZD have a better agreement to CE and SCOP as compared to the original all-atom surface representation introduced in the previous study [16].
Computing protein surfaces
For a protein structure, four different surface representations are computed: one that uses all heavy atoms (AASurf), the backbone conformation with all heavy atoms in the main-chain, i.e. Cα, C, N, and O atoms (CACNO), the backbone with Cα, C, and N atoms (CACN), and the backbone Cα atoms only (CA). For the set of extracted atoms, the surface is generated using the MSMS program [27]. MSMS rolls a probe sphere on the atoms and defines the surface as the path of the center of the probe. The radius of the probe sphere is set to default value of 1.5Å for AASurf, CACNO, and CACN and the radius is set to 2.0Å for CA to generate a smoother representation. The generated surface is then mapped on a 3D grid. A grid cell (voxel) is assigned a value of 1 if it is on the surface and 0 otherwise. Because the 3DZD is defined within a unit sphere, the protein surfaces represented by voxels are scaled into a unit sphere. Therefore, the size information of the protein is lost. The resulting voxels are considered as an input 3D function, f(x), which is used as input for computing the 3DZD as described in the next section.
3D Zernike descriptors
The 3DZD is a series expansion of an input 3D function, which allows for a compact representation of the 3D object (i.e. the input 3D function) [17, 28]. The mathematical foundation of the 3DZD was laid out by Canterakis (1999) and was applied on 3D object retrieval by Novotni and Klein (2003). For readers’ convenience, a brief mathematical derivation of the 3DZD is shown below. For detailed derivations and discussions, refer to the aforementioned two papers Canterakis [29] and Novotni and Klein [30].
The first step of computing the 3DZD is derivation of the 3D Zernike moments. For an input 3D function, f(x), the 3D Zernike polynomials defined on order n, degree l, and repetition m, are given by
subjected to -l < m < l , 0 ≤ l ≤ n , and (n-l) being even. The spherical harmonics,
, are functions of a set of a polar angle, ϑ, and a azimuthal angle, ϕ. The radial function, R
nl
(r), incorporates the radius information into the basis function and are constructed so that
are polynomials when written in terms of the Cartesian coordinates. The 3D Zernike moments of f(x) are defined as the coefficients of the expansion using this orthonormal basis in the following formula:
After computing the 3D Zernike moments, a normalization step is necessary to obtain rotation invariance. This is done by taking the L2 norm of the 3D Zernike moments as the descriptor. That is, the moments are collected into (2l+1) dimensional vectors
and the rotational invariance is obtained by defining 3DZD, F
nl
, as the norm of vectors Ω
nl
:
The size of the 3DZD vector is set by the parameter n, called the order, which determines the resolution of the descriptor. The 3DZD is a series of invariants (Eqn. 3) for each combination of n and l, where n ranges from 0 to the specified order. For example, n ranges from 0 to 20 for a 3DZD of an order 20. The order of n=20, which yields a total of 121 numbers, or invariants, is used in our study based on the success of the previous works [14, 30]. The last step is to normalize the descriptor by the norm of the descriptor. This normalization is found to reduce dependency of the 3DZD on the number of voxels used to represent a protein [14]. Figures 1A through D show the surface generated from the four representations. Figure 1E shows the 3DZD of the four representations for the protein PDB:1hdmA. It can be seen that there is little difference in the 3DZD of CACNO, CACN, and CA as compared to the 3DZD of AASurf in this particular case. The correlation coefficients among the three backbone representations (CACNO, CACN, and CA) range between 0.997 to 0.999. The correlation coefficients between CACNO, CACN, and CA to AASurf are 0.934, 0.938, and 0.941, respectively. Although there is little difference between the four representations in this particular example, we will show later that the four representations make a difference in terms of overall database retrieval performance.
Evaluating database retrieval performance
The database retrieval performance of the four surface representations is evaluated with precision-recall curves. The precision-recall curves are often confused with the receiver operator characteristic curves. Although these two curves are related, the precision-recall curve is considered to be a better measure when the dataset is skewed [31]. The number of proteins in a group in the dataset used varies from 3 to 180 and thus a precision-recall is used here. For each protein in the dataset, the rest of the proteins described with the 3DZD are sorted by the Euclidean distance (L2 norm) to the 3DZD of the query protein. Then, the precision and the recall values are computed at each distance threshold value. The precision is defined as the fraction of the retrieved proteins of the same group with the query among all proteins retrieved above the distance threshold. The recall is defined as the fraction of the retrieved proteins of the same group with the query among all the proteins in the same group. Finally, we calculate the average precision and recall for each distance threshold. The precision-recall curves of different representations are evaluated by the area under curve (AUC).
As employed in the previous work [14], we also apply pre-filtering of the proteins by their sequence length. For a query, a protein in the database is filtered out if it is longer than 135% or shorter than 65% to the length of the query protein. This is done because of the loss of the size of the proteins during the process of computing the 3DZD, since the proteins are scaled to fit into a unit sphere.
Combining 3DZD of AASurf and CACNO
We also examine database retrieval with combinations of the 3DZDs of the AASurf and a backbone surface representation. Among the three backbone representations, we choose CACNO since no significant difference in performance was observed among the three (see Results). CACNO would also be a natural choice since it is the full heavy atom representation of protein backbone. For the AASurf and CACNO combination, the distances measured independently are linearly combined with weighting factors:
where y and x are the two proteins compared and i is the index of 3DZD invariants of AASurf, S, and CACNO, B. w
yS
and w
yB
are weights for AASurf and CACNO of the query protein y, and m1 and m2 are the number of invariants in the 3DZD of AASurf and CACNO, respectively. In this study, the 3DZD of AASurf and CACNO is set to the same size, i.e. m1=m2=121. Eqn. 4 is asymmetric since the weights w
yS
and w
yB
depend on the query protein, y.
The weights for AASurf and CACNO for a query protein are determined by two characteristics of its protein shape: 1) the existence of a tail-like structure and 2) the sphericity. The tail is an elongated region in the structure which is longer than three amino acids locating further than two times of the radius of gyration (RG) of the protein from the center of the gravity. The radius of gyration is defined as follows:
where N is the number of atoms in protein x
j
, cog, is center of gravity of protein x
j
, and R is the approximate radius of atoms in which 1.5Å is used [32].
The sphericity measures how compactly a protein structure fits to a sphere:
where RS(x) is the radius of a sphere that has the same volume as the protein all-atom surface representation computed by the MSMS program. A larger value indicates that the protein is more spherical.