A new protein binding pocket similarity measure based on comparison of clouds of atoms in 3D: application to ligand prediction
- Brice Hoffmann†1, 3, 4,
- Mikhail Zaslavskiy†1, 2, 3, 4Email author,
- Jean-Philippe Vert1, 3, 4 and
- Véronique Stoven1, 3, 4
© Hoffmann et al; licensee BioMed Central Ltd. 2010
Received: 27 August 2009
Accepted: 22 February 2010
Published: 22 February 2010
Predicting which molecules can bind to a given binding site of a protein with known 3D structure is important to decipher the protein function, and useful in drug design. A classical assumption in structural biology is that proteins with similar 3D structures have related molecular functions, and therefore may bind similar ligands. However, proteins that do not display any overall sequence or structure similarity may also bind similar ligands if they contain similar binding sites. Quantitatively assessing the similarity between binding sites may therefore be useful to propose new ligands for a given pocket, based on those known for similar pockets.
We propose a new method to quantify the similarity between binding pockets, and explore its relevance for ligand prediction. We represent each pocket by a cloud of atoms, and assess the similarity between two pockets by aligning their atoms in the 3D space and comparing the resulting configurations with a convolution kernel. Pocket alignment and comparison is possible even when the corresponding proteins share no sequence or overall structure similarities. In order to predict ligands for a given target pocket, we compare it to an ensemble of pockets with known ligands to identify the most similar pockets. We discuss two criteria to evaluate the performance of a binding pocket similarity measure in the context of ligand prediction, namely, area under ROC curve (AUC scores) and classification based scores. We show that the latter is better suited to evaluate the methods with respect to ligand prediction, and demonstrate the relevance of our new binding site similarity compared to existing similarity measures.
This study demonstrates the relevance of the proposed method to identify ligands binding to known binding pockets. We also provide a new benchmark for future work in this field. The new method and the benchmark are available at http://cbio.ensmp.fr/paris/.
Predicting which molecules can bind to a given binding site of a protein with known 3D structure is important to decipher the protein function, and useful in drug design to identify drug precursors or predict potential side effects if a drug candidate is predicted to bind to many protein pockets. A classical assumption in structural biology is that the 3D structure of a protein is related to its molecular function, i.e., the nature of its partner molecules. Most available programs for structure visualization provide tools for 3D structure superposition and comparison, which may help to predict the nature of a protein ligand from those of other proteins with overall similar 3D structure . However, proteins that do not display any overall sequence or structure similarity may present similar binding sites, and consequently also share similar ligands. Therefore, comparison of binding pockets is a more appropriate approach in order to predict if two proteins bind similar ligands , and many ligand prediction methods rely on local 3D comparisons at the binding site, using various ways to perform the comparison. For example,  compared pockets described with real spherical harmonic expansion coefficients,  used a specialized geometric hashing procedure as the core of the SitesBase web server,  developed a method that detects multiple common sets of points. An approach proposed by  is based on the representation of binding pockets by triangle-discretized spheres.  and  considered graph-based representations of binding pockets and applied graph matching algorithms. Finally, [9, 10] combines the identification of a binding site on a whole protein 3D structure and its comparison to a reference binding site, using a geometric hashing procedure.
Our contribution in this paper is twofold. First, we propose a new similarity measure to compare binding pockets of proteins. For that purpose, we represent a binding pocket by a cloud of atoms in the 3D space, potentially baring labels such as partial charges or atom types. The method relies on the modeling of local protein structures are rigid bodies, and we therefore represent a protein pocket as a cloud of points with fixed relative positions. The new similarity measure is based on a convolution kernel between clouds of points, and allows to align protein pockets. The method provides a superposition of two pockets even if their corresponding proteins present no overall sequence or 3D structure similarity. One important difference between this approach and most existing methods is that it does not require any pairwise matching of atoms (or superatoms), or residues, in order to compare protein binding pockets. Instead we attempt to capture the similarity of atom densities in the 3D space. This confers smoothness properties to the proposed similarity measure. Second, we propose to use a classification method to predict ligands for target pockets according to their similarity scores with a set of pockets with known ligands. This approach is able to handle the difficult case where different families of pockets binding the same ligand are present. This may be observed when the ligand is flexible and can be bound in various conformations by pockets displaying different topologies.
An important question debated is how to compare the quality of similarity measures. We underline that it is not possible to define an intrinsic quality for a similarity measure, because there is no absolute reference. Similarity measures can only be compared according to the question of interest.
Here, we evaluate quality of similarity measures with respect to their ability to predict a ligand for a pocket. Although the area under ROC curves (AUC scores) are commonly used , we show that classification-based scores better compare the performances of similarity measures for ligand prediction.
We test our method on a benchmark proposed by other authors, in order to compare our new method to other published algorithms. We also test the methods on a new benchmark containing non redundant protein pockets binding ligands of similar sizes, typical of that of drug molecules, corresponding to a more realistic problem. We provide this new dataset as a publicly available benchmark.
Convolution kernel between clouds of atoms
In our model, a binding pocket is described by a set of atoms in the 3D space. Our objective is to construct a similarity measure between pockets, which may be used to identify pockets binding the same ligand.
Let denote a binding pocket consisting of N atoms, where x i ∈ ℝ3 is a 3D vector representing atom coordinates, and l i is a label (discrete or real valued) that may be used to store additional information on the atoms (for example, atom type, atom partial charge, or amino acid type).
A classical approach for pocket comparison is to iteratively align two pockets and further count the number of overlapping atoms, usually within a tolerance of 1 Å. Different implementations of this principle can be found in such methods as the Tanimoto index , the SitesBase algorithm (Poisson index, ), or the MultiBind algorithm . The alignment is made to maximize the number of overlapping atoms, which is generally a good indicator of pocket similarity. However, atoms may have different positions but play equivalent roles in ligand binding (for example, the side chain of a basic residue may bind a phosphate group of an ATP molecule from different positions), and the role of one atom in one pocket may be played by a group of atoms in another one. These observations suggest the idea of an alternative smooth score which would not count the number of overlapping atoms, but rather use a weighted number of atoms having similar positions. We first consider the case where labels are ignored, and only atom coordinates are used to measure the similarity between pockets. Then, we explain how the information on atom labels may be introduced in the new similarity measure.
The parameter σ characterizes the sensitivity of the similarity measure (1) to points relative displacements. When σ is small, only atoms which are close to each other significantly contribute to K(P1, P2). On the contrary, when σ is large, almost all pairs of atoms contribute to K(P1, P2). The kernel (1) is an example of a convolution kernel [14, 15] between sets of points. Alternative kernels may be constructed by substituting the Gaussian kernel by any other kernel between 3D vectors x i and y j
This probabilistic interpretation shows that, intuitively, the similarity score is preserved as far as the corresponding mass distributions are stable. It is therefore robust to small uncertainty in 3D coordinates.
Similarly, an approximation of the optimal rotation matrix R is the rotation that superposes the first principal axis of P2 with the first principal axis of P1, the second one with the second one, and the third one with the third one. Since principal vectors are defined up to a sign, the two signs for all principal vectors of one of the pockets have to be tested (there are 23 = 8 combinations, each combination defining one initial point). If some of the pocket axes have close lengths, it may also be interesting to consider rotations which superpose the first principal axis of one pocket with the second principal axis of the other one.
Since the set of rotation matrices is a 3D manifold embedded in 9D space, we cannot differentiate (1) with respect to each element of matrix R.
This optimization step defines the best pocket superposition, according to the sup-CK similarity measure.
As mentioned above, it may be interesting to use additional information on binding pocket atoms, such as atom types or charges. Let us suppose that this information is represented by labels l i (which may be discrete or real variables, or multidimensional vectors) and that it is associated to a similarity measure. For example, to measure the similarity between categorical labels like atom types, one can use the Dirac function . In our experiments, we used atom partial charges as atom labels, with a Gaussian kernel . Of course, other similarity measures may be employed.
where parameter λ controls the sensitivity of our measure to atom labels, for example to partial charges. When λ is large, the impact of labels is negligible, which corresponds to a purely geometrical approach. When λ is close to zero, only pairs of atoms which have the same partial charge contribute to our measure. In general, the smaller λ, the greater the contribution of the atom labels to the binding pocket similarity measure. Since the function K L does not depend on R and y t in (5), the same optimization procedure for pockets superposition can be used to optimize (3) or (5). Finally, it is important to notice that the sup-CK measure of similarity can be used to compare any set of atoms in 3D. As mentioned in the introduction section, the superposition method and the similarity measure may be applied to superpose and compare pockets, even when they belong to proteins displaying no sequence and no overall structure similarity. This point will be illustrated in Results on the example of two unrelated ATP binding proteins.
In the following, we briefly recall the principals of a few other methods proposed to measure similarity between pockets, because we compare them to the sup-CK method defined in the present study.
Spherical harmonic decomposition (SHD)
 proposed to model pockets by star-shapes built using the SURFNET program. The star-shape representation is defined by a function f (θ, ϕ), representing the distance from the pocket center to the pocket surface for a given (θ, ϕ),. To measure the similarity of binding pockets P1 and P2, the corresponding functions f1 and f2 are first decomposed into spherical harmonics, and the pocket similarity is then computed as the standard Euclidean metric between vectors of decomposition coefficients.
 presented three different variants of SHD, using only the shapes or the sizes of the binding pockets (keeping only the zero-th order in the spherical harmonics expansion), and their combination. In the Results section, we recall the results that they obtained with the combination, because it provided the best performance.
Poisson index (sup-PI)
As mentioned in the Introduction, many binding pockets similarity measures are based on pocket alignment with further counting of overlapping atoms. This kind of approach is used in the Poisson index model . More precisely, the Poisson index model is based on a normalized number of overlapping atoms , where L is the number of overlapping atoms, and #P1 and #P2 are the numbers of atoms in P1 and P2, respectively. The PI score may be computed for any pocket superposition method. While  used the geometric hashing algorithm, we used the superposition made by the sup-CK method, with further superposition refining to maximize the number of overlapping atoms.
 represents pockets by pseudo-atoms labeled with physico-chemical properties. Pockets are aligned using a geometric hashing technique. This algorithm was mainly designed for multiple alignment of binding sites, but it may be used for pairwise alignment of pockets, as performed in this study.
Other simple methods
We also consider two simple methods based on the comparison of simple binding pockets characteristics. These methods represent each pocket by an ellipsoid constructed on the basis of the pocket's principal axis. The first one, referred to as Vol, estimates the similarity between pockets P1 and P2 by the absolute value of the difference between the volumes of their corresponding ellipsoids:
Vol(P1, P2) = |Vol(P1) - Vol(P2)|. The second one, called Princ-Axis, estimates the similarity score between pockets by , where and are the lengths of the three principle axis of pockets P1 and P2, respectively.
Combination of sup-CK and Vol
Since volume information was found to be important by , we also tested a linear combination of the sup-CK and Vol methods, called sup-CK-Vol, where the coefficient of linear combination is learned as other model parameters (σ, λ, or the distance cutoff R discussed in the Datasets section) in the double cross validation scheme. This linear combination takes advantage of the Vol method to separate pockets binding ligands of very different sizes like PO4 and NAD, and of the sup-CK algorithm to allow finer discrimination.
To compare our method based on local 3D similarity to a simple and classical approach based on sequence comparison, we conducted a pairwise alignment of all protein sequences for the different datasets, in order to build a matrix of distance between proteins. This matrix was built with the algorithm of Needleman and Wunsch, using the default settings [17, 18].
There are various ways to measure the similarity between binding pockets, and some of them were discussed in the previous section. To evaluate the quality of a given similarity measure, one may compare it to some "ideal" similarity measure, but the problem is that such measure does not exist. As an example, if two alternative measures SM1 and SM2 compare two pockets P1 and P2 so that SM1(P1, P2) = 0.3 and SM2(P1, P2) = 0.4, there is no way to decide which one is the best, because we do not have any absolute reference. The choice of the optimal measure, thus, depends on the problem of interest. In the context of ligand prediction, the quality of a similarity measure can be evaluated according to its ability to cluster together pockets that bind the same ligand. This can then help to predict ligands for previously unseen pockets. To evaluate this clustering ability, we considered two different scores.
These remarks lead us to employ another quality score based on a classification error. To estimate the quality of the similarity measure SM, we try to predict a ligand (class) for each pocket from that of its neighbors. The smaller the classification error (proportion of bad predictions), the better the similarity measure. In this work, we used a K nearest neighbors (KNN) classifier. To evaluate the classification error, we applied a leave-one-out double cross validation methodology. Namely, each pocket P from the dataset is considered one by one, and all other pockets are used as the training set. Parameters of the model (K -- number of neighbors, σ and λ in the case of the sup-CK method) are estimated on the training set via cross-validation technique, and the class (i.e. the ligand) of the pocket P under consideration is predicted using the training set and the estimated parameters of the model. More precisely, in the case of a dataset containing 100 proteins, double cross validation is performed according to the following scheme: each of the 100 pockets is extracted in turn from the dataset in a leave one out procedure. Then, each of the other 99 pockets is selected in turn and its class is predicted from the 98 remaining pockets. This operation is repeated for different values of σ and λ, and the σ* and λ* values providing the highest number (over 99) of well predicted pockets are retained and used to predict a class for the initially extracted pocket. Note, that all datasets contained proteins that presented less than 30% global sequence identity , to ensure that there were no duplicates or very close elements in the datasets. This allowed to use a leave-one-out scheme without risk of bias.
For all protein structures, binding pockets were extracted as follows: protein atoms situated at less than R Å of one of the ligand atoms were selected, where R is a parameter of the model (as the number of neighbors k, or the σ and λ parameters), and is also learned in the double cross-validation scheme. In most cases, the optimal value of R was found to equal to 5.3 Å, a value which was retained in this study. However, experiments where R is varied are also presented in the discussion section. Finally, pockets are represented by 3D clouds of atoms labeled by their partial charge, attributed according to the GROMACS (FFG43a1) force field . Atom partial charges were assigned according to the protein structure alone, in absence of the ligand. However, the presence of a ligand would potentially modify these calculated charges, but this could not be taken into account since the method aimed at predicting the ligand. Other labels representing chemical properties such as amino-acid type, hydrogen donor or acceptor, or hydrophobic atom could be included.
We considered three benchmark datasets. The first one, proposed by  and referred to as the Kahraman dataset, comprises 100 protein crystal structures in complex with one of ten ligands (AMP, ATP, PO4, GLC, FAD, HEM, FMN, EST, AND, NAD). The second one is an extended version of the Kahraman dataset (called extended Kahraman Dataset below), in which we added protein structures in complex with one of the same ten ligands, leading to a total of 972 crystal structures (see Additional file 1). The added proteins presented pairwise sequence identities less or equal to 30%, to avoid potential bias by inclusion of close homologues.
Ligands descriptors for the Kahraman dataset
28.3 ± 15.0
420.7 ± 222.8
7.5 ± 5.1
4.4 ± 3.2
5.2 ± 4.9
Ligands descriptors for the homogeneous dataset
19.9 ± 4.0
296.4 ± 61.5
5.6 ± 3.0
3.9 ± 2.9
8.7 ± 3.5
All methods were tested on three datasets described in the Data section. The performance of all methods is evaluated on the basis of the AUC score and of the classification error (see Performance criteria). The sup-CK method is compared to sup-PI, SHD, Vol, Princ-Axis and MultiBind algorithms (see Related methods). Among the pocket extraction methods used in the SHD approach, we considered the results corresponding to the Interact Cleft Model, which is similar to our pocket extraction method, and allows to compare the sup-CK and SHD approaches. Algorithms, benchmark datasets and distance matrices for the SupCK method are available at http://cbio.ensmp.fr/paris/.
Performance on the Kahraman benchmark
0.858 ± 0.14
0.861 ± 0.13
0.889 ± 0.14
sup-CK L -Vol
0.895 ± 0.12
0.875 ± 0.14
0.853 ± 0.13
0.815 ± 0.13
0.715 ± 0.17
0.55 ± 0.08
To evaluate the classification error, we tried to predict a ligand (a class) for each pocket using the k-nearest neighbors classifier (see Performance criteria). Note that in a ten class (10 ligands) classification problem, a random classifier would have an error of 0.9, which represents baseline performance for all classifiers.
Table 3 shows that methods with higher AUC scores tend to have smaller classification errors, but this correlation is not strict. For example, the SHD and Vol methods have the same classification error, although the latter displayed a better AUC score than the former. Conversely, the sup-CK and sup-CK L -Vol methods had similar AUC scores, but the latter performs much better than the former in terms of classification error. This indicates that the AUC score is not appropriate to compare the quality of similarity measures with respect to the problem of ligand identification, and underlines the interest of the classification approach.
The sup-CK and sup-CK-Vol algorithms have lower classification errors than other methods, which means that they are well suited to the problem of ligand prediction. Interestingly, atom partial charges information significantly reduces classification errors of both methods, which was not the case for AUC scores. The use of additional atom labels such as amino-acid type, hydrogen donor or acceptor, or hydrophobic atom may again improve the quality of ligand prediction.
No method reaches the AUC score of 1.0, or perfectly predicts the ligands. Several remarks might explain this fact. First, pockets have to be extracted from the protein structure. Whatever the employed method might be, it is difficult to extract all atoms interacting with the ligand, and only these atoms. In particular, atoms that do not interact with the ligand might have been included in the pockets, which could reduce the observed similarity between pockets that bind this ligand. Second, ligands are flexible molecules that can adopt different conformations. Therefore, protein pockets that bind the same ligand may display various shapes. In such situations, correct prediction is still possible if the learning dataset contains pockets in which the ligand conformations correctly samples its accessible conformational space. The present dataset contains only 10 pockets per ligand, which might be too small for the most flexible ligands.
When analyzing results in Table 3, one must remember that the Vol and Princ-Axis methods do not require pockets superposition, while all other methods do. The superposition algorithms of the latter are different, which contributes to the observed scores. However, the sup-PI and sup-CK methods only differ by their similarity measures. After superposition, sup-PI requires to determine the number of overlapping atoms, while sup-CK relies on a weighted number of atoms having close positions. This seems to confer some smoothness properties to the latter, and robustness with respect to variations observed between pockets binding the same ligand.
Moreover, similar residues, playing equivalent roles in ATP binding, are found in equivalent positions in the superposed structures. In particular, N951 and K807 interact with the γ phosphate of ATP in PDB: 1E8X and are found close respectively to K288 and H236 that play the same role in PDB: 1DV2. We also observe that, K833 interacting with the β and α phosphates of ATP in PDB: 1E8X, is found close to K116 in PDB: 1DV2 after pockets superposition. These residues form equivalent positively charged regions, as shown in Figure 3c. Similarly, the hydrophobic region interacting with the adenine ring of ATP in PDB: 1E8X and involving residues W812, I831, I879, I881, V882, A885, M953, F961, and I963 is equivalent to the hydrophobic region involving residues V131, V156, I157, L204, L278, I287, I437 in the superposed PDB: 1DV2 structure. These hydrophobic patches overlap after pockets superposition, as shown in Figure 3d. Overall, these observations indicate that the sup-CK algorithm proposed a reasonable superposition for these two unrelated ATP-binding pockets.
Extension of Kahraman dataset
To evaluate the ability of the sup-CK method to improve its performance when trained on a larger dataset, we considered an extension of Kahraman dataset consisting of 972 of non redundant pockets that bind one of the 10 ligands of the original dataset (see Data). Therefore, the new dataset consists of 100 Kahraman pockets and 872 new pockets from the PDB.
Classification error on the extended Kahraman benchmark
sup-CK L -Vol
However, sup-CK L provides the best performance. The quality of its predictions might again improve by including more structures available at the PDB. Column B presents the results on the 100 original pockets extracted from those presented in column A. It shows that 79% of the binding pockets of the original Kahraman dataset were correctly classified by sup-CK L , compared to 73% when they were classified using only the original dataset (a classification error of 0.27 in Table 3). Finally, column C shows the prediction errors for the 872 new pockets when the 100 original pockets are not used in the leave one out procedure. The scores in this column may be seen as a test on an external independent dataset (as mentioned above, the optimal parameters σ and λ used here were those learned only on the 100 original pockets). It shows that the performance of the sup-CK methods does not degrade on the 872 new pockets, and remains above those of the other methods.
Homogeneous dataset (HD)
The Kahraman dataset contains ligands of very different sizes, which might not be typical of real problems. Therefore, we built the Homogeneous dataset because it was important to test methods on a benchmark containing pockets binding ligands of more similar sizes.
Performance on the HD benchmark
0.710 ± 0.19
0.752 ± 0.16
0.722 ± 0.18
sup-CK L -Vol
0.766 ± 0.17
0.648 ± 0.15
0.650 ± 0.18
0.702 ± 0.19
0.69 ± 0.14
0.577 ± 0.09
The best AUC score is obtained by the sup-CK L -Vol algorithm. The AUC scores of all other methods are significantly lower according to the Wilcoxon signed-rank test (see Figure 2b), except sup-CK L . Indeed, volume information only provides a slight improvement of 1%, compared to 3% on the Kahraman dataset. On the contrary, information on partial charges leads to an improvement of 4% for the sup-CK and sup-CK-Vol algorithms, which was not observed on the Kahraman dataset. This shows that addition of physico-chemical information is critical to better compare pockets of similar sizes. The lowest classification errors are obtained by the sup-CK L and sup-CK L -Vol algorithms, which again shows that volume information is not critical on this benchmark. On the contrary, partial charge information leads to an improvement of 9% between sup-CK and sup-CK L , and of 8% between sup-CK-Vol and sup-CK L -Vol.
Computer vision methods
An important topic is the relation between methods for binding pockets comparison and algorithms in field of computer vision for comparison of 3D shapes. A complete review of 3D shapes comparison methods is out of scope of this article, and interested readers may consult  for a detailed review. Interestingly, most of the existing methods for binding pocket comparison have an analogue in the domain of computer vision. For example, methods based on real spherical harmonic expansion used in  for binding pocket comparison are also discussed by [22, 23] in the context of general 3D shape matching. Principles used in another popular method for matching and comparison of 3D forms, called Iterative Closest Point algorithm , and its variants are used in Poisson index and MultiBind algorithms. Examples of approaches based on graph representation of 3D forms and graph matching methods may be found in  for binding pockets comparison, as well as in  for 3D shapes comparison. Nevertheless, binding pockets are not continuous shapes but discrete clouds of points. They can be transformed into 3D shapes [2, 3], but this transformation may be a source of noise. Moreover, a similarity measure between binding pockets should be rotationally and translationally invariant, which is not always the case in computer vision methods. However, we believe that the adaptation of appropriate methods may be very fruitful for the recognition of binding pockets.
Choice of optimal parameters
Robustness of the method with respect to pockets definition
It is important to discuss the impact of using the R parameter, a cutoff distance used for pocket definition. This could lead to situations where an atom is excluded from the pocket in one protein, when a similar atom is included in the pocket of another protein to which it is compared. However, as briefly mentioned in the background section, the principle of the method is to compare pockets based on the optimal superposition of their clouds of atoms. The method does not define or use pairwise matching of atoms of the two pockets, as most other available methods do. Figure 4 illustrates this point: the method did not lead to local pairwise superposition of blue and red points, but rather proposed a global superposition of the red and blue atoms densities. Therefore, the method is expected to be robust with respect to potential inclusion or exclusion of a small number of atoms in one of the pockets. As mentioned in the above paragraph, the fact that the performance of the method remains interesting when R varies between 4.5 and 8 Å is also an indirect illustration of this idea. One could wonder if the use of atom labels such as partial charges would decrease the robustness of the method with respect to pockets definition using R. Indeed, a cutoff distance could split a strong dipole in one of the proteins, and not in the other (for example an N-H group). However, the addition of atom labels like partial charges is only one option of the method. Results using only atom positions (corresponding to a pure geometrical approach) already show good performances. Addition of partial charges labels still improves the results, despite the risk that strong dipoles might have been cut. This can probably be explained by the facts that such events are rare, and that the method searches an overall best superposition of atoms densities, despite possible local mismatches in atoms positions or labels. Nevertheless, it would be interesting to explore other cutoff criteria taking atom labels into account (including other types of labels such as hydrogen bond acceptor, donor,...), in future developments of the method.
We did not tackle the problem of pocket detection, which relies on totally different algorithms than those discussed in this paper, and which was out of the scope of the present study. However, the similarity measured between two pockets strongly depends on pocket definition. We extracted pockets as the set of all protein atoms within about 6 Å of the bound ligand. Similar approaches were used by  (Interacted Cleft Model), and similar pockets may also be retrieved by methods like Q-SiteFinder without any information on ligand coordinates. Another alternative could be to employ one the various programs that have been developed to locate depressions on protein surfaces, particularly in the case where no holo structure is available , or in the case of orphan proteins for which the ligand and the binding site is unknown. However, existing pocket extraction algorithms have difficulty to define the rim of a binding pocket, and tend to extract protein cavities that are larger than the binding pocket itself, as defined by the ensemble of residues involved in ligand binding. Although we observed that our method had some robustness with respect to the definition of the binding pocket, global similarity measures like those proposed in this paper may loose some performance on automatically extracted pockets.
In the Result Section, we showed the example of the PDB: 1E8X and PDB: 1DV2 unrelated structures, binding ATP in similar conformations, and whose pockets were correctly superposed by the sup-CK method. In the case of even more dissimilar pockets, binding ATP in different conformations, sup-CK still allows superposition of the pockets so that similar regions overlap. For example, rabbit muscle pyruvate kinase (530 residues in PDB: 1A49) and E. coli 7,8-dihydro 6-hydroxymethylpterin pyrophosphokinase (158 residues in PDB: 1DY3) of the Kahraman dataset have no sequence or structures homologies, and bind ATP in different conformations as shown in Figure 3a. However, according to the sup-CK superposition of these two pockets, shown in Figure 3c, the two patches of hydrophobic residues that interact with the adenine ring of ATP are found to overlap. Note that these two pockets where correctly classified by sup-CK (an ATP ligand was correctly predicted), on the basis of other more similar pockets present in the dataset.
Nevertherless, a reasonable pocket superposition for these highly different proteins with significant pockets deviations, was proposed by the sup-CK method.
The sup-CK algorithm had a good performance in ligand prediction for holo structures. It also showed robustness with respect to atom displacements. This is an important characteristic for future application of the method to real case studies where the ligand is unknown, and one must extract pockets from apo structures. Local structural rearrangements are frequent upon ligand binding, and methods displaying some smoothness with respect to atoms positions are required when working with apo structures. This would also be necessary for proteins with no available experimental structure but for which a homology model can be constructed, since the modeled pocket may somewhat differ from the true, but unknown, pocket. We expected that, for large flexible ligands, the performance of the sup-CK method might decrease, but this was not observed for the two datasets that we used (Kahraman dataset and Homogeneous dataset). However, we cannot rule out the possibility that this could be observed if the method is trained on other small training datasets.
The running time of the sup-CK method depends on the value of the stopping criterion used in the gradient ascent method, and on the number of atoms. In our experiments, the algorithm running time varied between 0.2 and 1.3 seconds (2.5 GHz CPU) per pockets pair. This running time is already quite reasonable to process large protein databanks. The method is presented on datasets of moderate sizes because our aim was to validate the methodology. However, it can be applied on ligand prediction problems, where the number of pockets (and ligands) included in the learning dataset needs to be larger. For future applications in the domain of screening using all ligands available in the Protein Data Bank, a pre-filtering on the basis of simple pocket descriptors (like volume or size) could further accelerate the sup-CK method. Future application of the method proposed could include identification of new ligands for protein pockets according to those known for the most similar pockets. This is of interest in the context of identification of drug precursors or of side effects prediction.
we have developed a new method to measure the similarity between protein binding sites. In this method, binding pockets are described as clouds of points in the 3D space, each point corresponding to an atom. These points may bare additional labels representing various characteristics such as atom partial charges, atom types, or other atomic features. The proposed method showed good performance in the classification of binding pockets according to their respective ligands. It relies on the search for the best global superposition of clouds of atoms, which confers robustness with respect to binding site definition or variations in ligand conformation. This method may be used to compare any type of binding sites in the 3D space, even in absence of overall sequence or structure similarity between their corresponding proteins.
We thank Mines ParisTech and Cannot Mines for financial support to the project and funding of BH and MZ.
- Xie L, Xie L, Bourne PE: A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery. Bioinformatics 2009, 25(12):i305-i312. 10.1093/bioinformatics/btp220View ArticlePubMedPubMed CentralGoogle Scholar
- Kahraman A, Morris RJ, Laskowski RA, Thornton JM: Shape variation in protein binding pockets and their ligands. J Mol Biol 2007, 368: 283–301. 10.1016/j.jmb.2007.01.086View ArticlePubMedGoogle Scholar
- Morris RJ, Najmanovich R, Kahraman A, Thornton J: Real spherical harmonic expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics 2005, 21(10):2347–2355. 10.1093/bioinformatics/bti337View ArticlePubMedGoogle Scholar
- Gold N, Jackson R: SitesBase: a database for structure-based protein-ligand binding site comparisons. Nucleic Acids Res 2006, 34: D231-D234. 10.1093/nar/gkj062View ArticlePubMedPubMed CentralGoogle Scholar
- Shulman-Peleg A, Shatsky M, Nussinov R, Wolfson HJJ: MultiBind and MAPPIS: webservers for multiple alignment of protein 3D-binding sites and their interactions. Nucleic Acids Res 2008, 36: W260-W264. 10.1093/nar/gkn185View ArticlePubMedPubMed CentralGoogle Scholar
- Schalon C, Surgand JS, Kellenberger E, Rognan D: A simple and fuzzy method to align and compare druggable ligand-binding sites. Proteins 2008, 71(4):1755–1778. 10.1002/prot.21858View ArticlePubMedGoogle Scholar
- Weskamp N, Hullermeier E, Kuhn D, Klebe G: Multiple Graph Alignment for the Structural Analysis of Protein Active Sites. IEEE/ACM Trans Comput Biol Bioinformatics 2007, 4(2):310–320. 10.1109/TCBB.2007.358301View ArticleGoogle Scholar
- Najmanovich R, Kurbatova N, Thornton J: Detection of 3D atomic similarities and their use in the discrimination of small molecule protein-binding sites. Bioinformatics 2008, 24(16):i105-i111. 10.1093/bioinformatics/btn263View ArticlePubMedGoogle Scholar
- Shulman-Peleg A, Nussinov R, Wolfson HJ: Recognition of functional sites in protein structures. J Mol Biol 2004, 339(3):607–633. 10.1016/j.jmb.2004.04.012View ArticlePubMedGoogle Scholar
- Shulman-Peleg A, Nussinov R, Wolfson HJ: SiteEngines: recognition and comparison of binding sites and protein-protein interfaces. Nucleic Acids Res 2005, (33 Web Server):W337-W341. 10.1093/nar/gki482Google Scholar
- Willett P, Winterman V, Bawden D: Implementation of nearest-neighbor searching in an online chemical structure search system. J Chem Inform Comput Sci 1986, 26: 36–41.View ArticleGoogle Scholar
- Davies J, Jackson R, Mardia K, Taylor C: The Poisson Index: a new probabilistic model for protein ligand binding site similarity. Bioinformatics 2007, 23(22):3001–3008. 10.1093/bioinformatics/btm470View ArticlePubMedGoogle Scholar
- Schölkopf B, Tsuda K, Vert JP: Kernel Methods in Computational Biology. The MIT Press, Cambridge, Massachussetts: MIT Press; 2004.Google Scholar
- Haussler D: Convolution Kernels on Discrete Structures. Tech Rep UCSC-CRL-99–10, UC Santa Cruz 1999.Google Scholar
- Gärtner T, Flach P, Kowalczyk A, Smola A: Multi-Instance Kernels. In Proceedings of the Nineteenth International Conference on Machine Learning. Edited by: Sammut C, Hoffmann A. Morgan Kaufmann; 2002:179–186.Google Scholar
- Kondor R, Jebara T: A kernel between sets of vectors. ICML '03: Proceedings of the 20th international conference on Machine learning 2003.Google Scholar
- Needleman S, Wunsch C: A general method applicable to the search for similarities in the amino acid sequences of two proteins. J Mol Biol 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16(6):276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar
- Scott WRP, Tironi IG, Mark AE, Billeter SR, F J, Torda AE, Huber T, Kruger P: The Gromos biomolecular simulation program package. J Phys Chem A 1999, 103: 3596–3607. 10.1021/jp984217fView ArticleGoogle Scholar
- Schölkopf B, Smola A, Müller KR: Kernel principal component analysis. In Advances in Kernel Methods Support Vector Learning. Edited by: Schölkopf B, Burges C, Smola A. MIT Press; 1999:327–352.Google Scholar
- Iyer N, Jayanti S, Lou K, Kalyanaraman Y, Ramani K: Three-dimensional shape searching: state-of-the-art review and future trends. Computer-Aided Design 2005, 37(5):509–530. 10.1016/j.cad.2004.07.002View ArticleGoogle Scholar
- Papadakis P, Pratikakis I, Perantonis S, Theoharis T: Efficient 3D shape matching and retrieval using a concrete radialized spherical projection representation. Pattern Recogn 2007, 40(9):2437–2452. 10.1016/j.patcog.2006.12.026View ArticleGoogle Scholar
- Saupe D, Vranic DV: 3D Model Retrieval with Spherical Harmonics and Moments. In Proceedings of the 23rd DAGM-Symposium on Pattern Recognition. London, UK: Springer-Verlag; 2001:392–397.Google Scholar
- Zhang Z: Iterative Point Matching for Registration of Free-form Curves. In Tech rep. Institut National de Recherche en Informatique et en Automatique (INRIA); 1992.Google Scholar
- Biasotti S, Marini S, Mortara M, Patane G, Spagnuolo M, Falcidieno B: 3D Shape Matching through Topological Structures. In Discrete Geometry for Computer Imagery. Springer Berlin/Heidelberg; 2004:194–203.Google Scholar
- Laurie ATR, Jackson RM: Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21(9):1908–1916. 10.1093/bioinformatics/bti315View ArticlePubMedGoogle Scholar
- Glaser F, Morris RJ, Najmanovich RJ, Laskowski RA, Thornton JM: A method for localizing ligand binding pockets in protein structures. Proteins 2006, 62(2):479–488. 10.1002/prot.20769View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.