- Research
- Open access
- Published:
A multilayer dynamic perturbation analysis method for predicting ligand–protein interactions
BMC Bioinformatics volume 23, Article number: 456 (2022)
Abstract
Background
Ligand–protein interactions play a key role in defining protein function, and detecting natural ligands for a given protein is thus a very important bioengineering task. In particular, with the rapid development of AI-based structure prediction algorithms, batch structural models with high reliability and accuracy can be obtained at low cost, giving rise to the urgent requirement for the prediction of natural ligands based on protein structures. In recent years, although several structure-based methods have been developed to predict ligand-binding pockets and ligand-binding sites, accurate and rapid methods are still lacking, especially for the prediction of ligand-binding regions and the spatial extension of ligands in the pockets.
Results
In this paper, we proposed a multilayer dynamics perturbation analysis (MDPA) method for predicting ligand-binding regions based solely on protein structure, which is an extended version of our previously developed fast dynamic perturbation analysis (FDPA) method. In MDPA/FDPA, ligand binding tends to occur in regions that cause large changes in protein conformational dynamics. MDPA, examined using a standard validation dataset of ligand-protein complexes, yielded an averaged ligand-binding site prediction Matthews coefficient of 0.40, with a prediction precision of at least 50% for 71% of the cases. In particular, for 80% of the cases, the predicted ligand-binding region overlaps the natural ligand by at least 50%. The method was also compared with other state-of-the-art structure-based methods.
Conclusions
MDPA is a structure-based method to detect ligand-binding regions on protein surface. Our calculations suggested that a range of spaces inside the protein pockets has subtle interactions with the protein, which can significantly impact on the overall dynamics of the protein. This work provides a valuable tool as a starting point upon which further docking and analysis methods can be used for natural ligand detection in protein functional annotation. The source code of MDPA method is freely available at: https://github.com/mingdengming/mdpa.
Introduction
Proteins are ubiquitous in the cellular environment. They are not only the main components of cells but also the vital workforce to maintain the biochemical function of life. Proteins achieve their functions by interacting with a variety of molecules in cells, including some macromolecules (DNA, RNA, membranes) and small molecules (catalytic substrates, nucleotides, peptides, and artificial chemicals, often called ligands) [1, 2]. These interactions usually occur at several key amino acids in proteins, known as ligand-binding sites(LBSs), so the identification and characterization of protein LBSs and their associated ligands is key to understanding enzyme catalytic mechanism, disease pathogenesis and provide indispensable knowledge for enzyme engineering and drug design [3,4,5,6]. Furthermore, identifying protein LBSs is typically an indispensable starting step in virtual screening calculations, such as finding inhibitors or screening enzymes that bind to specific substrates [7].
Over the years, biological experiments have provided a wealth of protein structural data spanning all domains of life, which are available from the Protein Data Bank (PDB [8]), providing valuable, highly accurate protein LBS data. The rapid accumulation of experimental protein LBS data has been compiled into different databases, such as PoSSuM [9], BioLip [10], PLIC [11], sc-PDB [12], PDBbind [13], eModel-DBD [14], MOAD [15], etc., for different applications, giving birth to many computational methods for detecting and analyzing protein LBSs [16]. In recent years, not only has the number of known protein structures characterized experimentally increased rapidly, but advanced protein structure prediction programs are giving an ever-increasing number of new structures with reliable accuracy at an unprecedented rate [17, 18], which poses a great new challenge for the computational characterization of protein LBSs. At present, a variety of computational methods have been developed in the field of protein LBS prediction. According to their main algorithms, they can be divided into geometric-, energy-, template-/consensus-, knowledge-based methods and machine-learning methods; see references [19, 20] for details. Here, we mainly limited ourselves to a few geometric- or structure-based dynamic approaches that are closely related to this study, with particular emphasis on those that are freely available and freestanding.
As an early geometric method, POCKET [21] and the derivative LigSite [22] used geometric grids to explore protein-solvent-protein events and predicted those grids that begin and end with “protein” and into which solvents can be inserted as pockets. SURFNET [23] and PASS [24] used a probing ball randomly rotating on protein surface to search candidate pockets. FPOCKET [25] is a stand-alone geometric method that produces grid points based on Voronoi tessellation. For a given protein, FPOCKET usually generates many pockets which cover most protein LBSs, but they are not always in the top pocket. However, due to its easy-to-use, stand-alone and fast, it is very popular in the field and has been used as a built-in component by other programs, such as P2Rank [19]. Other methods introduce energy calculation on the basis of geometric representation. Q-SiteFinder [26] places methyl probes (-CH3) at grid points, calculates van der Waals interaction energies between probe atoms and neighboring protein atoms, and then clusters low-energy probes into candidate pockets as predictions. SiteHound [27] calculates the interaction forces between grid points and the protein, and clusters those points with higher interaction energies as candidate pockets. Then comes the consensus method MetaPocket 2.0 [28], a popular web server that combines the results produced by the above mentioned methods, including LigSite, PASS, Q-SiteFinder, Surfnet, FPOCKET and other three methods to improve the overall prediction success rate. Other consensus methods include COACH [29], G-LoSA [30], Libra [31], bSiteFinder [32], eMatchSite [33], etc. The template-based or consensus methods were reported to be the most successful and useful LBS prediction tools in CASP [34], which computationally determine conserved sites for certain protein families that bind target ligands based on sequence alignments [35]. Machine learning methods classify LBSs or protein functional residues based on structural, physico chemical features of biomolecular sequences/structures in the training sets [36,37,38,39,40,41]. In recent years, just like their development in protein structure prediction, machine learning and deep learning methods have also developed the fastest in the field of ligand-binding pocket and LBS prediction, as driven by the rapid accumulation of large amounts of experimental data [19, 42,43,44,45].
We have developed a template-free, structure-based and stand-alone method to predict protein LBS based on the protein dynamics perturbation analysis (DPA, and its Fast version FDPA) [46, 47]. The algorithm is based on the observation that external interactions, such as ligand binding, tend to occur in protein regions where interactions cause large conformational distribution changes, as measured by the allosteric potential Dx [46], which is the Kullback–Leibler divergence between protein conformational distributions with and without interaction. The MSMS program [48] was used to generate a layer of test points with a 1.5 Å radius on the surface of the protein, and those points with higher Dx values were selected and clustered to form prediction regions. In DPA, proteins were treated as some elastic network structures, and the external interactions between test points and neighboring protein atoms were simulated by connecting springs. The method was evaluated using 305 proteins in GOLD test set [49] of the time, and DPA yielded 287 protein-pocket predictions (rate of 94%), of which 250 predictions (rate of 87%) gave at least one correct LBS, while FDPA predicted 267 (rate of 86%) and 251 (rate of 94%), respectively. FDPA costs about 3 s to make a prediction for a protein of 130 aa with 2.2 GHz Intel Core i7 processor. We noted that since the DPA/FDPA algorithms only use a set of two-dimensional test points that are closely parallel to the protein surface (with an average distance of 1.5 Å from the protein surface), they cannot give the spatial distribution of target compounds (ligand or substrate) inside the pocket. However, in addition to LBSs and pockets, the spatial extensions of ligands within pockets are also important in practice [50] and their predictions have also received increasing attention over the years [51,52,53,54,55].
In this study, we presented a new version of DPA, called the multilayer dynamics perturbation analysis (MDPA), to predict both the protein LBSs and the spatial extension of ligands within the predicted pocket. The method was evaluated with a popular CCDC/Astex dataset for ligand-binding site prediction [56] and compared with the popular algorithms FPOCK [25] and CAVIAR [57]. The output of the algorithm might be used as input parameters for docking programs such as AutoDock [58], Vina [59], and others, and may also help optimize the conformation of the target ligand.
Methods
Implementation of MDPA
MDPA is an extension of the previously developed fast DPA or FDPA method. See the original work[47] for details of the FDPA. Here, only the principle and calculation process are outlined, with special emphasis on the extension of the method. This method is based on the observation that interactions introduced at LBSs tend to cause large perturbations to the protein conformational distribution, thereby making the protein activity susceptible to modulation by external molecular interactions at particular LBSs[46]. The conformational distribution of a protein containing N atoms, \(P\left(\mathbf{X}\right)\propto exp(-U\left(\mathbf{X}\right)/{k}_{B}T)\), is a probability function of the protein configuration \(\mathbf{X}=(\overrightarrow{{\mathbf{r}}_{1}},\overrightarrow{{\mathbf{r}}_{2}},\cdots ,\overrightarrow{{\mathbf{r}}_{{\varvec{N}}}})\), where \(\overrightarrow{{\mathbf{r}}_{\mathbf{j}}}\) is the coordinate of jth atom. In DPA/FDPA, the potential energy \(U(\mathbf{X})\) was modeled in harmonic approximation as a quadratic function \(U(\mathbf{X})=E(\Delta \mathbf{X})=\frac{1}{2}{\Delta {\varvec{X}}}^{T}\mathbf{H}\Delta \mathbf{X}\), where \(\Delta \mathbf{X}=\mathbf{X}-{\mathbf{X}}_{0}\) measures the protein deformation from its equilibrium configuration \({\mathbf{X}}_{0}\), and \(\mathbf{H}\), called the Hessian matrix, is the second partial mass-weighted derivative matrix of the potential energy \(U\) evaluated at \({\mathbf{X}}_{0}\) with respect to local coordinate changes. Harmonic approximation of the potential energy function U is usually valid because, under normal conditions, biomacromolecules must keep their structural fluctuations within a narrow range to perform their function correctly [60].
The key step of DPA/FDPA was to introduce external test interactions at random position “a” on the protein surface. For a given protein conformation \(\mathbf{X}\), the algorithm determined the new configuration distribution \({P}^{({\varvec{a}})}(\mathbf{X})\) with the external interaction at “a” and measured its difference from unperturbed distribution \(P(\mathbf{X})\) based on the Kullback–Leibler divergence, giving the D-value: \({D}^{(a)}=\int d\mathbf{X}\mathbf{l}\mathbf{n}\left(\frac{{P}^{({\varvec{a}})}(\mathbf{X})}{P(\mathbf{X})}\right){P}^{({\varvec{a}})}(\mathbf{X})\) [61]. In practice, the configuration distribution \(P(\mathbf{X})\) was approximated by Boltzmann distribution with the potential energy \(U(\mathbf{X})\) modeled using the \({C}_{\alpha }\)-based elastic network model [62, 63], and an analytic solution of D(a) was derived (depending on the protein equilibrium conformation \({\mathbf{X}}_{0}\)). In DPA/FDPA, a set of random test points, also called surface points, denoted as \({\mathbf{L}}_{1}=\left\{{a}_{j}^{(1)},j=\mathrm{1,2},...,{S}_{1}\right\},\) was generated by the MSMS program [48], and those points with higher D-value (\({D}^{({a}^{(1)})})\) were screened out and grouped into small clusters \({O}^{(1)},{P}^{(1)},...\), by using the OPTICS[64] clustering algorithm. Each cluster represented a predicted pocket, and protein residues near the cluster were then predicted as LBSs.
In DPA/FDPA, the set of test points generated by MSMS usually forms a closed surface parallel to the protein surface, which in turn makes the predicted pocket consisting of selected surface points very close (1.5 Å) to protein residues in any case. This often differs from what we observed when small molecules bind to proteins. In fact, except for a few marginal atoms, the entire small molecule tends to be located in the center of the binding pocket, some distance from the surface residues. Considering the importance of this ligand spatial extension in pockets, here we introduce a simple generalization of FDPA, called multilayer DPA or MDPA, for predicting protein LBSs and ligand spatial extension in pockets. MDPA treats the surface points \({\mathbf{L}}_{1}\) as a layer of virtual alpha-carbon atoms covering the protein structure, and then used MSMS program to generate a new set of surface points \({\mathbf{L}}_{2}=\left\{{a}_{j}^{(2)},j=\mathrm{1,2},...,{S}_{2}\right\}\) outside \({\mathbf{L}}_{1}\), using complex structure composed of the protein and \({\mathbf{L}}_{1}\). The D-values \({D}^{({a}^{(2)})}\) for \({\mathbf{L}}_{2}\) surface points \(a_{j}^{\left( 2 \right)} \;^{\prime } s\) are determined using a similar analytic solution formula in FDPA, and those with higher D-values were screened out, and grouped into clusters \({O}^{(2)},{P}^{(2)},...\). Compared to pockets \(\left\{{O}^{(1)},{P}^{(1)},...\right\}\) predicted based on layer 1 surface points, pockets \(\left\{{O}^{(2)},{P}^{(2)},...\right\}\) are predicted to be farther from the protein surface, with an average distance of twice the average distance from \({\mathbf{L}}_{1}\) to the protein surface, i.e., twice the diameter of the probe sphere. Repeating this process, the third layer of surface points \({\mathbf{L}}_{3}\) was built on top of \({\mathbf{L}}_{2}\), giving predicted pockets \(\left\{{O}^{(3)},{P}^{(3)},...\right\}\). Higher-layered pockets can also be generated in this way. In this study, only the pockets generated by the first three layers of test points were used, which was made after considering the balance between the prediction accuracy requirement and the computational load. Finally, the predicted pockets of all layers are collected, merged, and reclustered according to their spatial connectivity to obtain the final pocket predictions: \({O}_{M}, {P}_{M}\)… Therefore, in many cases, a final predicted pocket, say \({O}_{M}\), may be the result of connection of pockets from multiple layers such that, say \({O}_{M}= {O}^{(1)}+{O}^{(2)}+{P}^{(3)}+...\), and each pocket was re-ranked according to the averaged D-value of test points in it. The prediction of protein LBSs was straightforward by simply finding amino acids whose Cα atom is within a distance of 6 Å from a point in the rank-1 predicted cluster (denoted as cluster \({O}_{m}\)). LBSs predicted by lower-ranked pockets, if any, were also analyzed for comparison. The predicted clusters and protein structures were visualized by PyMOL (The PyMOL Molecular Graphics System, Version 2.5.2, Schrödinger, LLC.).
Validation dataset
To verify the accuracy of the algorithm in predicting ligand-binding pockets, the CCDC/Astex dataset of 85 protein-ligand complexes were used in this study [56]. The dataset consists of a diverse, high-quality test set originally developed to evaluate protein-ligand docking programs. Most structures in the dataset have little global conformational similarity, but some of them are selected from several major protein families, including 11 kinases, 9 nuclear receptors, 5 serine proteases, and 3 members of the phosphodiesterase family. This dataset was originally based on the statistical analysis and classification of all protein-ligand complexes in the PDB database. Starting from the sequence of protein, the structure of the ligand, and the structure of the complex, the clustering of proteins with similar sequences was used to display as many different proteins as possible. Characterization of the complexes was accomplished by extracting the structure of the ligand, based on the number of heavy atoms and rotatable bonds in the ligand, and determining whether it has therapeutic or agrochemical use in the complex. 90% of these complexes were related protein targets, contain similar drug complexes, and contain high-quality experimental structures that facilitate the evaluation of the ligand binding modes.
Validation of prediction LBSs and ligand spatial extension inside the binding pocket
To evaluate the prediction accuracy, we selected the set of protein residues \({R}_{P}\) whose Cα atoms were within 6 Å of any point in the rank-1 cluster \({O}_{m}\) and defined \({R}_{P}\) as the prediction LBSs. The prediction was compared with the set of LBSs \({R}_{L}\) found in the ligand-protein complex as collected from the SITE sections of the examined PDB files. The intersection of Rp and RL defined a set of residues \({R}_{P\cap L}={R}_{P}\cap {R}_{L}\), recoding residues found in both the prediction and experimental determined LBSs. The overlap of the prediction with the experiment LBSs was then evaluated with the conventional prediction precision and recall: Precision = \({R}_{P\cap L}/{R}_{P}\), Recall = \({R}_{P\cap L}/{R}_{L}\).
Compared with DPA/FDPA, the new output of MDPA was a prediction of the spatial extension of ligands inside the predicted binding pocket. To evaluate the quality of predictions, we defined an evaluation strategy similar to evaluating LBS predictions. Let SL be the set of atoms of a ligand found in the complex, and Sl be the subset of SL whose elements are ligand atoms located within 3.5 Å of any test point of the predicted pocket \({O}_{M}\). Let \({O}_{m}\) be the subset of \({O}_{M}\), whose elements are test points located within 3.5 Å of any ligand atom. The overlap of the prediction with the spatial extension of the experimental ligand within the binding pocket was then assessed using precision and recall as follows: Precision = \({O}_{m}/ {O}_{M}\), and Recall = \({S}_{l}/ {S}_{L}\).
Compared with FPOCKET method
As a fast geometric stand-alone method, FPOCKET has become one of the most popular tools for analyzing protein structural pockets in recent years. The method is based on the determination of alpha spheres by Voronoi tessellation [25]. An alpha sphere is a sphere that contacts four atoms at the protein boundary, and the radius of the alpha sphere reflects the local curvature defined by the four atoms. For a protein, very small spheres are located inside the protein, large spheres are located outside the protein, and cracks and cavities correspond to spheres of intermediate radii. Therefore, spheres of different radius sizes are filtered to solve the pocket detection problem. The algorithm first inputs the PDB structure, then returns the pre-filtered set of spheres, and finally identifies groups of closely connected alpha spheres, returns the pocket information, and scores each pocket. FPOCKET typically gives a large number of predicted pockets, covering most known LBSs and ligand molecules, however, LBSs and ligands are not always associated with the top predicted pockets. For quantitative comparisons, similar precision and recall are also defined to evaluate the predictions of FPOCKET's top-ranked pockets, where test points are replaced with alpha spheres.
Results
MDPA detects regions causing large perturbation to protein dynamics
The lysozyme from turkey egg-white (PDB code: 1JEF [65]) has been used as an example to illustrate the performance of DPA. Like many other lysozymes from different organisms, this protein consists of two domains connected by a helix linker, with its substrate polysaccharide bound deep in between the two lobes (Fig. 1A). Experimental data and simulation calculations show that the enzyme is constantly switching between open and fully closed conformations by adjusting the distance between the two lobes [66, 67]. As mentioned above, from random protein surface points, DPA selects a subset of surface points that most perturb the protein conformational distribution, which are then predicted to be associated with the binding site for protein substrates (Fig. 1). In the case of lysozyme, the interactions introduced at those surface points predicted by DPA should intuitively have the highest efficiency in blocking the key intrinsic opening/closing motion of the enzyme. The original DPA algorithm only considers the first layer (~ 1.5 Å from the protein surface) surface points (Fig. 1B), and it has the obvious disadvantage that it provides little information about the position and spatial extension of the substrate in the binding pocket (Fig. 1C, F). In MDPA, multiple layers of surface points are created on the protein surface, each layer is added to the top of the previous layer, then a subset of surface points with high DPA values are selected for each layer, and finally, all selected surface points are collected and clustered to form predicted binding region (Fig. 1C,D,E,G). For turkey egg-white lysozyme, both DPA and MDPA predicted the same ligand binding sites, namely D52, Q57, I58, N59, W62, W63, A107, and W108, giving a recall rate of 62% and an accuracy rate of 83% compared to the sites listed in the SITE entries of the PDB file. Compared to DPA (Additional file 1: Table S1), since it uses multiple layers of surface points, the main improvement of MDPA lies in the overlap between the predicted spatial region and the physical occupancy of the bound ligand, from 67% in DPA increased to 100% in MDPA, with a slight sacrifice in accuracy from 87 to 82%.
As a second example, we applied MDPA to detect the ligand-binding pocket in the main protease of the recent global pandemic coronavirus. The COVID-19 caused by the novel coronavirus SARS-CoV-2 virus, coupled with the lack of targeted drugs, urgently needs to find new antiviral drugs. The main protease (Mpro) of coronaviruses has been identified as a key drug target for the development of inhibitors to block viral RNA protein processing. The overall structure of the protease (PDB code 6Y2F[68]) can be divided into 3 consecutive and mutually contacting domains (domain I, II, and III) corresponding to residues 10 to 99, 100 to 182, and 198 to 303, respectively. Figure 2 shows the 3 predicted inhibitor-binding regions in the structure given by MDPA, with the rank-1 predicted region located between domains I and II, and the 2nd and 3rd prediction regions between domains II and III. Compared to recently reported α-ketoamide inhibitors[68, 69], in which Cys145 mediates the nucleophilic reaction and Gly143 and Glu166 form hydrogen bonds with the inhibitor, the predicted rank-1 region perfectly matches where the α-ketoamide inhibitor binds within the pocket. The prediction identified important binding sites such as Glu166, His164, Cys145, Gly143, His41, and Arg188, which accounts for half of the listed ligand-binding residues. The predicted binding region overlapped with the α-ketoamide inhibitor with 89% accuracy and 81% recall.
Evaluation of MDPA for prediction of ligand-binding sites
One of the key features of protein pockets is the distribution of ligand-binding sites within them, which largely determines protein function. To evaluate the ability of MDPA to predict ligand-binding sites, MDPA was applied to the protein structures of the CCDC/Astex dataset of 85 protein-ligand complexes. The residues identified by 3-layered MDPA predictions were then compared to the ligand-binding sites listed in the SITE records for each complex structure (Table 1). MDPA made at least one prediction ligand-binding region for 94% of the listed complexes in the dataset, with the exception of the four cases where the ligand-binding region was embedded inside the internal protein cavities. Compared to the listed ligand binding sites in the PDB entry, the predictions were at least 30% accurate in 65% of cases and at least 50% accurate in 25% of cases; the recall rate of residues is at least 30% in 87% of cases and at least 50% in 71% of cases.
Evaluation of MDPA for prediction of ligand spatial extension within the pocket
Compared with previous DPA studies, the development of the MDPA algorithm aims to improve the prediction of the spatial extension of ligands in protein pockets and provide useful information for subsequent ligand prediction and ligand-protein docking calculations. Any medium to large-sized ligand usually adopts a specific spatial distribution within the pocket, with parts of the ligand structure close to the protein and others away from the protein. Both DPA and MPA identify surface point clusters that impose high perturbation on the overall dynamics of the protein through interactions with neighboring protein amino acids, and according to this method, it is easier to detect interacting point sets closer to the protein surface. A key step in MDPA is to expand the collection of highly perturbing points at increasing distances from the protein surface in a layer-by-layer manner. Therefore, it would be interesting to verify the correlation between the direction of the predicted region increase and the spatial extension of ligands in the pocket. As an example, Fig. 3 shows that MPDA successfully computed the growth of the predicted region along the direction of the spatial distribution of the ligand 6-substituted 2-naphthamidine inhibitor in the human urokinase pocket (PDB code: 1OWE[70]). In this case, the predicted overlap of the binding region with the protein inhibitor increased from 30% in DPA to 100% in MDPA. To evaluate its ability to predict the spatial distribution of ligands in the pocket, MDPA was applied to the CCDC/Astex dataset of 85 protein-ligand complexes, and the overlap ratios of the predicted regions with the natural ligands found in the complexes were calculated (Table 1). Calculations showed that the predicted binding region of MDPA overlapped with the natural ligand in 82% of the complexes. Among them, the prediction accuracy is at least 30% in 87% of cases and at least 50% in 62% of cases, with the recall rate at least 30% in 95% of cases and at least 50% in 80% of cases.
Discussions
Compared to FPOCKET and CAVIAR
For comparison, we selected two free, standalone protein pocket prediction programs, FPOCKET [25] and CAVIAR, to examine MDPA by comparing their predictions of protein pockets of the complexes in the dataset (Additional file 1: Tables S2 and S3). FPOCKET identifies cavities through spherical probes, while CAVIAR detects pockets according to the grid algorithm, so most binding sites can be presented by a large number of pockets. Although calculations show that FPOCKET and CAVIAR can effectively identify the ligand-binding pockets of most proteins, the overlap rate between the predicted point sets and the bound ligand of the complexes is relatively small, with an average of < 10%, which provides little information about the ligand-binding position and spatial extension inside the pockets. Compared with FPOCKET and CAVIAR (Table 2), MDPA has higher accuracy in predicting binding sites (60%) and ligand spatial extension (71%), avoiding redundancy. We also noted that if the pocket is completely embedded inside the protein, MDPA does not generate any surface points in this region with the MSMS program, and its predicted binding region has zero overlap with the corresponding ligand (Fig. 4, PDB code: 1R1H[71]). Therefore, MDPA does not work for pockets buried in the protein core, whereas both FPOCKET and CAVIAR may correctly predict the embedded binding pocket. In this sense, MDPA provides a better description of the ligand-binding pose in pockets located on the protein’s outer surface, and this information is expected to be useful for further molecular docking calculations and other ligand–protein interaction characterizations.
Combining deep-learning models to improve the prediction accuracy of ligand binding pockets
Data-driven methods have long been used to predict important ligand binding residues in proteins and physicochemical properties involving ligand-protein interactions [19]. Recent years have seen an increasing number of deep-learning models reported to predict ligand-binding pockets or ligand-binding sites [19, 42, 72,73,74,75]. For example, Zhang and colleagues [76] developed a free and standalone deep-learning model called DeepbindPoc, which uses a 3D convolutional neural network to rank the predicted pockets identified by FPOCKET to screen for the pockets most likely to bind natural ligands. An interesting feature of DeepbindPoc is that the algorithm uses the mol2vec tool to add information about natural ligands to the training dataset, so it has the advantage of ranking near-native pockets for proteins with unknown binding sites but whose ligands are known in advance, and it may also help predict natural ligands for a given protein. Combining DeepbindPoc with MDPA, DeepbindPoc helps reprioritize the predicted regions identified by MDPA by using natural ligands(Additional file 1: Table S4). Of the 85 complexes, the highest-scoring pockets given by DeepbindPoc were consistent with the rank-1 pockets predicted by MDPA in 51% of cases, while the remaining 48% were different. In the cases where the two do not match, there are pros and cons between MDPA prediction ranking and DeepbindPoc scoring. For example, in the case of the influenza virus neuraminidase-inhibitor complex (PDB code: 1L7F [77]), DeepbindPoc gave the highest score to the fourth-ranked MDPA pocket, which does bind the ligand BCX-1812, while the rank-1 MDPA pocket has no binding ligand. However, in most cases (72%), the pockets screened by DeepbindPoc score do not seem as reasonable as the top-ranked pockets predicted by MDPA. The DeepbindPoc scoring function provides some references for MDPA-predicted pockets that do not bind to natural ligands. Therefore, screening candidate ligands of these pockets using tools such as DeepbindPoc is worth further study.
Effects of protein conformation changes on protein pocket detection
Just as flexible docking or flexible ligand recognition [78, 79] must take into account a certain degree of flexibility in the pocket side chain or backbone, it may be useful to examine the effect of protein conformational changes on the prediction of binding pockets. As an example, the human lysozyme was examined here, where structural changes in the binding pocket and different substrate binding modes have been characterized (PDB code: 1LZS [80]). The typical opening/closing conformational changes are simulated by the lowest frequency normal mode solved in MDPA (Fig. 5), where the entire structure rotates around a central axis connecting the two lobs.
Since no direct measurement data were available for the magnitude of the conformational change defined by the normal mode, a series of 12 conformations were generated, evenly distributed through a full cycle, with amplitude \(A\mathrm{cos}(i\pi /12), i=\mathrm{1,2},\dots ,12\), where A is an arbitrary value. To determine the maximum possible value of A, the Procheck[81] program was used to examine the distribution of unreasonable dihedral angles in the resulting conformations. Specifically, the largest A value was chosen under the condition that the Procheck program reports that at least 90% of the dihedral angles are within a reasonable area. MDPA was then applied to each conformation, and the predictions were compared with ligands for which binding sites were found (Table 3). The pocket was given a variety of flexibility from open to closed conformations, which mimics the conformational sampling of a flexible pocket. In the absence of conformational changes, the ligand-binding site predictions achieved 20% accuracy and 100% recall, and the ligand-spatial extension prediction achieved 80% accuracy and 100% recall. For comparison, a large conformational change \(A\) =200 and a medium-sized conformational change \(A=100\) were applied to the protein (Table 3). Calculations showed that the predictions given by MDPA vary slightly with conformational changes. When \(A=100\), the precision rate of LBS prediction reaches 31% in conformation 4 and 8, and the prediction accuracy of ligand spatial extension increases to 86%. In contrast, for the conformations derived by setting \(A\) =200, the highest ligand spatial extension prediction accuracy is only 83%. The calculations revealed that properly increasing the flexible deformation of protein conformation can improve the accuracy of ligand spatial extension prediction. This may be particularly useful for predicting ligand-binding regions within relatively large binding pockets.
On the other hand, we also noticed that, in many cases, ligand binding might lead to protein conformation changes that are not well described by the low-frequency normal mode motions described above, such as induced-fit conformational changes that cause strong inelastic deformations. In this case, it might be difficult to make accurate LBS predictions using the apo protein structures. Therefore, comparing LBS predictions based on apo structures with those based on complex structures may be another topic worthy of further study.
Conclusion
In this study, a multilayer dynamic perturbation analysis method (MDPA) was developed to predict ligand-binding pockets in proteins. This structural-based method is a direct improvement on our previously established method. The method has been validated using a standard ligand-protein complex dataset, with an LBS prediction accuracy of at least 30% in 65% of cases and recall of at least 30% in 87% of cases. One of the key features of MDPA is that MDPA is designed to predict the spatial extension of ligands within the binding pocket. The predicted binding region overlapped with the natural ligand in 82% of the complexes in the test dataset, with a prediction accuracy of at least 30% in 87% of the cases and a recall of at least 30% in 95% of cases, indicating that MDPA can give a reasonable prediction for protein pickets.
The combination of MDPA with deep learning methods based on ligand-protein interaction big data (such as DeepbindPoc) provides a good starting point for MDPA to study further biomolecular functions, including the prediction of natural ligands and their involved physicochemical interactions.
Availability of data and materials
All data is provided; the MDPA program code is also uploaded to https://github.com/mingdengming/mdpa
References
Lins L, Thomas A, Brasseur R. Analysis of accessible surface of residues in proteins. Protein Sci. 2003;12(7):1406–17.
DeWitte RS, Ishchenko AV, Shakhnovich EI. SMoG: de novo design method based on simple, fast, and accurate free energy estimates. 2. Case studies in molecular design. J Am Chem Soc. 1997;119(20):4608–17.
Meyers J, Brown N, Blagg J. Mapping the 3D structures of small molecule binding sites. J Cheminformatics. 2016. https://doi.org/10.1186/s13321-016-0180-0.
Monzon AM, et al. Conformational diversity analysis reveals three functional mechanisms in proteins. PLoS Comput Biol. 2017;13(2): e1005398.
Shen Q, et al. Proteome-scale investigation of protein allosteric regulation perturbed by somatic mutations in 7000 cancer genomes. Am J Hum Genet. 2017;100(1):5–20.
Bhagavat R, et al. An augmented pocketome: detection and analysis of small-molecule binding pockets in proteins of known 3D structure. Structure. 2018;26(3):499-512 e2.
Sun J, Xia Y, Ming D. Whole-genome sequencing and bioinformatics analysis of apiotrichum mycotoxinivorans: Predicting putative zearalenone-degradation enzymes. Front Microbiol. 2020;11:1866.
Rose PW, et al. The RCSB protein data bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015;43(D1):D345–56.
Ito J, et al. PoSSuM: a database of similar protein-ligand binding and putative pockets. Nucleic Acids Res. 2012;40(D1):D541–8.
Yang J, Roy A, Zhang Y. BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013;41(D1):D1096–103.
Anand P, et al. PLIC: protein-ligand interaction clusters. Database (Oxford), 2014. 2014(0): p. bau029
Desaphy J, et al. sc-PDB: a 3D-database of ligandable binding sites–10 years on. Nucleic Acids Res. 2015;43(D1):D399–404.
Liu Z, et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics. 2015;31(3):405–12.
Naderi M, Govindaraj RG, Brylinski M. eModel-BDB: a database of comparative structure models of drug-target interactions from the binding database. Gigascience. 2018. https://doi.org/10.1093/gigascience/giy091.
Smith RD, et al. Updates to binding MOAD (mother of all databases): polypharmacology tools and their utility in drug repurposing. J Mol Biol. 2019;431(13):2423–33.
Vajda S, Guarnieri F. Characterization of protein-ligand interaction sites using experimental and computational methods. Curr Opin Drug Discov Devel. 2006;9(3):354–62.
Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021. https://doi.org/10.1126/science.abj8754.
Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021. https://doi.org/10.1038/s41586-021-03819-2.
Krivak R, Hoksza D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform. 2018;10(1):39.
Zhao J, Cao Y, Zhang L. Exploring the computational methods for protein-ligand binding site prediction. Comput Struct Biotechnol J. 2020;18:417–26.
Levitt DG, Banaszak LJ. POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph. 1992;10(4):229–34.
Hendlich M, Rippmann F, Barnickel G. LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model. 1997;15(6):359–63.
Laskowski RA. SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph. 1995;13(5):323–30.
Brady GP, Stouten PFW. Fast prediction and visualization of protein binding pockets with PASS. J Comput Aided Mol Des. 2000;14(4):383–401.
Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics. 2009;10:168.
Laurie AT, Jackson RM. Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics. 2005;21(9):1908–16.
Ghersi D, Sanchez R. EasyMIFS and SiteHound: a toolkit for the identification of ligand-binding sites in protein structures. Bioinformatics. 2009;25(23):3185–6.
Zhang Z, et al. Identification of cavities on protein surface using multiple computational approaches for drug binding site prediction. Bioinformatics. 2011;27(15):2083–8.
Yang J, Roy A, Zhang Y. Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics. 2013;29(20):2588–95.
Lee HS, Im W. Ligand binding site detection by local structure alignment and its performance complementarity. J Chem Inf Model. 2013;53(9):2462–70.
le Hung V, et al. LIBRA: ligand binding site recognition application. Bioinformatics. 2015;31(24):4020–2.
Gao J, et al. bSiteFinder, an improved protein-binding sites prediction server based on structural alignment: more accurate and less time-consuming. J Cheminform. 2016;8:38.
Brylinski M. Local alignment of ligand binding sites in proteins for polypharmacology and drug repositioning. Methods Mol Biol. 2017;1611:109–22.
Gallo Cassarino T, Bordoli L, Schwede T. Assessment of ligand binding site predictions in CASP10. Proteins. 2014;82(Suppl 2):154–63.
Caffrey DR, et al. Are protein-protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci. 2004;13(1):190–202.
Gutteridge A, Bartlett GJ, Thornton JM. Using a neural network and spatial clustering to predict the location of active sites in enzymes. J Mol Biol. 2003;330(4):719–34.
Ofran Y, Rost B. ISIS: interaction sites identified from sequence. Bioinformatics. 2007;23(2):e13–6.
Kauffman C, Karypis G. LIBRUS: combined machine learning and homology information for sequence-based ligand-binding residue prediction. Bioinformatics. 2009;25(23):3099–107.
Qiu ZJ, Wang XC. Improved prediction of protein ligand-binding sites using random forests. Protein Pept Lett. 2011;18(12):1212–8.
Chen P, Huang JHZ, Gao X. LigandRFs: random forest ensemble to identify ligand-binding residues from sequence information alone. Bmc Bioinform. 2014. https://doi.org/10.1186/1471-2105-15-S15-S4.
Shrihari S, Pinak C. Prediction of active site cleft using support vector machines. J Chem Inf Model. 2010;50(12):2266–73.
Jimenez J, et al. DeepSite: protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics. 2017;33(19):3036–42.
Pu LM, et al. DeepDrug3D: classification of ligand-binding pockets in proteins with a convolutional neural network. Plos Comput Biol. 2019;15(2):e1006718.
Zhang HP, et al. DeepBindPoc: a deep learning method to rank ligand binding pockets using molecular vector representation. Peerj. 2020;8:e8864–e8864.
Simonovsky M, Meyers J. deeplytough: learning structural comparison of protein binding sites. J Chem Inf Model. 2020;60(4):2356–66.
Ming D, Wall ME. Interactions in native binding sites cause a large change in protein dynamics. J Mol Biol. 2006;358(1):213–23.
Ming D, Cohn JD, Wall ME. Fast dynamics perturbation analysis for prediction of protein functional sites. BMC Struct Biol. 2008;8:5.
Sanner MF, Olson AJ, Spehner JC. Reduced surface: an efficient way to compute molecular surfaces. Biopolymers. 1996;38(3):305–20.
Jones G, et al. Development and validation of a genetic algorithm for flexible docking. J Mol Biol. 1997;267(3):727–48.
Feinstein WP, Brylinski M. Calculating an optimal box size for ligand docking and virtual screening against experimental and predicted binding pockets. J Cheminform. 2015;7:18.
Hocker HJ, Rambahal N, Gorfe AA. LIBSA–a method for the determination of ligand-binding preference to allosteric sites on receptor ensembles. J Chem Inf Model. 2014;54(2):530–8.
Heo L, et al. GalaxySite: ligand-binding-site prediction by using molecular docking. Nucleic Acids Res. 2014;42(W1):W210–4.
Yang Y, Qian J, Ming D. Docking polysaccharide to proteins that have a Tryptophan box in the binding pocket. Carbohydr Res. 2015;414:78–84.
Sayama M, et al. Probing the hydrophobic binding pocket of G-protein-coupled lysophosphatidylserine receptor GPR34/LPS1 by docking-aided structure-activity analysis. J Med Chem. 2017;60(14):6384–99.
Desaphy J, et al. Encoding protein-ligand interaction patterns in fingerprints and graphs. J Chem Inf Model. 2013;53(3):623–37.
Hartshorn MJ, et al. Diverse, high-quality test set for the validation of protein-ligand docking performance. J Med Chem. 2007;50(4):726–41.
Marchand JR, et al. CAVIAR: a method for automatic cavity detection, description and decomposition into subcavities. J Comput Aided Mol Des. 2021;35(6):737–50.
Morris GM, et al. AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. J Comput Chem. 2009;30(16):2785–91.
Eberhardt J, et al. AutoDock Vina 1.2.0: new docking methods, expanded force field, and python bindings. J Chem Inf Model. 2021;61(8):3891–8.
Ma J. Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecular complexes. Structure. 2005;13(3):373–80.
Ming D, Wall ME. Allostery in a coarse-grained model of protein dynamics. Phys Rev Lett. 2005;95(19): 198103.
Atilgan AR, et al. Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys J. 2001;80(1):505–15.
Tirion MM. Large amplitude elastic motions in proteins from a single-parameter atomic analysis. Phys Rev Lett. 1996;77(9):1905–8.
Ankerst M, Breunig MM, Kriegel HP, Sander J. OPTICS: ordering points to identify the clustering structur. In: Proceedings of the ACM SIGMON International Conference on Management of Data. 1999; 28: 49–60
Harata K, Muraki M. X-ray structure of turkey-egg lysozyme complex with tri-N-acetylchitotriose. Lack of binding ability at subsite A. Acta Crystallogr D Biol Crystallogr. 1997;53(6):650–7.
McHaourab HS, et al. Conformation of T4 lysozyme in solution. Hinge-bending motion and the substrate-induced conformational transition studied by site-directed spin labeling. Biochemistry. 1997;36(2):307–16.
Goto NK, et al. What is the average conformation of bacteriophage T4 lysozyme in solution? a domain orientation study using dipolar couplings measured by solution NMR11Edited by P E Wright. J Mole Biol. 2001;308(4):745–64.
Zhang L, et al. Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved & #x3b1;-ketoamide inhibitors. Science. 2020;368(6489):409–12.
Sabbah DA, et al. An updated review on SARS-CoV-2 main proteinase (M(Pro)): protein structure and small-molecule inhibitors. Curr Top Med Chem. 2021;21(6):442–60.
Wendt MD, et al. Identification of novel binding interactions in the development of potent, selective 2-naphthamidine inhibitors of urokinase synthesis, structural analysis, and SAR of N-phenyl amide 6-substitution. J Med Chem. 2004;47(2):303–24.
Oefner C, et al. Structural analysis of neprilysin with various specific and potent inhibitors. Acta Crystallogr D Biol Crystallogr. 2004;60(Pt 2):392–6.
Saberi Fathi SM, Tuszynski JA. A simple method for finding a protein’s ligand-binding pockets. BMC Struct Biol. 2014;14:18.
Pu L, et al. DeepDrug3D: classification of ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput Biol. 2019;15(2): e1006718.
Aggarwal R, et al. deeppocket: ligand binding site detection and segmentation using 3D convolutional neural networks. J Chem Inf Model. 2021
Kandel J, Tayara H, Chong KT. PUResNet: prediction of protein-ligand binding sites using deep residual neural network. J Cheminform. 2021;13(1):65.
Zhang H, et al. DeepBindPoc: a deep learning method to rank ligand binding pockets using molecular vector representation. PeerJ. 2020;8: e8864.
Smith BJ, et al. Structural studies of the resistance of influenza virus neuramindase to inhibitors. J Med Chem. 2002;45(11):2207–12.
Shin WH, Seok C. GalaxyDock: protein-ligand docking with flexible protein side-chains. J Chem Inf Model. 2012;52(12):3225–32.
Ollikainen N, de Jong RM, Kortemme T. Coupling protein side-chain and backbone flexibility improves the re-design of protein-ligand specificity. PLoS Comput Biol. 2015;11(9): e1004335.
Song H, et al. Structural changes of active site cleft and different saccharide binding modes in human lysozyme co-crystallized with hexa-N-acetyl-chitohexaose at pH 4.0. J Mol Biol. 1994;244(5):522–40.
Laskowski RA, et al. PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr. 1993;26(2):283–91.
Acknowledgements
We are grateful to the High-Performance Computing Center of Nanjing Tech University for supporting the computational resources.
Funding
This work was supported, in part, by the National Key Research and Development Program of China, 2019YFA0905700, 2021YFC2102700.
Author information
Authors and Affiliations
Contributions
D.M. designed the study and made the funding acquisition. L.G. performed MDPA calculations and validations. B.L. did the DBP calculation and validations. D.M and L.G. wrote the main manuscript. L.G. prepared figures and tables. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: Table S1.
Prediction results of DPA for protein structures in CCDC/Asterx dataset; Table S2. Prediction results of FPOCKET for protein structures in CCDC/Asterx dataset; Table S3. Prediction results of CAVIAR for protein structures in CCDC/Asterx dataset; Table S4. Selection of optimal pockets for protein structures in CCDC/Asterx dataset by combining DeepbindPoc; Figure S1. MDPA calculation results of SARS-COV-2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Gu, L., Li, B. & Ming, D. A multilayer dynamic perturbation analysis method for predicting ligand–protein interactions. BMC Bioinformatics 23, 456 (2022). https://doi.org/10.1186/s12859-022-04995-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-022-04995-2