Large-scale reverse docking profiles and their applications
© Lee and Kim; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Skip to main content
© Lee and Kim; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Reverse docking approaches have been explored in previous studies on drug discovery to overcome some problems in traditional virtual screening. However, current reverse docking approaches are problematic in that the target spaces of those studies were rather small, and their applications were limited to identifying new drug targets. In this study, we expanded the scope of target space to a set of all protein structures currently available and developed several new applications of reverse docking method.
We generated 2D Matrix of docking scores among all the possible protein structures in yeast and human and 35 famous drugs. By clustering the docking profile data and then comparing them with fingerprint-based clustering of drugs, we first showed that our data contained accurate information on their chemical properties. Next, we showed that our method could be used to predict the druggability of target proteins. We also showed that a combination of sequence similarity and docking profile similarity could predict the enzyme EC numbers more accurately than sequence similarity alone. In two case studies, 5-flurouracil and cycloheximide, we showed that our method can successfully find identifying target proteins.
By using a large number of protein structures, we improved the sensitivity of reverse docking and showed that using as many protein structure as possible was important in finding real binding targets.
Identifying disease genes and target proteins of drugs is a critical step in drug discovery. Once the disease genes are identified, designing lead compounds which can modulate those genes or the protein products may lead to a successful new drug. The growth of the number of available 3D structures of proteins and computing power has enabled high-throughput computational screening of lead compounds, which is known as virtual screening. Conventionally, these virtual screening methods have focused on searching chemical space for chemicals that can specifically bind to a protein target .
Complication in this structure-based drug discovery strategy is that there may exist unknown off-target proteins that can bind to the lead compounds unexpectedly, which undoubtedly poses some difficulty such as severe side effect, but also provides a new opportunity. Upon discovering novel drug targets for existing drugs, we can expand indications of the drugs by drug repositioning. Motivated by this, reverse (or inverse) docking approaches have received increasing interest to find unknown targets of natural products and existing old drugs [2–4]. In reverse docking, one tries to find the protein targets which can bind to a particular ligand.
In previous researches, based on an assumption that the number of predicted potential protein targets  is quite low compared to the number of genes, they tried to find new drug targets among a relatively small number of potential target proteins. For example, a reverse docking study by Gao et al. used ~1,100 targets , and that by Hui-fang et al., used 1,714 targets and 8 compounds . However, this may cause poor coverage of the protein structure space in reverse docking. Moreover, their only intended application of their reverse docking methods is to find the targets of drugs. On the other hand, various approaches including statistical method using sequence and structure similarity , calculating binding site similarity [9, 10], and prediction of druggability by descriptors  have been developed.
Here, we present a large-scale reverse docking study. The main difference from previous studies is that we used all available protein structures in human and yeast. To our best knowledge, our docking profile contains the largest number of protein structures. The reverse docking profile was merged into a matrix which can be easily interpretable. We showed the some properties of the large-scale docking profile and demonstrated usefulness of these docking profile data. We also developed several new applications such as predicting druggability of protein targets and protein function prediction based on docking profile similarity. We discussed two interesting case studies, 5-flurouracil and cycloheximide. Especially, we successfully demonstrated that using as many protein structures as possible was important in improving the sensitivity of reverse docking and finding real binding targets.
The list of ligands used to generate reverse docking profiles
The similarities among hierarchical clusterings in ligand space.
The "druggability" of a certain target protein represents how probable the protein is in fact a real target of drugs, and it has been investigated in many previous studies [14–16]. In one such method, the druggability of a protein was inferred from its homologous proteins whose druggabilities were already known . The weakness of this method is that the number of targets with known druggability is limited. Other approaches attempted to define "druggable" as "highly likely to bind to putative drugs", i.e., "bindability" [18, 19].
Here, for example, simple implementation of combination of sequence and docking profile information was tested. To cover low sensitivity of docking fitness in low FDR, a new distance was defined as follows: if BLAST e-value of a pair is less than 1e-5, e-value is used as the distance; if otherwise, Euclidean distance is used. The performance of this metric is shown in Figure 5 (red). Note that this simple metric is never based on any serious training, feature extraction, or machine learning technique. Not considering which elements in 35-dimentional docking profile are important, and simply adding information of docking profile exhibits better performances in all area. In summary, this implies that using docking profile information together with other useful measures as features of state-of-the-art machine learning technique and increasing the size of docking profile, i.e., appending the reverse docking results of additional ligands would get close to more precise function prediction of proteins.
The docking profile data generated in this study can be applied in a variety of ways. As discussed in the previous section, it can be utilized to infer protein function. On the other hand, more common application that has been explored in several previous studies is to infer new binding targets for known drugs. Here, we present two case studies.
5-fluorocytosine (5-FC) and 5-flurouracil (5-FU) are both fluorinated analogues of pyrimidine . The structures of the two ligands are quite similar. Therefore, not surprisingly, the docking profiles are quite similar as well. Moreover, the top-ranking binding site of both ligands is the structure of yeast exosome component, the protein product of gene rrp6 (PDB id: 2hbm) . The structure was identified relatively recently, so 2hbm has never been annotated as putative target, not to mention druggable. Previously known mechanism of action of 5-FU is inhibition of thymidylate synthetase . Thus, the top-ranking structure, 2hbm, might be considered as a false positive. Fortunately, however, genome-wide study using tagged heterozygotes yeast mutants provided a strong evidence that rrp6 related rRNA processing exosome is a target of 5-FU . The direct binding target of 5-FU was not identified in the previous study, but the result of that research and the docking scores strongly suggest that the protein product of rrp6 is the direct binding target of 5-FU in yeast.
Compared to the protein sets used in previous studies, the set used in this study is quite large and has some redundancy. One may question whether all these structures contribute to the sensitivity of reverse docking. It is an important issue because docking still costs high computing power and is time-consuming.
In our dataset of human, 8,717 structures out of 10,886 structures have the hits sharing the same UniProt ID with 1,339 unique UniProt IDs. In other words, those 8,717 structures could be reduced into 1,339 structures by removing at most 7,378 structures if we filter the set with respect to only sequence redundancy. However, there are many cases where docking fitness profiles for similar sequences are quite different.
To show this property, we first carried out hierarchical clustering of docking profile of proteins. For each sub-cluster, if all the members were derivatives of the same UniProt ID, the members were merged into one. This procedure was repeated until there were no sub-clusters in which members shared the same UniProt ID. As a result, 1,710 structures were filtered out eventually, i.e., only about 20% of sequence-redundant protein structures exhibited the redundancy in docking profile. This is due to heterogeneity in PDB. There are many modified structures such as oxidized, reduced, multimeric, metal containing, and truncated forms for even a one protein sequence. Thus, we concluded that the sets of protein structures which were used in previous reverse docking studies are insufficient. For example, the interesting results from the docking of cycloheximide, which was discussed in the previous section, would have not been obtained.
Another interesting example is the main binding target of hydrocortisone, the glucocorticoid receptor (GCR). There are nine structures of the GCR in PDB. However, datasets used for reverse docking such as potential drug target database (PDTD)  included only two of them (PDB 1nhz and 1p93). The result of reverse docking of hydrocortisone by others  using PDTD could not detect the GCR as the target. In our docking profile, PDB 3bqd was the top-ranking protein target, which is another structure of the GCR. If we had removed redundancy based on sequence similarity, we could have not detected the real target of the GCR. Therefore, our reverse docking experiment suggests that using as many as possible protein structures in reverse docking is worthwhile in finding unknown drug targets or unexpected mode-of-action even though it costs high computation cost.
In this study, we generated large-scale reverse docking profiles for all X-ray protein structures in human and yeast. These data can be the reference for future binding assays and used to find unexpected binding targets of drugs. Furthermore, it would be useful to find unknown therapeutic uses in drug repositioning. In some case studies, targets not annotated as druggable or not stored in target database previously exhibit high binding fitness and they are highly likely to be real binding targets considering previous functional experiments. By using a large number of protein structures, we improved the sensitivity of reverse docking and showed that using as many protein structure as possible was important in finding real binding targets. Although we used as small as 35 ligands in docking, we were able to demonstrate some usefulness of our data. Generating this kind of reverse docking profile of a large number of ligands would be valuable in the future study.
All available X-ray protein structures in human and budding yeast Saccharomyces cerevisiae were retrieved from RCSB Protein Data Bank (PDB) [34, 35]. The best putative binding sites of each PDB structure were generated by using the program Fpocket [36, 37]. To make pockets appropriate inputs for the docking, Open Babel  was used to protonate all the pockets. Thirty-five well-known ligands (Table 1) were manually selected from previous high-throughput experimental studies [29, 39] to perform high-throughput reverse docking after excluding some ligands that were too large or small for molecular docking study. The 3D structures of the ligands were retrieved from PubChem  and converted from sdf file  into Tripos mol2 file format.
All the protonated pockets were docked against the ligand set using GOLD . We used a 'flexible ligand-rigid protein' mode. All other options involved in GOLD's search algorithm and termination factor were set to the default options. Given several putative docking conformations, we only chose the highest-ranking binding pose for each ligand-biding site pair. The GOLD fitness value  was used as a measure of the binding fitness. As a result, 10,886 × 35 matrix and 1,165 × 35 matrix of docking fitness scores for human and yeast, respectively, were made and used in this study (Additional files 1, 2).
Predefined the non-redundant set of druggable and less druggable binding sites (NRDLD set) was retrieved from the study by Krasowski et al. . Among 71 druggable binding sites and 44 less-druggable ones in NRDLD set, 43 druggable and 8 less-druggable binding sites are overlapped with human protein structures used in this study. These 51 binding sites were used for druggability analysis.
Putative druggable and less-druggable protein binding sites were assigned by the following rules: a binding site is druggable when all 35 docking values of the binding site are always larger than corresponding overall average values, and less-druggable when all 35 docking values of a binding site are always less than corresponding average values.
To get EC number composition of assigned druggable and less-druggable sets, non-redundant (NR) putative druggable and less druggable sets were defined. In this study, NR set means that the set do not contain any pairs of proteins sharing the same UniProt ID . Note that we did not use any sequence identity measure to remove redundancy.
Although there are several ortholog databases, none of those provides PDB-based mapping table. Therefore, we obtained the ortholog mapping between human and yeast protein structures by the following procedure. First, we retrieved human-yeast ortholog table from InParanoid [45, 46]. In this table, human proteins and yeast proteins were annotated by Ensembl's id (ENSP)  and yeast ORF name , respectively. These terms were transferred into PDB id by PICR  to complete PDB-based mapping.
We thank all members of Bioinformatics and Computational Biology Laboratory at KAIST for helpful discussions. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government, the Ministry of Education, Science & Technology (MEST) [2009-0086964].
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.