Prediction of compound-target interactions of natural products using large-scale drug and protein information
© Keum et al. 2016
Published: 28 July 2016
Verifying the proteins that are targeted by compounds of natural herbs will be helpful to select natural herb-based drug candidates. However, this entails a great deal of effort to clarify the interaction throughout in vitro or in vivo experiments. In this light, in silico prediction of the interactions between compounds and target proteins can help ease the efforts.
In this study, we performed in silico predictions of herbal compound target identification. First, data related to compounds, target proteins, and interactions between them are taken from the DrugBank database. Then we characterized six classes of compound-target interaction in humans including G-protein-coupled receptors (GPCRs), ion channel, enzymes, receptors, transporters, and other proteins. Also, classification-prediction models that predict the interactions between compounds and target proteins through a machine learning method were constructed using these matrices. As a result, AUC values of six classes are 0.94, 0.93, 0.90, 0.89, 0.91, and 0.76 respectively. Finally, the interactions of compounds from natural products were predicted using the constructed classification models. Furthermore, from our predicted results, we confirmed that several important disease related proteins were predicted as targets of natural herbal compounds.
We constructed classification-prediction models that predict the interactions between compounds and target proteins. The constructed models showed good prediction performances, and numbers of potential natural compounds target proteins were predicted from our results.
The efficacy of the medicinal use of natural products dates back thousands of years. In more recent years, compounds derived from natural products have shown promising effects in drug discovery and drug development. For example, oseltamivir (trade name, Tamiflu), an antiviral medication used to treat influenza A and influenza B, is synthesized from shikimic acid, a naturally occurring substance found in Chinese star anise herb . However, the detailed mechanism of action, including the target proteins of compounds, is known for just a few natural products. Moreover, identifying compound-target interactions through in vitro or in vivo experiments requires considerable efforts. In this regard, accurate in silico screening methods are necessary to predict interaction between compounds and target proteins.
Numerous studies on the prediction of interactions between compounds and target proteins have been reported. Yamanishi et al. implemented a systematic study on the prediction of compound-target protein interactions . They suggested that the interaction can be predicted by using the structural similarity of compounds and the genomic sequence similarity. They computed the sequence similarities between proteins using normalized Smith-Waterman scores and the structural similarities between compounds using SIMCOMP, a graph-based method for comparing chemical structures [3, 4]. With respect to prediction methods, Belakley et al. provided a useful method, referred to as the bipartite local model (BLM), to accurately predict compound-target protein interactions . BLM predicts target proteins of a given protein using the structural similarity of compounds, genomic similarity, and information of interactions between compounds and targets. Since this method shows promising performance in drug-target prediction, we adopted this method in our study to predict the interactions between herbal compounds and target proteins.
Compound, target protein, interaction data
The number of compounds, target proteins, and interactions of each types to construct a predicting model
A: the number of bits set in molecules A, B : the number of bits set in molecules B
The Tanimoto coefficient uses the bits set in both fingerprints. Applying the work to all compound pairs, the compound similarity matrix SC is constructed.
Applying this approach to all target protein pairs, the target protein genomic similarity matrix SP is constructed.
Herbal compound data
The number of herbs and their compounds for predicting their target proteins. All data are classified by phenotypes
Bipartite local model
Support Vector Machines (SVMs)
Support vector machines (SVMs) are used as the classifiers for the bipartite local model. A support vector machine is a supervised learning model, used in classification and regression. The SVM classifies two classes with the maximum margin. Therefore, when new data are classified, it predicts more accurately than other methods. For this reason, SVMs are used for good performance of classification in many applications . There are many libraries of available for SVM. The LIBSVM (v. 3.20) is used for the bipartite local model (BLM) . Given similarity information about the vertices (either the compounds or the target proteins), each local SVM learns a function that can assign a continuous score to a compound or target from the labels of these vertices . The sign of the score indicates the interaction or not. If the sign are positive, the predicted interactions between a compound and a target protein is positive, otherwise, they have no interactions. And the score contains the confidence of the prediction. If absolute value of the score is high, the predicted interaction is credible.
Performance of predicting model for each type of dataset
AUC (fixed target)
AUC (fixed drug)
The number of predicted herbs, their compounds, target proteins, and interactions by protein types. All data are classified by phenotypes
Conclusions and discussion
In this research, we predict whether compounds bind to target proteins or not by using chemical structure similarity information and genomic sequence similarity information on a large scale. Most data that help to construct the predication model are taken from the DrugBank database. The classification models that predict whether compounds bind to proteins of GPCRs, ion channels, transporters, other receptors, enzymes and other proteins is validated by a 10-fold cross validation. The AUC score of each prediction model is about 90 %, except the prediction model of other proteins. Since the number of data of other protein types is much smaller than that of other types, it is difficult to construct an accurate prediction model. Nonetheless, most of the prediction models are reliable. We can therefore predict the interactions of herbs and target proteins using prediction models. In the prediction results, there are many predicted herbal compounds of which Lipinski’s rule of five (ROF) is zero. This implies that the herbs can be used as drugs. Furthermore, many proteins are predicted to bind to herbal compounds of each phenotype. Also, many studies contend that some proteins among them are related with each phenotype. This implies that the suggested model predicts well. Moreover, there are not only the proteins that already identified their relationship with each phenotype by other studies but also the other proteins.
We note that limited number of target proteins were used in our predictions considering the whole size of the genes found in human . For example, the number of target proteins in the enzymes type is 404 in our dataset whereas the total number of proteins in the enzyme type is about 2,700 in the human genome . Therefore, there are rooms to improve the performance of the prediction models with the increased size of the training data set. Also, the prediction performance could be improved using additional information of compounds rather than the structure similarity only. There are many useful properties that can distinguish the compounds such as human intestinal absorption (HIA), blood-brain barrier (BBB), etc. Lastly, although some compounds are related with proteins, the bipartite local model (BLM) method regards an unknown compound-target protein interaction as a non-interaction. This can degrade performance of the prediction model. When these problems are addressed, we can construct better classification models that predict the interactions between target proteins and compounds.
Ethics approval and consent to participate
Consent for publication
Availability of data and materials
The dataset for constructing prediction model is available in the DrugBank database (http://www.drugbank.ca, version 4.0). The herb dataset supporting the conclusions of this article is available in the TCMID (http://www.megabionet.org/tcmid/), TCM-ID (http://bidd.nus.edu.sg/group/TCMsite/Default.aspx), KTKP (http://www.koreantk.com), and KAMPO (http://www.kampo.ca).
This work was supported by the Bio-Synergy Research Project (NRF-2014M3A9C4066449) of the Ministry of Science, ICT and Future Planning through the National Research Foundation, and supported by the National Research Foundation of Korea grant funded by the Korea government(MSIP) (NRF-2015R1C1A1A01051578).
The publication of this article was funded by the Bio-Synergy Research Project (NRF-2014M3A9C4066449) of the Ministry of Science, ICT and Future Planning through the National Research Foundation, and supported by the National Research Foundation of Korea grant funded by the Korea government (MSIP) (NRF-2015R1C1A1A01051578).
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 6, 2016: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Chen W, Lim CE, Kang HJ, Liu J. Chinese herbal medicines for the treatment of type A H1N1 influenza: a systematic review of randomized controlled trials. PLoS One. 2011;6(12):e28093.View ArticlePubMedPubMed CentralGoogle Scholar
- Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008;24(13):i232–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Hattori M, Okuno Y, Goto S, Kanehisa M. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J Am Chem Soc. 2003;125(39):11853–65.View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–7.View ArticlePubMedGoogle Scholar
- Bleakley K, Yamanishi Y. Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics. 2009;25(18):2397–403.View ArticlePubMedPubMed CentralGoogle Scholar
- Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, et al. DrugBank 3.0: a comprehensive resource for ‘omics’ research on drugs. Nucleic Acids Res. 2011;39(Database issue):D1035–41.View ArticlePubMedGoogle Scholar
- Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, et al. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(Database issue):D1091–7.View ArticlePubMedGoogle Scholar
- Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36(Database issue):D901–6.PubMedGoogle Scholar
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34(Database issue):D668–72.View ArticlePubMedGoogle Scholar
- O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open babel: an open chemical toolbox. J Cheminform. 2011;3:33.View ArticlePubMedPubMed CentralGoogle Scholar
- Ji ZL, Zhou H, Wang JF, Han LY, Zheng CJ, Chen YZ. Traditional Chinese medicine information database. J Ethnopharmacol. 2006;103(3):501.View ArticlePubMedGoogle Scholar
- Xue R, Fang Z, Zhang M, Yi Z, Wen C, Shi T. TCMID: Traditional Chinese Medicine integrative database for herb molecular mechanism analysis. Nucleic Acids Res. 2013;41(Database issue):D1089–95.View ArticlePubMedGoogle Scholar
- Pawson AJ, Sharman JL, Benson HE, Faccenda E, Alexander SP, Buneman OP, Davenport AP, McGrath JC, Peters JA, Southan C, et al. The IUPHAR/BPS Guide to PHARMACOLOGY: an expert-driven knowledgebase of drug targets and their ligands. Nucleic Acids Res. 2014;42(Database issue):D1098–106.View ArticlePubMedGoogle Scholar
- Zhao M, Lee WP, Garrison EP, Marth GT. SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS One. 2013;8(12):e82138.View ArticlePubMedPubMed CentralGoogle Scholar
- Haahtela T, Tuomisto LE, Pietinalho A, Klaukka T, Erhola M, Kaila M, Nieminen MM, Kontula E, Laitinen LA. A 10 year asthma programme in Finland: major change for the better. Thorax. 2006;61(8):663–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Wellcome Trust Case Control C. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–78.View ArticleGoogle Scholar
- Williarms AJ, Tkachenko V, Golotvin S, Kidd R, McCann G. ChemSpider - building a foundation for the semantic web by hosting a crowd sourced databasing platform for chemistry. J Cheminform. 2010;2(supp 1):O16.View ArticleGoogle Scholar
- Vapnik VN. An overview of statistical learning theory. IEEE Trans Neural Netw. 1999;10(5):988–99.View ArticlePubMedGoogle Scholar
- Chang C-C, Lin C-J. LIBSVM: A library for support vector machines. ACM Trans Intell Syst Tech (TIST). 2011;2(3):27.Google Scholar
- Lipinski CA. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov Today Technol. 2004;1(4):337–41.View ArticlePubMedGoogle Scholar
- Pertea M, Salzberg SL. Between a chicken and a grape: estimating the number of human genes. Genome Biol. 2010;11(5):206.View ArticlePubMedPubMed CentralGoogle Scholar