Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data
- Bin Chen†1,
- Xiao Dong†1,
- Dazhi Jiao†1, 2,
- Huijun Wang†1,
- Qian Zhu†1,
- Ying Ding†2 and
- David J Wild†1Email author
© Chen et al; licensee BioMed Central Ltd. 2010
Received: 23 November 2009
Accepted: 17 May 2010
Published: 17 May 2010
Recently there has been an explosion of new data sources about genes, proteins, genetic variations, chemical compounds, diseases and drugs. Integration of these data sources and the identification of patterns that go across them is of critical interest. Initiatives such as Bio2RDF and LODD have tackled the problem of linking biological data and drug data respectively using RDF. Thus far, the inclusion of chemogenomic and systems chemical biology information that crosses the domains of chemistry and biology has been very limited
We have created a single repository called Chem2Bio2RDF by aggregating data from multiple chemogenomics repositories that is cross-linked into Bio2RDF and LODD. We have also created a linked-path generation tool to facilitate SPARQL query generation, and have created extended SPARQL functions to address specific chemical/biological search needs. We demonstrate the utility of Chem2Bio2RDF in investigating polypharmacology, identification of potential multiple pathway inhibitors, and the association of pathways with adverse drug reactions.
We have created a new semantic systems chemical biology resource, and have demonstrated its potential usefulness in specific examples of polypharmacology, multiple pathway inhibition and adverse drug reaction - pathway mapping. We have also demonstrated the usefulness of extending SPARQL with cheminformatics and bioinformatics functionality.
Recent advances in chemical & biological sciences have lead to an explosion of new data sources about genes, proteins, genetic variations, chemical compounds, diseases and drugs. Through integrated and intelligent data mining, this information could provide important insights into the complex functions of biological systems and the actions of chemical compounds or drugs on these systems. However, this can only be achieved when data is semantically integrated (i.e. using multiple data sources that are connected in meaningful ways) and in particular when chemical and biological resources are brought together in such a framework [1, 2].
There are critical problems in biology that can only be answered through computational analysis of this kind of integrated chemical and biological information. For example, it is considered increasingly important to profile existing and potential new drugs for their effects across many protein targets, not just a single target of interest (this is known as polypharmacology [3, 4]). Only by exploring the relationships of the drugs to a wide body of target information can we determine this profile. Further, the polypharmacologic action of drugs on targets that fall within the same pathway can determine the drug's ability to interrupt pathways at multiple points, and thus provide more robust efficacy. Relationships between these pathways and potential side effects of drugs or chemicals that are being considered as drugs (such as undesirably inhibition of a pathway) can only be determined by large-scale analysis of the impact of the chemicals on known pathway systems [5, 6]. The need to address these kinds of problems has led to the emergence of the field of Systems Chemical Biology , a field which covers the computational analysis of integrated chemical and biological information for the enhancement of biological understanding, including chemogenomics (the relationship of compounds to genes specifically).
Implementing such an integrated system involves the creation of large networks of linked compounds, protein targets, genes, pathways, drugs, diseases and side effects from multiple, heterogeneous sources. It must be possible to query these data in ways that go beyond querying of a single source and allow inferences that cross domains: for example a positive experimental test of a chemical compound in a biological enzymatic assay where the enzyme is associated with a particular metabolic pathway constitutes a probable action of that compound on the pathway. Currently, there are significant barriers to carry out this kind of analysis. Many of the needed data sources overlap and cover similar data (we refer to them as homogenous or semi-homogenous data sources) but with slightly different foci. All data sources tend also to be published in very diverse formats (text files, scholarly journal articles, XML, relational databases, and so on) and may be structured or unstructured. The semantic relationship of these datasets to each other is often unclear.
Recent Semantic Web technologies provide efficient ways to integrate heterogeneous data. The Semantic Web  initially proposed by Tim Berners-Lee, has demonstrated its utility in the life sciences, healthcare and drug discovery [2, 9–11]. Various semantic languages have been established to represent and query semantic meaning of data and relationship. In this work we use RDF  to model chemogenomic and systems chemical biology data and use SPARQL  to query them.
A variety of RDF-based Semantic Web resources have already been created for biological data and drug data separately. Bio2RDF  provides a platform and a strategy for generation and querying of biological RDF data in a distributed framework, with around 4 billion RDF triples across over 30 biological resources. Linking Open Drug Data (LODD)  led by the W3C Semantic Web Health Care and Life Sciences Interest Group (HCLS IG) links RDF data from the Linked Clinical Trials dataset derived from ClinicialTrial.gov, DrugBank (a repository of almost 5000 FDA-approved drugs), and many other sources, with more than 8.4 million RDF triples and 388,000 links to external data sources. Similar efforts are YeastHub , LinkHub , BioDash  and BioGateway .
Approaches to querying across heterogeneous data sources in the life sciences have been discussed previously . In the work reported in this paper, we have created an RDF resource for integrated chemical and biological information. We have further employed methods to facilitate the easy generation of SPARQL queries and have implemented a variety of searching options for the RDF datasets by extending the SPARQL query language to include domain-specific cheminformatics and bioinformatics functionality. We refer to this combination of new RDF triples, links to Bio2RDF and LODD, and searching capabilities as Chem2Bio2RDF. We present three specific examples of how Chem2Bio2RDF can be used in the previously described important areas of polypharmacology, pathway inhibition and adverse drug reaction analysis.
Our datasets are organized into six categories based on the kinds of biological and chemical concepts they contain. These categories are: chemical & drug, protein & gene, chemogenomics, systems (i.e., PPI and pathway), phenotype (i.e., disease and side effect) and literature. Some data sources are listed in multiple categories. Some of the data used were previously employed in relational database format in our prior work  and in this case they were simply converted into RDF/XML via a D2R server . For the rest of the datasets, we acquired the raw dataset (by downloading from web sites), and converted the data into our relational database using customized scripts. These are then published as RDF through the D2R server. The data can be queried via a D2R SPARQL endpoint.
Storage and querying architecture
We linked our data to LODD and Bio2RDF using the owl:sameAs construct. Since LODD and BioRDF have strict namespace definition and dereferenceable URIs, it is straightforward to link them simply via a D2R mapping file. For example, the drug Lepirudin http://chem2bio2rdf.org/drugbank/resource/drugbank_drug/DB00001 is linked to the following URIs: http://bio2rdf.org/drugbank_drugs:DB00001, http://www.dbpedia.org/resource/Lepirudin, http://www4.wiwiss.fu-berlin.de/dailymed/resource/ingredient/Lepirudin, and http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00001
Implementation of cheminformatics and bioinformatics functionality in SPARQL
Where N C is the number of bits that are set in the fingerprints of both A and B, and N A and N B are the total number of bits set in A and B, respectively.
Creation of the Chem2Bio2RDF repository
Chem2Bio2RDF datasets: some data sources map to multiple RDF resources.
Number of RDF triples
Public QSAR sets
Case study 1: Linking DrugBank and PubChem to investigate Dexamethasone polypharmacology
Since approximately 35% of known drugs have more than one target, the efficacy of many drugs is increasingly thought to come from their effect on multiple targets. This is known as polypharmacology. We recently studied the utility of data in PubChem for identifying cases of polypharmacology  as well as how chemical and biological data can be mined on a large scale . We can now extend this, using Chem2Bio2RDF, to incorporate data from DrugBank as well as PubChem. In particular, if a compound has the same multiple targets as a marketed drug but has a different chemical structure, that compound could be a candidate for a novel new therapy. Conversely, if we have already established polypharmacologic action of known drugs, can we find other interesting drug-like compounds that also show similar polypharmacology? These questions can be formulated as a query: find all the drug-like compounds in PubChem BioAssay that share at least two targets with a drug in DrugBank. We can now translate this into a SPARQL query of Chem2Bio2RDF (in this example using Dexamethasone - an anti-inflammatory 9-fluoro-glucocorticoid which interacts with six proteins - as the drug of interest). The exact SPARQL query used is available on the chem2bio2rdf.org website.
Nine of retrieved active compounds are active against at least two of the same protein targets, all of which are drug-like (in terms of Lipinski's Rule of Five). These compounds make sense from a medicinal chemistry perspective. For example, dexamethasone and one result tocris-1126 (CID: 6603742) have similar activities in NFKB1 and NR3C1, because they only have slight difference in stereochemistry. The activity of dexamethasone is also similar to that of another search hit, hydrocortisone (CID: 5754), where the addition of the methyl and fluorine to hydrocortisone has no effect on the activity but improves its drug-likeness as measured by the rule-of-five. In our generalized mapping process, we found 55 significant proteins in DrugBank that are studied in PubChem BioAssay. 27 drugs have corresponding active compounds showing polypharmacology.
Case study 2: Linking KEGG/Reactome Pathways and PubChem to identify potential multiple pathway inhibitors for MAPK
The MAPK signalling pathway plays important roles in coordinating cell proliferation, differentiation and death. The inhibitors of proteins involved in the pathway are widely studied, but the robustness of this pathway leads to drug resistance. Cisplatin, for example, is used to treat ovarian cancer but the development of resistant cell population limits its efficiency in long-term trials. It has been suggested that targeting the ERK-MKP-1 system could destroy this pathway and further overcome Cisplatin resistance in human ovarian cancer treatment . One compound (CID: 573747) was found in the retrieved results that has never been reported in the literature, but which can apparently inhibit both ERK2 and MKP-1. We might consider this a candidate to provide a new direction for the design of inhibitors of both ERK and MKP-1 to reduce Cisplatin resistance. After iterating all the known pathways, we hit 36 pathways, in which at least two proteins are inhibited simultaneously by at least one compound in PubChem.
Case study 3: Linking KEGG and DrugBank to associate pathways with drug hepatotoxicity
Adverse drug reactions are of serious consequence and are often the subject of rigorous investigation in pharmaceutical R&D processes. Here, we apply Chem2Bio2RDF to study the most significant pathways that are associated with a given adverse drug reaction. The association between side effect and pathway is made using the pathways' gene components that are targets of drugs with known side effects. More specifically, we consider a gene is related to a certain side effect if and only if at least two drugs targeting this gene have reported the same side effect. Further, if there exists a pathway that contains more than 2 gene targets that are associated with that side effect, an associative relationship between the pathway and side effect can be drawn. Clearly, the more these associative paths can be discovered, the stronger the evidence of such pathway-adverse drug effect association it becomes.
In this study, we examined hepatotoxicity (liver toxicity) as the side effect. Drug induced liver injury is a major cause of safety-related drug withdrawal (e.g., Ticrynafen, Benoxaprofen, Bromfenac, Troglitazone, Nefazodone) both before and after a drug goes to market, and thus identifying pathways in the body that might be associated with liver function and toxicity is important. Here we define drugs associated with hepatotoxicity as those with side effect terms that include necrosis, hepatitis and hepatomegaly.
We posed the specific question: find the top 5 pathways in the KEGG pathway dataset that contain at least two efficient targets that have drugs that are associated with hepatotoxicity. A gene target is considered as efficient if the gene is targeted by at least two drugs that cause the given side effect. This question can be formed into a SPARQL query which is available on the chem2bio2rdf.org website.
The difficulties of polypharmacology are to explore the combination of targets and then to identify active compounds against the sets of targets. Linking between chemical, biological, systems, and phenotype data is demonstrated to be a promising way to address the problems. For example, linking between bioassay data and market drug data enables to explore the compounds similar to drugs that already show polypharmacology. Quinacrine, which has been used as an anthelmintic and in the treatment of giardiasis and malignant effusions, shows polypharmacology. One compound Loxapine (CID: 71399) is found to show similar polypharmacology with quinacrine. Loxapine is active in both BioAssay 859 and BioAssay 377, whose targets are CHRM1 and ABCB1 respectively. As Loxapine tends to be hydrophobic, medicinal chemists would not be surprised that it is active in BioAssay 377, which identifies substrates (or inhibitors) for multidrug resistance transporter. It is also reported that Loxapine might get metabolized to Amoxapine that is a considerably weak antagonist in BioAssay 859 . Other than Loxapine, many identified compounds such as Oxybutynin were proved to show polypharmacology by literature reviews.
By linking bioassay data to pathways, we can identify the compounds that inhibit at least two of proteins in a pathway, leading to the pathway dysfunction. For example, compound CID 6419769 could interact with proteins HSD11B1 and AKR1C4, which are in the different branches of C21-Steroid hormone metabolism pathways. The blocking of the pathway might be able to partially explain why CID 6419769 has side effects . In protein-protein interaction networks, two proteins are connected if both physically interact. In terms of polypharmacology, the deletion of one protein does not affect the whole network, but if two connected nodes with high degree were deleted, the network would be disturbed. For example, by linking bioassay to PPI, we found that two compound (CID: 460747 and CID: 9549688) are active against two high degree proteins (PLK1 and TP53) which are associated with cancer.
PPI data source distribution.
# of records
Pathway data source distribution.
Chemogenomics data source distribution
# of records
Results of linking sample drugs to pathways.
We have created a new systems chemical biology resource called Chem2Bio2RDF that integrates small molecule, target, gene, pathway and drug information and permits cross-source linking with LODD and Bio2RDF. We have demonstrated the utility of this approach in specific examples of polypharmacology, multiple pathway inhibition, and adverse drug reaction - pathway mapping. We also demonstrated the usefulness of extending the SPARQL query language with cheminformatics and bioinformatics functionality, and have discussed the importance of integrating not just heterogeneous data but data sources which cover the same kinds of data.
We propose three further developments of this work. First, we hope to include more resources and datasets into Chem2Bio2RDF as they become available. Second, we see a variety of applications of using large-scale identification and ranking of paths of interest between data sources, and we are working on developing methods for this. Third, we are linking Chem2Bio2RDF with a variety of network and data visualization tools.
We thank Li Huang for helping make the figures.
- Wild DJ: Mining large heterogeneous datasets in drug discovery. Expert Opinion on Drug Discovery 2009, 4(10):995–1004. 10.1517/17460440903233738View ArticlePubMedGoogle Scholar
- Slater T, Bouton C, Huang ES: Beyond data integration. Drug Discovery Today 2008, 13(13–14):584–9. 10.1016/j.drudis.2008.01.008View ArticlePubMedGoogle Scholar
- Chen B, Wild DJ, Guha R: PubChem as a Source of Polypharmacology. J Chem Inf and Model 2009, 49(9):2044–2055. 10.1021/ci9001876View ArticleGoogle Scholar
- Hopkins AL: Network Pharmacology: The Next Paradigm in Drug Discovery. Nat. Chem. Biol 2008, 4: 682–690. 10.1038/nchembio.118View ArticlePubMedGoogle Scholar
- Scheiber J, Chen B, Milik M, Sukuru SC, Bender A, Mikhailov D, Whitebread S, Hamon J, Azzaoui K, Urban L, Glick M, Davies JW, Jenkins JL: Gaining insight into off-target mediated effects of drug candidates with a comprehensive systems chemical biology analysis. J Chem Inf Model 2009, 49(2):308–17. 10.1021/ci800344pView ArticlePubMedGoogle Scholar
- Xie L, Li J, Xie L, Bourne PE: Drug discovery using chemical systems biology: identification of the protein-ligand binding network to explain the side effects of CETP inhibitors. PLoS Comput Biol 2009, 5(5):e1000387. 10.1371/journal.pcbi.1000387View ArticlePubMedPubMed CentralGoogle Scholar
- Oprea TI, Tropsha A, Faulon J, Rintoul MD: Systems chemical biology. Nat Chem Biol 2007, 3: 447–450. 10.1038/nchembio0807-447View ArticlePubMedPubMed CentralGoogle Scholar
- Berners-Lee T, Handler J, Lassila O: The semantic web. Scientific American 2001.Google Scholar
- Neumann EK: A life science semantic web: are we there yet? Science 2005, 283: 22–5.Google Scholar
- Neumann EK, Miller E, Wilbanks J: What the semantic web could do for the life sciences. Drug Discovery Today:BIOSILICO 2006, 2: 228–34. 10.1016/S1741-8364(04)02420-5View ArticleGoogle Scholar
- Chen H, Ding L, Wu Z, Yu T, Dhanapalan L, Chen JY: Semantic web for integrated network analysis in biomedicine. Brief Bioinform 2009, 10(2):177–92. 10.1093/bib/bbp002View ArticlePubMedGoogle Scholar
- Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J: Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J Biomed Inform 2008, 41: 706–716. 10.1016/j.jbi.2008.03.004View ArticlePubMedGoogle Scholar
- Jentzsch A, Zhao J, Hassanzadeh O, Cheung K, Samwald K, Andersson B: Linking open drug data. Proceedings of the International Conference on Semantic Systems (I-SEMANTICS'09) 2009; Graz, AustriaGoogle Scholar
- Cheung K, Yip K, Smith A, Deknikker R, Masiar A, Gerstein M: YeastHub: A semantic web use case for integrating data in the life sciences domain. Bioinformatics 2005, 21(Suppl 1):i85–96. 10.1093/bioinformatics/bti1026View ArticlePubMedGoogle Scholar
- Villanueva-Rosales N, Osbahr K, Doumontier M: Towards a Semantic Knowledge base for Yeast biologists. J Biomed Inform 2008, 41(5):779–89. 10.1016/j.jbi.2008.05.001View ArticlePubMedGoogle Scholar
- Neumann EK, Quan D: Biodash: a semantic web dashboard for drug development. Pac Symp on Biocomput 2006, 11: 176–187. full_textGoogle Scholar
- Antezana E, Blondé W, Egaña M, Rutherford A, Stevens R, De Baets B, Mironov V, Kuiper M: BioGateway: a semantic systems biology tool for the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):S11. 10.1186/1471-2105-10-S10-S11View ArticlePubMedPubMed CentralGoogle Scholar
- Cheung K, Frost HR, Marshall MS, Prud'hommeaux E, Samwald M, Zhao J, Paschke A: A journey to Semantic Web query federation in the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):S10. 10.1186/1471-2105-10-S10-S10View ArticlePubMedPubMed CentralGoogle Scholar
- Bizer C, Cyganiak R: D2R Server - Publishing Relational Databases on the Semantic Web. Poster at the 5th International Semantic Web Conference 2006.Google Scholar
- Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen EL: The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J Chem Inf Comput Sci 2003, 43(2):493–500.View ArticlePubMedGoogle Scholar
- Dong X, Gilbert KE, Guha R, Heiland R, Kim J, Pierce ME, Fox GC, Wild DJ: Web service infrastructure for chemoinformatics. J Chem Inf Model 2007, 47(4):1303–1307. 10.1021/ci6004349View ArticlePubMedGoogle Scholar
- Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M, Schreiber MJ: BioJava: an open-source framework for bioinformatics. Bioinformatics 2008, 24(18):2096–2097. 10.1093/bioinformatics/btn397View ArticlePubMedPubMed CentralGoogle Scholar
- Durant JL, Leland BA, Henry DR, Nourse JG: Reoptimization of MDL Keys for Use in Drug Discovery. J Chem Inf Comput Sci 2002, 42(6):1273–1280.View ArticlePubMedGoogle Scholar
- Holliday JD, Hu CY, Willett P: Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen 2002, 5(2):155–66.View ArticlePubMedGoogle Scholar
- Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH: PubChem: a public information system for analyzing bioactivities of small molecules. Nucl Acids Res 2009, 37: W623-W633. 10.1093/nar/gkp456View ArticlePubMedPubMed CentralGoogle Scholar
- Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J: DrugBank: A Comprehensive Resource for in Silico Drug Discovery and Exploration. Nucleic Acids Res 2006, 34: D668–72. 10.1093/nar/gkj067View ArticlePubMedPubMed CentralGoogle Scholar
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 2006, 34: D354–357. 10.1093/nar/gkj102View ArticlePubMedPubMed CentralGoogle Scholar
- Mattingly CJ, Colby GT, Forrest JN, Boyer JL: The Comparative Toxicogenomics Database (CTD). Environ Health Perspect 2003, 111(6):793–795. 10.1289/ehp.6028View ArticlePubMedPubMed CentralGoogle Scholar
- Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK: BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucl Acids Res 2007, 35: D198-D201. 10.1093/nar/gkl999View ArticlePubMedPubMed CentralGoogle Scholar
- Klein TE, Chang JT, Cho MK, Easton KL, Fergerson R, Hewett M, Lin Z, Liu Y, Liu S, Oliver DE, Rubin DL, Shafa F, Stuart JM, Altman RB: Integrating Genotype and Phenotype Information: An Overview of the PharmGKB Project. The Pharmacogenomics Journal 2001, 1: 167–170.View ArticlePubMedGoogle Scholar
- Günther S, Kuhn M, Dunkel M, Campillos M, Senger C, Petsalaki E, Ahmed J, Urdiales EG, Gewiess A, Jensen LJ, Schneider R, Skoblo R, Russell RB, Bourne PE, Bork P, Preissner R: SuperTarget and Matador: resources for exploring drug-target relationships. Nucl Acids Res 2008, 36: D919–922. 10.1093/nar/gkm862View ArticlePubMedPubMed CentralGoogle Scholar
- QSAR sets[http://www.cheminformatics.org]
- Wang H, Klinginsmith J, Dong X, Lee AC, Guha R, Wu Y, Crippen GM, Wild DJ: Chemical data mining of the NCI human tumor cell line database. J Chem Inf Model 2007, 47(6):2063–2076. 10.1021/ci700141xView ArticlePubMedGoogle Scholar
- Keith CT, Borisy AA, Stockwell BR: Multicomponent Therapeutics for Networked Systems. Nat Rev Drug Discovery 2005, 4: 71–78. 10.1038/nrd1609View ArticlePubMedGoogle Scholar
- Wang J, Zhou JY, Wu GS: ERK-Dependent MKP-1-Mediated Cisplatin Resistance in Human Ovarian Cancer Cells. Cancer Res 2007, 67: 3–1194.Google Scholar
- Jones BE, Czaja MJ: III Intracellular signaling in response to toxic liver injury. Am J Physiol 1998, 275(5 Pt 1):G874–878.PubMedGoogle Scholar
- Gong G, Waris G, Tanveer R, Siddiqui A: Human hepatitis C virus NS5A protein alters intracellular calcium levels, induces oxidative stress, and activates STAT-3 and NF-kappa B. Proc Natl Acad Sci USA 2001, 98(17):9599–9604. 10.1073/pnas.171311298View ArticlePubMedPubMed CentralGoogle Scholar
- Coupet J, Fisher SK, Rauh CE, Lai F, Beer B: Interaction of Amoxapine with Muscarinic Cholinergic Receptors - an in Vitro Assessment. Eur J Pharmacol 1985, 112: 231–235. 10.1016/0014-2999(85)90500-XView ArticlePubMedGoogle Scholar
- Andrews RC, Rooyackers O, Walker BR: Effects of the 11 Beta-hydroxysteroid Dehydrogrenase Inhibitor Carbenoxolone on Insulin Sensitivity in Men with Type 2 Diabetes. J Clin Endocrinol Metab 2003, 88: 285–291. 10.1210/jc.2002-021194View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.