Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data
© Chen et al. 2010
Received: 23 November 2009
Accepted: 17 May 2010
Published: 17 May 2010
Skip to main content
© Chen et al. 2010
Received: 23 November 2009
Accepted: 17 May 2010
Published: 17 May 2010
Recently there has been an explosion of new data sources about genes, proteins, genetic variations, chemical compounds, diseases and drugs. Integration of these data sources and the identification of patterns that go across them is of critical interest. Initiatives such as Bio2RDF and LODD have tackled the problem of linking biological data and drug data respectively using RDF. Thus far, the inclusion of chemogenomic and systems chemical biology information that crosses the domains of chemistry and biology has been very limited
We have created a single repository called Chem2Bio2RDF by aggregating data from multiple chemogenomics repositories that is cross-linked into Bio2RDF and LODD. We have also created a linked-path generation tool to facilitate SPARQL query generation, and have created extended SPARQL functions to address specific chemical/biological search needs. We demonstrate the utility of Chem2Bio2RDF in investigating polypharmacology, identification of potential multiple pathway inhibitors, and the association of pathways with adverse drug reactions.
We have created a new semantic systems chemical biology resource, and have demonstrated its potential usefulness in specific examples of polypharmacology, multiple pathway inhibition and adverse drug reaction - pathway mapping. We have also demonstrated the usefulness of extending SPARQL with cheminformatics and bioinformatics functionality.
Recent advances in chemical & biological sciences have lead to an explosion of new data sources about genes, proteins, genetic variations, chemical compounds, diseases and drugs. Through integrated and intelligent data mining, this information could provide important insights into the complex functions of biological systems and the actions of chemical compounds or drugs on these systems. However, this can only be achieved when data is semantically integrated (i.e. using multiple data sources that are connected in meaningful ways) and in particular when chemical and biological resources are brought together in such a framework [1, 2].
There are critical problems in biology that can only be answered through computational analysis of this kind of integrated chemical and biological information. For example, it is considered increasingly important to profile existing and potential new drugs for their effects across many protein targets, not just a single target of interest (this is known as polypharmacology [3, 4]). Only by exploring the relationships of the drugs to a wide body of target information can we determine this profile. Further, the polypharmacologic action of drugs on targets that fall within the same pathway can determine the drug's ability to interrupt pathways at multiple points, and thus provide more robust efficacy. Relationships between these pathways and potential side effects of drugs or chemicals that are being considered as drugs (such as undesirably inhibition of a pathway) can only be determined by large-scale analysis of the impact of the chemicals on known pathway systems [5, 6]. The need to address these kinds of problems has led to the emergence of the field of Systems Chemical Biology , a field which covers the computational analysis of integrated chemical and biological information for the enhancement of biological understanding, including chemogenomics (the relationship of compounds to genes specifically).
Implementing such an integrated system involves the creation of large networks of linked compounds, protein targets, genes, pathways, drugs, diseases and side effects from multiple, heterogeneous sources. It must be possible to query these data in ways that go beyond querying of a single source and allow inferences that cross domains: for example a positive experimental test of a chemical compound in a biological enzymatic assay where the enzyme is associated with a particular metabolic pathway constitutes a probable action of that compound on the pathway. Currently, there are significant barriers to carry out this kind of analysis. Many of the needed data sources overlap and cover similar data (we refer to them as homogenous or semi-homogenous data sources) but with slightly different foci. All data sources tend also to be published in very diverse formats (text files, scholarly journal articles, XML, relational databases, and so on) and may be structured or unstructured. The semantic relationship of these datasets to each other is often unclear.
Recent Semantic Web technologies provide efficient ways to integrate heterogeneous data. The Semantic Web  initially proposed by Tim Berners-Lee, has demonstrated its utility in the life sciences, healthcare and drug discovery [2, 9–11]. Various semantic languages have been established to represent and query semantic meaning of data and relationship. In this work we use RDF  to model chemogenomic and systems chemical biology data and use SPARQL  to query them.
A variety of RDF-based Semantic Web resources have already been created for biological data and drug data separately. Bio2RDF  provides a platform and a strategy for generation and querying of biological RDF data in a distributed framework, with around 4 billion RDF triples across over 30 biological resources. Linking Open Drug Data (LODD)  led by the W3C Semantic Web Health Care and Life Sciences Interest Group (HCLS IG) links RDF data from the Linked Clinical Trials dataset derived from ClinicialTrial.gov, DrugBank (a repository of almost 5000 FDA-approved drugs), and many other sources, with more than 8.4 million RDF triples and 388,000 links to external data sources. Similar efforts are YeastHub , LinkHub , BioDash  and BioGateway .
Approaches to querying across heterogeneous data sources in the life sciences have been discussed previously . In the work reported in this paper, we have created an RDF resource for integrated chemical and biological information. We have further employed methods to facilitate the easy generation of SPARQL queries and have implemented a variety of searching options for the RDF datasets by extending the SPARQL query language to include domain-specific cheminformatics and bioinformatics functionality. We refer to this combination of new RDF triples, links to Bio2RDF and LODD, and searching capabilities as Chem2Bio2RDF. We present three specific examples of how Chem2Bio2RDF can be used in the previously described important areas of polypharmacology, pathway inhibition and adverse drug reaction analysis.
Our datasets are organized into six categories based on the kinds of biological and chemical concepts they contain. These categories are: chemical & drug, protein & gene, chemogenomics, systems (i.e., PPI and pathway), phenotype (i.e., disease and side effect) and literature. Some data sources are listed in multiple categories. Some of the data used were previously employed in relational database format in our prior work  and in this case they were simply converted into RDF/XML via a D2R server . For the rest of the datasets, we acquired the raw dataset (by downloading from web sites), and converted the data into our relational database using customized scripts. These are then published as RDF through the D2R server. The data can be queried via a D2R SPARQL endpoint.
We linked our data to LODD and Bio2RDF using the owl:sameAs construct. Since LODD and BioRDF have strict namespace definition and dereferenceable URIs, it is straightforward to link them simply via a D2R mapping file. For example, the drug Lepirudin http://chem2bio2rdf.org/drugbank/resource/drugbank_drug/DB00001 is linked to the following URIs: http://bio2rdf.org/drugbank_drugs:DB00001, http://www.dbpedia.org/resource/Lepirudin, http://www4.wiwiss.fu-berlin.de/dailymed/resource/ingredient/Lepirudin, and http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00001
Where N C is the number of bits that are set in the fingerprints of both A and B, and N A and N B are the total number of bits set in A and B, respectively.
Chem2Bio2RDF datasets: some data sources map to multiple RDF resources.
Number of RDF triples
Public QSAR sets
Since approximately 35% of known drugs have more than one target, the efficacy of many drugs is increasingly thought to come from their effect on multiple targets. This is known as polypharmacology. We recently studied the utility of data in PubChem for identifying cases of polypharmacology  as well as how chemical and biological data can be mined on a large scale . We can now extend this, using Chem2Bio2RDF, to incorporate data from DrugBank as well as PubChem. In particular, if a compound has the same multiple targets as a marketed drug but has a different chemical structure, that compound could be a candidate for a novel new therapy. Conversely, if we have already established polypharmacologic action of known drugs, can we find other interesting drug-like compounds that also show similar polypharmacology? These questions can be formulated as a query: find all the drug-like compounds in PubChem BioAssay that share at least two targets with a drug in DrugBank. We can now translate this into a SPARQL query of Chem2Bio2RDF (in this example using Dexamethasone - an anti-inflammatory 9-fluoro-glucocorticoid which interacts with six proteins - as the drug of interest). The exact SPARQL query used is available on the chem2bio2rdf.org website.
Nine of retrieved active compounds are active against at least two of the same protein targets, all of which are drug-like (in terms of Lipinski's Rule of Five). These compounds make sense from a medicinal chemistry perspective. For example, dexamethasone and one result tocris-1126 (CID: 6603742) have similar activities in NFKB1 and NR3C1, because they only have slight difference in stereochemistry. The activity of dexamethasone is also similar to that of another search hit, hydrocortisone (CID: 5754), where the addition of the methyl and fluorine to hydrocortisone has no effect on the activity but improves its drug-likeness as measured by the rule-of-five. In our generalized mapping process, we found 55 significant proteins in DrugBank that are studied in PubChem BioAssay. 27 drugs have corresponding active compounds showing polypharmacology.
The MAPK signalling pathway plays important roles in coordinating cell proliferation, differentiation and death. The inhibitors of proteins involved in the pathway are widely studied, but the robustness of this pathway leads to drug resistance. Cisplatin, for example, is used to treat ovarian cancer but the development of resistant cell population limits its efficiency in long-term trials. It has been suggested that targeting the ERK-MKP-1 system could destroy this pathway and further overcome Cisplatin resistance in human ovarian cancer treatment . One compound (CID: 573747) was found in the retrieved results that has never been reported in the literature, but which can apparently inhibit both ERK2 and MKP-1. We might consider this a candidate to provide a new direction for the design of inhibitors of both ERK and MKP-1 to reduce Cisplatin resistance. After iterating all the known pathways, we hit 36 pathways, in which at least two proteins are inhibited simultaneously by at least one compound in PubChem.
Adverse drug reactions are of serious consequence and are often the subject of rigorous investigation in pharmaceutical R&D processes. Here, we apply Chem2Bio2RDF to study the most significant pathways that are associated with a given adverse drug reaction. The association between side effect and pathway is made using the pathways' gene components that are targets of drugs with known side effects. More specifically, we consider a gene is related to a certain side effect if and only if at least two drugs targeting this gene have reported the same side effect. Further, if there exists a pathway that contains more than 2 gene targets that are associated with that side effect, an associative relationship between the pathway and side effect can be drawn. Clearly, the more these associative paths can be discovered, the stronger the evidence of such pathway-adverse drug effect association it becomes.
In this study, we examined hepatotoxicity (liver toxicity) as the side effect. Drug induced liver injury is a major cause of safety-related drug withdrawal (e.g., Ticrynafen, Benoxaprofen, Bromfenac, Troglitazone, Nefazodone) both before and after a drug goes to market, and thus identifying pathways in the body that might be associated with liver function and toxicity is important. Here we define drugs associated with hepatotoxicity as those with side effect terms that include necrosis, hepatitis and hepatomegaly.
We posed the specific question: find the top 5 pathways in the KEGG pathway dataset that contain at least two efficient targets that have drugs that are associated with hepatotoxicity. A gene target is considered as efficient if the gene is targeted by at least two drugs that cause the given side effect. This question can be formed into a SPARQL query which is available on the chem2bio2rdf.org website.
The difficulties of polypharmacology are to explore the combination of targets and then to identify active compounds against the sets of targets. Linking between chemical, biological, systems, and phenotype data is demonstrated to be a promising way to address the problems. For example, linking between bioassay data and market drug data enables to explore the compounds similar to drugs that already show polypharmacology. Quinacrine, which has been used as an anthelmintic and in the treatment of giardiasis and malignant effusions, shows polypharmacology. One compound Loxapine (CID: 71399) is found to show similar polypharmacology with quinacrine. Loxapine is active in both BioAssay 859 and BioAssay 377, whose targets are CHRM1 and ABCB1 respectively. As Loxapine tends to be hydrophobic, medicinal chemists would not be surprised that it is active in BioAssay 377, which identifies substrates (or inhibitors) for multidrug resistance transporter. It is also reported that Loxapine might get metabolized to Amoxapine that is a considerably weak antagonist in BioAssay 859 . Other than Loxapine, many identified compounds such as Oxybutynin were proved to show polypharmacology by literature reviews.
By linking bioassay data to pathways, we can identify the compounds that inhibit at least two of proteins in a pathway, leading to the pathway dysfunction. For example, compound CID 6419769 could interact with proteins HSD11B1 and AKR1C4, which are in the different branches of C21-Steroid hormone metabolism pathways. The blocking of the pathway might be able to partially explain why CID 6419769 has side effects . In protein-protein interaction networks, two proteins are connected if both physically interact. In terms of polypharmacology, the deletion of one protein does not affect the whole network, but if two connected nodes with high degree were deleted, the network would be disturbed. For example, by linking bioassay to PPI, we found that two compound (CID: 460747 and CID: 9549688) are active against two high degree proteins (PLK1 and TP53) which are associated with cancer.
PPI data source distribution.
# of records
Pathway data source distribution.
Chemogenomics data source distribution
# of records
Results of linking sample drugs to pathways.
We have created a new systems chemical biology resource called Chem2Bio2RDF that integrates small molecule, target, gene, pathway and drug information and permits cross-source linking with LODD and Bio2RDF. We have demonstrated the utility of this approach in specific examples of polypharmacology, multiple pathway inhibition, and adverse drug reaction - pathway mapping. We also demonstrated the usefulness of extending the SPARQL query language with cheminformatics and bioinformatics functionality, and have discussed the importance of integrating not just heterogeneous data but data sources which cover the same kinds of data.
We propose three further developments of this work. First, we hope to include more resources and datasets into Chem2Bio2RDF as they become available. Second, we see a variety of applications of using large-scale identification and ranking of paths of interest between data sources, and we are working on developing methods for this. Third, we are linking Chem2Bio2RDF with a variety of network and data visualization tools.
We thank Li Huang for helping make the figures.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.