IPAD: the Integrated Pathway Analysis Database for Systematic Enrichment Analysis
© Zhang and Drabier; licensee BioMed Central Ltd. 2012
Published: 11 September 2012
Skip to main content
© Zhang and Drabier; licensee BioMed Central Ltd. 2012
Published: 11 September 2012
Next-Generation Sequencing (NGS) technologies and Genome-Wide Association Studies (GWAS) generate millions of reads and hundreds of datasets, and there is an urgent need for a better way to accurately interpret and distill such large amounts of data. Extensive pathway and network analysis allow for the discovery of highly significant pathways from a set of disease vs. healthy samples in the NGS and GWAS. Knowledge of activation of these processes will lead to elucidation of the complex biological pathways affected by drug treatment, to patient stratification studies of new and existing drug treatments, and to understanding the underlying anti-cancer drug effects. There are approximately 141 biological human pathway resources as of Jan 2012 according to the Pathguide database. However, most currently available resources do not contain disease, drug or organ specificity information such as disease-pathway, drug-pathway, and organ-pathway associations. Systematically integrating pathway, disease, drug and organ specificity together becomes increasingly crucial for understanding the interrelationships between signaling, metabolic and regulatory pathway, drug action, disease susceptibility, and organ specificity from high-throughput omics data (genomics, transcriptomics, proteomics and metabolomics).
We designed the Integrated Pathway Analysis Database for Systematic Enrichment Analysis (IPAD, http://bioinfo.hsc.unt.edu/ipad), defining inter-association between pathway, disease, drug and organ specificity, based on six criteria: 1) comprehensive pathway coverage; 2) gene/protein to pathway/disease/drug/organ association; 3) inter-association between pathway, disease, drug, and organ; 4) multiple and quantitative measurement of enrichment and inter-association; 5) assessment of enrichment and inter-association analysis with the context of the existing biological knowledge and a "gold standard" constructed from reputable and reliable sources; and 6) cross-linking of multiple available data sources.
IPAD is a comprehensive database covering about 22,498 genes, 25,469 proteins, 1956 pathways, 6704 diseases, 5615 drugs, and 52 organs integrated from databases including the BioCarta, KEGG, NCI-Nature curated, Reactome, CTD, PharmGKB, DrugBank, PharmGKB, and HOMER. The database has a web-based user interface that allows users to perform enrichment analysis from genes/proteins/molecules and inter-association analysis from a pathway, disease, drug, and organ.
Moreover, the quality of the database was validated with the context of the existing biological knowledge and a "gold standard" constructed from reputable and reliable sources. Two case studies were also presented to demonstrate: 1) self-validation of enrichment analysis and inter-association analysis on brain-specific markers, and 2) identification of previously undiscovered components by the enrichment analysis from a prostate cancer study.
IPAD is a new resource for analyzing, identifying, and validating pathway, disease, drug, organ specificity and their inter-associations. The statistical method we developed for enrichment and similarity measurement and the two criteria we described for setting the threshold parameters can be extended to other enrichment applications. Enriched pathways, diseases, drugs, organs and their inter-associations can be searched, displayed, and downloaded from our online user interface. The current IPAD database can help users address a wide range of biological pathway related, disease susceptibility related, drug target related and organ specificity related questions in human disease studies.
With the age of big data approaching , bioinformatics for Next-Generation Sequencing (NGS) and Genome-Wide Association Studies (GWAS) will be one of the biggest areas of disruptive innovation in life science tools over the next few years . Next-Generation Sequencing technologies and Genome-Wide Association Studies generate millions of reads and hundreds of datasets, and there is an urgent need for a better way to accurately interpret and distill such large amounts of data. The use of large scale gene expression analysis has been proven to be useful in identifying differentially expressed genes to classify and predict various disease subtypes. However, it is often difficult to assign biological significance to a large number of genes that are implicated. This problem persists even when users are able to reduce the number of differentially expressed genes substantially via hierarchical clustering methods.
As more information is revealed through large-scale "omics" techniques, it is becoming increasingly apparent that genes do not function alone but through complex biological pathways. Unraveling these intricate pathways is essential to understanding biological mechanisms, disease states, and the function of drugs that transform them. Extensive pathway and network analysis allow for the discovery of highly significant pathways from a set of disease vs. healthy samples in the NGS and GWAS. Knowledge of activation of these processes will lead to elucidation of the complex biological pathways affected by drug treatment, to patient stratification studies of new and existing drug treatments, and to understanding the underlying anti-cancer drug effects.
Pathway databases serve as repositories of current knowledge on cell signaling, enzymatic reaction, and genetic regulation. There are more than 300 pathway repositories listed in Pathguide resource http://www.pathguide.org, from which over 141 are specialized on reactions in human as of Jan 2012, for example, BioCarta http://www.biocarta.com, KEGG http://www.genome.jp/kegg/, NCI-Nature curated http://pid.nci.nih.gov/PID/index.shtml, Reactome http://www.reactome.org, and Wikipathways http://www.wikipathways.org/. However, these resources have several limitations. First, most currently available resources do not contain disease, drug or organ specificity information such as disease-pathway, drug-pathway, and organ-pathway associations. Next, these resources have been developed with variable degrees of data coverage, quality, and utility . In addition, only half of them provide pathways and reactions in computer-readable formats needed for automatic retrieval and processing . Lastly, many pathway databases are in distinct formats .
A component is a biomedical concept such as pathway, disease, drug and organ (nodes in Figure 1). Some pilot studies about this kind of connections have been done in the past. For example, Li et al. investigated disease relationships based on their shared pathways . First, they extracted disease associated genes by literature mining. Then, they connected diseases to biological pathways through overlapping genes. Lastly, they built a disease network by connecting diseases sharing common pathways. Smith et al. combined pathway analysis and drug analysis to identify common biological pathways and drug targets across multiple respiratory viruses based on human host gene expression analysis. Their study suggested that multiple and diverse respiratory viruses invoked several common host response pathways . One study found that disease candidate genes were functionally related in the form of protein complexes or biological pathways and complex disease ensued from the malfunction of one or a few specific signaling pathways . Another study aimed to explore complex relationships among diseases, drugs, genes, and target proteins altogether  and found that mapping the polypharmacology network onto the human disease-gene network revealed not only that drugs commonly acted on multiple targets but also that drug targets were often involved with multiple diseases. Berger and Iyengar also discussed how analysis of biological networks had contributed to the genesis of systems pharmacology and how these studies had improved global understanding of drug targets . They described that an emerging area of pharmacology, systems pharmacology, which utilizes biological network analysis of drug action as one of its approaches, is becoming increasingly important in: providing new approaches for drug discovery for complex diseases; considering drug actions and side effects in the context of the regulatory networks within which the drug targets and disease gene products function; understanding the relationships between drug action and disease susceptibility genes; and increasing knowledge of the mechanisms underlying the multiple actions of drugs .
Therefore we created the Integrated Pathway Analysis Database for Systematic Enrichment Analysis (IPAD) for users to query information about genes, diseases, drugs, organ specificity, and signaling and metabolic pathways. First, we integrated data from four kinds of sources: 1) pathway databases from BioCarta , KEGG , NCI-Nature curated , and Reactome , 2) disease databases from CTD http://ctdbase.org/ and PharmGKB http://www.pharmgkb.org, 3) drug databases from DrugBank httP://www.drugbank.ca and PharmGKB , and 4) organ-specific genes/proteins from HOMER http://discern.uits.iu.edu:8340/Homer/index.html. Next, we created inter-association between pathway, disease, drug, and organ specificity. Then, we built a web interface for users to perform 1) enrichment analysis from genes/proteins/molecules, and 2) inter-association analysis from a pathway, disease, drug and organ. Lastly, we presented three case studies: 1) breast cancer related markers, 2) brain-specific markers, and 3) prostate cancer model to demonstrate that the IPAD can enable users to analyze enrichment and inter-association between pathway, disease, drug and organ, to discover previously undiscovered pathway, disease, drug and organ, and to validate the enrichments.
The Integrated Pathway Analysis Database for Systematic Enrichment Analysis (IPAD), located at http://bioinfo.hsc.unt.edu/ipad is a comprehensive database covering about 22,498 genes, 25,469 proteins, 1956 pathways, 6704 diseases, 5615 drugs, and 52 organs integrated from databases including the BioCarta , KEGG , NCI-Nature curated , Reactome , CTD , PharmGKB , DrugBank , PharmGKB , and HOMER .
It is the first comprehensive database that can be used to analyze, discover, and validate enrichment and inter-association between pathway, disease, drug and organ. The inter-associations allow further identification of enriched pathways, diseases, drugs and organs. The quality of the database is validated on a "gold standard" constructed from reputable and reliable sources. The ability to choose multiple quantitative parameters (p-value, Absolute Enrichment Value (AE), Relative Enrichment Value (RE), and Mean Jaccard Index (MJI)) provides us with powerful statistics and computation to accurately calculate enrichment and inter-association. And the cross-linking of multiple data sources enables subsequent validation.
Current Statistics of Database
1956 (BioCarta:310,KEGG:247, NCI-Nature curated:222, Reactome:1177)
Molecules in Pathway
Molecules in Disease
Molecules in Drug
Molecules in Organ
A Comparison of Human Pathways in IPAD against Several Common Pathway Data Sources
Organ Specificity Association
Enrichment Score Quantitative
In response to the query input, IPAD can retrieve a list of related components (pathways, diseases, drugs, and organs) in a highly flexible table, with which users can further explore details about inter-association between the components. For example, users can browse the inter-association between each component's molecule and pathway, disease, drug and organ by clicking on the link in the column of molecule, and look through the component-related inter-association between pathway, disease, drug and organ by clicking on the inter-association icon in the last column. There are totally sixteen types of inter-associations between pathway, disease, drug and organ in IPAD: Pathway-Pathway, Pathway-Disease, Pathway-Drug, Pathway-Organ, Disease-Pathway, Disease-Disease, Disease-Drug, Disease-Organ, Drug-Pathway, Drug-Disease, Drug-Drug, Drug-Organ, Organ-Pathway, Organ-Disease, Organ-Drug, and Organ-Organ. User queried inter-association pathway/disease/drug/organ data stored in IPAD can also be freely downloaded as tab-delimited text files using links below each enrichment or inter-association table.
Assessing the capabilities of any pathway/disease/drug/organ enrichment analysis in real experiments is a challenge in itself because the full truth of what really occurred between the components and how they are actually inter-associated, if at all, may never be known. In the absence of a "gold standard" - a reference standard against which to establish the performance of the filter, the best alternative is to analyze the results of the enrichment analysis method in the context of the existing biological knowledge . We first used two identified studies to illustrate how well the significant pathways/diseases/drugs/organs identified by the enrichment analysis and inter-association analysis of IPAD fit with the existing biological knowledge. Then we constructed a "gold standard" of 30161 known associations and used it to assess the inter-association analysis of IPAD.
The absence of a definitive answer regarding the involvement of a particular pathway/disease/drug/organ in a given condition makes it impossible to calculate exact values for sensitivity, specificity, ROCs, etc. Therefore, we compared the result of IPAD's enrichment analysis and inter-association analysis and tested whether the significant pathways/diseases/drugs/organs fit with the existing biological context. This type of assessment is the current best practice in this area of enrichment analysis .
Enrichment Analysis of Breast Cancer Related Markers
Non-small cell lung cancer
Pathways in cancer
Influence of Ras and Rho proteins on G1 to S Transition
Chronic myeloid leukemia
Small cell lung cancer
E-cadherin signaling in the nascent adherens junction
FOXM1 transcription factor network
a6b1 and a6b4 Integrin signaling
Signaling events mediated by Hepatocyte Growth Factor Receptor (c-Met)
Muscular Atrophy, Spinal
Colorectal Neoplasms, Hereditary Nonpolyposis
Gastrointestinal Stromal Tumors
By the pathway analysis (p-value ≤ 1.69 × 10-4, AE ≥ 3.03, RE ≥ 20.01 and MJI ≥ 0.158), we identified 18 associated pathways of which most are linked with cancer such as hsa05212 Pancreatic cancer, hsa05213 Endometrial cancer, hsa05215 Prostate cancer, hsa05223 Non-small cell lung cancer, hsa05218 Melanoma, hsa05219 Bladder cancer, hsa05200 Pathways in cancer, hsa05214 Glioma, hsa05220 Chronic myeloid leukemia, hsa05222 Small cell lung cancer, and hsa05210 Colorectal cancer (Table 3). We also discovered 107 diseases (p-value ≤ 1.59 × 10-4, AE ≥ 4.35, RE ≥ 6.31 and MJI ≥ 0.17, Table 3, the top 12 diseases were shown due to space limitation). Most of them are linked with cancer such as MESH:D002528 Cerebellar Neoplasms, MESH:D016510 Corneal Neovascularization, MESH:D002282 Adenocarcinoma, Bronchiolo-Alveolar, MESH:D044483 Intestinal Polyposis, PA443756 Colonic Neoplasms, PA445062 Neoplasms, MESH:D003123 Colorectal Neoplasms, Hereditary Nonpolyposis, and MESH:D046152 Gastrointestinal Stromal Tumors.
By the inter-association, we found that the number 1 pathway (hsa05212, pancreatic cancer) we identified from the enrichment analysis is also highly associated with the pathway (hsa05200, pathways in cancer, p-value = 3.04 × 10-66, 46 orders of magnitude more significant than the pathway-pathway p-value threshold 2.13 × 10-19), disease (MESH:D046152 Gastrointestinal Stromal Tumors, p-value = 1.89 × 10-32, 25 orders of magnitude more significant than the pathway-disease p-value threshold 1.28 × 10-6), and drug (PA450191 lecithin, p-value = 4.55 × 10-11, 7 orders of magnitude more significant than the pathway-drug p-value threshold 5.73 × 10-4). Highly is measured by p-value. When the individual p-values are at least three orders of magnitude lower than current used p-value threshold, they are called "highly significant."
The pathway "hsa05200, pathways in cancer" and disease "MESH:D046152 Gastrointestinal Stromal Tumors" are already included in our previous enrichment analysis and were validated by the inter-association analysis. The drug PA450191 lecithin was filtered out in the enrichment analysis due to its insignificant measurement (p-value = 0.0472, AE = 2, RE = 9.04, MJI = 0.0884) and was discovered by the inter-association analysis as a previously undiscovered drug (p-value = 4.55 × 10-11, AE = 14, RE = 14.53, MJI = 0.2334). Similarly, the number 1 disease (MESH:D002528 Cerebellar Neoplasms) was found to be inter-associated with hsa05200 Pathways in cancer (validated, p-value = 6.86 × 10-42, AE = 79, RE = 9.39, MJI = 0.2536), MESH:D016410 Lymphoma, T-Cell, Cutaneous (previously undiscovered, p-value = 3.76 × 10-100, AE = 320, RE = 6.15, MJI = 0.5389), and PA449780 glutathione (previously undiscovered, p-value = 4.41 × 10-18, AE = 37, RE = 8.20, MJI = 0.3173); and the number 1 drug (PA451581 tamoxifen) was found to be inter-associated with 211859 Biological oxidations (previously undiscovered, p-value = 9.31 × 10-25, AE = 24, RE = 30.06, MJI = 0.2654), PA443560 Breast Neoplasms (previously undiscovered, p-value = 3.26 × 10-50, AE = 49, RE = 35.43, MJI = 0.4042), and PA449503 estradiol (previously undiscovered, p-value = 1.2 × 10-21, AE = 30, RE = 15.45, MJI = 0.3558).
Another dataset we used to assess the enrichment analysis is with the "self-validation" in Case Study 1. The self-validation makes the result of enrichment analysis more reliable and meaningful and consistent with the existing biological context. If a result of enrichment analysis can be validated by its subsequent inter-association analysis, it is also validated that the enrichment analysis and inter-association analysis are consistent and are both somewhat reliable.
Compared to sensitivity, specificity and accuracy, the prediction rates are relatively low because the size of testing set are much larger than that of the "gold standard" set. When more "gold standards" of inter-associations become available in the future, the prediction rates and F_measure can be improved because the currently unpredicted pairs will be able to be predicted correctly. Figure 5 also gives a global evaluation for all 30161 inter-associations (Precision 60.73%, Accuracy 89.90%, Sensitivity 78.69%, Specificity 91.72%, F_measure 68.56%). Overall, the balanced F_measure (68.56%) shows our inter-association analysis method is reliable and can be used for further enrichment analysis.
We show two case studies of increasing complexity and biological significance to achieve two goals: 1) to demonstrate the IPAD's ability to self-validate by using it to perform enrichment analysis and inter-association analysis on the 369 brain-specific markers, and 2) to demonstrate the ability of IPAD to identify previously undiscovered components by the enrichment analysis based on differentially expressed genes identified from a prostate cancer study.
Enrichment Analysis of Brain-Specific Markers
Transmission across Chemical Synapses
Retrograde endocannabinoid signaling
Neurotransmitter Receptor Binding And Downstream Transmission In The Postsynaptic Cell
Neuroactive ligand-receptor interaction
GABA A receptor activation
Ligand-gated ion channel transport
GABA receptor activation
Ion channel transport
Class C/3 (Metabotropic glutamate/pheromone receptors)
GABA synthesis, release, reuptake and degradation
Trafficking of AMPA receptors
Glutamate Binding, Activation of AMPA Receptors and Synaptic Plasticity
Neurotransmitter Release Cycle
REM Sleep Behavior Disorder
Brain Damage, Chronic
Affective Disorders, Psychotic
The 10 identified diseases: 1) MESH:D001764, Blepharospasm, 2) MESH:D012563, Schizophrenia, Paranoid, 3) MESH:D002385, Cataplexy, 4) MESH:D020187, REM Sleep Behavior Disorder, 5) MESH:D020821, Dystonic Disorders, 6) MESH:D015877, Miosis, 7) MESH:D001925, Brain Damage, Chronic, 8) MESH:D000341, Affective Disorders, Psychotic, 9) MESH:D007415, Intestinal Obstruction, and 10) MESH:D011681, Pupil Disorders, have on average 766 inter-associations between pathway, disease, drug and organ, which shows a strong association with those 369 brain-specific markers.
A blepharospasm is any abnormal contraction or twitch of the eyelid. There have been several important advances in understanding the brain mechanisms associated with blepharospasm. Baker et al. identified blinking-induced functional magnetic resonance imaging (fMRI) activation patterns in five benign essential blepharospasm (BEB) patients and five age-matched control subjects and concluded that the activations observed might represent a hyperactive cortical circuit linking visual cortex, limbic system, supplementary motor cortex, cerebellum, and supranuclear motor pathways innervating the periorbital muscles . Antal et al. examined whether magnetic or electrical stimulation of the brain could improve the involuntary closure of the eyelids in patients with blepharospasm or Meige syndrome .
Schizophrenia is a brain disorder that affects the way a person acts, thinks, and sees the world. People with schizophrenia have an altered perception of reality, often a significant loss of contact with reality. Chen et al. utilized a multivariate approach to identify genomic risk components associated with brain function abnormalities in schizophrenia . They first derived 5157 candidate single nucleotide polymorphisms (SNPs) from genome-wide array based on their possible connections with schizophrenia and further investigated for their associations with brain activations captured with functional magnetic resonance imaging (fMRI) during a sensorimotor task. Then, they identified 222 SNPs which showed significant difference between 92 schizophrenia patients and 116 healthy controls. Their further pathway analysis showed that the genes associated with the identified SNPs participated in four neurotransmitter pathways: GABA receptor signaling, dopamine receptor signaling, neuregulin signaling and glutamate receptor signaling. Their finding is consistent with our inter-association analysis from the 369 brain-specific markers.
Our 16 pathways identified by inter-association analysis using IPAD contains 1) Neurotransmitter Receptor Binding And Downstream Transmission In The Postsynaptic Cell, 2) Neuroactive ligand-receptor interaction, 3) GABAergic synapse, 4) GABA receptor activation, 5) Glutamate Binding, Activation of AMPA Receptors and Synaptic Plasticity, 6) Neurotransmitter Release Cycle, 7) GABA synthesis, release, reuptake and degradation, 8) Class C/3 (Metabotropic glutamate/pheromone receptors), and 9) GABA A receptor activation etc.
The other 7 diseases (except Intestinal Obstruction) also show strong links with brain, such as Cataplexy , REM Sleep Behavior Disorder , Dystonic Disorders , Miosis , Brain Damage , Chronic , Affective Disorders , Psychotic , and Pupil Disorders .
The 7 identified drugs: 1) DB00349, Clobazam, 2) DB00475, Chlordiazepoxide, 3) DB00683, Midazolam, 4) DB00690, Flurazepam, 5) DB00842, Oxazepam, 6) DB01558, Bromazepam, and 7) DB01595, Nitrazepam have on average 63 inter-associations between pathway, disease, drug and organ. They show strong links with brain, such as 1) Clobazam , 2) Chlordiazepoxide , 3) Midazolam , 4) Flurazepam , 5) Oxazepam , 6) Bromazepam , and 7) Nitrazepam .
In conclusion, this case study shows that the self-validation of IPAD is an innovation of traditional enrichment analysis and can be useful for validating any pathways, diseases, drugs or organs that users identify with their own data and methods.
RNA-seq is an emerging technology for surveying gene expression and transcriptome content by directly sequencing the mRNA molecules in a sample. RNA-seq can provide gene expression measurements and is regarded as an attractive approach to analyze a transcriptome in an unbiased and comprehensive manner. In this case study, we demonstrate the use of IPAD to identify previously undiscovered components by the enrichment analysis based on differentially expressed genes identified from the transcriptional profiling sequencing data . The original purpose is to provide a general guide for analysis of gene expression and alternative splicing by deep sequencing. In the prostate cancer study, the prostate cancer cell line LNCap was treated with androgen/DHT. Mock-treated and androgen-stimulated LNCap cells were sequenced using the Illumina 1G Genome Analyzer. For the mock-treated cells, there were four lanes totaling ~10 million reads. For the DHT-treated cells, there were three lanes totaling ~7 million reads. All replicates were technical replicates. Samples labeled s1 through s4 are from mock-treated cells. Samples labeled s5, s6, and s8 are from DHT-treated cells. The read sequences are stored in FASTA files. The sequence IDs break down as follows: seq_(unique sequence id)_(number of times this sequence was seen in this lane). We first downloaded the publicly available transcriptional profiling sequencing data from the author's Web Site at http://yeolab.ucsd.edu/yeolab/Papers.html and computed the digital gene expression, next identified 278 differentially expressed genes in RNA-seq data from hormone treated prostate cancer cell line samples, then performed the enrichment analysis of the 278 genes with IPAD, and lastly carried out the inter-association analysis for these enriched components with IPAD.
Identification of Previously Undiscovered Components by IPAD
Metabolism of lipids and lipoproteins
Mitotic G1-G1/S phases
AP-1 transcription factor network
Mitotic G1-G1/S phases
Direct p53 effectors
Cell Cycle, Mitotic
Fatty acid, triacylglycerol, and ketone body metabolism
Metabolism of amino acids and derivatives
We found that some of these components that were previously undiscovered but identified by inter-association analysis still showed strong association with prostate cancer. For example, previous studies reported that the top 5 drugs we identified with inter-association analysis: docetaxel, glutathione, gefitinib, rosiglitazone, and carboplatin were all associated with prostate cancer. Docetaxel is a drug used in men whose prostate cancer no longer responds to hormone therapy. Tannock et al. compared docetaxel plus prednisone in men with advanced, hormone-refractory prostate cancer with mitoxantrone plus prednisone. They found that treatment with docetaxel every three weeks led to superior survival and improved rates of response in terms of pain, serum PSA level, and quality of life, as compared with mitoxantrone plus prednisone, when given with prednisone . The deficiency in the glutathione enzyme system has been proposed to increase the likelihood of developing both an enlarged prostate and prostate cancer. Nelson discovered a genetic defect in prostate cancer cell prevents the body from producing glutathione S-transferase (GST), an enzyme needed by the liver to detoxify harmful chemicals . The function of a particular glutathione enzyme glutathione-S-transferase-pi-i (GSTP1) is almost universally lost in both cancerous and pre-cancerous prostate cells. The inactivation of this glutathione enzyme is an early event in the development of prostate cancer. Many studies have linked the loss of GSTP 1 to malignant transformation of prostatic tissues .
One study found that gefitinib and bicalutamide showed synergistic effects in primary cultures of prostate cancer derived from androgen-dependent naive patients . Another study discovered that gefitinib-trastuzumab combination showed promising clinical activity in hormone refractory prostate cancer . Smith et al. assessed the biological activity of rosiglitazone, a peroxisome proliferator-activated receptor gamma agonist that has been approved to treat type 2 diabetes, in men with recurrent prostate carcinoma using change in prostate specific antigen (PSA) doubling time (PSADT) as the primary outcome variable and concluded that Rosiglitazone did not increase PSADT or prolong the time to disease progression more than placebo in men with a rising PSA level after radical prostatectomy and/or radiation therapy . But Rosiglitazone was found to suppress human lung carcinoma cell growth through PPARγ-dependent and PPARγ-independent signal pathways . The number 3 drug, Carboplatin is a chemotherapy agent used for treatment of many types of cancer. Some studies examined the efficacy of carboplatin as a second line chemotherapy agent (after the failure of taxotere) as well as along with taxotere therapy for men with advanced prostate cancer [57, 58]. A phase II study assessed the outcome and predictive factors for prognosis and toxicity following intermittent chemotherapy with docetaxel, estramustine phosphate, and carboplatin (DEC) in patients with castrate resistant prostate cancer (CRPC) and found that combination chemotherapy with DEC has a potential effect on CRPC with acceptable toxicity . Jeske et al. conducted a retrospective, bi-institutional review of patients with advanced CRPC treated with carboplatin plus paclitaxel after docetaxel and concluded that Carboplatin/paclitaxel chemotherapy following docetaxel in metastatic CRPC is well tolerated with favorable PSA response rates and survival and the combination is a viable option after progression on docetaxel-based therapy .
This case study shows that compared to traditional enrichment analysis, the IPAD's inter-association analysis can be more powerful and useful in identification of previously undiscovered pathways, diseases, drugs or organ specification.
We developed IPAD as an integrated database system to analyze, identify, and validate pathway, disease, drug, organ specificity and their inter-associations. IPAD integrates many different types of pathway, disease, drug and organ-specificity information: pathway gene relationship from the BioCarta , KEGG , NCI-Nature curated , and Reactome  database; disease gene relationship from the CTD  and PharmGKB  database; drug gene relationship from the DrugBank  and PharmGKB  database; and organ-specific genes/proteins from the HOMER  databases.
Enriched pathways, diseases, drugs, organs and their inter-associations can be searched, displayed, and downloaded from our online user interface. The current IPAD database can help users address a wide range of pathway related, disease related, drug related and organ specificity related questions in human disease studies. We also developed a statistical method for similarity measurement and statistics and described two criteria for setting the threshold parameters, which can be extended to other enrichment applications. Lastly, our database was evaluated by comparison to other known databases, a constructed "gold standard" of 30161 known associations, and two case studies.
In this paper, we have demonstrated that IPAD can be used to discover, analyze, and validate pathway, disease, drug, and organ specificity from experimental data. We illustrated the features of IPAD by testing the inter-association between breast cancer markers related pathway, disease, drug and organ. In Case Study 1, we demonstrated the IPAD's ability to self-validate by using it to perform enrichment analysis and inter-association analysis on the 369 brain-specific markers. In Case Study 2, we further demonstrated the ability of IPAD to identify previously undiscovered components by the enrichment analysis based on differentially expressed genes identified from a prostate cancer study.
Selecting the appropriate statistical parameters for enrichment analysis and inter-association analysis is important. We presented a novel algorithm to measure relationships among the annotation terms based on p-value, Absolute Expression Value (AE), Relative Expression Value (RE) and Mean Jaccard Index (MJI). We also described the two criteria for setting the threshold parameters: 1) p-value below the 5% quantile and 2) 1 sigma lower control limits for AE, RE and MJI. However, defining each threshold parameter and implementing them effectively can be still challenging. Because the gene list size affects the enrichment score and the sizes of four types of component are largely different (Table 1, 11663 molecules in 1956 Pathways, 17925 molecules in 6704 diseases, 3735 molecules in 5615 drugs, and 5599 molecules in 52 organs).
In our website we provide all results for users to cut off according to the specificity of their input data. The number of enriched component sets depends on the structure of the data and the problem space. If no enriched component sets or a very large number of enriched component sets pass the thresholds, users first check whether too few or too many genes are loaded. If there are no such issues, users can tighten up the thresholds for too many significant component sets and relax them for no significant component sets.
In this paper, we introduced organ-pathway, organ-disease, organ-drug, organ-organ inter-associations for the first time. An organ actually means organ specificity in the paper. An organ is a group of tissues that perform a specific function or group of functions. Organ specificity is referred as the specificity of level of expression of a gene or protein in a certain type of organ. Identification of the association of organ-gene, organ-pathway, organ-disease, organ-drug, and organ-organ can be helpful in the discovery potentially therapeutic genes related to specific organs, measuring and understanding the function and characteristics of cells and tissues in an organ from the perspective of cooperative network, disease diagnosis, and drug target, indicating important clues about gene function, network signaling, disease treatment and drug target, and monitoring organ integrity both during preclinical toxicological assessment and clinical safety testing of investigational drugs.
We show an overview of the data integration process in Figure 1. Pathway data in IPAD were collected from the four most commonly used sources, i.e., BioCarta , KEGG , NCI-Nature curated , and Reactome .
The BioCarta  includes expert-curated interactive graphic models of many pathways from diverse fields like apoptosis, cell cycle, cell signaling, development, immunology, neuroscience, adhesion, and metabolism. BioCarta data from June 2004 was imported from its website.
The KEGG  pathway is a collection of manually drawn pathway maps containing the knowledge on the molecular interaction and reaction networks in Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes, Organismal Systems, Human Diseases, and Drug Development. The KEGG data was downloaded from its ftp site.
The NCI-Nature curated  are created by Nature Publishing Group editors and reviewed by experts in the field. Biomolecules are annotated with UniProt protein identifiers and relevant post-translational modifications. Interactions are annotated with evidence codes and references. The NCI-Nature curated data was downloaded from its website.
Reactome  is an expert-authored, peer-reviewed knowledgebase of human reactions and pathways that provides infrastructure for computation and data mining across the biologic reaction network. Human pathways from Reactome were downloaded from its website.
Disease data in IPAD was downloaded from two different sources: CTD  and PharmGKB . The Comparative Toxicogenomics Database CTD  is a public website and research tool that curates scientific data describing relationships between chemicals, genes, and human diseases. The Pharmacogenetics Knowledge Base (PharmGKB)  is curate knowledgebase about the impact of genetic variation on drug response with focus on clinical interpretation of variants associated with drug response, drug dosing guidelines and genetic tests, drug-centered pathways, important PGx gene summaries, and relationships among genes, drugs and diseases.
Drug data in IPAD were downloaded from two different sources, DrugBank  and PharmGKB . The DrugBank database  is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.
The organ specificity in IPAD was downloaded from HOMER . HOMER  is an integrated Human Organ-specific Molecular Electronic Repository, defining human organ-specific genes/proteins and covering about 22,598 proteins, 52 organs, and 4,290 diseases integrated and filtered from organ-specific proteins/genes and disease databases like dbEST , TiSGeD , HPA , CTD , and Disease Ontology .
We used PERL to parse the text data we downloaded and a light-weight implementation of the Document Object Model interface in Python 2.7.l , xml.dom.minidom to parse the XML format data.
The Jaccard Index measures similarity between pathways, diseases, drugs and organs, and is defined as the size of the intersection divided by the size of the union of the component sets. The component similarity measure can be defined as the extent of overlaps, e.g., common number of genes/proteins, shared between two different components . In IPAD, we have four types of components: pathway, disease, drug and organ.
where, N, M denotes total number of components. P i and P j denote two different components, P i and P j can be the same or different type, while |P i | and |P j | are the numbers of molecules in these two components. Their intersection P i ∩P j is the set of all molecules that appear in both P i and P j , while their union P i ∪P j is the set of all molecules either appearing in the P i or in the P j . Duplicates are eliminated in the intersection set and union set.
With the equations above, we can calculate similarity scores (Jaccard Index, Left Jaccard Index, Right Jaccard Index, and Mean Jaccard Index) for pathway-pathway, disease-disease, drug-drug, organ-organ, pathway-disease, pathway-drug, pathway-organ, disease-drug, disease-organ, and drug-organ associations.
Where L is the total number of genes in component i, M is the total number of genes in component j, N is the total number of genes in the type of component, p = M/N, x is the number of genes corresponding to component i in component j, and is the number of possible combinations of x genes from a set of L genes.
To prevent multiple testing problem from happening, IPAD adjust the p-value by Benjamini & Hochberg method .
The relative enrichment value (RE) of component i in component j is defined as AE/EE.
Thresholds for Inter-association Analysis in IPAD
A Comparison of the Five Quantile Thresholds
# Associations In Pathway
#Associations In Disease
#Associations In Drug
#Associations In Organ
P-value below the 5% quantile performs better than other p-value thresholds with a balanced F_measure and an appropriate total number of inter-associations (Table 7). First, the threshold (p-value ≤ Quantile 3%) is too strict. It filters out about half of the inter-associations that are identified by the threshold (p-value ≤ Quantile 7%). Secondly, the thresholds (p-value ≤ Quantile 6%) and (p-value ≤ Quantile 7%) cannot perform better in F_measure than the threshold (p-value ≤ Quantile 5%). Finally, we choose (p-value ≤ Quantile 5%) as the best threshold because we can identify 23% more inter-associations with (p-value ≤ Quantile 5%) than with (p-value ≤ Quantile 4%), although the F_measure of the threshold (p-value ≤ Quantile 4%) is a little bit higher than that of the threshold (p-value ≤ Quantile 5%).
A Comparison of the Four Sigma Thresholds
# Associations In Pathway
#Associations In Disease
#Associations In Drug
#Associations In Organ
If a user's gene list is treated as a component, then the similarity measures and the statistics for genes-pathway, genes-disease, genes-drug and genes-organ can be similarly computed with the equations in the sections: "Similarity Measure for the Inter-association Analysis" and "Statistics for the Inter-association Analysis".
The online version of IPAD database is a typical 3-tier web application , with an SQL Server2008R2 database at the backend database service layer, Apache/PHP server scripts to the middleware application web server layer, and CSS-driven web pages presented on the browser.
We thank Brian Denton, Woody Hagar, Anthony Tissera, and Lynley Dungan for help with database design and web development. We also thank three anonymous reviewers for comments that helped us improve this manuscript.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 15, 2012: Proceedings of the Ninth Annual MCBIOS Conference. Dealing with the Omics Data Deluge. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S15
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.