Extending pathways based on gene lists using InterPro domain signatures
- Florian Hahne†1,
- Alexander Mehrle†1,
- Dorit Arlt1,
- Annemarie Poustka1,
- Stefan Wiemann1 and
- Tim Beissbarth†1Email author
© Hahne et al; licensee BioMed Central Ltd. 2008
Received: 16 July 2007
Accepted: 04 January 2008
Published: 04 January 2008
High-throughput technologies like functional screens and gene expression analysis produce extended lists of candidate genes. Gene-Set Enrichment Analysis is a commonly used and well established technique to test for the statistically significant over-representation of particular pathways. A shortcoming of this method is however, that most genes that are investigated in the experiments have very sparse functional or pathway annotation and therefore cannot be the target of such an analysis. The approach presented here aims to assign lists of genes with limited annotation to previously described functional gene collections or pathways. This works by comparing InterPro domain signatures of the candidate gene lists with domain signatures of gene sets derived from known classifications, e.g. KEGG pathways.
In order to validate our approach, we designed a simulation study. Based on all pathways available in the KEGG database, we create test gene lists by randomly selecting pathway genes, removing these genes from the known pathways and adding variable amounts of noise in the form of genes not annotated to the pathway. We show that we can recover pathway memberships based on the simulated gene lists with high accuracy. We further demonstrate the applicability of our approach on a biological example.
Results based on simulation and data analysis show that domain based pathway enrichment analysis is a very sensitive method to test for enrichment of pathways in sparsely annotated lists of genes. An R based software package domainsignatures, to routinely perform this analysis on the results of high-throughput screening, is available via Bioconductor.
Many high-throughput techniques such as DNA microarray analysis, siRNA screens or proteomic approaches result in extensive data. After careful statistical analysis the result of such an experiment is typically a list of candidate genes, relevant for certain biological processes, or an ordered gene list, sorted according to the significance in one or more biological processes . The data analysis and interpretation of such lists provides a major bottleneck and a task for bioinformatics and systems biology. Many approaches have been published that assess the significant over-representation of biological functions or pathways as annotated in GO or pathway databases through gene set enrichment analysis [2–8].
However, for many of the screened genes there is hardly any functional annotation available. For example, the number of human genes annotated in the KEGG database  is only about 4,000. This contrasts the estimated number of putative protein coding genes which exceeds 23,000 (counted as the number of Entrez gene ids in the IPI-human database) [10, 11]. Many approaches rely on automatically inferred functional annotations . Such annotations can be assigned, for example, based on the protein sequences and predicted domains, as well as on protein interactions or co-expression. Structured vocabularies such as the Gene Ontologies (GO) comprise many such kinds of annotation .
Especially interesting and accessible is the functional annotation that is based on predicted protein domains, as there are already highly reliable prediction methods and databases available. Of the 23,000 genes in the IPI-human database, approximately 19,000 have at least one InterPro-domain assigned, of the 4,000 genes in the human-KEGG pathways, nearly all have at least one InterPro-domain. Together, these comprise approximately 3,000 distinct InterPro-domains . Protein domains very often directly correspond to some of the core biological functions, such as e.g. DNA binding, kinase or phosphorylation activity, or otherwise to cellular localization. Therefore, predicted protein domains are often utilized to predict these annotations, for instance in the GO database. To our knowledge, however, protein-domain signatures have thus far not been used as a classifier to predict assignment of genes to pathways, which constitute the biological processes in a cell. Here we want to critically assess the utility of domain information to predict pathway membership and provide a tool to utilize this information. As we deem it unlikely to reliably predict pathway memberships for individual genes, our main aim is to find relevant pathways significantly enriched in lists of genes, in particular lists from high-throughput experiments. Pathways consist of a series of chemical reactions occurring within a cell. Of special importance for the functioning of biological systems are the signalling pathways, in which signals or cellular stimuli are transmitted mainly through protein interaction and phosphorylation events, often leading to altered gene expression within a cell. Known signalling pathways are collected within different pathway databases . Pathways thus represent functional units for collecting genes or proteins. Crucial to gain hints of the function of genes is the mapping of these to one or more known pathways.
In many high-throughput experiments, we are facing the problem that we want to test for enrichment of pathways in gene lists, but the genes are often not annotated to pathways at all, or may even lack any form of functional annotation. We tried to answer this problem with the following idea: It should be possible to map a high fraction of the human genes to pathways via their protein domains. Hypothesis: Protein domains can be treated as functional elements of the proteins, the set of interpro-domains of the proteins of one pathway could serve as its functional fingerprint. In order to test the validity of this idea, we designed a simulation study based on the KEGG database and the protein domain assignment in the IPI database. Virtually removing genes from the KEGG pathways, we demonstrate that we can place these genes back into the respective pathway with high accuracy. We have further applied the method to a biological dataset from a high-throughput siRNA screen and demonstrate, that we are able to get biologicaly meaningful results.
Results and Discussion
the number of intersecting domains for both sets,
the number of domains that are unique to set 1,
the number of domains that are unique to set 2,
the number of domains not part of any of the two sets.
By means of a binomial test, we compute on this contingency table the probability of over-representation of domains of set 1 in set 2 given all available domains. The negative log-transform of this probability can now be used as a measure for signature similarity. Note that higher values indicate a higher similarity between the two sets.
In order to assess whether InterPro domains can be used as a tool for functional classification we conducted a simulation experiment. We chose the pathways represented in the KEGG database as test environment, since this database is readily available. In principle, however, the method should be applicable to any kind of functional classification. Assuming, that the domain structure of either a single protein or, more likely, of a group of proteins with related function should be sufficient for functional classification (i.e. to assign a pathway membership), we sampled different numbers of genes that are known to be part of a given KEGG pathway. These genes were then removed from the pathway to construct somewhat smaller virtual pathways, which were used in the further test. In a subsequent step, we tried to recover the pathway membership of the collections of removed genes, based on the computed similarity of their domain signature to the domain signature derived from all remaining proteins of the respective pathway. To assess the significance of this score, the similarity measure of the sampled genes was compared to that of a control subset consisting of randomly chosen genes not specifically annotated to the respective pathway. We expect the latter to be less similar to the pathway signature.
Simulating a single pathway
Simulating all 181 pathways
In a second step, we performed a similar simulation for all 181 KEGG pathways, this time sampling sets containing from one up to ten pathway genes, each 1,000 times. To assess how well the pathway gene similarity and the random sample similarity separate, we computed ROC curves for each pathway and each sample number and, subsequently, the area under the curve (AUC) as a measure of separation. ROC curves are plots of the true positive rate against the false positive rate at different cutoff levels of a binary classifier. Thus, an AUC of 1 represents an ideal classification while an AUC of 0.5 is what can be achieved by random guessing. In the box plots in Figure 2b we plot the distribution of AUC values for all 181 pathways for the different numbers of sampled genes. For most of the pathways we achieve almost complete separation with only a small number of sampled genes. Already for two sampled genes, the bulk of the data lies above 0.9 indicating low misclassification rates. Sampling of five genes is sufficient to achieve complete separation for the majority of the 181 pathways. Still, some of the pathways can not be classified at all, or only very poorly. Detailed analysis of these pathways revealed that they often comprise only small numbers of proteins or that they belong to the class of metabolic pathways. In both cases, this result can be expectet: For small pathways we run into problems during the sampling process, because due to the removal of genes from the training set, we end up with sparse domain signatures based on a very small number of genes. Moreover, we do not expect a good performance of our method for many of the metabolic pathways since they are based on a functional classification that may only be weakly reflected in the composition of protein domains. Most of these domains are extremely unspecific, leading to a poor specificity of the overall domain signature. A detailed list of the performance of all 181 pathways can be found as Additional file 2.
Simulating noisy data
Assessment of significance
In order to apply our method to real biological data, one has to derive a measure of significance indicating that the similarity of the gene list to a given pathway is not merely by chance. Since we do not have a clear concept of probability distributions or statistical models of the highly complex pathway data, we applied a non-parametric sampling method to derive p-values for statistical significance. To this end, we sample for each pathway 10,000 random gene lists of the same size as the gene list analysed and compute their similarities to the pathway domain signature. The p-value is defined as the fraction of data points in this empirical distribution larger than the similarity measurement of the original gene list.
Simulation of pathway mixtures
Application to screening data
In order to demonstrate the applicability of our approach, we tested the method on a data set generated in the department in a genome wide high-throughput screen aiming to identify modulators of cell adhesion. This screen resulted in a list of approximately 1,000 candidate proteins. Application of our method revealed a significant similarity to the domain signature of the KEGG pathway hsa00531 for glycosaminoglycan degradation. Glycosaminoglycans (GAG) are linear polysaccharides built from disaccharidic units and they are covalently attached to core proteins, forming proteoglycans (PG) . They provide mechanical links between the extracellular matrix and the cell surface, which influence signal transduction pathways and the cytoskeleton [16, 17]. Proteins involved in glycosaminoglycan degradation are essential components in the regulation of GAG and PG functions. Down-regulation of genes involved in glycosaminoglycan degradation induces loss of cell adherence, and tumor stroma contains proteoglycans and glycosaminoglycans often in higher proportion than normal tissue . As the biological role of the KEGG pathway hsa00531 fits to the phenotype which we screened with the assay for cell detachment, we conclude that our method is suitable for application to high-throughput data, revealing relevant proteins for potentially any cellular process. Furthermore, we could not detect a significant over-representation of hsa00531 in a classical hypergeometric test based on available KEGG annotations, indicating that our method adds additional information to gene lists compared to well established techniques.
We have established a technique to use predicted protein domains in order to test for enrichment of pathways based on the domain signatures of the gene lists of interest. In a simulation study we show that our method can reliably predict the pathway membership of genes that were previously removed from a database of known pathways. To our knowledge, this approach is unique to assign pathway membership based on sequence information. This is important for prediction of biological functions and the basis for systems biological approaches. We demonstrate the applicability of our method in the functional analysis of high-throughput siRNA screening data and identify the enrichment of a biologically significant pathway in a poorly annotated gene list. This strategy will improve the field of gene set enrichment analysis and help to point out the biologically meaningful aspect of gene sets. The method is available as open source software in the Bioconductor package domainsignatures .
Gene to pathway mapping was done using information obtained as tab-delimited text files from the KEGG FTP-site , files hsa_pathway.list and hsa_ncbi-geneid.list. The files used in this analysis were downloaded in November 2006 and contain references to 181 pathways including 4,085 genes. The mapping of proteins to InterPro-domains was based on the information stored in the IPI database. The database files were downloaded from the IPI web site  in November 2006. This version of the IPI database contains 23,423 human genes (coding for at least one protein) and information for 3,968 genes which are also listed in the KEGG database. The Entrez GeneID  was used as the universal identifier. On average, each pathway contains 46,6 genes and 68.1 unique domains. On average each protein in the KEGG database has 3.5 unique domains. Each domain on average appears in 3.95 different pathways. Domains appear on average in 6.7% of all genes in a pathway and there are only very few domains that appear in the majority of genes of a pathway. Some further statistics on the distribution of domains in pathways and proteins are available in Additional file 2.
All databases were stored and queried via a local MS SQL server database.
All computations were done using the statistical programming language R in combination with Bioconductor tools. We provide a software package domainsignatures containing the means and source code to compute pathway enrichment in gene lists based on domain signatures (Additional file 1). The package uses the latest KEGG version based on the Bioconductor KEGG package and the domain annotation based on the BiomaRt package  and the annotation in the Ensembl  database.
We used a biological dataset generated in the department as an example to test the utility of our method on the outcome of high-throuput experiments. The data is based on an RNAi screening experiment utilizing a cell detachment assay. This assay measured the number of floating cells and thus the loss of adherence after gene knock down in HEK293 cells transfected with siRNA pools. The siRNAs are available from Dharmacon (Lafayette). Transfections were performed using Lipofectamine (Invitrogen), and after 72 hours the culture medium was centrifuged to collect non-adherent cells. The cells were then fixed and counted by flow cytometry.
This work was supported by grants (01GR450 and 01GR0420) in the context of the German National Genome Research Network (NGFN) by the German Federal Ministry for Education and Research (BMBF). We thank Dirk Ledwinka for IT support.
- Beissbarth T: Interpreting experimental results using gene ontologies. Methods Enzymol 2006, 411: 340–352. 10.1016/S0076-6879(06)11018-6View ArticlePubMedGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
- Alexa A, Rahnenfuehrer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 2006, 22(13):1600–1607. 10.1093/bioinformatics/btl140View ArticlePubMedGoogle Scholar
- Manoli T, Gretz N, Groene HJ, Kenzelmann M, Eils R, Brors B: Group testing for pathway analysis improves comparability of different microarray datasets. Bioinformatics 2006, 22(20):2500–2506. 10.1093/bioinformatics/btl424View ArticlePubMedGoogle Scholar
- Al-Shahrour F, Minguez P, Tárraga J, Medina I, Alloza E, Montaner D, Dopazo J: FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res 2007, 35(Web Server issue):W91-W96. 10.1093/nar/gkm260PubMed CentralView ArticlePubMedGoogle Scholar
- Froehlich H, Speer N, Poustka A, Beissbarth T: GOSim – An R-package for computation of information theoretic GO similarities between terms and gene products. BMC Bioinformatics 2007, 8: 166. 10.1186/1471-2105-8-166View ArticleGoogle Scholar
- Froehlich H, Fellmann M, Sueltmann H, Poustka A, Beissbarth T: Large scale statistical inference of signaling pathways from RNAi and microarray data. BMC Bioinformatics 2007, 8: 386. 10.1186/1471-2105-8-386PubMed CentralView ArticlePubMedGoogle Scholar
- Tresch A, Beissbarth T, Sueltmann H, Kuner R, Poustka A, Buness A: Discrimination of direct and indirect interactions in a network of regulatory effects. J Comput Biol 2007, 14(9):1217–1228. 10.1089/cmb.2007.0085View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, (32 Database):D277-D280. 10.1093/nar/gkh063
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4(7):1985–1988. 10.1002/pmic.200300721View ArticlePubMedGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2007, (35 Database):D26-D31. 10.1093/nar/gkl993
- Thomas PD, Mi H, Lewis S: Ontology annotation: mapping genomic regions to biological function. Curr Opin Chem Biol 2007, 11: 4–11. 10.1016/j.cbpa.2006.11.039View ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Nikolskaya AN, Orchard S, Orengo C, Petryszak R, Selengut JD, Sigrist CJA, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: New developments in the InterPro database. Nucleic Acids Res 2007, (35 Database):D224-D228. 10.1093/nar/gkl841Google Scholar
- Schaefer CF: Pathway databases. Ann N Y Acad Sci 2004, 1020: 77–91. 10.1196/annals.1310.009View ArticlePubMedGoogle Scholar
- Raman R, Sasisekharan V, Sasisekharan R: Structural insights into biological roles of protein-glycosaminoglycan interactions. Chem Biol 2005, 12(3):267–277. 10.1016/j.chembiol.2004.11.020View ArticlePubMedGoogle Scholar
- Wegrowski Y, Maquart FX: Involvement of stromal proteoglycans in tumour progression. Crit Rev Oncol Hematol 2004, 49(3):259–268. 10.1016/j.critrevonc.2003.10.005View ArticlePubMedGoogle Scholar
- Bass MD, Humphries MJ: Cytoplasmic interactions of syndecan-4 orchestrate adhesion receptor and growth factor receptor signalling. Biochem J 2002, 368(Pt 1):1–15. 10.1042/BJ20021228PubMed CentralView ArticlePubMedGoogle Scholar
- Timar J, Lapis K, Dudis J, Sebestyen A, Kopper L, Kovalszky I: Proteoglycans and tumor progression: Janus-faced molecules with contradictory functions in cancer. Semin Cancer Biol 2002, 12(3):173–186. 10.1016/S1044-579X(02)00021-4View ArticlePubMedGoogle Scholar
- Kyoto Encyclopedia of Genes and Genomes[http://www.genome.jp/kegg/]
- International Protein Index[http://www.ebi.ac.uk/IPI]
- Entrez Gene[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene]
- Durinck S, Moreau Y, Kasprzyk A, Davis S, Moor BD, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 2005, 21(16):3439–3440. 10.1093/bioinformatics/bti525View ArticlePubMedGoogle Scholar
- Ensembl Genome Database[http://www.ensembl.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.