GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists
- Eran Eden†1Email author,
- Roy Navon†2, 3Email author,
- Israel Steinfeld3, 4,
- Doron Lipson4 and
- Zohar Yakhini3, 4Email author
© Eden et al; licensee BioMed Central Ltd. 2009
Received: 16 October 2008
Accepted: 03 February 2009
Published: 03 February 2009
Since the inception of the GO annotation project, a variety of tools have been developed that support exploring and searching the GO database. In particular, a variety of tools that perform GO enrichment analysis are currently available. Most of these tools require as input a target set of genes and a background set and seek enrichment in the target set compared to the background set. A few tools also exist that support analyzing ranked lists. The latter typically rely on simulations or on union-bound correction for assigning statistical significance to the results.
GOrilla is a web-based application that identifies enriched GO terms in ranked lists of genes, without requiring the user to provide explicit target and background sets. This is particularly useful in many typical cases where genomic data may be naturally represented as a ranked list of genes (e.g. by level of expression or of differential expression). GOrilla employs a flexible threshold statistical approach to discover GO terms that are significantly enriched at the top of a ranked gene list. Building on a complete theoretical characterization of the underlying distribution, called mHG, GOrilla computes an exact p-value for the observed enrichment, taking threshold multiple testing into account without the need for simulations. This enables rigorous statistical analysis of thousand of genes and thousands of GO terms in order of seconds. The output of the enrichment analysis is visualized as a hierarchical structure, providing a clear view of the relations between enriched GO terms.
GOrilla is an efficient GO analysis tool with unique features that make a useful addition to the existing repertoire of GO enrichment tools. GOrilla's unique features and advantages over other threshold free enrichment tools include rigorous statistics, fast running time and an effective graphical representation. GOrilla is publicly available at: http://cbl-gorilla.cs.technion.ac.il
The availability of functional genomics data has increased dramatically over the last decade, mostly due to the development of high-throughput microarray-based technologies such as expression profiling. Automatic mining of these data for meaningful biological signals requires systematic annotation of genomic elements at different levels. The Gene Ontology (GO) project  is a collaborative effort aimed at providing a controlled vocabulary to describe gene product attributes in all organisms. GO consists of three hierarchically structured vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions. The building blocks of GO are terms, the relationship between which can be described by a directed acyclic graph (DAG), a hierarchy in which each gene product may be annotated to one or more terms in each ontology.
Since its inception, many tools have been developed to explore, filter and search the GO database. A comprehensive list of available tools is provided at the Gene Ontology web site http://www.geneontology.org. One of the most common applications of the GO vocabulary is enrichment analysis – the identification of GO terms that are significantly overrepresented in a given set of genes . Enrichment may suggest possible functional characteristics of the given set. For example, enriched GO terms in a set of genes that are significantly over-expressed in a specific condition may suggest possible mechanisms of regulation that are put into play, or functional pathways that are activated in that condition.
A large repertoire of tools for enrichment analysis has been developed in recent years, including GoMiner , FatiGO , BiNGO , GOAT , DAVID  and others. In general, these tools accept as input a target set of genes that is compared to a given background set of genes, or to a default "complete" background set. Some subset of GO terms from one or more of the three ontologies is scanned for enrichment in the target set relative to the background set, and terms for which significant enrichment is discovered are reported. The statistical test used for enrichment analysis is typically based on a hypergeometric or binomial model.
The most common form of output is a list of enriched terms. This simple approach allows the user to identify terms that are most significantly enriched but may lose substantial information regarding the relations between these terms. A more informative approach is to present the enrichment results in the context of the DAG structure of the respective ontology. In a typical case, the list of significantly enriched GO terms may include several related terms at varying significance levels. Identifying the clusters of enriched terms in the GO hierarchy becomes much simpler if the DAG structure is made available. A few tools visualize the results of enrichment analysis in the DAG structure, including the downloadable version of GoMiner , the CytoScape plug-in BiNGO , GOLEM , GOEAST  and GOTM . A particularly friendly and useful GO enrichment analysis tool is GO::TermFinder which is provided at the Saccharomyces Genome Database (SGD, ). This tool provides a color-coded map of the enriched GO terms. It is, however, limited only to analysis of S. cerevisiae genes and requires specifying an explicit target set.
In many practical cases, functional genomic information used as the input for the GO enrichment analysis may be naturally represented as a ranked list. For most applications, the requirement for an input target gene set forces the user to set some arbitrary threshold and define the target set as all genes with ranks above (or below) the threshold. For example, genes may be naturally scored and ranked according to their differential expression between two conditions. However, defining the specific set of genes that are differentially expressed requires setting an arbitrary threshold. Unfortunately, the results of the enrichment analysis may often depend on the specific threshold that is set. Tools that use the simple hypergeometric distribution require setting such a fixed threshold.
A few tools have been developed that use a threshold free approach including GSEA , FatiScan , GO-stat , GeneTrail  and iGA . The widely used GSEA tool uses a statistic that is similar to Kolmogorov-Smirnov but assigns different weights to the occurrences of genes in different ranks in the list. The tool is not specifically aimed at GO enrichment, and therefore does not offer visualization in terms of the GO DAG structure. In addition GSEA does not provide an exact p-value and estimates the p-value using permutations. The p-values assigned by GSEA are therefore limited by the number of permutations performed. FatiScan is another threshold free tool by the creators of FatiGO. It tests a number of thresholds determined by the user (the default is 30 thresholds) and then corrects for multiple testing using FDR. Again, this tool does not provide an exact p-value. The iGA method uses an iterative approach that circumvents the need for a fixed cutoff by computing the hypergeometric score at all possible cutoffs. iGA does not produce an exact p-value as well. In  the authors study the advantages of sample re-sampling. The sample permutation approach is applicable for the analysis of differential gene expression data but not in other applications of gene set enrichment where ranking is inferred otherwise.
As a matter of practicality it is also important for GO enrichment tools to perform in an interactive manner. Many of the existing tools, in particular those based on flexible thresholds that require time consuming simulations, fall short on this desired property.
In this note we describe a web-server interactive software tool, called GOrilla, that enables GO enrichment analysis in ranked lists of genes. It is based on previous work , in which we describe a statistical framework, called mHG, for enrichment analysis in ranked lists. The method identifies, independently for each GO term, the threshold at which the most significant enrichment is obtained. The significance score is accurately and tightly corrected for threshold multiple testing without the need for time consuming simulations. Consequentially GOrilla performs the enrichment analysis on thousands of genes and thousands of GO terms in a few seconds. In the Results section we demonstrate how GOrilla can capture relevant biological processes and visualize the results with an easy to use graphical representation of the GO hierarchy, emphasizing on the enriched nodes.
where . In words, the mHG score is the optimal HGT probability that is found over all possible partitions induced by the gene ranking. As such, this score must be corrected for multiple testing. In previous work we describe a dynamic programming algorithm for computing the exact p-value of a given mHG score . More specifically, given a ranked list of genes, a GO term associated with some of these genes and a corresponding mHG score s, the mHG p-value tells us the exact probability of observing an mHG score s' ≤ s under the null assumption that all GO term occurrence configurations in the ranked list are equiprobable. We describe various considerations for efficient implementation of the mHG p-value algorithm elsewhere .
Description of the Tool
Discovery of enriched GO terms at the top of a ranked list of genes using the mHG statistics (as explained in this paper).
Discovery of enriched GO terms in a target set versus a background set and using a hypergeometric model (as commonly done in other applications). In this case the ranking inside these lists is ignored.
Results and discussion
To test the performance of GOrilla we used the van't Veer et al. breast cancer dataset , which is a landmark study in clinical use of gene expression data. This dataset consists of expression profiles containing 14,565 genes measured on 77 breast cancer patients. Different patients showed various prognostic characteristics, 44 with more than 5 years survival versus 33 patients with less than 5 years survival. All genes were ranked according to how well they differentiate between the two groups using a simple t-test. The top of the list contained the genes that were the best separators between the two groups. The ranked list was given as input to GOrilla with default running parameters. The result is shown in Fig 2 and highlights a unique set of enriched GO terms that were identified at different cutoffs. The enriched GO terms include mitosis (p < 10-12, top 717 genes), cell cycle (p < 10-13, top 692 genes) and microtubule cytoskeleton organization and biogenesis (p < 10-8, top 927 genes). These enriched GO terms are attributed to genes that were over expressed in patients with bad prognosis and under-expressed in patients with good prognosis, which is in accordance with biological common sense and supports their relevance. The total analysis running time was less than 10 seconds.
A comparison to other web-based tools was performed on the same dataset using each software's default parameters. Web-based tools have several advantages over standalone tools  and therefore we only compare GOrilla to other web-based tools. The flexible threshold tools Fatiscan  and GO-stat  were given the ranked list as input while the fixed threshold methods GOEAST and DAVID were given a target set containing all the genes with t-test p < 10-3, which includes the top 124 genes, and a background set containing the rest of the genes. We note that the p < 10-3 choice of fixed threshold for cutting the data, which is often used for this type of tasks, is inherently arbitrary.
A comparison of web-based GO enrichment tools.
P-value and statistical method
Exact mHG p-value computation (no need for simulations)
Fischer Exact (FDR corrected for number of thresholds)
(predetermined steps of 30)
Wilcoxon Rank-Sum/Kolmogorov Smirnov
Modified Fischer Exact
(only in the downloadable version)
Most other approaches to GO enrichment analysis assume that a target set and a background set are given or evident to the process. It is often the case, however, that genomic data which is the subject of the enrichment analysis step, is naturally represented as a ranked list of elements. GOrilla uses a statistical model that supports the discovery of GO terms that are enriched at the top of a ranked list, enabling a threshold to be determined in a data driven manner. It also provides an exact p-value for the observed event, which is not the case for most other available flexible threshold tools.
GOrilla provides a simple and informative graphical representation of the significantly enriched terms, in the context of the complete DAG representation of the ontology used in the analysis. This graphical representation is color coded based on the p-value attained for each GO term.
The application is highly interactive with running time of a few seconds per analysis. This is achieved by using an efficient algorithm for computing the exact mHG p-value, which circumvents the need for simulations, and an efficient software implementation.
Availability and requirements
Project name: GOrilla
Project home page: http://cbl-gorilla.cs.technion.ac.il
Operating system(s): (Platform independent) web-based application
Programming language: Java
EE, RN, IS and ZY were partially supported by a European Union FP6 grant, under the framework of the MultiKnowledge Project. EE is partially supported by the Yeshaya Horowitz Association through the Center for Complexity Science. We acknowledge constructive comments by BMC's anonymous reviewers. We thank Amir Ben-Dor, Anya Tsalenko and Diego Ardigo for useful discussions.
- Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, Harris M, Hill D, Issel-Tarver L, Kasarskis A, Lewis S, Matese J, Richardson J, Ringwald M, Rubin G, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–9. 10.1038/75556PubMed CentralView ArticlePubMed
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565PubMed CentralView ArticlePubMed
- Zeeberg B, Feng W, Wang G, Wang M, Fojo A, Sunshine M, Narasimhan S, Kane D, Reinhold W, Lababidi S, Bussey K, Riss J, Barrett J, Weinstein J: GoMiner: A Resource for Biological Interpretation of Genomic and Proteomic Data. Genome Biology 2003, 4(4):R28. 10.1186/gb-2003-4-4-r28PubMed CentralView ArticlePubMed
- Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20(4):578–80. 10.1093/bioinformatics/btg455View ArticlePubMed
- Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 2005, 21(16):3448–9. 10.1093/bioinformatics/bti551View ArticlePubMed
- Xu Q, Shaulsky G: GOAT: An R Tool for Analysing Gene Ontology Term Enrichment. Applied Bioinformatics 2005, 4(4):281–283. 10.2165/00822942-200504040-00008View ArticlePubMed
- Jr GD, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. 10.1186/gb-2003-4-5-p3View Article
- Sealfon R, Hibbs M, Huttenhower C, Myers C, Troyanskaya O: GOLEM: an interactive graph-based gene-ontology navigation and analysis tool. BMC Bioinformatics 2006, 7: 443. 10.1186/1471-2105-7-443PubMed CentralView ArticlePubMed
- Zheng QWX: GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis. Nucleic Acids Res 2008, 1464–1465.
- Zhang B, Schmoyer D, Kirov S, Snoddy J: GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies. BMC Bioinformatics 2004, 5: 16. 10.1186/1471-2105-5-16PubMed CentralView ArticlePubMed
- Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder-open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 2004, 20(18):3710–3715. 10.1093/bioinformatics/bth456PubMed CentralView ArticlePubMed
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMed
- Al-Shahrour F, Arbiza L, Dopazo H, Huerta-Cepas J, Minguez P, Montaner D, Dopazo J: From genes to functional classes in the study of biological systems. BMC Bioinformatics 2007, 8: 114. 10.1186/1471-2105-8-114PubMed CentralView ArticlePubMed
- Beissbarth TST: GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMed
- Backes C, Keller A, Kuentzer J, Kneissl B, Comtesse N, Elnakady YA, Muller R, Meese E: GeneTrail – advanced gene set enrichment analysis. Nucleic Acid Research 2007, 35: 186–192. 10.1093/nar/gkm323View Article
- Breitling R, Amtmann A, Herzyk P: Iterative Group Analysis (iGA): A simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics 2004, 5: 34. 10.1186/1471-2105-5-34PubMed CentralView ArticlePubMed
- Goeman JJJ, Bühlmann P: Analyzing gene expression data interms of gene sets: methodological issues. Bioinformatics 2007.
- Eden E, Lipson D, Yogev S, Yakhini Z: Discovering Motifs in Ranked Lists of DNA Sequences. PLoS Comp Bio 2007, 3(3):e39. 10.1371/journal.pcbi.0030039View Article
- Eden E: Discovering Motifs in Ranked Lists of DNA Sequences. Master's thesis [http://bioinfo.cs.technion.ac.il/people/zohar/thesis/eran.pdf]
- Gansner E, North S: An open graph visualization systemand its applications to software engineering.[http://www.graphviz.org/]Software – Practice and Experience 2000, 30(11):1203–1233. Publisher Full Text 10.1002/1097-024X(200009)30:11<1203::AID-SPE338>3.0.CO;2-NView Article
- van't Veer LJ, Dai H, Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 530–536. 10.1038/415530a
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.