GObar: A Gene Ontology based analysis and visualization tool for gene sets
© Lee et al; licensee BioMed Central Ltd. 2005
Received: 18 April 2005
Accepted: 25 July 2005
Published: 25 July 2005
Microarray experiments, as well as other genomic analyses, often result in large gene sets containing up to several hundred genes. The biological significance of such sets of genes is, usually, not readily apparent.
Identification of the functions of the genes in the set can help highlight features of interest. The Gene Ontology Consortium  has annotated genes in several model organisms using a controlled vocabulary of terms and placed the terms on a Gene Ontology (GO), which comprises three disjoint hierarchies for Molecular functions, Biological processes and Cellular locations. The annotations can be used to identify functions that are enriched in the set, but this analysis can be misleading since the underlying distribution of genes among various functions is not uniform. For example, a large number of genes in a set might be kinases just because the genome contains many kinases.
We use the Gene Ontology hierarchy and the annotations to pick significant functions and pathways by comparing the distribution of functions in a given gene list against the distribution of all the genes in the genome, using the hypergeometric distribution to assign probabilities. GObar is a web-based visualizer that implements this algorithm.
The public website for GObar  can analyse gene lists from the yeast (S. cervisiae), fly (D. Melanogaster), mouse (M. musculus) and human (H. sapiens) genomes. It also allows visualization of the GO tree, as well as placement of a single gene on the GO hierarchy. We analyse a gene list from a genomic study of pre-mRNA splicing to demonstrate the utility of GObar.
Large scale genomic studies, especially expression microarrays, routinely identify large genes sets (sometimes a few hundred or more) of interest. Researchers are faced with the problem of identifying interesting features in such datasets. A listing of gene annotations (e.g. functions, process) can help identify interesting features, but this is impractical with large sets, due to the labor involved and the difficulty in picking statistically significant trends from large datasets. Thus, a user-friendly method is required for the routine analysis of such datasets.
The Gene Ontology consortium  curates the annotations of genes of several model organisms using a controlled vocabulary of GO terms. It also places the GO terms on a hierarchy called the Gene Ontology(GO). There are separate hierarchies for Molecular Functions, Cellular Components and Biological Processes. The terms on the hierarchy share a parent-child relationship in which a child term is either a specific instance or a part of its parent term. The terms get more specific the lower they are on the hierarchy . Each node(GO term) on the hierarchy can have multiple parents and children.
In Figure 1, each node shows the number of genes annotated in the D. Melanogaster genome associated with the term. There are a total of approximately 14,000 genes in the genome. If a hypothetical microarray experiment in D. Melanogaster picks out 100 significant genes, of which 10 are double-stranded RNA (dsRNA) binding genes, then it is intuitively obvious that dsRNA binding is affected by the experiment and pathways using this function might be responding to the experiment. GObar uses the hypergeometric distribution (explained below) to quantify this intuition.
The basic idea of the algorithm is to compare the distribution of the genes from a set on the GO tree against the distribution of all the genes of the genome on the GO tree, identify and highlight branches that are improbably enriched by chance alone, and render an image of the GO tree that will allow user interactions to further explore the data.
The GO tree is first populated with all the genes in the genome, which involves placing genes at all the nodes that describe the gene. This operation is carried out only once, and is redone each time the genomic data gets updated. At each node two sets of counts are maintained, the contribution of genes from nodes that are children of the current node (distributed count) and the contribution from the genes at the current node (bare count). The total count at each node (bare count + distributed count), is divided equally amongst the distributed counts of each parent (please note that each node can have multiple parents). To calculate the counts at each node the leaves (nodes with no children) are first considered, and the counts are progressively transmitted up in a level-by-level manner, until the root is reached.
For the analysis the GO tree is populated with the list of genes to be analysed. Once again a set of distributed and bare counts is calculated at each node for this list. The distribution of these counts is compared to the genomic set and significance is assigned to the deviations from the expected counts, using the hypergeometric distribution, which is now described. If there is a collection of N objects of three types, A (count = a), B (count = b), and C (count = c) so that, N = a + b + c, then, a random selection of n objects from these N objects will contain α A objects, β B objects and γ C objects (where α + β + γ = n) with a probability given by the hypergeometric distribution
This equation can be generalized to arbitrary collections of objects. Using this probability distribution, highly improbable deviations from the expected numbers are highlighted under the assumption that they are likely to have a biological significance.
GO data can be downloaded from the Gene Ontology website . The data contains two sets of information that are used, the parent-child relationships for each node and the definitions of each node or term. The data collection and analysis techniques are described in this section.
The downloaded GO data is used to populate one table with GO IDs and the ID definitions, and another table with a description of relationships between the GO IDs, which can use terms such as is_a or part_of to define the relationships.
In order to associate GO terms with gene IDs (accession), the files gene2go and gene2accession were retrieved from Entrez Gene  for the human and mouse genomes. A similar dataset for D. Melanogaster is acquired from Flybase . Each gene can have multiple GO annotations, so this is a many-to-many association table.
Columns in the GO statistics database table.
GO node statistics
num of trails up
Level (depth in a tree): A recursive depth-first search in a bottom-up fashion is carried out to determine the level of GO terms associated with the experiment, as explained in Figure 5.
Number of trails up: This is obtained from the table of GO ID relationships by counting the number of parents for a node.
The following two items are calculated once in the beginning for all the genes in the genome and for each analysis of a gene-list.
BC: Bare count (BC) is a number of genes associated with each GO term (node).
DC: Starting from the lowest node(s) in the tree (determined by the Level column), the total count, BC + DC is propagated to the node's immediate parent. If a node has more than one parent then total count is divided by the number of trails up, which is the same as the number of parents.
Populating the tree with the reference dataset
We have populated the GO tree with datasets from Entrez Gene for human and mouse data , Flybase  for fly data and SGD  for yeast data. In the case of Entrez Gene, two sets of maps exist, a gene id to GO map and a gene to gene id map. At the end of this process each GO node gets a list of genes. The term bare counts denotes the counts of genes at each node. The genes on children nodes also contribute to the counts on any given node, which are tracked separately and called distributed counts. Thus, the distributed count of a node is the sum of contributions of the nodes below it in the gene ontology hierarchy. Each node contributes the sum of its bare count and its distributed count equally to the distributed counts of each of its parents. This process can be recursively applied, starting from the lowest levels (or greatest depths) of the tree and working the way up the tree.
If the accounting of distributed counts is to be done properly, defining the depth of each node in the tree is important. The rule for assigning depth to each node is that, if a node gets multiple levels, then the highest depth is always assigned to it. This can be done by picking the leaves of the tree (nodes with no children) and travelling recursively all the way up to the root (node with no parents). For each path to the top, depth is assigned to each node based on the number of steps to the node from the root. If a node already has a depth assigned to it, then the depth is replaced with the current depth only if it is bigger. This is explained in Figure 5. Once the leaves have been exhausted, all the nodes in the tree will have depths assigned to them.
In order to calculate the distributed counts for each node, the list of nodes is ordered based on their depths. Starting from nodes with the highest depths the counts are propagated up, as described above, summing up the bare count and distributed count and partitioning the sum equally amongst all the parents. After exhausting the list of nodes, all the nodes should have a bare count and a distributed count assigned to them.
Populating the tree with the experimental dataset
In order to calculate probabilities for a given experimental dataset, we need to first populate the GO tree with the experimental dataset. A procedure identical to the one used in the previous section is implemented, resulting in a GO tree with just the dataset of interest on it.
Calculating the probabilities
Let BC i , DC i be the bare and distributed counts respectively at node i for the genomic dataset and let bc i , dc i be the bare and distributed counts respectively at node i for the experimental dataset. Then, for the Node 0 in Figure 8.
DC0 = (BC1 + DC1) + (BC2 + DC2) + (BC3 + DC3)/2 (3)
dc0 = (bc1 + dc1) + (bc2 + dc2) + (bc3 + dc3)/2 (4)
The following are defined for ease of notation:
N1 = (BC1 + DC1) (5)
N2 = (BC2 + DC2) (6)
N3 = (BC3 + DC3)/2 (7)
N0 = DC0 = N1 + N2 + N3 (8)
n1 = (bc1 + dc1) (9)
n2 = (bc2 + dc2) (10)
n3 = (bc3 + dc3)/2 (11)
n0 = dc0 = n1 + n2 + n3 (12)
Then, the probability that a dataset is a random selection from the genes in the genome is given by the Hypergeometric formula (explained above)
The expected value for n1 is given by
We define PD as the deviation of the counts on a node i from its expected number and is given by
We use P and PD to prune the trees, as described in the next section.
Pruning the tree
Listing all the nodes of the GO tree for a given dataset is not very informative, especially if only a few nodes are populated or if a large number of GO terms are populated by a small number of genes. This also defeats the purpose of helping users narrow down the GO terms of interest.
if n0 <n c , stop traversing the tree, that is, do not show anything below such a node. The population cutoff, n c can be set by the level of details option on the GObar webpage at step 4, shown in figure 3. This determines how low the population of genes in a node can go before it gets pruned. Less Detailed corresponds to a minimum of 6 genes, Detailed corresponds to a minimum of 3 genes and Very Detailed shows every node.
Prune nodes that have P > P th . The threshold P th is arbitrarily set at 0.1.
if n i deviates significantly up from <n i >, then the path is hightlighted using red color. PD can be set using the deviation stringency option in step 4 on the webpage, shown in figure 3. Less strict corresponds to a deviation cutoff value of 0.2, Strict corresponds to a cutoff of 0.5 and very strict corresponds to a cutoff value of 0.8.
The pruning is done starting with leaves (nodes with no children) on the tree, and stops when it reaches a node that should not be pruned according to the rules above. Fine variations of the pruning conditions are not allowed as these do not offer useful biological information and make the tool difficult to use.
Visualizing the tree and user-interaction
The tool also allows the downloading of all the genes in the GO tree below any node. The downloaded list is in the form of a comma separated valued (csv) file, which contains the gene, the GO terms for each gene and a short description.
In the downloaded list, the uploaded genes are highlighted, since the list will also contain genes that belong to the nodes but are not in the uploaded list.
Use of the tool
The numbers on the lines between the nodes are the deviation from expected counts (whose calculation is described above). A red line implies that the child node connected to the path in the GO hierarchy is over-represented in the uploaded dataset and the corresponding GO term might be biologically significant.
If a single gene (or a few genes) needs to be placed on the GO tree, then the tree should not be pruned, which is achieved by selecting the "No" option in step 4 on the front page. Pruning is the removal of branches of the GO tree which do not carry much information (either a small number of genes or results that can be explained by random selection). The result of placing Dicer-1 (FBgn0039016) in D. Melanogaster is shown in Figure 2.
We will apply GObar to the analysis of results from a genomic study of splicing . Splicing is the excision of introns from the pre-mRNA after transcription [9, 10]. Broadly, there are two classes of splicing machinery (spliceosomes), U2-dependent and U12-dependent, named on the basis of the snRNPs involved in the splicing reaction. Each of these spliceosomes is involved in the excision of two sub-classes of introns, defined by the consensus sequences of the 5' and 3' ends of the intron, GT-AG-U2, GC-AG-U2, GT-AG-U12 and AT-AC-U12 type splice sites. The number of U12-dependent splice sites in the genome is dwarfed by the number of U2-dependent splice sites, numbering less than 2% of the total . The snRNPs comprising the U12-dependent splicing machinery are also relatively less abundant. The D. Melanogaster, mouse and human genomes have U12-dependent splicing, while C. elegans seems to have lost it. This conservation leads to speculation regarding the reason for the persistence of the U12-dependent splice sites over evolutionary time scales.
Significant GO terms in the human AT-AC-U12 set, which accompanies the analysis shown in Figure 5. We show only the more specific terms in the table that appears as a pop-up webpage along with the GO tree shown in Figure 5. The level is the depth from the root, and its calculation is described in Figure 4.
integral to endoplasmic reticulum membrane
voltage-gated sodium channel complex
voltage-gated calcium channel complex
transcription elongation factor complex
transcription factor complex
low voltage-gated calcium channel activity
voltage-gated sodium channel activity
voltage-gated calcium channel activity
calcium channel activity
sodium channel activity
sodium ion transport
calcium ion transport
actin filament-based process
In the molecular function branch, analysis of the mouse set highlighted GTPase regulator activity, cation channel activity and calcium channel activity. The human set also highlighted sodium channel activity and voltage-gated calcium channel activity. In the biological process branch, the analysis highlighted intracellular signalling cascade and actin-filament based process. The Drosophila set was too small (7 genes), to give any detailed statistics, but GObar did highlight transporter activity, which is a parent of the cation channel activity. The highlights that are present in both human and mouse genomes could point to reasons for the persistence of AT-AC U12-dependent splice sites over evolutionary time-scales. We believe this might have some basis in the biological control of the rates of splicing reactions of these genes but reaching a firm conclusion requires an investigation that is beyond the scope of this study.
Our proposed method of analysis is mathematically robust and allows visualization and identification of pathways. We can identify sub-groups of genes that cannot be explained by chance alone. This in turn can identify pathways that are of interest in the process under study. Identification of the pathways then allows study of other genes in the pathway that are not picked up in the experiment, allowing for a clearer understanding of subtle effects and quantifying the errors in the experiment.
The conclusions we reach using our method of analysis does depend on the accuracy of the gene annotations. Thus, if the role of a gene in a pathway were unknown, or if a small set of genes could have a strong phenotypic effect (without triggering major changes in the mRNA levels of a large number of genes, which is the only quantity measured in a microarray experiment), then GObar will mislead the investigator. Such phenomena are, in general, more difficult to study using microarrays and require supplementary biological assays to uncover the underlying mechanisms.
Using the GO tree allows us to ameliorate some of the problems inherent in gene annotations, so that genes below the term of interest are also counted as part of the function being studied. Our method involves a one-time analysis of the whole genome dataset, which then allows us to decide, in a straightforward manner, the significance of any number of datasets and allows easy navigation and analysis of the data. The convenience and robustness of the method are the novel contributions here.
Another rigorous approach identifies biological themes from gene lists using GO, by calculating the over-representation of categories in the experimental list relative to a background list (all genes on the chip or all genes in the genome). A problem with this is the underlying assumption that the GO annotations at each node are accurate. In our approach, in order to allow for the fact that genes may be placed at different depths due to human biases, we let every gene below a particular node contribute to the counts on that node, but done in a way that prevents multiple counting. This also allows genes with more specific functional annotations to contribute to the more general annotation. Thus, our approach also differs from this one in the way we define hits, by allowing genes that are lower down in the tree to be a part of the node under consideration.
We offer a detailed comparison with GOstat  available at its website . GOstat returns lists of significantly enriched GO terms and genes in those GO terms. In order to see the meaning of the term and its placement in the GO tree the user can click on the link to go to AMIGO . The results are comparable, though we do not offer a P value for the significant nodes. They also offer a list of GO annotations for each gene in the uploaded list. A list of terms is not human friendly and is not a natural method of presenting data that has a tree-like structure. We feel that our SVG based view of the tree is intuitively easier to use and also allows for a quick overview of the data, allowing the user to zoom into relevant sections. Also, our listing of the functions in order of their level in the GO tree allows a user to pay attention to more nodes based on levels (more general corresponds to lower level, while higher levels correspond to more specific terms). The goal of the study determines what level of specificity for a node is preferred.
In addition, the GO browser offered on the GObar website allows an SVG-based exploration of the GO tree, simultaneously showing all the branches and relationships between them, which is different from the text based version offered by the AMIGO website . GObar also allows viewing all the GO annotations of a gene in a single view, as shown in figure 2, which is also a novel feature of this program.
Vladimir Grubor, Michele Hastings, Xavier Roca and Susan Janicki gave many detailed suggestions and comments. Andrew Olson implemented the GO term browser available on the GObar website. Cat Eberstark put tremendous effort into improving the figures. The anonymous reviewers helped improve the paper with several suggestions. The Cancer Center of CSHL funded this work and several members of the cancer center provided encouragement.
- Gene Ontology Consortium website[http://www.geneontology.org]
- GObar website[http://katahdin.cshl.org:9331/GO]
- Smith B, Williams J, Schulze-Kremer S: The Ontology of Gene Ontology. Proceedings of AMIA Symposium 2003. [http://ontology.buffalo.edu/medo/Gene_Ontology.pdf]
- Gene E: Entrez Gene: unified query environment for genes.[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene]
- FlyBase: FlyBase: A database of the drosophila genome.[http://www.flybase.org/]
- SGD: Saccharomyces Genome Database.[http://www.yeastgenome.org/]
- Graphviz website[http://www.graphviz.org/]
- Sachidanandam R: Curated collection of splice sites in various genomes. Unpublished
- Burge CB, Padgett RA, Sharp PA: Evolutionary fates and origins of U12-type introns. Mol Cell 1998, 2(6):773–85. 10.1016/S1097-2765(00)80292-0View ArticlePubMed
- Wu Q, Krainer AR: AT-AC Pre-mRNA Splicing Mechanisms and Conservation of Minor Introns in Voltage-Gated Ion Channel Genes. Molecular and Cellular Biology 1999, 19(5):3225–3236.PubMed CentralPubMed
- Beissbarth T, Speed TP: GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMed
- Beissbarth T, Speed TP: GOstat Website.[http://gostat.wehi.edu.au/]
- The GO consortium: Amigo website.[http://www.godatabase.org/cgi-bin/amigo/go.cgi]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.