Comparing gene annotation enrichment tools for functional modeling of agricultural microarray data
© Berg et al; licensee BioMed Central Ltd. 2009
Published: 8 October 2009
The widespread availability of microarray technology has driven functional genomics to the forefront as scientists seek to draw meaningful biological conclusions from their microarray results. Gene annotation enrichment analysis is a functional analysis technique that has gained widespread attention and for which many tools have been developed. Unfortunately, most of these tools have limited support for agricultural species. Here, we evaluate and compare four publicly available computational tools (Onto-Express, EasyGO, GOstat, and DAVID) that support analysis of gene expression datasets in agricultural species. We use AgBase as the functional annotation reference for agricultural species. The selected tools were evaluated based on i) available features, usage and accessibility, ii) implemented statistical computational methods, and iii) annotation and enrichment performance analysis. Annotation was assessed using a randomly selected test gene annotation set and an experimental differentially expressed gene-set – both from chicken. The experimental set was also used to evaluate identification of enriched functional groups.
Comparison of the tools shows that they produce different sets of annotations for the two datasets and different functional groups for the experimental dataset. While DAVID, GOstat and Onto-Express annotate comparable numbers of genes, DAVID provides by far the most annotations per gene. However, many of DAVID's annotations appear to be redundant or are at very high levels in the GO hierarchy. The GOSlim distribution of annotations shows that GOstat, Onto-Express and EasyGO provide similar GO distributions to those found in AgBase while annotations from DAVID show a different GOSlim distribution, again probably due to duplication and many non-specific terms. No consistent trends were found in results of GO term over/under representation analysis applied to the experimental data using different tools. While GOstat, David and Onto-Express could retrieve some significantly enriched terms, EasyGO did not show any significantly enriched terms. There was little agreement about the enriched terms identified by the tools.
Different tools for functionally annotating gene sets and identifying significantly enriched GO categories differ widely in their results when applied to a test annotation gene set and an experimental dataset from chicken. These results emphasize the need for care when interpreting the results of such analysis and the lack of standardization of approaches.
Systems biology research aims to characterize cellular networks and mechanisms by integrating high-throughput "-omics" data from genomics, proteomics, transcriptomics, and metabolomics experiments. It is humanly impossible to manage, analyze and interpret these massive datasets manually. Therefore researchers have developed a wide array of computational tools over the last decade to assist researchers in deriving biological value from the generated data [1, 2]. Gene annotation enrichment analysis is a widely used approach, where the over or under-representation of gene ontology (GO) terms in a set of genes is determined statistically. Available tools perform a number of similar functions and each also presents its own unique features. However, the majority of currently available computational tools target well-studied model organisms such as human, mouse, rat and Arabidopsis. There are very few publicly available computational tools that include equally important but less studied organisms such as agricultural species. In addition, most tools are only compatible with popular commercial arrays (e.g. Affymetrix and Agilent), while other valuable, widely-used custom arrays are disregarded. This multitude of available tools makes it difficult to the researcher to choose the right tools for the job. Recently, an extensive comparison and summary of 68 gene annotation enrichment analysis tools was published , categorizing tools into three classes based on their underlying algorithms. This comparison provides the user with a clear overview of the current availability and differences of a multitude of gene annotation enrichment analysis tools. However, the summary does not provide a side-by-side performance comparison of the tools when applied to biological datasets. The tool features and underlying algorithm(s) do not necessarily reflect the value and functionality of a tool. Our goal is to use an empirical evaluation to provide insight into the obstacles and issues encountered in analysis of gene annotation enrichment, especially when using data generated from agricultural species.
Here, we evaluate and compare four gene annotation enrichment analysis tools: Onto-Express , EasyGO , GOstat , and DAVID . All are categorized by Huang et. al. as Class 1 singular enrichment analysis (SEA) tools . Although Huang et al. describe 44 available Class 1 SEA tools, we selected only the tools that directly support chicken gene input for this study. Gene Set Enrichment Analysis (GSEA) tools such as GenePattern  were not selected because they do not directly support chicken gene identifiers. In addition to the four selected SEA tools, the AgBase  database is used as a baseline for functional annotation of agricultural species. Since Gene Ontology (GO) annotation is the de facto method for functional annotation , we have chosen tools that primarily use GO as their annotation resource in gene annotation enrichment analysis, although some of the tools also have other biological databases integrated (e.g. KEGG, REACTOME). However, the standard vocabulary provided by GO allows easy comparison of the results produced by different tools.
Standard set of tool parameters
Maximum p value
Maximum GO depth
False discovery correction
OntoExpress & EasyGO: Hypergeometric
GOstat & DAVID: Fisher's Exact
Results and discussion
Data set generation
Identifier mapping for experimental data set
FHCRC whole array
FHCRC differentially expressed
Entrez Gene ID
UniprotKB accession no.
Tool feature evaluation
Computational tools are often designed to accomplish a specific goal and then expanded with additional features. Changing statistical methods and needs of researchers combined with continual generation of new data makes maintenance and regular updating of existing tools essential. We compared feature similarities and differences for the selected tools (Additional file 2). Huang et al. have previously provided a summary of tool features  of the underlying statistical methods and annotation visualization methods of a wide range of tools, but provided only a brief description of the annotation database and the species' compatibility of a few tools. For the tools used in this comparison, we present an expanded discussion of species compatibility and databases used and also discuss several other practical features influencing the usability of the tools.
The core of each tool is its underlying database. Several tools have multiple bio-databases implemented for information retrieval. All the tools support GO modelling, while DAVID and Onto-Express also incorporate other bio-databases (e.g. KEGG, REACTOME). As mentioned earlier, maintenance and updating is essential for a tool, especially for their underlying database(s). We found that database update intervals for the evaluated tools range from weekly to annually. Comparing update schedules of several major repository databases (RefSeq , Genbank , UniProtKB/SwissProt , GOA , IPI ) we suggest that a scheduled monthly database update would be a minimum to provide the researcher with the latest annotation information. The ability to upload custom annotations into the gene annotation enrichment analysis or the database provides a short-cut to overcome out-dated or incomplete annotation information. The tools evaluated here offer either direct custom annotation upload or upload upon request.
Adequate user-support for a tool is essential to enable users to access its full range of tool capabilities and to use the tool efficiently and effectively. All of the tools we evaluated provide a description of the tool, a user's manual, and sometimes additional educational resources. DAVID provides a helpful wizard-style guide through the analysis, which makes the upload and analysis of datasets simple and rapid.
Result storage on the tool's server for future access supports the researcher's ability to rapidly access previous results without having to re-analyze entire datasets. EasyGO provides a session ID valid for two weeks to retrieve results, whereas GOstat provides a session ID for 24-hours, but also provides an offline result-viewer for researchers to download. DAVID and Onto-Express do not provide data storage.
The annotation evidence code describes the type of evidence used to assign a GO annotation to a gene product (e.g. inferred from direct assay, inferred from genetic interaction or inferred by electronic annotation) and is a reflection of the strength of the evidence supporting the annotation. Recently, a method for evidence code-based Gene Annotation Quality (GAQ) analysis was published . This method calculates a GAQ score that allows researchers to quantitatively assess the quality of the functional annotations assigned to their data set and is currently available upon request at the AgBase database . AgBase is the only annotation resource in this study that provides the annotation evidence code directly in the annotation result export and thus supports GAQ score calculation.
All tools provide researchers the option of using a default or a custom uploaded background gene dataset for gene annotation enrichment analysis. This allows researchers to calculate the true statistical enrichment significance when using microarray data. In microarray analysis, the number of genes that one is able to detect is limited to what is on the slide. When using the entire genome as background, the statistically significant enrichment is biased since more genes are considered than actually can experimentally be detected. Uploading a custom background (i.e. all genes on the microarray) allows the researcher to eliminate this statistical bias.
DAVID is the only tool in this study that presents only over-represented functional terms. This has the potential to bias the biological conclusion, since under-represented terms also provide valuable information for understanding the biological processes at work. For example, when comparing control and disease datasets, the lack of expression of a certain gene or functional category may be a signature for the disease.
Implemented statistical methods for determining GO term enrichment
Statistical tests implemented in evaluated tools
Multiple testing correction methods implemented in evaluated tools
Another point of interest on which most biologists concur is that the arbitrary selection of a statistical significance "cut-off" will often result in a loss of legitimate biological information. Therefore, researchers need to remember that these computational tools are intended to be evaluative and not definitive to the biology. They provide a starting place for hypothesis generation and testing.
GO annotation modelling
AgBase provides researchers with highly curated GO annotations for agricultural species to be used for downstream modelling. The AgBase biocurators provides a preponderance of the GO annotations for the Gene Ontology Annotation for chicken at EBI. Therefore, AgBase is used as a baseline reference for the retrieval of GO terms.
Test gene annotation set
# Genes input
A. Test gene annotation set
B. Experimental chicken gene set
AgBase retrieves more annotations for the test set than do EasyGO, GOstat and Onto-Express. This could be explained by the manual curation by which AgBase assigns protein annotations that are included into their database. These curated annotations have been submitted to UniProtKB/Swiss-Prot and are awaiting inclusion into the UniProtKB database.
Gene annotation enrichment analysis
Experimental gene set GO-based modeling
Overall, based on these results, having one dataset and multiple tools could provide different biological conclusions. Researchers need to keep their overall research goal in mind to validate the retrieved annotations and derive conclusions based on an evaluative assumption rather than a conclusive statement.
Gene annotation enrichment performance
Each evaluated tool is designed to perform functional enrichment analysis on a gene set. While there are multiple accepted statistical methods available, each has their limitations. As described previously [1, 2, 23] researchers need to decide which methods would be most appropriate for their research model. A comparison of functional enrichment analysis results generated by the evaluated tools provides insight into the performance of each tool. We used the Experimental Set with each tool to generate functional enrichment results. Because there was no one statistical test implemented by all tools (see Table 3), we chose to use statistical tests implemented by at least two tools. Therefore, we compared Onto-Express with EasyGO, because they both implement a hypergeometric statistical method and DAVID with GOstat because they both provide a Fisher's exact test. DAVID uses a modified Fisher's exact test, called EASE, so comparison with GOstat is not conclusive.
Table 6 shows the GO terms that were found significantly enriched (FDR p-value 0.1, GO term depth 5) in the Experimental Set. Both Onto-Express and DAVID found many enriched terms whereas GOstat and EasyGO found only a small number of enriched terms. GOstat is the only tool that does report an under-represented GO terms for Cellular Component.
To gain a better understanding of the biological meaning of the enriched GO terms, we compared the GOSlim distributions for the significantly enriched genes found by each tool. Additional file 3 lists the enriched GO terms retrieved by all tools. The functional enrichment results from the Experimental Set show interesting GO term distributions. For the biological process ontology, GOstat did not find any GO terms represented. EasyGO, DAVID and Onto-Express are in agreement that "response to stimulus" is one of the major GO terms represented. However, additional GO terms from DAVID represent an immunological trend, while Onto-Express find GO terms enriched to a developmental and metabolical trend. The cellular component ontology also shows disagreement where GOstat reports an "intracellular" trend, DAVID an extracellular trend, EasyGO and Onto-Express represent a more global cell location. The molecular function ontology GO terms find agreement by each tool, in that "protein binding" is the major biological trend. Onto-Express find additional details to enzyme activities, while DAVID shows chemokine and cytokine activities.
Although the tools show some agreement for the Experimental Set, there are also substantial differences. This makes it hard to identify a specific biological theme represented in a given dataset. As mentioned earlier, each tool should be considered evaluative and not conclusive in terms of the gene annotation enrichment results and the related biological trends. This comparison demonstrates that even if a dataset is evaluated by multiple tools, it may be difficult to find a general trend that will help the researcher focus on more specific genes of interest.
No standard GO annotation assignment method has been established in the scientific world. Each tool has advantages and disadvantages in the features it supports and the statistical methods it uses. Having more databases incorporated in a tool does not necessarily positively affect the number of gene annotations retrieved. Gene/protein identifiers play a critical role in database compatibility and annotations retrieved. Availability of GO annotation evidence code would offers a more valuable quantitative assessment (i.e. GAQ score) of assigned annotation quality in the entire dataset. Researchers in the agricultural community would benefit greatly from inclusion of their species in tools such as GenePattern  that implement more sophisticated statistical tests and use different analysis techniques.
Test gene dataset
We selected 60 probes from all the structurally annotated probes on the widely used Fred Hutchinson Cancer Research Center (FHCRC) 13 K chicken cDNA microarray (GEO accession GPL2863)  to serve as our test gene annotation set (Test Set). Since each tool accepts different gene identifiers, we selected the 60 probes for which we could retrieved the corresponding Entrez Gene ID and UniProtKB accessions via the UniGene database  and IPI database  (Additional File 1). This set serves as equal input for each tool and is used to evaluate the annotation performance.
Experimental gene dataset
For the Experimental Set, we used the custom-made FHCRC 13 K chicken cDNA microarray (containing 13,007 features) to represent a real experimental dataset. We used a differentially expressed gene-set, which is previously published; Zhou and Lamont described 53 significantly differentially expressed ESTs using the FHCRC 13 K . As for the Test Set, we retrieved all possible corresponding Entrez Gene IDs and UniProtKB accessions for each probe. Since multiple ESTs can be assigned to one gene, we removed duplicate genes. In addition, some ESTs may not be structurally annotated. Therefore, from the 53 ESTs, we were able to obtain 31 genes for input into each evaluated tool.
The tools used in this comparative study are Onto-Express , EasyGO , GOstat , and DAVID . These tools were chosen because they fulfilled the criteria of being i) operational and freely accessible online; ii) compatible with agricultural species (e.g. chicken, corn, cow) and iii) supportive of GO-based gene annotation enrichment analysis. We also used GOretriever from AgBase  to retrieve all possible GO annotations for our datasets. AgBase currently provides the most comprehensive and recent GO annotations for a majority of agricultural species. This allows us to obtain a core reference set of GO annotations for our experimental dataset.
We evaluated each tool via published literature describing the tool and accessed the tool's website for additional information and available features. We evaluated the tools based on i) available features, usage and accessibility; ii) implemented statistical computational methods; iii) annotation performance analysis. The approach for the latter is described in more detail below.
We accessed each tool online and submitted each differential expressed data set as input for each tool. Some tools allow users to upload their own background list of genes to calculate enrichment against. We analyzed our Experimental Set with the parameters listed in Table 1. We analyzed the enrichment using common statistical methods available in the tools when possible.
We analyzed the results of each tool based on the number of genes recognized, the total number of genes annotated, and the total number of GO annotations found. We compared the over and under representation of GO terms as calculated by each tool and used GOSlimViewer from AgBase  and the "GOA and whole proteome GOSlim set" to compare the distribution of the major GO groups represented for each tool's generated dataset.
We would like to acknowledge Dr. Susan J. Lamont (Iowa State University), Dr. Huaijun Zhou (Texas A&M University) for their contribution of the microarray data used in this work. Also, we would like to acknowledge Dr. Shane C. Burgess for his support of BVDB. The project was supported by the National Research Initiative of the USDA Cooperative State Research, Education and Extension Service, grant number 2004-34481-14513, 2004-35204-14829, and 2007-35205-17941.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
- da Huang W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009, 37(1):1–13. 10.1093/nar/gkn923View ArticlePubMedGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565PubMed CentralView ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res 2003, 31(13):3775–3781. 10.1093/nar/gkg624PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou X, Su Z: EasyGO: Gene Ontology-based annotation and functional enrichment analysis tool for agronomical species. BMC Genomics 2007, 8: 246. 10.1186/1471-2164-8-246PubMed CentralView ArticlePubMedGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome biology 2003, 4(5):P3. 10.1186/gb-2003-4-5-p3View ArticlePubMedGoogle Scholar
- Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP: GenePattern 2.0. Nature genetics 2006, 38(5):500–501. 10.1038/ng0506-500View ArticlePubMedGoogle Scholar
- McCarthy FM, Bridges SM, Wang N, Magee GB, Williams WP, Luthe DS, Burgess SC: AgBase: a unified resource for functional analysis in agriculture. Nucleic Acids Res 2007, (35 Database):D599–603. 10.1093/nar/gkl936Google Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, (32 Database):D262–266. 10.1093/nar/gkh021Google Scholar
- Galperin MY, Cochrane GR: Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res 2009, (37 Database):D1–4. 10.1093/nar/gkn942Google Scholar
- UniGene website[http://www.ncbi.nlm.nih.gov/./unigene/]
- Entrez website[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene]
- UniProt/SwissProt Homepage[http://www.uniprot.org/]
- Gene Ontology Annotaton (GOA) database[http://www.ebi.ac.uk/GOA/]
- International Protein Index (IPI)[http://www.ebi.ac.uk/IPI/IPIhelp.html]
- Buza TJ, McCarthy FM, Wang N, Bridges SM, Burgess SC: Gene Ontology annotation quality analysis in model eukaryotes. Nucleic Acids Res 2008, 36(2):e12. 10.1093/nar/gkm1167PubMed CentralView ArticlePubMedGoogle Scholar
- McCarthy FM, Wang N, Magee GB, Nanduri B, Lawrence ML, Camon EB, Barrell DG, Hill DP, Dolan ME, Williams WP, et al.: AgBase: a functional genomics resource for agriculture. BMC Genomics 2006, 7: 229. 10.1186/1471-2164-7-229PubMed CentralView ArticlePubMedGoogle Scholar
- Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23(8):980–987. 10.1093/bioinformatics/btm051View ArticlePubMedGoogle Scholar
- Gold DL, Coombes KR, Wang J, Mallick B: Enrichment analysis in high-throughput genomics – accounting for dependency in the NULL. Brief Bioinform 2007, 8(2):71–77. 10.1093/bib/bbl019View ArticlePubMedGoogle Scholar
- Vencio RZ, Shmulevich I: ProbCD: enrichment analysis accounting for categorization uncertainty. BMC bioinformatics 2007, 8: 383. 10.1186/1471-2105-8-383PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society B 1995, 57: 289–300.Google Scholar
- Yekutieli Y, Benjamini Y: The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 2001, 29: 1165–1188. 10.1214/aos/1013699998View ArticleGoogle Scholar
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(38):13544–13549. 10.1073/pnas.0506577102PubMed CentralView ArticlePubMedGoogle Scholar
- Lewin A, Grieve IC: Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data. BMC bioinformatics 2006, 7: 426. 10.1186/1471-2105-7-426PubMed CentralView ArticlePubMedGoogle Scholar
- Al-Shahrour F, Minguez P, Tarraga J, Medina I, Alloza E, Montaner D, Dopazo J: FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res 2007, (35 Web Server):W91–96. 10.1093/nar/gkm260Google Scholar
- Burnside J, Neiman P, Tang J, Basom R, Talbot R, Aronszajn M, Burt D, Delrow J: Development of a cDNA array for chicken gene expression analysis. BMC Genomics 2005, 6(1):13. 10.1186/1471-2164-6-13PubMed CentralView ArticlePubMedGoogle Scholar
- Nanduri B, Lawrence ML, Boyle CR, Ramkumar M, Burgess SC: Effects of subminimum inhibitory concentrations of antibiotics on the Pasteurella multocida proteome. J Proteome Res 2006, 5(3):572–580. 10.1021/pr050360rView ArticlePubMedGoogle Scholar
- Zhou H, Lamont SJ: Global gene expression profile after Salmonella enterica Serovar enteritidis challenge in two F8 advanced intercross chicken lines. Cytogenetic and genome research 2007, 117(1–4):131–138. 10.1159/000103173View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.