Improving detection of differentially expressed gene sets by applying cluster enrichment analysis to Gene Ontology
© Xu et al; licensee BioMed Central Ltd. 2009
Received: 14 January 2009
Accepted: 5 August 2009
Published: 5 August 2009
Gene set analysis based on Gene Ontology (GO) can be a promising method for the analysis of differential expression patterns. However, current studies that focus on individual GO terms have limited analytical power, because the complex structure of GO introduces strong dependencies among the terms, and some genes that are annotated to a GO term cannot be found by statistically significant enrichment.
We proposed a method for enriching clustered GO terms based on semantic similarity, namely cluster enrichment analysis based on GO (CeaGO), to extend the individual term analysis method. Using an Affymetrix HGU95aV2 chip dataset with simulated gene sets, we illustrated that CeaGO was sensitive enough to detect moderate expression changes. When compared to parent-based individual term analysis methods, the results showed that CeaGO may provide more accurate differentiation of gene expression results. When used with two acute leukemia (ALL and ALL/AML) microarray expression datasets, CeaGO correctly identified specifically enriched GO groups that were overlooked by other individual test methods.
By applying CeaGO to both simulated and real microarray data, we showed that this approach could enhance the interpretation of microarray experiments. CeaGO is currently available at http://chgc.sh.cn/en/software/CeaGO/.
Identifying differentially expressed genes (DEGs) from microarray experiments enables researchers to elucidate related biological processes. In addition to studies focused on individual genes such as SAM, statistical techniques have been successfully employed to determine whether predefined groups, for example those in Gene Ontology (GO) , or in a metabolic pathway, are differentially expressed. There are two main statistical testing approaches: individual gene analysis (IGA) [3, 4] and Gene Set Analysis (GSA) . IGA is performed in two steps: first, genes of interest are selected using a cutoff threshold, and the enriched biological categories are gained by statistically testing these genes against the background: typically all genes in the category (e.g., Fisher's exact test). The major limitation of IGA is that the result is significantly affected by an arbitrarily chosen cutoff in the first step. Hence, the GSA approach was developed to address this issue. GSA methods calculate a score based on all the genes within the gene set. Since it is free of the problems of threshold-based methods, GSA should be more sensitive than IGA, thus identifying gene sets with 'subtle but coordinated' expression changes that cannot be detected by IGA. In previous studies, gene sets were formatted and pre-defined into groups such as independent GO terms and reference KEGG  pathways. Little attention was paid to subtle but coordinated expression changes across gene sets or within gene sets. In other words, the structure of gene sets was usually ignored.
Some authors have proposed taking into account the structure of gene sets when testing for gene set enrichment. Draghici et al.  developed an impact analysis of a signaling pathway that incorporates some crucial factors, such as each gene's position in the given pathway and their interactions. A few studies reported that through an intersect operation on gene sets between different categories, the gene sets could more precisely characterize the biological themes, and more closely represent the true differential expression functions of the data. The SEGS  and ADGO  methods attempted to improve gene set enrichment analysis by intersecting two different GO categories. Enrichment of GO terms with p-values calculated by the IGA method based on its neighbourhood using GO graph topology was also implemented in TopGO . A few attempts [10, 11] have been proposed to address the redundant enrichment problem caused by overlapping annotations according to the fact that GO terms within a Directed Acyclic Graph (DAG) represent an inheritance relationship. We abstracted the simplest common opinion of these methods, called the parent-based approach which was implemented in the CeaGO package, was to join the genes of children to the parent GO term stepwise, and then tested all the GO terms in the graph topology. Elim and weight approaches presented in the study of Alexa et al.  and parent-child in Grossmann et al. are complicated but useful for minimizing false positives and can enrich GO terms more accurately. Otherwise, if an insufficient number of genes are annotated to one GO term, methods may not be sensitive enough to uncover subtle expression changes, similar to the drawbacks of significance analysis of individual genes . Thus, our goal was to identify some novel expression changes by grouping GO terms. Similarity between pairs of GO terms provided an opportunity to enlarge the GO groups to better interpret the gene expression data.
In this article, we present an effective method, cluster enrichment analysis based on GO (CeaGO), to extend enrichment analysis of individual GO terms and discover significantly clustered GO classes from expression data. A sufficiently rigorous standard could not be found to evaluate the algorithm; instead, a simulation study was performed to assess the statistical properties of this method. Finally, we applied this method to the gene expression profiles of ALL  and ALL/AML datasets . The results from both simulated and real data showed that CeaGO is sensitive enough to identify significant expression changes overlooked by individual term tests.
Semantic similarity calculation
Note that S(t) in this definition is the set of all terms in one ontology.
Hierarchical clustering is a multiple step, agglomerative method that sequentially merges samples based on the pair-wise similarity of a given measurement, forming common partitions until all samples are contained in a single group. At each particular stage, the method joins together the two previous clusters that are closest together (most similar). A dendrogram is built by using the semantic similarity metric described above. N classes are obtained when the tree meets a user-defined similarity distance threshold (in this study taken as d < 2). The model of each class can be described as a vector = (G1, G2, ..., G n ), where G i represents the genes annotated to GO terms obtained at the ith step in the clustering process. When such a class is formed, various statistical tests can be used to determine whether the genes within the G i showed coordinated expression.
Gene set enrichment analysis
Where μ is the mean of total fold change  values and σ is the standard deviation of total fold change values of a given microarray data set, x is the mean of the fold change values and n is the total number of genes in the gene set. Fold changes are calculated for all genes between two experimental groups (e.g. control versus treatment). P-values inferred from Z-scores against standard normal distribution are calculated. According to PAGE, 10 samples should be sufficiently close to normal distribution and provide a fairly good statistical test. Therefore, gene sets larger than 10 were used to test the differential gene expression changes.
Next, a vector of significant p-values, which was described as = (p1, p2, ..., p n ), associated with was computed for each GO class. If the p-values were less than a pre-defined value (e.g. < 0.05), those gene sets were considered as significantly differentially expressed. When such gene sets were found, the gene sets were reduced by retaining the GO subsets with the smallest p-values. At this point, the biologically meaningful sets were identified by the tally of GO subsets selected during this step.
The algorithms were implemented in the R programming language http://www.r-project.org. The results were obtained using R version 2.7.1 and the libraries provided by the Bioconductor project http://www.bioconductor.org, version 2.2.
Validation of CeaGO on simulated data
The evaluation of enrichment measurement methods is a challenging task, because biologically meaningful gene sets usually are not known for real datasets. In this study, we introduced an evaluation framework similar to one described previously  to address this issue. To imitate real data as closely possible, an artificial data set derived from a HGU95aV2 chip with all 10,503 probes representing genes annotated by terms from the GO biological process subontology was used, and the resulting graph with 4,913 nodes was used as the underlying dataset. Here, we presumed that the expression values of the control and the treatment groups obeyed a standard normal distribution N(0, 1). After clustering the GO terms, 188 classes, annotated with more than 10 genes each, were obtained by cutting the dendrogram (d < 2). We selected 25 "truly enriched" GO classes at random from the 188 classes. One set was chosen randomly from the subsets in each GO class and denoted as the "truly" differentially expressed gene set. Given those gene sets, N(0,1) distributed genes in the treatment groups were replaced by N(μ,1) distributed genes, with the genes in the control groups remaining as N(0,1). The test of dynamic change of σ was not performed in this study because Z-statistic methods are not sensitive enough to detect changes of sample standard deviation σ (see Eq. (3)).
The score is the number of pre-selected GO sets found among the top k enriched sets. It lies in the interval [0, k], where k is the perfect prediction by this method.
Another score was also implemented for evaluating the detecting power of the "Possible match" members. Such members represent the fraction of perfect matchings of the pre-selected "truly enriched" GO sets, and the "in but wrong" members that failed, but were still detected as the same class of the "truly enriched" GO sets.
Application of CeaGO to ALL data sets
CeaGO was first applied to the well-known expression dataset Acute Lymphocytic Leukemia (ALL) developed by Chiaretti et al. . These data were collected to characterize the relationship between gene expression signatures in ALL-associated cells and genotypic abnormalities in adult patients and the dataset is available from Bioconductor . One use of the dataset has been to examine B-cell lines with this disease and find differential gene expression between the BCR/ABL samples that have rearrangements in the BCR/ABL genes, and NEG samples, which have no evidence of major molecular rearrangements. There are 37 samples for the BCR/ABL group and 42 for the NEG group, each of which has been hybridized to an Affymetrix HGU95Av2 chip containing 12,625 gene-associated probes. We began by normalizing the dataset using the variance stabilizing method VSN . Subsequently, 10,503 genes were successful mapped to GO terms from the BP ontology yielding a list of genes with a GO graph of 3,066 terms.
Top significant GO groups identified between BCR/ABL and NEG phenotypes for the ALL dataset.
regulation of I-kappaB kinase/NF-kappaB ...
positive regulation of I-kappaB kinase/N...
negative regulation of I-kappaB kinase/N...
S phase of mitotic cell cycle
S-phase-specific transcription in mitoti...
negative regulation of interleukin-6 pro...
positive regulation of interleukin-6 pro...
interleukin-6 biosynthetic process
regulation of interleukin-6 biosynthetic...
positive regulation of interleukin-6 bio...
inhibition of NF-kappaB transcription fa...
negative regulation of DNA binding
negative regulation of transcription fac...
activation of JNK activity
positive regulation of JNK activity
Application of CeaGO to the ALL/AML dataset
The purpose of the present study was to examine the applicability of the CeaGO algorithm. In addition, we tested the algorithm with a published dataset called golubEsets , which is also available from Bioconductor. It consists of 7,129 genes from 47 samples of acute lymphoblastic leukemia (ALL), and 25 samples of acute myeloblastic leukemia (AML). Normalization on these samples was also carried out using VSN. This pre-processing resulted in 6,372 genes annotated to GO terms from BP ontology. The induced GO graph contains 2,766 GO terms.
Top significant GO groups identified between AML and ALL for the ALL/AML dataset.
chemokine biosynthetic process
negative regulation of chemokine biosynt...
positive regulation of chemokine biosynt...
positive regulation of fractalkine biosy...
positive regulation of positive chemotax...
induction of positive chemotaxis
negative regulation of tumor necrosis fa...
positive regulation of tumor necrosis fa...
positive regulation of tumor necrosis fa...
negative regulation of tumor necrosis fa...
positive regulation of endocytosis
positive regulation of receptor-mediated...
positive regulation of phagocytosis
Traditional strategies for gene expression analysis have focused on identifying individual genes or pre-defined groups such as those in KEGG pathways and Gene Ontology, which exhibit differences between two states of interest. In this article, we extended expression analysis from prior defined gene sets to a gene set analysis framework that makes use of the structure of GO.
GO has a hierarchical structure that forms a DAG. CeaGO uses the clustering method to combine similar GO terms, rather than using the parent terms, and successfully detected several novel, meaningful categories. One objective of these methods is to enlarge the gene sets to enable identification of new, differentially expressed groups. Methods derived from a parent term with all the genes of its children may introduce unnecessary noise, and fail to detect significant changes. The superior performance of CeaGO than parent-based method using PAGE gene set enrichment algorithm is supported by the results presented here on simulated and ALL data. Similar scenarios were observed with the ALL/AML dataset. However, the number of GO groups enriched by the CeaGO method is related to the cutoff of semantic distances. If the threshold was set to the maximum distance of two GO terms of the ontology, CeaGO gave just one cluster, which is too biologically general. If the threshold was set to 0, all GO terms were independent, which is the same as the individual GO term analysis. Fortunately, we found that CeaGO is not sensitive to this parameter, and an arbitrary 20% of the maximum semantic distance was chosen to balance biological meaning and cluster numbers.
The application of CeaGO relies on the accuracy of the gene set test methods. Fortunately, the proposed method operates within a well-defined statistical framework, so that other statistical tests for assessing the significance of GO sets can be used with CeaGO, for example GSEA  or GlobalANCOVA . The Z-statistic algorithm was employed in this article because of its fast computation advantage over the permutation-based methods. For example, PAGE reduces computation time at least 5,000-fold when performing 5,000 dataset permutations to get a background distribution. However, it suffers from limitations on the normal distribution hypothesis and the minimal size of gene sets. As the accuracy of statistical test methods increases, the performance of CeaGO should improve, as applied to experimental datasets.
In many, if not all cases, analysis with individual GO terms may not be sufficient to reveal changes in specific expression patterns. For example, not all the genes annotated to a "Biological Process" term may exhibit as differentially expressed, but only those with a particular localization such as "membrane" might alter their expression. Through intersection of gene sets between different categories, gene sets can be separated more specifically, facilitating a much more detailed analysis of expression patterns [8, 9]. In future studies, we will introduce a comprehensive algorithm to investigate genes within a single GO term or a clustered GO class, with the goal of uncovering the truly differential expression functions in expression data.
Gene set analysis based on GO is a popular and useful approach to extract biological information from expression data. However, this is limited from showing its full analytical power when only pre-defined GO terms are used, because an insufficient number of genes annotated to one GO term may not be sensitive enough to detect subtle expression changes. Therefore, we developed a novel method to extend the traditional individual GO term analysis. This method dynamically enriches clustered GO terms to identify groups that are significantly differentially expressed in microarray data. Compared to individual GO term analysis (including parent-based enrichment methods), the results obtained from both simulated and real microarray data sets showed that the proposed approach is very promising. Furthermore, the CeaGO model can easily be extended to other gene set analysis methods.
Availability and requirements
Project name: CeaGO
Project homepage: http://chgc.sh.cn/en/software/CeaGO/
Operating system(s): Platform Independent
Programming languages: R and C
The work was supported by the National Natural Sciences Foundation of China (No.30870118), the Doctoral Foundation of Ministry of Education of China (No.20070610080), Program for New Century Excellent Talents in University (NCET-04-0861), Sichuan University Research Grant 985 and China National High-tech 863 Program (2006AA02Z335).
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(9):5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (Oxford, England) 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565View ArticleGoogle Scholar
- Rivals I, Personnaz L, Taing L, Potier MC: Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics (Oxford, England) 2007, 23(4):401–407. 10.1093/bioinformatics/btl633View ArticleGoogle Scholar
- Nam D, Kim SY: Gene-set approach for expression pattern analysis. Briefings in bioinformatics 2008, 9(3):189–197. 10.1093/bib/bbn001View ArticlePubMedGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic acids research 1999, 27(1):29–34. 10.1093/nar/27.1.29PubMed CentralView ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R: A systems biology approach for pathway level analysis. Genome research 2007, 17(10):1537–1545. 10.1101/gr.6202607PubMed CentralView ArticlePubMedGoogle Scholar
- Trajkovski I, Lavrac N, Tolar J: SEGS: search for enriched gene sets in microarray data. Journal of biomedical informatics 2008, 41(4):588–601. 10.1016/j.jbi.2007.12.001View ArticlePubMedGoogle Scholar
- Nam D, Kim SB, Kim SK, Yang S, Kim SY, Chu IS: ADGO: analysis of differentially expressed gene sets using composite GO annotation. Bioinformatics (Oxford, England) 2006, 22(18):2249–2253. 10.1093/bioinformatics/btl378View ArticleGoogle Scholar
- Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics (Oxford, England) 2006, 22(13):1600–1607. 10.1093/bioinformatics/btl140View ArticleGoogle Scholar
- Grossmann S, Bauer S, Robinson PN, Vingron M: Improved detection of overrepresentation of Gene-Ontology annotations with parent child analysis. Bioinformatics (Oxford, England) 2007, 23(22):3024–3031. 10.1093/bioinformatics/btm440View ArticleGoogle Scholar
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, et al.: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature genetics 2003, 34(3):267–273. 10.1038/ng1180View ArticlePubMedGoogle Scholar
- Chiaretti S, Li X, Gentleman R, Vitale A, Wang KS, Mandelli F, Foa R, Ritz J: Gene expression profiles of B-lineage adult acute lymphocytic leukemia reveal genetic patterns that identify lineage derivation and distinct mechanisms of transformation. Clin Cancer Res 2005, 11(20):7209–7219. 10.1158/1078-0432.CCR-04-2165View ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science (New York, NY) 1999, 286(5439):531–537.View ArticleGoogle Scholar
- Jiang DW: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Proceeding of International Conference Research on Computational LinguisticsTaipei 1997, 19–33.Google Scholar
- Lin D: An Information-Theoretic Definition of Similarity. Proceedings of the Fifteenth International Conference on Machine Learning 1998.Google Scholar
- Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence 1995, 448–453.Google Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics (Oxford, England) 2007, 23(10):1274–1281. 10.1093/bioinformatics/btm087View ArticleGoogle Scholar
- Kim SY, Volsky DJ: PAGE: parametric analysis of gene set enrichment. BMC bioinformatics 2005, 6: 144. 10.1186/1471-2105-6-144PubMed CentralView ArticlePubMedGoogle Scholar
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(38):13544–13549. 10.1073/pnas.0506577102PubMed CentralView ArticlePubMedGoogle Scholar
- Hummel M, Meister R, Mansmann U: GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics (Oxford, England) 2008, 24(1):78–85. 10.1093/bioinformatics/btm531View ArticleGoogle Scholar
- Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAM-GS. BMC bioinformatics 2007, 8: 242. 10.1186/1471-2105-8-242PubMed CentralView ArticlePubMedGoogle Scholar
- Breitling R, Amtmann A, Herzyk P: Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC bioinformatics 2004, 5: 34. 10.1186/1471-2105-5-34PubMed CentralView ArticlePubMedGoogle Scholar
- Choe S, Boutros M, Michelson A, Church G, Halfon M: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome biology 2005, 6: R16. 10.1186/gb-2005-6-2-r16PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor: open software development for computational biology and bioinformatics. Genome biology 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Huber W, von Heydebreck A, Sultmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics (Oxford, England) 2002, 18(Suppl 1):S96–104.View ArticleGoogle Scholar
- Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Statist 2001, 29(4):1165–1188. 10.1214/aos/1013699998View ArticleGoogle Scholar
- Dierov J, Dierova R, Carroll M: BCR/ABL translocates to the nucleus and disrupts an ATR-dependent intra-S phase checkpoint. Cancer cell 2004, 5(3):275–285. 10.1016/S1535-6108(04)00056-XView ArticlePubMedGoogle Scholar
- Raitano AB, Halpern JR, Hambuch TM, Sawyers CL: The Bcr-Abl leukemia oncogene activates Jun kinase and requires Jun for transformation. Proceedings of the National Academy of Sciences of the United States of America 1995, 92(25):11746–11750. 10.1073/pnas.92.25.11746PubMed CentralView ArticlePubMedGoogle Scholar
- Cambier N, Zhang Y, Vairo G, Kosmopoulos K, Metcalf D, Nicola NA, Elefanty AG: Expression of BCR – ABL in M1 myeloid leukemia cells induces differentiation without arresting proliferation. Oncogene 1999, 18(2):343–352. 10.1038/sj.onc.1202302View ArticlePubMedGoogle Scholar
- Olsnes AM, Motorin D, Ryningen A, Zaritskey AY, Bruserud O: T lymphocyte chemotactic chemokines in acute myelogenous leukemia (AML): local release by native human AML blasts and systemic levels of CXCL10 (IP-10), CCL5 (RANTES) and CCL17 (TARC). Cancer Immunol Immunother 2006, 55(7):830–840. 10.1007/s00262-005-0080-zView ArticlePubMedGoogle Scholar
- Bruserud O, Ryningen A, Olsnes AM, Stordrange L, Oyan AM, Kalland KH, Gjertsen BT: Subclassification of patients with acute myelogenous leukemia based on chemokine responsiveness and constitutive chemokine release by their leukemic cells. Haematologica 2007, 92(3):332–341. 10.3324/haematol.10148View ArticlePubMedGoogle Scholar
- Brailly H, Pebusque MJ, Tabilio A, Mannoni P: TNF alpha acts in synergy with GM-CSF to induce proliferation of acute myeloid leukemia cells by up-regulating the GM-CSF receptor and GM-CSF gene expression. Leukemia 1993, 7(10):1557–1563.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.