Appearance frequency modulated gene set enrichment testing
© Ma et al; licensee BioMed Central Ltd. 2011
Received: 5 July 2010
Accepted: 20 March 2011
Published: 20 March 2011
Gene set enrichment testing has helped bridge the gap from an individual gene to a systems biology interpretation of microarray data. Although gene sets are defined a priori based on biological knowledge, current methods for gene set enrichment testing treat all genes equal. It is well-known that some genes, such as those responsible for housekeeping functions, appear in many pathways, whereas other genes are more specialized and play a unique role in a single pathway. Drawing inspiration from the field of information retrieval, we have developed and present here an approach to incorporate gene appearance frequency (in KEGG pathways) into two current methods, Gene Set Enrichment Analysis (GSEA) and logistic regression-based LRpath framework, to generate more reproducible and biologically meaningful results.
Two breast cancer microarray datasets were analyzed to identify gene sets differentially expressed between histological grade 1 and 3 breast cancer. The correlation of Normalized Enrichment Scores (NES) between gene sets, generated by the original GSEA and GSEA with the appearance frequency of genes incorporated (GSEA-AF), was compared. GSEA-AF resulted in higher correlation between experiments and more overlapping top gene sets. Several cancer related gene sets achieved higher NES in GSEA-AF as well. The same datasets were also analyzed by LRpath and LRpath with the appearance frequency of genes incorporated (LRpath-AF). Two well-studied lung cancer datasets were also analyzed in the same manner to demonstrate the validity of the method, and similar results were obtained.
We introduce an alternative way to integrate KEGG PATHWAY information into gene set enrichment testing. The performance of GSEA and LRpath can be enhanced with the integration of appearance frequency of genes. We conclude that, generally, gene set analysis methods with the integration of information from KEGG PATHWAY performs better both statistically and biologically.
Traditional microarray data analysis mainly focuses on identifying individual genes whose expression levels are associated with a certain phenotype . However, to generate a biologically meaningful hypothesis based on a handful of genes is often challenging. To overcome the limitations of individual gene level analysis, gene set enrichment analysis  was introduced and is now widely used as a strategy for gene expression analysis over pathway knowledge. Gene set enrichment analysis using "cutoff-free" expression data integrated with prior biology knowledge generates consistent and biologically relevant results. Several gene set analysis strategies have been developed, such as GSEA , SAM-GS , LRpath , GAGE , Random Set Scoring , etc. Based on prior biological knowledge, such as chromosomal location, Gene Ontology term assignment, publicly available databases, conserved regulatory motif in the promoter region, etc., each of these approaches defines sets of genes that are considered to be involved in the same underlying biological process. Typically, all genes in a gene set are treated equally in subsequent analysis. However, genes involved in multiple pathways may not be as distinctive as genes specific to one pathway, and therefore may be less informative from a biological perspective. For example, specific receptors interacting with a ligand are more characteristic of a ligand-receptor pathway than signal trasduction kinases, which could play different roles in several pathways. From a gene set analysis point of view, if most of the significant differentially expressed genes in a particular gene set function in multiple pathways, the corresponding gene set might be enriched, but this result could also be a false positive result. Placing excessive concentration on differentially expressed genes with multiple roles could easily disguise the discovery of genes with less significant differential expression but higher specificity.
A similar problem has been studied extensively in the field of Information Retrieval. Given a set of search terms entered by the user, the system task is to find the most relevant documents. In evaluating the relevance of a document, the occurrence of the search terms in the document is considered. If all search terms are treated equally, it has been observed that commonly occurring terms, which are actually less useful in identifying relevant documents, swamp out the more specific search terms that occur less frequently. To address this problem, it is standard practice to weight search terms by their inverse document frequency (IDF) defined for term × as the reciprocal of (the logarithm of) the number of documents (in the repository) in which the search term occurs.
Among the many different ways to define gene sets, public databases are considered to be among the most informative, since these pathways are manually curated by experts in the area and supported by experimental evidence. KEGG (Kyoto Encyclopedia of Genes and Genomes) PATHWAY  is a collection of manually drawn pathway maps representing molecular interaction and reaction networks for Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes, and Human Diseases. As part of a systems biology approach, KEGG can be viewed as a virtual biological system containing various types of information essential for the recreation of an organism. In short, KEGG PATHWAY integrates all the molecular interaction knowledge from wet lab experiments into one database.
If we took all the pathways in KEGG as a virtual biological system, the appearance frequency of each gene could be a reliable indicator for the specificity of each gene to a pathway. Low appearance frequency genes are more specific to a pathway; if these genes show significant differential expression, then there is a good chance the corresponding pathway is activated. High appearance frequency genes function in multiple pathways; the significant differential expression of these genes could affect a number of pathways, and they are therefore less likely to lead to the identification of any particular pathway. It is also possible to observe several high appearance frequency genes cause the enrichment of truly affected pathway, while other pathways containing the same set of high frequency genes are indirectly enriched solely due to the overlapping genes. These indirectly enriched pathways interfere with the identification of the true positive pathway.
Motivated by these ideas, we propose a novel approach to characterize genes using information provided by KEGG. We propose a weighting strategy based on the appearance frequency of each gene in KEGG PATHWAY maps and incorporate it into gene set analysis methods. We applied the weights to the analysis of two independent breast cancer datasets and two independent lung cancer datasets. The results with our weights were compared with results using the original GSEA, and showed an increase in consistency between datasets. The weighting method was also incorporated into a novel logistic regression based gene set analysis method, LRpath. The combination of LRpath and our weighting strategy provided more reproducible and biologically meaningful results than using the original LRpath alone. Detailed information can be found at http://eecs.umich.edu/db/think/.
Results and discussion
The distribution of appearance frequency of genes in KEGG PATHWAY
If we have a list of differentially expressed genes, and most of the top genes are signal trasduction kinases, such as MAPK, PI3K and AKT, we can't be sure whether the whole pathway is activated, since these genes are not specific to the mTOR signaling pathway. If S6K1 and 4EBP1 show significant changes, then there is a good chance that mTOR is also activated. If mTOR pathway specific protein Raptor and GβL are among the top genes, alone with the environmental sensing STK11/STRAD complex, then we can confidently conclude that the mTOR signaling pathway is related to the phenotype. Most other pathways in KEGG follow a similar pattern. The above example illustrates that when we interpret microarray results in terms of affected pathways, if we could put higher weight on pathway specific genes while putting less weight on genes with multiple roles, then, with the integration of this new information, the updated ranked list of significantly involved genes could become more biologically meaningful.
Application of GSEA-AF
GSEA  is currently a widely used approach to gene set enrichment testing. It uses a weighted version of the Kolmogorov-Smirnov statistic to measure the degree of differential gene expression in gene sets. We decided to integrate the appearance frequency of genes in KEGG PATHWAY into GSEA.
where rj is the correlation of gene expression of gene g j with the phenotype of interest, N is the total number of genes in the data set, and NH is the number of genes in one gene set.
The ES is the maximum deviation from zero of Phit-Pmiss. In the original implementation of GSEA, the running-sum statistics used equal weights at every step (exponent p = 0), which yielded high scores for sets clustered near the middle of the ranked list . Subramanian et al addressed this issue by weighting the steps according to each gene's correlation with a phenotype, which corresponded to take p = 1. The exponent p can be adjusted to control the magnitude of each step. When p = 0, ES(S) reduces to the standard Kolmogorov-Smirnov statistic, when p = 1, the genes in one gene set are weighted by their correlation with phenotype normalized by the sum correlation over all genes in the set. We adjusted the value of the exponent p according to the appearance frequency of each gene. Since the correlation rj is 0, genes with high appearance frequency should receive an exponent p greater than 1 to reduce the magnitude of its effects. On the other hand, genes with low appearance frequency should receive an exponent p smaller than 1 to increase its effect. In this way, if there is a large fraction of pathway specific genes in a certain gene set ranked near the top or bottom of the list, the corresponding gene set will have a greater ES compared with the original GSEA. The appropriate value of p is determined based on a classical information retrieval term weighting method as described in the methods. In the following sections, we call the new model with the integration of appearance frequency into GSEA as GSEA-AF.
To demonstrate the validity of the new weighting method, we also devised two additional variants of GSEA as "controls": one based on a random appearance frequency, and another on inverse frequency. For the former, the genes are randomized while the proportion of genes with each appearance frequency is kept comparable to the original distribution. RF (Random Frequency) is used to indicate the resulting method. As another control, we generated an inverse appearance frequency, as described in the methods, for each gene in KEGG pathways and indicate this method as IW (Inverse Weight).
Comparison of overlapping gene sets generated by GSEA and GSEA-AF
Overlap Gene Sets
(FDR < 0.05)
Name of Gene Sets
Proteasome, Cell cycle, Biosynthesis of steroids
Proteasome, Cell cycle, Biosynthesis of steroids, Oxidative phosphorylation
Comparison of individual gene sets generated by GSEA and GSEA-AF
Biosynthesis of steroids
P53 signaling Pathway
Application of LRpath-AF
Several gene set analysis strategies have been proposed in the literature in addition to GSEA. Following the same logic, the appearance frequency of genes in KEGG PATHWAY can be applied to these strategies as well. As an example, we modified the logistic regression-based method, LRpath, to integrate this information. The basic concept of LRpath is that if the odds of a gene belonging to a pre-defined gene set increases as the significance of differential expression increases, then this gene set is enriched. In the original LRpath, each gene is assigned a statistical significance in terms of its P-value using a test for differential expression chosen by the user. For each gene set, genes within the gene set are defined as 1, while all other genes are 0. If π is the proportion of genes belonging to the gene set at a specified significance level, then π/(1-π) are the corresponding odds that a gene with significance × is a member of this particular gene set. Logistic regression is used to model the log-odds of a gene belonging to the specific gene set as a linear function of the statistical significance -log(P-value). Whether a gene set is enriched or depleted with deferentially expressed genes is tested using the slope parameter in the logistic regression equation.
The P-values of differential expression for genes, generated by the standard t-test, were transformed to -log(P-value)s and used as input for LRpath. In addition to the significance of each gene, these values could also contain information about appearance frequencies. We integrated this information into the LRpath input as described in the methods section.
Comparing results from two studies of lung cancer
To further demonstrate the advantage of applying appearance frequency of genes in gene set analysis, we applied the same methods as above on two lung cancer datasets, which have been analyzed by the original GSEA method  and several other methods . These two datasets were generated independently by research groups in Boston (62 samples) and Michigan (86 samples). Gene expression profiles in tumor samples from patients with lung adenocarcinomas were obtained by microarray. Based on clinical outcomes, the samples were classified as having a "good" or "poor" outcome. The patients with "good" clinical outcome were defined as control, and differentially expressed genes associated with "poor" outcome were identified. The individual gene level analysis did not show any significant overlap between datasets. The application of GSEA was able to detect more similarities and provided biological insights through the identification of overlapping gene sets.
Current statistical approaches are able to identify significantly differentially expressed genes individually, but knowing a list of significant genes is not sufficient to make conclusions about the underlying biological processes. Development of gene set analysis methods has made the interpretation of microarray results at a systems biology level much easier. Instead of focusing on individual genes, researchers can focus on gene sets, which give more reproducible and interpretable results. Without requiring an arbitrary cutoff of significance, the changes of all genes in an experiment can be considered.
Although the construction of gene sets is based on biological knowledge, most current methods fail to take full advantage of the available resources. Recently, there have been efforts to bring more pathway-specific information into the analysis of microarray data. Draghici et al  reported a novel impact analysis method, in which they tried to capture a number of additional factors that may affect the analysis results. Additional information from KEGG PATHWAY, such as the position of the differentially expressed genes in the pathways, the topology of the pathway that describes gene relations and the types of interactions among them, were taken into account in the analysis of differentially expressed genes. The latest publication from the same group further refined the original model to reduce the false positive rate for short lists of differentially expressed genes and to make the model less sensitive to noise in the expression data . The impact analysis is a good combination of a traditional statistical approach with biology knowledge. However there is no discussion of appearance frequency.
In this paper, we provided the motivation for and proposed a strategy to integrate more biologically meaningful information into gene set analysis. First, the appearance frequency of each gene in KEGG pathways was determined. Further analysis showed that the appearance frequency of genes has potential relations to the property of the gene. Greater appearance frequency indicates less specificity to a particular gene set, and therefore more likely to create a false positive result. Fewer appearances indicate the function of the gene is important to a particular biological process. The significant changes of low appearance genes are more likely to be highly related to the perturbation of the corresponding pathway. We adopted a classical approach in information retrieval to determine the weight to be used for adjustment of the methods. The performance of each method was assessed using two independent breast cancer data sets and two independent lung cancer datasets. The observed concordance of the results was improved with the integration of appearance frequency of genes. Compared with original GSEA and LRpath, the updated versions give more reproducible and biologically meaningful results. The successful integration of the appearance frequency of genes into GSEA and LRpath also suggests the potential to apply this information to other gene set analysis methods.
Determination of the exponent value p in GSEA-AF
where idf i defined as idf i = log(195/f i ), f i is gene appearance frequency, i = 1,..., n.
This resulted in the exponent p ranging from 0.73 to 2. When the appearance frequency is 4, the exponent p approximately equals 1. Low appearance frequency genes get an exponent p less than 1, while high appearance genes get an exponent p greater than 1, which leads to a smaller weight. In additional validations, using more (or less) extreme ranges of the exponent p did not continue to improve results, as assessed by correlation between the breast cancer datasets and lung cancer datasets (additional file 1).
Generation of inverse appearance frequency
The purpose of using inverse appearance frequency and random appearance frequency is to make sure the positive effect of appearance frequency on GSEA and LRpath is specific, rather than an artifact of the method in general. In GSEA-RF, appearance frequencies were assigned to each gene randomly, while the proportions of genes with each appearance frequency were kept comparable to the original distribution. In GSEA-IW and LRpath-IW, higher weights were put on genes that are involved in multiple pathways.
Our method for running the IW tests preserves the true distribution of numbers of pathways to which genes belong. We generated an inverse appearance frequency for each gene in KEGG pathways based on their actual appearance frequency. For example if one gene has appearance frequency of 1, a random appearance frequency between 20-30 is taken as its inverse appearance frequency. If one gene has appearance frequency greater than 20, the inverse appearance frequency of this gene is set to 1. This way the true distribution of appearance frequency of all genes in KEGG pathways are approximately preserved, and the adjustment on the exponent p in IW tests is also closer to the scale used for GSEA-AF and LRpath-AF. Ten sets of inverse appearance frequency were generated and used as f i in equation for determine exponent p in IW tests. The mean and the standard error of the mean of the correlations in IW tests were calculated.
Application of appearance frequency in LRpath
Appearance frequency can be applied to different significance measures in LRpath. In the original LRpath, the statistical significance of each gene is represented by its P-value. We applied the exponent p on the -log(P-value), since LRpath performs logistic regression on this P-value transformation and it preserves the information about the relative significance levels in the P-values. After the -log(P-value) transformation, genes with lower appearance frequency should have greater -log(P-value) to enhance their significance, while genes with higher appearance frequency should move down in significance, corresponding to a lower -log(P-value). We again adopted the idea of TF-IDF described above to determine the parameter used for adjusting the -log(P-values). Finally, the appearance frequency adjusted -log(P-values) were passed into LRpath for analysis.
All authors were supported in part by the National Library of Medicine under grant LM010138-01 and by the National Institutes of Health under Grant #U54DA021519.
- Song S, Black MA: Microarray-based gene set analysis: a comparison of current methods. BMC Bioinformatics 2008, 9: 502. 10.1186/1471-2105-9-502PubMed CentralView ArticlePubMedGoogle Scholar
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34(3):267–273. 10.1038/ng1180View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 2007, 8: 242. 10.1186/1471-2105-8-242PubMed CentralView ArticlePubMedGoogle Scholar
- Sartor MA, Leikauf GD, Medvedovic M: LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics 2009, 25(2):211–217. 10.1093/bioinformatics/btn592PubMed CentralView ArticlePubMedGoogle Scholar
- Luo W, Friedman MS, Shedden K, Hankenson KD, Woolf PJ: GAGE: generally applicable gene set enrichment for pathway analysis. BMC Bioinformatics 2009, 10: 161. 10.1186/1471-2105-10-161PubMed CentralView ArticlePubMedGoogle Scholar
- Newton MA, Quintana FA, Boon JAd, Sengupta S, Ahlquist P: Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann of Applied Stat 2007, 1(1):85–106. 10.1214/07-AOAS104View ArticleGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 1999, 27(1):29–34. 10.1093/nar/27.1.29PubMed CentralView ArticlePubMedGoogle Scholar
- Huang da W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009, 4(1):44–57. 10.1038/nprot.2008.211View ArticlePubMedGoogle Scholar
- Ma XM, Blenis J: Molecular mechanisms of mTOR-mediated translational control. Nat Rev Mol Cell Biol 2009, 10(5):307–318. 10.1038/nrm2672View ArticlePubMedGoogle Scholar
- Rennstam K, Hedenfalk I: High-throughput genomic technology in research and clinical management of breast cancer. Molecular signatures of progression from benign epithelium to metastatic breast cancer. Breast Cancer Res 2006, 8(4):213. 10.1186/bcr1528PubMed CentralView ArticlePubMedGoogle Scholar
- Putignani L, Raffa S, Pescosolido R, Aimati L, Signore F, Torrisi MR, Grammatico P: Alteration of expression levels of the oxidative phosphorylation system (OXPHOS) in breast cancer cell mitochondria. Breast Cancer Res Treat 2008, 110(3):439–452. 10.1007/s10549-007-9738-xView ArticlePubMedGoogle Scholar
- Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA 2005, 102(38):13550–13555. 10.1073/pnas.0506230102PubMed CentralView ArticlePubMedGoogle Scholar
- Adams J, Carder PJ, Downey S, Forbes MA, MacLennan K, Allgar V, Kaufman S, Hallam S, Bicknell R, Walker JJ, Cairnduff F, Selby PJ, Perren TJ, Lansdown M, Banks RE: Vascular endothelial growth factor (VEGF) in breast cancer: comparison of plasma, serum, and tissue VEGF and microvessel density and effects of tamoxifen. Cancer Res 2000, 60(11):2898–2905.PubMedGoogle Scholar
- Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R: A systems biology approach for pathway level analysis. Genome Res 2007, 17(10):1537–1545. 10.1101/gr.6202607PubMed CentralView ArticlePubMedGoogle Scholar
- Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R: A novel signaling pathway impact analysis. Bioinformatics 2009, 25(1):75–82. 10.1093/bioinformatics/btn577PubMed CentralView ArticlePubMedGoogle Scholar
- Salton G, Buckley C: Term-weighting approaches in automatic text retrieval. Information Processing and Management 1988, 24(5):513–523. 10.1016/0306-4573(88)90021-0View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.