- Methodology article
- Open Access
Testing gene set enrichment for subset of genes: Sub-GSE
© Yan and Sun; licensee BioMed Central Ltd. 2008
- Received: 15 April 2008
- Accepted: 02 September 2008
- Published: 02 September 2008
Many methods have been developed to test the enrichment of genes related to certain phenotypes or cell states in gene sets. These approaches usually combine gene expression data with functionally related gene sets as defined in databases such as GeneOntology (GO), KEGG, or BioCarta. The results based on gene set analysis are generally more biologically interpretable, accurate and robust than the results based on individual gene analysis. However, while most available methods for gene set enrichment analysis test the enrichment of the entire gene set, it is more likely that only a subset of the genes in the gene set may be related to the phenotypes of interest.
In this paper, we develop a novel method, termed Sub-GSE, which measures the enrichment of a predefined gene set, or pathway, by testing its subsets. The application of Sub-GSE to two simulated and two real datasets shows Sub-GSE to be more sensitive than previous methods, such as GSEA, GSA, and SigPath, in detecting gene sets assiated with a phenotype of interest. This is particularly true for cases in which only a fraction of the genes in the gene set are associated with the phenotypes. Furthermore, the application of Sub-GSE to two real data sets demonstrates that it can detect more biologically meaningful gene sets than GSEA.
We developed a new method to measure the gene set enrichment. Applications to two simulated datasets and two real datasets show that this method is sensitive to the associations between gene sets and phenotype. The program Sub-GSE can be downloaded from http://www-rcf.usc.edu/~fsun.
- Phenotypic Data
- Association Strength
- Strict Subset
- Cytogenetic Band
- Rank Gene List
Genome-wide gene expression profiling using microarray technologies has been ubiquitously used in biological research. An important problem is to identify gene sets that are significantly changed under a certain treatment (for example, two different cell lines or tissues or the same cell line under different conditions). A gene set is basically a group of genes with related functions, e.g., genes in a biological process or in the same complex. There are a variety of ways by which genes, and, ultimately, gene sets may be defined. For example, gene sets can be defined according to the information provided by several databases, such as GeneOntology , KEGG , Biocarta http://www.biocarta.com, and Pfam . Gene sets may also be defined by cytogenetic bands, by region of genomic sequence or by establishing the functional relationships among them. Importantly, by using a gene set-based approach, a high power can potentially be achieved for detecting differentially expressed gene sets by integrating expression changes of genes inside the same gene set, even when the expression changes of individual genes are modest.
Moreover, because the gene sets have already been annotated by their common functions in the databases, the biological interpretation for a given list of significant gene sets will also be clear. At least one study  showed that using such gene set-based approaches did increase the congruence of the identified gene sets between different data sets addressing the same biological problem. To detect differentially expressed gene sets, several methods have been proposed, which can be roughly categorized into three groups.
The first group identifies a list of significant differentially expressed genes (DEGs) using individual gene analysis methods, and then examines the enrichment of gene sets within this gene list using different statistical tests, such as the binomial test, Fisher's exact test, or the hypergeometric test [5–11]. Khatri and Draghici  compared fourteen different methods within this group. Each of these methods is easy to implement, but flawed by 1) sensitivity to the cutoff value for defining the list of significant DEGs, 2) non-consideration of the relative position of genes inside the significant DEG list, and 3) assumption of independence between the genes, which may make the resulting p-value misleading.
The second group of methods does not depend on the predefined DEG list. Instead, these methods calculate a gene-specific statistic, known as the "local" statistic, which measures the strength of association between the gene expression and the phenotype for each gene. A "global" statistic for a gene set is then constructed as a function of the local statistic for each gene in it. The significance of the global statistic is assessed by permutation test, and different methods arrive at this assessment in different ways [13–21]. In contrast to calculating a gene-specific statistic, the third group of methods directly combines the expression levels of all genes in the gene sets and they are represented as gene set-specific features. These features are then compared between the treatment and the control groups to identify significantly affected gene sets [22–27]. Some methods also integrated the interaction information between genes in the gene sets [28–31].
The available methods generally tested the association of all the genes in a gene set with the phenotype. In reality, however, it is more likely that only genes in a subset of the gene set of interest are associated with the phenotype. Three possible factors may explain this. First, since the function annotation defined in the available databases, such as GO, KEGG, and Biocarta, are incomplete, or even erroneous, some of the genes in a gene set of interest may not truly belong to the set. Second, the gene sets are sometimes defined according to the genomic regions of the genes. Thus, the expression of the genes in the same set may not coordinate with each other. For example, although a group of gene sets has been defined by the cytogenetic bands on human chromosomes , the expression of genes on the same cytogenetic bands do not necessarily correlate with each other. The result is that only a subset of genes in a given cytogenetic band will then be correlated with a phenotype. Third, even if all the genes in the gene set have the same function, or belong to the same complex, it is possible that only genes in one branch of the pathway are associated with the phenotype. The cumulative effect of these considerations strongly suggests that the currently available methods for gene set enrichment analysis may not be powerful enough to detect the association of a given gene set with a phenotype, particularly in the case where only a subset of the genes is associated with the phenotype.
In this paper, therefore, we extend a set-association strategy for genetic polymorphism association studies developed by  to a set-enrichment analysis. In so doing, we want to test the null hypothesis that no subsets of genes in the gene set are associated with the phenotype. We refer to the resulting method as Gene Set Enrichment by testing Subset association, or Sub-GSE. Using two simulated data sets, we first show that Sub-GSE has higher sensitivity in identifying gene sets associated with a phenotype compared to GSEA, SigPath, and GSA, when only a fraction of the genes are associated with the phenotype. Next, we apply Sub-GSE to two real data sets. One involves gene expression data related to gender and the other identified functional gene sets related to p53 mutation status. For the first dataset, Sub-GSE identified cytogenetic bands Xp22 as significantly associated with gender, while GSEA failed to detect this association. For the second dataset, Sub-GSE identified several novel functional gene sets, including DNA damage genes, cell cycle checkpoints genes and programmed cell death genes that are associated with p53 mutation status and that were also not detected by GSEA. Overall, this method provides a complementary approach for identifying gene sets associated with a phenotype, especially when only a subset of genes in a gene set is associated with the phenotype.
In this section, we first give a brief overview of our method. Second, we apply Sub-GSE, GSEA, SigPathway, and GSA to two simulated data sets and compare their performance. Third, we apply Sub-GSE to two real datasets: one related to gender and the other related to p53 mutation status. We also show some new biological findings related to the two real data sets using Sub-GSE.
Outline of the method
To assess the enrichment of a given gene set, we construct a statistical hypothesis testing model. The null hypothesis is that no subsets of genes in the given gene set are associated with the phenotype.
Defining the "strict subsets"
Given the fact that the number of all subsets of a gene set increases exponentially with the number of genes in the gene set, it is impractical and less powerful to test every subset of the gene set. Therefore, we define the "strict subsets" to only those subsets that are more likely to be related with the phenotype. To define the "strict subsets" of a gene set, we first calculate the association strength between the gene expression and the phenotype for each gene in the gene set. Depending on the measurement levels of the phenotypic data, we calculate the absolute t-statistics for comparing the mean gene expression levels for binary phenotypic data, Kruskal-Wallis statistics for comparing the mean expression levels of different groups for discrete phenotypic data, and the absolute Pearson correlation coefficient between the gene expression levels and the phenotype for continuous data. All genes are sorted in decreasing order of the association strength measures. The "strict subsets" are defined to include genes up to each position from the top to the bottom in the ranked gene list, which are most strongly associated with the phenotype. That is to say, for each position in the ranked gene list, we define a strict subset that includes all the genes that are ranked higher than this position. Thus, if there are n genes in the gene set, there will be n "strict subsets" among which the i-th subset contains the top i genes in the ranked gene list according to the association strengths. In this way, the number of subsets to be tested increases linearly with the number of genes in the gene set. Since the strict subsets includes the genes that are most associated with the phenotype, we expect them to be more probable to be related with the phenotype. The strict subsets are defined to be contiguous to include as many as possible subsets that are expected to be more likely related with the phenotype. However, the method we propose here cannot detect the gene sets in which individual genes are not associated with the phenotype but they can interact with each other to affect the phenotype. To overcome the problem that the "strict subsets" contain too few genes, we add a tuning parameter to control the sizes of the "strict subsets". Throughout the paper, we set this tuning parameter to be 5 which means the "strict subsets" are required to contain at least 5 genes. The cutoff for the set size of the strict subsets is set to be 5 so that the method is not too sensitive to detect gene set which has only one gene strongly correlated with the phenotype. There are other ways of deciding the cutoff of the set size as discussed in the Conclusions section.
The hypothesis testing statistic is calculated in three steps. First, for each "strict subset", we calculate the average association strength across all member genes, which is also called the local set association statistic T. Second, the statistical significance (raw p-value) of the local set association statistic T for each "strict subset" is calculated by permuting the phenotypes of the individuals. Finally, the minimum raw p-value among all the "strict subsets" is evaluated and taken as the hypothesis testing statistic. If there is any strict subset related with the phenotype, the minimum p-value will be significantly small.
To assess the significance of the minimum p value, nested permutation is needed since we do not know the distribution of T under the null hypothesis However, nested permutation is computation intensive. Fortunately, previous work  has shown that a single set of permutation is sufficient to accomplish the significance assessment. For the permutation, we decide to permute the phenotypic data and keep the gene expression data intact due to the criticism on gene-based permutation which assumes the independence between genes .
The phenotypic data is permuted for N times. After each permutation, the "strict subsets" are re-defined according to the newly calculated association strengths using the permuted phenotypic data. The strict subsets are defined in the same way as we did with the observed data including the threshold of set size.
The only difference is that the phenotypic data is changed. By comparing the set association statistic T from the observed data and those from the permuted data, raw p values can be calculated for all the observed "strict subsets" and thus the observed P min is obtained. To estimate the distribution of P min under the null hypothesis, as classic permutation does, we replace the observed phenotypic data with every permuted phenotypic data and compare the set association statistic with those from all the other N - 1 permutations. In other words, we repeat exactly the same procedure to obtain the minimum raw p values for every permuted data. Finally, the significance of the gene set will be the percentage of permutations that result in minimum raw p values smaller than the observed P min . If there are more than one given gene set, multiple testing correction can be done using any multiple testing correction method. In this paper, we use the QVALUE R package  to calculate the q-values for the two biological data sets so that the results by Sub-GSE are comparable with other gene set enrichment analysis methods.
Generate 100 gene sets whose sizes are 5,6,7,8,...,104. The total number of genes is 5450;
The gene expression levels in 100 samples for each gene are generated from a standard normal distribution. Different genes are independent of each other. Different samples are also independent of each other;
Generate the phenotypic data from another independent standard normal distribution in 100 samples;
Repeat steps 1–3 for 100 times;
In total, the simulation generates 100 data sets that have gene expression data and a corresponding phenotypic data. We apply Sub-GSE to the 100 data sets separately.
We first evaluate the performance of Sub-GSE using simulated data in which gene expression profiles with different correlations within the gene set are generated. The expression profiles for 1000 genes in 100 samples are simulated. The genes are divided into 50 non-overlapping gene sets with 20 genes in each. The gene expression profiles for the 100 samples represent 100 independent vectors of random variables generated from a multivariate normal distribution. The multivariate normal distribution has 1000 dimensions corresponding to the 1000 genes. The mean is a vector of 1000 zeroes, and the variance of the expression levels of each gene is 1. To simulate the dependence between genes, we randomly select a certain percentage of correlated genes (PCG) = (0%, 10%, ⋯, 90%) in each gene set and let the correlation coefficient between any two of them be ρ = 0, 0.1, 0.2, ⋯, 0.9. The remaining genes are independent of each other and those that are chosen. We use this simulation strategy based on the following considerations. The chosen genes in the gene set correspond to those in the same complex or pathway; thus, their expression profiles are correlated. Also, since the remaining genes represent those not belonging to the group, they are more likely to be independent of the chosen genes and each other.
If a given gene is among those that are chosen, we use its expression levels as the phenotype. The rationale of this step is to determine if our Sub-GSE method can identify the gene set to which this particular gene belongs. We repeat this process for all the chosen genes. Thus, we have a total of 1000 × PCG different phenotypic data. To avoid the problem where a gene has exactly the same expression profile as the phenotype, we eliminate the gene's expression profile from the expression data if it is used as the phenotype.
First, as seen in the left panel in Figure 4, for small PCG = 10%, 20% and 30%, the average rank of the gene set related to the phenotype based on Sub-GSE is always the lowest, irrespective of the coefficient value. On the other hand, for large PCG, the performance of Sub-GSE is similar or slightly worse than GSEA and GSA for small correlation. The right panel in Figure 4 confirms this because the average rank of the gene set related to the phenotype based on Sub-GSE decreases much faster than those for the other methods when PCG is small. The results of Figure 4 can be explained as follows. When PCG is low, only a small fraction of the genes in the target gene set are correlated with the phenotype. GSEA, GSA, and SigPath cannot distinguish the target gene set from the other gene sets since these methods consider all the genes in the gene set of interest in their statistics. In contrast, Sub-GSE incrementally tests each strict set and chooses the smallest p-value across all the strict sets as a test statistic, thus making the test more powerful.
Second, across different combinations of PCG and correlation coefficient, we find that GSA and GSEA achieve similar results. Both GSA and GSEA use t-statistics to obtain the ranking list of genes. For applications in this article, the only diference between them is that GSA restandardizes the statistics before the permutation to reduce the effect of correlation between genes. However, in both panels of Figure 4, the average ranks of the target gene set by GSEA and GSA are quite similar, especially when the PCG is high. Consequently, restandardization in GSA does not seem to be very efficient in this simulation study, especially when there are many correlated genes.
Third, to show the sensitivity and specificity of Sub-GSE, we need a group of gene sets that are related with the phenotype. Therefore, we do another set of simulations similar as simulation 1. The detailed descriptions of the simulation and the resulting ROC curves can be found in the supplementary materials [see Additional file 1]. The results show that the higher the PCG and the correlation coefficient are, the higher the AUC score is. Once the correlation coefficient is higher than 0.4, the AUC score is higher than 0.85 no matter what the PCG is. When PCG is higher than 0.5, the AUC score can be higher than 0.75 regardless of the correlation coefficient.
Simulate the expression profiles of the 1000 genes as in the first simulation for 100 individuals;
- 2.Choose two gene sets K1 and K2 from the 50 gene sets. Let SK1 and SK2 be the correlated genes in K1 and K2, respectively. Define the phenotype for the j-th individual as
Analyze the data using GSEA, SigPath, GSA, and Sub-GSE to rank the gene sets. Rank all the gene sets in increasing order of their q-values.
Repeat steps 1–3 100 times to assess the performance of the different analytic methods by the effects of the different gene expression data.
As shown in Figure 6, the average ranks of the target gene sets based on all the methods are relatively high. This could result from the involvement of two different gene sets when simulating the phenotypic data and the fact that the phenotypic data are the sum of the squared expression levels of correlated genes. Another potential complicating factor is that the phenotype includes a noise in addition to the function of the gene expression levels of the component genes. All these facts can weaken the correlation between the phenotypic data and the gene expression profile of individual genes inside the true gene sets. Despite these problems, Sub-GSE performs relatively well compared to the other three methods, especially when PCG is low. When PCG is high, the performances of Sub-GSE are close to those of GSEA and GSA since both GSEA and GSA consider all the genes inside the gene sets. Again, the performances of GSEA and GSA are similar.
Male Vs. Female Lymphoblastoid Cells
We also apply Sub-GSE to two real data sets from . The first data set measured the mRNA expression profiles from lymphoblastoid cells derived from 15 males and 17 females using Affymetrix U133A chip. The gender of the individuals represents the corresponding phenotypic data. The gene sets are chosen as the cytogenetic sets (C1, 319 gene sets) and the functional gene sets (C2, 522 gene sets) defined in . The cytogenetic sets contain 24 gene sets, one for each of the 24 human chromosomes, and 295 gene sets corresponding to cytogenetic bands along the chromosomes. The functional sets include 472 gene sets containing genes whose products are involved in specific metabolic and signaling pathways, as reported in eight publicly available, manually curated databases, and 50 gene sets containing genes co-expressed in response to genetic and chemical perturbations, as reported in various experimental studies (see supporting text in  for details). We apply Sub-GSE, GSEA, and SigPath to these two types of gene sets independently with the objective of identifying the cytogenetic regions that are differentially expressed between males and females and the functional gene sets related to sex distinction, respectively.
Comparison of the top 7 cytogenetic bands related to gender detected by different methods.
Comparison of the significant functional gene sets related to gender by Sub-GSE and GSEA.
Testis related genes
Genes that escape X inactivation
Female reproductive tissue expressed genes
P53 Status in Cancer Cell Lines
Significant functional gene sets related to p53 mutational status by Sub-GSE.
Hypoxia and p53 in the Cardiovascular system
G1 and S Phases of the Cell Cycle
p53 Signaling Pathway (p53 Pathway)
TrkA Signaling Pathway
P53 Upregulated Genes
p53 Signaling Pathway Genes(p53_signalling)
Cell Cycle: G2/M Checkpoint
G2 and M Phases of the Cell Cycle
Programmed Cell Death
DNA Damage Signaling Pathway
Radiation Sensitivity Genes
Cell Cycle Regulator Genes
ATM Signaling Pathway
Ceramide Signaling Pathway
Drug Resistance and Metabolism Genes
Stress Induction of HSP Regulation
Multi-step Regulation of Transcription by Pitx2
Fas Signaling Pathway
The first group includes gene sets that are directly regulated or affected by p53, including the "p53 signaling pathway", "p53 signaling pathway genes", and "p53 upregulated genes". This group of gene sets was detected by both Sub-GSE and GSEA  (FDR < 0.015).
The second group contains gene sets that are "downstream" of p53. These gene sets can either be induced or inhibited by p53 [35–38]. For example, it is well known that p53 induces cell cycle arrest during the G1/S phase and the G2/M phase checkpoint . By itself, p53 can activate an important death receptor, Fas, which triggers the "Fas Signaling Pathway" and thus leads to apoptosis . It is also well known that p53 functions "upstream" of ceramide in response to genotoxic stress .
The third group includes gene sets related to the "upstream" biological processes or genes for p53. These "upstream" biological processes, such as DNA damage caused by radiation or chemical carcinogens, for example, pass the DNA damage signal down to p53 and further induce some of the "downstream" pathways. Two genes, TrkA and Pitx2, are known to affect apoptosis through regulation of p53 [39, 40]. The gene sets related to these "upstream" biological processes actually include genes related to those "downstream" biological processes in the second group.
In this dataset, Sub-GSE not only detects the gene sets identified by GSEA , but also detects more novel gene sets related to p53. Previous studies from the literature support the findings in that all the significant gene sets identified by Sub-GSE are related to p53, as shown in Figure 7.
To summarize, we have developed a method, called Sub-GSE, to identify gene sets that are associated with a phenotype by testing the association between the strict subsets of genes and the phenotype. In many applications, it is very likely that only a subset of genes in a gene set of interest is associated with the phenotype. However, since currently available methods for gene set enrichment analysis usually test the association of all the genes in a gene set with the phenotype, the power of these methods is correspondingly reduced. In contrast, Sub-GSE is based on the idea of set-association approach first proposed by  and it incrementally tests the association of "strict subsets" with the phenotype. The strict subsets contain the genes having the top association strength of individual genes with the phenotype. We first study the performance of Sub-GSE and compare it with three widely used methods for gene set enrichment analysis: GSEA, GSA, and SigPath. Our simulations show that Sub-GSE outperforms GSEA, GSA, and SigPath in prioritizing gene sets associated with a phenotype when the fraction of genes associated with the phenotype is relatively small. On the other hand, these four methods all achieve similar results when the fraction of associated genes is large. When applied to two real data sets, Sub-GSE is shown to detect more biologically meaningful gene sets than GSEA. For example, Sub-GSE identified cytogenetic band Xp22 as significantly associated with gender (q-value < 0.20), while neither GSEA nor SigPath identified them as significant at a FDR < 0.20. Similarly, Sub-GSE identified many gene sets including, for instance, DNA damage genes, cell cycle checkpoints genes and programmed cell death genes, as significantly associated with p53 mutation status. These were not identified by GSEA. This evidence supports the high sensitivity of Sub-GSE. Most of the detected gene sets have supports from previous studies for the association between them and the p53 mutations.
Usually a large number of sets will be detected as significant for most tests of gene enrichment analysis. Since Sub-GSE is more sensitive in detecting significant gene sets than other tests for gene set enrichment analysis, we expect that many more gene sets will be identified as significant. This may reflect biological reality instead of statistical artifacts. For example, cancer can affect a large number of genes and gene categories. By studying the GO relationship among the significant gene sets, the more specific significant GO categories may represent the real underlying affected function categories.
The advantages of Sub-GSE over other approaches for testing gene set enrichment are most evident when only a fraction of the genes in the gene set of interest are associated with the phenotype. If we believe that most genes in a gene set of interest are associated with the phenotype, other approaches, including GSEA, GSA, and SigPath, may perform better than Sub-GSE. Under this scenario, the use of the minimum p-value across all the strict subsets as a test statistic, which is done in Sub-GSE, would result in the introduction of more noise. It is possible that the minimal p-value may be achieved for some subsets of the gene set of interest, making Sub-GSE less powerful. The results of our simulations are consistent with this observation. On the other hand, our simulations also showed that the performance of Sub-GSE is only marginally worse than the other approaches under the conditions noted above. We do not claim that Sub-GSE is always better than GSEA, GSA, or SigPath. Instead, Sub-GSE complements other approaches for gene set enrichment analysis when the fraction of associated genes is relatively small.
The speed of Sub-GSE is determined by the number of gene sets and the number of genes inside each gene set. To give an example of the running time, we download the gene expression data with accession number GSE5081 from NCBI http://www.ncbi.nlm.nih.gov/ which hybridized total RNAs from gastric biopsy specimens of patients with Helicobacter pylori positive (HP+) and Helicobacter pylori negative (HP-) antrum erosions (ER+), and the corresponding, adjacent normal mucosae (ER-). The gene expression data includes 54675 probes and 32 samples. HP+ and HP- are treated as the phenotype. Mappings between the probes and GO categories are from the R package http://www.r-project.org/ named "hgu133plus2". All the probes are mapped to 8310 GO categories in total. We run Sub-GSE on this data set using a computer with Pentium 4 CPU 3.60 GHz/3.59 GHz, 1.00 GB of RAM. It took 12.7 hours when 1000 permutations are required and the strict set size threshold is 1.
Usually we need to simultaneously test the association of a large number of gene sets with the phenotype. For each gene set, we can use Sub-GSE to test the association of the gene set with the phenotype to obtain a p-value. We have shown in the "Results" section that the p-value is uniformly distributed under the null hypothesis that no subset is associated with the phenotype. When we test for a large number of gene sets, the issue of multiple testing is of concern. To solve this problem, conventional methods such as Bonferroni correction can be used. However, Bonferroni correction is too conservative in most situations. Another currently widely used method dealing with multiple testing is to control false discovery rate (FDR) as implemented in the software package QVALUE . For the QVALUE package to work well, the p-values for all the gene sets need to be weakly dependent. When the sizes of the gene sets are relatively small compared to the total number of genes, we expect that the p-values to be weakly dependent since the genes usually form modules and genes from different modules are more likely to be independent. When these assumptions are in doubt, we can use the p-values obtained from Sub-GSE to indicate the statistical significance of the gene sets.
There are two options in Sub-GSE: the minimal size for the strict sets and the statistic to measure the association strength between gene expression profiles and the phenotype. We set the tuning parameter c to be the minimal size of the strict subsets on which to test. Parameter c can control the sensitivity and the specificity of Sub-GSE, thus having a significant effect on its performance. Generally, the sensitivity of Sub-GSE decreases and the specificity increases as c increases. Therefore, the choice of c should depend on the balance between sensitivity and specificity. Although we set c = 5 in this paper, which restricts the minimal size for the strict sets, we can, instead, require that the minimal size of the strict sets depend on the size of the gene set of interest. For example, one could consider the subsets of genes that cover at least 10% of the given genes inside each gene set. Since the minimal set size of the subset may be different for different gene sets, the effects of this type of restriction need to be further studied. The other Sub-GSE option involves the statistic used to measure the association strength between gene expression profiles and the phenotype. In this paper, we use t-statistics, Kruskal-Wallis statistics, and Pearson's correlation to evaluate the association strength between the gene expression profiles and discrete, categorical, and quantitative phenotypes, respectively. Other statistics can also be applied. The power of Sub-GSE to detect enriched gene sets for different types of statistics also needs to be further studied.
It is well known that genes in the same pathway or complex tend to be correlated. A natural question is whether it is better first to do principal component analysis (PCA) and then apply Sub-GSE to the principal components. We implemented this idea and found the approach less powerful than the method implemented in this paper. A potential explanation is that the expression profiles of the genes in the gene sets among the cases and controls do not satisfy the normality assumption making the PCA approach less powerful. More studies are needed to see under what conditions the combination of PCA and Sub-GSE is more powerful than Sub-GSE alone.
Association strength measures for individual genes
For a given gene, suppose the gene expression levels measured in the experiment are (e1, e2, ⋯, e m ), where m is the number of samples. The corresponding phenotypic data for the m samples are denoted as C = (c1, c2,⋯, c m ). Depending on the measurement levels of C, we measure the association strength between the gene expression and phenotype by the absolute value of t-statistic, Kruskal-Wallis statistic or Pearson correlation coefficient for binary, discrete or continuous phenotypic data, respectively.
Local association statistic for each strict subset and the global statistic for the gene set
where c is the minimum number of genes in the strict sets. Here we take the observed data as the permutation 0.
The testing statistic for the gene set is .
Multiple testing correction for multiple gene sets
In a typical situation, there will be multiple gene sets to be analyzed. After we assess the significance level for each of them according to the procedure described above, q-values of all the given gene sets are calculated using the QVALUE R package  for multiple testing correction.
We thank Dr. Chao Cheng for substantially useful discussion of the method and the results. This work was partly supported by National Institutes of Health (NIH)/National Science Foundation Joint Mathematical Biology Initiative grant DMS-0241102 and NIH grants P50 HG 002790.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 1999, 27: 29–34. 10.1093/nar/27.1.29PubMed CentralView ArticlePubMedGoogle Scholar
- Sonnhammer EL, Eddy SR, R D: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28: 405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Kim S, Volsky DJ: PAGE: Parametric Analysis of Gene Set Enrichment. BMC Bioinformatics 2005, 6: 144. 10.1186/1471-2105-6-144PubMed CentralView ArticlePubMedGoogle Scholar
- Berriz GF, King OD, Bryant B, Sander C, P RF: Characterizing gene sets with FuncAssociate. Bioinformatics 2003, 19: 2502–2504. 10.1093/bioinformatics/btg363View ArticlePubMedGoogle Scholar
- Hosack DA, Dennis GJ, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE. Genome Biology 2003, 4: R70. 10.1186/gb-2003-4-10-r70PubMed CentralView ArticlePubMedGoogle Scholar
- Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biology 2003, 4: R7. 10.1186/gb-2003-4-1-r7PubMed CentralView ArticlePubMedGoogle Scholar
- Kim CC, Falkow S: Significance analysis of lexical bias in microarray data. Genome Biology 2003, 4: 12. 10.1186/gb-2003-4-2-r12View ArticleGoogle Scholar
- Drǎghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression. Genomics 2002, 81: 98–104. 10.1016/S0888-7543(02)00021-6View ArticleGoogle Scholar
- Al-Shahrour F, Díaz-Uriarte R, Dopazo J: FatiGo: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20: 578–580. 10.1093/bioinformatics/btg455View ArticlePubMedGoogle Scholar
- Beißbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20: 1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
- Khatri P, Drǎghici S: Ontologcal analysis of gene expression data: current tools, limitations and open problems. Bioinformatics 2005, 21: 3587–3595. 10.1093/bioinformatics/bti565PubMed CentralView ArticlePubMedGoogle Scholar
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1 α -responsive genes involved in oxidative phosphorylaton are coordinately downregulated in human diabetes. Nature Genetics 2003, 34: 267–273. 10.1038/ng1180View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 2005, 102: 15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proceedings of the National Academy of Sciences of the United States of America 2005, 102: 13544–13549. 10.1073/pnas.0506577102PubMed CentralView ArticlePubMedGoogle Scholar
- Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23: 980–987. 10.1093/bioinformatics/btm051View ArticlePubMedGoogle Scholar
- Efron B, Tibshirani R: On testing the significance of sets of genes. The Annals of Applied Statistics 2007, 1: 107–129. 10.1214/07-AOAS101View ArticleGoogle Scholar
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the Gene Ontology for Microarray Data Mining: A Comparison of MEthods and Application to Age Effects in Human Prefrontal Cortex. Neurochemical Research 2004, 29: 1213–1222. 10.1023/B:NERE.0000023608.29741.45View ArticlePubMedGoogle Scholar
- Jiang Z, Gentleman R: Extensions to gene set enrichment. Bioinformatics 2007, 23: 306–313. 10.1093/bioinformatics/btl599View ArticlePubMedGoogle Scholar
- Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P: Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. The Annals of Applied Statistics 2007, 1: 85–106. 10.1214/07-AOAS104View ArticleGoogle Scholar
- Barry WT, Nobel AB, Wright FA: Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 2005, 21: 1943–1949. 10.1093/bioinformatics/bti260View ArticlePubMedGoogle Scholar
- Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics 2005, 6: 225. 10.1186/1471-2105-6-225PubMed CentralView ArticlePubMedGoogle Scholar
- Goeman JJ, Geer SA, Kort FD, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20: 93–99. 10.1093/bioinformatics/btg382View ArticlePubMedGoogle Scholar
- Goeman JJ, Oosting J, Cleton-Jansen AM, Anninga JK, van Houwelingen HC: Testing association of a pathway with survival using gene expression data. Bioinformatics 2005, 21: 1950–1957. 10.1093/bioinformatics/bti267View ArticlePubMedGoogle Scholar
- Ye C, Eskin E: Discovering tightly regulated and differeitially expressed gene sets in whole genome expression data. Bioinformaitcs 2006, 23: e84-e90. 10.1093/bioinformatics/btl315View ArticleGoogle Scholar
- Wei Z, Li H: Nonparametric pathway-based regression models for analysis of genomic data. Biostatistics 2007, 8: 265–284. 10.1093/biostatistics/kxl007View ArticlePubMedGoogle Scholar
- Levine DM, Haynor DR, Castle JC, Stepaniants SB, Pellegrini M, Mao M, Johnson JM: PAGE: Parametric Analysis of Gene Set Enrichment. Genome Biology 2006, 7: R93. 10.1186/gb-2006-7-10-r93PubMed CentralView ArticlePubMedGoogle Scholar
- Wei Z, Li H: A Markov random field model for network-based analysis of genomic data. Bioinformatics 2007, 23: 1537–1544. 10.1093/bioinformatics/btm129View ArticlePubMedGoogle Scholar
- Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S: Network-Based Analysis of Affected Biological Processes in Type 2 Diabetes Models. PLoS Genetics 2007, 3: e96. 10.1371/journal.pgen.0030096PubMed CentralView ArticlePubMedGoogle Scholar
- Rahnenführer J, Domingues FS, Maydt J, Lengauer T: Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data. Statistical Applications in Genetics and Molecular Biology 2004, 3: 16. 10.2202/1544-6115.1055View ArticleGoogle Scholar
- Nacu c, Critchley-Thorne R, Lee P, Holmes S: Gene expression network analysis and applications to immunology. Bioinformatics 2007, 23: 850–858. 10.1093/bioinformatics/btm019View ArticlePubMedGoogle Scholar
- Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Research 2001, 11: 2115–2119. 10.1101/gr.204001PubMed CentralView ArticlePubMedGoogle Scholar
- Ge Y, Dudoit S, P ST: Resampling-based multiple testing for microarray data analysis. Test 2003, 12: 1–77. 10.1007/BF02595811View ArticleGoogle Scholar
- Storey JD: A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 2002, 64: 479–498. 10.1111/1467-9868.00346View ArticleGoogle Scholar
- Vogelstein B, Lane D, Levine AJ: Surfing the p53 network. Nature 2002, 408: 307–310. 10.1038/35042675View ArticleGoogle Scholar
- Giono LE, Manfredi JJ: The p53 Tumor Suppressor Participates in Multiple Cell Cycle Checkpoints. Journal of Cellular Physiology 2006, 209: 13–20. 10.1002/jcp.20689View ArticlePubMedGoogle Scholar
- Dbaibo GS, Pushkareva MY, Rachid RA, Alter N, Smyth MJ, Obeid LM, Hannun YA: p53-dependent Ceramide Response to Genotoxic Stress. The Journal of Clinical Investigation 1998, 102: 329–339. 10.1172/JCI1180PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Raffo AJ, Drew L, Mao Y, Tran A, Petrylak DP, Fine RL: Fas-Mediated Apoptosis Is Dependent on Wild-Type p53 Status in Human Cancer Cells Expressing a Temperature-Sensitive p53 Mutant Alanine-143. Cancer Research 2003, 63: 1527–1533.PubMedGoogle Scholar
- Aloyz RS, Bamji SX, Pozniak CD, Toma JG, Atwal J, Kaplan DR, Miller FD: P53 Is Essential For Developmental Neuron Death as Regulated by the TrkA and p75 Neurotrophin Receptors. The Journal of Cell Biology 1998, 143: 1691–1703. 10.1083/jcb.143.6.1691PubMed CentralView ArticlePubMedGoogle Scholar
- Wei Q: Pitx2a binds to human papillomavirus type 18 E6 protein and inhibits E6-mediated P53 degradation in HeLa cells. The Journal of Biological Chemistry 2005, 280: 37790–37797. 10.1074/jbc.M502974200PubMed CentralView ArticlePubMedGoogle Scholar
- Becker T, Knapp M: A Powerful Strategy to Account for Multiple Testing in the Context of Haplotype Analysis. American Journal of Human Genetics 2004, 75: 561–570. 10.1086/424390PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.