Down-weighting overlapping genes improves gene set analysis
© Tarca et al.; licensee BioMed Central Ltd. 2012
Received: 14 February 2012
Accepted: 18 May 2012
Published: 19 June 2012
The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set.
In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method P athway A nalysis with D own-weighting of O verlapping G enes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results.
PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org.
KeywordsGene expression Gene set analysis Pathway analysis Overlapping gene sets
Microarray-based gene expression profiling experiments, which are routine today, allow researchers to identify, for instance, genes differentially expressed (DE) between diseased and normal patient samples or genes that change in expression over time during a treatment. Unfortunately, the steady increase in the amount of data generated in the past decade from such experiments was not paralleled by the evolution of analytical methods used to extract knowledge from such datasets and, therefore, there is a gap between our ability to measure gene expression data and to extract workable knowledge from it.
Since the beginning of the microarray-based expression profiling experiments, researchers were interested in finding common “themes” among the genes identified as differentially expressed between two conditions. For instance the identification of Gene Ontology (GO) terms enriched in differentially expressed genes was used as early as 1999 , but became widespread only after the first on-line GO analysis tools were made available [2, 3]. As biological annotations started to include descriptions of gene interactions in the form of pathways (found in resources such as KEGG , BioCarta http://www.biocarta.com, and Reactome ), the identification of the pathways involved in various conditions has emerged as a ubiquitous bioinformatics task.
In general, biological pathways can be divided into gene signaling pathways, and metabolic pathways. Gene signaling pathways are graphs that use nodes to represent genes, or gene products, and edges to represent signals that go from one gene to another. Metabolic pathways are graphs that use nodes to represent biochemical compounds, and edges to describe biochemical reactions that involve such compounds. Since biochemical reactions are usually carried out by enzymes which are coded for by genes, in a metabolic pathway genes are associated with edges rather than nodes. Ideally, a comprehensive pathway analysis method would be able to take into consideration all aspects of the phenomena described by a pathway. These aspects would include the position and role of each gene in a pathway, the types of signals between genes, the efficiency with which a signal travels from one gene to another, or the efficiency with which a certain reaction is carried out, rate limiting conditions, etc. Such methods have been proposed for both signaling pathways [6–9], and metabolic pathways [8, 10], but no method is currently available to analyze both types of pathways taking into consideration all the information available. Hence, even though they do not use all information available, methods that treat the pathways as simple gene sets are still popular because they can be applied equally well to signaling pathways, metabolic pathways, GO terms, as well as arbitrary sets of genes.
Two of the most popular such methods are the Gene Set Enrichment Analysis (GSEA)  and the Gene Set Analysis (GSA) . These methods belong to the functional class scoring category of gene set analysis methods [13, 14]. For a simple two group experiment (e.g. disease vs. normal), both GSEA and GSA start with computing a t-statistic for each gene measured on the array. Then, a score is computed for each gene set using the t-scores of all genes in the gene set. The significance of the gene set scores is determined by using permutations of the samples. Both approaches treat the genes in the gene set equally.
In this work, we propose the P athway A nalysis with D own-weighting of O verlapping G enes (PADOG) which is a general gene set analysis method. The method gives more weight to genes that are gene set-specific, than to genes which can be found in multiple gene sets. This is similar to the approach commonly used in information retrieval (e.g. web search engines) that decreases the importance of words that appear in many documents (e.g. “and”, “or”, etc.) in favor of words that are highly specific to given documents, the latter type being considered to carry more information about the informational content of the document. Similarly, in our approach, if the differential expression affects genes that are highly specific to a given pathway (e.g. huntingtin to Hungtington’s disease), it is more likely that the respective pathway is truly relevant in that condition.
The process of down-weighting popular genes does not affect one’s ability to find a gene set to be significant whenever the gene set is composed mostly of ubiquitous genes, but rather increase the contrast between gene sets that overlap by reducing the contribution of the overlapping genes into the gene set scores. As a simple example, with PADOG, a gene set A having 20 out of 50 genes differentially expressed, that appear only in gene set A, will be found more significant than another gene set B of same size that has also 20 differentially expressed genes but which appear in other gene sets as well. Both GSEA and GSA would find the two gene sets equally significant.
Analysis methods that do not treat all genes equally were previously proposed for pathway analysis in an over-representation context [6, 7], or in a functional class scoring context , yet none specifically exploit the frequency of occurrence of genes across the pathways. Moreover, unlike GSA, PADOG does not rely on ordinary t-scores to derive gene set scores but uses moderated t-statistics  instead. A similar idea to use non-ordinary t-scores in the gene set scores computation was illustrated first in  by using SAM statistics  in conjunction with GSEA. Moreover, unlike GSA, PADOG summarizes the gene scores into a gene set score using the mean of absolute values instead of the maxmean statistic.
The 24 data sets used to assess the proposed gene set analysis method
Brain, Entorhinal Cortex
Brain, Primary visual cortex
Postmortem brain putamen
Acute myeloid leukemia
Blood, Bone marrow
Non-Small Cell Lung Cancer
Non-Small Cell Lung Cancer
The two methods we chose to compare PADOG against are the Gene Set Enrichment Analysis (GSEA)  and the Gene Set Analysis (GSA) . Briefly, GSEA works as follows. Let GS i denote the i th gene set, where i=1.. N GS . For each gene j on the array, GSEA computes a t-statistic z j for the differential expression of the gene between the disease group and the control group. A gene set score S(GS i ) is computed similar to a signed version of the Kolmogorov-Smirnov statistic between the values z j j∈G S i and their complement (genes measured on the array but not belonging to the gene set). The class labels of the arrays are permuted and the significance of the gene set score is assessed by determining the null distribution of the gene set score.
where the (+) and (-) signs identify the positive and negative t-scores respectively, and n represents the number of genes in the gene set. Secondly, GSA differs from GSEA by re-standardizing the gene set scores by taking into account scores from sets formed by random selection of genes. Permutations of class labels are then used to infer the significance of the standardized gene set scores. The need for re-standardization is justified by the fact that, given that the genes are correlated (they tend to have either high or low t-scores simultaneously), the gene set score computed with the true class labels will be systematically larger than with permuted class labels and, hence, the significance of all gene sets will be overstated.
Pathway Analysis with Down-weighting of Overlapping Genes (PADOG)
Let GS i with i=1.. N GS be the collection of gene sets to be analyzed, each containing N(GS i ) genes, and G be the set of all genes measured on the array that can be mapped to at least one gene set to be analyzed. Then let be the value of a moderated t-score  of the gene g between the two conditions of interest with g∈G. The moderated t-scores are similar to ordinary t-scores, except that their standard errors have been moderated across genes, i.e., shrunk towards a common value using a Bayesian model . The moderated t-scores are expected to be more reliable than ordinary t-scores because the shrinkage of the gene standard deviations will prevent large t-scores to occur only due to small gene standard deviations.
where I is a function that returns 1 when the argument is true and 0 otherwise, and represents the standardized score obtained with the ite-th permutation of the samples for gene set GS i .
Assessing the sensitivity and gene set ranking capability using real data
Statistics that describe the distribution of the 24 target pathway’s p -values, including the geometric mean and median (the lower the better), and the percentage of target pathways with nominal p<0.05 (the higher the better). This later statistic is an estimate of the sensitivity of a given analysis method. The percentage of target pathways with False Discovery Rate  corrected p-values (called q-values) less than 0.05 is also given.
Statistics that describe the distribution of the 24 target pathways ranks, including mean and median (the lower the better). The rank of a target pathway, having the i th smallest p-value amongst all N GS pathways analyzed for a given dataset, will be equal to i/ N GS ·100.
Statistics that allow to determine if a given pathway analysis method produces better rankings than a reference method, chosen to be GSA since it was the best among the two published methods that we tested. A simple method to test that the ranks produced by a given method for the 24 target pathways are smaller (better) than the reference method would be to use a one-tailed paired Wilcoxon test, the pairing being at data set level. However, the Wilcoxon test assumes that the different ranks are independent between the 24 datasets, yet this is may not be the case because some ranks are obtained for the same pathway in up to 4 datasets (see Table 1). Another approach that we used to analyze the ranks while accounting for the eventual lack of independence among them was to fit a linear mixed-effects model. The dependent variable in this model were the rank values, while the explanatory variables were the analysis method (factor with two levels, with the reference level being GSA) and the dataset ID (to reflect that the ranks are paired at the dataset level), while the random effects were the pathway IDs. Both the coefficient, and one-tailed p-value that a given analysis method produces better (smaller) ranks than the reference method were reported.
Note that the gene set analysis methods could have been compared also in analyzing gene ontology terms instead of pathways, however, choosing one GO term most relevant for each dataset would have been more subjective.
Assessing the sensitivity of gene set analysis using simulated data
Level of the first 15 genes of GS1 was increased by 0.3 units in group 2.
Level of the first 10 genes of GS1 was increased by 0.3 units and the level of the next 5 genes was decreased by 0.3 units in group 2.
Level of the first 8 genes of GS1 was increased by 0.3 units and the level of the next 7 genes was decreased by 0.3 units in group 2.
Level of the first 7 genes of GS1 was increased by 0.4 units and the level of the next 3 genes was decreased by 0.4 units in group 2.
Level of the first 5 genes of GS1 was increased by 0.4 units and the level of the next 5 genes was decreased by 0.4 units in group 2.
A number of 50 data sets were generated for each of the six scenarios above. Orthogonal on the different scenarios we considered three analysis setups that could influence the results of PADOG but not GSA and GSEA, according to whether or not the genes in GS1 are allowed to be present in other gene set as well (e.g. (GS50)). In the first setup I), GS1 did not overlap with other gene sets as in , II) All genes designed to be DE in GS1 were included also in GS50, and III) All non-DE genes of GS1 were included in GS50. With setup I) we are basically interested in assessing if the gene set summarization function of PADOG (mean or absolute values) combined with the moderated t-scores compares favorably to GSA and GSEA, because in the absence of overlap, the genes of GS1 will have the same weight (w=1.0). When the DE genes in GS1 appear also in other gene sets but the non-DE do not (setup II), PADOG is expected to give higher p-values to GS1 compared to the situation when there is no overlap. This is because the weight of the DE genes in this case will be lower than the weight on non-DE genes. In contrary, if the genes that are non-DE in GS1 overlap but the DE genes are specific to GS1 (setup III) then PADOG is expected to produce smaller p-values for GS1 because the DE genes will have more weight and also larger t-scores.
Assessing the specificity of gene set analysis
To test the ability of the gene set analysis methods to not reject the null hypothesis when it is true, i.e. their specificity, we conducted two simulation studies.
Simulation of the null hypothesis by sample labels permutation
In the first simulation study all the 24 data sets were considered, but their array/samples class labels were permuted at random before analysis so that the correlation structure between genes is preserved. In 100 different trials, we computed several of the statistics described above, including the median of target pathways p-values, median ranks, and the percentage of pathways with p<0.05. The average of these statistics over the 100 trials are reported.
The purpose of this simulation was two-fold. First, it allows us to determine if the target pathways-based benchmark works, i.e. if the ranking results are worse for all methods when the labels are permuted compared to when the true class labels are used. Second, it allows us to estimate the false positive rate (1-specificity) of each gene set analysis method and compare it with the level expected under the null hypothesis. All analysis methods were run on the same 100 permutations of the original class labels of each of the 24 data sets to eliminate any differences introduced by random chance. The number of internal iterations used by each analysis method was N ite =500.
Simulation of the null hypothesis by generating random data
At the suggestion of one of the reviewers, a second type of simulation was performed to determine the false positives rate of gene set analysis methods by generating random data from a normal distribution with mean 0 and standard deviation of 1, N(0,1). For each of the 24 real datasets, 50 fake replicas were created by maintaining the actual sample size and number of genes but generating data at random, for a total of 1200 simulated datasets. The structure of the gene sets was preserved as defined by the 229 KEGG metabolic and non-metabolic pathways, therefore maintaining a meaningful overlap between the different genes in the gene sets. The fraction of all significant pathways (false positive rate) at different α thresholds was determined.
For all 24 datasets shown in Table 1 which were available from the Gene Expression Omnibus (GEO), the analysis was performed consistently by: a) removing outlier arrays (if necessary), b) log transforming the data and normalizing it, c) performing a moderated t-test between groups and computing probes/probesets p-values, d) resolving duplicate probes/probesets to Entrez ID mappings by keeping the probe/probeset with smallest p-value for each unique gene and, e) filtering out all genes that could not be mapped on any of the pathways. The normalization of datasets obtained on Affymetrix arrays was performed using the RMA algorithm  implemented in the affy package of Bioconductor, while normalization of of datasets run on Illumina arrays were normalized using the quantile normalization algorithm  implemented in the preprocessCore of Bioconductor. The package limma was used to compute a moderated two-sample paired or unpaired t-score depending on the particular design of each experiment.
Comparison between gene set analysis methods in terms of sensitivity and pathway ranking when analyzing 143 KEGG non-metabolic pathways
p geometric mean
Comparison between pathway analysis methods in terms of sensitivity and pathway ranking when analyzing 229 KEGG metabolic and non-metabolic pathways
p geometric mean
A sensitivity analysis using simulated data in the absence and presence of overlap between gene sets
PADOG Setup I
PADOG Setup II
PADOG Setup III
Determining the contribution of gene weighting and moderated t-scores in PADOG when analyzing 229 KEGG metabolic and non-metabolic pathways
Comparing gene set analysis methods performance under the null hypothesis simulated by class labels permutation
False positive rates when null hypothesis is simulated by generating random expression data
The set of 229 metabolic and non-metabolic pathways and their genes were obtained from the KEGG.db annotation package  of Bioconductor . The split between metabolic and non-metabolic pathways was done based on KEGG’s classification.
All analyses were run under the R statistical language and environment  version 2.14 and using other infrastructure packages available in Bioconductor version 2.9.
Results and discussion
Sensitivity and rank analysis using real data
We compared the PADOG method proposed here with two existing methods (GSA and GSEA). The analysis was performed on i) 143 non-metabolic pathways (which included all target pathways) and ii) 229 metabolic and non-metabolic KEGG pathways. The criteria used in the comparison between these methods were the sensitivity, the ranking, as well as the specificity of the gene set analysis methods considered. Table 2 shows the summary of gene set analysis results for the three different methods based on the panel of 24 datasets described in Table 1 when analyzing only KEGG non-metabolic pathways.
Sensitivity analysis using simulated data
The result of the sensitivity analysis based on 50 simulated data sets in each of the 5 different scenarios are given in Table 4. These results show that when all genes designed to be differentially expressed (DE) in GS1 are changing in the same direction (scenario 1), GSA and GSEA have an advantage over PADOG while the opposite is true in all remaining 4 scenarios. These results can be understood by considering the fact that GSA and GSEA statistics are designed to find such cases when all the genes in the gene set change in the same direction while PADOG’s summary statistic is more flexible to accommodate cases when the changes occur in both directions. When the overlap favors the DE genes in GS1 (Setup III), that is, when its DE genes are specific to this gene set while its non DE-genes are not specific to the gene set, the performance of PADOG increase in all scenarios 1 through 5, as compared to the absence of overlap. However, even when the overlap is not favorable to GS1 (setup I), that is, when all its non-DE genes are specific to this gene set, PADOG still performes better than GSA and GSEA under scenarios 2 through 5.
Sources of improvement in PADOG
Specificity analysis of gene set analysis methods
Two simulation studies were performed to determine whether the improved sensitivity of the PADOG method, i.e. producing lower p-values for the target pathways, comes at the expense of reduced specificity (increased false positive rate). Table 6 shows three of the same statistics introduced in Table 2 (median p-values, median ranks, and percentage of pathways with p<0.05) except that their average was taken over 100 trials in which the class labels of the arrays in all 24 datasets were randomly permuted before the analysis. The percentage of target pathways with p<0.05 is now the false positive rate (FP) because using random class labels models the null hypothesis in which expression levels are dissociated from the studied phenotypes, yet the gene-gene correlations are preserved. Under these circumstances, any pathways that are reported as significant by any method are, in fact, false positives.
Table 6 shows that, under the null hypothesis, the average median p-values, median ranks and fraction of pathways with p<0.05 across the 100 random permutations are 0.49, 48.9% and 0.049, respectively for PADOG and similar values are obtained for GSA and GSEA. This is expected since when class labels are permuted, the p-values of the target pathways should be uniformly distributed between between 0 and 1 (expected mean 0.5), and rank values should be uniformly distributed between 1/ N GS ·100=0.44 and 100 (expected mean 50.22) where N GS is the number of gene sets analyzed. Table 6 also shows that the average median p-values and median ranks are much above (worse) than the level they had when true class labels were used in the analysis (see Table 2). This is the case for all analysis methods. These results prove that: i) the target pathways were indeed in average relevant to their respective phenotypes, ii) the benchmark system was sound, and iii) both the novel, as well as the existing methods were correctly deployed.
An additional simulation in which 1200 datasets were generated by drawing random values from a normal distribution has yielded similar results as the previous simulation. In this case the false positives rate was estimated as the fraction of all pathways across all 1200 datasets with a p-value less than a given threshold α. The estimated false positive rates of all three methods were very close to the expected α levels as shown in Table 7. This again confirms that PADOG is not expected to find significant gene sets more often than expected by chance regardless if gene are correlated (as in the simulation above) or not (this simulation).
Specificity analysis of the set of target pathways
A specificity analysis of the target pathways on 16 cancer data sets
Pathways in cancer
Pathways in cancer
The original contribution of this paper is two-fold. Firstly, this paper introduces the idea of gene weighting in gene set analysis on the basis of gene frequency across the gene sets to analyzed. The reasoning behind this type of gene weighting is that whenever a gene belongs to multiple gene sets, that particular gene is less useful in prioritizing among those gene sets. Conversely, the differential expression of a gene that is present only on a single gene set/pathway represents a stronger evidence that the given gene set/pathway is impacted in the given condition. A second original contribution is the validation procedure deployed here. The classical approach involves analyzing a handful of selected data sets and discussing the results in the light of the existing literature. This is subjective and makes the comparison of various methods practically impossible. The validation proposed here involves the analysis of a large number of data sets (24 in this case) that can be objectively associated with a target gene set/pathway. This objective association is based on the fact that the samples analyzed are collected from tissues affected by the target disease (e.g. in the analysis of colorectal cancer samples, the colorectal cancer pathway is chosen as the target pathway, etc.). This approach allows a comparison of analysis methods in terms of sensitivity and ranking. Such a comparison is: a) objective, b) reproducible, and c) independent of the accuracy and thoroughness of a literature search. Using this approach, we have shown that PADOG is able to identify the target pathways as significant more frequently and rank them consistently higher than two of the best existing methods for the analysis of gene sets based on high-throughput gene expression data.
ALT designed and implemented the research and drafted the manuscript. ALT, SD, and RR evaluated the research and improved the manuscript. GB collected the datasets and performed data pre-processing. All authors read and approved the final manuscript.
This research was supported, in part, by the Intramural Research Program of the National Institute of Child Health and Human Development, NIH/DHHS. SD was also supported by the following grants: NIH RO1 RDK089167-01 and NSF DBI-0965741 (to S.D.), and by the Robert J. Sokol Endowment in Systems Biology. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ: Church GM: Systematic determination of genetic network architecture. Nat Genet 1999, 22: 281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
- Khatri P, Drăghici S, Ostermeier GC, Krawetz SA: Profiling Gene Expression Using Onto-Express. Genomics 2002, 79(2):266–270. 10.1006/geno.2002.6698View ArticlePubMedGoogle Scholar
- Drăghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression. Genomics 2003, 81(2):98–104. 10.1016/S0888-7543(02)00021-6View ArticlePubMedGoogle Scholar
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, et al: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 1999, 27: 29–34. 10.1093/nar/27.1.29PubMed CentralView ArticlePubMedGoogle Scholar
- Joshi-Tope G, Gillespie M, Vasrik I, D’Eustachio P, Schmidt E, de Bone B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33(Database issue):D428–432.PubMed CentralView ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R: A systems biology approach for pathway level analysis. Genome Research 2007, 17(10):1537–1545. 10.1101/gr.6202607PubMed CentralView ArticlePubMedGoogle Scholar
- Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP: Romero R: A novel signaling pathway impact analysis. Bioinformatics 2009, 25: 75–82. 10.1093/bioinformatics/btn577PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas R, Gohlke JM, Stopper GF, Parham FM, Portier CJ: Choosing the right path: enhancement of biologically relevant sets of genes or proteins using pathway structure. Genome Biol 2009, 10(4):R44. 10.1186/gb-2009-10-4-r44PubMed CentralView ArticlePubMedGoogle Scholar
- Massa MS, Chiogna M, Romualdi C: Gene set analysis exploiting the topology of a pathway. BMC Syst Biol 2010, 4: 121.PubMed CentralPubMedGoogle Scholar
- Rahnenführer J, Domingues FS, Maydt J, Lengauer T: Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data. Statistical Applications in Genetics and Molecular Biology 2004, 3: Article16.View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceeding of The National Academy of Sciences of the USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102View ArticleGoogle Scholar
- Efron B, Tibshirani R: On testing the significance of sets of genes. Annals of Applied Statistics 2006, 1: 107–129.View ArticleGoogle Scholar
- Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 2007, 23: 980–987. 10.1093/bioinformatics/btm051View ArticlePubMedGoogle Scholar
- Ackermann M, Strimmer K: A general modular framework for gene set enrichment analysis. BMC Bioinformatics 2009, 10: 47. 10.1186/1471-2105-10-47PubMed CentralView ArticlePubMedGoogle Scholar
- Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 2004, 3: Article3.View ArticlePubMedGoogle Scholar
- Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 2007, 8: 242. 10.1186/1471-2105-8-242PubMed CentralView ArticlePubMedGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 2001, 98(9):5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex. Neurochem Res 2004, 29(6):1213–1222.View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okunom Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32(Database isuue):277–280.View ArticleGoogle Scholar
- Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW: Incipient Alzheimer’s disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci U.S.A 2004, 101: 2173–2178. 10.1073/pnas.0308512100PubMed CentralView ArticlePubMedGoogle Scholar
- Liang WS, Dunckley T, Beach TG, Grover A, Mastroeni D, Walker DG, Caselli RJ, Kukull WA, McKeel D, Morris JC, Hulette C, Schmechel D, Alexander GE, Reiman EM, Rogers J, Stephan DA: Gene expression profiles in anatomically and functionally distinct regions of the normal aged human brain. Physiol Genomics 2007, 28: 311–322.PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng B, Liao Z, Locascio JJ, Lesniak KA, Roderick SS, Watt ML, Eklund AC, Zhang-James Y, Kim PD, Hauser MA, Grunblatt E, Moran LB, Mandel SA, Riederer P, Miller RM, Federoff HJ, Wullner U, Papapetropoulos S, Youdim MB, Cantuti-Castelvetri I, Young AB, Vance JM, Davis RL, Hedreen JC, Adler CH, Beach TG, Graeber MB, Middleton FA, Rochet JC, Scherzer CR: PGC-1?, a potential therapeutic target for early intervention in Parkinson’s disease. Sci Transl Med 2010, 2(52):52ra73. 10.1126/scitranslmed.3001059PubMed CentralPubMedGoogle Scholar
- Zhang Y, James M, Middleton FA, Davis RL: Transcriptional analysis of multiple brain regions in Parkinson’s disease supports the involvement of specific protein processing, energy metabolism, and signaling pathways and suggests novel disease mechanisms. Am J Med Genet B Neuropsychiatr Genet 2005, 137B: 5–16. 10.1002/ajmg.b.30195View ArticlePubMedGoogle Scholar
- Runne H, Kuhn A, Wild EJ, Pratyaksha W, Kristiansen M, Isaacs JD, Regulier E, Delorenzi M, Tabrizi SJ, Luthi-Carter R: Analysis of potential transcriptomic biomarkers for Huntington’s disease in peripheral blood. Proc Natl Acad Sci U.S.A 2007, 104: 14424–14429. 10.1073/pnas.0703652104PubMed CentralView ArticlePubMedGoogle Scholar
- Hong Y, Ho KS, Eu KW, Cheah PY: A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. Clin Cancer Res 2007, 13: 1107–1114. 10.1158/1078-0432.CCR-06-1633View ArticlePubMedGoogle Scholar
- Sabates-Bellver J, Van der Flier LG, de Palo M, Cattaneo E, Maake C, Rehrauer H, Laczko E, Kurowski MA, Bujnicki JM, Menigatti M, Luz J, Ranalli TV, Gomes V, Pastorelli A, Faggiani R, Anti M, Jiricny J, Clevers H, Marra G: Transcriptome profile of human colorectal adenomas. Mol Cancer Res 2007, 5: 1263–1275. 10.1158/1541-7786.MCR-07-0267View ArticlePubMedGoogle Scholar
- Hong Y, Downey T, Eu KW, Koh PK, Cheah PY, Koh PK, Cheah PY: A ’metastasis-prone’ signature for early-stage mismatch-repair proficient sporadic colorectal cancer patients and its implications for possible therapeutics. Clin Exp Metastasis 2010, 27: 83–90. 10.1007/s10585-010-9305-4View ArticlePubMedGoogle Scholar
- Wang Y, Roche O, Yan MS, Finak G, Evans AJ, Metcalf JL, Hast BE, Hanna SC, Wondergem B, Furge KA, Irwin MS, Kim WY, Teh BT, Grinstein S, Park M, Marsden PA, Ohh M: Regulation of endocytosis via the oxygen-sensing pathway. Nat Med 2009, 15: 319–324. 10.1038/nm.1922View ArticlePubMedGoogle Scholar
- Lenburg ME, Liou LS, Gerry NP, Frampton GM, Cohen HT, Christman MF: Previously unidentified changes in renal cell carcinoma gene expression identified by parametric analysis of microarray data. BMC Cancer 2003, 3: 31. 10.1186/1471-2407-3-31PubMed CentralView ArticlePubMedGoogle Scholar
- Badea L, Herlea V, Dima SO, Dumitrascu T, Popescu I: Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. Hepatogastroenterology 2008, 55: 2016–2027.PubMedGoogle Scholar
- Pei H, Li L, Fridley BL, Jenkins GD, Kalari KR, Lingle W, Petersen G, Lou Z, Wang L: FKBP51 affects cancer cell response to chemotherapy by negatively regulating Akt. Cancer Cell 2009, 16: 259–266. 10.1016/j.ccr.2009.07.016PubMed CentralView ArticlePubMedGoogle Scholar
- Wallace TA, Prueitt RL, Yi M, Howe TM, Gillespie JW, Yfantis HG, Stephens RM, Caporaso NE, Loffredo CA, Ambs S: Tumor immunobiological differences in prostate cancer between African-American and European-American men. Cancer Res 2008, 68: 927–936. 10.1158/0008-5472.CAN-07-2608View ArticlePubMedGoogle Scholar
- He H, Jazdzewski K, Li W, Liyanarachchi S, Nagy R, Volinia S, Calin GA, Liu CG, Franssila K, Suster S, Kloos RT, Croce CM, de la Chapelle A: The role of microRNA genes in papillary thyroid carcinoma. Proc Natl Acad Sci U.S.A 2005, 102: 19075–19080. 10.1073/pnas.0509603102PubMed CentralView ArticlePubMedGoogle Scholar
- Stirewalt DL, Meshinchi S, Kopecky KJ, Fan W, Pogosova-Agadjanyan EL, Engel JH, Cronk MR, Dorcy KS, McQuary AR, Hockenbery D, Wood B, Heimfeld S, Radich JP: Identification of genes with abnormal expression changes in acute myeloid leukemia. Genes Chromosomes Cancer 2008, 47: 8–20. 10.1002/gcc.20500View ArticlePubMedGoogle Scholar
- Sanchez-Palencia A, Gomez-Morales M, Gomez-Capilla JA, Pedraza V, Boyero L, Rosell R, Farez-Vidal ME: Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer. Int J Cancer 2010, 129(2):355–364.View ArticlePubMedGoogle Scholar
- Hou J, Aerts J, den Hamer, van Ijcken, den Bakker, Riegman P, van der Leest, van der Spek, Foekens JA, Hoogsteden HC, Grosveld F, Philipsen S: Gene expression-based classification of non-small cell lung carcinomas and survival prediction. PLoS ONE 2010, 5: e10312. 10.1371/journal.pone.0010312PubMed CentralView ArticlePubMedGoogle Scholar
- Barth AS, Kuner R, Buness A, Ruschhaupt M, Merk S, Zwermann L, Kaab S, Kreuzer E, Steinbeck G, Mansmann U, Poustka A, Nabauer M, Sultmann H: Identification of a common gene expression signature in dilated cardiomyopathy across independent microarray studies. J Am Coll Cardiol 2006, 48: 1610–1617.View ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30: 207–210. 10.1093/nar/30.1.207PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of The Royal Statistical Society B 1995, 57: 289–300.Google Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4(2):249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
- Gautier L, Cope L, Bolstad BM, Irizarry RA: affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20(3):307–315. 10.1093/bioinformatics/btg405View ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Smyth GK: Limma: linear models for microarray data. Springer, New York; 2005.Google Scholar
- Efron B, Tibshirani R: GSA: Gene set analysis. 2010. . [R package version 1.03] http://CRAN.R-project.org/package=GSA . [R package version 1.03]Google Scholar
- Carlson M, Falcon S, Pages H, Li N: KEGG.db: A set of annotation maps for KEGG. [R package version 2.5.0] [R package version 2.5.0]Google Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2011. . [ISBN 3–900051–07–0] http://www.R-project.org . [ISBN 3-900051-07-0]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.