GSVA: gene set variation analysis for microarray and RNA-Seq data

Background Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets. Results To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments. Conclusions GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at http://www.bioconductor.org.

: Distribution of GSVA scores. The left plot shows the distribution of GSVA scores calculated from standard Gaussian deviates N (µ = 0, σ = 1) on p = 20, 000 genes and n = 30 samples using 100 gene sets with sizes uniformly sampled from 10 to 100 genes using the classical maximum deviation enrichment score from a Kolmogorov-Smirnov random walk. The right plot shows normalized enrichment scores (see main paper) calculated from the same simulated data.
2 Accuracy of single-sample GSE methods by simulation Strong signal−to−noise ratio 80% DE genes in gene set AUC P < 0.01 P < 0.01 P < 0.01 Figure S2: Comparison of differential expression prediction of GSVA, PLAGE, single sample GSEA (ssGSEA) and combined z-score (zscore). Each panel shows the area under the ROC curve (AUC) in the y-axis for differentially expressed genes predicted by each method at 1% FDR on 100 simulations (see Methods). The two panels on top correspond to simulations where 50% of the genes in DE gene sets were DE while the two at the bottom contained 80% of DE genes on those DE gene sets. The two panels on the left correspond to a weak signal-to-noise ratio in the DE magnitude while the two on the right correspond to a strong one. Diamonds indicate mean values in boxplots.

Identification of significant pathways
A Genes B Gene sets Adrenocortical Adenoma Adrenocortical Carcinoma Normal Adrenal Figure S3: Differential expression analysis of adrenocortical carcinoma. Heatmaps of genes expression values (A) and GSVA pathway scores (B) for genes and pathways that change significantly across the three subtypes.
The transformation of gene expression levels into GSVA enrichment scores permits gene set enrichment analyses as a classical differential gene expression analysis. To identify gene sets that are differentially activated between the sample groups, we use linear models and moderated t-statistics. These are computed using the empirical Bayes shrinkage method implemented in the Bioconductor package limma [1]. The combination of utilizing GSVA scores and limma allows for the assessment of gene set enrichment directly in complex experimental designs including multiple sample groups and batch effects. We illustrate this application by means of an adrenocortical carcinoma data set with the three groups, adrenocortical carcinoma (ACC), adenoma (ACA) and normal adrenal tissue (NAC) [2].
Adenoma is a benign form of adrenocortical tumors and found in the majority of patients, whereas the malignant tumor (adrenocarcinoma) has an estimated annual incidence of one to two per million population [3]. Most genes disrupted in adrenocarcinoma are either tumor suppressors or oncogenes, such as insulin-like growth factor 1 (IGF1), β-catenin, steroid factor 1 (SF1) or growth factors (details in Table S1 below). ACA and NA tissues exhibit a very similar molecular profile, which leads to difficulties in illuminating the subtle underlying changes. Hence, we apply a non-specific gene and gene set filter, low variation genes and small gene sets are removed, before executing the GSVA function. This approach helps to focus on gene sets representing a more coherent struc-ture and emphasize the fine differences between the three groups. We model the pairwise contrasts of ACA, ACC and normal tissue to identify genes/gene sets that are differentially expressed and can be utilized for clustering. For this, we use a moderated t-statistic [1] and take the intersection of significant gene sets (at a corrected p-value of .001). Subsequently, we use unsupervised clustering on the selected gene sets to generate the heatmap in Figure S3B. Similarly, we perform a gene-centric analysis with the results showing in Figure S3A. In both cases, the use of pathways or genes is capable of separating the three groups, demonstrating GSVA enrichment scores may be used in an analogous fashion to gene expression for modeling multiple groups. The added value of GSVA is that pathways/gene sets illuminate the subtle differences between adenoma and normal tissue clearer.  These gene sets contain growth factors ( IGF1 (Insulin Growth Factor 1), VEGF (Vascular Endothelial Growth Factor)), estrogen receptor (ER) and its targets, as well as TP53 (Cellular tumor antigen p53) and genes involved in cell cycle regulation (cyclins, cell division kinases). All of these genes are associated with ACC [5]. Early stage ACC correlates with ER expression [6]. Topoisomerase II α has been shown to distinguish adenomas from carcinomas in ACC [7]. BUB1B (also contained in the gene sets on the left) has been identified as a marker for survival [8]. The overexpression of gene sets involved in cell cycle regulation (cyclins, cell division kinases) and TP53 agrees with findings from previous functional studies in ACC [2], [5]. Also, p53 tumor suppressor pathways lead to changes from normal cells to tumor cells [9] DELYS THYROID CANCER DN Both retinoic acid production and aldehyde dehydrogenases were found to be decreased in ACC [8], [5].

Survival analysis in ovarian serous carcinoma
In the main paper we describe how pathways can be used to predict survival in ovarian cancer. We fit a Cox proportional hazards model to each gene set to identify predictive pathways, using a permutation approach to estimate FDR. Below we list the most predictive pathways (FDR < .1).  Figure S4: GSVA for RNA-seq (Yale). A. Distribution of Spearman correlation values between gene expression profiles of RNA-seq and microarray data. B. Distribution of Spearman correlation values between GSVA enrichment scores of gene sets calculated from RNA-seq and microarray data. C and D. Comparison of GSVA enrichment scores obtained from microarray and RNA-seq data for two gene sets containing genes with sex-specific expression: MSY formed by genes from the male-specific region of the Y chromosome (male-specific), and XiE formed by genes that escape X-inactivation in females (female-specific). Red and blue dots represent female and male samples, respectively. In both cases GSVA scores show very high correlation between the two profiling technologies where female samples show higher enrichment scores in the female-specific gene set and male samples show higher enrichment scores in the male-specific gene set.

CNV Analysis in Ovarian Data
GSVA enables comparisons with different molecular data types, such as measurements of copy-number variation (CNV). Our approach is analogous to eQTL analysis, which studies how genetic variation in DNA impacts changes in mRNA expression. We would like to discern how copy number alterations, induced by either deletions or amplifications along a chromosomal region, produce changes in pathway activity. For this, we use the ovarian cancer data (OV) from TCGA. We begin by preprocessing the CNV data. First, we remove any cis-CNV effect on mRNA expression since we are not interested in pathways whose behavior can be explained by occupying a shared genomic region. For this, we fit a linear model between gene expression and the CNV overlapping a gene, and regress out the CNV's effect on expression. CNV data was further reduced in dimension by windowing the data into 60kb blocks, with 30kb steps across each chromosome, resulting in approximately 11,000 loci.
We use this corrected expression data to compute GSVA enrichment scores using the maximum deviation method, followed by CNV-pathway correlation analysis. We separately analyze two gene set databases from MSigDB, the "canonical pathway" gene sets (n = 833) (comprised of REACTOME, BIO-CARTA and KEGG) and the gene ontology (GO) biological process gene sets (n = 885), and plot the number of gene sets with a strong correlation (|ρ| > .4) at a given locus ( Figure S5). The canonical pathways track, shown in light gray, has its largest peak on the p-arm of chromosome 6 and corresponds primarily to immune related pathways, a likely consequence of major histocompatibility complex (MHC) genes found densely in this region. The highest ranked CNVpathway correlations (|ρ| > .8) occur at three loci on chr4:68MB, chr6:26MB, and chr6:32MB corresponding to pathways KEGG PENTOSE AND GLU-CURONATE INTERCONVERSIONS, REACTOME RNA POLYMERASE I PROMOTER OPENING, and KEGG ANTIGEN PROCESSING AND PRE-SENTATION, respectively. The first of these gene sets is particularly interesting, as it may relate to the so-called "Warburg effect" seen in cancer tumors whereby anaerobic glycolysis is increased to supply fast growing tumors with needed energy. The GO gene sets track is depicted in dark gray and shows a large peak on chromosome 17. This region consists of pathways primarily associated with viral response and cation homeostasis. Only one CNV region has a gene set with a correlation above the ρ = |.8| threshold: EXTRACEL-LULAR STRUCTURE ORGANIZATION AND BIOGENESIS at the region chr5:138MB -chr5:143MB.