A multi-omics approach for identifying important pathways and genes in human cancer

Background Cancer develops when pathways controlling cell survival, cell fate or genome maintenance are disrupted by the somatic alteration of key driver genes. Understanding how pathway disruption is driven by somatic alterations is thus essential for an accurate characterization of cancer biology and identification of therapeutic targets. Unfortunately, current cancer pathway analysis methods fail to fully model the relationship between somatic alterations and pathway activity. Results To address these limitations, we developed a multi-omics method for identifying biologically important pathways and genes in human cancer. Our approach combines single-sample pathway analysis with multi-stage, lasso-penalized regression to find pathways whose gene expression can be explained largely in terms of gene-level somatic alterations in the tumor. Importantly, this method can analyze case-only data sets, does not require information regarding pathway topology and supports personalized pathway analysis using just somatic alteration data for a limited number of cancer-associated genes. The practical effectiveness of this technique is illustrated through an analysis of data from The Cancer Genome Atlas using gene sets from the Molecular Signatures Database. Conclusions Novel insights into the pathophysiology of human cancer can be obtained from statistical models that predict expression-based pathway activity in terms of non-silent somatic mutations and copy number variation. These models enable the identification of biologically important pathways and genes and support personalized pathway analysis in cases where gene expression data is unavailable. Electronic supplementary material The online version of this article (10.1186/s12859-018-2476-8) contains supplementary material, which is available to authorized users.

List of Figures S1 Illustration of results from the analysis of TCGA pan cancer data using the MSigDB oncogenic signatures (C6) collection. Rows correspond to the TCGA cohorts and columns to MSigDB gene sets. Each cell represents a pathway and cancer type-specific regression model colored according to the R 2 pred value. Both the columns and rows are ordered according to the output from hierarchical agglomerative clustering using complete link agglomeration and euclidean distance. . . . . . . . . . . . . . . . . . . . 11 S2 Box plot illustrating the distribution of R 2 pred values from the analysis of TCGA pan cancer data using the MSigDB oncogenic signatures (C6) collection. Values are shown for three different analyses: COS-MIC : estimation of pathway regression models using somatic alteration data for COSMIC consensus cancer genes, All TCGA: estimation of models using somatic alteration data for all genes included in the TCGA datasets, and Random: estimation of models using randomized somatic alteration data for COSMIC consensus cancer genes. 1 Supplemental Methods

Pathway Definitions
We assume there are p a genes grouped into m overlapping gene sets or pathways as defined by an m × p a gene set indicator matrix A: where a i,j is 1 if gene j belongs to gene set i and 0 otherwise.
For the reported analyses, the A matrix was populated using gene sets from version 5.2 of the Molecular Signatures Database (MSigDB) (Liberzon et al., 2011 -MSigDB description: "Gene sets represent signatures of cellular pathways which are often dis-regulated in cancer. The majority of signatures were generated directly from microarray data from NCBI GEO or from internal unpublished profiling experiments which involved perturbation of known cancer genes. In addition, a small number of oncogenic signatures were curated from scientific publications."

Cancer Driver Genes
We assume that information on the known associations between p d genes (where p d ≥ p a ) and t cancer types is defined by a t × p d matrix D of indicator variables: where d i,j is 1 if gene j has a known association with cancer type i. Because the set of t cancer types is not comprehensive and type-specific associations may be only approximate or unknown for some genes, we also assume that the overall cancer association for each gene is held in a separate vector of indicator variables: where o i is 1 if gene i has a known association with any cancer type (not necessarily one of the t types defined in D). Let the total number of cancer-associated genes be represented by For the reported analyses, the D matrix and o vector were generated using the cancer type annotations in the COSMIC cancer gene census (release v78, 5th September 2016) (Forbes et al., 2015): • Contains cancer associations for 602 genes, i.e., b, as defined in (4), equals 602.
• COSMIC description: "The cancer Gene Census is an ongoing effort to catalogue those genes for which mutations have been causally implicated in cancer." Table S1 lists the driver genes for the 20 TCGA cohorts used in the analysis (see next section for more details on the 20 TCGA cohorts). These lists were computed via string matching on the cancer type annotations in the COSMIC cancer gene census.
• Normalized gene expression data (either microarray gene expression of RNA-seq) measured on the tumor samples associated with cancer type k. Let the number of the p a genes captured in either A C2.CP or A C6 that have non-constant gene expression data for all n k subjects be represented by pe k . This gene expression data is for defined by an n k × pe k matrix E k : where e i,j represents the abundance of the mRNA associated with gene j for sample i from cancer type k.
• Data on non-silent somatic mutations in the protein coding region of genes in tumor samples associated with cancer type k. Let the number of genes with non-constant somatic mutation indicators for all n k subjects be represented by pm k , which can take on two separate values depending on whether the somatic mutations are limited to just the b genes with a cancer association as defined in (4) or include all genes with somatic mutation data in the data set. This somatic mutation data is recorded as gene-level mutation indicators in an n k × pm k matrix M k : where m i,j is 1 if there is at least one non-silent mutation in the protein coding region of the gene j for sample i from cancer type k.
• Copy number variation (CNV) data for genes in tumor samples associated with cancer type k. Let the number of genes with non-constant CNV values for all n k subjects be represented by pc k . Similar to pm k , pc k can take on two separate values depending on whether the CNV values are limited to just the b genes with a cancer association as defined in (4) or include all genes with CNV data in the data set. This CNV data is recorded as gene-level values in an n k × pc k matrix C k : where c i,j is an estimate of the copy number for gene j and sample i from cancer type k.
For the reported analyses, the E k , M k and C k matrices were populated using data from The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Research Network et al., 2013) as downloaded from the UCSC Cancer Browser (Goldman et al., 2015) pan cancer data set for the 20 TCGA cohorts that had the most samples with expression, mutation and CNV data (i.e., n k ).  Table S2: Sample and gene counts for the 20 TCGA pan cancer cohorts used in the analysis. Note that for the number of mutation and CNV predictors, the same gene may be counted twice (i.e., both non-silent somatic mutation data and CNV data is available for the gene).
The TCGA gene expression and gene-level somatic alteration data was specifically drawn from the following data sets: • TCGA gene expression data (E k ) -Source file: HiSeqV2 PANCAN-2015-02-15.tgz -Data type description (from UCSC Cancer Browser documentation): "TCGA pan-cancer gene expression by RNAseq. Gene expression measured using the IlluminaHiSeq technology. Data from all TCGA cohorts are combined to produce this dataset." -Number of samples: 9,755 • Gene-level non-silent somatic mutations (M k ) -Source file: TCGA PANCAN mutation xena gene-2015-01-28.tgz -Data type details (from UCSC Cancer Browser documentation): "TCGA pan-cancer somatic mutation data compiled from all cohorts mutation calls are available. Red (=1) indicates that a non-silent somatic mutation (nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs) was identified in the protein coding region of a gene, or any mutation identified in a non-coding gene. White (=0) indicates that none of the above mutation calls were made in this gene for the specific sample. Somatic mutations calls (even on the same tumor DNA extract) are affected by many factors including library prep, sequencing process, read mapping method, reference genome used, genome annotation, calling algorithms, and ad-hoc pre/postprocessing such as black list genes, target selection regions, and black list samples. This dataset is the effort of the UCSC Xena team. Individual mutation data can be downloaded at TCGA DCC." -Number of samples: 6,901 • Gene-level CNV data (C k ) -Source file: TCGA PANCAN gistic2-2015-02-06.tgz -Data type details (from UCSC Cancer Browser documentation): "TCGA pan-cancer genelevel copy number variation (CNV) estimated using the GISTIC2 method. Copy number profile was measured experimentally using whole genome microarray at Broad TCGA genome characterization center. Subsequently, TCGA FIREHOSE pipeline applied GISTIC2 method to produce segmented CNV data, which was then mapped to genes to produce gene-level estimates. Gistic2 data from all TCGA cohorts are combined to produce this dataset. Reference to GISTIC2 method PMID:21527027." (Mermel et al., 2011) -Number of samples: 10,843

Analysis pipeline
This section provides mathematical and implementation details for three analysis steps described in the Methods Section of the main manuscript. R code implementing this pipeline is available at http://www.dartmouth.edu/∼hrfrost/MutPath/. It is assumed that the analysis is performed separately for each cancer type k, k ∈ 1...t. • Step 1: For each of the m pathways defined in A (1), determine the extent to which expression of the pathway genes in each individual tumor varies from the mean expression for all n k tumors of that cancer type. This computation is performed using the unsupervised, single-sample gene set testing method GSVA (Hänzelmann et al., 2013). This analysis generates an n k × m matrix S k of single-subject pathway enrichment scores: where s i,j holds the enrichment score computed by GSVA for pathway j using the gene expression data for tumor i from cancer type k held in row i of E k . Specifically, this matrix will contain normalized enrichment scores generated using the scoring procedure defined by Eq. 5 in Hanzelmann et al. (Hänzelmann et al., 2013), which, under the null of no deviation from the mean pathway activity level for that cancer type, have a standard normal distribution, i.e., s i,j |H 0 ∼ N (0, 1).
If desired, the S k matrix can be generated using alternative single sample gene set testing methods (e.g., ssGSEA (Barbie et al., 2009)) or variations of the GSVA method (e.g., the GSVA statistic in Eq. 4 of Hanzelman et al. (Hänzelmann et al., 2013) that can identify gene sets containing both up-regulated and down-regulated genes). Researchers interested in enabling such a change using the R implementation of our approach available at http://www.dartmouth.edu/∼hrfrost/MutPath/ can edit the GSVAUtils.R file. Importantly, the GSVA R package leveraged by our implementation also provides implementations of the ssGSEA (Barbie et al., 2009), zscore (Lee et al., 2008), and PLAGE (Tomfohr et al., 2005) methods. To illustrate the impact of alternate single sample methods, tables S24 and S25 below show the top driver genes identified for the TCGA pancreatic cancer cohort using either GSVA or ssGSEA to generate single sample gene set scores. As demonstrated by these results, our approach is able to successfully identify known driver genes using either method (GSVA identified more known drivers in the top 20 in this case). • Step 2: For each pathway j, j = 1, ..., m, fit both penalized and unpenalized multiple linear regression models using column j of the enrichment score matrix , S k , as the dependent variable and the pm k + pc k predictors representing gene-level somatic mutation and CNV values: where β 0 captures the expected pathway enrichment score for this cancer type when none of the predictor genes contain non-silent somatic mutations or have copy number variation.
-β m is a length pm k vector of regression coefficients for the indicators of gene-level non-silent somatic mutations, i.e., the effect of non-silent somatic mutations on pathway enrichment.
-β c is a length pc k vector of regression coefficients for the CNV values, i.e., the effect of copy number variation on pathway enrichment.
For the reported analyzes, the penalized and unpenalized regression models were fit separately for both potential sets of the pm k + pc k somatic alteration predictors. In other words, two pairs of penalized and unpenalized models were estimated for each pathway. Models were also estimated using randomized somatic alteration predictors, i.e., the subject labels for the predictors were randomly permuted. The first pair of models used as predictors the somatic alteration variables for all of the genes with a known cancer association, i.e., the b genes defined by o (3) (4). The second set of models used as predictors the somatic alteration variables for all available genes in the TCGA data. For the 20 analyzed TCGA cancer types, the number of somatic mutation and CNV predictors for both cases are listed in the third column of Table S2.
Because the number of predictors, pm k + pc k , will most likely be larger than the number of samples, n, the model in Eq (9) is initially fit using a LASSO (Tibshirani, 2011) penalty, which performs both variable selection and coefficient shrinkage. Because lasso-penalization tends to retain just one predictor from among a set of correlated predictors (Tibshirani, 2011), penalized estimation will tend to select a parsimonious group of genelevel somatic mutation and CNV variables with minimal overlap. Actual estimation for the reported results was performed using the glmnet R package (Friedman et al., 2010). LASSO-penalized estimation is accomplished by maximizing the following objective function: Note that the intercept term, β 0 , is not penalized in this objective function. The LASSO penalty parameter λ can be selected according to cross-validation or to achieve a specific number of non-zero coefficients. For the reported analyses, λ was set to the value that minimized mean squared error during 10-fold cross-validation with the cross-validation (CV) process repeated 5 times to reduce the variance associated with the random splitting of the data.
The CV results were also used to estimate the predicted coefficient of determination for the model (i.e., R 2 pred ). Specifically, R 2 pred was estimated as the average proportion of null deviance explained by the lasso-penalized models on held-out data for each CV fold with the average value computed across all 50 total estimates, i.e., 5 replications of 10-fold CV. Note that the proportion of null deviance explained by the model is equivalent and the coefficient of determination for multiple linear regression. Let the estimated R 2 pred values for all cancer types for a given gene set collection be held in a t × m matrix R: where r i,j holds the R 2 pred estimated via cross-validation of the LASSO-penalized model for pathway j using the data the genomic data for cancer type i. For the reported analysis, 6 versions of the R matrix were generated: -R C2.CP,COSMIC : holds the R 2 pred values for the MSigDB C2.CP pathway models using as predictors just the somatic alterations associated with COSMIC consensus cancer genes, i.e., the b genes defined by o (3) (4).
-R C2.CP,COSMIC,random : holds the R 2 pred values for the MSigDB C2.CP pathway models using as predictors just the somatic alterations associated with COSMIC consensus cancer genes with the association between subjects and predictor values randomly permuted.
-R C2.CP,all : holds the R 2 pred values for the MSigDB C2.CP pathway models using as predictors somatic alterations for all available TCGA genes.
-R C6,COSMIC : holds the R 2 pred values for the MSigDB C6 pathway models using as predictors just the somatic alterations associated with COSMIC consensus cancer genes, i.e., the b genes defined by o (3) (4).
-R C6,COSMIC,random : holds the R 2 pred values for the MSigDB C6 pathway models using as predictors just the somatic alterations associated with COSMIC consensus cancer genes with the association between subjects and predictor values randomly permuted.
-R C6,all : holds the R 2 pred values for the MSigDB C6 pathway models using as predictors somatic alterations for all available TCGA genes.
Following penalized estimation, an unpenalized multiple linear regression is performed for model (9) using as dependent variables just the subset of the pm k +pc k somatic alteration predictors that have non-zero coefficient estimates at the optimal λ from the penalized regression. Let the p-values associated with the t-statistics for the estimated coefficients in the unpenalized linear models for cancer type k be held in a m × (pm k + pc k ) matrix Pval k : where pval i,j holds the p-value associated with the estimated coefficient for somatic alteration predictor j in the unpenalized regression model for pathway i and cancer type k. Note that the elements of this matrix associated with predictors that had 0 coefficients in the LASSO-penalized models will be undefined. For the reported results, 4 different versions of the Pval k matrix were generated (one for each of the 4 possible combinations of MSigDB collection and somatic alteration predictor sets). • Step 3: Use the fitted regression models to address one of the three primary aims. The Methods and Results Sections in the main manuscript contain details on the motivation, expectations and results for the evaluation procedures. Mathematical and implementation details for the evaluation results shown in the figures and tables of the main manuscript and SI are detailed in the next section.

Analysis of breast cancer subtypes
To determine if the pathway models fit for the high-level TCGA cancer types could be used to characterize the features of cancer subtypes, we explored the differences in predicted pathway activity and somatic mutation predictors for the TCGA BRCA subjects assigned to either the basal (n=124) or luminal (n=667) PANCAN cluster-of-cluster assignments Hoadley et al. (2014). To identify pathways whose activity differed between basal and luminal subtypes, the following steps were performed: • For each of the pathways in either the C2.CP or C6 collection, the unpenalized regression model fit according to the approach outlined above was used to generate a predicated pathway dysregulation score for each breast cancer subject using that subject's somatic alteration values (i.e., the non-silent somatic mutation indicators and CNV estimates for all model predictors with non-zero effect sizes).
• For each pathway, a two-sample t-test was performed comparing the predicated pathway scores for luminal tumors with the predicated pathway scores for basal tumors.
• The pathways were then ranked according to the estimated t-statistics with large positive statistics representing pathways with greater activity in luminal tumors and with large negative statistics representing greater activity in basal tumors.
To identify somatic alteration predictors whose importance in driving pathway activity differs between basal and luminal subtypes, the following steps were performed: • For each somatic mutation or CNV predictor included in at least one pathway model, a subject-level score was computed by taking the sum across all pathway models of the product of the following values: -The absolute value of subject's measurement for that predictor.
-The -log of the p-value for that predictor in the unpenalized regression model.
-The predicted R 2 value for the pathway model.
• The average of the predictor scores was computed among the luminal subjects and separately among the basal subjects.
• The predictors were then ranked according to the difference between the mean luminal and mean basal scores with large positive mean differences corresponding to somatic alterations that are more important in driving pathway activity among luminal tumors and with large negative mean differences corresponding to somatic alterations that are more important in driving pathway activity among basal tumors.
The results of these comparative pathway and predictor analyses are contained in Tables S18-S23 below.

Generation of tables and figures
The tables and figures included in the main manuscript and SI were generated as detailed below. Associated R logic is available at http://www.dartmouth.edu/∼hrfrost/MutPath/.
• Heatmap of R 2 pred values: For Figure 2 in the main manuscript, the heatmap was generated using the R 2 pred values in the matrix R C2.CP,COSMIC with both the columns (i.e., pathways) and rows (i.e., cancer types) ordered according to the results of agglomerative hierarchical clustering using complete link agglomeration and euclidean distance. The heatmap shown in Figure S1 was generating using a similar procedure for the R 2 pred values in the matrix R C6,COSMIC .
• Box plots of R 2 pred values: For Figure 3 in the main manuscript, the box plot was generated using the R 2 pred values in the following matrices: R C2.CP,COSMIC , R C2.CP,COSMIC,random and R C2.CP,all . For Figure S2, the box plot was generated using the R 2 pred values in R C6,COSMIC , R C6,COSMIC,random and R C6,all • Rank correlation between R 2 pred values for different predictor sets: For Figure 4 in the main manuscript, the correlation value for TCGA cancer type k was computed as the Spearman rank correlation between row k of matrix R C2.CP,COSMIC and row k of matrix R C2.CP,all . For Figure S3, the correlation value for TCGA cancer type k was computed as the Spearman rank correlation between row k of matrix R C6,COSMIC and row k of matrix R C6,all .
• Ranking of pathway models by R 2 pred : Table 1 in the main manuscript shows the top 10 MSigDB C2.CP pathway models according to the R 2 pred values estimated for the TCGA lung adenocarcinoma data. The rank ordering is displayed for models estimated using both potential sets of predictors. Specifically, the left side is ordered according to the R 2 pred values in the row of R C2.CP,COSMIC corresponding to lung adenocarcinoma and the right side is ordered according to the R 2 pred values in the row of R C2.CP,all corresponding to lung adenocarcinoma. The # columns specify the rank of that pathway according to the R 2 pred values in R C2.CP,COSMIC . Similar tables displaying the top 20 pathways from the MSigDB C2.CP and C6 collections are included in the SI for the fiveTCGA cohorts with the largest number of samples that have expression, somatic mutation and CNV data: breast cancer (Tables S3 and S4), head and neck cancer (Tables S6 and S7), lower grade glioma (Tables S9 and S10), lung adenocarcinoma (Tables S12 and S13) and thyroid cancer (Tables S15 and S16).
• Mean R 2 pred values: Figures S4 and S5 illustrate the association between cohort sample size and mean R 2 pred values for each TCGA cancer type. For Figure S4, the mean R 2 pred values are computed for each row of matrix R C2.CP,COSMIC . For Figure S5, the mean R 2 pred values are computed for each row of matrix R C6,COSMIC • Ranking of somatic alteration predictors: The individual gene-level somatic mutation and CNV predictors where rank ordered for each TCGA cohort according to a weight w computed the sum average across all pathway models for a specific MSigDB collection of the product of the -log(p-value) from the unpenalized regression model and the R 2 pred estimate for the penalized model. If the index of the predictor among all pm k + pc k predictors is e, then w is computed for cancer type k according to the following formula: All predictor weights can be held in a t × (pm k + pc k ) matrix W : where wl i,j holds the weight for predictor j and cancer type i. For the reported results, 4 different versions of W were computed: -W C2.CP,COSMIC : The weights are computed across the C2.CP pathways estimated using just the COS-MIC cancer consensus gene predictors.
-W C2.CP,all : The weights are computed across the C2.CP pathways estimated using all available TCGA genes as predictors.
-W C6,COSMIC : The weights are computed across the C6 pathways estimated using just the COSMIC cancer consensus gene predictors.
• Rank correlation between somatic alteration predictors for different MSigDB collections: For Figure 6 in the main manuscript, the correlation value for TCGA cancer type k was computed as the Spearman rank correlation between row k of matrix W C2.CP,COSMIC and row k of matrix W C6,COSMIC . For Figure  S6, the correlation value for TCGA cancer type k was computed as the Spearman rank correlation between row k of matrix W C2.CP,all and row k of matrix W C6,all , • Enrichment of known driver genes among top ranked predictors: Figure 5 and S7 illustrate the enrichment of known driver genes for each cancer type among the predictors ranked according to the weights in W. Specifically, for each cancer type k a Wilcoxon rank sum test was performed between the elements of row k of matrix W C2.CP,all (or W C6,all ) associated with genes that have non-zero values in row k of matrix D (i.e., genes with a known association with cancer type k) and the elements of row k of matrix W C2.CP,all (or W C6,all ) associated with genes that have zero values in row k of matrix D (i.e., genes with no known association with cancer type k). False discovery rate (FDR) q-values were computed on the Wilcoxon test p-values using the Benjamini and Hochberg (BH) (Benjamini and Hochberg, 1995) method for a family of hypotheses that includes all t cancer types. In Figure 5, the q-values for each TCGA cohort, as computed using the C2.CP versions of the W matrix, are plotted relative to cohort sample size. Figure S7 illustrates the q-values computed using the C6 versions of the W matrix. Figure S1: Illustration of results from the analysis of TCGA pan cancer data using the MSigDB oncogenic signatures (C6) collection. Rows correspond to the TCGA cohorts and columns to MSigDB gene sets. Each cell represents a pathway and cancer type-specific regression model colored according to the R 2 pred value. Both the columns and rows are ordered according to the output from hierarchical agglomerative clustering using complete link agglomeration and euclidean distance. Figure S2: Box plot illustrating the distribution of R 2 pred values from the analysis of TCGA pan cancer data using the MSigDB oncogenic signatures (C6) collection. Values are shown for three different analyses: COSMIC : estimation of pathway regression models using somatic alteration data for COSMIC consensus cancer genes, All TCGA: estimation of models using somatic alteration data for all genes included in the TCGA datasets, and Random: estimation of models using randomized somatic alteration data for COSMIC consensus cancer genes. Figure S3: Spearman rank correlation between R 2 pred values for C6 pathway models estimated using just COSMIC consensus cancer genes and the R 2 pred values for C2.CP pathway models estimated using all TCGA genes. The rank correlation values are computed separately for each TCGA cohort and are plotted relative to cohort sample size. Figure S4: Mean R 2 pred values for C2.CP pathway models estimated for each TCGA cohort using just COSMIC consensus cancer genes. The mean R 2 pred values are plotted relative to cohort sample size. Figure S5: Mean R 2 pred values for C6 pathway models estimated for each TCGA cohort using just COSMIC consensus cancer genes. The mean R 2 pred values are plotted relative to cohort sample size. Figure S6: Spearman rank correlation between the weights for somatic alteration predictors for all available TCGA genes computed for the MSigDB C2.CP and C6 collections. The rank correlation values are computed separately for each TCGA cohort and are plotted relative to cohort sample size. Figure S7: The false discovery rate (FDR) q-values (as computed using the Benjamini and Hochberg (BH) (Benjamini and Hochberg, 1995) method) for the Wilcox rank sum tests of enrichment of known driver genes among the ranks of all predictors for the MSigDB C6 models using all TCGA genes. The q-value for each cancer type cohort is plotted relative to the cohort sample size.