Relating mutational signature exposures to clinical data in cancers via signeR 2.0

Background Cancer is a collection of diseases caused by the deregulation of cell processes, which is triggered by somatic mutations. The search for patterns in somatic mutations, known as mutational signatures, is a growing field of study that has already become a useful tool in oncology. Several algorithms have been proposed to perform one or both the following two tasks: (1) de novo estimation of signatures and their exposures, (2) estimation of the exposures of each one of a set of pre-defined signatures. Results Our group developed signeR, a Bayesian approach to both of these tasks. Here we present a new version of the software, signeR 2.0, which extends the possibilities of previous analyses to explore the relation of signature exposures to other data of clinical relevance. signeR 2.0 includes a user-friendly interface developed using the R-Shiny framework and improvements in performance. This version allows the analysis of submitted data or public TCGA data, which is embedded in the package for easy access. Conclusion signeR 2.0 is a valuable tool to generate and explore exposure data, both from de novo or fitting analyses and is an open-source R package available through the Bioconductor project at (10.18129/B9.bioc.signeR). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05550-3.


Introduction
signeRFlow is a shiny app that allows users to explore mutational signatures and exposures to related mutational processes embedded into signeR package.With available modules, users are able to perform analysis on their own data by applying different approaches, de novo signature estimation or the fitting of mutation counts to known signatures.Also, the app provides a module to explore public datasets from TCGA.
In order to use the signeRFlow app, you must have the signeR package installed.To install signeR, open R and enter: install .packages ( " BiocManager " ) BiocManager :: install ( " signeR " ) Start the app using either RStudio or a terminal: # running signeRFlow function library ( signeR ) signeRFlow ( )The app will open on a new window or on a tab in your browser (Figure 1). 2 Modules signeRFlow functionalities are divided into three modules: • signeR de novo: performs de novo analysis to extract signatures from your data, also estimating related exposures.
• signeR fitting: find exposures to known signatures in your data, which can be uploaded or chosen from the COSMIC database.Exposures are estimated and can be explored.
• TCGA explorer: provides access to the results of signeR applications to 33 datasets from TCGA.
You can go through the modules independently by using the app sidebar.All modules give access to downstream analysis of exposure data.

signeR de novo analysis
In this module, you can upload an SNV matrix with counts of mutations and execute the signeR de novo algorithm, which computes a Bayesian approach to the non-negative factorization (NMF) of the mutation counts in a matrix product of mutational signatures and exposures to mutational processes.
You can also provide a file with opportunities that are used as weights for the factorization.Further analysis parameters can be set, results can be visualized on different plots and found signatures can be compared to the ones in COSMIC database interactively.

Load data
You can upload a VCF, MAF or and or an SNV matrix file with your own samples to use in signeR de novo module.You can upload an opportunity file as well or use an already built genome opportunity (hg19 or hg38 only).Also, you can upload a BED file to build an opportunity matrix.

VCF, MAF or SNV matrix
You can upload a VCF, MAF or an SNV matrix file from your computer by clicking at the Browse button.SNV matrix is a text file with a (tab-delimited) matrix of SNV counts found on analyzed genomes.It must contain one row for each genome sample and 97 columns, the first one with sample IDs and, after that, one column for each mutation type.Mutations should be specified in the column names (headers), by both the base change and the trinucleotide context where it occurs (for example: C>A:ACA).The table below shows a example of the SNV matrix structure.
If you want to upload a VCF or a MAF file, you must select the genome build used on your variant calling analysis to allow signeR to generate an SNV matrix of counts.Also, you can generate an SNV matrix from a VCF or a MAF file using the methods:

Columns:
The first column needs to contain the sample ID and the other columns contain the 96 trinucleotide context.

Rows:
Each row contains the sample ID and the counts for each trinucleotide contexts.

Opportunity matrix
You can upload an Opportunity matrix file or a BED file from your computer by clicking on the Browse button.Also, you can use an already built genome opportunity for human reference genomes (hg19 or hg38 only).This is an optional file.Opportunity matrix is a tab-delimited text file with a matrix of counts of trinucleotide contexts found in studied genomes.It must structured as the SNV matrix, with mutations specified on the head line (for each SNV count, the Opportunity matrix shows the total number of genomic loci where the refereed mutation could have occurred).The table below shows a example of the opportunity matrix structure.
If you want to upload a BED file, you must select the genome build used on your analysis to allow signeR to generate the opportunities for your region's file.Also, you can generate an opportunity matrix from the reference genome using the method:

Columns:
There is no header in this file and each column represents a trinucleotide context.

Rows:
Each row contains the count frequency of the trinucleotides in the whole analyzed region for each sample.

De novo analysis parameters
There are some parameters that you can define before running the analysis by clicking at the Start de novo analysis button: • EM: number of iterations performed to estimate the hyper-hyper parameters of the signeR model.Ignored if previously computed values are used for those parameters (fast option).
• Warm-up: number of Gibbs sampler iterations performed in warming phase, before signeR assumes that the model has converged.
• Final: number of final Gibbs sampler iterations used to estimate signatures and exposures.
During the execution, a message will appear on the screen showing the progress.After the analysis is finished, you can download the results by clicking the button Download Rdata below the button Start de novo analysis and can iterate with all available plots in the signeR package.For example, the convergence of the MCMC model used to estimate the signatures along with their exposures can be seen in Figure 5.
Figure 5: Estimated entries for both the signatures matrix and the exposure matrix, along with iterations of the Gibbs sampler.

Cosmic cosine
signeRFlow uses COSMIC v3.2 to calculate the cosine distance between found signatures and those present in COSMIC.A heatmap will be shown at the COSMIC Comparison section of de novo tab.

signeR fitting
In this module, you can upload a VCF, MAF or SNV matrix file with counts of mutations, the same as used on de novo module, and a previous signatures file with known signatures to execute the signeR fitting algorithm, which computes a Bayesian approach to the fitting of mutation counts to known mutational signatures, thus estimating exposures to mutational processes.
You can also provide a file with opportunities or use an already built genome opportunity that are used as weights for the factorization.Further analysis parameters can be set and estimated exposures can be visualized on different plots interactively.

Load data
You can upload a VCF, MAF or a SNV matrix file with your own samples to use in signeR fitting module and previous known signatures.You can upload an opportunity file as well.SNV or VCF or MAF (2.1.1)and opportunity (2.1.1)matrix are the same as used on de novo module.

Previous signatures matrix
You can upload a Previous signatures matrix file from your computer by clicking at the Browse button.Previous signatures is a tab-delimited text file with a matrix of previously known signatures.It must contain one column for each signature and one row for each of the 96 SNV types (considering trinucleotide contexts).Mutation types should be contained in the first column, in the same form as the column names of the SNV (2.1.1)matrix.The table below shows an example of the previous signatures matrix structure.

Columns:
The first column needs to contain the trinucleotide contexts and other columns contain the known signatures.

Rows:
Each row contains the expected frequency of the given mutation in the appointed trinucleotide context.

Fitting parameters
There are some parameters that you can define before running the analysis by clicking at Start Fitting analysis button: • Warm-up: number of Gibbs sampler iterations performed in warming phase, before signeR assumes that the model has converged.
• Final: number of final Gibbs sampler iterations used to estimate signatures and exposures.
During the execution, a message will appear on the screen showing the progress.

TCGA Explorer
Instead of uploading a private dataset, signeRFlow allows you to explore exposure data previously estimated for samples on TCGA public datasets.We previously applied the signeR algorithm to genome samples from 33 cancer types.Estimated mutational signatures and exposures were obtained for each cancer type.Also, known signatures from the COSMIC database were fitted to TCGA mutation data, thus estimating related exposures on each cancer type.You can select the cancer type of interest and the analysis type on the sidebar.Also, samples can be filtered according to available features in the metadata.
The first time you click on the button TCGA Explorer on the sidebar, signeRFlow will download all the necessary files (RData) according to cancer study and analysis type.The files Figure 8: TCGA Explorer are often small, but depending on the cancer study, this process can take a while.A message will show the download and rendering progress.

Filter dataset
Using the data summary table with all clinical data features downloaded from TCGA, you can select a feature to filter the dataset.According to the feature class, different options to filter will be shown.If you filter a dataset using the data summary table, it will be used on the downstream analysis, such as clustering and covariate.
It is not mandatory to filter the dataset, you can use all the cases.The aim of this resource is to allow you to explore the dataset and select the cases you work with.
As an example, we selected the feature ajcc pathologic stage from ACC cancer type and de novo analysis: For each change in features and filters, the available plots are updated according to the filtered samples.
After, you can download the results by clicking the button Download Rdata below the button Start Fitting analysis and iterate with all available plots in the signeR package.

Downstream analysis
Available in all modules, you can perform downstream analysis using de novo or fitting results with your own data, or in the TCGA Explorer module.Available analysis options, conditioned by the provided clinical data, will be found on the top tabs Clustering and Covariate.On the TCGA Explorer module it is not needed to upload a clinical dataset, since available data is embedded in the software.In this last case, at the top of the Covariate tab you will see, as a reminder, information about the dataset and used filters.
There are two main downstream analyses: -Fuzzy Clustering: signeRFlow can apply the Fuzzy C-Means Clustering on each generated sample of the exposure matrix.Pertinence levels of samples to clusters are averaged over different runs of the algorithm.Means are considered as the final pertinence levels and are shown in a heatmap.
• Covariate -Categorical feature: differences in exposures among groups can be analyzed and if some of the samples are unlabeled they can be labeled based on the similarity of their exposure profiles to those labeled samples.
-Continuous feature: its correlation to estimated exposures can be evaluated.
-Survival feature: survival data can also be analyzed and the relation of signatures to survival can be accessed.

Clustering Hierarchical Clustering
By using the Hierarchical clustering section, you can select different dist and hclust methods: When you select a new dist or hclust method, a dendrogram plot is updated.

Fuzzy Clustering
By using the Fuzzy clustering section, you can set the number of groups or let the algorithm to estimate (Set groups to 1) and click at the Run fuzzy to start the analysis:

Covariate
To perform a Covariate analysis on signeRFlow, you must upload clinical data, a tab-delimited file with samples in rows and features in columns.You can upload a file by clicking on the Browse... button: Clinical data is a tab-delimited text file with a matrix of available metadata (clinical and/or survival) for each sample.It must have a first column of sample IDs, named "SampleID", whose entries match the row names of the SNV matrix(2.1.1).The number and title of the remaining columns are optional, however if survival data is included it must be organized in a column named time (in months) and another named status (which contains 1 for death events and 0 for censored samples).The table below shows an example of the clinical data matrix structure.

Columns:
The first column must contain the sample ID.Other columns may contain sample groupings  or other features that you would like to co-analyze with exposure data.

Rows:
Each row contains clinical information for one sample: its ID and all other data of interest.
After the upload, a description table summarizes the data with all the features in rows, and the class, counts and missing for each feature.By selecting a feature (row) at the table, a small panel is shown next to the table summarizing the values, categorical or continuous, for the selected feature: Sample Classification: classify samples based on their exposure to mutational processes.

Numeric features
Correlation Analysis: evaluates the correlation of exposure levels for each mutational signature to the provided feature.
Linear Regression: Builds a linear model of the provided feature based on exposure data.Evaluates the relevance of each signature exposure in the final model.

Survival feature
Survival Analysis: evaluate the effect of exposure levels on each signature to survival.
Cox Regression: uses a Cox Proportional Hazards Model to evaluate the combined effect on survival of exposure levels to different signatures.

Case study
The utility of signeRFlow app was demonstrated by its application on the data obtained from the stomach adenocarcinoma cohort available in TCGA (STAD, N=439).The mutational spectra of those samples were fitted to known mutational signatures found on COSMIC and previously described in this type of tumors (COSMIC v2; SBS1, SBS2, SBS5, SBS13, SBS15, SBS17, SBS18, SBS20, SBS21, SBS26 and SBS28).Part of the found results were shown in the main paper and part are described below.

Correlation analysis
signeR 2.0 can evaluate the correlation of continuous sample features and exposures to mutational signatures (ExposureCorrelation, Figure 16).For the STAD samples, the age at diagnosis was picked to study as an example.COSMIC signatures SBS1, 5, 15, 20 and 26 showed significant Spearman's rank correlation with age, which can be explained by their etiology: SBS1 is related to endogenous mutational process and is expected to correlate with age; SBS15, 20 and 26 are associated with defective DNA mismatch repair.

Linear regression
The software can also build generalized linear models of studied features based on their exposures to mutational processes, exporting the relevance of each signature in the model (ExposureGLM, Figure 17).The age at diagnosis was also used as an example here, however linear models based on the exposures were not able to predict patients' age satisfactorily.

Survival analysis
Survival data are extremely valuable in cancer studies.If those are available, exposure effects on overall survival can be analyzed for each signature.Log-rank tests and Cox proportional models are implemented and ready to be applied on exposure data from each signature.(ExposureSurv, Figure 18).Using the signeRFlow interface to select data before running the tests, we restricted the survival analysis of TCGA STAD samples to the ones classified as MSI-High.Among those, COSMIC signatures SBS1, 5, 15, 20, 21 and 26 were found as significantly related to better overall survival.

Cox Regression
The combined effect of signature exposures on overall survival can also be assessed by a multidimensional Cox Proportional Hazards model.As can be seen in Figure 19, when considered all together no signature shows significant impact on survival.

Computational Performance
In this study, we conducted a comprehensive benchmarking analysis between signeR 2.0 and the previous version, signeR 1.2 (Bioconductor 3.17).We evaluated the computational efficiency of these two software versions to demonstrate improvements brought by signeR 2.0.We ran a de novo analysis by defining the set of possible ranks for the NMF factorization (number of signatures) to {2, 3, . . ., 11}.This is set via the paramameter nlim.The rest of the parameters was set by taking their default values.Three datasets where considered here:
g e n C o u n tMatr ixFr omVcf g e n C o u n tMatr ixFr omMAF

Figure 3 :
Figure 3: Opportunity matrix or BED upload.

Figure 16 :
Figure 16: Scatter plots of exposures vs age at diagnosis.

Figure 17
Figure 17: P-values boxplots showing the relevance of each signature's exposures to the modeling of age at diagnosis.

Figure 19 :
Figure 19: Forest plot showing the effect on survival of exposure levels for each signature.

Table 1 :
SNV matrix example.C>A:ACA C>A:ACC C>A:ACG C>A:ACT C>A:CCA ... T>G:TTT See the documentation for more details.Also, you must have installed the needed genomes from the BSgenome package in order to use the VCF or MAF upload option.

Table 3 :
Previous known signature example

Table 4 :
Clinical data matrix example That founds are in accordance both to literature and to the previous result (age correlation test): apparently, signatures SBS1, 5, 15, 20, and 26, age at diagnosis and overall survival are closely related.
.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P.value P