MSA: reproducible mutational signature attribution with confidence based on simulations

Background Mutational signatures proved to be a useful tool for identifying patterns of mutations in genomes, often providing valuable insights about mutagenic processes or normal DNA damage. De novo extraction of signatures is commonly performed using Non-Negative Matrix Factorisation methods, however, accurate attribution of these signatures to individual samples is a distinct problem requiring uncertainty estimation, particularly in noisy scenarios or when the acting signatures have similar shapes. Whilst many packages for signature attribution exist, a few provide accuracy measures, and most are not easily reproducible nor scalable in high-performance computing environments. Results We present Mutational Signature Attribution (MSA), a reproducible pipeline designed to assign signatures of different mutation types on a single-sample basis, using Non-Negative Least Squares method with optimisation based on configurable simulations. Parametric bootstrap is proposed as a way to measure statistical uncertainties of signature attribution. Supported mutation types include single and doublet base substitutions, indels and structural variants. Results are validated using simulations with reference COSMIC signatures, as well as randomly generated signatures. Conclusions MSA is a tool for optimised mutational signature attribution based on simulations, providing confidence intervals using parametric bootstrap. It comprises a set of Python scripts unified in a single Nextflow pipeline with containerisation for cross-platform reproducibility and scalability in high-performance computing environments. The tool is publicly available from https://gitlab.com/s.senkin/MSA. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04450-8.


Comparison of MSA performance with existing tools
Performance of MSA was benchmarked against existing tools using both real and synthetic data published by the PCAWG consortium, available at the following Synapse link: https://www.synapse.org/#!Synapse:syn11726601/wiki/513478 Comparisons of signature attributions extracted from 2,780 PCAWG cancer genomes are shown on Figure 1 using various similarities between the original and reconstructed mutational spectra. The performance appears comparable across the tools, however, since the ground truth is not known, it is dicult to determine which tool performs best on real data.
Benchmarking can be more informative when using synthetic data where the ground truth is known. At rst, the performance was tested using simulations mimicking real PCAWG data, available at https://www.synapse.org/#!Synapse: syn18500213. These simulations include 2700 synthetic whole-genome mutational spectra, with 300 spectra from each of 9 cancer types of the PCAWG dataset, for single base substitutions in the 96 trinucleotide context generated from COSMIC reference signatures.
The performance was also tested on simulations of randomly generated signatures and exposures, available at https://www.synapse.org/#!Synapse:syn18500221.
Along with the simulated spectra and the ground truth signature activities, PCAWG consortium shared the results of signature extraction and attribution for SigProlerExtract and SignatureAnalyzer tools, which are compared to MSA ran on all available or SigProler-extracted signatures.
For simulations mimicking real PCAWG data and based on COSMIC signatures, Figure 2 (a) shows the benchmarking results using various metrics, including sensitivity, specicity, precision, accuracy, F1 and MCC (Matthews Correlation coefcient). Figure 2 (b) shows the similarities of the reconstructed mutation spectra with the ground truth spectra using cosine similarity and other metrics. It has to be noted that such comparison is only aimed at the evaluation of signature attribution performance, rather signature extraction. Both SigProler and Sig-natureAnalyzer are extracting signatures de-novo, and then decomposing or matching them to the reference signatures. On the other hand, MSA only performs signature attribution with a given signature catalogue, either a whole COSMIC reference catalogue (shown as MSA_allsigs in the plots), or a catalogue given by SigProler decomposition, shown as MSA_SP. When randomly generated signatures are used, MSA_allsigs label refers to results obtained with the catalogue of all 30 randomly generated signatures.
In the case of simulations based on COSMIC signatures, it can be seen that the best performance is achieved with MSA attribution based on the signatures taken from the SigProler decomposition where MSA is run on the SigProler output due to the native integration of the tool. However, MSA attribution based on the whole COSMIC catalogue shows a comparable performance, due to the fact that the simulations were made only with the signatures taken from the reference catalogue. Therefore, the recommended way to run MSA is using the SigProler output, as it makes sure that any novel signatures that may be present in the cohort are accurately attributed.
In the case of simulations based on randomly generated signatures, MSA shows best performance when the whole catalogue of randomly generated signatures is used (MSA_allsigs label). However, as the whole set of reference signatures is generally not known, the real-life scenario corresponds to MSA attributions based on de-novo extracted signatures (MSA_SP label).
The reason why MSA attribution generally achieves higher performance metrics than that of other tools is due to its automised optimisation of the penalties applied, coupled with application of condence intervals. SigProler uses xed penalties of 0.01/0.05, which appear to be too conservative in many cases, with the exception of scenarios with randomly generated Poisson signatures that are dicult to discern from noise and each other. SignatureAnalyzer, on the other hand, tends to overt the data as activities of virtually all signatures are attributed in each sample, i.e. no regularisation appears to be implemented. Expectedly, this leads to a very high sensitivity, but also low specicity.
Similarity metrics were calculated from normalised reconsturcted and original mutation spectra, dened as follows: Cosine similarity: , where m = 1 2 (x + y) and Kullback-Leibler divergence (aka relative entropy) for two probabilities: D(x||y) = x log