Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator
BMC Bioinformatics volume 21, Article number: 438 (2020)
Performing a statistical test requires a null hypothesis. In cancer genomics, a key challenge is the fast generation of accurate somatic mutational landscapes that can be used as a realistic null hypothesis for making biological discoveries.
Here we present SigProfilerSimulator, a powerful tool that is capable of simulating the mutational landscapes of thousands of cancer genomes at different resolutions within seconds. Applying SigProfilerSimulator to 2144 whole-genome sequenced cancers reveals: (i) that most doublet base substitutions are not due to two adjacent single base substitutions but likely occur as single genomic events; (ii) that an extended sequencing context of ± 2 bp is required to more completely capture the patterns of substitution mutational signatures in human cancer; (iii) information on false-positive discovery rate of commonly used bioinformatics tools for detecting driver genes.
SigProfilerSimulator’s breadth of features allows one to construct a tailored null hypothesis and use it for evaluating the accuracy of other bioinformatics tools or for downstream statistical analysis for biological discoveries. SigProfilerSimulator is freely available at https://github.com/AlexandrovLab/SigProfilerSimulator with an extensive documentation at https://osf.io/usxjz/wiki/home/.
Performing a statistical evaluation to determine whether an observation is seen by chance necessitates the construction of a null hypothesis corresponding with the expected default position. An observation is generally considered statistically significant if it reflects an unlikely outcome of the null hypothesis. In most practical applications, observations seen in less than 5% of outcomes from a null distribution are considered statistically significant.
Large-scale computational analyses of cancer genomes use background mutational models to evaluate driver mutations [1,2,3,4,5,6], mutational signatures , and topographical accumulation of somatic mutations . In almost all cases, a null hypothesis model of the background mutation rate is implicitly incorporated into a bioinformatics tool [6, 9, 10] and used to report statistically significant results. Here we present SigProfilerSimulator, a computationally efficient bioinformatics tool for generating sample specific mutational landscapes that match the mutational signatures operative in each sample (Fig. 1a). SigProfilerSimulator provides a framework for generating a background mutational model for downstream statistical analyses and hypothesis testing. The tool supports generation of simulated single base substitutions (SBSs), small insertions and deletions (IDs), and doublet base substitutions (DBSs) while maintaining their patterns at different resolutions. SigProfilerSimulator is available as both a Python and an R package, provides support for commonly used data formats, and is extensively documented. To demonstrate the wide applicability of SigProfilerSimulator, we illustrate its basic functionality using a single cancer genome and then apply the tool to 2144 whole-genome sequenced cancers and to 1,024 whole-exome sequenced breast cancers to address three different questions in cancer genomics.
The mutational pattern of a cancer genome can be described using distinct classification schemes reflecting the activity of mutational processes at different resolutions . For example, single base substitutions can be described using only the mutated base-pair (6 possible mutational channels; known as SBS-6 classification), or the mutated base-pair with ± 1 bp context (SBS-96), or the mutated base-pair with ± 2 bp context (SBS-1536), or the mutated base-pair with ± 3 bp context (SBS-24576), etc. (Fig. 1B) . Each of these classifications can be subsequently elaborated by considering additional features [11, 12]. For example, SBS-24 extends the SBS-6 classification by including four subtypes for the six possible single base substitutions: substitutions are first split into ones in non-transcribed/intergenic regions and ones in genic regions; substitutions in genetic regions are further subclassified as ones occurring on the transcribed strand, untranscribed strand, or in regions of bi-directional transcription . Similarly, the SBS-384 and SBS-6144 classifications extend SBS-96 and SBS-1536, respectively, by subclassifying each mutational channel into four: non-transcribed, transcribed, untranscribed, and bi-directional . Note that, conventionally, these classifications have been displayed using the mutations only on the transcribed and untranscribed strands (e.g., 192 channel depiction for SBS-384) [7, 9, 13] since, historically, mutational patterns have been predominately investigated in whole-exome sequenced samples that provide little information about mutations outside of transcribed protein coding regions.
By preserving the pattern of mutations at a preselected resolution, SigProfilerSimulator converts a set of real somatic mutations from a cancer genome into another set of randomly generated somatic mutations (Fig. 1a). Maintaining the mutational pattern provides an assurance that the same mutational processes are observed in both the real and the simulated cancer genome. By default, the tool projects these mutations as statistically independent events onto each chromosome by proportionately assigning mutations based on the observed rate of each mutational channel across the complete length of a preselected reference genome. The number of mutations is proportionally assigned to each chromosome based on the number of mutational channels (e.g., 96 channels reflecting trinucleotides) found on that chromosome. The tool also provides a variety of custom options for simulating mutations, including: (i) gender of the sample allowing appropriate incorporation of sex chromosomes; (ii) transcriptional strand bias allowing accurate distribution of mutations to account for the activity of transcription-coupled nucleotide excision repair; (iii) considering mutations as dependent sequential events where each mutation updates the observed rate of a mutational channel for a preselected reference genome, e.g., a C > T mutation at ACT trinucleotide will reduce the number of ACT trinucleotides in the reference genome by one and increase the number of ATT trinucleotides in the reference genome by one, thus, each mutation will modify the overall observed rate of a mutational channel in the genome and affect subsequent mutations; (iv) preserving mutational burden and mutational patterns for each chromosome instead of the complete genome, thus, the number and type of mutations assigned to each chromosome match exactly the ones observed in the original sample (Additional file 1: Fig. 1); (v) exome simulations that generate mutations only in the protein coding regions of the genome; (vi) adding Poisson noise to the number of mutations in each mutational channel of the original data; (vii) allowing the use of a probability mask that can decrease or increase the opportunity for mutations in certain parts of the genome (Additional file 1: Fig. 2); and several other options. Additionally, one can simulate the germline variants in a (matched-)normal sample(s), which can be used for subsequent comparisons against tumor samples. With this collection of features, one can easily tailor an appropriate background mutational model for testing different biological hypotheses or for evaluating existing bioinformatics tools. Importantly, SigProfilerSimulator is computationally efficient. For example, the tool can simulate ~ 37 million somatic mutations found in the 2144 whole-genome sequenced cancers generated by Pan-cancer Analysis of Whole Genomes (PCAWG) initiative  within 90 s.
To illustrate several of SigProfilerSimulator’s features, we provide a detailed visualization for a single TCGA melanoma sample: TCGA-DA-A-A1I8. Simulating TCGA-DA-A-A1I8 using the SBS-6 classification maintains the original sample’s pattern for the six possible types of single base mutations, however, it also results in completely different patterns for classifications at higher resolutions (Fig. 1c). Simulating an extended sequence context (SBS-96; trinucleotides) results in a perfect match with the original landscape when including ± 1 adjacent bases; however, it does not reflect the transcriptional strand bias observed in the sample (Fig. 1c). As such, one can further elaborate these simulations by incorporating transcriptional strand bias (Fig. 1c), by considering ± 2 adjacent bases (Additional file 1: Fig. 1), or by preserving the mutational burden and mutational patterns on each chromosome (Additional file 1: Fig. 1). Similarly, simulations can be performed for the different classification types for small insertions and deletions (ID-83 and ID-415; Fig. 1D). Each of these simulations can be subsequently used to test different hypotheses. To demonstrate this capability, we applied SigProfilerSimulator to three questions in cancer genomics.
First, we used simulations to evaluate whether doublet base substitutions (e.g., CC:GG > TT:AA mutations) are two subsequent single base substitutions occurring simply by chance in adjacent genomic positions. We constructed a null hypothesis by applying the tool to the 2144 PCAWG cancer genomes. Simulations were performed considering SBSs as both statistically independent events (non-updating—simulating with replacement; each mutation has no effect on the observed rate of mutational channels) and dependent events (updating—simulating without replacement; each mutation updates the observed rate of mutational channels). Each sample was simulated 1000 times providing a distribution of doublet base substitutions. After simulating the SBS-96 context for each PCAWG sample, we examined the number of single base substitutions occurring next to one another on the genome simply by chance. For example, in the sample SP99325 (LIRI), we observed on average approximately 23 pairs of adjacent SBSs when considering mutations as statistically independent events and 14 pairs of adjacent SBSs when considering mutations as dependent events (Fig. 2a). In contrast, the actual sample contains 303 doublet base substitutions indicating a 22-fold and a 13-fold enrichment compared to the null hypothesis, respectively. The results indicate that it is highly unlikely that the majority of observed doublet base substitutions in SP99325 are the result of two adjacent SBS events. Applying the same approach to all PCAWG samples reveals between 10- and 1000-fold increase of the real number of DBSs compared to simulated data (Fig. 2b; Additional file 1: Fig. 3). These results confirm the belief that the vast majority of doublet base substitutions in human cancer are not due to adjacent single base substitutions. Rather, doublet base substitutions are likely due either to single genomic events or to higher mutagenic propensities of certain regions of the human genome. Indeed, we recently derived mutational signatures of doublet base substitutions across the PCAWG dataset . Nevertheless, it is important to remember that, especially for hyper-mutated samples, some of the observed DBSs may be due to having two single base substitutions occurring by chance in adjacent positions (Fig. 2a).
Second, we evaluated whether incorporating additional sequence context 5′ and 3′ of single base substitutions increases the specificity of the mutational patterns observed in cancer genomes . Here, we considered two mutational patterns to be the same if their cosine similarity is more than 0.85 (Additional file 1: Fig. 4; Methods). Specifically, we simulated the PCAWG dataset at different resolutions (viz., SBS-6, SBS-96, and SBS-1536; Fig. 1B) and compared them to the patterns of mutations observed in the real samples (Fig. 2c,d). Comparing the ± 2 bp context of data simulated using SBS-6 to the ± 2 bp context of the real data demonstrated that for almost all samples the SBS-6 simulations do not capture the ± 2 bp context as 91% of samples exhibited a cosine similarity below 0.85. Similarly, only half of the samples simulated using SBS-96 (i.e., ± 1 bp) had consistent ± 2 bp context when compared to the real data (44% below 0.85; Fig. 2C). This demonstrates that the mutational patterns of the examined cancer genomes exhibit additional specificity for ± 2 bp adjacent to single base substitutions; note that ± 2 bp contains within itself the ± 1 bp classification. In contrast, comparing the ± 3 bp context of data simulated using SBS-1536 demonstrated that the ± − 2 bp context captures the patterns observed at ± 3 bp for almost all samples (only 6.5% of samples below 0.85; Fig. 2d). Overall, these results suggest that the SBS-1536 classification is necessary to capture additional information for a set of signatures beyond SBS-6 and SBS-96. Moreover, extending this classification to ± 3 bp (SBS-24576) is likely not necessary as the SBS-1536 classification already captures the patterns of ± 3 bp for majority of the examined cancer samples.
Third, we evaluated the false-positive rates of tools commonly used for discovery of cancer driver genes. More specifically, we simulated the somatic mutations observed in the 1024 whole-exome sequenced breast cancers reported in the TCGA MC3 release . The simulations were repeated 100 times and each of these 100 repetitions was analyzed for driver genes using MutSigCV1.41 and MutSigCV2  as well as dNdScv . In principle, since SigProfilerSimulator randomly shuffles somatic mutations, one would not expect to find any genes under selection. However, each of the tools found significantly mutated genes within the simulations using the recommended cutoff threshold of q-value < 0.10 (Fig. 2E). On average MutSig1.41CV found between 1.3 and 1.6 false-positive driver genes per simulation when examining data generated using the SBS-384 and SBS-6144 mutational classifications, respectively. In contrast, MutSig2CV found between 0.3 and 0.2 false-positive driver genes per simulation using SBS-384 and SBS-6144, respectively. Lastly, dNdScv found between 0.03 and 0.02 false-positive driver genes per simulation using SBS-384 and SBS-6144, respectively. Note that by chance, when using a q-value cutoff of 0.1, one would expect to observe less than 0.1 false-positive driver genes per simulation. Lowering the threshold for statistical significance to 0.01 eliminates all false-positive results from dNdScv and MutSig2CV but not for MutSig1.41CV.
Increasingly, there is a need to develop reliable background models of cancer mutational landscapes to allow downstream statistical analysis for biological discoveries. Currently, to the best of our knowledge, there is no tool that allows explicitly simulating accurate background mutational landscapes. This report presents SigProfilerSimulator, a method that allows fast generation of mutational landscapes at different resolutions. As demonstrated by our analyses, SigProfilerSimulator can be used to evaluate the accuracy of other bioinformatics tools or it can be leveraged for making novel discoveries. SigProfilerSimulator’s breadth of features allows one to construct a tailored null hypothesis of mutational landscapes and to identify significance levels of the subsequent results. Overall, SigProfilerSimulator will be a useful tool for any researcher that performs statistical analysis based on mutational data derived from the sequencing of cancer or normal somatic tissues.
SigProfilerSimulator is developed as a computationally efficient Python package and it is available for installation through PyPI. Further, an R-wrapper is available through GitHub. The tool leverages a PCG random number generator that provides a simple, fast, and space-efficient algorithm for generating random numbers with high statistical quality . The tool uses a Monte Carlo approach for randomly generating somatic mutations while considering the observed frequency of a preselected reference genome. More specifically, SigProfilerSimulator randomly shuffles mutations by using the precomputed observed rates of mutational channels in a reference genome. The tool works in unison with SigProfilerMatrixGenerator  to first classify a catalog of somatic mutations prior to simulating it. The final mutational catalog is outputted into commonly used mutation data formats including mutation annotation format (MAF) files and variant annotation format (VCF) files. SigProfilerSimulator is freely available and has been extensively documented.
The computational efficiency of SigProfilerSimulator was benchmarked by simulating the freely available PCAWG dataset, consisting of 2,144 samples with 36,876,213 single base substitutions, for a single iteration using the default parameters. Simulating the complete dataset took approximately 90 s. Simulations were performed on a dedicated computational node with a dual Intel® Xeon® Gold 6132 Processors (19.25 M Cache, 2.60 GHz) and 192 GB of shared DDR4-2666 RAM.
Analysis of doublet base substitutions
We simulated the PCAWG dataset using the SBS-96 classification. Each simulation was performed 1,000 times considering mutations as both statistically independent events (non-updating; each mutation has no effect on the observed rate of mutational channels) and dependent events (updating; each mutation updates the observed rate of mutational channels). To calculate the number of DBS mutations occurring by chance in each sample, we generated the mutational catalogs for DBS-78 using SigProfilerMatrixGenerator . The resulting counts for DBSs were used to plot the distributions of the expected number of DBSs due to two adjacent SBSs happening purely by chance. The fold change was calculated by dividing the mean DBS count observed across the simulations by the total number of DBSs found in the original sample. Derivation of q-values was performed by applying the Benjamini and Hochberg false discovery rate correction to p-values calculated using z-tests based on the DBS distributions found in the simulations and the numbers of DBSs observed in the real data.
Sequence context analysis for mutational signatures
The PCAWG dataset was simulated using the SBS-6, SBS-96, and SBS-1536 classifications while ensuring the respective mutational patterns and mutational burdens on each chromosome match the ones observed in the real data. SigProfilerMatrixGenerator was used to derive the mutational vectors for each sample including vectors incorporating three bases 5′ and three bases 3′ of each mutation, resulting in a classification with 24,576 mutational channels. To avoid comparisons of sparse binary vectors, only samples that had at least 2 mutations per mutational channel were included in the comparative analyses. The simulated and real mutational patterns of a cancer genome were considered the same if their cosine similarity was at least 0.85. Note that the average cosine similarity between two random nonnegative vectors is 0.75 (Additional file 1: Fig. 4). The chance of two nonnegative vectors with 1,536 mutational channels or 24,576 mutational channels to have a similarity of 0.85 simply by chance is less than 10–6 (Additional file 1: Fig. 4).
Benchmarking false-positive driver genes detected by MutSigCV and dNdScv
All whole-exome sequenced breast cancer samples part of the TCGA MC3 release were simulated using SBS-384 and SBS-6144 contexts while maintaining the mutational burden on each chromosome. As recommended , 23 samples with more than 500 exonic mutations were excluded from the analysis. Each simulation was repeated 100 times with different random seeds. The variant annotation predictor  was used to annotate simulated mutations with the appropriate gene name for compatibility with MutSigCV1.41 and MutSigCV2 . We ran MutSigCV1.41 and MutSigCV2 using the recommended default parameters in conjunction with the genome reference sequence for hg19, mutation dictionary file, exome coverage file, and gene covariates file as found at https://software.broadinstitute.org/cancer/cga/mutsig_run. We ran dNdScv  using the default library parameters and filtered out the significant genes using the recommended q-value cutoff of less than 0.10. All rainfall plots were generated using karyoploteR .
Availability and requirements
Project name: SigProfilerSimulator
Project home page: https://github.com/AlexandrovLab/SigProfilerSimulator
Operating system(s): Unix, Linux, and Windows
Programming language: Python 3; R wrapper
Other requirements: None
License: BSD 2-Clause "Simplified" License
Any restrictions to use by non-academics: None
Availability of data and materials
No novel data were generated as part of this study. All source code is freely available and can be downloaded from the links below. Python code: https://github.com/AlexandrovLab/SigProfilerSimulator. R wrapper: https://github.com/AlexandrovLab/SigProfilerSimulatorR. Documentation: https://osf.io/usxjz/wiki/home/
Doublet base substitutions
Small insertions and deletions
Liver cancer—hepatocellular carcinoma (virus associated)
Pan-cancer analysis of whole genomes
Single base substitutions
The cancer genome atlas
Youn A, Simon R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics. 2011;27(2):175–81.
Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314(5797):268–74.
Greenman C, Wooster R, Futreal PA, Stratton MR, Easton DF. Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics. 2006;173(4):2187–98.
Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455(7216):1069–75.
Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, et al. Diverse somatic mutation patterns and pathway alterations in human cancers. Nature. 2010;466(7308):869–73.
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8.
Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Borresen-Dale AL, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21.
Morganella S, Alexandrov LB, Glodzik D, Zou X, Davies H, Staaf J, Sieuwerts AM, Brinkman AB, Martin S, Ramakrishna M, et al. The topography of mutational processes in breast cancer genomes. Nat Commun. 2016;7:11383.
Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013;3(1):246–59.
Martincorena I, Raine KM, Gerstung M, Dawson KJ, Haase K, Van Loo P, Davies H, Stratton MR, Campbell PJ. Universal patterns of selection in cancer and somatic tissues. Cell. 2017;171(5):1029–41.
Bergstrom EN, Huang MN, Mahto U, Barnes M, Stratton MR, Rozen SG, Alexandrov LB. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics. 2019;20(1):685.
Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, Boot A, Covington KR, Gordenin DA, Bergstrom EN, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578(7793):94–101.
Alexandrov LB, Stratton MR. Mutational signatures: the patterns of somatic mutations hidden in cancer genomes. Curr Opin Genet Dev. 2014;24:52–60.
Consortium ITP-CAoWG: Pan-cancer analysis of whole genomes. Nature 2020, 578(7793):82–93.
Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C, Stewart C, Hess J, Ma S, Chiotti KE, McLellan M, et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 2018;6(3):271.
Lemire D. Fast random integer generation in an interval. ACM Trans Model Comput Simul. 2019;29(1):1–12.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17(1):122.
Gel B, Serra E. karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics. 2017;33(19):3088–90.
The authors thank M. Lawrence for sharing the newest code for MutSig2CV. The authors also thank M. Zhivagui for designing the SigProfiler logo.
This work was supported by Cancer Research UK Grand Challenge Award C98/A24032. LBA is an Abeloff V scholar and he is personally supported by an Alfred P. Sloan Research Fellowship and a Packard Fellowship for Science and Engineering. The funders had no roles in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
. Figure 1. Example of an additional resolution for simulating mutational patterns supported by SigProfilerSimulator. The example illustrates the resulting patterns when maintaining the mutational burden on each chromosome and when only relying on proportionate allocation based upon the nucleotide context distribution of the reference genome. Comparison is provided for a single breast cancer sample simulated at an SBS-1536 resolution. Figure 2. Simulating cancer genomics data using a probability mask. An example rainfall plot visualization when simulating a single TCGA melanoma sample, TCGA-DA-A-A1I8, with and without a probability mask on chromosome 2. A) Distribution of single base substitutions across chromosome 2 as found in the original sample. B) Distribution of single base substitutions across chromosome 2 when simulating the sample with default parameters. C) Distribution of single base substitutions across chromosome 2 when simulating the sample using a probability mask with 90% probability for mutations on the p arm and a 10% on the q arm. D) Distribution of single base substitutions across chromosome 2 when simulating the sample with a probability mask that varies in weights across the chromosome. All rainfall plots generated using karyoploteR . Y-axes reflect log-scaled distances between adjacent mutations. X-axes reflect positions on chromosome 2 in TCGA-DA-A-A1I8. Each dot reflects a single base substitution colored using the default coloring scheme of karyoploteR. Figure 3. Evaluating the expected rates of DBSs for mutations simulated as dependent events. The fold increase of DBSs observed in the original PCAWG samples and the average number of DBSs observed in our simulations. The mutational pattern of each sample was generated 1000 times considering somatic mutations as dependent events. Figure 4. Evaluating the average similarity of random nonnegative vectors. A) Comparing the cosine similarities amongst 10,000 randomly generated nonnegative vectors, where each vector has 1536 mutational channels. B) Comparing the cosine similarities amongst 10,000 randomly generated nonnegative vectors, where each vector has 24,576 mutational channels.
About this article
Cite this article
Bergstrom, E.N., Barnes, M., Martincorena, I. et al. Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator. BMC Bioinformatics 21, 438 (2020). https://doi.org/10.1186/s12859-020-03772-3
- Somatic mutations
- Mutational patterns
- Mutational signatures