Skip to main content

Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator

Abstract

Background

Performing a statistical test requires a null hypothesis. In cancer genomics, a key challenge is the fast generation of accurate somatic mutational landscapes that can be used as a realistic null hypothesis for making biological discoveries.

Results

Here we present SigProfilerSimulator, a powerful tool that is capable of simulating the mutational landscapes of thousands of cancer genomes at different resolutions within seconds. Applying SigProfilerSimulator to 2144 whole-genome sequenced cancers reveals: (i) that most doublet base substitutions are not due to two adjacent single base substitutions but likely occur as single genomic events; (ii) that an extended sequencing context of ± 2 bp is required to more completely capture the patterns of substitution mutational signatures in human cancer; (iii) information on false-positive discovery rate of commonly used bioinformatics tools for detecting driver genes.

Conclusions

SigProfilerSimulator’s breadth of features allows one to construct a tailored null hypothesis and use it for evaluating the accuracy of other bioinformatics tools or for downstream statistical analysis for biological discoveries. SigProfilerSimulator is freely available at https://github.com/AlexandrovLab/SigProfilerSimulator with an extensive documentation at https://osf.io/usxjz/wiki/home/.

Background

Performing a statistical evaluation to determine whether an observation is seen by chance necessitates the construction of a null hypothesis corresponding with the expected default position. An observation is generally considered statistically significant if it reflects an unlikely outcome of the null hypothesis. In most practical applications, observations seen in less than 5% of outcomes from a null distribution are considered statistically significant.

Large-scale computational analyses of cancer genomes use background mutational models to evaluate driver mutations [1,2,3,4,5,6], mutational signatures [7], and topographical accumulation of somatic mutations [8]. In almost all cases, a null hypothesis model of the background mutation rate is implicitly incorporated into a bioinformatics tool [6, 9, 10] and used to report statistically significant results. Here we present SigProfilerSimulator, a computationally efficient bioinformatics tool for generating sample specific mutational landscapes that match the mutational signatures operative in each sample (Fig. 1a). SigProfilerSimulator provides a framework for generating a background mutational model for downstream statistical analyses and hypothesis testing. The tool supports generation of simulated single base substitutions (SBSs), small insertions and deletions (IDs), and doublet base substitutions (DBSs) while maintaining their patterns at different resolutions. SigProfilerSimulator is available as both a Python and an R package, provides support for commonly used data formats, and is extensively documented. To demonstrate the wide applicability of SigProfilerSimulator, we illustrate its basic functionality using a single cancer genome and then apply the tool to 2144 whole-genome sequenced cancers and to 1,024 whole-exome sequenced breast cancers to address three different questions in cancer genomics.

Fig. 1
figure1

High-level overview illustrating the functionality of SigProfilerSimulator. a Schematic depiction of SigProfilerSimulator’s general functionality. The tool transforms the real somatic mutational catalog of a cancer genome into a simulated mutational catalog, while maintaining the mutational burden and the mutational pattern at a preselected resolution. b Summary of SBS classifications with corresponding examples at each resolution. The guide summarizes how the number of mutational channels is derived for each classification. c Comparing the simulated catalogues of a single cancer genome at different resolutions. Adding additional sequence context to a simulation creates a more specific and complex mutational model (e.g., SBS-96 provides greater resolution than SBS-6). Similarly, one can preserve the number of mutations in both genic and intergenic regions as well as the transcriptional strand bias by simulating with either SBS-24, SBS-384, or SBS-6144 classifications. Simulating a more complex classification of the data results in matching catalogs for all collapsed versions of the higher matrix (i.e., simulating SBS-384 ensures that the SBS-6, SBS-24, and SBS-96 simulated catalogs match the original data). d Comparing the simulated catalogues of a single cancer genome at different resolutions for small insertions and deletions. One can preserve the number of small insertions and deletions in genic and intergenic regions as well as the transcriptional strand bias by simulating the ID-415 classification

Implementation

The mutational pattern of a cancer genome can be described using distinct classification schemes reflecting the activity of mutational processes at different resolutions [11]. For example, single base substitutions can be described using only the mutated base-pair (6 possible mutational channels; known as SBS-6 classification), or the mutated base-pair with ± 1 bp context (SBS-96), or the mutated base-pair with ± 2 bp context (SBS-1536), or the mutated base-pair with ± 3 bp context (SBS-24576), etc. (Fig. 1B) [12]. Each of these classifications can be subsequently elaborated by considering additional features [11, 12]. For example, SBS-24 extends the SBS-6 classification by including four subtypes for the six possible single base substitutions: substitutions are first split into ones in non-transcribed/intergenic regions and ones in genic regions; substitutions in genetic regions are further subclassified as ones occurring on the transcribed strand, untranscribed strand, or in regions of bi-directional transcription [11]. Similarly, the SBS-384 and SBS-6144 classifications extend SBS-96 and SBS-1536, respectively, by subclassifying each mutational channel into four: non-transcribed, transcribed, untranscribed, and bi-directional [11]. Note that, conventionally, these classifications have been displayed using the mutations only on the transcribed and untranscribed strands (e.g., 192 channel depiction for SBS-384) [7, 9, 13] since, historically, mutational patterns have been predominately investigated in whole-exome sequenced samples that provide little information about mutations outside of transcribed protein coding regions.

By preserving the pattern of mutations at a preselected resolution, SigProfilerSimulator converts a set of real somatic mutations from a cancer genome into another set of randomly generated somatic mutations (Fig. 1a). Maintaining the mutational pattern provides an assurance that the same mutational processes are observed in both the real and the simulated cancer genome. By default, the tool projects these mutations as statistically independent events onto each chromosome by proportionately assigning mutations based on the observed rate of each mutational channel across the complete length of a preselected reference genome. The number of mutations is proportionally assigned to each chromosome based on the number of mutational channels (e.g., 96 channels reflecting trinucleotides) found on that chromosome. The tool also provides a variety of custom options for simulating mutations, including: (i) gender of the sample allowing appropriate incorporation of sex chromosomes; (ii) transcriptional strand bias allowing accurate distribution of mutations to account for the activity of transcription-coupled nucleotide excision repair; (iii) considering mutations as dependent sequential events where each mutation updates the observed rate of a mutational channel for a preselected reference genome, e.g., a C > T mutation at ACT trinucleotide will reduce the number of ACT trinucleotides in the reference genome by one and increase the number of ATT trinucleotides in the reference genome by one, thus, each mutation will modify the overall observed rate of a mutational channel in the genome and affect subsequent mutations; (iv) preserving mutational burden and mutational patterns for each chromosome instead of the complete genome, thus, the number and type of mutations assigned to each chromosome match exactly the ones observed in the original sample (Additional file 1: Fig. 1); (v) exome simulations that generate mutations only in the protein coding regions of the genome; (vi) adding Poisson noise to the number of mutations in each mutational channel of the original data; (vii) allowing the use of a probability mask that can decrease or increase the opportunity for mutations in certain parts of the genome (Additional file 1: Fig. 2); and several other options. Additionally, one can simulate the germline variants in a (matched-)normal sample(s), which can be used for subsequent comparisons against tumor samples. With this collection of features, one can easily tailor an appropriate background mutational model for testing different biological hypotheses or for evaluating existing bioinformatics tools. Importantly, SigProfilerSimulator is computationally efficient. For example, the tool can simulate ~ 37 million somatic mutations found in the 2144 whole-genome sequenced cancers generated by Pan-cancer Analysis of Whole Genomes (PCAWG) initiative [14] within 90 s.

Fig. 2
figure2

Applying SigProfilerSimulator to three distinct cancer genomics problems. a Distribution of the expected number of doublet base substitutions (DBSs) due to the adjacent single base substitutions (SBSs) observed by chance for the PCAWG sample SP99325. The distributions represent the results from 1000 simulations of the mutational pattern of SP99325 treating mutations as statistically independent events (blue) and 1000 simulations of the mutational pattern of SP99325 treating mutations as dependent events. b The fold increase of DBSs observed in the original PCAWG samples and the average number of DBSs observed in our simulations. The mutational pattern of each sample was generated 1000 times considering somatic mutations as statistically independent events. c Comparing the similarities of mutational patterns at ± 2 bp context (SBS-1536) between real and simulated PCAWG samples. Simulations were performed at SBS-6 and SBS-96 resolutions. d Comparing the similarities of mutational patterns at ± 3 bp context (SBS-24576) between real and simulated PCAWG samples. Simulations were performed at SBS-6, SBS-96, and SBS-1536 resolutions. e Evaluating the false-positive rates of MutSigCV1.41, MutSigCV2, and dNdScv driver detection tools using SigProfilerSimulator. All TCGA breast cancer WES samples were simulated 100 times and examined for driver mutations using both MutSigCV and dNdScv. The average number of significant driver genes are plotted using a recommended q-value cutoff of 0.10

Results

To illustrate several of SigProfilerSimulator’s features, we provide a detailed visualization for a single TCGA melanoma sample: TCGA-DA-A-A1I8. Simulating TCGA-DA-A-A1I8 using the SBS-6 classification maintains the original sample’s pattern for the six possible types of single base mutations, however, it also results in completely different patterns for classifications at higher resolutions (Fig. 1c). Simulating an extended sequence context (SBS-96; trinucleotides) results in a perfect match with the original landscape when including ± 1 adjacent bases; however, it does not reflect the transcriptional strand bias observed in the sample (Fig. 1c). As such, one can further elaborate these simulations by incorporating transcriptional strand bias (Fig. 1c), by considering ± 2 adjacent bases (Additional file 1: Fig. 1), or by preserving the mutational burden and mutational patterns on each chromosome (Additional file 1: Fig. 1). Similarly, simulations can be performed for the different classification types for small insertions and deletions (ID-83 and ID-415; Fig. 1D). Each of these simulations can be subsequently used to test different hypotheses. To demonstrate this capability, we applied SigProfilerSimulator to three questions in cancer genomics.

First, we used simulations to evaluate whether doublet base substitutions (e.g., CC:GG > TT:AA mutations) are two subsequent single base substitutions occurring simply by chance in adjacent genomic positions. We constructed a null hypothesis by applying the tool to the 2144 PCAWG cancer genomes. Simulations were performed considering SBSs as both statistically independent events (non-updating—simulating with replacement; each mutation has no effect on the observed rate of mutational channels) and dependent events (updating—simulating without replacement; each mutation updates the observed rate of mutational channels). Each sample was simulated 1000 times providing a distribution of doublet base substitutions. After simulating the SBS-96 context for each PCAWG sample, we examined the number of single base substitutions occurring next to one another on the genome simply by chance. For example, in the sample SP99325 (LIRI), we observed on average approximately 23 pairs of adjacent SBSs when considering mutations as statistically independent events and 14 pairs of adjacent SBSs when considering mutations as dependent events (Fig. 2a). In contrast, the actual sample contains 303 doublet base substitutions indicating a 22-fold and a 13-fold enrichment compared to the null hypothesis, respectively. The results indicate that it is highly unlikely that the majority of observed doublet base substitutions in SP99325 are the result of two adjacent SBS events. Applying the same approach to all PCAWG samples reveals between 10- and 1000-fold increase of the real number of DBSs compared to simulated data (Fig. 2b; Additional file 1: Fig. 3). These results confirm the belief that the vast majority of doublet base substitutions in human cancer are not due to adjacent single base substitutions. Rather, doublet base substitutions are likely due either to single genomic events or to higher mutagenic propensities of certain regions of the human genome. Indeed, we recently derived mutational signatures of doublet base substitutions across the PCAWG dataset [12]. Nevertheless, it is important to remember that, especially for hyper-mutated samples, some of the observed DBSs may be due to having two single base substitutions occurring by chance in adjacent positions (Fig. 2a).

Second, we evaluated whether incorporating additional sequence context 5′ and 3′ of single base substitutions increases the specificity of the mutational patterns observed in cancer genomes [12]. Here, we considered two mutational patterns to be the same if their cosine similarity is more than 0.85 (Additional file 1: Fig. 4; Methods). Specifically, we simulated the PCAWG dataset at different resolutions (viz., SBS-6, SBS-96, and SBS-1536; Fig. 1B) and compared them to the patterns of mutations observed in the real samples (Fig. 2c,d). Comparing the ± 2 bp context of data simulated using SBS-6 to the ± 2 bp context of the real data demonstrated that for almost all samples the SBS-6 simulations do not capture the ± 2 bp context as 91% of samples exhibited a cosine similarity below 0.85. Similarly, only half of the samples simulated using SBS-96 (i.e., ± 1 bp) had consistent ± 2 bp context when compared to the real data (44% below 0.85; Fig. 2C). This demonstrates that the mutational patterns of the examined cancer genomes exhibit additional specificity for ± 2 bp adjacent to single base substitutions; note that ± 2 bp contains within itself the ± 1 bp classification. In contrast, comparing the ± 3 bp context of data simulated using SBS-1536 demonstrated that the ± − 2 bp context captures the patterns observed at ± 3 bp for almost all samples (only 6.5% of samples below 0.85; Fig. 2d). Overall, these results suggest that the SBS-1536 classification is necessary to capture additional information for a set of signatures beyond SBS-6 and SBS-96. Moreover, extending this classification to ± 3 bp (SBS-24576) is likely not necessary as the SBS-1536 classification already captures the patterns of ± 3 bp for majority of the examined cancer samples.

Third, we evaluated the false-positive rates of tools commonly used for discovery of cancer driver genes. More specifically, we simulated the somatic mutations observed in the 1024 whole-exome sequenced breast cancers reported in the TCGA MC3 release [15]. The simulations were repeated 100 times and each of these 100 repetitions was analyzed for driver genes using MutSigCV1.41 and MutSigCV2 [6] as well as dNdScv [10]. In principle, since SigProfilerSimulator randomly shuffles somatic mutations, one would not expect to find any genes under selection. However, each of the tools found significantly mutated genes within the simulations using the recommended cutoff threshold of q-value < 0.10 (Fig. 2E). On average MutSig1.41CV found between 1.3 and 1.6 false-positive driver genes per simulation when examining data generated using the SBS-384 and SBS-6144 mutational classifications, respectively. In contrast, MutSig2CV found between 0.3 and 0.2 false-positive driver genes per simulation using SBS-384 and SBS-6144, respectively. Lastly, dNdScv found between 0.03 and 0.02 false-positive driver genes per simulation using SBS-384 and SBS-6144, respectively. Note that by chance, when using a q-value cutoff of 0.1, one would expect to observe less than 0.1 false-positive driver genes per simulation. Lowering the threshold for statistical significance to 0.01 eliminates all false-positive results from dNdScv and MutSig2CV but not for MutSig1.41CV.

Conclusions

Increasingly, there is a need to develop reliable background models of cancer mutational landscapes to allow downstream statistical analysis for biological discoveries. Currently, to the best of our knowledge, there is no tool that allows explicitly simulating accurate background mutational landscapes. This report presents SigProfilerSimulator, a method that allows fast generation of mutational landscapes at different resolutions. As demonstrated by our analyses, SigProfilerSimulator can be used to evaluate the accuracy of other bioinformatics tools or it can be leveraged for making novel discoveries. SigProfilerSimulator’s breadth of features allows one to construct a tailored null hypothesis of mutational landscapes and to identify significance levels of the subsequent results. Overall, SigProfilerSimulator will be a useful tool for any researcher that performs statistical analysis based on mutational data derived from the sequencing of cancer or normal somatic tissues.

Methods

Tool implementation

SigProfilerSimulator is developed as a computationally efficient Python package and it is available for installation through PyPI. Further, an R-wrapper is available through GitHub. The tool leverages a PCG random number generator that provides a simple, fast, and space-efficient algorithm for generating random numbers with high statistical quality [16]. The tool uses a Monte Carlo approach for randomly generating somatic mutations while considering the observed frequency of a preselected reference genome. More specifically, SigProfilerSimulator randomly shuffles mutations by using the precomputed observed rates of mutational channels in a reference genome. The tool works in unison with SigProfilerMatrixGenerator [11] to first classify a catalog of somatic mutations prior to simulating it. The final mutational catalog is outputted into commonly used mutation data formats including mutation annotation format (MAF) files and variant annotation format (VCF) files. SigProfilerSimulator is freely available and has been extensively documented.

Python code: https://github.com/AlexandrovLab/SigProfilerSimulator

R wrapper: https://github.com/AlexandrovLab/SigProfilerSimulatorR

Documentation: https://osf.io/usxjz/wiki/home/

Computational benchmarking

The computational efficiency of SigProfilerSimulator was benchmarked by simulating the freely available PCAWG dataset, consisting of 2,144 samples with 36,876,213 single base substitutions, for a single iteration using the default parameters. Simulating the complete dataset took approximately 90 s. Simulations were performed on a dedicated computational node with a dual Intel® Xeon® Gold 6132 Processors (19.25 M Cache, 2.60 GHz) and 192 GB of shared DDR4-2666 RAM.

Analysis of doublet base substitutions

We simulated the PCAWG dataset using the SBS-96 classification. Each simulation was performed 1,000 times considering mutations as both statistically independent events (non-updating; each mutation has no effect on the observed rate of mutational channels) and dependent events (updating; each mutation updates the observed rate of mutational channels). To calculate the number of DBS mutations occurring by chance in each sample, we generated the mutational catalogs for DBS-78 using SigProfilerMatrixGenerator [11]. The resulting counts for DBSs were used to plot the distributions of the expected number of DBSs due to two adjacent SBSs happening purely by chance. The fold change was calculated by dividing the mean DBS count observed across the simulations by the total number of DBSs found in the original sample. Derivation of q-values was performed by applying the Benjamini and Hochberg false discovery rate correction to p-values calculated using z-tests based on the DBS distributions found in the simulations and the numbers of DBSs observed in the real data.

Sequence context analysis for mutational signatures

The PCAWG dataset was simulated using the SBS-6, SBS-96, and SBS-1536 classifications while ensuring the respective mutational patterns and mutational burdens on each chromosome match the ones observed in the real data. SigProfilerMatrixGenerator was used to derive the mutational vectors for each sample including vectors incorporating three bases 5′ and three bases 3′ of each mutation, resulting in a classification with 24,576 mutational channels. To avoid comparisons of sparse binary vectors, only samples that had at least 2 mutations per mutational channel were included in the comparative analyses. The simulated and real mutational patterns of a cancer genome were considered the same if their cosine similarity was at least 0.85. Note that the average cosine similarity between two random nonnegative vectors is 0.75 (Additional file 1: Fig. 4). The chance of two nonnegative vectors with 1,536 mutational channels or 24,576 mutational channels to have a similarity of 0.85 simply by chance is less than 10–6 (Additional file 1: Fig. 4).

Benchmarking false-positive driver genes detected by MutSigCV and dNdScv

All whole-exome sequenced breast cancer samples part of the TCGA MC3 release were simulated using SBS-384 and SBS-6144 contexts while maintaining the mutational burden on each chromosome. As recommended [10], 23 samples with more than 500 exonic mutations were excluded from the analysis. Each simulation was repeated 100 times with different random seeds. The variant annotation predictor [17] was used to annotate simulated mutations with the appropriate gene name for compatibility with MutSigCV1.41 and MutSigCV2 [6]. We ran MutSigCV1.41 and MutSigCV2 using the recommended default parameters in conjunction with the genome reference sequence for hg19, mutation dictionary file, exome coverage file, and gene covariates file as found at https://software.broadinstitute.org/cancer/cga/mutsig_run. We ran dNdScv [10] using the default library parameters and filtered out the significant genes using the recommended q-value cutoff of less than 0.10. All rainfall plots were generated using karyoploteR [18].

Availability and requirements

Project name: SigProfilerSimulator

Project home page: https://github.com/AlexandrovLab/SigProfilerSimulator

Operating system(s): Unix, Linux, and Windows

Programming language: Python 3; R wrapper

Other requirements: None

License: BSD 2-Clause "Simplified" License

Any restrictions to use by non-academics: None

Availability of data and materials

No novel data were generated as part of this study. All source code is freely available and can be downloaded from the links below. Python code: https://github.com/AlexandrovLab/SigProfilerSimulator. R wrapper: https://github.com/AlexandrovLab/SigProfilerSimulatorR. Documentation: https://osf.io/usxjz/wiki/home/

Abbreviations

DBSs:

Doublet base substitutions

IDs:

Small insertions and deletions

LIRI:

Liver cancer—hepatocellular carcinoma (virus associated)

PCAWG:

Pan-cancer analysis of whole genomes

SBSs:

Single base substitutions

TCGA:

The cancer genome atlas

References

  1. 1.

    Youn A, Simon R. Identifying cancer driver genes in tumor genome sequencing studies. Bioinformatics. 2011;27(2):175–81.

    CAS  Article  Google Scholar 

  2. 2.

    Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N, et al. The consensus coding sequences of human breast and colorectal cancers. Science. 2006;314(5797):268–74.

    Article  Google Scholar 

  3. 3.

    Greenman C, Wooster R, Futreal PA, Stratton MR, Easton DF. Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics. 2006;173(4):2187–98.

    CAS  Article  Google Scholar 

  4. 4.

    Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature. 2008;455(7216):1069–75.

    CAS  Article  Google Scholar 

  5. 5.

    Kan Z, Jaiswal BS, Stinson J, Janakiraman V, Bhatt D, Stern HM, Yue P, Haverty PM, Bourgon R, Zheng J, et al. Diverse somatic mutation patterns and pathway alterations in human cancers. Nature. 2010;466(7308):869–73.

    CAS  Article  Google Scholar 

  6. 6.

    Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–8.

    CAS  Article  Google Scholar 

  7. 7.

    Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Borresen-Dale AL, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21.

    CAS  Article  Google Scholar 

  8. 8.

    Morganella S, Alexandrov LB, Glodzik D, Zou X, Davies H, Staaf J, Sieuwerts AM, Brinkman AB, Martin S, Ramakrishna M, et al. The topography of mutational processes in breast cancer genomes. Nat Commun. 2016;7:11383.

    CAS  Article  Google Scholar 

  9. 9.

    Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 2013;3(1):246–59.

    CAS  Article  Google Scholar 

  10. 10.

    Martincorena I, Raine KM, Gerstung M, Dawson KJ, Haase K, Van Loo P, Davies H, Stratton MR, Campbell PJ. Universal patterns of selection in cancer and somatic tissues. Cell. 2017;171(5):1029–41.

    CAS  Article  Google Scholar 

  11. 11.

    Bergstrom EN, Huang MN, Mahto U, Barnes M, Stratton MR, Rozen SG, Alexandrov LB. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics. 2019;20(1):685.

    Article  Google Scholar 

  12. 12.

    Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, Boot A, Covington KR, Gordenin DA, Bergstrom EN, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578(7793):94–101.

    CAS  Article  Google Scholar 

  13. 13.

    Alexandrov LB, Stratton MR. Mutational signatures: the patterns of somatic mutations hidden in cancer genomes. Curr Opin Genet Dev. 2014;24:52–60.

    CAS  Article  Google Scholar 

  14. 14.

    Consortium ITP-CAoWG: Pan-cancer analysis of whole genomes. Nature 2020, 578(7793):82–93.

  15. 15.

    Ellrott K, Bailey MH, Saksena G, Covington KR, Kandoth C, Stewart C, Hess J, Ma S, Chiotti KE, McLellan M, et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 2018;6(3):271.

    CAS  Article  Google Scholar 

  16. 16.

    Lemire D. Fast random integer generation in an interval. ACM Trans Model Comput Simul. 2019;29(1):1–12.

    Article  Google Scholar 

  17. 17.

    McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17(1):122.

    Article  Google Scholar 

  18. 18.

    Gel B, Serra E. karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics. 2017;33(19):3088–90.

    CAS  Article  Google Scholar 

Download references

Acknowledgements

The authors thank M. Lawrence for sharing the newest code for MutSig2CV. The authors also thank M. Zhivagui for designing the SigProfiler logo.

Funding

This work was supported by Cancer Research UK Grand Challenge Award C98/A24032. LBA is an Abeloff V scholar and he is personally supported by an Alfred P. Sloan Research Fellowship and a Packard Fellowship for Science and Engineering. The funders had no roles in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Affiliations

Authors

Contributions

E.N.B. developed the Python and R code and wrote the manuscript. M.B. tested and documented the code. I. M. provided advice for both driver gene analysis and the interpretation of the results. L.B.A. supervised the overall development of the code and writing of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ludmil B. Alexandrov.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

. Figure 1. Example of an additional resolution for simulating mutational patterns supported by SigProfilerSimulator. The example illustrates the resulting patterns when maintaining the mutational burden on each chromosome and when only relying on proportionate allocation based upon the nucleotide context distribution of the reference genome. Comparison is provided for a single breast cancer sample simulated at an SBS-1536 resolution. Figure 2. Simulating cancer genomics data using a probability mask. An example rainfall plot visualization when simulating a single TCGA melanoma sample, TCGA-DA-A-A1I8, with and without a probability mask on chromosome 2. A) Distribution of single base substitutions across chromosome 2 as found in the original sample. B) Distribution of single base substitutions across chromosome 2 when simulating the sample with default parameters. C) Distribution of single base substitutions across chromosome 2 when simulating the sample using a probability mask with 90% probability for mutations on the p arm and a 10% on the q arm. D) Distribution of single base substitutions across chromosome 2 when simulating the sample with a probability mask that varies in weights across the chromosome. All rainfall plots generated using karyoploteR [18]. Y-axes reflect log-scaled distances between adjacent mutations. X-axes reflect positions on chromosome 2 in TCGA-DA-A-A1I8. Each dot reflects a single base substitution colored using the default coloring scheme of karyoploteR. Figure 3. Evaluating the expected rates of DBSs for mutations simulated as dependent events. The fold increase of DBSs observed in the original PCAWG samples and the average number of DBSs observed in our simulations. The mutational pattern of each sample was generated 1000 times considering somatic mutations as dependent events. Figure 4. Evaluating the average similarity of random nonnegative vectors. A) Comparing the cosine similarities amongst 10,000 randomly generated nonnegative vectors, where each vector has 1536 mutational channels. B) Comparing the cosine similarities amongst 10,000 randomly generated nonnegative vectors, where each vector has 24,576 mutational channels.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bergstrom, E.N., Barnes, M., Martincorena, I. et al. Generating realistic null hypothesis of cancer mutational landscapes using SigProfilerSimulator. BMC Bioinformatics 21, 438 (2020). https://doi.org/10.1186/s12859-020-03772-3

Download citation

Keywords

  • Somatic mutations
  • Mutational patterns
  • Mutational signatures