Heterogeneity of the mutation rate
We observe significant heterogeneities of the mutation rate at multiple levels (Fig. 2). The mutation rate varies among cancer types (Fig. 2a): skin cutaneous melanoma, colorectal cancer and lung adenocarcinoma are the cancer types with the highest mean mutation rates (5–10 mut/patient/Mb) while thyroid carcinoma, prostate adenocarcinoma and low-grade glioma have the lowest (0.5–1 mut/patient/Mb). The mutation rate also differs between samples from the same cancer type, with the largest variation seen for skin cutaneous melanoma (the mutation rate ranges from 1 mut/patient/Mb to 150 mut/patient/Mb).
The mutation rate also varies between different genomic contexts (Fig. 2b). As previously shown, we find that mutational signatures are cancer type specific by looking into the mutation rate for different trinucleotide contexts. For instance, the mutation rate at TC* sites is particularly high in skin cutaneous melanoma, with the largest proportion of mutations at TCC positions of all cancer types. The mutation rate at CpG sites is elevated in all the cancer types. In colorectal cancer, we observe a high proportion of mutations at TCG and TCT sites. These can be attributed to mutations in the POLE gene that cause DNA polymerase ε deficiency [10]: we find an increased overall mutation rate, a very high proportion of T[C >A]T and T[C >T]G mutations and a high contribution of COSMIC signature 10 in six out of 42 colon cancer samples [11]. Five of those (and three of the other colon cancer samples) have a nonsynonymous mutation in POLE and one of them in addition in POLD1, which encodes the DNA polymerase δ (Additional file 1: Figure S1). When the samples with POLE mutation pattern are removed, the mutation pattern of the remaining colon cancer samples is very similar to most cancer types with a high proportion of mutations at CpG positions (Additional file 1: Figure S2).
The mutation rate also differes between genomic environments defined by the explanatory variables (Fig. 2b, c). Coding regions tend to have fewer mutations in all cancer types. Mutation rates are elevated for simple repeat regions, which might be related to mapping artefacts and ensuing technical challenges during mutation calling. The effect of CpG islands varies between different cancer types. The mutation rate in CpG islands is higher than in regions outside for thyroid carcinoma, prostate adenocarcinoma, low-grade glioma and kidney chromophobe, while for colorectal cancer, lung adenocarcinoma, lung squamous cell carcinoma and skin cutaneous melanoma the situation is reversed. Regions that are late replicated, GC rich, evolutionarily less conserved, inside DNase 1 peaks and lowly expressed have an elevated mutation rate. The explanatory power of the variables varies across cancer types as shown by the regression lines in Fig. 2c.
Granularity of regression models
We model the mutation probability in cancer genomes using a set of regression models. The most coarse-grained description of the number of mutations in a region is a Poisson regression count model and the most fine-grained is a binomial or multinomial site-specific regression model. Here we describe and investigate in detail the (dis)advantages of these three models. A conceptual overview of the models is given in Fig. 3a.
Poisson count regression model
In Poisson regression, the number of mutations in a genomic region of fixed length is modeled. The whole genome is divided into regions of pre-fixed length or according to the value of explanatory variables (e.g. segmented by genomic element types). Regression modelling is facilitated by summing the mutation counts and summarizing the annotations over the region.
We model the mutation count Nr,sam in the r-th genomic region with length L
r
in sample sam. Furthermore, can is the cancer type of the sample. Mutations arise randomly with probability pr,sam. As pr,sam is small and L
r
is large, we have the Poisson approximation of the binomially distributed number of mutations
$$N_{r, sam} \sim \text{Bi}(L_{r}, p_{r, sam}) \approx \text{Po}(L_{r} p_{r, sam}). $$
For J explanatory variables, the expected mutation count λr,sam=L
r
pr,sam can be modeled by Poisson regression with a log-link:
$$\log \lambda_{r, sam} = \mu_{sam} + \beta_{1, can} x_{r, 1} + \dots + \beta_{J, can} x_{r,J} $$
where for the jth explanatory variable, xr,j is the average value of the annotation across region r if the explanatory variable is continuous and for categorical explanatory variables, xr,j is derived from the proportions of different levels of the annotations in region r, j=1,…,J. μ
sam
is the sample-specific intercept.
Site-specific binomial regression model
In site-specific regression models, the mutation probability is modeled in each position of the genome. We enable regression modelling by binning the continuous annotations, such that we are able to sum mutation counts over positions with the same combination of annotations, and thereby reduce the size of the data set. We consider both site-specific binomial and multinomial regression models.
We model the mutation probability pi,sam at a site i in sample sam of cancer type can. With a logit link, the mutation probability can be modeled by logistic regression:
$$ \log \frac{p_{i, sam}}{1 - p_{i,sam}} = \mu_{sam} + \beta_{1,can}x_{i,1} + \dots + \beta_{J, can}x_{i,J} $$
where xi,j is the value of the jth explanatory variable at site i.
Site-specific multinomial regression model for strand-symmetric mutation types
We model the mutation probability for different mutation types. Assuming strand-symmetry, we are not distinguishing between e.g. A >G (A to T) mutations and T >C mutations, pA>G=pT>C. We consider the strand with the C or T nucleotide, and the mutation probability matrix is
$$\begin{aligned} & \qquad\mathbf{A} \quad\qquad \mathbf{G} \quad\qquad \mathbf{C} \quad\qquad \mathbf{T} \\ \begin{array}{c} \mathbf{A}~\\ \mathbf{G}~\\ \mathbf{C}~\\ \mathbf{T}~\\ \end{array} &\left(\begin{array}{cccc} p^{{T > T}} & \quad p^{{T > C}} & \quad p^{{T > G}} & \quad p^{{T > A}} \\ p^{{C > T}} & \quad p^{{C > C}} & \quad p^{{C > G}} & \quad p^{{C > A}} \\ p^{{C > A}} & \quad p^{{C > G}} & \quad p^{{C > C}} & \quad p^{{C > T}} \\ p^{{T > A}} & \quad p^{{T > G}} & \quad p^{{T > C}} & \quad p^{{T > T}} \\ \end{array}\right) \end{aligned} $$
with only 6 types of mutations.
We model these mutation probabilities by setting up a multinomial logistic regression model, where dummy variables are used to distinguish between mutations from (G:C) base pairs and mutations from (A:T) base pairs. The (G:C) basepairs are modelled with probabilities
$$\left(p_{i,\text{sam}}^{{C > A}}, p_{i,\text{sam}}^{{C > G}},p_{i,\text{sam}}^{{C > C}}, p_{i,\text{sam}}^{{C > T}} \right)$$
for (G:C) position i in sample sam. With J explanatory variables, the mutation probability at G:C position i in sample sam of cancer type can can be written as:
$$\begin{array}{*{20}l} \log \frac{p_{i,\text{sam}}^{{C > A}}}{p_{i,\text{sam}}^{{C > C}}} =\mu_{\text{sam}}^{{C > A}} + \beta_{1,\text{can}}^{{C > A}} x_{i,1} + \cdots + \beta_{k,\text{can}}^{{C > A}} x_{i,J} \\ \log \frac{p_{i,\text{sam}}^{{C > G}}}{p_{i,\text{sam}}^{{C > C}}} = \mu_{\text{sam}}^{{C > G}} + \beta_{1,\text{can}}^{{C > G}} x_{i,1} + \cdots + \beta_{k,\text{can}}^{{C > G}} x_{i,J} \\ \log \frac{p_{i,\text{sam}}^{{C > T}}}{p_{i,\text{sam}}^{{C > C}}} = \mu_{\text{sam}}^{{C > T}} + \beta_{1,\text{can}}^{{C > T}} x_{i,1} + \cdots + \beta_{k,\text{can}}^{{C > T}} x_{i,J}. \end{array} $$
Note that the probability for no mutation \(p_{i, sam}^{{C > C}}\) is the reference. Similarly, (A:T) basepairs are modeled with probabilities
$$\left(p_{i,\text{sam}}^{{T > A}}, p_{i,\text{sam}}^{{T > G}},p_{i,\text{sam}}^{{T > C}}, p_{i,\text{sam}}^{{T > T}} \right)$$
and the reference is the probability for no mutation \(p_{i, sam}^{{T > T}}\). We compare the performance of the three models on 2% of the whole genome data.
The setting for the three models is shown in Fig. 3a. Since we are mainly interested in their predictive performance, we have not included overdispersed models. Overdispersion is a way to model unexplained variance in the data, but it does not qualitatively change the predictions of a model. Each model is trained with replication timing, phyloP, and context information from the reference genome. For the region-based Poisson model, continuous annotation values are averaged over the selected region and GC percentages are calculated for each region. For the site-specific models, we use the site-specific annotations for each site. Continuous values are discretized to simplify the estimation process.
The results are shown in Fig. 3b-e. We compare the performance of these models at different resolutions using different datasets. In Fig. 3b, we predict the mutation counts in large windows of length 100kb. In Fig. 3c, d, e, we predict the mutation counts in sets of 1000 randomly sampled individual sites. When predicting mutation counts in large windows (100 kb), the three regression models perform similarly. For prediction in a randomly selected small number of sites (1 kb), the two site-specific models out-perform the region-based model (Fig. 3c). The site-specific models can capture the mutational heterogeneities between sites and provide a more accurate mutation probability at any resolution. This is in contrast to the region-based model where a large number of sites are required for accurate predictions. In addition to predicting the probability for mutation events, the multinomial regression model can also predict the probabilities of different mutation types. It outperforms the binomial model by taking the different mutation rates into account (Fig. 3d, e).
Model selection
We consider the site-specific multinomial regression model to predict mutation probabilities for different mutation types at a single site. We implement a forward model selection procedure to determine the explanatory variables in the final model (Fig. 1). In each step, we add all possible new variables to the previous model in turn and rank the resulting new models. We identify and include the explanatory variable with the best fit and iterate the procedure several times. With forward model selection, we avoid the preparation of large analytical data tables that contain all potential explanatory variables (“Preparation of the analytical data table” section). We choose the deviance loss that measures the predictive performance of the model to assess the fit. By estimating it by cross-validation, we avoid overfitting, because it assesses the fit on an independent subset of the data (“Cross validation” section).
By construction, the fit improves during the model selection procedure (Fig. 4a). We also evaluate McFadden’s pseudo R2 as a measure of the explained variance that is valid in categorical regression models (“McFadden’s pseudo R2” section).
As a reference model, we start out with a single mutation rate for the whole genome in all samples (Model 1; Fig. 4a). This model cannot explain any of the variation in the mutation rate between samples and positions, so McFadden’s pseudo R2=0. After including the six strand-symmetric mutation types in the model (Model 2), we add cancer and sample specific intercepts (Model 3 and Model 4) to make sure that we account for sample-specific mutation rates. In the next step, we include the left and right neighboring base-pair for each cancer type (Model 5).
Starting with Model 5, additional annotations are added using forward model selection. We consider the phyloP score, replication timing, expression, genomic segments, GC content in 1 kb, CpG islands, simple repeats and DNase I hypersensitivity. For each of these variables, cancer specific regression coefficients are estimated to allow for differences in the mutational process between cancer types. For expression, we use data directly obtained from matching tumor types, so we also take expression differences between cancer types into account.
The annotation with the largest decrease in the deviance loss function is the phyloP score (Model 6). Subsequently, replication timing (Model 7) and gene expression (Model 8) are added. At this point, we stop the model selection procedure to avoid the time-consuming creation of larger count tables (see “Preparation of the analytical data table” section for details) as we also see that the improvement of the fit levels off. Detailed results for each step of the forward selection procedure are provided in Additional file 1: Section S2.1. To assess the robustness of the results, we rerun the forward model selection procedure five times on randomly selected regions that cover 2% of the whole genome. The ranking of the variables is constant for all the experiments (Additional file 1: Section S2.2).
Adding the context in model 5 gives a substantial improvement over model 4. When the phyloP score is added in model 6, it can be seen in Fig. 4b that it considerably changes the predicted mutation rate for the same nucleotide triplet at two positions with different phyloP score. While the trinucleotide context and the phyloP score vary on a basepair scale, both replication timing and expression vary on a kilo-base scale. Even though much of the per-base-pair variation in the mutation rate is already captured in Models 5 and 6, the long-range variation is considerably better explained in Model 7 and Model 8 (Fig. 4b). We can see that adding replication timing in Model 7 considerably changes the predicted mutation rate obtained from Model 6 that is relatively uniform in the 1.1 Mb region shown in the figure, by taking the replication timing gradient in the region into account. Finally, in Model 8, the mutation rate in highly expressed regions is lowered, which again improves the prediction.
Driver detection methods, such as MutSigCV [5] and ncdDetect [9], are generally based on a model of the neutral mutation rate in tumors. We use two typical cancer genes, the oncogene KRAS and the tumor suppressor gene TP53, to contrast the observed mutation pattern in a driver to the predicted neutral mutation rate.
In Additional file 1: Figure S4, it is obvious that the predicted mutation rate is lower at the gene body of KRAS than right outside of it. A zoom onto exon 2 shows a cluster of mutations at positions 25,398,281-5, with 25 mutations at position 25,398,284-5 which cause a change of protein. This cluster is highly unlikely under our neutral model.
For TP53, we also find a lower predicted mutation rate at the gene body of TP53 and the overlapping gene WRAP53 (Additional file 1: Figure S5). Here, the observed mutations are more spread than in KRAS, but they mainly occur in highly conserved exonic regions, where the neutral model predicts a low mutation rate.
Estimation results
Upon determining the final model from the model selection procedure, we estimate parameters for the multinomial logistic regression model on the remaining 98% of the genome. To study the difference between cancer types, all position-specific explanatory variables are stratified by cancer type. The coefficients represent multiplicative changes in mutation rate. Our results confirm the large differences in mutation pattern both between samples and cancer types, but also between different genomic and epigenomic regions.
We observe that regions that have been highly conserved during human evolution have a lower mutation rate in all cancer types and for all mutation types, and this difference is nearly always significant (Fig. 5a). For kidney chromophobe and prostate adenocarcinoma, it is reduced to less than half for some of the mutation types. In breast cancer, head and neck squamous cell carcinoma, kidney chromophobe and thyroid carcinoma, this difference is much more pronounced at A:T positions than at G:C positions, but there is no general pattern with respect to the mutation type.
As previously described [5], we nearly always find a positive association between replication timing and mutation rates (see regression lines in Fig. 2c; later replicating regions have more mutations), but the regression coefficient varies significantly between the different cancer types and the mutation types (Fig. 5b). For the dummy variable that distinguishes gene bodies, where expression is measured, from intergenic regions, we see a mixed pattern of insignificant and negative regression coefficients where the mutation rate in gene bodies is reduced to up to one third of the rate in intergenic regions (low-grade glioma; Fig. 5c). The only exception is melanoma with a slightly increased mutation rate in gene bodies.
Also when we consider regions within gene bodies, we find a reduced mutation probability in highly expressed regions for most cancer types. The most extreme example is the probability of a C > T mutation in highly expressed regions in melanoma, which is only one fifth of the probability in lowly expressed ones (Fig. 5d). Low-grade glioma, prostate adenocarcinoma and thyroid cancer show the opposite pattern for some of the mutation types, though. However, this effect is dampened by the dummy variables for the gene body that have comparably large coefficients of the opposite sign for these three cancer types.
We find that the mutation rates differ between the different types of mutations, but also between the specific contexts that we consider: CpG, to capture the pattern of spontaneous deamination, and TpCp[AT], to capture the APOBEC signature. We find that the C > T mutation rate is higher in CpG sites than in other sites in all cancer types. In skin cutaneous melanoma, we also observe elevated mutation rate for CC context, which is related to the elevated CC > TT mutation rate due to UV light [12]. We observe elevated rates of mutations that fit the APOBEC pattern in breast cancer, bladder urothelial carcinoma, head and neck squamous cell carcinoma, lung squamous cell carcinoma and skin cutaneous melanoma.