Using passenger mutations to estimate the timing of driver mutations and identify mutator alterations

Background Recent developments in high-throughput genomic technologies make it possible to have a comprehensive view of genomic alterations in tumors on a whole genome scale. Only a small number of somatic alterations detected in tumor genomes are driver alterations which drive tumorigenesis. Most of the somatic alterations are passengers that are neutral to tumor cell selection. Although most research efforts are focused on analyzing driver alterations, the passenger alterations also provide valuable information about the history of tumor development. Results In this paper, we develop a method for estimating the age of the tumor lineage and the timing of the driver alterations based on the number of passenger alterations. This method also identifies mutator genes which increase genomic instability when they are altered and provides estimates of the increased rate of alterations caused by each mutator gene. We applied this method to copy number data and DNA sequencing data for ovarian and lung tumors. We identified well known mutators such as TP53, PRKDC, BRCA1/2 as well as new mutator candidates PPP2R2A and the chromosomal region 22q13.33. We found that most mutator genes alter early during tumorigenesis and were able to estimate the age of individual tumor lineage in cell generations. Conclusions This is the first computational method to identify mutator genes and to take into account the increase of the alteration rate by mutator genes, providing more accurate estimates of the tumor age and the timing of driver alterations.


Probability Model
The founder cell of the major clone traces a genealogy back to the fertilized egg. From the fertilized egg, passenger somatic alterations are accumulated in the lineage of the founder cell. The accumulation of passenger somatic alterations can be modeled by a Poisson process with rate λ per cell division if the alteration rate stays constant. (Figure 1 (a)) In order to permit variation among patients, we model the number of passenger somatic alterations N i as following a Poisson distribution with rate λ T i + E i where T i is the number of cell divisions in the lineage of the founder cell from the birth to the tumor detection for sample i and E i is the increase of alterations by unknown factors such as exposure to mutagens by smoking or UV radiation.
We assume an alteration of a driver gene or driver region j increases the alteration rate by ∆ j . It is positive if the driver is a mutator and 0 otherwise. Suppose the alteration of the driver gene or region j occurred in sample i at time X i, j . Then the alteration rate per cell division is λ until the time X i, j and after that, it becomes λ + ∆ j . Therefore, the number of passenger alterations follows a Poisson distribution with rate λ Now, suppose another driver gene or region k is altered in sample i at time X i,k . If X i, j < X i,k , the alteration rate is λ by X i, j , λ + ∆ j between X i, j and X i,k and λ + ∆ j + ∆ k afterwards. Therefore the number of passenger alterations follows a Poisson distribution with rate λ The same result is derived when X i,k < X i, j . This means that the alteration of each driver gene or region j increases the average number of passenger alterations accumulated in the sample by ∆ j (T i − X i, j ) additively. (Figure 1

Derivation of the likelihood
The likelihood element of observing the number of passenger somatic alterations N i and the occurrences of alterations A i, j of selected driver genes or regions j ∈ J given their germline or somatic status G i, j and age of the patient S i is, Gamma distribution with shape and rate parameters α, β ).
The difficult part of the calculation is the multiple integration over X i,k for k ∈ K i . However it can be converted into a sum of products of single integrals as follows :

Parameter estimation
The unknown values of the parameters α, β , α j , β j , ∆ j , p j , ρ are estimated by maximizing the likelihood Before calculating the maximum likelihood estimates (MLE), we first test whether ρ is infinite or not. This is equivalent to testing whether E i = 0 for all samples i or not. Note that E i follows an exponential distribution with the parameter ρ, and therefore when ρ is infinite, E i is always zero. We do a likelihood ratio test of ρ being infinite vs. not. When the null hypothesis is not rejected under the p-value cutoff 0.05, we fix the value of ρ as infinite and estimate other parameters by maximizing the likelihood. If the null hypothesis is rejected, we estimate the value of ρ and other parameters by maximizing the likelihood. When estimating the parameters α j , β j for a gene/region j, we found that there is not a unique MLE for some genes/regions. In those cases, for the given value of β j , there is an interval of α j which gives the same likelihood. In such cases, we find the interval of α j giving the same maximum likelihood and give the median value of the interval as the estimate of α j .

Preselection of mutator genes
The calculation and optimization of the likelihood becomes difficult for a large set of drivers J. Therefore, we predetermine the set of mutators whose ∆ j value is positive and only include these mutators in J and estimate α, β , α j , β j , ∆ j , p j , ρ by maximizing the likelihood. This is because non-mutators do not increase the rate of alterations and thus do not affect the estimates α, β , ρ as well as α j , β j , ∆ j , p j of other genes.
For non-mutators k, ∆ k = 0. The parameters α k , β k , p k of non-mutators are estimated by maximizing using the estimates obtained above.
To predetermine the set of non-mutators whose ∆ k value is zero, we fit a model which includes only one driver gene/region k. Then we obtain an estimate of ∆ k and its 95% confidence interval (CI) by doing 100 bootstraps. If the 95% CI for ∆ k includes 0, we consider the gene as non-mutator.
For bootstrap calculations, we utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Md. (http://biowulf.nih.gov).

Bootstrapping
For the sensitivity analysis of parameter estimates, we did 400 bootstrappings. For each bootstrap, we randomly sampled tumor samples (of equal size to the orginal dataset) with replacement from the original dataset. We applied our method to obtain MLEs of parameters for each bootstrapped dataset. Therefore we obtain 400 sets of parameter estimates. Using these, we construct confidence intervals for ∆ j , the posterior mean of T i and X i, j . When calculating MLEs for each bootstrapped data, we used the MLEs obtained from the original data as the starting value for the optimization of the likelihood. This may have biased the confidence interval to be narrow. Comparison with the previous method of mutation order estimation

List of genes belonging to mutator CNA regions
We compared the result for the lung data with the result for the same data obtained by our previous method of the order estimation [1]. It estimates P k,i , the probability that the k th mutational event involving the selected driver genes occurs in gene i. Table 4 shows the estimates of P k,i and their 90% confidence interval. For better comparison of the results from both methods, we also present the conditional probability that the gene mutates early (a gene mutates at the k th event for k ≤ 3) or late (a gene mutates at the k th event for k > 3) given that the gene is mutated in the sample and their 90% CIs in Table 5. The conditional probabilities clarify whether a gene mutates early or late. Table 6 shows the result obtained by our current method. Although it is difficult to compare the results because each method uses different measures for ordering, both identify EGFR, KRAS, STK11, TP53 as early mutating genes and LRP1B, NF1, PRKDC, PTPRD as late mutating genes.