The MOBSTER R package for tumour subclonal deconvolution from bulk DNA whole-genome sequencing data

Background The large-scale availability of whole-genome sequencing profiles from bulk DNA sequencing of cancer tissues is fueling the application of evolutionary theory to cancer. From a bulk biopsy, subclonal deconvolution methods are used to determine the composition of cancer subpopulations in the biopsy sample, a fundamental step to determine clonal expansions and their evolutionary trajectories. Results In a recent work we have developed a new model-based approach to carry out subclonal deconvolution from the site frequency spectrum of somatic mutations. This new method integrates, for the first time, an explicit model for neutral evolutionary forces that participate in clonal expansions; in that work we have also shown that our method improves largely over competing data-driven methods. In this Software paper we present mobster, an open source R package built around our new deconvolution approach, which provides several functions to plot data and fit models, assess their confidence and compute further evolutionary analyses that relate to subclonal deconvolution. Conclusions We present the mobster package for tumour subclonal deconvolution from bulk sequencing, the first approach to integrate Machine Learning and Population Genetics which can explicitly model co-existing neutral and positive selection in cancer. We showcase the analysis of two datasets, one simulated and one from a breast cancer patient, and overview all package functionalities.


Introduction
These notes describe most of the vignettes available at the mobster GitHub Pages website: https://caravagn.github.io/mobster/.
Available notes cover the following topics: • Note 1: introduction to the input format, simple fits and data-generation process; • Note 2: plotting functions for the models and the data; • Note 3: bootstrap estimation of the model parameters; • Note 4: inference of population genetics parameters of tumour growth; • Note 5: clone-specific dN/dS statistics; • Note 6: clone tree estimation from the model fits.

Input data for mobster
You can run a MOBSTER analysis if, for a set of input mutations (SNVs, indels etc.), you have available VAF or CCF data. The input data can be loaded using different input formats.
For VAF values you can use: • a data.frame (or, tibble) with a column named VAF whose values are numerical 0 < < 1, without NA entries; • a VCF file that must contain at least a column with the total depth of sequencing and the number of reads with the variants. In this case you need first to use function load_VCF and load the VCF content, and then proceed using your data as a data.frame.
For CCF values you can only use the a data.frame format. Importantly, you have to store the CCF values in a column again named VAF, which must follow the same convention of a VAF column (i.e., range of values). Since CCF values usually peak at around 1.0 for clonal mutations (i.e., present in 100% of the input cells), we suggest to adjust standard CCF estimates dividing them by 0.5 in order to reflect the peak of an heterozygous clonal mutation at 50% VAF for a 100% pure bulk sample.
Example dataset. Diploid mutations from sample LU4 of the Comprehensive Omics Archive of Lung Adenocarcinoma are available in the package under the name LU4_lung_sample. The available object is the results of an analysis with mobster, and the input mutation data is stored inside the object. Other datasets are available through the data command.

Driver annotations
In the context of subclonal deconvolution we are often interested in linking "driver" events to clonal expansions. Since mobster works with somatic mutations data, it is possible to annotate the status of "driver mutation" in the input data; doing so, the drivers will be reported in some visualisations of the tool, but will not influence any of the computation carried out in mobster.
The annotate one or more driver mutations you need to include in your column 2 extra columns: • is_driver, a boolean TRUE/FALSE flag; • driver_label, a character string that will be used as label in any visualisation that uses drivers.

Generating random models and data
You can sample a random dataset with the random_dataset function, setting: • the number of mutations (n) and Beta components (k, subclones) to generate; • ranges and constraints on the size of the components; • ranges for the mean and variance of the Beta components.
The variance of the Betas is defined as / where ∼ [0, 1], and is the input parameter Beta_variance_scaling.
Roughly, values of Beta_variance_scaling on the order of 1000 give low variance and sharp peaked data distributions. Values on the order of 100 give much wider distributions. dataset = random_dataset( seed = 123456789, Beta_variance_scaling = 100 # variance~U[0, 1]/Beta_variance_scaling ) A list with 3 components is returned, which contains the actual data, sampled parameters of the generative model, and a plot of the data.
In mobster we provide the implementation of the model's density function (ddbpmm, density Dirichlet Beta Pareto mixture model), and a sampler (rdbpmm) which is used internally by random_dataset to generate the data.

Fitting a dataset
Function mobster_fit fits a MOBSTER model.
The function implements a model-selection routine that by default scores models by their reICL (reduced Integrative Classification Likelihood) score, a variant to the popular BIC that uses the entropy of the latent variables of the mixture. reICL is discussed in the main paper.
This function has several parameters to customize the fitting procedure, and a set of special pre-parametrised runs that can be activated with parameter auto_setup. Here we use auto_setup = "FAST", an automatic setup for a fast run; its parameters are accessible through an internal package function.  A call of mobster_fit will return a list with 3 elements:

s) and a tail ----------------------
• the best fit fit$best, according to the selected scoring method; • fit$runs, a list with the ranked fits; best matches the head of this list; • fit$fits.table, a table that summarises the scores for each one of the runs.
Each fit object (best or any object stored in runs) is from the S3 class dbpmm.  Usually, one keeps working with the best model fit. From that it is possible to extract the results of the fit, and the clustering assignments. The output is a copy of the input data, with a column reporting the model's latent variables (LVs) and the cluster assignment (hard clustering). The second call imposes a cut to the assignments with less than 85% probability mass in the LVs.

# All assignments
If you want to assign some new data to the fit model you can use function Clusters_denovo.

Basic plots of a fit
Clusters can be plot as an histogram with the model density (total and per mixture). By default, mobster names Beta clusters C1, C2, etc. according to the decreasing order of their mean; so C1 is always the cluster with highest Beta mean, etc. If the data are diploid mutations, C1 should represent clonal mutations.

Mixing proportions
The negative log-likelihood NLL of the fit can be plot against the iteration steps, so to check that the trend is decreasing over time. Step NLL Moments Matching converged in 75 steps

Goodness of fit
The scores for model selection can also be compared graphically.

Scores for model selection
In the above plot, all the computed scoring functions (BIC, AIC, ICL and reICL) are shown. This plot can be used to quickly grasp the best model with, for instance, the default score (reICL) is also the best for other scores. In this graphics the red dot represents the best model according to each possible score.

Model-selection report
A general model-selection report assembles most of the above graphics. Now we can compute n.resamples nonparametric bootstraps using function mobster_bootstrap, passing parameters to the calls of mobster_fit. This function by defaults runs the fits in parallel (using a default percentage of the available cores); parallel computing capabilities are achieved using package easypar.

#> [ MOBSTER bootstrap~25 resamples from nonparametric bootstrap ]
The output object includes the bootstrap resamples, the fits and possible error returned by the runs.  Errors of each run are available, if any.

Bootstrap statistics
Bootstrap statistics can be computed with bootstrapped_statistics.
With nonparametric bootstrap the data co-clustering probability is also computed (the probability of any pair of mutations in the data to be clustered together). Note that this probability depends on the joint resample probability of each pair of mutations (each bootstrapped with probability 1/ , for mutations).
bootstrap_statistics shows to screen several statistics.   The mutation rate mu (cell division units) scaled by the probability of lineage survival , / , is given by: Where min is the minimum VAF and max is the maximum, and is the number of mutations between min and max . Selection is defined as the relative growth rates of host tumour cell populations ( ℎ) vs subclone ( ):

+ = ℎ
The mathematical details of these computations are described Requirements. In order to be able to compute dN/dS values mutations data must store their genomic coordinates: • chromosome location chrom, • position from, • reference alleles alt and ref.
Besides, it is important to know what is the reference genome used to align the genome; this information will be used by dndscv to annotate input mutations.
We show this analysis with the fits for one of the lung samples available in the package. fit = mobster::LUFF76_lung_sample   -----wall, wmis, wnon, wspl, wtru --#>  The statistics can be computed for a custom grouping of the clusters. Here it does not make much difference because we have only the clonal cluster, and the tail; but if we had one subclone C2 we could have pooled together the mutations in the clones using # Not run here dnds_stats = dnds( clusters, mapping = c(`C1`= 'Non-tail',`C2`= 'Non-tail',`Tail`= 'Tail'), gene_list = NULL )

Using custom genes lists
A custom list of genes can be supplied in the call to dnds as the variable genes_list; the package provides 4 lists of interests for this type of computation: which are available to load.

Pooling data from multiple patients
The input format of the dnds function allows to pool data from several fits at once. We pool data from the 2 datasets available in the package.
You need to have drivers annotated your object if you want to use ctree, and every driver_label has to be unique, as it will be used as the variantID column to identify the driver event.
We show the analysis with a synthetic dataset.  Clones size for My_MOBSTER_model