SMuRF: a novel tool to identify regulatory elements enriched for somatic point mutations

Background Single Nucleotide Variants (SNVs), including somatic point mutations and Single Nucleotide Polymorphisms (SNPs), in noncoding cis-regulatory elements (CREs) can affect gene regulation and lead to disease development. Several approaches have been developed to identify highly mutated regions, but these do not take into account the specific genomic context, and thus likelihood of mutation, of CREs. Results Here, we present SMuRF (Significantly Mutated Region Finder), a user-friendly command-line tool to identify these significantly mutated regions from user-defined genomic intervals and SNVs. We demonstrate this using publicly available datasets in which SMuRF identifies 72 significantly mutated CREs in liver cancer, including known mutated gene promoters as well as previously unreported regions. Conclusions SMuRF is a helpful tool to allow the simple identification of significantly mutated regulatory elements. It is open-source and freely available on GitHub (https://github.com/LupienLab/SMURF). Electronic supplementary material The online version of this article (10.1186/s12859-018-2501-y) contains supplementary material, which is available to authorized users.


Background
With the advent of next-generation sequencing technologies, a growing catalogue of genome-wide datasets has become available. This includes whole-genome sequencing to detect single nucleotide variants (SNVs) in diseased tissue (eg: TCGA Research Network: http://cancergenome.nih.gov/) as well as maps of histone variants and chromatin accessibility [1]. Using these datasets, numerous cis-regulatory elements (CREs) have been identified as recurrently mutated in cancer and other diseases. A notable example is the TERT promoter in glioma, melanoma, medulloblastoma, hepatocellular carcinoma, lung adenocarcinoma, thyroid and bladder cancers [2]. The mutations in this promoter create new transcription factor binding sites [3,4], leading to increased TERT expression and ultimately immortalization and genomic instability [5]. Enhancers and anchors of chromatin interaction can also display recurrent mutation, such as the PAX5 enhancer in chronic lymphocytic leukemia [6,7] and CTCF binding sites in colorectal cancer [8].
Others have previously developed methods to identify important clusters of somatic point mutations based on proximity [9] or an enrichment compared to the local background [10]. However, the mutation rate of a CRE is impacted by its chromatin accessibility and the binding of transcription factors, as demonstrated by a lower rate of mutation in open compared to closed chromatin [11]. Therefore, recurrently mutated CREs should be identified against a background of other regulatory elements with a matched chromatin accessibility in the same cell or tissue type. To achieve this, SMuRF receives a user-defined set of regions of interest as the input rather than relying on a proximity clustering of SNVs and provides a user-friendly tool to identify, filter, and annotate significantly mutated genomic regions.

Implementation
SMuRF consists of two main steps. The first filters, counts, annotates, and intersects the list of SNVs with the set of genomic coordinates, using a custom Bash script and the BEDTools suite [12]. The second consists in running a binomial test in R followed by a mutation rate filter to determine which genomic intervals are significantly enriched in SNVs and producing output figures as well as files for downstream analyses.

Input processing
The SNVs in BED or vcf format, are optionally filtered for known SNPs. This will remove either all known SNPs or only those with a minor allele frequency above 1% to preserve potentially interesting acquired SNVs that also occur as extremely rare polymorphisms in the population.
Subsequently, the input genomic regions are annotated as either gene promoter regions or as distal regulatory elements. This is done by overlapping those genomic intervals with a catalogue of gene promoters, derived from Gencode transcription start site annotations [13].
Finally, the input SNVs and genomic intervals are intersected to map all SNVs to unique genomic intervals, and the resulting data structure forms the starting point of the statistical analysis for mutation enrichment.
All of the above filtering and annotating can be achieved with data from any genome for which the required annotation files are available. Those for human builds hg19 and hg38 are supplied with the tool for convenience.

Identifying significantly mutated regions
The binomial test used by SMuRF to determine whether a given genomic region is significantly enriched for mutations requires an expected mutation rate. Depending on the sample cohort, the user can choose how this mutation rate is calculated. For each sample, the average number of mutations per base pair in input regions is calculated first.
A number of output files are generated and these are detailed within the manual; they include a list of genes whose promoters are significantly mutated for use in gene ontology analyses, as well as a bed-formatted list of mutated regions annotated as distal regulatory elements to allow the user to associate them to target genes through GREAT [15] or C3D [16]. The main output figure is a scatter plot of -log10(q-value) against the number of unique samples mutated in the region, and color-coded to distinguish gene promoters from distal regulatory elements.

Results and discussion
To illustrate the above steps, we used publicly available acquired SNVs from 88 liver cancer samples [17] and chromatin accessibility data from HepG2 [1] that provides a reference set for CREs. The total number of SNVs per sample used in the analysis after filtering ranged from 1344 to 25,121 (Fig. 1a), with an average of 1.2% falling within one of the 278,135 CREs (Fig. 1b) as identified in HepG2. While the input SNV numbers covered a wide range, no subset of patients was abnormally hyper or hypomutated, so we selected the "allsamples" mode to calculate the background mutation rate for each CRE. In total, 9485 individual CREs contained at least one mutation, of which 72 (6 promoters and 66 distal regulatory elements) were found to be significantly enriched for mutations (q-value ≤0.05 and peak mutation rate ≥ threshold) (Fig. 2 and Additional file 1: Table 1). These regulatory elements were each recurrently mutated in 2-5 samples.
Among the highly mutated promoters were those for the TERT, TP53, ACSM1, TNFRSF8, and PCGF5 genes, all previously reported recurrently mutated regions in liver cancer [18]. Also significantly mutated, however, was the promoter of a gene with unknown function, RP11-484D2.2, highlighting the potential of this type of analysis for uncovering novel regions of interest.
To further assess the ability of this approach to identify mutated regulatory elements that are relevant to the samples of interest, we compared the number of significantly mutated CREs identified in HepG2 to those found in other tissue types when using the same liver cancer mutation data. Chromatin accessibility data from eight ENCODE cell lines [1], including HepG2, was randomly sampled five times, matching for peak number and peak length, and SMuRF was run on each iteration using the same settings detailed above (Fig. 3). Significantly fewer (Mann-Whiney U test p-value range: 0.007-0.012) mutated CREs were identified in each of the seven other cell lines compared to HepG2.

Conclusions
Whole-genome sequencing and chromatin accessibility data sets in numerous normal and diseased tissues are  The datasets generated and/or analysed during the current study are available in the following manuscripts: [1] and [17].  Fig. 3 Assessing the sample specificity of SMuRF. SMuRF was run on matched chromatin accessibility data from seven other tissue types. Each peak set was randomly sampled 5 times and SMuRF was run on each iteration. SK-N-SH_RA had the lowest peak number and was not sampled. The selected peak sets were also matched to the HepG2 dataset for peak length. The number of significantly mutated CREs identified by SMuRF in each run are shown as green diamonds, with the height of the bar for each tissue corresponding to the average CRE number