GeDi: applying suffix arrays to increase the repertoire of detectable SNVs in tumour genomes

Background Current popular variant calling pipelines rely on the mapping coordinates of each input read to a reference genome in order to detect variants. Since reads deriving from variant loci that diverge in sequence substantially from the reference are often assigned incorrect mapping coordinates, variant calling pipelines that rely on mapping coordinates can exhibit reduced sensitivity. Results In this work we present GeDi, a suffix array-based somatic single nucleotide variant (SNV) calling algorithm that does not rely on read mapping coordinates to detect SNVs and is therefore capable of reference-free and mapping-free SNV detection. GeDi executes with practical runtime and memory resource requirements, is capable of SNV detection at very low allele frequency (<1%), and detects SNVs with high sensitivity at complex variant loci, dramatically outperforming MuTect, a well-established pipeline. Conclusion By designing novel suffix-array based SNV calling methods, we have developed a practical SNV calling software, GeDi, that can characterise SNVs at complex variant loci and at low allele frequency thus increasing the repertoire of detectable SNVs in tumour genomes. We expect GeDi to find use cases in targeted-deep sequencing analysis, and to serve as a replacement and improvement over previous suffix-array based SNV calling methods.

To evaluate GeDi for low frequency SNV detection we generated multiple simulated targeted deep-sequencing datasets using the method described in this section. We first generated two copies of hg19 chromosomes 1,8,9,15 and 22 (randomly picked from ftp://ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz) and to each copy added between 100-300 random SNPs from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp147.txt.gz resulting in pairs of personalised "maternal" and "paternal" chromosome files. For each pair of chromosome files, we extracted a two 1 Mbp sequences, one from the paternal file, the other from the maternal file, both starting at the same coordinate. We call these sequences the paternal and maternal target sequences respectively. The beginning coordinate of each pair of paternal and maternal target sequences was chosen randomly apart from avoiding coordinates that lead to an overlap between the target sequences with centromeric regions (defined in http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz).
To simulate SNVs, we made a copy of each maternal and paternal target sequence and added 100 randomly located SNVs to the copies. This resulted in four target sequences for a given chromosome: two control target sequences, the original paternal and maternal target sequences; and two tumour target sequences, the copied control target sequences with 100 randomly located SNVs added to each.
For each chromosome, the four target sequences were used as templates to produce seven 1000x simulated targeted-deep sequencing datasets using ART NGS dataset simulator (Huang et al., 2012). Supplementary Data, Command 6 shows an example of the command and parameters used when constructing datasets with ART. The seven datasets produced for each chromosome had differing average allele frequencies of 0.50, 0.25, 0.10, 0.05, 0.02, 0.01 and 0.005. To construct, for example, the dataset with 0.005 average allele frequency, we produced the tumour data by generating two 495x coverage datasets using the paternal and maternal control target sequences as a template, and combining these with two 5x coverage datasets using the paternal and maternal tumour target sequence as a template into a single dataset. Similarly, to produce the control data, we combined two 500x coverage datasets using the paternal and maternal control target sequence as a template into a single dataset. All datasets were produced using this method and differed only in the coverages required to generate the necessary average allele frequencies. In total, this process gave 35 1000x simulated targeted deep-sequencing tumour-control paired NGS datasets: For each of the five chromosomes, seven datasets with differing average allele frequencies.

Method 2: Generating simulated sSRSC-containing datasets
To evaluate GeDi's ability to detect sSRSC we built simulated datasets containing sSRSC ranging from size k = 2 to k = 20 using ART NGS dataset simulator following the method described in this section. We first generated two copies of hg19 chromosomes 1, 8, 15, 17 and 22 (randomly picked from ftp://ftp.broadinstitute.org/bundle/ hg19/ucsc.hg19.fasta.gz) and to each copy added between 100-300 random SNPs from http:// hgdownload.cse .ucsc.edu/goldenPath/hg19/database/snp147.txt.gz resulting in pairs of personalised "maternal" and "paternal" chromosome files. To generate control data for a given chromosome, two 15x datasets were generated with ART using the corresponding paternal and maternal chromosome files as templates and combined to give a single 30x control dataset. To generate tumour data for a given chromosome, the corresponding paternal and maternal chromosome files were copied, and 250 sSRSCs were added to copied file. A minimum distance of at least 500 bp between each added sSRSC was enforced to avoid merging and coordinates that resulted sSRSC entering centromeric regions were avoided (centromeric regions defined in http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ cytoBand.txt.gz). Apart from these constraints, each added sSRSC's coordinate was determined randomly. Furthermore, for each added sSRSC its size, k, was randomly chosen from a uniform distribution, where 1 ≤ k ≤ 20. For a given chromosome these sSRSC-containing paternal and maternal chromosome files were used as templates to generate two 15x using ART and were then combined into a single 30x tumour dataset.

Method 3: Removing false positives arising from somatic indels
We developed indel filter, to remove false positives caused by somatic indels. When tumour-consensus sequence pairs are examined during the SNV calling stage of GeDi the sequences within a pair must be aligned to one another. Accordingly, during this stage, pairs covering somatic indels will be misaligned unless a correcting gap is added across the indel site. If no correcting gap is added, sometimes this misalignment can giv e the appearance of multiple single character mismatches which GeDi would incorrectly call as SNVs. Indel filter removes false positives by identifying and discarding consensus sequence pairs covering somatic indels; these are often extracted during suffix array-based SNV detection. Indel filter identifies pairs covering somatic indels by performing gapped pair-wise alignment of tumour-consensus sequence pairs using smith-waterman. A scoring regime of match score, mismatch penalty, gap open penalty and gap extension penalty set to 1, 1, 0, 0 respectively minimises gap insertion/extension penalties. Under this scoring regime, consensus pairs producing single-gapped pair-wise alignments are discarded, as such alignments are indicative of somatic indels. Supplementary Data, Table 1 shows the effect of indel filter on false positive reduction in the presence of somatic indels. In future work GeDi will call somatic indels, rather than discarding them.

Method 4: Benchmarking resource requirements
For all analyses each SNV caller was run on a machine with SGI UV2 scheduler and Xeon E5-4650v2 CPUs. GeDi was compiled with g++ version 4.8.5 with O3 optimisation. SMuFin was compiled with using the Makefile given in their release version, which uses gcc 4.8.5 (http://cg.bsc.es/smufin/). MuTect was run with Java(TM) SE Runtime Environment (version 1.6.0 45). The SGI UV2 scheduler allows process-specific user-definable resource allocation that is exclusively reserved for each process. Accordingly, for the results in rows 1-4 of Table 3 from the main paper: all SNV callers were allocated 200 GB of memory; GeDi and MuTect allocated 32 logical threads, and were run with their threading values set to 32 (Supplementary Data, Commands 3 and 4 provide example commands); the memory allocation pool of each Java environment running MuTect was set to 200GB (-Xmx200g); since SMuFin's parallelism is achieved through MPI, the 256 thread run of SMuFin (command given on SMuFin website http://cg.bsc.es/smufin) was allocated 16 compute nodes each with 16 threads whilst the 32 thread run of SMuFin was allocated 2 nodes with 16 threads each (Supplementary Data, Commands 1 and 2 in Section 4 show the exact commands use to run SMuFin in 256 and 32 thread mode respectively). For row 5 of Table 3 from the main paper (analysis of dataset MB:L.A), GeDi and MuTect were allocated 1500 GB of memory with 64 logical threads. Their commands were adjusted accordingly.

Method 6: MB:L.A preprocessing
In Tyler S. Alioto (2015), preprocessing of MB:L.A raw sequencing reads followed best practices of the research group testing each pipeline. Information on these best practices is provided in the supplementary data of the work (Tyler S. Alioto, 2015). A raw sequencing reads was assesed using FastQC. This assessment lead to the trimming of the second reads of each 251bp MiSeq paired end run (runs A3MCW, A4DBD and A4DC6) being trimmed to 220bp (Tyler S. Alioto, 2015). Since the preprocessing practices of MB:L.A was only described for these pipelines, we decided to follow these practices exactly for our analysis of GeDi such that a fair comparison could be made between GeDi's and the aforementioned pipelines' output. Accordingly, we trimmed the reads of files:  (Bolger et al., 2014) and left the rest of the data unchanged. Our own assessment of the raw sequencing data with FastQC agreed with the practices of MB.E, MB.F1 and MB.F2 -the second reads from runs A3MCW, A4DBD and A4DC6 should cleaved to 220bp, and the remaining data should be left unchanged.
1.7 Method 7: Calculation of percentage of GeDi calls within cis-regulatory or transcribed regions SNV from GeDi ran in default mode when analysing MB:L.A were grouped into three catagories All GeDi, sSRSCresiding SNV calls, and non Gold Set as described in the main text. A fourth group, Gold Set, consists of all the SNVs from the Gold Set identified in Tyler S. Alioto (2015). SNVs in each of these groups were matched against a functional dataset consisting of transcribed and cis-regulatory regions. The transcribed regions included in the dataset comprise all introns and exons annotated in Human Reference Genome hg19. The cis-regulatory regions are the totality of DHS loci available from ENCODE Honey Badger DHS (https://personal.broadinstitute.org/ meuleman/reg2map/). Each of the four groups of SNVs was intersected with the functional dataset using bedtools intersect with default parameters to identify calls occuring in functional sites. The percentage of the SNV from each group that fell within the intersection was calculated and the results plotted in Figure 6 of main text.

Figures
2.1 Figure 1. Graphical explanation of GeDi's dual suffix array design. Figure 1: GeDi's dual suffix array design enables SNV detection at low allele frequency. a) A single SNV is covered by four reads R1-4, however only two reads cover the SNV in each DNA orientation (Denoted by +/-in Ori. column). b) In SMuFin's approach to suffix array-based SNV detection, a single suffix array is used (Moncunill et al., 2014). In this suffix array, reads R1-2 form a separate tumour-suffix enriched section to R3-4. As neither of these sections contain at least M SS = 4 tumour-derived suffixes, reads R1-4 are not extracted, even though the SNV is genuine and has a total coverage ≥ M SS. Hence, the SNV is undetectable. Since this SNVs coverage is low, its allele frequency is likely to be low. Therefore, the original GSA approach shows reduced sensitivity for SNV detection at low allele frequency. c) In GeDi's dual suffix array design, reads R1-4 are extracted from the primary GSA as pM SS = 2 and the separate tumour-suffix enriched sections formed from R1-2 and R3-4 both have ≥ pM M S tumour-suffixes. Once extracted, the auxiliary GSA groups together R1-4 with their reverse complements (Rn* denotes reverse complement of read Rn). Since the reverse complement grouping leads to tumour-suffix enriched sections with size ≥ aM SS = 4, reads R1-4 are successfully extracted and the SNV is now detectable. The extracted reads R1-4 are now locally aligned into a variant block using the start positions of the suffixes within the enriched section. For a given read Rn, Rn[i:] is the suffix starting at position i in Rn. Figure 2. Graphical explanation of GeDi's masking filter. The tumour consensus sequence (Tumour cns.) and the aligned control reads that will be used to make the control consensus sequence (Control cns.) are shown. An SNV is present at the coloured site (red 'G', blue 'T'). However, an SNP is also located in close proximity (yellow 'A', green 'T'). In the tumour consensus sequence, the 'A' SNP allele occurs on the same chromosome as the SNV. However in the control dataset, the 'T' allele is most frequent. Middle panel: Without the masking filter, the 'T' allele (most frequent) was the chosen allele for the control consensus sequence. With the masking filter, although the 'T' allele is selected, the masking filter identified the position for masking (asterisk), due to their being a high frequency of two bases at the SNP position (2 A's and 4 T's) and hence, greater than one non-zero component within the SNP containing column of the control sequence's frequency matrix. Rightmost panel: Without masking, the T/A SNP would be reported as an SNV by GeDi resulting in a false positive call. With masking, the 'A' within the tumour consensus sequence is replaced with the 'T' in the control. As a result, a false positive call is avoided. Since the tumour consensus sequence is never aligned, masking does not influence the coordinate position of the called SNV.     2.4 Figure 7. Effect of emfilter on recall and precision for SNV calling across different sSRSC. Figure 7: We analyzed the same datasets used to determine GeDi's an d MuTect's precision and recall for sSRSC detection (Figure 4 of main paper) with GeDi in default mode, with and without emfilter; a full explanation of how these datasets were generated is given in Supplementary Data, Method 2. The plot shows the k-precision and k-recall attained by GeDi for each sSRSC of size 2 ≥ k ≥ 20 with emfilter on (ON:blue) and off (OFF:red). We can see that, although emfilter reduces sensitivity, unlike with reference-based SNV callers, the loss in sensitivity is not proportional to the size of the sSRSC (Figure 4 of main paper). Hence, emfilter does not cause the mappingassociated sensitivity loss that afflicts reference-based SNV callers.

2.2
3 Tables   3.1 Effect of indel filter on false positive reduction. 3.2 Effect of masking filter on false positive reduction.  3.4 SNV calls made by GeDi when combining output from runs with pMSS = 1 and pMSS = 4.

Commands
Commands 1-4: Example commands used for GeDi, MuTect and SMuFin when running resource requirement analysis. Dataset downloaded from http://cg.bsc.es/smufin/ is used as inputs. For commands 5 and 6, all parameters in angled brackets would be set appropriately, the remaining parameters were never changed.