Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring

Background Somatic copy number alternations (SCNAs) can be utilized to infer tumor subclonal populations in whole genome seuqncing studies, where usually their read count ratios between tumor-normal paired samples serve as the inferring proxy. Existing SCNA based subclonal population inferring tools consider the GC bias of tumor and normal sample is of the same fature, and could be fully offset by read count ratio. However, we found that, the read count ratio on SCNA segments presents a Log linear biased pattern, which influence existing read count ratios based subclonal inferring tools performance. Currently no correction tools take into account the read ratio bias. Results We present Pre-SCNAClonal, a tool that improving tumor subclonal population inferring by correcting GC-bias at SCNAs level. Pre-SCNAClonal first corrects GC bias using Markov chain Monte Carlo probability model, then accurately locates baseline DNA segments (not containing any SCNAs) with a hierarchy clustering model. We show Pre-SCNAClonal’s superiority to exsiting GC-bias correction methods at any level of subclonal population. Conclusions Pre-SCNAClonal could be run independently as well as serving as pre-processing/gc-correction step in conjuntion with exsiting SCNA-based subclonal inferring tools. Electronic supplementary material The online version of this article (10.1186/s12859-018-2099-0) contains supplementary material, which is available to authorized users.

: GC bias of read count ratio of paired tumor and normal sample of HCC2218. Red line denotes the linear regression line. SCNA segments are obtain by BIC-seq [5] . The WGS sequence alignment data (.bam files) of 'HCC2218' and its paired normal sample are publicly available on Illumina BaseSpace Sequence Hub website 2 Read count ratio bias simulation 2

.1 Method
To further test Pre-SCNAClonal, we develop a package that could simulate data that presents stripes with log-linear bias pattern. It mainly contains two steps to generate simulation data, i) generate the data without any bias; ii) generate the biased data based on this data.
Given the total number of segments n, first we need to allocate n to stripes. We calculate all the possible average copy numberC k byC here, cn max is the maximum copy number pre-defined, sn is the subclonal number pre-defined, τ k are pre-defined subclonal frequency for subclone k, C is the absolute copy number. Then we could get the number of stripes as the unique number of the value of all the possible average copy numbers (denoted as unique(C)). Next we allocate n to unique(C) stripes. The proportions of segments of all stripes p are sampled from a Dirichlet distribution, here, each value of the parameter vector α equals 1, dimension of p and α equals unique(C). Then we get the center of each strip m Yj by, here, we do not simulate the somatic deletions, for log operation causes the m Yj of somatic deletions dispersed and approaching to negative infinite. Next the y coordinates of stripe j, Y j , could be sampled from normal distribution, here, σ j is obtained by For stripe j, x coordinates are sample from a belta distribution, parameter a j , b j define its horizontal density distribution, here, ζ j is the location between the max and min of X j , then x ji is Generally, we set min(X j ) = 0.35, and Next, the number of the segments in stripe j with higher GC content is less than the segments with lower GC content, we set 1 < a j < b j in Equation 6 and shrink y ji by  it is generated with maximum copy number assigned as 20, subclonal number assigned as 1, purity assgned as 0.8. b) The simulation data with GC bias, which is generate from a) with horizontal bias slope set as -1, vertical bias slope assigned as -90.
here s j ∈ (0, 1) is the shrink factor pre-defined for stripe j.
Final step, assign bias to the original data. The vertical bias is generated by given the horizontal bias slope m v and interception c v . And the horizontal bias is generated by given the vertical bias slope m h ,and interception c h .

The performance of Pre-SCNAClonal on simulation data
We simulate the log(D S /D N ) data with maximum copy number 20, subclonal number 1, purity 0.8, as shown in Figure 2a. Then we apply GC bias to this simulated data with vertical GC bias slope -90, horizontal GC bias from -20 to 20. For each combination of vertical and horizontal GC bias, we apply MCMC correction model to correct the horizontal GC bias 10 times. As shown in Figure 3, the boxplot of GC correction result of MCMC model shows that most of horizontal GC bias slope predicted by MCMC model is precisely the horizontal GC bias pre-set, while the linear regression model could not correctly predicted the horizontal GC bias slope.

Hierarchy clustering based baseline selction model
SCNAs based subclonal reconstruction tools, such as MixClone [3] and THetA [4] , use the read coverage and B-allele frequency (BAF) together to estimate the absolute copy number and population frequency. Since there exists difference between the WGS data of tumor-normal paired samples (such as batch effects), the log(D S /D N ) of baseline segments not always locate at 0. MixClone [3] tries to detect the baseline segments through BAF information while THetA [4] leaves this problem to user. MixClone [3] assumes that the average value of segments which are not Lost of Heterzyosity (LOH) segments is the baseline location. This assumption is not correct for the case that all the segments with positive number of the maternal copy and paternal copy, are not LOH segments.
Here we propose the features of baseline segments, • the segment does not lose heterzyosity; • the number of maternal copy and the number of paternal copy of the segment are the same; • for all the segments meet above two conditions, the segments with the minimum log(D S /D N ) are baseline segments.
The top two conditions limit the segments to even copy number strips, then the last condition pick out segments with the minimum even copy number 2 as the baseline segments. Pre-SCNAClonal uses the same method in MixClone [3] to get the LOH status of every segment. Next, for each segment, Pre-SCNAClonal select the segments that contain loci of heterzyogous SNPs with 50% B-allele frequency in sequencing data of tumor sample from the LOH segments, denoted as APM (Average Paternal Maternal) segments. Then Pre-SCNAClonal use hierarchy clustering model to group APM segments according to their log(D S /D N ) value. Finally, Pre-SCNAClonal select the segment group with the minimum log(D S /D N ) value as baseline segments. Generally, there are two typical methods to obtain the BAF sites of each segment, i) find out all the heterzyogous sites from aligned WGS read set of normal sample (i.e. MixClone [3] ); ii) use the pre-known heterzyogous SNP sites as input, filter out the heterzyogous sites from aligned WGS read set of normal sample (i.e. THetA [4] ); The number of the heterzyogou SNP sites detected by the second method is equal or less than the number detected by the first method. Pre-SCNAClonal imposes another limitation, the density threshold of heterzyogou SNP in each segment, to filer out highly reliable baseline segments.

Performance of baseline selection model of Pre-SCNAClonal and MixClone
MixClone [3] obtains baseline by removing outliers of the segments that do not lose heterzygosity. For sequencing data, it is difficult to distinguish LOH from sequencing deviation. As shown in Figure 4, segments that do not lose heterzygosity are randomly distributed everywhere. Baseline selection method of MixClone almost picks out all the segments as baseline while the tumor purity is low. In comparison with baseline obtained by MixClone, as shown in Figure 5, baseline obtained by Pre-SCNAClonal is lower and more consistent than the baseline obtained by MixClone.

The ploidy number
According to the multiple studies as shown in Table 1, tumor sample HCC1954 is tetraploidy. In this subsection, we calculate the ploidy number based on baseline position to validate baseline selection model of Pre-SCNAClonal. Figure 6 shows the distribution of log(D S i /D N i ) of the paired tumor-normal sample of HCC1954 (coverage 58x, purity 99%). We manually pick out the discernible stripes, marked with 'a', 'b', 'c', 'd', 'e', then we mark all the segments lower than 'a' with 'a" and mark all the segments higher than 'e' with 'f'. Then we manually get the upper boundary and lower boundary of each stripes listed in Table 2. According to Equation ?? in Section ??, the distance between stripes with high copy number approaches 0. For the indiscernible area 'f',     we use Equation ?
? to estimate the absolute copy number of each segment in it. We use stripe 'e' to correct the segments location, to make all the absolute copy numbers of segments in 'f' larger than the absolute copy number of stripe 'e'. Then we calculate the plodiy number ξ by Here, n denotes the total segment number, cn i denotes the absolute copy number of segment i and l i denotes the segment length of segment i. As shown in Table 3, the ploidy number is 3.876 while stripe 'b' is baseline, which is closer to the results obtained by the previous studies listed in Table 1 than the ploidy number based on the other two baselines. Let µ i denote the BAF of SCNA segment i of tumor genome on germline heterozygous SNP site, and let C i , G i respectively denote the absolute copy number and genotype of SCNA segment i. As shown in Figure 7, the B allele (non-reference allele) could be either maternal or paternal allele, thus the BAF of SCNA segments of tumor genome presents symmetrical pattern in [0, 1] (as shown in Table 4). Let ξ i denote the BAF of the tumor sample, φ i denote the subclonal population frequency, then,