### GC bias of the tumor WGS data does not have the same feature as its paired normal

Let coefficient *θ*_{
j
} denote the effect of mappability and genomic length of segment *j*, \(\bar {C}_{j}\) denote the average copy number of segment *j*, *λ*_{
j
} denote the expected read counts, and let \(D_{j}^{N}\) denote the read counts of segment *j* in matched normal genome, then for segment *i* and segment *j*, existing SCNA based tumor subclonal populations inferring tools [8, 9] assume that \({\lambda _{i}}/{\lambda _{j}} ={\bar {C}_{i}\theta _{i}}/{\bar {C}_{j}\theta _{j}}\), and \(\theta _{i} / \theta _{j} = D_{i}^{N} / D_{j}^{N}\), then

$$ \frac{D_{i}^{S}}{D_{j}^{S}} = \frac{\lambda_{i}}{\lambda_{j}} = \frac{\bar{C}_{i}\theta_{i}}{\bar{C}_{j}\theta_{j}} = \frac{\bar{C}_{i}}{\bar{C}_{j}} * \frac{D_{i}^{N}}{D_{j}^{N}}. $$

(1)

Figure 1 shows the two normal libraries from the same normal sample, and there is a crossover point of the two loess lines. Here we suppose the normal Lib 2 is a tumor sample has no variations, and normal Lib 1 is its paired normal sample. According to Eq. 1,

$$ \frac{D_{i}^{Lib2}}{D_{j}^{Lib2}} = \frac{\lambda_{i}}{\lambda_{j}} = \frac{\bar{C}_{i}\theta_{i}}{\bar{C}_{j}\theta_{j}} = \frac{2}{2} * \frac{D_{i}^{Lib1}}{D_{j}^{Lib1}} = \frac{D_{i}^{Lib1}}{D_{j}^{Lib1}}, $$

(2)

If *j* is the crossover point, we have \(D_{i}^{Lib2} = D_{i}^{Lib1}\), which means the two loess lines should overlap each other. This demonstrates that the GC bias is different in the tumor and its paired normal sample.

### Modelling the difference of GC bias between paired tumor and normal sample

We find that, the difference between the GC bias of tumor and its paired normal could be modelled as following equation,

$$ \begin{aligned} &D_{i}^{N} = \frac{f(GC_{i})}{\exp(a_{1} * GC_{i}) / (d_{1} * GC_{i})} \\ &D_{i}^{S} = \frac{f(GC_{i})}{\exp(a_{2} * GC_{i}) / (d_{2} * GC_{i})} \\ \end{aligned}, $$

(3)

In this equation, *f*(*G**C*_{
i
}) is a function of GC content, which represents the bias feature that shared by tumor and its paired normal sample. *a*_{1}, *a*_{2}, *d*_{1} and *d*_{2} denote the distinctions of bias feature between tumor and its paired normal sample. *a*_{1} and *a*_{2} represent the curvature of tumor and its paired normal sample respectively; *d*_{1} and *d*_{2} represent the distance of tumor and its paired normal sample respectively; As shown in Fig. 2, the distinctions of bias feature between the paired tumor sample HCC1954 and its paired normal HCC1954 BL could be well captured by this model.

According to Eq. 3, Eq. 1 is transformed into

$$ \frac{D_{i}^{S}}{D_{j}^{S}} = \frac{\bar{C}_{i}}{\bar{C}_{j}} * \exp\left[(a_{2}-a_{1})* (GC_{j} - GC_{i})\right] * \frac{D_{i}^{N}}{D_{j}^{N}}, $$

(4)

then,

$$ \log{\frac{D_{i}^{S}}{D_{i}^{N}}} - \log{\frac{D_{j}^{S}}{D_{j}^{N}}} = \log{\frac{\bar{C}_{i}}{\bar{C}_{j}}} + \left(a_{2}-a_{1}\right)* \left(GC_{j} - GC_{i}\right). $$

(5)

Equation 5 reveals that the read count ratio presents a Log linear biased pattern on SCNAs which we will prove it later. Equation 5 also shows that the read count ratio’s GC bias between paired tumor and normal sample exists if the curvature of tumor and its paired normal sample are not the same. We also find this phenomenon in HCC2218 (Additional file 1: Figure S1).

### BAF in tumor WGS data presents symmetrical pattern in [0,1] at heterozygous SNP sites

Let *μ*_{
i
} denote the BAF of SCNA segment *i* of tumor genome on germline heterozygous SNP site, and let *C*_{
i
}, *G*_{
i
} respectively denote the absolute copy number and genotype of SCNA segment *i*. The B allele (non-reference allele) could be either maternal or paternal allele, thus the BAF of SCNA segments of tumor genome presents symmetrical pattern in [ 0,1] (please see Additional file 1: Supplementary 3.3.2 for detail proof). Let *ξ*_{
i
} denote the BAF of the tumor sample, *ϕ*_{
i
} denote the subclonal population frequency, then,

$$ \xi_{i} = \frac{\phi_{i} * C_{i} * \mu_{i} + \left(1- \phi_{i}\right) *2*\frac{1}{2}}{\bar{C}_{i}}, $$

(6)

$$ \bar{C}_{i} = \phi_{i} * C_{i} + \left(1- \phi_{i}\right) *2. $$

(7)

In Eqs. 6 and 7, ‘2’ and ‘\(\frac {1}{2}\)’ are the copy number and heterozygous BAF of normal sample. Then, *ξ*_{
i
} is symmetrical in [ 0,1], because *μ*_{
i
} is symmetrical in [ 0,1].

### GC bias of read count ratio affects SCNA based subclonal population analysis

By increasing the window size to 5000bp (Fig. 3c) or even larger at SCNA level (Fig. 3b), the 2D plot between GC content and tumor-normal coverage ratio clearly clustered into multiple stripes. It is noted that the relationship is pretty linear between GC content and log ratio of tumor-normal coverage on SCNAs (Fig. 3a) and we show that slopes of linear relation vary across tumors (Additional file 1: Figure S1). We also show that the gaps between the stripes in Fig. 3a are proportional to the subclonal populations (as shown in the sub-figures in the first column of Fig. 4). The SCNA segments which are clustered into the same stripe, present the symmetrical pattern of B allele frequency (BAF) density on the heterozygous allele loci of paired normal sample (Fig. 3e), which reveals that these SCNA segments in the same stripe contain the same copy number(see Additional file 1: Supplementary 3.3.2 for detail proof). While using the ratio of read counts of SCNA segments to get the precise subclonal population of each SCNA, it needs to correct the GC bias of the gap first.

### Existing read count ratio’s GC bias correction methods are not suitable for SCNA based subclonal population analysis

Existing GC correction methods for WGS data of tumor normal paired sample, such as CNAnorm [14], rectifies the distribution of the ratio of read counts of the small window, aiming at finding the position of SCNA and absolute copy number (Fig. 3d) by merging the adjoining small window with similar ratio properties. This method uses regression model to rectify the GC content distribution of the ratio and hence removing the dependencies on GC content. However, while using this GC correction method to rectify the bias of read count ratio for SCNA based subclonal population analysis, it additionally requires the regression correctly capture the slope of the gaps between the SCNA stripes. As shown in Fig. 3a and b, linear or loess regression could be easily biased by outliers, regression lines in Fig. 3a and b do not parallel the stripes, hence there would still exist GC content bias after removing the dependencies on GC content based on these regression lines (see Fig. 4).

### Models of Pre-SCNAClonal for read count ratio’s GC bias correction for SCNA based subclonal population analysis

#### MCMC model

Pre-SCNAClonal uses a Markov chain Monte Carlo (MCMC) model to pick out the maximum posterior probability of stripe slope *m* listed in Eq. 8,

$$ \begin{array}{ll} p(m|Y,X) & \sim p(m) * p(Y,X|m)\\ m & \sim \text{Uniform}(a- \delta, a+\delta)\\ p(Y, X|m) & = \Lambda(D, \tau * \max(cn))\\ D & = density(Y')\\ Y' & = Y-(m*X + c) + \text{median}(Y)\\ \end{array}, $$

(8)

here *Y*, *X* denotes log(*D*^{S}/*D*^{N}) and GC content respectively; *a*, *c* are slope and intercept pre-determined by two points, coordinates of which are the median of *Y* and *X* at high and low GC content areas; *δ* is the slope range pre-specified; *D* denotes the density function, *Λ*(*D*,*τ*∗ max(*c**n*)) denotes the sum of top (largest) *τ*∗ max(*c**n*) peaks of density curve of *D*; *τ* denotes the number of subclonal populations, max(*c**n*) denotes the maximum copy number pre-defined. *Y*^{′} represents the corrected *Y*.

#### Hierarchy clustering model

Note that, normally, the read counts of tumor segments without SCNA (defined as baseline) are not equivalent to those from paired normal samples due to coverage difference. According to Eqs. 6 and 7, the \(\bar {C}_{i}\) and *ξ*_{
i
} of baseline segment always equals to 2 and \(\frac {1}{2}\) respectively. If and only if \(\mu _{i}=\frac {1}{2}\), \(\xi _{i} = \frac {1}{2}\). Then according to Eqs. 5 and 7, the baseline segments locate in the SCNA stripe with \(\xi _{i} = \frac {1}{2}\) and the smallest \(\log {\frac {D_{i}^{S}}{D_{i}^{S}}}\), because only positive even *C*_{
i
} with equal paternal and maternal copy could make \(\mu _{i}=\frac {1}{2}\). Thus, after the GC correction, Pre-SCNAClonal picks out all the segments with \(\xi _{i} = \frac {1}{2}\), and imports a hierarchy clustering model to group the segments into several clusters, then Pre-SCNAClonal selects the cluster with smallest \(\log {\frac {D_{i}^{S}}{D_{i}^{S}}}\) as baseline segments.