Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring

Chu, Yanshuo; Teng, Mingxiang; Wang, Yadong

doi:10.1186/s12859-018-2099-0

Volume 19 Supplement 5

Selected articles from the Biological Ontologies and Knowledge bases workshop 2017

Research
Open access
Published: 11 April 2018

Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring

Yanshuo Chu¹,
Mingxiang Teng¹ &
Yadong Wang¹

BMC Bioinformatics volume 19, Article number: 112 (2018) Cite this article

3075 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

Background

Somatic copy number alternations (SCNAs) can be utilized to infer tumor subclonal populations in whole genome seuqncing studies, where usually their read count ratios between tumor-normal paired samples serve as the inferring proxy. Existing SCNA based subclonal population inferring tools consider the GC bias of tumor and normal sample is of the same fature, and could be fully offset by read count ratio. However, we found that, the read count ratio on SCNA segments presents a Log linear biased pattern, which influence existing read count ratios based subclonal inferring tools performance. Currently no correction tools take into account the read ratio bias.

Results

We present Pre-SCNAClonal, a tool that improving tumor subclonal population inferring by correcting GC-bias at SCNAs level. Pre-SCNAClonal first corrects GC bias using Markov chain Monte Carlo probability model, then accurately locates baseline DNA segments (not containing any SCNAs) with a hierarchy clustering model. We show Pre-SCNAClonal’s superiority to exsiting GC-bias correction methods at any level of subclonal population.

Conclusions

Pre-SCNAClonal could be run independently as well as serving as pre-processing/gc-correction step in conjuntion with exsiting SCNA-based subclonal inferring tools.

Background

Tumor heterogeneity introduces challenges in cancer tissue diagnosis and subsequent treatment [1]. Currently, projects such as TCGA [2] screened thousands of tumor samples using whole-genome sequencing(WGS) on tissue (bulk) cells, provide more explicit molecular insights on identifying cancer cell types and sub-types than other bioinformatics methods [3–6]. To decipher cell composition in bulk cell WGS, somatic copy number alterations (SCNAs), commonly found in tumor cells [7], are utilized as the representative to determine tumor subclonal populations in a tumor-normal tissue paired manner by existing tools, e.g. MixClone [8], THetA [9]. The benefits of using SCNAs is that the WGS data doesn’t have to be deeply sequenced [8]. However, existing tools lack the ability to properly account for sequencing GC bias which is widely observed in DNA-seq data [10].

Evidences have showed that GC-bias could affect SCNA identification in tumor cells [11–13]. Existing tools consider the SCNA segments have the same sequence properties between the normal and tumor samples, and consider the bias could be offset to use the read count ratios between tumor and normal paired samples [8, 9]. However, We found that, in a GC biased study, the GC contents and read count ratios on SCNA segments present a Log linear biased pattern. Though existing method [14] suggests removing GC bias by modeling GC cotent with tumor-normal coverage difference for small genomic windows, however, we find that small window is not a proper and robust resolution for SCNA.

We present Pre-SCNAClonal, a tool that improving tumor subclonal population inferring by correcting GC-bias at SCNAs level. Pre-SCNAClonal first corrects GC bias using Markov chain Monte Carlo probability model, then accurately locates baseline DNA segments (not containing any SCNAs) with a hierarchy clustering model. We show Pre-SCNAClonal’s superiority to exsiting GC-bias correction methods for SCNA-based tumor reconstruction tools at any level of subclonal population. We also note that Pre-SCNAClonal could be run independently as well as serving as pre-processing/gc-correction step in conjuntion with exsiting SCNA-based subclonal inferring tools.

Data

The WGS data of human breast cancer HCC2218 and HCC1954 with different levels of normal contamination (coverage 30x) are used to validate the method proposed in this paper. Each of the HCC1954 samples, HCC1954.mix1.n5t95, HCC1954.mix1.n20t80, HCC1954.mix1.n40t60, HCC1954.mix1.n60t40, HCC1954.mix1.n80t20 and HCC1954.mix1.n95t5, contains one tumor subclone. The tumor subclonal frequencies (or tumor purity) of these samples are 0.95, 0.80, 0.60, 0.40, 0.20 and 0.05, respectively. We also use the data of human ovary cancer sample TCGA-13-0723 in Benjamini’s work [11] to show the read count ratio’s GC bias between paired tumor and normal sample.

The WGS sequence alignment data (.bam files) of HCC2218 and its paired normal sample are publicly available on Illumina BaseSpace Sequence Hub website https://basespace.illumina.com. The WGS sequence alignment data (.bam files) of HCC1954 and its paired normal sample and the WGS data with different levels of normal contamination are public available at National Cancer Institute GDC Data Portal https://gdc.cancer.gov/resources-tcga-users/tcga-mutation-calling-benchmark-4-files. The WGS sequence alignment data (.bam files) of TCGA-13-0723 is available at National Cancer Institute GDC Data Portal https://portal.gdc.cancer.gov/ only for authorized user.

Methods

GC bias of the tumor WGS data does not have the same feature as its paired normal

Let coefficient θ_j denote the effect of mappability and genomic length of segment j, $\bar {C}_{j}$ denote the average copy number of segment j, λ_j denote the expected read counts, and let $D_{j}^{N}$ denote the read counts of segment j in matched normal genome, then for segment i and segment j, existing SCNA based tumor subclonal populations inferring tools [8, 9] assume that ${\lambda _{i}}/{\lambda _{j}} ={\bar {C}_{i}\theta _{i}}/{\bar {C}_{j}\theta _{j}}$, and $\theta _{i} / \theta _{j} = D_{i}^{N} / D_{j}^{N}$, then

$$ \frac{D_{i}^{S}}{D_{j}^{S}} = \frac{\lambda_{i}}{\lambda_{j}} = \frac{\bar{C}_{i}\theta_{i}}{\bar{C}_{j}\theta_{j}} = \frac{\bar{C}_{i}}{\bar{C}_{j}} * \frac{D_{i}^{N}}{D_{j}^{N}}. $$

(1)

Figure 1 shows the two normal libraries from the same normal sample, and there is a crossover point of the two loess lines. Here we suppose the normal Lib 2 is a tumor sample has no variations, and normal Lib 1 is its paired normal sample. According to Eq. 1,

$$ \frac{D_{i}^{Lib2}}{D_{j}^{Lib2}} = \frac{\lambda_{i}}{\lambda_{j}} = \frac{\bar{C}_{i}\theta_{i}}{\bar{C}_{j}\theta_{j}} = \frac{2}{2} * \frac{D_{i}^{Lib1}}{D_{j}^{Lib1}} = \frac{D_{i}^{Lib1}}{D_{j}^{Lib1}}, $$

(2)

If j is the crossover point, we have $D_{i}^{Lib2} = D_{i}^{Lib1}$, which means the two loess lines should overlap each other. This demonstrates that the GC bias is different in the tumor and its paired normal sample.

Modelling the difference of GC bias between paired tumor and normal sample

We find that, the difference between the GC bias of tumor and its paired normal could be modelled as following equation,

$$ \begin{aligned} &D_{i}^{N} = \frac{f(GC_{i})}{\exp(a_{1} * GC_{i}) / (d_{1} * GC_{i})} \\ &D_{i}^{S} = \frac{f(GC_{i})}{\exp(a_{2} * GC_{i}) / (d_{2} * GC_{i})} \\ \end{aligned}, $$

(3)

In this equation, f(GC_i) is a function of GC content, which represents the bias feature that shared by tumor and its paired normal sample. a₁, a₂, d₁ and d₂ denote the distinctions of bias feature between tumor and its paired normal sample. a₁ and a₂ represent the curvature of tumor and its paired normal sample respectively; d₁ and d₂ represent the distance of tumor and its paired normal sample respectively; As shown in Fig. 2, the distinctions of bias feature between the paired tumor sample HCC1954 and its paired normal HCC1954 BL could be well captured by this model.

According to Eq. 3, Eq. 1 is transformed into

$$ \frac{D_{i}^{S}}{D_{j}^{S}} = \frac{\bar{C}_{i}}{\bar{C}_{j}} * \exp\left[(a_{2}-a_{1})* (GC_{j} - GC_{i})\right] * \frac{D_{i}^{N}}{D_{j}^{N}}, $$

(4)

then,

$$ \log{\frac{D_{i}^{S}}{D_{i}^{N}}} - \log{\frac{D_{j}^{S}}{D_{j}^{N}}} = \log{\frac{\bar{C}_{i}}{\bar{C}_{j}}} + \left(a_{2}-a_{1}\right)* \left(GC_{j} - GC_{i}\right). $$

(5)

Equation 5 reveals that the read count ratio presents a Log linear biased pattern on SCNAs which we will prove it later. Equation 5 also shows that the read count ratio’s GC bias between paired tumor and normal sample exists if the curvature of tumor and its paired normal sample are not the same. We also find this phenomenon in HCC2218 (Additional file 1: Figure S1).

BAF in tumor WGS data presents symmetrical pattern in [0,1] at heterozygous SNP sites

Let μ_i denote the BAF of SCNA segment i of tumor genome on germline heterozygous SNP site, and let C_i, G_i respectively denote the absolute copy number and genotype of SCNA segment i. The B allele (non-reference allele) could be either maternal or paternal allele, thus the BAF of SCNA segments of tumor genome presents symmetrical pattern in [ 0,1] (please see Additional file 1: Supplementary 3.3.2 for detail proof). Let ξ_i denote the BAF of the tumor sample, ϕ_i denote the subclonal population frequency, then,

$$ \xi_{i} = \frac{\phi_{i} * C_{i} * \mu_{i} + \left(1- \phi_{i}\right) *2*\frac{1}{2}}{\bar{C}_{i}}, $$

(6)

$$ \bar{C}_{i} = \phi_{i} * C_{i} + \left(1- \phi_{i}\right) *2. $$

(7)

In Eqs. 6 and 7, ‘2’ and ‘$\frac {1}{2}$’ are the copy number and heterozygous BAF of normal sample. Then, ξ_i is symmetrical in [ 0,1], because μ_i is symmetrical in [ 0,1].

GC bias of read count ratio affects SCNA based subclonal population analysis

By increasing the window size to 5000bp (Fig. 3 c) or even larger at SCNA level (Fig. 3 b), the 2D plot between GC content and tumor-normal coverage ratio clearly clustered into multiple stripes. It is noted that the relationship is pretty linear between GC content and log ratio of tumor-normal coverage on SCNAs (Fig. 3 a) and we show that slopes of linear relation vary across tumors (Additional file 1: Figure S1). We also show that the gaps between the stripes in Fig. 3 a are proportional to the subclonal populations (as shown in the sub-figures in the first column of Fig. 4). The SCNA segments which are clustered into the same stripe, present the symmetrical pattern of B allele frequency (BAF) density on the heterozygous allele loci of paired normal sample (Fig. 3 e), which reveals that these SCNA segments in the same stripe contain the same copy number(see Additional file 1: Supplementary 3.3.2 for detail proof). While using the ratio of read counts of SCNA segments to get the precise subclonal population of each SCNA, it needs to correct the GC bias of the gap first.

Existing read count ratio’s GC bias correction methods are not suitable for SCNA based subclonal population analysis

Existing GC correction methods for WGS data of tumor normal paired sample, such as CNAnorm [14], rectifies the distribution of the ratio of read counts of the small window, aiming at finding the position of SCNA and absolute copy number (Fig. 3 d) by merging the adjoining small window with similar ratio properties. This method uses regression model to rectify the GC content distribution of the ratio and hence removing the dependencies on GC content. However, while using this GC correction method to rectify the bias of read count ratio for SCNA based subclonal population analysis, it additionally requires the regression correctly capture the slope of the gaps between the SCNA stripes. As shown in Fig. 3 a and b, linear or loess regression could be easily biased by outliers, regression lines in Fig. 3 a and b do not parallel the stripes, hence there would still exist GC content bias after removing the dependencies on GC content based on these regression lines (see Fig. 4).

Models of Pre-SCNAClonal for read count ratio’s GC bias correction for SCNA based subclonal population analysis

MCMC model

Pre-SCNAClonal uses a Markov chain Monte Carlo (MCMC) model to pick out the maximum posterior probability of stripe slope m listed in Eq. 8,

$$ \begin{array}{ll} p(m|Y,X) & \sim p(m) * p(Y,X|m)\\ m & \sim \text{Uniform}(a- \delta, a+\delta)\\ p(Y, X|m) & = \Lambda(D, \tau * \max(cn))\\ D & = density(Y')\\ Y' & = Y-(m*X + c) + \text{median}(Y)\\ \end{array}, $$

(8)

here Y, X denotes log(D^S/D^N) and GC content respectively; a, c are slope and intercept pre-determined by two points, coordinates of which are the median of Y and X at high and low GC content areas; δ is the slope range pre-specified; D denotes the density function, Λ(D,τ∗ max(cn)) denotes the sum of top (largest) τ∗ max(cn) peaks of density curve of D; τ denotes the number of subclonal populations, max(cn) denotes the maximum copy number pre-defined. Y^′ represents the corrected Y.

Hierarchy clustering model

Note that, normally, the read counts of tumor segments without SCNA (defined as baseline) are not equivalent to those from paired normal samples due to coverage difference. According to Eqs. 6 and 7, the $\bar {C}_{i}$ and ξ_i of baseline segment always equals to 2 and $\frac {1}{2}$ respectively. If and only if $\mu _{i}=\frac {1}{2}$, $\xi _{i} = \frac {1}{2}$. Then according to Eqs. 5 and 7, the baseline segments locate in the SCNA stripe with $\xi _{i} = \frac {1}{2}$ and the smallest $\log {\frac {D_{i}^{S}}{D_{i}^{S}}}$, because only positive even C_i with equal paternal and maternal copy could make $\mu _{i}=\frac {1}{2}$. Thus, after the GC correction, Pre-SCNAClonal picks out all the segments with $\xi _{i} = \frac {1}{2}$, and imports a hierarchy clustering model to group the segments into several clusters, then Pre-SCNAClonal selects the cluster with smallest $\log {\frac {D_{i}^{S}}{D_{i}^{S}}}$ as baseline segments.

Results

We use WGS data of HCC1954 with different levels of normal contamination (coverage 30x) to test the models of Pre-SCNAClonal.

Result of MCMC model

As shown in Fig. 4, MCMC model of Pre-SCNAClonal could better correct the read count ratio’s GC bias than linear regression method [14]. The GC correction results of linear regression are either over-corrected (HCC1954.mix1.n5t95 and HCC1954.mix1.n20t80) or under-corrected(HCC1954.mix1.n60t40, HCC1954.mix1.n80t20 and HCC1954.mix1.n95t5).

To further test MCMC model, we also develop a package that could simulate data with extreme bias (please refer to Additional file 1: Supplementary 2.1 for detail). Results show that the MCMC model of Pre-SCNAClonal is robust and noise tolerant, outperforms the regression method in CNAnorm [14].

Result of hierarchy clustering model

The baseline selection method in MixClone [8] obtains baseline by removing outliers of read count ratios of the segments that do not lose heterzygosity(LOH). In the WGS data, it is difficult to distinguish LOH from sequencing deviation or error. As shown in Fig. 5 a, segments that do not lose heterzygosity are randomly distributed everywhere. Baseline selection method of MixClone almost picks out all the segments as baseline while the tumor purity is low. In comparison, as shown in Fig. 5 b, baseline obtained by Pre-SCNAClonal is lower and more consistent than the baseline obtained by MixClone.

We calculate the ploidy number based on baseline segments’ $\log {\frac {D_{i}^{S}}{D_{i}^{N}}}$ to validate baseline selection model of Pre-SCNAClonal, and the result shows that the tumor sample HCC1954 is tetraploidy which is the same as results of COSMIC [15] and ABSOLUTE [16](for detail procedure, please see Additional file 1: Supplementary 3.3.1). Furthermore, the BAF distribution on germline heterozygous SNP site also shows the baseline segments obtained by hierarchy clustering models are correct (for detail procedure, please see Additional file 1: Supplementary 3.3.2).

Result of pipeline test

We respectively string Pre-SCNAClonal with two typical SCNAs based subclonal inferring tools, MixClone [8] and THetA [9], to test the bias correction and baseline selection models of Pre-SCNAClonal on HCC1954. As shown in Table 1, Pre-SCNAClonal–MixClone pipeline almost precisely estimated the subclonal frequency for all HCC1954 tumor samples, which outperforms MixClone alone a lot. Pre-SCNAClonal–THetA pipeline also provided better estimation than THetA alone (THetA could not run on sample ‘n80t20’ and ‘n95t5’ for their BIC-seq segments number lower than 1000). This result shows that Pre-SCNAClonal could greatly improve the performance of tumor subclonal population inferring algorithms.

Table 1 Pipeline test of Pre-SCNAClonal on HCC1954

Full size table

Discussion

Generally, SCNAs with larger subclonal frequency could be more precisely located relatively. However, due to the twice sequencing procedures of tumor and its paired normal, the read information of the genomic regions with the same copy number in tumor sample is not exactly the same as its paired normal’s. Moreover, the lower read overage of next generation sequencing (NGS) makes the perturbation more likely to be mistaken for a SCNA. As shown in Fig. 6, the number of SCNA breakpoints obtained by SCNA detection tool is proportional to the subclonal frequency. For the samples with higher subclonal frequency, the “true” SCNA segments could be segmented into multiple segments, which causes the Fig. 3 c and b presents the same stripe pattern.

For NGS based SCNA analysis, the read count ratio stripes could serve as a good proxy for bias correction, even if the break points are not correct, because the read count ratio of the “true” SCNA segment is preserved as the center of read count ratio stripe.

Conclusion

Pre-SCNAClonal proposed in this paper is a robust GC bias correction and baseline selection tool for SCNAs based tumor subclonal inferring. Pre-SCNAClonal could correct the read count ratio’s GC bias and improve the performance of SCNA based subclonal inferring tools at all levels of tumor subclonal frequency even the subclonal frequency is very small. Furthermore, Pre-SCNAClonal also provides an user-friendly interface for visualizing and manually correcting the GC bias of read count ratio.

References

Nowell PC. The clonal evolution of tumor cell populations. Science. 1976; 194(4260):23–8.
Article CAS PubMed Google Scholar
McLendon R, Friedman A, Bigner D, Van Meir EG, Brat DJ, Mastrogianakis GM, Olson JJ, Mikkelsen T, Lehman N, Aldape K, et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008; 455(7216):1061–8.
Article CAS Google Scholar
Peng J, Lu J, Shang X, Chen J. Identifying consistent disease subnetworks using dnet. Methods. 2017; 131:104–10.
Article CAS PubMed Google Scholar
Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinformatics. 2017; 18(16):573.
Article PubMed PubMed Central Google Scholar
Peng J, Xue H, Shao Y, Shang X, Wang Y, Chen J. A novel method to measure the semantic similarity of hpo terms. Int J Data Mining Bioinformatics. 2017; 17(2):173–88.
Article Google Scholar
Peng J, Zhang X, Hui W, Lu J, Li Q, Shang X. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. BMC Syst Biol. 2018;12(Suppl2). In press.
Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010; 463(7283):899–905.
Article CAS PubMed PubMed Central Google Scholar
Li Y, Xie X. Mixclone: a mixture model for inferring tumor subclonal populations. BMC Genomics. 2015; 16(Suppl 2):1.
Article Google Scholar
Oesper L, Mahmoody A, Raphael BJ. Theta: inferring intra-tumor heterogeneity from high-throughput dna sequencing data. Genome Biol. 2013; 14(7):1.
Article Google Scholar
Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010; 7(2):111–8.
Article CAS PubMed Google Scholar
Benjamini Y, Speed TP. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Res. 2012; 40:e72.
Article CAS PubMed PubMed Central Google Scholar
Farkash-Amar S, Lipson D, Polten A, Goren A, Helmstetter C, Yakhini Z, Simon I. Global organization of replication time zones of the mouse genome. Genome Res. 2008; 18(10):1562–70.
Article CAS PubMed PubMed Central Google Scholar
Desprat R, Thierry-Mieg D, Lailler N, Lajugie J, Schildkraut C, Thierry-Mieg J, Bouhassira E. Predictable dynamic program of timing of dna replication in human cells. Genome Res. 2009; 19:2288–99.
Article CAS PubMed PubMed Central Google Scholar
Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics. 2012; 28(1):40–7.
Article CAS PubMed Google Scholar
Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, Ding M, Bamford S, Cole C, Ward S, et al. Cosmic: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res. 2014; 43(D1):805–11.
Article Google Scholar
Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. Absolute quantification of somatic dna alterations in human cancer. Nat Biotechnol. 2012; 30(5):413–21.
Article CAS PubMed PubMed Central Google Scholar
Locke MEO, Milojevic M, Eitutis ST, Patel N, Wishart AE, Daley M, Hill KA. Genomic copy number variation in mus musculus. BMC Genomics. 2015; 16(1):497.
Article PubMed PubMed Central Google Scholar
Xi R, Luquette J, Hadjipanayis A, Kim TM, Park PJ. Bic-seq: a fast algorithm for detection of copy number alterations based on high-throughput sequencing data. Genome Biol. 2010; 11(1):1.
Article Google Scholar

Download references

Acknowledgments

We are grateful to members of bioinformatics lab of HIT who have tested Pre-SCNAClonal and the financial support of the funding agencies.

Funding

The publication costs of this article was funded by the Major State Research Development Program of China [no. 2016YFC1202302], the National Natural Science Foundation of China (no. 61571152) and the National High-Tech R&D Program of China (863 Program) [nos. 2015AA020101, and 2015AA020108].

Availability of data and materials

The tool we provided in this paper, Pre-SCNAClonal, is publicly available as a Python package: https://github.com/dustincys/Pre-SCNAClonal.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 5, 2018: Selected articles from the Biological Ontologies and Knowledge bases workshop 2017. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-5.

Author information

Authors and Affiliations

Center for Bioinformatics, Harbin Institute of Technology, Harbin, China
Yanshuo Chu, Mingxiang Teng & Yadong Wang

Authors

Yanshuo Chu
View author publications
You can also search for this author in PubMed Google Scholar
Mingxiang Teng
View author publications
You can also search for this author in PubMed Google Scholar
Yadong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yadong Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1

Modeling and Correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring. (PDF 2570 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Chu, Y., Teng, M. & Wang, Y. Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring. BMC Bioinformatics 19 (Suppl 5), 112 (2018). https://doi.org/10.1186/s12859-018-2099-0

Download citation

Published: 11 April 2018
DOI: https://doi.org/10.1186/s12859-018-2099-0

Selected articles from the Biological Ontologies and Knowledge bases workshop 2017

Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring

Abstract

Background

Results

Conclusions

Background

Data

Methods

GC bias of the tumor WGS data does not have the same feature as its paired normal

Modelling the difference of GC bias between paired tumor and normal sample

BAF in tumor WGS data presents symmetrical pattern in [0,1] at heterozygous SNP sites

GC bias of read count ratio affects SCNA based subclonal population analysis

Existing read count ratio’s GC bias correction methods are not suitable for SCNA based subclonal population analysis

Models of Pre-SCNAClonal for read count ratio’s GC bias correction for SCNA based subclonal population analysis

MCMC model

Hierarchy clustering model

Results

Result of MCMC model

Result of hierarchy clustering model

Result of pipeline test

Discussion

Conclusion

References

Acknowledgments

Funding

Availability of data and materials

About this supplement

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional file

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us