Segmentum: a tool for copy number analysis of cancer genomes
© The Author(s). 2017
Received: 27 May 2016
Accepted: 6 April 2017
Published: 13 April 2017
Somatic alterations, including loss of heterozygosity, can affect the expression of oncogenes and tumor suppressor genes. Whole genome sequencing enables detailed characterization of such aberrations. However, due to the limitations of current high throughput sequencing technologies, this task remains challenging. Hence, accurate and reliable detection of such events is crucial for the identification of cancer-related alterations.
We introduce a new tool called Segmentum for determining somatic copy numbers using whole genome sequencing from paired tumor/normal samples. In our approach, read depth and B-allele fraction signals are smoothed, and double sliding windows are used to detect breakpoints, which makes our approach fast and straightforward. Because the breakpoint detection is performed simultaneously at different scales, it allows accurate detection as suggested by the evaluation results from simulated and real data. We applied Segmentum to paired tumor/normal whole genome sequencing samples from 38 patients with low-grade glioma from the TCGA dataset and were able to confirm the recurrence of copy-neutral loss of heterozygosity in chromosome 17p in low-grade astrocytoma characterized by IDH1/2 mutation and lack of 1p/19q co-deletion, which was previously reported using SNP array data.
Segmentum is an accurate, user-friendly tool for somatic copy number analysis of tumor samples. We demonstrate that this tool is suitable for the analysis of large cohorts, such as the TCGA dataset.
KeywordsSomatic copy number analysis Loss of heterozygosity Segmentation Whole-genome sequencing Cancer
Somatic copy number alterations (SCNA) are a group of genomic aberrations commonly observed in many cancers . Copy number is the number of copies per cell of a particular gene or DNA sequence. Somatically acquired chromosomal rearrangements such as deletions and duplications may change the copy number of a gene. Consequently, the expression level of a gene is often correlated with its copy number  - a phenomenon known as the gene dosage effect. Loss of heterozygosity (LOH) is an event in which one of the two alleles at a heterozygous locus is lost due to segmental aneuploidy, gene conversion, mitotic recombination, or mitotic nondisjunction . LOH events involving tumor suppressor genes such as PTEN, RB1, and TP53 have been observed in many cancer. LOH may alter gene expression. For example, monoallelic expression (MAE), which is the expression of a gene from only one of two alleles in a diploid organism, is associated with LOH . By analyzing a cohort of 23 triple-negative breast cancer patients, Ha et al.  have shown that LOH is a prominent aberration in this type of cancer, and modulates a significant portion of the transcriptome in the form of MAE. Copy-neutral LOH (cnLOH) is a specific type of LOH that occurs when the lost allele is replaced with a duplicated copy of the surviving allele, resulting in the copy number remaining unchanged. Suzuki et al. have shown recurring cnLOH at chromosome 17p (harboring TP53 gene) in low-grade astrocytoma . The altered expression of genes with allelic imbalance due to LOH events may bring about selective advantages for tumorigenesis and tumor progression. Additionally, regions with cnLOH may harbor genes with driver mutations . Hence, accurate and reliable detection and characterization of events, such as SCNAs and LOH, are crucial for the identification of prospective cancer-related genes, such as tumor suppressor genes and oncogenes, and eventually for informing new approaches to treat cancer .
High throughput sequencing (HTS)-based SCNA detection approaches (including both whole exome sequencing (WES) and whole genome sequencing (WGS)) have become popular due to their potential for accurate copy number estimation and breakpoint detection with single nucleotide accuracy. However, the short read length of current HTS technologies makes it difficult to map some reads to unique locations in the genome. Furthermore, due to GC-content bias, GC-content-rich regions in the genome will have increased number of reads. These ambiguities make accurate estimation of coverage and consequently copy number a challenge . Additionally, tumor ploidy and normal cell contamination introduce further challenges in SCNA detection .
HTS-based copy number analysis is, in most cases, based on read depth (RD) estimations at each genomic location and further segmentation and quantification of the RD profiles into segments of consistent copy number (Additional file 1: Table S1 for a list of SCNA tools) [9, 10]. However, such tools are only capable of detecting deletions and duplications. Recently, RD-based analysis has been augmented to identify cnLOH events by incorporating information from an alternate allele’s fraction at heterozygous single nucleotide polymorphism (SNP) positions (or B-allele fraction (BAF)). The BAF of a heterozygous SNP has an expected value of 0.5 in normal diploid cells. Deviation from 0.5 in the heterozygous SNP BAF points to an aberration. In the case of cnLOH, BAF values are expected to be either 0 or 1 in a pure tumor population. Tools such as Control-FREEC , Patchwork , and CLImAT  incorporate BAF data to extend SCNA detection. Control-FREEC determines the breakpoints using a least absolute shrinkage estimator (LASSO) regression. Sample ploidy is provided by the user to Control-FREEC. It also evaluates and corrects for normal cell contamination, GC-content, and mapability biases while inferring the copy number profile of a tumor genome. Patchwork performs GC and positional normalization and segments the genome using a circular binary segmentation (CBS) algorithm. It also estimates normal cell contamination and tumor ploidy. CLImAT implements corrections for GC-content and mapability bias and models the RD and BAF data with a hidden Markov model (HMM) to infer the somatic copy number variation, normal cell contamination and tumor ploidy (Additional file 1: Overview of Tools section for more details on these tools). While the above tools are well-suited for SCNA detection, their use has some limitations. Control-FREEC and Patchwork utilize computationally costly models, which leads to long analysis times. The main motivation of our study was to develop an accurate and user-friendly tool that could be used to analyze large WGS datasets, such as the cancer genome atlas (TCGA) datasets. In our approach, the RD and BAF signals are smoothed, and double sliding windows subsequently are used to detect breakpoints, which makes our approach fast and straightforward. Because the breakpoint detection is performed simultaneously at different scales, it allows accurate detection. Our tool, Segmentum, is freely available under MIT license at: https://github.com/eafyounian/Segmentum (Additional file 2 contains the software code. For the lates version of the software code please visit the project’s online repository).
RD extraction and BAF calculation for heterozygous SNPs
It should be noted that by default reads with mapping quality score 10 are filtered out before RD extraction and BAF calculation in order to address the challenges raised by reads not mapping to a unique region in the genome (the read filtration criterion based on the mapping quality score is a parameter to Segmentum and can be set by the user).
where logr i is the log-ratio of the i th genomic window and tRD i and nRD i are RDs extracted from the i th genomic window of a specific size (determined by user; default is 2 kbp) for the tumor and normal samples, respectively.
Differences in the total number of aligned reads in the normal and tumor samples may bias the estimation of the RD log-ratios. The correction was performed by finding the mode of log-ratio values for each chromosome and subtracting the median of all of the modes from each log-ratio value. It should be noted that median, in the correction step, is robust to the changes in one mode. For instance, one chromosomal arm having a copy number change has no effect on the correction since it only affects one of the chromosomal modes.
Mirroring and smoothing of the BAF values
Segmentation using a double sliding window approach
Detection of cnLOH events within a single sample
Detection of recurrent cnLOH regions across multiple samples
To find genomic regions with recurrent cnLOH events, all cnLOH regions for individual samples are identified following the procedure described earlier. Then, the number of occurrences of a cnLOH event for a specific region across multiple samples is counted using an interval tree data structure (Additional file 1: Figure S1).
To evaluate Segmentum in terms of segmentation accuracy, a simulator capable of simulating whole-genome RD for both normal and tumor samples and BAF based on events such as deletions, amplifications and cnLOH was developed. The simulator receives a normal sample RD data and outputs 4 sets of data including the simulated normal and tumor RD, BAF data and a ground truth. First, the simulator learns the distribution of the RD data from the provided normal sample by simply counting the number of times two consecutive RD values (e.g., 368 and 299) occur together throughout the genome (Additional file 1: Figure S2). The learned distribution also accounts for the inherent noise in the RD data. Next, inverse transform sampling (Smirnov transform) is used to generate RD values for each position in the genome based on the learned distribution. Then, noise is removed using a median filter. A normal RD is constructed by adding independent Poisson noise to the simulated RD data. To construct the tumor RD, two copy number tracks (because autosomal chromosomes come in maternal and paternal pairs) harboring random SCNAs are constructed. The tumor sample RD is calculated using the copy number tracks, the simulated normal sample RD and the normal sample contamination (i.e., a parameter determined by user). To construct the BAF data, heterozygous SNPs are initially randomly distributed across the genome (1 heterozygous SNP per 1.5 Kbp). The number of B-alleles at a heterozygous SNP is calculated using a binomial distribution with the parameters n (total number of reads at heterozygous SNP position) and p (probability that a read is coming from the B-allele). n is extracted from the simulated normal RD at heterozygous SNP positions. p is calculated using the two constructed copy number tracks and the normal sample contamination. Once the number of B-alleles is calculated, it is used to calculate the BAF values (Additional file 1: Figures S3 and S4 for the simulator pipeline and the simulated data visualized in the integrative genomics viewer (IGV) , respectively).
Segmentum segmentation accuracy for the simulated data
Segmentum segmentation accuracy for real data compared to other tools
Similar evaluations using low coverage data (6x average coverage) are shown in Additional file 1: Figures S5 and S6. The low coverage data is comprised of the paired tumor/normal whole genome sequencing samples of 10 individuals diagnosed with prostate adenocarcinoma (PRAD). With regard to the low coverage data, Patchwork produces the most similar results to the SNP array segmentation results with a JSI score of 0.93, followed by Segmentum with a JSI score of 0.88. Additional file 1: Table S4 contains the names of the 10 TCGA PRAD samples (Additional file 1: Tables S6-S10 represent the parameter values used for running the competing tools. Additional file 1: Segmentum’s parameter value selection section provides guidance on selecting parameter values for Segmentum. Additional file 1: Figure S9 represents an example plot made by Segmentum’s 'plot' sub-command that can be used to guide the parameter value selection).
Segmentum segmentation accuracy for the subsampled real data
It should be noted that as the coverage decreases the number of identified heterozygous SNPs decreases (Additional file 1: Figure S8). For instance, for the 10%-fraction subsample only 1997 heterozygous SNPs were identified from the entire genome (in contrast to the original sample where the number of identified heterozygous SNPs was more than 3 million SNPs). Even though Segmentum is shown to work with low coverage data, one should note the implications of low amounts of detected heterozygous SNPs on the reliable detection of cnLOH events.
Time usage evaluation
Average tool analysis time for high coverage data (30x < coverage < 100x)
Average preparation time
Average analysis time
- 10 h 34 min for extracting RD from normal or tumor BAM file
- 4 h 25 min for calculating BAF values
- 1 min 45 s
- 29 h 37 min for creating pileups from normal or tumor BAM file
- 3 h 56 min
- 33 h 28 min for creating pileups from normal or tumor BAM file
- 7 h 11 min
- 2 h 12 min for extracting RD
- 29 min
Recurrent cnLOH detection case study
By comparing the simulated (Fig. 2) and real data (Figs. 3, 4 and 5 and Additional file 1: Figures S5 and S6), we can conclude that Segmentum can recover true copy number aberrations with high accuracy even when the coverage is as low as ~4 reads (Fig. 5, Additional file 1: Figure S7). On average, Segmentum produces results that are the most concordant with the copy number aberrations identified from the SNP array data (i.e. ~90% of concordance) (Fig. 4). As shown in Table 1, our tool is more than twice as fast as the second best performing tool in terms of accuracy. Segmentum is also the second fastest tool after CLImAT compared to the other tools evaluated in this study (Table 1). However, CLImAT ranks last in terms of accuracy (Fig. 4). One explanation for the speed of CLImAT is that it computes the BAF values for a subset of known SNPs (~13.7 million SNPs that are retrieved from the dbSNP database ). In contrast, Segmentum, computes the BAF values for heterozygous SNPs determined from the 1000 Genomes project’s SNP list (~85 million SNPs) . The other reason for the speed of CLImAT might be that it does not require a normal sample for analysis.
As the normal contamination in the simulated data increases, the number of false negatives increases and the recall rate decreases (Fig. 2). However, within the ranges of realistic amounts of normal contamination (i.e. ~30% to 40%), Segmentum performs consistently well.
Segmentum is able to report recurrent cnLOH regions across multiple cancer genome samples; a characteristic of cancer genomes that has been neglected until recently . By applying Segmentum to TCGA data, we were able to recover recurrent cnLOH events from low-grade glioma samples that were reported earlier by SNP array-based data analysis. It is worth mentioning that Segmentum can work in two modes, i.e., with or without BAF value. In the case where BAF values are not used, Segmentum cannot detect regions with cnLOH. Furthermore, Segmentum is capable of reliably segmenting the cancer genome using both high (Figs. 3 and 4) and low (Fig. 5 and Additional file 1: Figures S5 and S6) sequence coverage data. However, with the low sequence coverage data, the estimated BAF values for the heterozygous SNPs will be less reliable. This is also reflected in Additional file 1: Figure S8, where it is shown that the number of detected heterozygous SNPs drop as the average coverage decreases. The implications of low amounts of detected heterozygous SNPs on the reliable detection of cnLOH events should not be overlooked.
Even though we have shown that Segmentum is highly accurate at recovering the true copy number, other tools in this study do more than just segmenting the genome. For instance, CLImAT and Patchwork are capable of estimating tumor ploidy and tumor purity and consequently, reporting the integral copy numbers for each segment. Patchwork and Control-FREEC are also capable of reporting the genotype of each segment and CLImAT reports the genotype for each SNP within each segment. This is in contrast to Segmentum that only reports the mean RD log-ratio and BAF value of each segment. However, tools such as ABSOLUTE  or THetA  can be used to estimate tumor impurity and ploidy from Segmentum’s segmentation result, meaning that Segmentum can be used as part of a larger tumor evolution analysis pipeline. Finally, a strength of our tool is its minimum dependence on third party tools, with the exception of SAMtools, for calculating the RD and BAF.
We have developed Segmentum as a tool for the identification of SCNAs, including cnLOH in tumor samples, using WGS data. We have shown that Segmentum is accurate and fast with regards to other state-of-the-art tools, making it suitable for analyzing cohorts with a large number of samples, such as TCGA cohorts.
Availability and requirements
Project name: Segmentum
Project homepage: https://github.com/eafyounian/Segmentum
Operating system(s): Linux
Programming language: Python
Other requirements: SciPy, Samtools, and matplotlib if the ‘plot’ sub-command is used.
License: MIT license
Any restrictions to use by non-academics: None
Binary alignment map
circular binary segmentation
Array comparative genomic hybridization
Copy-neutral loss of heterozygosity
Copy number variation
Single nucleotide polymorphism database
Fluorescent in situ hybridization
Hidden Markov model
High throughput sequencing
Integrative genomics viewer
Jaccard similarity index
Least absolute shrinkage eStimatOr
Low grade glioma
Loss of heterozygosity
Somatic copy number alteration
Single nucleotide polymorphism
The cancer genome atlas
Whole exome sequencing
Whole genome sequencing
The results published here are in part based upon data generated by The Cancer Genome Atlas project (dbGaP Study Accession: phs000178.v9.p8) established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov. The authors would also like to acknowledge CSC — IT Center for Science Ltd. (https://www.csc.fi/csc) for providing the computational resources.
This work was funded by the Academy of Finland (project no. 269474), Cancer Society of Finland, and Sigrid Juselius Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
EA, MA implemented the method. EA performed the data analysis and drafted the manuscript. MA initiated the study and designed the method. MN supervised the study. All authors edited and approved the manuscript.
EA is a Ph.D. student at Faculty of Medicine and Life Sciences and BioMediTech institute, University of Tampere, Tampere, Finland.
MA is a Ph.D. student at Faculty of Medicine and Life Sciences and BioMediTech institute, University of Tampere, Tampere, Finland.
MN is a professor of bioinformatics at Faculty of Medicine and Life Sciences and BioMediTech institute, University of Tampere, Tampere, Finland.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, Mc Henry KT, Pinchback RM, Ligon AH, Cho YJ, Haery L, Greulich H, Reich M, Winckler W, Lawrence MS, Weir BA, Tanaka KE, Chiang DY, Bass AJ, Loo A, Hoffman C, Prensner J, Liefeld T, Gao Q, Yecies D, Signoretti S, Maher E, Kaye FJ, Sasaki H, Tepper JE, Fletcher JA, Tabernero J, Baselga J, Tsao MS, Demichelis F, Rubin MA, Janne PA, Daly MJ, Nucera C, Levine RL, Ebert BL, Gabriel S, Rustgi AK, Antonescu CR, Ladanyi M, Letai A, Garraway LA, Loda M, Beer DG, True LD, Okamoto A, Pomeroy SL, Singer S, Golub TR, Lander ES, Getz G, Sellers WR, Meyerson M. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463(7283):899–905.View ArticlePubMedPubMed CentralGoogle Scholar
- Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, Graf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, METABRIC Group, Langerod A, Green A, Provenzano E, Wishart G, Pinder S, Watson P, Markowetz F, Murphy L, Ellis I, Purushotham A, Borresen-Dale AL, Brenton JD, Tavare S, Caldas C, Aparicio S. The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–52.PubMedPubMed CentralGoogle Scholar
- Ha G, Roth A, Lai D, Bashashati A, Ding J, Goya R, Giuliany R, Rosner J, Oloumi A, Shumansky K, Chin SF, Turashvili G, Hirst M, Caldas C, Marra MA, Aparicio S, Shah SP. Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple-negative breast cancer. Genome Res. 2012;22(10):1995–2007.View ArticlePubMedPubMed CentralGoogle Scholar
- Suzuki H, Aoki K, Chiba K, Sato Y, Shiozawa Y, Shiraishi Y, Shimamura T, Niida A, Motomura K, Ohka F, Yamamoto T, Tanahashi K, Ranjit M, Wakabayashi T, Yoshizato T, Kataoka K, Yoshida K, Nagata Y, Sato-Otsubo A, Tanaka H, Sanada M, Kondo Y, Nakamura H, Mizoguchi M, Abe T, Muragaki Y, Watanabe R, Ito I, Miyano S, Natsume A, Ogawa S. Mutational landscape and clonal architecture in grade II and III gliomas. Nat Genet. 2015;47(5):458–68.View ArticlePubMedGoogle Scholar
- Barresi V, Romano A, Musso N, Capizzi C, Consoli C, Martelli MP, Palumbo G, Di Raimondo F, Condorelli DF. Broad copy neutral-loss of heterozygosity regions and rare recurring copy number abnormalities in normal karyotype-acute myeloid leukemia genomes. Genes Chromosomes Cancer. 2010;49(11):1014–23.View ArticlePubMedGoogle Scholar
- Stuart D, Sellers WR. Linking somatic genetic alterations in cancer to therapeutics. Curr Opin Cell Biol. 2009;21(2):304–10.View ArticlePubMedGoogle Scholar
- Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013, 14(Suppl 11);S1-2105-14-S11-S1. Epub 2013 Sep 13.
- Liu B, Morrison CD, Johnson CS, Trump DL, Qin M, Conroy JC, Wang J, Liu S. Computational methods for detecting copy number variations in cancer genome using next generation sequencing: principles and challenges. Oncotarget. 2013;4(11):1868–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Alkodsi A, Louhimo R, Hautaniemi S. Comparative analysis of methods for identifying somatic copy number alterations from deep sequencing data. Brief Bioinform. 2015;16(2):242–54.View ArticlePubMedGoogle Scholar
- Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12(5):363–76.View ArticlePubMedPubMed CentralGoogle Scholar
- Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012;28(3):423–5.View ArticlePubMedGoogle Scholar
- Mayrhofer M, DiLorenzo S, Isaksson A. Patchwork: allele-specific copy number analysis of whole-genome sequenced tumor tissue. Genome Biol. 2013;14(3):R24. -2013-14-3-r24.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu Z, Liu Y, Shen Y, Wang M, Li A. CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data. Bioinformatics. 2014;30(18):1-8.
- Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Goransson H, Juliusson G, Rosenquist R, Hoglund M, Borg A, Ringner M. Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol. 2008;9(9):R136. -2008-9-9-r136 . Epub 2008 Sep 16.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29(1):308–11.View ArticlePubMedPubMed CentralGoogle Scholar
- 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.View ArticleGoogle Scholar
- Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, Beroukhim R, Pellman D, Levine DA, Lander ES, Meyerson M, Getz G. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Oesper L, Mahmoody A, Raphael BJ. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 2013;14(7):R80. -2013-14-7-r80.View ArticlePubMedPubMed CentralGoogle Scholar