Skip to main content

GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing



The gain or loss of large chromosomal regions or even whole chromosomes is termed as genomic scarring and can be observed as copy number variations resulting from the failure of DNA damage repair.


In this study, a new algorithm called genomic scar analysis (GSA) has developed and validated to calculate homologous recombination deficiency (HRD) score. The two critical submodules were tree recursion (TR) segmentation and filtering, and the estimation and correction of the tumor purity and ploidy. Then, this study evaluated the rationality of segmentation and genotype identification by the GSA algorithm and compared with other two algorithms, PureCN and ASCAT, found that the segmentation result of GSA algorithm was more logical. In addition, the results indicated that the GSA algorithm had an excellent predictive effect on tumor purity and ploidy, if the tumor purity was more than 20%. Furtherly, this study evaluated the HRD scores and BRCA1/2 deficiency status of 195 clinical samples, and the results indicated that the accuracy was 0.98 (comparing with Affymetrix OncoScan™ assay) and the sensitivity was 95.2% (comparing with BRCA1/2 deficiency status), both were well-behaved. Finally, HRD scores and 16 genes mutations (TP53 and 15 HRR pathway genes) were analyzed in 17 cell lines, the results showed that there was higher frequency in HRR pathway genes in high HRD score samples.


This new algorithm, named as GSA, could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.

Peer Review reports


Homologous recombination deficiency (HRD) is a functional defect in homologous recombination repair (HRR) pathway, which is responsible for repairing DNA double-stranded breaks (DSBs), and HRD is a potent tumorigenic type of DNA lesion. Tumors with HRD status are arising from germline and/or somatic mutations in BRCA1/2 or other HRR pathway genes, promoter hypermethylation of BRCA1 and/or RAD51C, and other mechanisms [1,2,3]. In particular, it has proved that BRCA1 promoter hypermethylation is also a common epigenetic event in breast and ovarian cancer, ranged from 11 to 57% in different studies [4,5,6]. Over the last years, several studies about HRD in pan-cancer have shown that HRD occurred in many cancers with various frequency, and it was most prevalent in ovarian, breast, prostate and pancreatic cancer [7,8,9].

Some drugs based on DSB or single-strand break (SSB) repair have been developed and used successfully in clinical trials. Platinum salts are currently one of the most important chemotherapeutic drugs and have a broad anticancer spectrum, which introduces DNA DSBs and interstrand crosslinks. Therefore, HRD cells are supposed to be sensitive to platinum salts. The PARP family of proteins, especially PARP1, are essential for SSBs repair by the BER pathway. PARP inhibitors (PARPi) are based on inhibiting the SSBs repairing and then the accumulation of SSBs would lead to the development of fatal DSBs. Thus, PARPi are more sensitive to HRD-positive tumors through synthetic lethal interaction [10, 11]. Several studies have shown that HRD is a potential biomarker for platinum salts and PARPi in many cancers, especially in ovarian and breast cancer, and PARPi have attracted widespread attention in the targeted therapies of multiple cancers due to better efficacy and fewer side effects [12,13,14]. In the past years, FDA has approved niraparib and olaparib combined with bevacizumab, for specific patients with HRD-positive status, according to two critical clinical trials, QUADRA and PAOLA-1 [15, 16].

Currently, HRD testing is mainly carried out by two methods. Firstly, BRCA1/2 or other HRR pathway gene mutation detection and analysis by a custom-designed panel, but the panel gene list is various and variants of unknown significance and secondary mutations lack a standard database [17]. Secondly, identifying the results of HRD by detecting genomic damage patterns, known as genomic scars, including three biomarkers, loss of heterozygosity (LOH), telomeric allelic imbalance (TAI), and large-scale state transition (LST) [15, 16, 18]. Besides, some other methods are also developing, such as BRCA1/RAD51C epigenetic analysis and mutational signature analysis, [9], but these are still difficult for clinical application. Two commercial genomic scar detection kits have been approved by FDA, myChoice HRD CDx (Myriad Genetics, Co., Ltd.) and Foundation Focus CDx (Foundation Medicine, Co., Ltd.) [2].

Genomic scar analysis (GSA) by calculating LOH, LST and TAI scores is a dominant way to evaluate the HRD status of tumors, and the theoretical foundation of this method is that tumors with HRD phenotype would lead to large-scale copy number variation (CNV) [15, 16, 18, 19]. There are four software commonly used to detecting CNV of tumor samples, including PennCNV, ASCAT, ABSOLUTE and PureCN. PennCNV is a free software tool based on a hidden Markov model (HMM), for kilobase-resolution detection of CNVs from Illumina high-density SNP genotyping data [20]. ASCAT and ABSOLUTE were introduced to estimate tumor purity directly from SNP array data [21, 22], and both can effectively solve the impact of purity and ploidy based on SNP array data. NGS data is much higher resolution data than SNP arrays, and provides the opportunity to derive highly accurate estimates of both tumor purity and ploidy [23]. PureCN, is optimized for targeted short read sequencing data, integrates well with standard somatic variant detection pipelines, and has support for matched and/or unmatched tumor samples [24]. Currently, PureCN is a mainstream method for purity correction based on circular binary segmentation (CBS) from NGS data, but it is necessary to artificially judge the best combination from multiple combinations of purity and ploidy.

This study aims to developing and validating an algorithm, named as GSA, which can effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.


Tumor samples and cell lines

Archival FFPE tumor tissues were obtained from 195 ovarian and breast cancer patients who had signed the informed consents, and the study was approved by the Institutional Review Board of BGI Co., Ltd. 17 human cancer cells (HCC38, HCC1428, HCC1143, HCC1806, MX-1, HCC70, ZR-75-1, MDA-MB-453, MDA-MB-231, MDA-MB-361, MDA-MB-415, ZR-75-30, HS-578T, IGR-OV1, A2780, NCI/ADR-RES and OVCAR-4) and 3 matched-wildtype cell lines (HCC38BL, HCC1143BL and HCC1428BL) were purchased from the CoBioer Biosciences Co., Ltd. All cell lines have been verified by STR and were provided in the form of DNA status.

Custom design HRD panel

Variant detection requires comprehensive consideration of the detection frequency and coverage of SNP sites, and relies on statistical analysis based on the relationship between adjacent points to determine the position of fragment breakpoints and eliminate test deviations. Therefore, the principles of probe design mainly include: (1) The target region should contain high-frequency SNP sites of the population; (2) Ensure the capture region has a certain density in the whole genome; (3) Ensure that the regions are as even as possible; (4)Ensure that the target probe is synthesizable (some region probes cannot be synthesized by the supplier due to complex structure or multiple alignments lead to poor specificity); (5)Ensure that the target regions have good capture efficiency and coverage, and no obvious regional preference. Based on the above principles, a total of 93,200 high-frequency SNP loci (frequency ≥ 5%) from the 1000 Genome Database were screened out, which are evenly distributed on each chromosome (except Y chromosomes and mitochondria).

Library preparation, hybridization capture and sequencing

DNA from FFPE tissues was extracted by QIAAMP DNA FFPE TISSUE KIT (Qiagen, Hilden, Germany) according to the manufacturer’s standard protocol. Briefly, 400 ng genomic DNA is fragmented and end-repaired, and a linker with a tag sequence is added to both ends of the DNA by ligase, followed by PCR amplification to form a pre-PCR library. The target DNA fragment in the library is hybridized with a combined probe containing 93,200 SNP sites and additional 2228 capture beds targeting the complete coding region of ATM, BRCA1, BRCA2, BRIP1, BARD1, CDK12, CHEK1, CHEK2, FANCL, PALB2, PPP2R2A, RAD51B, RAD51C, RAD51D, RAD54L and TP53, After purification, the enriched DNA is specifically captured and amplified by PCR to obtain a post-PCR library. The post-PCR library undergoes single-strand separation, circularization and rolling circle replication to generate DNA nano balls (DNB) and sequencing was performed with 2 × 101 bp paired-end reads on MGISEQ-2000 platform (MGI Tech Co., Ltd.).

Raw data quality control

Sequencing data needs to pass the basic standards of quality checks. Raw data quality control includes quality metrics for per-base sequence quality, sequence content, GC content and sequence length distribution, relative percentages of unmatched indices. Usually, the quality control parameters are set as Q30 ≥ 90%, and 40–50% GC content.

Data pre-processing

Raw paired-end reads were subjected to SOAPnuke (v2.0) processing to remove sequencing adapters and low-quality reads. High-quality reads were aligned to the reference human genome (GRCh37.p13), using the BWA sequence alignment software (0.7.17-r1188). PCR deduplication was performed using Picard. Average sequencing depths for tumors samples were ≥ 150× . For each sample, SNVs were called from BAM files using an in-house software, termed as Somatk. B allele frequency (BAF) and Log R ratio (LRR) were obtained from each capture region. BAF represents the median SNP genotype frequency of each capture region, and LRR represents the normalized depth ratio of the tumor and the normal sample (or blood cell control set) in each capture region after GC-bias correction.

TR segmentation and filtering algorithm

The Tree recursion (TR) Segmentation and Filtering Algorithm was developed by C++ . The input data format of the algorithm is (i) BAF data and (ii) LRR data. To reduce the noise in the input data, both BAF and LRR are preprocessed by a specially designed segmentation and filtering algorithm. First, if BAF ≥ 0.95 or BAF ≤ 0.05, defined as homozygous, the data would be removed from the BAF track because of its uselessness. Then the remaining BAF value is mirrored and flipped upward with 0.5 as the center, thus BAF =|BAF − 0.5|+ 0.5. For LRR, the bin LRR values are also first optionally filtered for outliers, defined as the total probability density is below the 30% quantile in all bins.

Next, the in-house TR segmentation algorithm, based on the calculation of the run-length, was used to roughly segment each chromosome, as shown in Additional file 1: Fig. S1. In this algorithm, the whole chromosome is taken as the root node, all the segmented sub-nodes are taken as the child nodes. The segmentation process can be simply described as the following steps:

  1. (a)

    Calculate the cumulative run-length of data (here refers to BAF and LRR) deviated from the mean (\(\sum \left( {{\text{x}} - {\overline{\text{x}}}} \right)\)) and select its maximum and minimum points as candidate breakpoints.

  2. (b)

    Make appropriate trade-offs of candidate breakpoints according to the location of breakpoints, length of segments, number of data points in segments, etc., that is, determine whether breakpoints should be recorded.

  3. (c)

    If none of the subfragment of the current fragment satisfies the record condition, a recursive judgment is initiated. Otherwise, it recursively slices its last child node.

  4. (d)

    After the termination condition is reached, recursion is carried out on horizontal child nodes.

  5. (e)

    If all child objects have been processed, the parent's level object will continue to be processed until it is finally traced back to the root node and no new child objects are created.

Then the fragments are merged in a cyclic manner. Firstly, for each segmented fragment of chromosome was traversed by the kernel density estimation, to find out the two fragments, which are closest to the same distribution and combine them. Secondly, the statistics of the newly merged fragment and its adjacent fragments are recalculated until all indicators meet the requirements. Besides, segmentation of BAF and LRR is carried out separately, and then the union set of the merged BAF and LRR segmentation list is taken, but the regions with too short or insufficient data points are iteratively removed.

Purity and ploidy estimation

BAF and LRR are expressed by a given genomic location as functions of the allele-specific copy numbers nA and nB, where nA denotes the number of copies of the A allele and nB denotes the number of copies of the B allele. Assuming tumor cell purity (p) was 1, BAF and LRR are calculated by:

$$\begin{aligned} BAF & = \frac{{n_{B} }}{{n_{A} + n_{B} }} = \frac{{n_{B} }}{CN} \\ LRR & = \log_{2} \left( {\frac{{n_{A} + n_{B} }}{2}} \right) \\ \end{aligned}$$

Considering the influence of nonaberrant cells in real world tumor samples and assuming that the nonaberrant cells have a total copy number of 2 for all loci, tumor ploidy correction factor (scale_factor), tumor purity (p), the measured CN (CN*) and the measured BAF (BAF*) of the FFPE samples satisfy the following relationship (Additional file 1: Table S1, Table S2).

$$n_{B} = CN \cdot BAF$$
$$CN^{*} \cdot scale\_factor = CN \cdot p + 2\left( {1 - p} \right)$$
$$BAF^{*} = \frac{{1 - p + pn_{B} }}{{CN^{*} \cdot scale\_factor}}$$

Based on Eqs. (1), (2) and (3), Tumor purity can be expressed as below.

$$purity = \frac{{1 - 2 \cdot BAF^{*} }}{{BAF^{*} \cdot CN - 2 \cdot BAF^{*} + 1 - BAF*CN}}$$

Based on the segmented chromosome fragments, the mean of all BAF value in the fragments, the percentage ranking of the BAF mean in all fragments, the theoretical CN (CN = 2LRR*2) and the percentage ranking of the CN in all fragments are calculated. Subsequently, using the density-based scan (DBSCAN) algorithm to perform density clustering on the BAF mean-CN percentage ranking data, the chromosome fragments of the same genotype are clustered into a cluster.

For fragments of the same genotype, the calculated purity value can be approximately regarded as conforming to a normal distribution with the theoretical mean value of purity, that is Eqs. (5).

$$\widehat{purity} = \frac{{1 - 2 \cdot \overline{{BAF^{*} }} }}{{\overline{{BAF^{*} }} \cdot CN - 2 \cdot \overline{{BAF^{*} }} + 1 - BAF*CN}}$$

Therefore, the tumor purity is calculated by clustering the chromosome fragments of the same genotype, and bringing the measured mean value of BAF*, theoretical BAF and CN values of the specific genotype cluster into the Eqs. (5).

In addition, the ploidy value of the entire genome of the sample is the weighted average of the copy number of each segment of the chromosome.

$$Ploidy = \frac{{\mathop \sum \nolimits_{i = 1}^{n} Segs_{{i^{*} }} CN_{i} + \left( {1 - \Sigma_{i = 1}^{n} Segs_{i} } \right)*2}}{n}$$

Segsi is the proportion of each segment on the reference genome, and CNi is the calculated copy number of the segment.

Calculation of LOH, TAI, and LST scores

HRD-LOH score was defined as the number of LOH regions longer than 15 Mb exceeding LOH regions which do not cover the whole chromosome. HRD-TAI score was defined as the number of regions with allelic imbalance that (a) extend to one of the subtelomeres, (b) do not cross the centromere and (c) are longer than 11 Mb. HRD-LST score is the number of break points between regions longer than 10 Mb after filtering out regions shorter than 3 Mb.

Aneuploidy is a common event in cancer patients, so more copy number variations will be detected by the high-throughput sequencing data. However, these copy number abnormalities may not be caused by the failure of homologous recombination repair, it will make the final HRD score calculation biased. Calculating accurately HRD scores depend on BAF and copy number, but the aneuploidy properties and various purity of tumor samples will affect the actual value of BAF and copy number. Thus, it is necessary to make appropriate correction, and the calculation formula of HRD score is preliminarily determined as follows:

$$HRD_{score} = LOH + TAI + LST - K*Ploidy$$

Here, K is the coefficient of correction, which is a constant. Besides, the whole analysis flowchart was shown as Additional file 1: Fig. S2. However, the constant depends on the type of cancer, sample type, target region size, and sequencing platform, This study screened 62 BRCA1/2-deficiency samples and 37 BRCA1/2 wildtype clinical samples of the 195 patients to explore the reasonable constant K. Finally, when the correction coefficient was set as 15.5, the AUC of the model is 88.3%, the sensitivity is 95.2%, and the threshold is 30 (Additional file 1: Supplementary Fig. S3).

BRCA1/2 and other HRR gene mutation analysis

Variants were named according to HGVS (Human Genome Variation Society; Point mutations, short InDels, copy number variants were identified from NGS data, and interpreted in accordance with the “Genetic Variation Annotation Standards and Guidelines” (2015 Edition) issued by the American College of Medical Genetics (ACMG) for germline mutation, and the “Cancer mutation interpretation of guidelines and standards (2017 Edition)” for somatic mutation, respectively. BRCA1/2 locus-specific loss of heterozygosity were analyzed as follows: (a) if the mutation frequency of the SNP on the control sample is between 35 and 65%, it is recorded as a heterozygous mutation; (b) if the mutation frequency of the SNP on the tumor sample is greater than 65% or less than 35%, it is recorded as a homozygous mutation; (c) if the mutation frequency of a SNP site meets the both conditions, the SNP site is marked as a LOH site, otherwise it is marked as a non-LOH site; (d) if the number of LOH sites on BRCA1/2 is greater than the number of non-LOH sites, it is considered that LOH has occurred in BRCA1/2.

BRCA1 promoter methylation quantitative PCR assays

DNA methylation-sensitive and methylation-dependent restriction enzymes were used to selectively digest unmethylated or methylated genomic DNA, respectively. Post-digest DNA was quantified by real-time PCR using a 344-bp PCR-generated primer that spanned BRCA1 exon 1. The relative concentrations of differentially methylated DNA are determined by comparing the amount of each digest with that of a mock digest. A cutoff of 10% was used to define samples as “methylated”.

The definition of BRCA1/2-deficiency

BRCA1/2-deficiency is defined as either (i) one deleterious mutation in BRCA1 or BRCA2, with LOH in the wild-type copy or (ii) two deleterious mutations in the same gene or (iii) promoter methylation of BRCA1 with LOH in the wild-type copy.

Affymetrix OncoScan™ assay

The Affymetrix OncoScan™ assay utilizes the Molecular Inversion Probe (MIP) assay technology for the detection of SNP genotyping, and has subsequently been used for identifying other types of genetic variation including focal insertions and deletions, large fragment CNV, LOH, and even somatic mutation. This assay has been shown over time to perform well with highly degraded DNA, such as that derived from FFPE- preserved tumor samples of various ages and with < 100 ng DNA of starting material, thus making the assay a natural choice in cancer clinical research. This assay captured the alleles of 217,611 SNPs and then the original CEL files were obtained by Affymetrix Genechip Scanner were converted to the OSCHP files by Chromosome Analysis Suite 3.0.

Statistical analysis

All statistical analysis was conducted using R version 3.6.1 (R Core Team, 2013) with an α of 0.05. The statistical tools employed in this study include Student's t-test and one-way ANOVA analysis of variance. All reported P values were two-sided. P < 0.05 was considered to be statistically significant. The Pearson correlation were used to evaluating the consistency of two different methods. The two-dimensional normal distribution function was used to remove outliers. Same distribution statistical test was used to compare the difference between adjacent fragments. DBSCAN density clustering algorithm was used to identify different genotypes.


The density and uniformity of SNP sites

The density and uniformity of SNP sites are the critical factors for the precision and resolution of CNVs identification. Herein, the capture regions distribution of HRD Panel with all chromosomes was compared with Affymetrix OncoScan™ Assay, xGen Exome Research Panel (IDT), and a pan-cancer panel (BGI) (Fig. 1a). Then, chr4, chr13 and chr21 were selected randomly to compare the number and spacing of capture regions in the above four panels (Fig. 1b). The Affymetrix OncoScan™ Assay is a SNP-array based on molecular inversion probe technology, a proven technology for identifying CNVs, LOH, and detecting somatic mutations. Besides, this assay could realize the high-resolution (50–300 kb) copy number detection. HRD panel is designed for detecting large-scale CNVs (at Mb level) to calculate LOH, TAI, LST, and HRD score, thus the total number of SNP sites could reduce appropriately, but not uniformity. The results showed that the SNP sites of HRD panel had similar uniformity with Affymetrix OncoScan™ Assay, but had less SNP sites (93,200 vs. 217,611). Most of the work published to date detecting CNVs is based on SNP array, thus this study defined Affymetrix OncoScan™ Assay as a correlation method to evaluate the accuracy of HRD panel based on NGS approaches [25].

Fig. 1
figure 1

The capture regions distribution of Affymetrix OncoScan™ Assay (red line), HRD panel (yellow line), xGen Exome Research Panel (orange line), and a pan-cancer panel (green line) with all chromosomes (a) and 3 randomly selected chromosomes (b)

The rationality of segmentation and genotype identification

Chromosome segmentation of different genotypes is the key step for detecting CNVs. This algorithm had a variety of built-in statistical methods, which could evaluate whether the adjacent chromosomal segments conform to the same genotype according to the BAF. If the adjacent sub-segmentations belonged to the same genotype, they would be combined based on circular binary segmentation (CBS). Besides, the abnormal points had been removed, thus the segmentation error rate could be reduced greatly (Additional file 1: Fig. S4). Subsequently, the segmentation result of chromosome 2 in one patient of 195 clinical samples was selected as an example to analyze the rationality of segmentation and genotype identification (Fig. 2). The results showed that the BAF and copy number of the 25 M ~ 40 M region were 0.67 and 3 copies, it means that this segmentation was identified as ABB genotype. Similarly, BAF and copy number of the 115 M ~ 120 M region were 1.00 and 2 copies, it means that this segmentation was identified as BB genotype. Evaluating the relationship between BAF and CNV of all segmentation were all logical.

Fig. 2
figure 2

The relationship between B allele frequency (top panel) and copy number (bottom panel) according to genomic position, which could infer the genotype of each segmentation. Here, the chromosome 2 of one clinical patient is shown

The self-consistency of purity and ploidy estimation

The different tumor purity samples (80%, 50%, and 20%) of HCC1143 cell line were used to evaluate the necessity and limitation of tumor purity correction. The genotype of 20 M ~ 120 M in chromosome 1 of HCC1143 was considered as ABB, so the BAF distribution in this region should be characterized by a bimodal distribution with the antimode around 0.5 and peaks around 0.33 and 0.67. The density of BAF in each tumor purity was analyzed, and the results showed that the lower tumor purity, the more difficult to define ABB or AB genotype (Fig. 3a–c). In addition, the ploidy of 195 OC and BC tumor samples had been estimated by the GSA algorithm. The ploidy ranged between 1.41 and 4.07, being characterized by a bimodal distribution with the antimode around 2.5 and peaks around 2 (near-diploid status) and 3 (near-triploid status) (Fig. 3d; Additional file 1: Table S3). Thus, purity and ploidy estimation were considered to the GSA algorithm. The accuracy of the GSA algorithm to predict tumor purity and ploidy were verified by the following two methods. Firstly, mix sequencing reads of different proportions of control samples into the 5 measured FFPE samples to simulate tumor purity dilution; Secondly, mix different proportions of germline DNA into 3 samples of tumor cell lines to obtain tumor cell line samples of different tumor purity. The correlation coefficient R2 between the theoretical tumor purity value and the actual tumor purity value calculated by GSA were 0.9813 and 0.9812, respectively, in simulated diluted tumor samples and the real diluted cell line samples (Fig. 4a, b, Additional file 1: Table S4). The tumor purity calculated by GSA maintains a good linear relationship with the theoretical purity obtained by simulated dilution, indicating that the GSA purity correction algorithm has good stability when the pathological tumor purity is higher than or equal to 20%, and the algorithm can accurately reflect the real tumor cell content. Due that tumor ploidy is an essential attribute of the sample and will not affect by the tumor purity, the smaller the fluctuation and the more accurate the ploidy correction algorithm. Similarly, if purity more than 20%, the ploidy values under different tumor purity calculated by GSA are basically the same, and the HRD score of one sample with different purity was also stable (Fig. 4c, d, Additional file 1: Fig. S5, Additional file 1: Table S4). In a word, GSA algorithm had an excellent predictive effect on tumor purity and tumor genome ploidy, but the pathological tumor purity should more than 20%.

Fig. 3
figure 3

The BAF distribution of HCC1143 with 80% (a), 50% (b) and 20% (c) tumor purity in the region of 20 M ~ 120 M in chromosome 1. The ploidy distribution of 195 clinical samples (d)

Fig. 4
figure 4

The consistency of purity and ploidy estimation with the GSA algorithm. a The consistency between the theoretical tumor purity value and the actual tumor purity value in the simulated diluted tumor samples. b The consistency between the theoretical tumor purity value and the actual tumor purity value in the real diluted cell line samples. c The consistency of ploidy in the simulated diluted tumor samples with different tumor purity. d The consistency of ploidy in the real diluted cell line samples with different tumor purity

The accuracy of LOH, LST, TAI and HRD score

Total of 40 FFPE tumor samples were both detected by Affymetrix OncoScan™ assay and the custom designed HRD Panel. BAF and LRR from NGS data and CEL file were both used to calculate HRD scores by the GSA algorithm. The results showed that the correlation coefficient of HRD scores calculated by the two method was 0.98, and the correlation coefficient of LOH, TAI and LST calculated by the two methods were 0.96, 0.89 and 0.95, respectively. It indicated that our target capture panel, containing 93,200 SNP sites, could represent the whole-genome copy number variations, and thus could generate the accurate HRD scores (Fig. 5; Table 1).

Fig. 5
figure 5

Pearson correlation of LOH, LST, TAI and HRD score between Affymetrix OncoScan™ assay and the custom designed HRD Panel. Panels above the diagonal show correlation. Diagonal panels show density plots. The red dots represent BRCA1/2-deficient subgroup and the green dots represent BRCA1/2-intact subgroup in panels below the diagonal

Table 1 The HRD scores of SNP array and the target panel

The relationship between HRD score and BRCA1/2 deficiency status in clinical samples

Deficiency of BRCA1/2 via biallelic mutations and somatic hypermethylation (for BRCA1) gives rise to a deficiency status in homologous recombination repair, thus the HRD score and BRCA1/2 deficiency status of the 195 clinical samples were all analyzed. The distribution of scores was shown for BRCA1/2-deficient versus BRCA1/2-intact samples in Fig. 6. Higher HRD scores were observed in BRCA1/2-deficient tumors, suggesting that these tumors had a tendency towards genomic instability, and the consistency was up to 95.2% if the biological threshold was set as 30. But for other indicators, LOH, TAI and LST, there was less distinction between BRCA1/2-deficient tumors and BRCA1/2-intact tumors. Therefore, the results indicated that HRD score, combing LOH, TAI and LST, was the optimal indicator.

Fig. 6
figure 6

The distribution of LOH (a), TAI (b), LST (c) and HRD score (d) were shown as BRCA1/2-deficient subgroup (top panel, red bars) and BRCA1/2-intact subgroup (bottom panel, blue bars)

The HRD score and HRR genetic landscape of cell lines and clinical samples

All the 17 ovarian and breast cancer cell lines were determined, 2 cell lines (11.76%) were mBRCA carriers (only including SNV and InDel variation, which are defined as likely pathogenic and pathogenic mutation carriers), and BRCA1/2 copy number loss was detected in 8 cell lines (47.06%) (Table 2; Fig. 7). The HRD scores distributed from -21.19 to 73.21, and the mean was 27.91, and the median was 14.42. Assuming that there is only one primary clone in the tumor samples, and amplification or loss only occur in one chromosome, the theoretic ploidy of diploid is between 1.95 and 2.04. Thus, only 1 cell line (A2780) was calculated as diploid, and other 15 cell lines (93.75%) were defined as aneuploidy (Table 2). The BAF and CN mapping of HCC38 and ZR-75-30, which had the similar ploidy correction factor but the different HRD scores, were shown as Additional file 1: Fig. S6 and Fig. S7. Obviously, the genomic status of HCC38 cell line was more instable than ZR-75-30. In addition, all the cell lines with BRCA1 methylation were showed as high HRD scores but not with any BRCA1 mutation. Meanwhile, mutations of 13 other HRR pathway genes, ATM, BARD1, BRIP1, CDK12, CHEK1, CHEK2, FANCL, PALB2, PPP2R2A, RAD51B, RAD51C, RAD51D, RAD54L, and TP53 were analyzed. TP53, which encodes the tumor-suppressor protein p53, was the most frequently mutated gene (82.35%), but there was no mutation in CDK12, CHEK2, FANCL, PALB2, PPP2R2A and RAD51D (Fig. 7). ZR-75-1 cell line was not detected any mutation in the above 16 genes and BRCA1 methylation, and the HRD score was 14.42. Notably, there was higher frequency in HRR pathway genes in high HRD score samples, except for IGR-OV1, which derived from ovarian endometrioid adenocarcinoma, which is a type I epithelial ovarian cancer (EOC) and have verified as a slow growing and indolent neoplasms. The HRD score of IGR-OV1 was calculated as -14.73 but with many mutations in multiple HRR pathway genes and multi-hit mutations in TP53 and BRCA2, and it indicated that this kind of ovarian cancer type might tend to genomic stability and various threshold should be set in different tumor type. Similar mutational patterns in 15 HRR genes were also analyzed with the 195 clinical samples, and 64.62% (126/195) samples were altered. The result showed that the mutation frequency of the HRR genes in the high HRD score cohort (72.48%) was higher than that in the low HRD score cohort (54.65%), indicating there was a higher genomic instability in the high HRD score cohort, but clinical efficacy data were still lacking (Additional file 1: Fig. S8).

Table 2 The HRD Score and BRCA1/2 mutations and methylation of 17 OC and BC cell lines
Fig. 7
figure 7

The waterfall map of 15 HRR pathway genes and TP53 that mutated in 17 cell lines. The annotation of mutation types was shown on the upper right with various colors. The samples were sorted by HRD scores


This study proposed a new algorithm model called GSA to detect genome aberrations based on high-throughput sequencing data, and this algorithm could accurately realize the segmentation of chromosomal regions, effectively calculate the tumor purity and tumor genome ploidy automatically and could be used for the statistical modeling of various genomic instability indicators.

Herein, the segmentation result of chromosome 2 in one clinical sample as an example to compare the chromosome segmentation effects of GSA, PureCN and ASCAT algorithms through measured samples (Additional file 1: Fig. S9; Additional file 1: Table S5). The results showed that PureCN, ASCAT and GSA algorithms divided chromosome 2 into 4, 26 and 4 fragments respectively. First of all, the BAF and copy number of the 76 M ~ 170 M region were different from the adjacent segments. The BAF of this segment was concentrated around 0.5, indicating that the A/B gene ratio was closed to 1:1, and the average copy number was about 4 excluding the influence of some interference points. The result given by GSA was 4 copies, and the genotype of this fragment is AABB. But the result given by PureCN was 3 copies, which is inconsistent with BAF, and ASCAT algorithm divided the segment into 10 fragments, due to the interference of some noise points. Secondly, PureCN was disconnected at the positions of 21 M, 90-92 M, and 147 M, but there was no significant difference at the breakpoint position judging from BAF and copy number. ASCAT divides the 10 K ~ 76 M region into 7 fragments, which should belong to the same genotype inferring by the data of BAF and copy number. In addition, it should be pointed out that the GSA method has special processing for the chromosome centromere gap region. If the region distribution before and after the gap was the same, it would be merged. Overall, the GSA algorithm is more accurate for chromosome fragment segmentation.

Aneuploidy is commonly observed in cancers, and the result showed that ploidy in tumor samples is characterized by a bimodal distribution in triploid and diploid which is basically consistent with the results reported in many literatures [7]. Then, three tumor cell line samples (HCC1143, HCC1428 and HCC38) and four FFPE samples with different tumor purity were used to predict tumor purity and ploidy using GSA, ASCAT and PureCN software, respectively. The results showed that the tumor purity predicted by GSA was closest to the pathological tumor purity compared to the other two software. ASCAT has an obvious limitation in predicting samples with lower tumor purity. Meanwhile, when the actual tumor purity was low, the tumor ploidy predicted by PureCN was varies greatly. But the tumor ploidy predicted by GSA could maintain good stability (Additional file 1: Table 6). Although the tumor purity calculated by GSA was highly consistent with the theoretical purity of the diluted tumor cell line sample, if the tumor cell content was less than 20%, GSA algorithm could not accurately predict the tumor purity and ploidy. Normal tissue contamination and duplication of the entire genome is the most common initiation event for aneuploidy during cancer progression [3, 15, 26]. Moreover, near-tetraploid tumors show increased TAI and LST scores, but the generation of new HRD-LOH events is theoretically less likely to occur as ploidy increases [7]. The GSA algorithm added the calculation of ploidy and tumor cell purity when detecting CNVs and deducted the ploidy when calculating the HRD score. The results from clinical samples indicated that the accuracy and sensitivity of the combined HRD score calculated by GSA algorithm both were well-behaved, the accuracy was 0.98 (comparing with Affymetrix OncoScan™ assay) and the sensitivity was 95.2% (comparing with BRCA1/2 deficiency status).

To verify whether the HRD score results obtained by the GSA algorithm can be used to assist clinical treatment efficacy, more clinical efficacy data support is needed. Unfortunately, the clinical samples used in this study lack this part of information, but we can initially compare the score of cell lines with the research results of the previous literature. The HRD scores of HCC38 and MDA-MB-231 calculated by GSA were 73.21 and 8.90, respectively. Both of the two cell lines are Claudin-low breast cancer cells, and the BRCA status were wildtype. In the Anne Margriet Heijink study (2019), HCC38 (GR50 = 3.6 μM) was defined as cisplatin-sensitive TNBC cell lines, and MDA-MB-231 (GR50 = 61.0 μM) was defined as cisplatin-resistant TNBC cell lines [27]. Due to the cisplatin introduces both intra- and inter-strand DNA crosslinks (ICLs), which stall replication forks and are therefore especially toxic in proliferating cells, if the homologous recombination function of cell lines is deficient, the cell lines cannot repair the double-strand breaks. Obviously, the HRD score result obtained by the GSA algorithm is consistent with the judgment of cell line drug sensitivity of the previous study [27]. With more and more application of PARP inhibitor in the clinic, we need data on the efficacy of PARP inhibitors to validate the accuracy of GSA algorithm in the real-world data.


This new algorithm, named as GSA, could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.

Availability of data and materials

The data generated during this study are also available in the CNGB Nucleotide Sequence Archive (CNSA: https://; accession number CNP0002015). All data analyzed during this study are included in this published article and its supplementary information files. The basic information of GSA is available at



Next generation sequencing


Genomic scar analysis


Homologous recombination deficiency


Homologous recombination repair


Double-stranded break


Single-strand break


PARP inhibitors


Loss of heterozygosity


Telomeric allelic imbalance


Large-scale state transition


Copy number variation


Hidden Markov model


Circular binary segmentation


B allele frequency


Log R ratio


Tree recursion


  1. Konstantinopoulos PA, Ceccaldi R, Shapiro GI, D’Andrea AD. Homologous recombination deficiency: exploiting the fundamental vulnerability of ovarian cancer. Cancer Discov. 2015;5(11):1137–54.

    Article  CAS  Google Scholar 

  2. Stover EH, Fuh K, Konstantinopoulos PA, Matulonis UA, Liu JF. Clinical assays for assessment of homologous recombination DNA repair deficiency. Gynecol Oncol. 2020;159(3):887–98.

    Article  CAS  Google Scholar 

  3. Watkins JA, Irshad S, Grigoriadis A, Tutt AN. Genomic scars as biomarkers of homologous recombination deficiency and drug response in breast and ovarian cancers. Breast Cancer Res. 2014;16(3):211.

    Article  Google Scholar 

  4. Glodzik D, Bosch A, Hartman J, Aine M, Vallon-Christersson J, Reutersward C, Karlsson A, Mitra S, Nimeus E, Holm K, et al. Comprehensive molecular comparison of BRCA1 hypermethylated and BRCA1 mutated triple negative breast cancers. Nat Commun. 2020;11(1):3747.

    Article  CAS  Google Scholar 

  5. Tabano S, Azzollini J, Pesenti C, Lovati S, Costanza J, Fontana L, Peissel B, Miozzo M, Manoukian S. Analysis of BRCA1 and RAD51C promoter methylation in italian families at high-risk of breast and ovarian cancer. Cancers. 2020;12(4):66.

    Article  Google Scholar 

  6. Moschetta M, George A, Kaye SB, Banerjee S. BRCA somatic mutations and epigenetic BRCA modifications in serous ovarian cancer. Ann Oncol. 2016;27(8):1449–55.

    Article  CAS  Google Scholar 

  7. Marquard AM, Eklund AC, Joshi T, Krzystanek M, Favero F, Wang ZC, Richardson AL, Silver DP, Szallasi Z, Birkbak NJ. Pan-cancer analysis of genomic scar signatures associated with homologous recombination deficiency suggests novel indications for existing cancer drugs. Biomark Res. 2015;3:9.

    Article  Google Scholar 

  8. Nguyen L, Jeck WMM, Van Hoeck A, Cuppen E. Pan-cancer landscape of homologous recombination deficiency. Nat Commun. 2020;11(1):5584.

    Article  CAS  Google Scholar 

  9. O’Kane GM, Connor AA, Gallinger S. Characterization, detection, and treatment approaches for homologous recombination deficiency in cancer. Trends Mol Med. 2017;23(12):1121–37.

    Article  CAS  Google Scholar 

  10. McMullen M, Karakasis K, Madariaga A, Oza AM. Overcoming platinum and PARP-inhibitor resistance in ovarian cancer. Cancers. 2020;12(6):66.

    Article  Google Scholar 

  11. Arora S, Balasubramaniam S, Zhang H, Berman T, Narayan P, Suzman D, Bloomquist E, Tang S, Gong Y, Sridhara R, et al. FDA approval summary: olaparib monotherapy or in combination with bevacizumab for the maintenance treatment of patients with advanced ovarian cancer. Oncologist. 2021;26(1):e164–72.

    Article  CAS  Google Scholar 

  12. Telli ML, Timms KM, Reid J, Hennessy B, Mills GB, Jensen KC, Szallasi Z, Barry WT, Winer EP, Tung NM, et al. Homologous recombination deficiency (HRD) score predicts response to platinum-containing neoadjuvant chemotherapy in patients with triple-negative breast cancer. Clin Cancer Res. 2016;22(15):3764–73.

    Article  CAS  Google Scholar 

  13. Stronach EA, Paul J, Timms KM, Hughes E, Brown K, Neff C, Perry M, Gutin A, El-Bahrawy M, Steel JH, et al. Biomarker assessment of HR deficiency, tumor BRCA1/2 mutations, and CCNE1 copy number in ovarian cancer: associations with clinical outcome following platinum monotherapy. Mol Cancer Res. 2018;16(7):1103–11.

    Article  CAS  Google Scholar 

  14. Gou R, Dong H, Lin B. Application and reflection of genomic scar assays in evaluating the efficacy of platinum salts and PARP inhibitors in cancer therapy. Life Sci. 2020;261:118434.

    Article  CAS  Google Scholar 

  15. Popova T, Manie E, Rieunier G, Caux-Moncoutier V, Tirapo C, Dubois T, Delattre O, Sigal-Zafrani B, Bollet M, Longy M, et al. Ploidy and large-scale genomic instability consistently identify basal-like breast carcinomas with BRCA1/2 inactivation. Cancer Res. 2012;72(21):5454–62.

    Article  CAS  Google Scholar 

  16. Birkbak NJ, Wang ZC, Kim JY, Eklund AC, Li Q, Tian R, Bowman-Colin C, Li Y, Greene-Colozzi A, Iglehart JD, et al. Telomeric allelic imbalance indicates defective DNA repair and sensitivity to DNA-damaging agents. Cancer Discov. 2012;2(4):366–75.

    Article  CAS  Google Scholar 

  17. Feng Z, Wen H, Ju X, Bi R, Chen X, Yang W, Wu X. Homologous recombination repair gene mutations show no survival benefits in Chinese high-grade serous ovarian cancer patients. Ann Transl Med. 2021;9(5):364.

    Article  CAS  Google Scholar 

  18. Abkevich V, Timms KM, Hennessy BT, Potter J, Carey MS, Meyer LA, Smith-McCune K, Broaddus R, Lu KH, Chen J, et al. Patterns of genomic loss of heterozygosity predict homologous recombination repair defects in epithelial ovarian cancer. Br J Cancer. 2012;107(10):1776–82.

    Article  CAS  Google Scholar 

  19. Timms KM, Abkevich V, Hughes E, Neff C, Reid J, Morris B, Kalva S, Potter J, Tran TV, Chen J, et al. Association of BRCA1/2 defects with genomic scores predictive of DNA damage repair deficiency among breast cancer subtypes. Breast Cancer Res. 2014;16(6):475.

    Article  Google Scholar 

  20. Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17(11):1665–74.

    Article  CAS  Google Scholar 

  21. Van Loo P, Nordgard SH, Lingjaerde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci USA. 2010;107(39):16910–5.

    Article  Google Scholar 

  22. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol. 2012;30(5):413–21.

    Article  CAS  Google Scholar 

  23. Oesper L, Mahmoody A, Raphael BJ. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol. 2013;14(7):R80.

    Article  Google Scholar 

  24. Riester M, Singh AP, Brannon AR, Yu K, Campbell CD, Chiang DY, Morrissey MP. PureCN: copy number calling and SNV classification using targeted short read sequencing. Source Code Biol Med. 2016;11:13.

    Article  Google Scholar 

  25. de Luca XM, Newell F, Kazakoff SH, Hartel G, McCart Reed AE, Holmes O, Xu Q, Wood S, Leonard C, Pearson JV, et al. Using whole-genome sequencing data to derive the homologous recombination deficiency scores. NPJ Breast Cancer. 2020;6:33.

    Article  Google Scholar 

  26. Storchova Z, Kuffer C. The consequences of tetraploidy and aneuploidy. J Cell Sci. 2008;121(Pt 23):3859–66.

    Article  CAS  Google Scholar 

  27. Heijink AM, Everts M, Honeywell ME, Richards R, Kok YP, de Vries EGE, Lee MJ, van Vugt M. Modeling of cisplatin-induced signaling dynamics in triple-negative breast cancer cells reveals mediators of sensitivity. Cell Rep. 2019;28(9):e2345–57.

    Article  Google Scholar 

Download references


Not applicable.


Not applicable.

Author information

Authors and Affiliations



DC and MS wrote this manuscript with contributions from all authors. MS, PM and CW designed the GSA and analyzed the variant caller results. DC, QL, YC and CS performed the experiments and interpreted the results. TS directed all aspects of the project from concept, to design, to engineering to experimentation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Taiping Shi.

Ethics declarations

Ethics approval and consent to participate

All patients had signed the informed consents, and the study was approved by the Institutional Review Board of BGI Co., Ltd.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Supplementary material. Additional figures and tables to support on the understanding the GSA algorithm.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, D., Shao, M., Meng, P. et al. GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing. BMC Bioinformatics 22, 562 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: