GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing

Background The gain or loss of large chromosomal regions or even whole chromosomes is termed as genomic scarring and can be observed as copy number variations resulting from the failure of DNA damage repair. Results In this study, a new algorithm called genomic scar analysis (GSA) has developed and validated to calculate homologous recombination deficiency (HRD) score. The two critical submodules were tree recursion (TR) segmentation and filtering, and the estimation and correction of the tumor purity and ploidy. Then, this study evaluated the rationality of segmentation and genotype identification by the GSA algorithm and compared with other two algorithms, PureCN and ASCAT, found that the segmentation result of GSA algorithm was more logical. In addition, the results indicated that the GSA algorithm had an excellent predictive effect on tumor purity and ploidy, if the tumor purity was more than 20%. Furtherly, this study evaluated the HRD scores and BRCA1/2 deficiency status of 195 clinical samples, and the results indicated that the accuracy was 0.98 (comparing with Affymetrix OncoScan™ assay) and the sensitivity was 95.2% (comparing with BRCA1/2 deficiency status), both were well-behaved. Finally, HRD scores and 16 genes mutations (TP53 and 15 HRR pathway genes) were analyzed in 17 cell lines, the results showed that there was higher frequency in HRR pathway genes in high HRD score samples. Conclusions This new algorithm, named as GSA, could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04487-9.


Background
Homologous recombination deficiency (HRD) is a functional defect in homologous recombination repair (HRR) pathway, which is responsible for repairing DNA doublestranded breaks (DSBs), and HRD is a potent tumorigenic type of DNA lesion. Tumors with HRD status are arising from germline and/or somatic mutations in BRCA1/2 or other HRR pathway genes, promoter hypermethylation of BRCA1 and/or RAD51C, and other mechanisms [1][2][3]. In particular, it has proved that BRCA1 promoter hypermethylation is also a common epigenetic event in breast and ovarian cancer, ranged from 11 to 57% in different studies [4][5][6]. Over the last years, several studies about HRD in pancancer have shown that HRD occurred in many cancers with various frequency, and it was most prevalent in ovarian, breast, prostate and pancreatic cancer [7][8][9].
Some drugs based on DSB or single-strand break (SSB) repair have been developed and used successfully in clinical trials. Platinum salts are currently one of the most important chemotherapeutic drugs and have a broad anticancer spectrum, which introduces DNA DSBs and interstrand crosslinks. Therefore, HRD cells are supposed to be sensitive to platinum salts. The PARP family of proteins, especially PARP1, are essential for SSBs repair by the BER pathway. PARP inhibitors (PARPi) are based on inhibiting the SSBs repairing and then the accumulation of SSBs would lead to the development of fatal DSBs. Thus, PARPi are more sensitive to HRD-positive tumors through synthetic lethal interaction [10,11]. Several studies have shown that HRD is a potential biomarker for platinum salts and PARPi in many cancers, especially in ovarian and breast cancer, and PARPi have attracted widespread attention in the targeted therapies of multiple cancers due to better efficacy and fewer side effects [12][13][14]. In the past years, FDA has approved niraparib and olaparib combined with bevacizumab, for specific patients with HRD-positive status, according to two critical clinical trials, QUADRA and PAOLA-1 [15,16].
Currently, HRD testing is mainly carried out by two methods. Firstly, BRCA1/2 or other HRR pathway gene mutation detection and analysis by a custom-designed panel, but the panel gene list is various and variants of unknown significance and secondary mutations lack a standard database [17]. Secondly, identifying the results of HRD by detecting genomic damage patterns, known as genomic scars, including three biomarkers, loss of heterozygosity (LOH), telomeric allelic imbalance (TAI), and large-scale state transition (LST) [15,16,18]. Besides, some other methods are also developing, such as BRCA1/RAD51C epigenetic analysis and mutational signature analysis, et.al. [9], but these are still difficult for clinical application. Two commercial genomic scar detection kits have been approved by FDA, myChoice HRD CDx (Myriad Genetics, Co., Ltd.) and Foundation Focus CDx (Foundation Medicine, Co., Ltd.) [2].
Genomic scar analysis (GSA) by calculating LOH, LST and TAI scores is a dominant way to evaluate the HRD status of tumors, and the theoretical foundation of this method is that tumors with HRD phenotype would lead to large-scale copy number variation (CNV) [15,16,18,19]. There are four software commonly used to detecting CNV of tumor samples, including PennCNV, ASCAT, ABSOLUTE and PureCN. PennCNV is a free software tool based on a hidden Markov model (HMM), for kilobase-resolution detection of CNVs from Illumina high-density SNP genotyping data [20]. ASCAT and ABSOLUTE were introduced to estimate tumor purity directly from SNP array data [21,22], and both can effectively solve the impact of purity and ploidy based on SNP array data. NGS data is much higher resolution data than SNP arrays, and provides the opportunity to derive highly accurate estimates of both tumor purity and ploidy [23]. PureCN, is optimized for targeted short read sequencing data, integrates well with standard somatic variant detection pipelines, and has support for matched and/or unmatched tumor samples [24]. Currently, PureCN is a mainstream method for purity correction based on circular binary segmentation (CBS) from NGS data, but it is necessary to artificially judge the best combination from multiple combinations of purity and ploidy.
This study aims to developing and validating an algorithm, named as GSA, which can effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.

Custom design HRD panel
Variant detection requires comprehensive consideration of the detection frequency and coverage of SNP sites, and relies on statistical analysis based on the relationship between adjacent points to determine the position of fragment breakpoints and eliminate test deviations. Therefore, the principles of probe design mainly include: (1) The target region should contain high-frequency SNP sites of the population; (2) Ensure the capture region has a certain density in the whole genome; (3) Ensure that the regions are as even as possible; (4)Ensure that the target probe is synthesizable (some region probes cannot be synthesized by the supplier due to complex structure or multiple alignments lead to poor specificity); (5)Ensure that the target regions have good capture efficiency and coverage, and no obvious regional preference. Based on the above principles, a total of 93,200 high-frequency SNP loci (frequency ≥ 5%) from the 1000 Genome Database were screened out, which are evenly distributed on each chromosome (except Y chromosomes and mitochondria).

Library preparation, hybridization capture and sequencing
DNA from FFPE tissues was extracted by QIAAMP DNA FFPE TISSUE KIT (Qiagen, Hilden, Germany) according to the manufacturer's standard protocol. Briefly, 400 ng genomic DNA is fragmented and end-repaired, and a linker with a tag sequence is added to both ends of the DNA by ligase, followed by PCR amplification to form a pre-PCR library. The target DNA fragment in the library is hybridized with a combined probe containing 93,200 SNP sites and additional 2228 capture beds targeting the complete coding region of ATM, BRCA1, BRCA2, BRIP1, BARD1, CDK12, CHEK1, CHEK2, FANCL, PALB2, PPP2R2A, RAD51B, RAD51C, RAD51D, RAD54L and TP53, et.al. After purification, the enriched DNA is specifically captured and amplified by PCR to obtain a post-PCR library. The post-PCR library undergoes single-strand separation, circularization and rolling circle replication to generate DNA nano balls (DNB) and sequencing was performed with 2 × 101 bp paired-end reads on MGISEQ-2000 platform (MGI Tech Co., Ltd.).

Raw data quality control
Sequencing data needs to pass the basic standards of quality checks. Raw data quality control includes quality metrics for per-base sequence quality, sequence content, GC content and sequence length distribution, relative percentages of unmatched indices. Usually, the quality control parameters are set as Q30 ≥ 90%, and 40-50% GC content.

Data pre-processing
Raw paired-end reads were subjected to SOAPnuke (v2.0) processing to remove sequencing adapters and low-quality reads. High-quality reads were aligned to the reference human genome (GRCh37.p13), using the BWA sequence alignment software (0.7.17-r1188). PCR deduplication was performed using Picard. Average sequencing depths for tumors samples were ≥ 150× . For each sample, SNVs were called from BAM files using an in-house software, termed as Somatk. B allele frequency (BAF) and Log R ratio (LRR) were obtained from each capture region. BAF represents the median SNP genotype frequency of each capture region, and LRR represents the normalized depth ratio of the tumor and the normal sample (or blood cell control set) in each capture region after GC-bias correction.

TR segmentation and filtering algorithm
The Tree recursion (TR) Segmentation and Filtering Algorithm was developed by C++ . The input data format of the algorithm is (i) BAF data and (ii) LRR data. To reduce the noise in the input data, both BAF and LRR are preprocessed by a specially designed segmentation and filtering algorithm. First, if BAF ≥ 0.95 or BAF ≤ 0.05, defined as homozygous, the data would be removed from the BAF track because of its uselessness. Then the remaining BAF value is mirrored and flipped upward with 0.5 as the center, thus BAF =|BAF − 0.5|+ 0.5. For LRR, the bin LRR values are also first optionally filtered for outliers, defined as the total probability density is below the 30% quantile in all bins.
Next, the in-house TR segmentation algorithm, based on the calculation of the runlength, was used to roughly segment each chromosome, as shown in Additional file 1: Fig. S1. In this algorithm, the whole chromosome is taken as the root node, all the segmented sub-nodes are taken as the child nodes. The segmentation process can be simply described as the following steps: (a) Calculate the cumulative run-length of data (here refers to BAF and LRR) deviated from the mean ( (x − x) ) and select its maximum and minimum points as candidate breakpoints.
(b) Make appropriate trade-offs of candidate breakpoints according to the location of breakpoints, length of segments, number of data points in segments, etc., that is, determine whether breakpoints should be recorded. (c) If none of the subfragment of the current fragment satisfies the record condition, a recursive judgment is initiated. Otherwise, it recursively slices its last child node. (d) After the termination condition is reached, recursion is carried out on horizontal child nodes. (e) If all child objects have been processed, the parent's level object will continue to be processed until it is finally traced back to the root node and no new child objects are created.
Then the fragments are merged in a cyclic manner. Firstly, for each segmented fragment of chromosome was traversed by the kernel density estimation, to find out the two fragments, which are closest to the same distribution and combine them. Secondly, the statistics of the newly merged fragment and its adjacent fragments are recalculated until all indicators meet the requirements. Besides, segmentation of BAF and LRR is carried out separately, and then the union set of the merged BAF and LRR segmentation list is taken, but the regions with too short or insufficient data points are iteratively removed.

Purity and ploidy estimation
BAF and LRR are expressed by a given genomic location as functions of the allele-specific copy numbers n A and n B , where n A denotes the number of copies of the A allele and n B denotes the number of copies of the B allele. Assuming tumor cell purity (p) was 1, BAF and LRR are calculated by: Considering the influence of nonaberrant cells in real world tumor samples and assuming that the nonaberrant cells have a total copy number of 2 for all loci, tumor ploidy correction factor (scale_factor), tumor purity (p), the measured CN (CN*) and the measured BAF (BAF*) of the FFPE samples satisfy the following relationship (Additional file 1: Table S1, Table S2).
Based on the segmented chromosome fragments, the mean of all BAF value in the fragments, the percentage ranking of the BAF mean in all fragments, the theoretical CN (CN = 2 LRR *2) and the percentage ranking of the CN in all fragments are calculated. Subsequently, using the density-based scan (DBSCAN) algorithm to perform density clustering on the BAF mean-CN percentage ranking data, the chromosome fragments of the same genotype are clustered into a cluster.
For fragments of the same genotype, the calculated purity value can be approximately regarded as conforming to a normal distribution with the theoretical mean value of purity, that is Eqs. (5).
Therefore, the tumor purity is calculated by clustering the chromosome fragments of the same genotype, and bringing the measured mean value of BAF*, theoretical BAF and CN values of the specific genotype cluster into the Eqs. (5).
In addition, the ploidy value of the entire genome of the sample is the weighted average of the copy number of each segment of the chromosome.
Segs i is the proportion of each segment on the reference genome, and CN i is the calculated copy number of the segment.

Calculation of LOH, TAI, and LST scores
HRD-LOH score was defined as the number of LOH regions longer than 15 Mb exceeding LOH regions which do not cover the whole chromosome. HRD-TAI score was defined as the number of regions with allelic imbalance that (a) extend to one of the subtelomeres, (b) do not cross the centromere and (c) are longer than 11 Mb. HRD-LST score is the number of break points between regions longer than 10 Mb after filtering out regions shorter than 3 Mb.
Aneuploidy is a common event in cancer patients, so more copy number variations will be detected by the high-throughput sequencing data. However, these copy number abnormalities may not be caused by the failure of homologous recombination repair, it will make the final HRD score calculation biased. Calculating accurately HRD scores depend on BAF and copy number, but the aneuploidy properties and various purity of tumor samples will affect the actual value of BAF and copy number. Thus, it is necessary to make appropriate correction, and the calculation formula of HRD score is preliminarily determined as follows: Here, K is the coefficient of correction, which is a constant. Besides, the whole analysis flowchart was shown as Additional file 1: Fig. S2 Segs i * 2 n coefficient was set as 15.5, the AUC of the model is 88.3%, the sensitivity is 95.2%, and the threshold is 30 (Additional file 1: Supplementary Fig. S3).

BRCA1/2 and other HRR gene mutation analysis
Variants were named according to HGVS (Human Genome Variation Society; http:// www. hgvs. org/). Point mutations, short InDels, copy number variants were identified from NGS data, and interpreted in accordance with the "Genetic Variation Annotation Standards and Guidelines" (2015 Edition) issued by the American College of Medical Genetics (ACMG) for germline mutation, and the "Cancer mutation interpretation of guidelines and standards (2017 Edition)" for somatic mutation, respectively. BRCA1/2 locus-specific loss of heterozygosity were analyzed as follows: (a) if the mutation frequency of the SNP on the control sample is between 35 and 65%, it is recorded as a heterozygous mutation; (b) if the mutation frequency of the SNP on the tumor sample is greater than 65% or less than 35%, it is recorded as a homozygous mutation; (c) if the mutation frequency of a SNP site meets the both conditions, the SNP site is marked as a LOH site, otherwise it is marked as a non-LOH site; (d) if the number of LOH sites on BRCA1/2 is greater than the number of non-LOH sites, it is considered that LOH has occurred in BRCA1/2.

BRCA1 promoter methylation quantitative PCR assays
DNA methylation-sensitive and methylation-dependent restriction enzymes were used to selectively digest unmethylated or methylated genomic DNA, respectively. Postdigest DNA was quantified by real-time PCR using a 344-bp PCR-generated primer that spanned BRCA1 exon 1. The relative concentrations of differentially methylated DNA are determined by comparing the amount of each digest with that of a mock digest. A cutoff of 10% was used to define samples as "methylated".

The definition of BRCA1/2-deficiency
BRCA1/2-deficiency is defined as either (i) one deleterious mutation in BRCA1 or BRCA2, with LOH in the wild-type copy or (ii) two deleterious mutations in the same gene or (iii) promoter methylation of BRCA1 with LOH in the wild-type copy.

Affymetrix OncoScan ™ assay
The Affymetrix OncoScan ™ assay utilizes the Molecular Inversion Probe (MIP) assay technology for the detection of SNP genotyping, and has subsequently been used for identifying other types of genetic variation including focal insertions and deletions, large fragment CNV, LOH, and even somatic mutation. This assay has been shown over time to perform well with highly degraded DNA, such as that derived from FFPE-preserved tumor samples of various ages and with < 100 ng DNA of starting material, thus making the assay a natural choice in cancer clinical research. This assay captured the alleles of 217,611 SNPs and then the original CEL files were obtained by Affymetrix Genechip Scanner were converted to the OSCHP files by Chromosome Analysis Suite 3.0.

Statistical analysis
All statistical analysis was conducted using R version 3.6.1 (R Core Team, 2013) with an α of 0.05. The statistical tools employed in this study include Student's t-test and oneway ANOVA analysis of variance. All reported P values were two-sided. P < 0.05 was considered to be statistically significant. The Pearson correlation were used to evaluating the consistency of two different methods. The two-dimensional normal distribution function was used to remove outliers. Same distribution statistical test was used to compare the difference between adjacent fragments. DBSCAN density clustering algorithm was used to identify different genotypes.

The density and uniformity of SNP sites
The density and uniformity of SNP sites are the critical factors for the precision and resolution of CNVs identification. Herein, the capture regions distribution of HRD Panel with all chromosomes was compared with Affymetrix OncoScan ™ Assay, xGen Exome Research Panel (IDT), and a pan-cancer panel (BGI) (Fig. 1a). Then, chr4, chr13 and chr21 were selected randomly to compare the number and spacing of capture regions in the above four panels (Fig. 1b). The Affymetrix OncoScan ™ Assay is a SNP-array based on molecular inversion probe technology, a proven technology for identifying CNVs, LOH, and detecting somatic mutations. Besides, this assay could realize the high-resolution (50-300 kb) copy number detection. HRD panel is designed for detecting largescale CNVs (at Mb level) to calculate LOH, TAI, LST, and HRD score, thus the total number of SNP sites could reduce appropriately, but not uniformity. The results showed that the SNP sites of HRD panel had similar uniformity with Affymetrix OncoScan ™ Assay, but had less SNP sites (93,200 vs. 217,611). Most of the work published to date detecting CNVs is based on SNP array, thus this study defined Affymetrix OncoScan ™ Assay as a correlation method to evaluate the accuracy of HRD panel based on NGS approaches [25].

The rationality of segmentation and genotype identification
Chromosome segmentation of different genotypes is the key step for detecting CNVs. This algorithm had a variety of built-in statistical methods, which could evaluate whether the adjacent chromosomal segments conform to the same genotype according to the BAF. If the adjacent sub-segmentations belonged to the same genotype, they would be combined based on circular binary segmentation (CBS). Besides, the abnormal points had been removed, thus the segmentation error rate could be reduced greatly (Additional file 1: Fig. S4). Subsequently, the segmentation result of chromosome 2 in one patient of 195 clinical samples was selected as an example to analyze the rationality of segmentation and genotype identification (Fig. 2). The results showed that the BAF and copy number of the 25 M ~ 40 M region were 0.67 and 3 copies, it means that this segmentation was identified as ABB genotype. Similarly, BAF and copy number of the 115 M ~ 120 M region were 1.00 and 2 copies, it means that this segmentation was identified as BB genotype. Evaluating the relationship between BAF and CNV of all segmentation were all logical.

The self-consistency of purity and ploidy estimation
The different tumor purity samples (80%, 50%, and 20%) of HCC1143 cell line were used to evaluate the necessity and limitation of tumor purity correction. The genotype of 20 M ~ 120 M in chromosome 1 of HCC1143 was considered as ABB, so the BAF distribution in this region should be characterized by a bimodal distribution with the Fig. 2 The relationship between B allele frequency (top panel) and copy number (bottom panel) according to genomic position, which could infer the genotype of each segmentation. Here, the chromosome 2 of one clinical patient is shown antimode around 0.5 and peaks around 0.33 and 0.67. The density of BAF in each tumor purity was analyzed, and the results showed that the lower tumor purity, the more difficult to define ABB or AB genotype (Fig. 3a-c). In addition, the ploidy of 195 OC and BC tumor samples had been estimated by the GSA algorithm. The ploidy ranged between 1.41 and 4.07, being characterized by a bimodal distribution with the antimode around 2.5 and peaks around 2 (near-diploid status) and 3 (near-triploid status) ( Fig. 3d; Additional file 1: Table S3). Thus, purity and ploidy estimation were considered to the GSA algorithm. The accuracy of the GSA algorithm to predict tumor purity and ploidy were verified by the following two methods. Firstly, mix sequencing reads of different proportions of control samples into the 5 measured FFPE samples to simulate tumor purity dilution; Secondly, mix different proportions of germline DNA into 3 samples of tumor cell lines to obtain tumor cell line samples of different tumor purity. The correlation coefficient R 2 between the theoretical tumor purity value and the actual tumor purity value calculated by GSA were 0.9813 and 0.9812, respectively, in simulated diluted tumor samples and the real diluted cell line samples (Fig. 4a, b, Additional file 1: Table S4). The tumor purity calculated by GSA maintains a good linear relationship with the theoretical purity obtained by simulated dilution, indicating that the GSA purity correction algorithm has good stability when the pathological tumor purity is higher than or equal to 20%, and the algorithm can accurately reflect the real tumor cell content. Due that tumor ploidy is an essential attribute of the sample and will not affect by the tumor purity, the smaller the fluctuation and the more accurate the ploidy correction algorithm. Similarly, if purity more than 20%, the ploidy values under different tumor purity calculated by GSA are basically the same, and the HRD score of one sample with different purity was also stable (Fig. 4c, d, Additional file 1: Fig. S5, Additional file 1: Table S4). In a word, GSA algorithm had an excellent predictive effect on tumor purity and tumor genome ploidy, but the pathological tumor purity should more than 20%.

The accuracy of LOH, LST, TAI and HRD score
Total of 40 FFPE tumor samples were both detected by Affymetrix OncoScan ™ assay and the custom designed HRD Panel. BAF and LRR from NGS data and CEL file were both used to calculate HRD scores by the GSA algorithm. The results showed that the correlation coefficient of HRD scores calculated by the two method was 0.98, and the correlation coefficient of LOH, TAI and LST calculated by the two methods were 0.96, 0.89 and 0.95, respectively. It indicated that our target capture panel, containing 93,200 SNP sites, could represent the whole-genome copy number variations, and thus could generate the accurate HRD scores (Fig. 5; Table 1).

The relationship between HRD score and BRCA1/2 deficiency status in clinical samples
Deficiency of BRCA1/2 via biallelic mutations and somatic hypermethylation (for BRCA1) gives rise to a deficiency status in homologous recombination repair, thus the HRD score and BRCA1/2 deficiency status of the 195 clinical samples were all analyzed. The distribution of scores was shown for BRCA1/2-deficient versus BRCA1/2intact samples in Fig. 6. Higher HRD scores were observed in BRCA1/2-deficient tumors, suggesting that these tumors had a tendency towards genomic instability, and the consistency was up to 95.2% if the biological threshold was set as 30. But for other indicators, LOH, TAI and LST, there was less distinction between BRCA1/2-deficient tumors and BRCA1/2-intact tumors. Therefore, the results indicated that HRD score, combing LOH, TAI and LST, was the optimal indicator.

The HRD score and HRR genetic landscape of cell lines and clinical samples
All the 17 ovarian and breast cancer cell lines were determined, 2 cell lines (11.76%) were mBRCA carriers (only including SNV and InDel variation, which are defined as likely pathogenic and pathogenic mutation carriers), and BRCA1/2 copy number loss was detected in 8 cell lines (47.06%) (Table 2; Fig. 7). The HRD scores distributed from -21.19 to 73.21, and the mean was 27.91, and the median was 14.42. Assuming that there is only one primary clone in the tumor samples, and amplification or loss only occur in one chromosome, the theoretic ploidy of diploid is between 1.95 and 2.04. Thus, only 1 cell line (A2780) was calculated as diploid, and other 15 cell lines (93.75%) were defined as aneuploidy ( Table 2). The BAF and CN mapping of HCC38 and ZR-75-30, which had the similar ploidy correction factor but the different HRD scores, were shown as Additional file 1: Fig. S6 and Fig. S7. Obviously, the genomic status of HCC38 cell line  mutated gene (82.35%), but there was no mutation in CDK12, CHEK2, FANCL, PALB2, PPP2R2A and RAD51D (Fig. 7). ZR-75-1 cell line was not detected any mutation in the above 16 genes and BRCA1 methylation, and the HRD score was 14.42. Notably, there was higher frequency in HRR pathway genes in high HRD score samples, except for IGR-OV1, which derived from ovarian endometrioid adenocarcinoma, which is a type I epithelial ovarian cancer (EOC) and have verified as a slow growing and indolent neoplasms. The HRD score of IGR-OV1 was calculated as -14.73 but with many mutations in multiple HRR pathway genes and multi-hit mutations in TP53 and BRCA2, and it indicated that this kind of ovarian cancer type might tend to genomic stability and various threshold should be set in different tumor type. Similar mutational patterns in 15 HRR genes were also analyzed with the 195 clinical samples, and 64.62% (126/195) samples were altered. The result showed that the mutation frequency of the HRR genes in the high HRD score cohort (72.48%) was higher than that in the low HRD score cohort (54.65%), indicating there was a higher genomic instability in the high HRD score cohort, but clinical efficacy data were still lacking (Additional file 1: Fig. S8).

Discussion
This study proposed a new algorithm model called GSA to detect genome aberrations based on high-throughput sequencing data, and this algorithm could accurately realize the segmentation of chromosomal regions, effectively calculate the tumor purity and tumor genome ploidy automatically and could be used for the statistical modeling of various genomic instability indicators. Herein, the segmentation result of chromosome 2 in one clinical sample as an example to compare the chromosome segmentation effects of GSA, PureCN and ASCAT algorithms through measured samples (Additional file 1: Fig. S9; Additional file 1: Table S5). The results showed that PureCN, ASCAT and GSA algorithms divided chromosome 2 into 4, 26 and 4 fragments respectively. First of all, the BAF and copy number of the 76 M ~ 170 M region were different from the adjacent segments. The BAF of this segment was concentrated around 0.5, indicating that the A/B gene ratio was closed to 1:1, and the average copy number was about 4 excluding the influence of some interference points. The result given by GSA was 4 copies, and the genotype of this fragment is AABB. But the result given by PureCN was 3 copies, which is inconsistent with BAF, and ASCAT algorithm divided the segment into 10 fragments, due to the interference of some noise points. Secondly, PureCN was disconnected at the positions of 21 M, 90-92 M, and 147 M, but there was no significant difference at the breakpoint position judging from BAF and copy number. ASCAT divides the 10 K ~ 76 M region into 7 fragments, which should belong to the same genotype inferring by the data of BAF and copy number. In addition, it should be pointed out that the GSA method has special processing for the chromosome centromere gap region. If the region distribution before and after the gap was the same, it would be merged. Overall, the GSA algorithm is more accurate for chromosome fragment segmentation. Aneuploidy is commonly observed in cancers, and the result showed that ploidy in tumor samples is characterized by a bimodal distribution in triploid and diploid which is basically consistent with the results reported in many literatures [7]. Then, three tumor cell line samples (HCC1143, HCC1428 and HCC38) and four FFPE samples with different tumor purity were used to predict tumor purity and ploidy using GSA, ASCAT and PureCN software, respectively. The results showed that the tumor purity predicted by GSA was closest to the pathological tumor purity compared to the other two software. ASCAT has an obvious limitation in predicting samples with lower tumor purity. Meanwhile, when the actual tumor purity was low, the tumor ploidy predicted by PureCN was varies greatly. But the tumor ploidy predicted by GSA could maintain good stability (Additional file 1: Table 6). Although the tumor purity calculated by GSA was highly consistent with the theoretical purity of the diluted tumor cell line sample, if the tumor cell content was less than 20%, GSA algorithm could not accurately predict the tumor purity and ploidy. Normal tissue contamination and duplication of the entire genome is the most common initiation Fig. 7 The waterfall map of 15 HRR pathway genes and TP53 that mutated in 17 cell lines. The annotation of mutation types was shown on the upper right with various colors. The samples were sorted by HRD scores event for aneuploidy during cancer progression [3,15,26]. Moreover, near-tetraploid tumors show increased TAI and LST scores, but the generation of new HRD-LOH events is theoretically less likely to occur as ploidy increases [7]. The GSA algorithm added the calculation of ploidy and tumor cell purity when detecting CNVs and deducted the ploidy when calculating the HRD score. The results from clinical samples indicated that the accuracy and sensitivity of the combined HRD score calculated by GSA algorithm both were well-behaved, the accuracy was 0.98 (comparing with Affymetrix OncoScan ™ assay) and the sensitivity was 95.2% (comparing with BRCA1/2 deficiency status).
To verify whether the HRD score results obtained by the GSA algorithm can be used to assist clinical treatment efficacy, more clinical efficacy data support is needed. Unfortunately, the clinical samples used in this study lack this part of information, but we can initially compare the score of cell lines with the research results of the previous literature. The HRD scores of HCC38 and MDA-MB-231 calculated by GSA were 73.21 and 8.90, respectively. Both of the two cell lines are Claudin-low breast cancer cells, and the BRCA status were wildtype. In the Anne Margriet Heijink study (2019), HCC38 (GR50 = 3.6 μM) was defined as cisplatin-sensitive TNBC cell lines, and MDA-MB-231 (GR50 = 61.0 μM) was defined as cisplatin-resistant TNBC cell lines [27]. Due to the cisplatin introduces both intra-and inter-strand DNA crosslinks (ICLs), which stall replication forks and are therefore especially toxic in proliferating cells, if the homologous recombination function of cell lines is deficient, the cell lines cannot repair the doublestrand breaks. Obviously, the HRD score result obtained by the GSA algorithm is consistent with the judgment of cell line drug sensitivity of the previous study [27]. With more and more application of PARP inhibitor in the clinic, we need data on the efficacy of PARP inhibitors to validate the accuracy of GSA algorithm in the real-world data.

Conclusions
This new algorithm, named as GSA, could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.