Tumor samples and cell lines
Archival FFPE tumor tissues were obtained from 195 ovarian and breast cancer patients who had signed the informed consents, and the study was approved by the Institutional Review Board of BGI Co., Ltd. 17 human cancer cells (HCC38, HCC1428, HCC1143, HCC1806, MX-1, HCC70, ZR-75-1, MDA-MB-453, MDA-MB-231, MDA-MB-361, MDA-MB-415, ZR-75-30, HS-578T, IGR-OV1, A2780, NCI/ADR-RES and OVCAR-4) and 3 matched-wildtype cell lines (HCC38BL, HCC1143BL and HCC1428BL) were purchased from the CoBioer Biosciences Co., Ltd. All cell lines have been verified by STR and were provided in the form of DNA status.
Custom design HRD panel
Variant detection requires comprehensive consideration of the detection frequency and coverage of SNP sites, and relies on statistical analysis based on the relationship between adjacent points to determine the position of fragment breakpoints and eliminate test deviations. Therefore, the principles of probe design mainly include: (1) The target region should contain high-frequency SNP sites of the population; (2) Ensure the capture region has a certain density in the whole genome; (3) Ensure that the regions are as even as possible; (4)Ensure that the target probe is synthesizable (some region probes cannot be synthesized by the supplier due to complex structure or multiple alignments lead to poor specificity); (5)Ensure that the target regions have good capture efficiency and coverage, and no obvious regional preference. Based on the above principles, a total of 93,200 high-frequency SNP loci (frequency ≥ 5%) from the 1000 Genome Database were screened out, which are evenly distributed on each chromosome (except Y chromosomes and mitochondria).
Library preparation, hybridization capture and sequencing
DNA from FFPE tissues was extracted by QIAAMP DNA FFPE TISSUE KIT (Qiagen, Hilden, Germany) according to the manufacturer’s standard protocol. Briefly, 400 ng genomic DNA is fragmented and end-repaired, and a linker with a tag sequence is added to both ends of the DNA by ligase, followed by PCR amplification to form a pre-PCR library. The target DNA fragment in the library is hybridized with a combined probe containing 93,200 SNP sites and additional 2228 capture beds targeting the complete coding region of ATM, BRCA1, BRCA2, BRIP1, BARD1, CDK12, CHEK1, CHEK2, FANCL, PALB2, PPP2R2A, RAD51B, RAD51C, RAD51D, RAD54L and TP53, et.al. After purification, the enriched DNA is specifically captured and amplified by PCR to obtain a post-PCR library. The post-PCR library undergoes single-strand separation, circularization and rolling circle replication to generate DNA nano balls (DNB) and sequencing was performed with 2 × 101 bp paired-end reads on MGISEQ-2000 platform (MGI Tech Co., Ltd.).
Raw data quality control
Sequencing data needs to pass the basic standards of quality checks. Raw data quality control includes quality metrics for per-base sequence quality, sequence content, GC content and sequence length distribution, relative percentages of unmatched indices. Usually, the quality control parameters are set as Q30 ≥ 90%, and 40–50% GC content.
Data pre-processing
Raw paired-end reads were subjected to SOAPnuke (v2.0) processing to remove sequencing adapters and low-quality reads. High-quality reads were aligned to the reference human genome (GRCh37.p13), using the BWA sequence alignment software (0.7.17-r1188). PCR deduplication was performed using Picard. Average sequencing depths for tumors samples were ≥ 150× . For each sample, SNVs were called from BAM files using an in-house software, termed as Somatk. B allele frequency (BAF) and Log R ratio (LRR) were obtained from each capture region. BAF represents the median SNP genotype frequency of each capture region, and LRR represents the normalized depth ratio of the tumor and the normal sample (or blood cell control set) in each capture region after GC-bias correction.
TR segmentation and filtering algorithm
The Tree recursion (TR) Segmentation and Filtering Algorithm was developed by C++ . The input data format of the algorithm is (i) BAF data and (ii) LRR data. To reduce the noise in the input data, both BAF and LRR are preprocessed by a specially designed segmentation and filtering algorithm. First, if BAF ≥ 0.95 or BAF ≤ 0.05, defined as homozygous, the data would be removed from the BAF track because of its uselessness. Then the remaining BAF value is mirrored and flipped upward with 0.5 as the center, thus BAF =|BAF − 0.5|+ 0.5. For LRR, the bin LRR values are also first optionally filtered for outliers, defined as the total probability density is below the 30% quantile in all bins.
Next, the in-house TR segmentation algorithm, based on the calculation of the run-length, was used to roughly segment each chromosome, as shown in Additional file 1: Fig. S1. In this algorithm, the whole chromosome is taken as the root node, all the segmented sub-nodes are taken as the child nodes. The segmentation process can be simply described as the following steps:
-
(a)
Calculate the cumulative run-length of data (here refers to BAF and LRR) deviated from the mean (\(\sum \left( {{\text{x}} - {\overline{\text{x}}}} \right)\)) and select its maximum and minimum points as candidate breakpoints.
-
(b)
Make appropriate trade-offs of candidate breakpoints according to the location of breakpoints, length of segments, number of data points in segments, etc., that is, determine whether breakpoints should be recorded.
-
(c)
If none of the subfragment of the current fragment satisfies the record condition, a recursive judgment is initiated. Otherwise, it recursively slices its last child node.
-
(d)
After the termination condition is reached, recursion is carried out on horizontal child nodes.
-
(e)
If all child objects have been processed, the parent's level object will continue to be processed until it is finally traced back to the root node and no new child objects are created.
Then the fragments are merged in a cyclic manner. Firstly, for each segmented fragment of chromosome was traversed by the kernel density estimation, to find out the two fragments, which are closest to the same distribution and combine them. Secondly, the statistics of the newly merged fragment and its adjacent fragments are recalculated until all indicators meet the requirements. Besides, segmentation of BAF and LRR is carried out separately, and then the union set of the merged BAF and LRR segmentation list is taken, but the regions with too short or insufficient data points are iteratively removed.
Purity and ploidy estimation
BAF and LRR are expressed by a given genomic location as functions of the allele-specific copy numbers nA and nB, where nA denotes the number of copies of the A allele and nB denotes the number of copies of the B allele. Assuming tumor cell purity (p) was 1, BAF and LRR are calculated by:
$$\begin{aligned} BAF & = \frac{{n_{B} }}{{n_{A} + n_{B} }} = \frac{{n_{B} }}{CN} \\ LRR & = \log_{2} \left( {\frac{{n_{A} + n_{B} }}{2}} \right) \\ \end{aligned}$$
Considering the influence of nonaberrant cells in real world tumor samples and assuming that the nonaberrant cells have a total copy number of 2 for all loci, tumor ploidy correction factor (scale_factor), tumor purity (p), the measured CN (CN*) and the measured BAF (BAF*) of the FFPE samples satisfy the following relationship (Additional file 1: Table S1, Table S2).
$$n_{B} = CN \cdot BAF$$
(1)
$$CN^{*} \cdot scale\_factor = CN \cdot p + 2\left( {1 - p} \right)$$
(2)
$$BAF^{*} = \frac{{1 - p + pn_{B} }}{{CN^{*} \cdot scale\_factor}}$$
(3)
Based on Eqs. (1), (2) and (3), Tumor purity can be expressed as below.
$$purity = \frac{{1 - 2 \cdot BAF^{*} }}{{BAF^{*} \cdot CN - 2 \cdot BAF^{*} + 1 - BAF*CN}}$$
(4)
Based on the segmented chromosome fragments, the mean of all BAF value in the fragments, the percentage ranking of the BAF mean in all fragments, the theoretical CN (CN = 2LRR*2) and the percentage ranking of the CN in all fragments are calculated. Subsequently, using the density-based scan (DBSCAN) algorithm to perform density clustering on the BAF mean-CN percentage ranking data, the chromosome fragments of the same genotype are clustered into a cluster.
For fragments of the same genotype, the calculated purity value can be approximately regarded as conforming to a normal distribution with the theoretical mean value of purity, that is Eqs. (5).
$$\widehat{purity} = \frac{{1 - 2 \cdot \overline{{BAF^{*} }} }}{{\overline{{BAF^{*} }} \cdot CN - 2 \cdot \overline{{BAF^{*} }} + 1 - BAF*CN}}$$
(5)
Therefore, the tumor purity is calculated by clustering the chromosome fragments of the same genotype, and bringing the measured mean value of BAF*, theoretical BAF and CN values of the specific genotype cluster into the Eqs. (5).
In addition, the ploidy value of the entire genome of the sample is the weighted average of the copy number of each segment of the chromosome.
$$Ploidy = \frac{{\mathop \sum \nolimits_{i = 1}^{n} Segs_{{i^{*} }} CN_{i} + \left( {1 - \Sigma_{i = 1}^{n} Segs_{i} } \right)*2}}{n}$$
Segsi is the proportion of each segment on the reference genome, and CNi is the calculated copy number of the segment.
Calculation of LOH, TAI, and LST scores
HRD-LOH score was defined as the number of LOH regions longer than 15 Mb exceeding LOH regions which do not cover the whole chromosome. HRD-TAI score was defined as the number of regions with allelic imbalance that (a) extend to one of the subtelomeres, (b) do not cross the centromere and (c) are longer than 11 Mb. HRD-LST score is the number of break points between regions longer than 10 Mb after filtering out regions shorter than 3 Mb.
Aneuploidy is a common event in cancer patients, so more copy number variations will be detected by the high-throughput sequencing data. However, these copy number abnormalities may not be caused by the failure of homologous recombination repair, it will make the final HRD score calculation biased. Calculating accurately HRD scores depend on BAF and copy number, but the aneuploidy properties and various purity of tumor samples will affect the actual value of BAF and copy number. Thus, it is necessary to make appropriate correction, and the calculation formula of HRD score is preliminarily determined as follows:
$$HRD_{score} = LOH + TAI + LST - K*Ploidy$$
Here, K is the coefficient of correction, which is a constant. Besides, the whole analysis flowchart was shown as Additional file 1: Fig. S2. However, the constant depends on the type of cancer, sample type, target region size, and sequencing platform, et.al. This study screened 62 BRCA1/2-deficiency samples and 37 BRCA1/2 wildtype clinical samples of the 195 patients to explore the reasonable constant K. Finally, when the correction coefficient was set as 15.5, the AUC of the model is 88.3%, the sensitivity is 95.2%, and the threshold is 30 (Additional file 1: Supplementary Fig. S3).
BRCA1/2 and other HRR gene mutation analysis
Variants were named according to HGVS (Human Genome Variation Society; http://www.hgvs.org/). Point mutations, short InDels, copy number variants were identified from NGS data, and interpreted in accordance with the “Genetic Variation Annotation Standards and Guidelines” (2015 Edition) issued by the American College of Medical Genetics (ACMG) for germline mutation, and the “Cancer mutation interpretation of guidelines and standards (2017 Edition)” for somatic mutation, respectively. BRCA1/2 locus-specific loss of heterozygosity were analyzed as follows: (a) if the mutation frequency of the SNP on the control sample is between 35 and 65%, it is recorded as a heterozygous mutation; (b) if the mutation frequency of the SNP on the tumor sample is greater than 65% or less than 35%, it is recorded as a homozygous mutation; (c) if the mutation frequency of a SNP site meets the both conditions, the SNP site is marked as a LOH site, otherwise it is marked as a non-LOH site; (d) if the number of LOH sites on BRCA1/2 is greater than the number of non-LOH sites, it is considered that LOH has occurred in BRCA1/2.
BRCA1 promoter methylation quantitative PCR assays
DNA methylation-sensitive and methylation-dependent restriction enzymes were used to selectively digest unmethylated or methylated genomic DNA, respectively. Post-digest DNA was quantified by real-time PCR using a 344-bp PCR-generated primer that spanned BRCA1 exon 1. The relative concentrations of differentially methylated DNA are determined by comparing the amount of each digest with that of a mock digest. A cutoff of 10% was used to define samples as “methylated”.
The definition of BRCA1/2-deficiency
BRCA1/2-deficiency is defined as either (i) one deleterious mutation in BRCA1 or BRCA2, with LOH in the wild-type copy or (ii) two deleterious mutations in the same gene or (iii) promoter methylation of BRCA1 with LOH in the wild-type copy.
Affymetrix OncoScan™ assay
The Affymetrix OncoScan™ assay utilizes the Molecular Inversion Probe (MIP) assay technology for the detection of SNP genotyping, and has subsequently been used for identifying other types of genetic variation including focal insertions and deletions, large fragment CNV, LOH, and even somatic mutation. This assay has been shown over time to perform well with highly degraded DNA, such as that derived from FFPE- preserved tumor samples of various ages and with < 100 ng DNA of starting material, thus making the assay a natural choice in cancer clinical research. This assay captured the alleles of 217,611 SNPs and then the original CEL files were obtained by Affymetrix Genechip Scanner were converted to the OSCHP files by Chromosome Analysis Suite 3.0.
Statistical analysis
All statistical analysis was conducted using R version 3.6.1 (R Core Team, 2013) with an α of 0.05. The statistical tools employed in this study include Student's t-test and one-way ANOVA analysis of variance. All reported P values were two-sided. P < 0.05 was considered to be statistically significant. The Pearson correlation were used to evaluating the consistency of two different methods. The two-dimensional normal distribution function was used to remove outliers. Same distribution statistical test was used to compare the difference between adjacent fragments. DBSCAN density clustering algorithm was used to identify different genotypes.