Identification and Utilization of Copy Number Information for Correcting Hi-C Contact Map of Cancer Cell Line

Motivation Hi-C and its variant techniques have been developed to capture the spatial organization of chromatin. Normalization of Hi-C contact maps is essential for accurate modeling and interpretation of genome-wide chromatin conformation. Most Hi-C correction methods are originally developed for normal cell lines and mainly target systematic biases. In contrast, cancer genomes carry multi-level copy number variations (CNVs). Copy number influences interaction frequency between genomic loci. Therefore, CNV-driven bias needs to be corrected for generating euploid-equivalent chromatin contact maps. Results We developed HiCNAtra framework that extracts read depth (RD) signal from Hi-C or 3C-seq reads to generate the high-resolution CNV profile and use this information to correct the contact map. We proposed the “entire restriction fragment” counting for better estimation of the RD signal and generation of CNV profiles. HiCNAtra integrates CNV information along with other systematic biases for explicitly correcting the interaction matrix using Poisson regression model. We demonstrated that RD estimation of HiCNAtra recapitulates the whole-genome sequencing (WGS)-derived coverage signal of the same cell line. Benchmarking against OneD method (only explicit method to target CNV bias) showed that HiCNAtra fared better in eliminating the impact of CNV on the contact maps. Availability and implementation HiCNAtra is an open source software implemented in MATLAB and is available at https://github.com/AISKhalil/HiCNAtra.


Introduction 1
Chromosome conformation capture (3C)-related methods are a collection of molecular 2 techniques to analyze the three-dimensional (3D) chromatin interactions inside the cell [1]. In 3 the last 15 years, the conventional 3C method [2] has been modified and combined with next-4 generation sequencing (NGS) to interrogate the chromatin interactions of genomic loci at 5 different length scales [1,3,4]. In 2009, Hi-C protocol was developed to study the genome-6 wide chromatin interactions [3]. Hi-C experiments confirmed the chromosome territory 7 hypothesis and established that our genome is hierarchically organized into A/B 8 compartments and topologically-associating domains (TADs) [3,5]. In addition, 3C-seq 9 technique has been developed in recent years as a simple and straight-forward experimental 10 protocol to study the genome-wide interactions of genomic loci. In this approach, a 11 conventional 3C library is prepared first and then the library is sonicated and subsequently 12 sequenced using NGS platform [4,6,7]. 13 Majority of available Hi-C datasets were generated using cancer cell lines. However, cancer 14 genomes are afflicted with large-scale and focal CNVs [8][9][10][11][12] that have pathological 15 consequences [13]. Regions with copy number gains or losses will show concomitant 16 changes in their interaction profiles [14][15][16]. This results in the rewiring of chromatin 17 connectivity that may lead to alterations in the long-range control of gene expression [16]. 18 CNVs can also modulate gene-regulatory mechanisms by altering the copy number of 19 regulatory elements or by modifying the higher-order chromatin structure [17]. Study of 20 prostate cancer cell lines showed that they have smaller-sized TADs with new boundaries. 21 These new TAD boundaries occur at CNVs [18]. Recently,Wu et. al. reported in multiple 22 myeloma (MM) cells with whole/partial chromosomal gains/losses that copy number (CN)-23 amplified regions have higher interaction frequencies than CN-neutral (normal) regions [15]. 24 They identified the TADs by insulation score method [6] using ICE-normalized interaction 25 matrices [19] and found that the numbers of TADs increased by 25% and the average size of 26 TADs decreases significantly compared to the normal B cells. This suggests that CNV is an 27 important source of bias on chromatin contact maps of cancer cell lines. Therefore, detection 28 and correction of CNV effects are necessary to produce euploid-equivalent contact maps of 29 cancer cells. 30 Hi-C sequencing reads can be utilized to estimate the read depth (RD) signal and discovery of 31 CNVs without additional cost of performing WGS. Recently, HiCnv tool has attempted to 32 types of reads. In case of Hi-C, valid pairs generally comprise ~40-65% of the total mapped 23 reads and these reads are solely used to generate contact map. In addition to valid pairs, 24 HiCNAtra also utilizes the non-informative reads to compute the RD signal at higher 25 resolution. On the other hand, in the absence of biotin labeling and pull-down steps, the 26 majority of 3C-seq reads are contributed by genomic reads (~70-80% of the double-sided 27 mapped reads). Therefore, for 3C-seq datasets, we compute the RD signal exclusively from 28 these genomic reads in a similar manner as the RD signal is usually computed from paired-29 end WGS reads. 30 Based on the Hi-C or 3C-seq experiment, reads are utilized differently for computing the RD 31 signal (default bin size = 5 Kb). For Hi-C datasets, HiCNAtra computes the RD signal from all Hi-C reads (Fig 1b(i)) using the "entire restriction fragment" counting approach. For this, 33 we first retain only those reads that are located within the restriction fragment-end windows -34 window of maximum molecule length (MML) next to the restriction site (Fig 1b(ii)). The 35 reason for targeting fragment-end window reads is that the biotin pull-down step of Hi-C 36 protocol should cause most of the reads to map next to the restriction sites. Second, we count 37 the reads for each restriction fragment based on the assumption that each continuous DNA 38 sequence read represents a particular restriction fragment and contribute to the abundance of 39 that fragment. Therefore, we count each non-informative read as a single count whereas for 40 valid pairs, side1 and side2 are counted separately (Fig 1b(iii)). So for a particular restriction 41 fragment, the fragment count is calculated as the sum of the number of reads located in both 42 fragment-end windows ( Fig. 1b(iv)). For example, the fragment count of restriction fragment 43 1 (rFrag1) is the sum of read counts of fragment-end windows, w2 and w3 ( Fig. 1b(iv)). 44 Finally, we assign the fragment count of a restriction fragment to all of its bases and compute 45 the RD signal. For 3C-seq datasets, genomic reads are distributed uniformly along the 46 genome and represent the majority of reads (Fig 1.1b(v)). Therefore, we use these paired-end 47 genomic reads (Fig 1.1b(vi)) to compute the RD signal in an unbiased manner (Fig.  48 1.1b(vii)). 49

CNV identification 50
The CNV calling module of HiCNAtra is based on CNAtra [28] approach. CNAtra CNV 51 caller constitutes the hierarchical framework to delineate the multi-level copy number 52 alterations in the cancer genomes. Briefly, we first define the CN reference (CN=2) by fitting 53 the RD signal to a multimodal distribution. Then we utilize a multi-step framework to first 54 identify large genomic segments with distinct CN state. Segments with CN state other than 2 55 (CN≠ 2) are considered as large-scale copy number variations (LCVs) or segmental 56 aneuploidies. Next, focal amplifications and deletions in each CN-defined segment are 57 detected based on coverage-based thresholding. HiCNAtra then computes the CNV tracks by 58 merging both LCVs and focal alterations (FAs) for Hi-C correction. 59

Correction of the chromatin contact map 60
We employed GLMs to normalize the contact frequencies against all sources of biases similar 61 to HiCNorm [26]. Two GLM with Poisson distribution are used to fit the cis and trans 62 contact maps separately. By default, we apply our normalization approach in a genome-wide 63 manner for better estimation of the GLM parameters. Let , , , and represent the GC content, mappability, effective length, CNV features 84 of bin in chromosome , respectively; whereas , , , and represent the features of 85 bin in chromosome , respectively. We assume that the interaction frequency follows 86 Then, is used to compute the normalized interaction frequency between bins and 93 from chromosomes and , respectively: 94

Utilization of all Hi-C reads for computing high-resolution RD signal 2
Data coverage (number of reads) is an important factor for detecting the CNVs, especially the 3 focal alterations. Although valid pairs are the main target of Hi-C/3C-seq protocols for 4 capturing the chromatin interactions, other non-informative reads can be additionally utilized 5 for estimation of the RD signal at a higher resolution. Based on the analyses of six Hi-C 6 datasets, valid pairs represent 39-64% of the total mapped reads (Supplementary Table 1). 7 Other reads, which represent a significant chunk (36-61%), are usually filtered out during Hi-8 C downstream analysis. However, these reads are still 'informative' and can potentially 9 contribute to the estimation of the genome-wide RD signal. Therefore, we computed RD 10 signal using both valid pairs and non-informative reads. Addition of these non-informative 11 reads increases the resolution of the RD signal by 22-44% for different Hi-C datasets 12 (Supplementary Table 1). The correctness of incorporating non-informative reads to estimate 13 the RD signal is evaluated by comparing "read counts per restriction fragment" from all Hi-C 14 reads versus valid pairs only. We found that read counts are highly correlated (Spearman 15 correlation 80-91%) (Supplementary Table 2) confirming the validity of using non-16 informative Hi-C reads in RD calculation which results in higher resolution of RD signal. 17 This conferred advantage to HiCNAtra over HiCnv and OneD which use only the valid pair 18 reads for estimating the RD signal. 19 On the other hand, 3C-seq technique, which combines 3C library with NGS sequencing step, 20 results in a higher percentage of genomic reads among the double-sided mapped reads (66% 21 for K562 and 77% for NCI-H69). Therefore, similar to WGS datasets, RD signals can be 22 conveniently computed from these genomic reads. Each paired-end genomic read is used to 23 calculate the fragment length (side1-start to side2-end). This results in computing RD signal 24 at high coverage (14.26x for K562 with only 117 million genomic reads and 12.37x for NCI-25 H69 with only 120 million reads). This demonstrates that genomic reads from the 3C-seq 26 datasets can be effectively used for estimating the RD signal in an unbiased manner without 27 additional cost of WGS. 28

Entire restriction fragment counting approach successfully extracts copy number-29 associated features of RD signal 30
Accurate estimation of the RD signal from Hi-C data is the primary key for identification of 31 copy number events. Generally, most Hi-C analysis tools [19][20][21][22][23][24][25][26][27] assign the Hi-C reads to 32 the midpoint of corresponding restriction fragment (midpoint approach). Alternatively, exact cut site of the read can be used (exact-cut approach) [30]. These approaches can be suitable 34 for contact frequency calculation since Hi-C matrices are usually generated using large bin 35 size. Usually, the bin size is at least one order of magnitude larger than the restriction 36 fragment length. However, smaller bins are essential for effective identification of change 37 points at high-resolution for discovery of both LCVs and FAs. Therefore, we introduced 38 "entire restriction fragment" counting approach for RD calculation (Fig. 1b). This allowed us 39 to compute the RD signal at high-resolution. We compared HiCNAtra RD calculation method 40 with exact-cut and midpoint approaches using MCF7 Hi-C data generated using HindIII 41 enzyme (Fig. 2a). We calculated the RD signal at 5-Kb bin using Hi-C reads which is of the 42 same order of magnitude as the experimental resolution (average 4096 bases for 6-bp cutter). 43 Interestingly, we found that our approach can effectively capture LCVs as evidenced by two 44 peaks in the RD signal distribution (Fig 2a right, bottom panel). In addition, FAs are 45 conspicuous using this approach viz. FA1 and FA2 ( Figure.  For a more quantitative comparison, we computed the centralization score (CS) and zero 49 score (ZS) for all the six Hi-C datasets. CS is computed as the percentage of bins that are near 50 their integer CN states. CS can be used as a measure for distributing genomic loci into 51 distinct CN states. We then defined ZS as the percentage of bins with CN=0. ZS can be used 52 as a measure of the sparseness of the RD signal. We found that "entire fragment" counting 53 approach achieved the maximum CS (for both 5-Kb and 100-Kb bin) and minimum ZS in all 54 Hi-C datasets (Fig. 2b, Supplementary Table 3). This is also confirmed by the visual 55 inspection of coverage plot and RD frequency distribution derived from these three 56 approaches (Fig. 2a). High CS implies that RD signal is distributed near distinct CN states 57 which aids in the identification of CNVs. Low ZS is essential to avoid false detection of 58 deleted regions, especially in large restriction fragments. 59 Next, we validated that the "entire restriction fragment" counting approach extract the RD 60 signal correctly from Hi-C reads. We compared the RD signal derived from Hi-C datasets of 61 MCF7 and LNCaP with the WGS-derived coverage signal of same cell lines. First, we found 62 that HiCNAtra RD signal has the highest correlation with the WGS-derived RD signal for 63 both MCF7 (Spearman's ρ =0.714) and LNCaP (Spearman's ρ =0.55) cells (Supplementary 64 Table 3). In comparison, exact-cut and midpoint approaches have the Spearman's ρ of 0.32 65 and 0.31 for MCF7, and 0.26 and 0.24 for LNCaP cells. This confirmed that the HiCNAtra counting approach is the best estimate of the RD signal. Similarly, we computed the RD 67 signal (5-Kb bin) from genomic reads of K562 3C-seq dataset. As expected, RD signal 68 extracted from K562 3C-seq data correlated well with WGS-derived coverage signal of K562 69 cells (Spearman's ρ = 0.573). This confirmed that the RD signal can be reproducibly derived 70 from Hi-C/3C-seq datasets (visually illustrated in Fig. 2c and Supplementary Fig. S2). 71

HiCNAtra can extract high-resolution copy number events from Hi-C/3C-seq 72 datasets 73
Successful cloning of the WGS-derived RD signal from Hi-C/3C-seq reads enables 74 HiCNAtra to accurately detect CNV regions. To the best of our knowledge, HiCnv [20] is the 75 only tool available to detect CNV from Hi-C data. However, HiCnv computes the RD signal 76 using midpoint approach exclusively from the valid pairs. As stated before, valid pairs 77 represent 39-64% of total mapped-Hi-C reads. Therefore, HiCnv sacrifices a large percentage 78 of reads which can be effectively used for calculating the RD signal based on HiCNAtra 79 approach. Moreover, the midpoint approach limits the power of HiCnv to detect only large 80 CNVs (size > 1Mb) [20]. As an alternative solution, we adapted the hierarchical CNV 81 detection approach of CNAtra [28] in HiCNAtra CNV caller module to identify both LCVs 82 and FAs. 83 We analyzed the CNV profiles of all six Hi-C and two 3C-seq datasets using HiCNAtra 84 (Supplementary Table 4 Supplementary Fig. S4). Next, CNV track is generated by integrating LCVs and FAs. This 95 track is used for correcting the contact frequencies of Hi-C/3C-seq datasets. 96

HiCNAtra successfully ameliorates the effects of CNVs on the contact map
(HiCNAtra, OneD and OneD+CN). Then, we computed the Spearman correlations between 130 the biases and corrected contact frequencies. Overall, our results showed that HiCNAtra-131 corrected interaction frequencies achieved the minimum correlation with CNV tracks 132 (median Spearman's ρ = 0.08) as well as with other systematic biases across cell lines (Fig.  133 3a, Supplementary Fig. S5). The OneD+CN-corrected contact frequencies showed the 134 second-lowest correlation with CNV tracks (mean Spearman's ρ = 0.124) and systematic 135 biases (Fig. 3a, Supplementary Fig. S5). This confirms the advantage of HiCNAtra's 136 simultaneous bias correction method over OneD+CN's sequential bias correction. 137 Visualization and manual inspection of contact heatmaps is a simple way to qualitatively 138 ascertain the effects of CNVs on interaction frequencies. Cancer cell lines, which typically 139 harbor CNVs, create visible artifacts on the raw heatmaps. For example, the amplified and 140 deleted regions in CNV track (Fig. 3b) of MCF7 chr1 show a concomitant increase or 141 decrease of contact frequencies in the raw (uncorrected) heatmap (Fig. 3c). Specifically, 142 amplified regions CNV1 and CNV2 (Fig. 3b) resulted in two off-diagonal block patterns 143 (artifacts) in the heatmap with signal higher than their surrounding regions (Fig. 3c). These 144 blocks refer to the overrepresented cis interactions between these CNV regions with other 145 genomic regions. We plotted bias-corrected heatmaps of the MCF7 chr1 normalized using 146 HiCNAtra, OneD and OneD+CN and compared them with the raw heatmap to visually 147 interpret their abilities to ameliorate CNV-induced artifacts. Visually, it can be noticed that 148 off-diagonal lines (contributed by CNV1 and CNV2) are removed in the HiCNAtra-149 normalized heatmap without distorting its overall features (Fig. 3d). As expected, OneD-150 normalized heatmap is almost similar to the raw heatmap in the absence of CNV bias 151 correction (Fig. 3e). Although OneD+CN-normalized heatmap showed improvement over the 152 raw heatmap, however, the footprint of off-diagonal blocks (artifacts) of weaker strength is 153 still visible after bias correction (Fig. 3f). We witnessed similar observations while inspecting 154 other regions of MCF7 as well as in a different cancer cell line ( Supplementary Fig. S6). This 155 confirms that CN information should be integrated with other systematic biases for effective 156 normalization of Hi-C/3C-seq contact frequencies of cancer cells. In case of normal cells, 157 which are largely devoid of CNVs, HiCNAtra approach of systematic bias correction of 158 contact maps is basically same as HiCNorm approach. Therefore, HiCNAtra approach can be 159 applied for both normal and cancer cell lines.

Discussion 1
Cancer genomes are scattered with large-scale CNVs and focal alterations. Amplification and 2 deletion events result in over-and under-representation of chromatin interactions respectively 3 in Hi-C/3C-seq datasets. Therefore, CNVs can greatly influence the interpretation of 4 chromatin contact maps and may lead to false identification of genomic regions with over-or 5 under-represented interaction strengths. The effect CNV bias should be compensated for the 6 amplified/deleted regions along with other systematic biases for accurate interpretation of 7 interaction frequency. 8 High-resolution extraction of RD signal from Hi-C/3C-seq datasets is central to discover 9 copy number events of smaller sizes. We propose two solutions to better estimate RD signal 10 from Hi-C datasets. First, HiCNAtra utilized all types of Hi-C reads that result in higher-11 resolution RD signal unlike other tools which use only valid pairs. Second, in order to better 12 demarcate the RD signal into distinct copy number state, HiCNAtra employs the "entire 13 restriction fragment" counting approach. Consequently, HiCNAtra can better extract the RD 14 signal from the Hi-C reads, which corroborated well with the WGS-derived RD profile. 15 Previous studies have demonstrated the adverse effects of CNV regions on TAD sizes and 16 structures in cancer cells. We observed a positive correlation between the copy number and 17 the strength of the interaction frequencies of genomic loci. This clearly suggests that CNV 18 imparts a dominant explicit bias in case of cancer cell lines. HiCNAtra identified the copy 19 number events and incorporated them with other systematic biases for simultaneously 20 correcting the contact map. Compared to OneD approach, HiCNAtra correction method is 21 better suited to normalize the contact maps of cancer cell lines as evidenced by visual 22 inspection of raw and corrected heatmaps. 23 In conclusion, our results suggest that HiCNAtra provides a better solution for a) computing 24 high-coverage RD signal and detecting large-scale and focal CNVs from Hi-C/3C-seq 25 datasets, and b) for explicitly correcting the chromatin contact frequencies from biases 26 introduced by CNVs.

Acknowledgment 2
We like to thank Ms. Yao Chen for help in analyzing 3C-seq datasets and fruitful discussions. 3 We also acknowledge Sanyal and Chattopadhyay lab members for their valuable comments. 4

Funding 5
This work was supported by Nanyang Technological University's Nanyang Assistant 6 Professorship grant and Singapore Ministry of Education Academic Research Fund Tier 1 7 grant to AS. AC was supported by Nanyang Technological University Start-up grant. 8 Author's contributions 9 AS, AC and AISK conceived the project. AISK developed HiCNAtra software with inputs 10 from AS and AC and performed all the analyses. SRMM prepared the 3C-seq libraries. AS, 11 AISK and AC analyzed the data and prepared the manuscript. All authors read and approved 12 the final manuscript. 13

Ethics approval and consent to participate 14
Not applicable. 15

Competing interests 16
The authors declare that they have no competing interests. Fig. 1 HiCNAtra methodology and RD signal computation. (a) Block diagram of HiCNAtra software. The HiCNAtra pipeline is divided into RD calculator, CNV caller, and contact-map normalization modules. (b) Cartoon illustrations of the RD signal computation method from Hi-C (left) and 3C-seq (right) data. For Hi-C, informative and non-informative (only dangling-end shown) reads (i) that were mapped within the restriction fragment-end windows (ii) are used for RD calculation. Then, number of reads is counted for each window (iii). Base counts are then estimated for each restriction fragment as the summation of the counts of its two fragment-end windows (iv). For 3C-seq libraries (v), genomic reads (vi) are exclusively used to compute the RD signal in an unbiased manner like WGS paired-end reads (vii).

Fig. 2
Entire restriction fragment approach is the best estimate of RD signal. (a) The coverage plot (left) at 5-Kb bin and the RD frequency distribution (right) of MCF7 Chr16:56-88Mb region extracted from Hi-C data using exact-cut (top), midpoint (middle), and HiCNAtra entire fragment (bottom) approaches. Each red dot represents copy number of bin. (b) Centralization score and zero score of the Hi-C RD signals using "entire restriction fragment" counting, exact-cut, and midpoint approaches using bin sizes of 5 Kb and 100 Kb. (c) The genome-wide coverage plot of MCF7 cancer cell computed from Hi-C data (top) and WGS data (bottom). WGS data is obtained from ChIP-seq input control sequencing reads. Each red dot is the RD value per bin from Hi-C data whereas each grey dot represents RD value per bin from the WGS data. between bins. It can be visually observed that amplified and deleted regions (Fig. 3b) show a concomitant increase or decrease of contact frequencies in the raw Hi-C heatmap (Fig. 3c).
For example, amplified regions CNV1 and CNV2 (Fig. 3b) resulted in two off-diagonal block patterns (visible artifacts) with signal higher than their surrounding regions (Fig. 3c).