HiCmapTools: a tool to access HiC contact maps

Background With the development of HiC technology, more and more HiC sequencing data have been produced. Although there are dozens of packages that can turn sequencing data into contact maps, there is no appropriate tool to query contact maps in order to extract biological information from HiC datasets. Results We present HiCmapTools, a tool for biologists to efficiently calculate and analyze HiC maps. The complete program provides multi-query modes and analysis tools. We have validated its utility on two real biological questions: TAD loop and TAD intra-density. Conclusions HiCmapTools supports seven access options so that biologists can quantify contact frequency of the interest sites. The tool has been implemented in C++ and R and is freely available at https://github.com/changlabtw/hicmaptools and documented at https://hicmaptools.readthedocs.io/. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04589-y.


Implementation
HiCmapTools is implemented in C+ + , which facilitates using common programming data structures and functions from the Standard Template Library (STL). Users input HiC maps in either .hic format generated by Juicer [8] or bin-contact pair files following hicpipe [11,12]. The input contact map is stored as a hash structure using pair bins as keys. The size of the bin is specified by the user (-in_hic_resol for .hic) or depends on the input file (bin-contact pair files). A query is binned into a corresponding key based on its position to facilitate efficient extraction of contact frequency via STL hash operation (O(1) for lookup). Also, we measure the significance of the extracted frequencies using permutation tests which rank the frequency among random samples. The usage of the query mode and random test are explained below.

Query mode
We use seven query modes to meet the needs of biologists, as illustrated in Fig. 1. The query input is expected to be in BED format, in which each line is considered as an individual query. Sample query files are available at https:// hicma ptools. readt hedocs. io/ en/ latest/ format. html# query-file.
1. bait: calculate average contacts from downstream to upstream (controlled by -ner_ bin) of a position of interest (white rectangle). For example, biologists can measure the average contact frequency around a PRE binding site. 2. local: list all contacts inside an interval (white cross). All contacts inside a gene body can be extracted by querying specific gene loci. 3. loop: contact frequency between two ends of a loop. As an example query, biologists can test whether gene looping exists [13] by calculating the contact frequency of its promoter with the transcription termination site. A gene of interest is listed as one row in a BED file. 4. pair: contacts between a pair of regions (contact between regions X and Y, white crosses). For instance, contact frequencies between a gene promoter and an enhancer are extracted by querying their positions. 5. sites: contacts between specific sites (contacts between three sites, including diagonal). As an example, given the list of chromatin insulator sites, HiCmapTools calculates all pairwise contacts among these sites, such that users can check whether any pair of binding sites interact with each other. 6. submap: sub contact map of regions of interest. The HiC map is stored efficiently by keeping only selected regions (i.e., a region containing long-range contacts between two loci such as the Drosophila Antp-C and the BX-C). 7. TAD: sum and average of contacts within specific TAD regions (white dashed square at the top right of Fig. 1). Biologists can quantify chromatin compaction within a TAD by measuring the average intra-TAD contact frequency. This might be used to compare different TADs of interest.

Permutation test
The biological meaning of the extracted contact frequency is assessed by the probability that a given frequency occurs by chance. We approximate the distribution of the null hypothesis by shuffling query positions. That is, we generate alternative queries by randomly shifting the query loci around the same chromosome multiple times (default = 100 times, controlled by the -random option). Then, we perform the same query mode for the alternative queries and calculate their contact frequencies.
Finally, the query's contact frequency is evaluated as its ranking among the sampled frequencies.  [14] bait query example Here, one is interested in whether there is local contact enrichment around specific loci, such as the Ubx gene in the BX-C locus. Then, we perform a -bait query where a 30k map is used, up/downstream is 150kb (= 30k x 5bins) with 100 permutation tests.
hicmaptools -in_map fly_30k.n_contact -in_bin fly_30k.cbins -bait Ubx.bed -near_bin 5 -random 100 -output Ubxt-bait.tsv Two outputs are produced: Ubxt-bait.tsv and Ubxt-bait_random_1.txt. The former contains the contact frequency of the query (for example "7237.85") and the average of the sampled frequencies (for example "6363.51") with comparison provided as ratio (for example "1.14") and rank (for example "top 19%"). The latter provides each sampled frequency, including that of the query (the second line), where the suffix "_1" indicates one query entry in the input file. Output details are available online at https:// hicma ptools. readt hedocs. io/ en/ latest/ format. html# output. We provide tools/ visualPermutationTest.R, a R script to visualize the query's output against the distribution of the random samples (Fig. 2).

Rscript
visualPermutationTest.R Ubxt-bait_random_1.txt UbxPermuTest.pdf Besides Drosophila, HiCmapTools could handle other species. We test whether there is local contact enrichment around the Tox gene in mouse T Cell. The HiC map, D0_mm10.hic, is available at https:// doi. org/ 10. 6084/ m9. figsh are. 18586 106 instead of GitHub due to the size limit.

Results
HiC data has shown chromosome structure to be formed by 3D folding of a higher-order unit, the topologically associating domain (TAD) [14][15][16]. In Drosophila, TADs can be further classified into four epiTADs-active, null, PcG, and HP1-based on ChIP-seq binding profiles [14]. Although a substantial proportion of TADs form loops, with loop bases frequently located at the TAD boundaries in mammals [17,18], few examples of TAD loop structure have been reported in the fly genome [19]. These loops generally involve Polycomb binding sites, but whether any preference exists regarding epiTADs has not been systematically addressed. We selected two related experiments to determine whether HiCmapTools can help biologists address these questions.

TAD loop
We calculate the contact frequency between two boundaries of the TADs via the loop query mode: hicmaptools -in_map fly.bimap -in_bin fly.bins -loop epiTAD. bed -output resLoop.tsv Interestingly, PcG TADs show strong loop frequency, whereas active TADs have the lowest frequency (Fig. 3, raw data in Additional file 2). Therefore, we assume that PcG TADs form a loop structure with high probability [20].

TAD intra-density
Furthermore, we check whether intra-TAD contact frequencies also show differences across epi-classes via the TAD query mode: hicmaptools -in_map fly.bimap -in_bin fly.bins -TAD epiTAD. bed -output resTADs.tsv HP1 TADs show a higher average intra-TAD contact density than the other three classes, whereas active TADs show the lowest contact density (Fig. 4, raw data in Additional file 3). This is in agreement with previous studies which reported that active chromatin domains present a weaker inside contact density than inactive domains, PcG, and null epiTADs [16].

Running time benchmark
Each query command generally completes within minutes. However, this varies based on the input scale, especially HiC maps. We conducted the following tests ( Table 2) based on different input map resolutions and file formats on a personal desktop (4 GHz CPU, 32 GB memory).
The most time-consuming step of HiCmapTools is parsing the map and turning it into a hash structure, especially for text format. The running time increases from 6.23 s to 345.62/344.43 s when the map size increases from 67 MB to 3.2 GB in text format. A text HiC map can be saved as a binary format via the genBinMap

Conclusion
We present a C+ + package that provides an efficient way to query HiC maps. HiC-mapTools supports seven access options so that biologists can quantify the contact frequency of the interest sites. Furthermore, the frequency probability is estimated based on a null hypothesis that shuffles the query position. Finally, the frequency is visualized as an output plot: a vertical line in the density plot of the random samples. The authors will continue to develop new functions for comparative HiCs to pursue HiC quantitative analysis.