 Methodology article
 Open access
 Published:
Deciphering hierarchical organization of topologically associated domains through changepoint testing
BMC Bioinformatics volume 22, Article number: 183 (2021)
Abstract
Background
The nucleus of eukaryotic cells spatially packages chromosomes into a hierarchical and distinct segregation that plays critical roles in maintaining transcription regulation. Highthroughput methods of chromosome conformation capture, such as HiC, have revealed topologically associating domains (TADs) that are defined by biased chromatin interactions within them.
Results
We introduce a novel method, HiCKey, to decipher hierarchical TAD structures in HiC data and compare them across samples. We first derive a generalized likelihoodratio (GLR) test for detecting changepoints in an interaction matrix that follows a negative binomial distribution or general mixture distribution. We then employ several optimal search strategies to decipher hierarchical TADs with p values calculated by the GLR test. Largescale validations of simulation data show that HiCKey has good precision in recalling known TADs and is robust against random collisions of chromatin interactions. By applying HiCKey to HiC data of seven human cell lines, we identified multiple layers of TAD organization among them, but the vast majority had no more than four layers. In particular, we found that TAD boundaries are significantly enriched in active chromosomal regions compared to repressed regions.
Conclusions
HiCKey is optimized for processing large matrices constructed from highresolution HiC experiments. The method and theoretical result of the GLR test provide a general framework for significance testing of similar experimental chromatin interaction data that may not fully follow negative binomial distributions but rather more general mixture distributions.
Background
The eukaryotic genome is hierarchically organized in the nucleus, exhibiting wellmaintained threedimensional (3D) structures for its cellular functions. DNA and associated proteins constitute chromatin units, among which interactions are not random but precisely regulate transcription and replication during the cell cycle [1,2,3]. For example, the interactions between enhancers and their distal targeted genes are essential for controlling gene expression strengths and tissuespecific expression patterns [4]. 3D chromosomal studies in prostate cancer, thalassemia, breast cancer and multiple myeloma have revealed that disordered interactions are closely related to gene dysregulation, contributing to the development of cancer and other genetic diseases [5,6,7]. Thus, estimating the 3D organization of chromosomes can provide important insight into not only the role of highorder chromatin compaction in gene regulation but also the way disordered chromatin interactions lead to diseases.
To systematically delineate chromatin interactions and 3D organization, novel experimental methods have been developed by employing highthroughput sequencing techniques. Chromosome conformation capture (3C) [8] and its highthroughput derivatives, such as ChIAPET [9], HiChIP [10] and HiC [11, 12], have granted researchers comprehensive information on chromatin interactions and hierarchical chromosomal organizations, including active or repressive compartments (A/B compartments) [13], topologically associated domains (TADs) [11, 14, 15], CTCF proteinmediated loops [9] and enhancerpromoter interactions [4]. In general, a chromosome can be divided into active or repressed compartments (A/B compartments) corresponding to higher or lower gene expression levels [13]. The analysis of highresolution HiC data has shown that chromosomes can be divided into functional units, called TADs, which are conserved across multiple human and mouse cell lines [11, 12, 16]. Furthermore, ChIAPET data of CTCF, Cohesin and RNA PolII have revealed fine spatial structures of CTCF loops and enhancerpromoter interactions [4, 9]. Compared with ChIAPET and other capturebased methods, HiC provides highresolution unbiased signals of chromatin interactions [12].
TADs can be considered isolated structures that partition chromosomes into discrete functional regions and thus restrict regulatory activities within them [3, 11, 14]. To detect TAD structures from HiC data, many computational methods have been proposed by calculating the insulation scores or defining significance values of TAD boundaries. However, most of them are tools for detecting nonhierarchical TADs, such as Armatus [17], TopDom [18], HiCSeg [19], InsulationScore [20], Arrowhead [12] and DomainCaller [11]. Since TADs were shown to be hierarchically organized [11, 12], the estimated nonhierarchical TADs cannot fully describe the biological hierarchy in cell systems. As shown in Fig. 1a, a \(\sim\,3 \,{\hbox {Mb}}\) region on chr1 of the GM12878 cell line clearly exhibits four layers of TADs with different interaction strengths. To overcome the limitations of nonhierarchical TAD finders, another type of method, such as TADtree [21], ICFinder [22], GMAP [23], Matryoshka [24] and 3DNetMod [25], has been proposed to find TADs and their nested subTAD organizations. Although these methods have given researchers new knowledge in understanding chromosomal organization, they still suffer from low precision or poor robustness against noise or high time consumption [26]. ICFinder [22] employs a constrained hierarchical clustering strategy that iteratively groups objects into a hierarchy of clusters. Although it was robust against noise, it requires high sequencing depth [26]. Another method, GMAP [23], utilizes a Gaussian mixture model to iteratively identify TADs but is limited to two levels of TADs. TADtree [21] finds the best TAD hierarchy via a dynamic programming algorithm that was tested to be time consuming for large size HiC matrices [26]. 3DNetMod [25] uses network modularity theory to hierarchically cluster TADs; however, it is sensitive to multiple parameter settings and is less robust against experimental noise. One major challenge in detecting TAD structures is the experimental noise that mainly comes from random ligation of chromosomal segments during the crosslinking step and the “genomic distance effect” in HiC experiments, reducing the consistency and prevalence of higherorder structures [11, 12, 27]. Another obstacle in HiC data analysis is how interaction frequencies are distributed. Negative binomial (NB) distribution is the most widely used assumption, but it cannot fully capture the characteristics of chromatin interactions [12, 16] since confounding factors of HiC experiments may transform the interaction frequencies into more complicated distributions (e.g., a mixture of unknown discrete distributions). Thus, there are urgent requirements for new TAD detection methods to precisely estimate chromosome structure.
Understanding dynamic changes in TADs is also an important topic in HiC data analysis since disordered TADs are linked with cellspecific gene expression regulation or different developmental conditions. For example, Sauerwald and Kingsford et al. confirmed that conservation and dynamics of TAD boundaries were associated with distinct biological conditions or chromosomal variations by comparing a large number of HiC experiments of cell lines or tissues [28, 29]. Several methods have been proposed for detecting boundary changes in TADs, including HiCcompare [30], localTADSim [29], HOMER [31], HiCDB [32] and TADCompare [33]. The major strategy of these methods is to first detect TADs separately and then compare two sets of TAD boundaries. However, these methods usually require specific data types and lack statistical rigorousness. HOMER [31] only outputs different TAD regions by overlapping two sets of TADs but does not provide significance testing for boundary differences. Another method, TADCompare [33], has arisen as a potentially useful tool for comparing TAD boundaries. This method proposes a new boundary score for differential boundary detection, timecourse analysis of boundary changes, and consensus boundary calling but is limited to five types of boundary changes. LocalTADSim [29] requires using Armatus software or manually formatting their inputs as Armatus output. HiCDB [32] uses a new metric named relative local insulation that is similar to insulation score, but it is biased to topranked insulation scores.
Based on the above observations, we propose a novel computational method, called HiCKey, to decipher the hierarchical organization of chromatin interactions in HiC data (Fig. 1b). We derived a generalized likelihoodratio test (GLR) for calling TAD boundaries (changepoints), which is a matrixvariant changepoint testing method in the literature. HiCKey can be applied to different interaction strength distributions. This is important for statistical analysis of HiC data, which is composed of biological interactions, random missing interactions and random ligation noise. Furthermore, the p values of a changepoint from different HiC matrices can be combined by Fisher’s method, providing a measure of whether a boundary is conserved across different samples (Fig. 1c). We demonstrated the performance and robustness of HiCKey using substantial validations of simulation studies. By applying HiCKey to seven human cell lines, we identified not only multiple layers of TAD organization in each cell line but also TAD structures consisting of different gene expression or histone modification signals. We found that TAD boundaries are significantly enriched in active chromosomal regions, indicating that fine TAD architectures are employed for precise gene transcription control. These results show the advantages of HiCKey in detecting TADs and provide novel biological discoveries revealing the association of chromosomal organization and gene regulation.
Methods
Modelling TADs organization
HiC experiments generate a symmetric n by n matrix \(\mathbf{X }=(x_{ij})\in {\mathbb{R}}^{n\times n}\), where n is the number of bins and \(x_{ij}\) is the frequency of chromatin interactions between a pair of genomic loci i and j. Assume there are K changepoints located at \(1 = \tau _0< \tau _1< \tau _2< \dots< \tau _K < n\). These changepoints divide all chromosomal bins into \(K+1\) nonoverlapping TADs, as shown in Fig. 2a. For changepoints \(\tau _a\), \(\tau _b\) and \(\tau _c\), the TAD between \(\tau _a\) and \(\tau _b\) is \(A_{a,b}\). The rectangle between \(A_{a,b}\) and \(A_{b,c}\) is \(R_{a,b,c}\) (R). We aim to detect (1) all the changepoints and (2) the hierarchical organization of these changepoints.
Previous HiC studies have revealed that withinTAD interactions are much stronger than crossTAD interactions [11, 12, 16], as shown in a HiC matrix of GM12878 (Fig. 1a). In general, the average interactions in neighbouring blocks, \(A_{0,1}\) and \(A_{1,2}\), can be different, but they are both stronger than the interactions in R (Fig. 2a). Additionally, the interaction strength decreases as the distance from the diagonal increases. These biological observations raise statistical insight that a changepoint, \(\tau _1\), will lead to significant distribution differences among \(A_{0,1}\), \(A_{1,2}\) and R.
In theoretical statistics, this problem is known as changepoint analysis. We note that various frequentist and Bayesian methods have been developed for multiple changepoint analyses of uni and multivariate data over the last few decades. In particular, Bayesian methods assume that multiple changepoints follow a stochastic process and solve the inference problem through Markov chain Monte Carlo (MCMC) simulations [34, 35]. Exact Bayesian inference approaches with efficient approximations were also developed for various problems [36,37,38]. The frequentist methods provide more technical tools, including dynamic programming algorithms to solve maximum likelihood estimation [39, 40], binary segmentation [41], information criteria model selection [42, 43], and penalized likelihood or cost function approaches [44, 45]. However, none of these methods can be directly applied to HiC data because they are matrixvariate. Simplifying them to multivariate vectors and feeding them to existing changepoint methods are not optimal. Furthermore, a general tool needs to be developed to address cases that are more complicated than some single distribution assumption (e.g., NB).
Generalized likelihoodratio test for changepoints
In Fig. 2a, the red block is a submatrix from \(\tau _0\) to \(\tau _2\) in matrix X. Its upper triangular part is denoted by A. Assume that all interaction reads are independent random variables from an NB family. If there exists a changepoint, we then have three sets of blockwise constant parameters. Otherwise, all the parameters will be the same. Specifically,
where \(\mu _k\) is the mean of the NB distribution and r is a nuisance parameter with a positive value. In A, we consider the hypothesis test \(H_0\): there is no changepoint against \(H_1\): there is one changepoint at unknown position \(\tau _1 = m (1< m < n)\), such that \(A=A_{0,1}\cup A_{1,2} \cup R\). Let \(S_A\), \(S_{A_k}\) and \(S_R\) be the sums of \(x_{ij}\) in the corresponding regions.
When the location of changepoint \(\tau _1 = m\) is known, the logarithm of the generalized likelihoodratio test statistic is
where \( \cdot \) is the cardinality of a set.
Theorem 1
The GLR statistic, \(GLR_{NB,m}\), is asymptotically equivalent to the following scan statistic if m/n holds constant as \(n \rightarrow \infty\)
where \(S_{A_1 \cup R} = S_{A_1} + S_R\), \(\sigma _0^2\) is the variance under the null hypothesis, which can be estimated.
The first term on the righthand side in (1) describes the difference between \(A_{0,1}\) and R, while the second term describes the difference between \(A_{0,1} \cup R\) and \(A_{1,2}\). Note that if the changepoint position is known, the partition ratio m/n of the matrix A is fixed. Hence, it is natural to assume that m/n holds constant as \(n \rightarrow \infty\). It is easy to see that the asymptotic distribution of \(Z_m\) is chisquare. However, in practice, m is unknown, so we need to define a new test statistic and study its asymptotic properties.
Definition 1
We define the following test statistic:
\(Z_m\) is calculated for each \((\xi < m \le n\xi )\), and we take the supremum, where \(\xi\) is the minimum size of a TAD.
The minimal TAD size is defined by default as the up integer of 100 kb/resolution. Here, 100 kb is estimated by the observed subTAD size in real biological datasets, since more than 95% of subTAD sizes are larger than 100 kb among multiple cell types of Rao’s HiC data [12]. For example, for the interaction matrix with 40 k resolution, 3 (round 100/40 up to 3) is set as the minimal TAD size. Moreover, the p value threshold and the minimal TAD size threshold are flexible in HiCKey software and can be changed by users for specific research goals.
As a result, we can eliminate the assumption on the original distribution. The above \({\widetilde{Z}}\) can be used to detect the change in means and is not limited to an NB distribution. Under the null hypothesis, we take the upper triangular part, A, as a discretetime randomwalk with a twodimensional time index. By Donsker’s invariance principle, we obtain the asymptotic distributions of \(Z_m\) and \({\widetilde{Z}}\) in the following theorem.
Theorem 2
Consider a Gaussian random field G(s, t), with location indices s and t, defined on the upper triangular part of a unit square \(B=\{ (s,t)  0\le s \le t \le 1 \}\). Assuming that \(m/n \rightarrow t \in (0, 1)\) as \(n \rightarrow \infty\), then the regions \(\frac{A_{0,1}}{\sqrt{n^2/2}}, \frac{A_{1,2}}{\sqrt{n^2/2}}\) and \(\frac{R}{\sqrt{n^2/2}}\) converge to regions \({\widetilde{A}}_1= \{ ({\widetilde{s}}, {\widetilde{t}})  0 \le {\widetilde{s}} \le {\widetilde{t}} \le t \}\), \({\widetilde{A}}_2=\{ ({\widetilde{s}}, {\widetilde{t}})  t \le {\widetilde{s}} \le {\widetilde{t}} \le 1 \}\), and \({\widetilde{R}}=B  {\widetilde{A}}_1  {\widetilde{A}}_2\), respectively (note that \(\frac{A}{\sqrt{n^2/2}} \rightarrow B\)). Correspondingly,
and if \(\xi /n \rightarrow \delta >0\),
Details of the Gaussian random field construction and proof are included in Additional file 1. This asymptotic property is extremely helpful in highresolution HiC data. Consider a TAD with a fixed chromosomal size (typically 1 Mb), where the higher the resolution is, the more reads TAD contains in the HiC matrix. In practice, Monte Carlo simulations can be used to obtain the asymptotic distribution from the above theorem. A histogram and a kernel density estimation are included in Additional file 1: Fig. S1. Since our GLR testing theoretically converges for different types of distributions, the parameters can be applied to different datasets.
Detecting hierarchical TADs
We propose an iterative algorithm to implement the GLR test in estimating hierarchical TADs. The first step is binary segmentation to identify all the changepoints (TAD boundaries). In the HiC matrix, we first find one changepoint that has the maximum \(Z_m\) in Eq. (2), resulting in two diagonal submatrices. Iteratively, one changepoint is found for each submatrix, until the submatrix has a size smaller than the lower bound, \(2\xi\). In the second step, we use a pruning process to test each changepoint in reverse order to which they are identified and to remove insignificant changepoints. A pvalue threshold, \(\alpha _0\), is needed for all the tests.
Although binary segmentation in a topdown strategy can provide hierarchical organization as divisive clustering or a decision tree, it may contain more false hierarchical structures. We use a bottomup procedure to merge neighbouring blocks and update the layer labels of boundaries (please see Fig. 2b for a demo example). More specifically, we recalculate the p value for each potential boundary outputted in the topdown step by testing its flanking blocks with the attached rectangle submatrix. The p values of all boundaries are ranked in descending order, and their layer labels are initialized as zero. In descending order, two neighbouring blocks are parallelly merged into one if their inner boundary p value is larger than a threshold \(\alpha _1=1e{}5\). The layer label of the inner boundary of merged blocks increases by one. In each iteration, the boundary p values between merged blocks are recalculated. The iteration continues until no remaining blocks satisfy the merge conditions. Here, \(\alpha _0\) and \(\alpha _1\) are used to control the number of potential TAD boundaries and the number of hierarchical branches, respectively. In HiCKey, \(\alpha _1=1e{}5\) was used by default, which can welldelineate local hierarchical structures in real data analysis. These two parameters can be reset by users for different TAD detection goals. If \(\alpha _1\) increases to \(\alpha _0\), more blocks are considered as individual TADs (less hierarchical). In contrast, if \(\alpha _1\) is smaller, more blocks are grouped into hierarchical structures.
Validating performance by simulation studies
HiC data have random interactions generated in random ligation of DNA segments [12, 16]. Polymer models show a decrease in the random interaction score as the distance between two loci increases. Because there is no true answer for TAD boundaries in real HiC datasets for validation, we first tested HiCKey on two simulation datasets that were originally created for assessing several computational methods by Forcato [46]. These datasets were simulated by a quasinegativebinomial generator modified from Lun [47], with each \(y_{ij}\) specifically designed to approximate real HiC data well. Using the same datasets facilitates the comparison between HiCKey and other methods. The specific datasets we used are as follows:

[(Sim1)] Matrices without nested TADs. It consists of 20 simulated HiC matrices with noise levels of 4%, 8%, 12% and 16%. These matrices contain no nested TAD structure, and each matrix has a size of approximately 4500 with 171 diagonal blocks.

[(Sim2)] Matrices with nested TADs. It consists of 20 simulated HiC matrices with noise levels of 4%, 8%, 12% and 16%. These matrices differ from Sim1 in that they contain nested TADs. In particular, each matrix has a size of approximately 4500 and contains 910 diagonal blocks with three layers of hierarchical structure.
A previous study [47] reported that the noise level, which they refer to as the biological coefficient of variation, varies between 0 and 16%.
The performance was evaluated by four measures. First, the true positive rate (TPR) was defined as the number of detected true boundaries divided by the number of total true boundaries. Second, the false discovery rate (FDR) was defined as the number of falsely detected boundaries divided by the total number of detected boundaries. Third, the difference between the estimated and true number of TAD boundaries was defined as \({\widehat{K}}K\). If there were several matrices in a simulation dataset, we calculated the average of all \({\widehat{K}}K\) of the matrices. Fourth, to evaluate the consistency between true hierarchical TAD structures and HiCKey TADs, we calculated the Fowlkes–Mallows index (\(B_k\)) [48], where k is the hierarchical level. Since there are three levels of hierarchical structures embedded in the Sim2 dataset, \(B_1\), \(B_2\) and \(B_3\) were calculated for each hierarchical level from the bottom layer \(B_1\) to the outer layer \(B_3\). It was noted that the \(B_k\) index lies between 0 and 1. If two partitions were perfectly matched, then \(B_k = 1\). For each noise level, we calculated the average score of \(B_k\) for 1000 random initializations. To obtain the Fowlkes–Mallows indices under the null hypothesis that the two clusterings are unrelated, we calculated control \(B_k\) between the true hierarchical structures and randomly relabelled HiCKey TADs (relabelling) by using the Fowlkes and Mallows formula [48].
Validating the robustness of HiCKey by simulation studies
To evaluate the robustness of HiCKey against different initial boundaries (changepoints), we performed validations on simulated datasets Sim1 and Sim2 and real datasets of hESC and IMR90 cell lines. For each dataset, we first ran HiCKey ordinarily and recorded the number of detected changepoints as well as their locations. Then, we ran HiCKey 1000 times with random selection of the first changepoint in the whole matrix. We evaluated the result consistency of randomly starting and the ordinary run by using the criteria TPR and \({\widehat{K}}K\).
To test the performance of HiCKey on datasets with different distributions, we generated simulation matrices whose entries followed Gaussian and NB distributions. We considered the following two scenarios for the HiC matrix of size \(500\times 500\):

[(Sim3)] Gaussian distribution. Let \(K=31\) (changepoint numbers), and their locations were uniformly drawn with the smallest block size that was larger than 4. We set the mean of each element, \(u_{ij}\), as \(u_{ij} = \mu _k \sim\) Gamma(4, 18) for \((i, j) \in A_{k1,k}\), \(k=1, \dots , 31\), and \(\mu _0=0\) for \((i, j) \in A  \cup _{k=1}^{31} A_{k1,k}\), where the numbers were estimated by real HiC data. The values of \(x_{ij}\) were generated by
$$\begin{aligned} x_{ij} \sim \left\{ \begin{array}{ll} N(u_{ij}, \sigma ^2) &{} \quad {\text{ for }}\, (i, j) \in A_{k1,k}, \\ \max \{ N(0, \sigma ^2), 0 \} &{}\quad {\text{ o.w. }}\, \end{array} \right. \end{aligned}$$ 
[(Sim4)] Poisson and negative binomial distributions. Let \(K=31\). Changepoint locations and element means \(\mu _k\) were similarly generated as (Sim3). Furthermore, \(x_{ij}\) was generated by an NB model [47]
$$\begin{aligned} x_{ij} \sim \left\{ \begin{array}{ll} NB( \nu ^{1}, (1+\nu \mu _k)^{1}) &{} \quad {\text{ for }}\, (i, j) \in A_{k1,k}, \\ 0.5 \mathbf{1 }_{\{0\}} + 0.5 NB( \nu ^{1}, (1+\nu \cdot \min _{k} \{ \mu _k\})^{1}) &{} \quad {\text{ o.w. }} \end{array} \right. \end{aligned}$$Note that \(x_{ij}\) in the complementary region \(A  \cup _{k=1}^{K+1} A_{k1,k}\) follows a mixture of point mass and NB distribution.
\(NB( \nu ^{1}, (1+\nu \mu _k)^{1})\) provides an NB distribution with mean \(\mu _k\) and variance \(\mu _k + \nu \mu _k^2\). The parameter \(\sqrt{v}\), referred to as the biological coefficient of variation (BCV) [47], varies from 0 to 16%. Hence, we set \(\sqrt{\nu } = 0, 0.05, 0.10, 0.15\) in (Sim4). Note that \(\nu =0\) corresponds to the case in which \(x_{ij}\) follows a Poisson distribution with mean \(\mu _k\). In (Sim3), \(\sigma ^2\) was specified as \(\sigma ^2 \approx 72+72^2 \cdot \nu\), where 72 was the mean of Gamma(4, 18).
HiC datasets for real data case studies
Highresolution in situ HiC data of seven cell lines produced by Rao [12] were downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) with the accession number GSE63525. We applied HiCKey on 25 kb resolution HiC data for all seven cell lines, which included GM12878, HMECs, HUVECs, IMR90 cells, K562 cells, KBM7 cells, and NHEKs. In addition, we downloaded their predicted TADs using the Arrowhead method [12] and denoted them by RaoTAD in the rest of the paper. We also downloaded early HiC data generated by Dixon [11] for two human cell lines, HESC and IMR90, whose resolutions are 40 kb. Histone modifications and TF binding peaks were obtained from ENCODE and the Roadmap epigenomics project using the WashU genome browser [49].
Comparing TADs across samples
Since HiCKey outputs p values of TAD boundaries, they can be extended to compare boundary differences across cell lines. For a boundary, m, we calculated its p values, \(p_1(m)\) and \(p_2(m)\), in two different samples of HiC matrices. Assuming two HiC experiments are independent, we used Fisher’s method [50] to combine two p values into one test statistic \(\chi _4^2:= 2\ln (p_1(m))2\ln (p_2(m))\). Here, \(\chi _4^2\) follows a chisquared distribution of 4 degrees of freedom, and its p value is denoted by \(p_f\). The p value \(p_f(m)\) will decrease if \(p_1(m)\) and/or \(p_2(m)\) decrease.
Memory and running time optimization
Recent in situ HiC experiments have generated datasets with a resolution as high as 1 kb, resulting in large matrices. Therefore, it is essential to optimize memory usage and running time. Some HiC data processing pipelines, such as HiCPro [51], place much emphasis on memory efficiency. Many existing TAD detection methods use a matrix as input; however, this is infeasible for highresolution data. For example, storing a single HiC matrix of 5 kb resolution under double precision might require 18 G memory. We used several strategies for computing resource optimization.
First, highresolution HiC matrices are sparse, as most elements are zero. Our program can read both matrix or list forms (Rao’s data consist of nonzero elements and their indices [12]). HiCKey stores only nonzero elements in the upper triangular part of the HiC matrix. Second, the topdown binary segmentation is the most timeconsuming step of HiCKey. To calculate all \(Z_m\) in \(\mathbf{X }=(x_{ij}) \in {\mathbb{R}}^{n\times n}\), we first calculate the sums of every row and column. As m moves from 1 to n, only linear arithmetic operations are needed.
Second, for the best case in which each changepoint is allocated in the middle of matrix A, we need at most \(log_2 n\) iterations with operations \(O(n^2)\). In the worst case, each iteration generates one submatrix as small as possible and the other as large as possible. This results in at most \(n/\xi\) iterations with operations \(O(n^3)\) (see detailed calculations in Additional file 1). All the tests were conducted on a regular laptop with an Intel(R) Core(TM) i57200U CPU with a 2.50 GHz processor and 12 GB memory.
Results
Performance of HiCKey in detecting TADs
To understand the performance of HiCKey in detecting TADs, we first tested it on largescale simulated HiC matrices [46], which included two types of data, Sim1 without nested TAD structures and Sim2 with nested TAD structures. For both Sim1 and Sim2, four noise levels of 4%, 8%, 12% and 16% were added to test the robustness of HiCKey against the random collision noise of interactions. Testing at different noise levels is critical since large numbers of random collision interactions are observed in HiC data [12, 16]. First, on dataset Sim1, HiCKey achieved a high TPR of 0.9988 under a 4% noise level (Fig. 3. Additional file 1: Table S1). As the noise level increased from 4 to 16% (fourfold change), the TPR decreased to 0.9459 (0.947fold change). The fold change ratio of TPR and noise was 0.24 (0.947/4), indicating that the TPR of HiCKey was robust against noise changes. When the noise level increased from 4 to 8% and 12%, the FDR slightly increased from 0 to 0.0173 and 0.1176, respectively. Additionally, the FDR increased to 0.3618 at the 16% noise level. We also compared the number of TADs estimated by HiCKey with the true value. We found that HiCKey produced a very accurate average number of TADs at the 4% noise level (\({\widehat{K}}K=\,0.2\)). As the noise level increased, the estimated number of TADs increased (\({\widehat{K}}K\) as 1.8, 19.4 and 82.6 for noise levels 8%, 12% and 16%, respectively). We then tested HiCKey on dataset Sim2. At a 4% noise level, HiCKey achieved 0.845 TPR and 0.059 FDR (Additional file 1: Table S2). When the noise level increased to 8% and 12%, the TPR decreased to 0.757 and 0.6568, respectively. We noticed that TPR and noise level were linearly correlated (\(R^{2}=0.9926\)), demonstrating that HiCKey can remain stable with these noise changes. In summary, these validation results at different noise levels suggest that HiCKey is robust against random collision noise, especially for noise levels ranging from 0 to 12%.
We calculated the Fowlkes–Mallows index \(B_k (k=1,2,3)\) [48] to evaluate the consistency between true hierarchical TAD structures and HiCKey TADs, as there are three layers of hierarchical structures embedded in the Sim2 dataset. The \(B_k\) index lies between 0 and 1, where larger index scores indicate higher similarities among two compared hierarchical structures. Overall, we found that the \(B_k\) indices were larger than 0.8196 for all three hierarchical layers at the 4% noise level (Additional file 1: Table S3). When noise levels increased to 16%, they decreased slightly but maintained fair scores all larger than 0.6887. At four noise levels, the control Fowlkes–Mallows indices were no more than 0.01 after recalculating \(B_k\) between the true hierarchical structures and randomly relabelled HiCKey TADs (Additional file 1: Table S3). We also found that the \(B_2\) and \(B_3\) indices were more stable than \(B_1\) when noise levels increased, suggesting that HiCKey is more robust against noise in detecting second and third levels of TAD structures.
We examined practical running time and memory usage on a real HiC matrix of chr1 (the largest among 23 chromosomes) in the GM12878 cell line [12]. Under resolutions of 50 kb, 25 kb, 10 kb and 5 kb, the running times of HiCKey were 23 s, 53 s, 106 s and 157 s, respectively. Additionally, the practical running time was approximately \(O(n^2)\) (Additional file 1: Fig. S2a). Its memory usage was 170 Mb, 389 Mb, 768 Mb and 980 Mb, respectively, and was also approximately \(O(n^2)\) (Additional file 1: Fig. S2b). Taken together, HiCKey requires reasonable computing resources in processing highresolution HiC matrices.
Robustness against different distributions and initialization
In HiC data analysis, the read counts of interactions were usually assumed to follow an NB distribution [12, 52, 53] or Poisson distribution [54] or normalized data [30]. However, these distribution models cannot fully capture the characteristics of chromatin interactions in HiC experiments due to the divergent confounding factors observed in real biological systems, which results in a complicated mixed model [8, 27, 46]. In HiCKey, we derived a GLR test that can be broadly used for multiple distributions but is not limited to the NB distribution. To test the performance of HiCKey on different distributions, we simulated 1000 interaction matrices with normal and NB distributions (see details in the Methods section). At four different noise levels, we found that the TPRs were all larger than 0.99, while the FDRs were less than 0.0055 (Additional file 1: Table S4), suggesting that HiCKey is robust against different distributions.
To test whether HiCKey is sensitive to the initial choice of changepoint allocation, we constructed new validations by randomly selecting the initial location. We performed validations on simulation datasets Sim1 (Additional file 1: Table S5) and Sim2 (Additional file 1: Table S6) and real datasets of hESCs (Additional file 1: Table S7) and IMR90 cell lines (Additional file 1: Table S8). Regarding these validations, TPR rates were all larger than 90%, while the means of \({\hat{K}}K\) were very small, indicating that HiCKey is robust against the initial boundary selection.
Comparisons with other methods
We compared HiCKey with four popular methods, HiCSeg [19], TADtree [21], ICFinder [22] and 3DNetMod [25], on simulation datasets Sim1 and Sim2. For Sim1, HiCKey achieved not only higher TPRs at four noise levels but also slow declines (Fig. 3a and Additional file 1: Table S1). The FDR of HiCKey at 16% noise was smaller than those of ICFinder, TADtree and 3DNetMod (Fig. 3b). We found that HiCSeg tended to retrieve large TADs, resulting in fewer detected TAD boundaries. For example, HiCSeg’s \({\widehat{K}}K\) were \({}\) 2.2, \({}\) 29.2, \({}\) 99.2 and \({}\) 150.2 at noise levels of 4%, 8%, 12% and 16%, respectively, explaining why their FDRs were always lower but TPRs decreased sharply. TADtree and ICFinder outputted more TADs than the true value at the 4% noise level (\({\widehat{K}}K = 397.6\) and 207.4, respectively), but the number of falsely identified TADs decreased with increasing noise levels. 3DNetMod tended to output more TADs than the true numbers at four noise levels and thus had higher FDR rates. For Sim2, HiCKey achieved the highest TPRs at noise levels of 4% and 8% but slightly dropped below TADtree and ICFinder at noise levels of 12% and 16% (Fig. 3c and Additional file 1: Table S2). However, TADtree, 3DNetMod and ICFinder suffered from much higher FDRs (Fig. 3d). Taken together, HiCKey achieved good performance, especially for lower noise levels.
We specifically investigated two regions with hierarchical TADs in dataset Sim2 that were used for comparative analysis in a previous study [46] (Fig. 4). The results showed that HiCKey, TADtree and 3DNetMod can detect single and hierarchical TADs for both cases, while ICFinder and HiCSeg can only output bottom single TADs. At E1–E6, we found that HiCKey can correctly detect not only all the TAD boundaries but also their hierarchical organization, while TADtree and 3DNetMod made a few false predictions (E1, E3, E6) or missed outputting bottom TADs (E2, E4, E9). Region E7, consisting of complicated hierarchical structures, was a challenge for HiCKey and TADtree. Although HiCKey correctly detected all the boundaries in E7, it missed the outer layer. TADtree missed several boundaries and wrongly merged a neighbouring TAD into the block. 3DNetMod showed good hierarchical details at E7 and reported a large hierarchical TAD at E6, which seems to have no clear interaction blocks. Overall, these genomewide analyses and detailed examples showed that HiCKey has good performance in detecting TAD boundaries and their hierarchical organization.
Hierarchical architecture of chromosomal organization
We applied HiCKey to in situ HiC data of seven cell lines. HiCKey successfully outputted the boundary positions, p values and hierarchical levels. Following Weinreb and Raphael’s method [21], we defined the root TAD as order 1. If a root TAD had sublevel TADs, the two subTADs were of order 2. Similarly, subsubTADs were of order 3, and so on. This is the same as how we defined the hierarchical order of TADs in the Methods section. In total, we detected 8586, 8200, 8903, 9043, 8801, 6246 and 7726 TADs in the GM12878, hMEC, HUVEC, IMR90, K562, KBM7 and NHEK cell lines, respectively (Additional file 2). In each cell line, we compared our results with RaoTAD. First, we found that their allocations of TADs in the 23 chromosomes were similar (pvalues \(> 0.23\) in all seven cells, two sides chisquare test). Second, we considered that a RaoTAD boundary was matched if there was a HiCKey boundary located within its 2bin (50 kb) distance. The proportions of matches between RaoTADs and HiCKey TADs were 48.90%, 65.56%, 58.71%, 57.98%, 52.22%, 51.14% and 57.90% for GM12878, hMECs, HUVECs, IMR90, K562, KBM7 and NHEKs, respectively. Furthermore, we used the hypergeometric test (all the chromosomal bins were taken as population, boundaries of RaoTAD as successes, HiCKey boundaries as samples, and the matched ones as sampled successes) to measure how significantly HiCKey TAD boundaries matched with RaoTAD boundaries. On seven cell lines, the p values were all calculated as less than 1.0e\({}\)10, indicating that they were significantly matched.
Multiple levels of TADs were detected in each cell line. For example, Fig. 5 demonstrates our estimation of hierarchical TADs and RaoTAD in a local region of chr3: 10700000–11300000 of GM12878. Overall, HiCKey TAD estimations exhibited more hierarchical layers than RaoTADs. There was no RaoTAD within a large subregion (chromosome 3:10700000–11300000); however, clear blocks were observed from the HiC interaction heatmap. Extending the analysis to the genomewide level, we found that although the highest order of TADs can reach 7, most of them exhibited an order 1 or 2 (\(97.86\% \pm 0.71\%\), Additional file 1: Table S9), suggesting that hierarchical TADs are enriched in certain chromosomal regions.
Hierarchical organizations are enriched in active regions compared with repressive regions
To test whether biased hierarchical organizations at different chromosomal regions are related to different biological insights, we performed an integrative analysis of TAD structures and epigenetic markers. Previous association analysis revealed that neighbouring TADs usually have different histone modification patterns [55] and that TAD boundaries are primarily associated with CTCF and Rad21 binding peaks [11, 12]. Here, we downloaded several histone signals and protein binding peaks for GM12878. We examined a \(\sim \,2 \,{\hbox {Mb}}\) region (chr3: 10700000–11300000, Fig. 5) that was partitioned into three parts, an active region flanked by two repressed regions. The active region was exhibited by biological signals, such as high RNAseq signals of genes, Pol2 binding peaks, and multiple histone modifications (e.g., H3K4me3 and H3K27ac). We observed that the active region contained more hierarchical TADs than the repressed regions (Fig. 5, HiCKey track).
We next examined genomewide TAD boundary enrichment in active chromosomal regions. To search genomewide, we used the active/repressive annotations of chromosomal regions for six cell lines [55] and compared the numbers and layers of boundaries between them. First, we confirmed that among all six cell lines, TAD boundaries were enriched in active regions compared with repressive regions (pvalue \(< 0.01\), onesided Fisher exact test, Additional file 1: Table S10). For instance, among the estimated 8200 active/repressive TAD boundaries in the IMR90 cell line, 5494 were located in active regions (64,220 bins of 40 kb), while only 2706 were in repressive regions (46,400 bins of 40 kb). Next, we checked the layer annotation of boundaries. By comparing the layer number distributions, we found that they were significantly different, as the average layer number in active regions was larger than that in repressive regions (pvalue \(< 0.01\), K–S test). Thus, these genomewide analysis results demonstrated that active chromosomal regions usually contain more TAD boundaries and richer TAD structures, indicating that active regions may employ more precise spatial organizations to regulate gene expression.
Detecting conserved and dynamic TAD boundaries between cells
Conserved and dynamic boundaries, as a result of cellspecific gene expression organization/regulation, different developmental conditions, or chromosomal variants in diseases, can be compared by HiC data of two different samples [7, 16]. Here, we examined the 8801 TAD boundaries of the K562 myelogenous leukaemia cell line and 9043 TAD boundaries of the IMR90 normal human fibroblast cell line detected by HiCKey. First, we found that 7286 boundaries were colocalized within a 2bin distance. These high proportions of matched boundaries in K562 (82.79%) and IMR90 (80.57%) cells are consistent with early observations that TADs are conserved among mammalian cells [11, 12, 16]. Second, among the colocalized boundaries, 7280 of them have Fisher’s combined p values \(p_f < 0.01\), indicating that they are conserved in both cell lines. We also detected 1621 K562 TAD boundaries changed in IMR90, and 1763 IMR90 TAD boundaries changed in the K562 cell line, providing potential candidates for TAD boundaries that may be involved in cellspecific regulation.
To demonstrate dynamic changes in TADs and their potential biological functions related to gene transcription, we investigated a large chromosomal region (\(\sim \,4.2 \,{\hbox {Mb}}\), chr2:38900000–43100000) of the K562 and IMR90 cell lines as representative. In this region, three large TADs were estimated in both K562 and IMR90, but a TAD (chr2:40000000–42050000) showed novel hierarchical subTAD structures in IMR90 but not in K562 (Fig. 6). Moreover, two sublevel TADs (chr2:40000000–41000000 and chr2:41000000–42050000) were observed in IMR90 with clear HiC interaction patterns. In contrast, their interactions were uniform in K562 cells. Furthermore, two smaller blocks (chr2:40000000–40700000 and chr2:40700000–41000000) were detected in the subTAD (chr2:40000000–41000000) of IMR90. The smaller block (chr2:40000000–40700000) included the proteincoding gene SLC8A1 and the lncRNA gene SLC8A1AS1. SLC8A1 (solute carrier family 8 member A1) is a proteincoding gene linked to multiple diseases, such as long Qt syndrome 9, cardiac diseases and aromatase deficiency [56,57,58]. We found that this small block was active in IMR90 but not in K562 by several signals. First, SLC8A1 was highly expressed in IMR90 cells but not in K562 cells. This was consistent with the novel binding signals of Pol2, H2A. Z, CTCF and Rd21 in IMR90. In addition, novel histone modifications were observed in IMR90 but not in K562. Another small block (chr2:40700000–41000000) was at the 5’ region of the SLC8A1 gene, but no genes were included. It also contained novel signals of CTCF and Rad21, suggesting that its generation may be mediated by CTCF and Rad21 to regulate SLC8A1 expression. Overall, these results demonstrate that HiCKey can detect not only TAD hierarchies but also their difference across samples. The hierarchical TAD structures in IMR90 further support our findings that active regions of chromosomes usually contain more and richer TAD structures for regulating gene expression.
Discussion
The identification of TADs and their hierarchical structures is extremely important in the study of chromatin interactions. We developed a novel GLR test to detect changepoints in HiC matrices and studied its asymptotic properties. Based on the GLR test, we introduced HiCKey to decipher the hierarchical structure of TADs. The performance of HiCKey is endorsed by extensive simulation and real data analysis. The retrieved hierarchical TADs are consistent with diverse biological signals, including histone modification and ChIPseq data. We further found much more detailed TAD structures in active chromosomal regions. Comparative analysis of TADs across various cell lines revealed that different TAD organizations harbour diseaserelated genes, providing insights into how disordered interactions are linked to different cancer types.
Considerations for different distributions of chromatin interactions
To explain chromatin interaction strength decay, polymer models propose that the average pairwise contact probability decreases and asymptotically follows a power law with a given contour distance [59,60,61]. NB is another popular model used not only in the simulation of HiC data but also in other types of interaction data, such as ChIAPET, CAPTUREseq and HiChIP. Although such methods can approximately explain the interacting decay along the offdiagonal of the HiC matrix, they are challenged by several confounding factors in real biological systems. First, most promoterenhancer interactions are usually regulated by divergent TF proteins or structural proteins, such as Pol2, GATA1/2, CTCF and Rad21. These deterministic factors have been evolutionarily fixed in different cell lines, thereby reducing the randomness of chromatin interactions that is captured in HiC experiments. In fact, most promoterenhancer interactions are observed over short distances (tens of kb) from gene promoters to their neighbouring enhancers. Second, different experimental factors, such as the crosslinking of chromatin, chromatin digestion, and streptavidin pulldown of biotinylated ligations, may also affect the signal/noise ratio of HiC data [62]. Existing distribution models, for example, the NB distribution, cannot fully capture the characteristics of chromatin interactions in HiC experiments [60, 63]. In HiCKey, we derived a GLR test that can be broadly used for multiple distributions and is not limited to NB. This could potentially be very useful for complex HiC or ChIPPET or HiChIP experiments in which interaction strengths may follow some unknown distribution.
Phase transition, loop extrusion and chromosomal hierarchies
Although HiC data provide a landscape of interacting strengths among chromosomes, mechanistic explanations of how chromatin interactions are dynamically formatted and regulated are lacking. At large scales (e.g., A/B compartments and TADs), it seems that interactions can be selforganized by levels of epigenetic modifications that are formalized as the phase transition in the neighbourhood of typical physiological conditions [61, 64, 65]. To further consolidate this phasetransition model, it will be highly desirable to manipulate in vivo histonetail modifications for comparative HiC analysis. At smaller scales and active chromosomal regions, the loop extrusion model is proposed to link the CTCF and cohesin proteins into the formation of local looping structures [66,67,68]. The loop extrusion model suggests that structural maintenance of chromosome proteins (cohesin or condensin) progressively extrudes chromatin until it is blocked by CTCF bound to properly oriented site pairs [69,70,71]. In our comparative studies of K562 and IMR90 cell lines, we found that CTCF and Rad21 binding sites were remarkably changed within the newly established TADs (Fig. 6), indicating that CTCFloop domains and enhancerpromoter interactions may be established via an extrusion process involving cohesin and CTCF. In summary, these observations and studies suggest that the phasetransition model and loop extrusion model take place at different scales and chromatin states.
Integrating other Omics data
With the hierarchical organization of chromatin interactions available, we can deeply investigate the biological functions or principles of how histone modifications are coordinately used, as well as how gene expression is dynamically regulated. Since histone modifications, TF binding sites, and gene expression have been collected for hundreds of cell lines in ENCODE and Roadmap Epigenomics projects, we integrated them for locusspecific and genomewide analysis. As shown in Fig. 5, we demonstrated that TADs estimated by HiCKey are consistent with histone modifications, including H3K27ac, H3Kme1/2/3 and many others. Gene expression from RNAseq and TF binding peaks (Pol2, CTCF and H2A. Z) also confirmed the active and repressive compartment of chromosomal regions. By using multiple omics signals, we further observed different signal patterns in a comparative analysis of K562 and IMR90 (Fig. 6). We found that the newly expressed genes at active regions in IMR90 were occupied by multiple active histone signals. The new binding peaks of CTCF and Rad21 suggest that these structural proteins may be involved in the local chromosomal conformation for SLC8A1 gene expression in IMR90. Both examples indicate that integrative analysis of multiple omics data and hierarchical organizations is a promising method to fully understand chromosomal compartments and functions.
Conclusions
In this work, we presented an efficient method, HiCKey, for detecting and comparing hierarchical TAD structures in HiC datasets. We especially derived a GLR test that worked for general distributions. The theoretical results of the GLR test can be used in similar experimental data (such as HiChIP, ChIAPET and Dropseq), whose signal may not fully follow the NB distribution but more general mixture distributions. HiCKey was evaluated by using large simulation data and real HiC data of mammalian cell lines. First, largescale validations on simulation data (with or without nested HiC structures) show that HiCKey has good precision in recalling known TADs and is robust against random collision noise of chromatin interactions. Second, HiCKey was successfully applied to in situ HiC data of seven human cell lines, and its predictions are supported by diverse epigenetic markers and exhibit novel biological discoveries. We concordantly identified multiple layers of TAD organization among these cell lines. In particular, TAD boundaries were found to be significantly enriched in active chromosomal regions compared to repressed regions. HiCKey was manipulated by C++ language for high operation speed. It accepts multiple input formats of the HiC matrix and is optimized for processing large matrices constructed from highresolution HiC experiments. With more HiC and similar experimental datasets available, we believe our method and theoretical framework will highly inspire computational biologists to design novel pipelines by using the GLR test to elucidate the hierarchical organization of locusspecific chromatin interactions in mammalian genomes or other types of deep sequencing data analysis.
Availability of data and materials
HiCKey is coded in C++. HiCKey is open source available in the GitHub repository (https://github.com/YingruWuGit/HiCKey).
Abbreviations
 TAD:

Topologically associating domain
 GLR:

Generalized likelihood ratio
 NB:

Negative binomial
 MCMC:

Markov chain Monte Carlo
 TPR:

True positive rate
 FDR:

False discovery rate
 GEO:

Gene expression omnibus
 ENCODE:

Encyclopaedia of DNA elements
References
Cavalli G, Misteli T. Functional implications of genome topology. Nat Struct Mol Biol. 2013;03(20):290–9.
Gibcus J, Dekker J. The hierarchy of the 3D genome. Mol Cell. 2013;03(49):773–82.
Bonev B, Cavalli G. Organization and function of the 3D genome. Nat Rev Genet. 2016;10(17):661–78.
Liu X, Chen Y, Zhang Y, Liu Y, Liu N, Botten G, et al. Multiplexed capture of spatial configuration and temporal dynamics of locusspecific 3D chromatin by biotinylated dCas9. Genome Biol. 2020;12:21.
Ramanand SG, Chen Y, Yuan J, Daescu K, Lambros M, Houlahan KE, et al. The landscape of RNA polymerase II associated chromatin interactions in prostate cancer. J Clin Invest. 2020;130:4.
Liu X, Zhang Y, Chen Y, Li M, Zhou F, Li K, et al. In situ capture of chromatin interactions by biotinylated dCas9. Cell. 2017;08(170):1028–43.
Spielmann M, Lupiáñez D, Mundlos S. Structural variation in the 3D genome. Nat Rev Genet. 2018;04:19.
Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002;295(5558):1306–11.
Tang YY, Holzel B, Posner M. The neuroscience of mindfulness meditation. Nat Rev Neurosci. 2015;03:16.
Mumbach MR, Rubin AJ, Flynn RA, Dai C, Khavari PA, Greenleaf WJ, et al. HiChIP: efficient and sensitive analysis of proteindirected genome architecture. Nat Methods. 2016;07(13):919–22.
Dixon J, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in Mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–80.
Rao SSP, Huntley MH, Durand N, Stamenova EK, Bochkov I, Robinson JT, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2015;159(7):1665–80.
Naumova N, Dekker J. Integrating onedimensional and threedimensional maps of genomes. J Cell Sci. 2010;06(123):1979–88.
Crane E, Bian Q, McCord R, Lajoie B, Wheeler B, Ralston E, et al. Condensindriven remodelling of X chromosome topology during dosage compensation. Nature. 2015;06:523.
Nora E, Lajoie B, Schulz E, Giorgetti L, Okamoto I, Servant N, et al. Spatial partitioning of the regulatory landscape of the Xinactivation centre. Nature. 2012;04(485):381–5.
Dixon J, Gorkin DU, Ren B. Chromatin domains: the unit of chromosome organization. Mol Cell. 2016;62(5):668–80.
Filippova D, Patro R, Duggal G, Kingsford C. Identification of alternative topological domains in chromatin. Algorithms Mol Biol AMB. 2014;9:14.
Shin H, Shi Y, Dai C, Tjong H, Gong K, Alber F, et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucl Acids Res. 2015;44:gkv1505.
LévyLeduc C, Delattre M, MaryHuard T, Robin S. Twodimensional segmentation for analyzing HiC data. Bioinformatics. 2014;30(17):i386–92.
Gong Y, Lazaris C, Sakellaropoulos T, Lozano A, Kambadur P, Ntziachristos P, et al. Stratification of TAD boundaries identified in reproducible HiC contact matrices reveals preferential insulation of superenhancers by strong boundaries. Nat Commun. 2017;11:141481.
Weinreb C, Raphael B. Identification of hierarchical chromatin domains. Bioinformatics. 2015;32(11):1601–9.
Haddad N, Vaillant C, Jost D. ICfinder: inferring robustly the hierarchical organization of chromatin folding. Nucl Acids Res. 2017;45(10):e81.
Yu W, He B, Tan K. Identifying topologically associating domains and subdomains by Gaussian Mixture model And Proportion test. Nat Commun. 2017;12:8.
Malik L, Patro R. Rich chromatin structure prediction from HiC data. IEEE/ACM Trans Comput Biol Bioinf. 2018;PP:1.
Norton H, Emerson D, Huang H, Kim J, Titus K, Gu S, et al. Detecting hierarchical genome folding with network modularity. Nat Methods. 2018;02:15.
Zufferey M, Tavernari D, Oricchio E, Ciriello G. Comparison of computational methods for the identification of topologically associating domains. Genome Biol. 2018;19(217):1–18.
Yardimci G, Ozadam H, Sauria M, Ursu O, Yan KK, Yang T, et al. Measuring the reproducibility and quality of HiC data. Genome Biol. 2019;03:20.
Sauerwald N, Kingsford C. Quantifying the similarity of topological domains across normal and cancer human cell types. Bioinformatics. 2018;07(34):i475–83.
Sauerwald N, Singhal A, Kingsford C. Analysis of the structural variability of topologically associated domains as revealed by HiC. NAR Genom Bioinform. 2020;03:2.
Stansfield J, Cresswell K, Vladimirov V, Dozmorov M. HiCcompare: An Rpackage for joint normalization and comparison of HIC datasets. BMC Bioinformatics. 2018;12:19.
Heinz S, Texari L, Hayes MGB, Urbanowski M, Chang MW, Givarkes N, et al. Transcription elongation can affect genome 3D Structure. Cell. 2018;174(6):1522–36.
Chen F, Li G, Zhang M, Chen Y. HiCDB: a sensitive and robust method for detecting contact domain boundaries. Nucl Acids Res. 2018;09:46.
Cresswell K, Dozmorov M. TADCompare: an R package for differential and temporal analysis of topologically associated domains. Front Genet. 2020;03:11.
Liu J, Lawrence C. Bayesian inference on biopolymer models. Bioinformatics. 1999;15:38–52.
Wang J, Zivot E. A Bayesian time series model of multiple structural changes in level, trend, and variance. J Bus Econ Stat. 2000;18(3):374–86.
Lai TL, Xing H. Stochastic changepoint ARXGARCH models and their applications to econometric time series. Stat Sin. 2013;23(4):1573–94.
Xing H, Mo Y, Liao W, Zhang M. Genomewide localization of proteinDNA binding and histone modification by BCP with ChIPseq data. PLoS Comput Biol. 2012;8(7):e1002613.
Xing H, Sun N, Chen Y. Credit rating dynamics in the presence of unknown structural breaks. J Bank Finance. 2012;36(1):78–89.
Bai J, Perron P. Estimating and testing linear models with multiple structural changes. Econometrica. 1998;66(1):47–78.
Perron P, Qu Z. Estimating and testing multiple structural changes in multivariate regressions. Econometrica. 2007;02(75):459–502.
Matteson DS, James NA. A nonparametric approach for multiple change point analysis of multivariate data. J Am Stat Assoc. 2014;109(505):334–45.
Shen J, Zhang N. Changepoint model on nonhomogeneous Poisson processes with application in copy number profiling by nextgeneration DNA sequencing. Ann Appl Stat. 2012;6:476–96.
Zhang N, Siegmund D. A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007;63:22–32.
Lavielle M. Using penalized contrasts for the changepoint problem. Sig Process. 2005;08(85):1501–10.
Harchaoui Z, LévyLeduc C. Multiple changepoint estimation with a total variation penalty. J Am Stat Assoc. 2010;105(492):1480–93.
Forcato M, Nicoletti C, Pal K, Livi C, Ferrari F, Bicciato S. Comparison of computational methods for HiC data analysis. Nat Methods. 2017;14:679–85.
Lun A, Smyth G. DiffHic: a bioconductor package to detect differential genomic interactions in HiC data. BMC Bioinform. 2015;16:258.
Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings. J Am Stat Assoc. 1983;78(383):553–69.
Li D, Hsu S, Purushotham D, Sears RL, Wang T. WashU epigenome browser update 2019. Nucl Acides Res. 2019;07(47):158–65.
Fisher RA. Questions and answers #14. Am Stat. 1948;2(5):30–1.
Servant N, Varoquaux N, Lajoie B, Viara E, Chen CJ, Vert JP, et al. HiCPro: an optimized and flexible pipeline for HiC data processing. Genome Biol. 2015;12:16.
Carty M, Zamparo L, Sahin M, González A, Pelossof R, Elemento O, et al. An integrated model for detecting significant chromatin interactions from highresolution HiC data. Nat Commun. 2017;17(8):1–10.
Cook KB, Hristov BH, Roch KGL, Vert JP, Noble WS. Measuring significant changes in chromatin conformation with ACCOST. Nucl Acids Res. 2020;48(5):2303–11.
Djekidel MN, Chen Y, Zhang MQ. FIND: differential chromatin interactions detection using a spatial Poisson process. Genome Res. 2018;28(3):412–22.
Chen Y, Wang Y, Xuan Z, Chen M, Zhang M. De novo deciphering threedimensional chromatin interaction and topological domains by wavelet transformation of epigenetic profiles. Nucl Acids Res. 2016;44:gkw225.
Kepp K, Org E, Sõber S, Kelgo P, Viigimaa M, Veldre G, et al. Hypervariable intronic region in NCX1 is enriched in short insertion–deletion polymorphisms and showed association with cardiovascular traits. BMC Med Genet. 2010;01(11):15.
Kennedy R, Ovsyannikova I, Haralambieva I, Lambert N, Pankratz V, Poland G. Genomewide SNP associations with rubellaspecific cytokine responses in measlesmumpsrubella vaccine recipients. Immunogenetics. 2014;05:66.
Roberts D, Matsuda T, Bose R. Molecular and functional characterization of the human platelet Na\(^{+}\)/Ca\(^{2+}\) exchangers. Br J Pharmacol. 2011;07(165):922–36.
Esposito A, Annunziatella C, Bianco S, Chiariello A, Fiorillo L, Nicodemi M. Models of polymer physics for the architecture of the cell nucleus. Wiley Interdiscip Rev Syst Biol Med. 2018;12(11):e1444.
Nicodemi M, Pombo A. Models of chromosome structure. Curr Opin Cell Biol. 2014;05(28C):90–5.
Haddad N, Jost D, Vaillant C. Perspectives: using polymer modeling to understand the formation and function of nuclear compartments. Chromosome Res. 2017;01:25.
Belton JM, McCord R, Gibcus J, Naumova N, Zhan Y, Dekker J. HiC: a comprehensive technique to capture the conformation of genomes. Methods (San Diego). 2012;05:58.
Rottenberg H, Covian R, Trumpower B. Membrane potential greatly enhances superoxide generation by the cytochrome bc1 complex reconstituted into phospholipid vesicles. J Biol Chem. 2009;06(284):19203–10.
Lesage A, Dahirel V, Victor JM, Barbi M. Polymer coilglobule phase transition is a universal folding principle of Drosophila epigenetic domains. Epigenet Chromatin. 2019;12:12.
Khanna N, Zhang Y, Lucas J, Dudko O, Murre C. Chromosome dynamics near the sol–gel phase transition dictate the timing of remote genomic interactions. Nat Commun. 2019;12:10.
Nuebler J, Fudenberg G, Imakaev M, Abdennur N, Mirny L. Chromatin organization by an interplay of loop extrusion and compartmental segregation. Proc Natl Acad Sci. 2018;07(115):201717730.
Rowley M, Corces V. Organizational principles of 3D genome architecture. Nat Rev Genet. 2018;10:19.
Schwarzer W, Abdennur N, Goloborodko A, Pekowska A, Fudenberg G, LoeMie Y, et al. Two independent modes of chromatin organization revealed by Cohesin removal. Nature. 2017;551:09.
Sanborn A, Rao S, Huang SC, Durand N, Huntley M, Bochkov I, et al. Chromatin extrusion explains key features of loop and domain formation in wildtype and engineered genomes. Proc Natl Acad Sci USA. 2015;10:112.
Alipour E, Marko J. Selforganization of domain structures by DNAloopextruding enzymes. Nucl Acids Res. 2012;10:40.
Fudenberg G, Imakaev M, Lu C, Goloborodko A, Abdennur N, Mirny L. Formation of chromosomal domains by loop extrusion. Cell Rep. 2016;05:15.
Acknowledgements
We would like to thank Catherine Tang and Nakoa Kristen Webber for critical reading. We would like to thank Dr. Mattia Forcato for providing detailed information helping with the comparison of HiCKey with other methods. We thank the four anonymous reviewers for their insightful suggestions.
Funding
Publication costs are funded by Rowan University Startup Grant, 2019, (PI, Yong Chen). Moreover, research reported in this project was partially supported by the NIH Grant R01MH109616, the Cecil H. and Ida Green Endowment, the SKR&DPC Grant (2017YFA0505503) (PI, Michael Q. Zhang), and NSF DMS1612501 (PI, Haipeng Xing). The funders had no roles in the design of the study and collection, analysis, and interpretation of data or in writing the manuscript.
Author information
Authors and Affiliations
Contributions
HX, YC and MQZ initiated the concept and supervised the study. HX, YW and YC designed the methodology. YW and YC performed the data analysis. YW implemented the software. YC, YW and HX drafted and reviewed the paper. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 2
. Including the predicted TAD boundaries of nine cell lines.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Xing, H., Wu, Y., Zhang, M.Q. et al. Deciphering hierarchical organization of topologically associated domains through changepoint testing. BMC Bioinformatics 22, 183 (2021). https://doi.org/10.1186/s12859021041138
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021041138