 Methodology
 Open access
 Published:
Chromatin 3D structure reconstruction with consideration of adjacency relationship among genomic loci
BMC Bioinformatics volume 21, Article number: 272 (2020)
Abstract
Background
Chromatin 3D conformation plays important roles in regulating gene or protein functions. Highthroughout chromosome conformation capture (3C)based technologies, such as HiC, have been exploited to acquire the contact frequencies among genomic loci at genomescale. Various computational tools have been proposed to recover the underlying chromatin 3D structures from in situ HiC contact map data. As connected residuals in a polymer, neighboring genomic loci have intrinsic mutual dependencies in building a 3D conformation. However, current methods seldom take this feature into account.
Results
We present a method called ShNeigh, which combines the classical MDS technique with local dependence of neighboring loci modeled by a Gaussian formula, to infer the best 3D structure from noisy and incomplete contact frequency matrices. We validated ShNeigh by comparing it to two typical distancebased algorithms, ShRec3D and ChromSDE. The comparison results on simulated HiC dataset showed that, while keeping the highspeed nature of classical MDS, ShNeigh can recover the true structure better than ShRec3D and ChromSDE. Meanwhile, ShNeigh is more robust to data noise. On the publicly available human GM06990 HiC data, we demonstrated that the structures reconstructed by ShNeigh are more reproducible between different restriction enzymes than by ShRec3D and ChromSDE, especially at high resolutions manifested by sparse contact maps, which means ShNeigh is more robust to signal coverage.
Conclusions
Our method can recover stable structures in high noise and sparse signal settings. It can also reconstruct similar structures from HiC data obtained using different restriction enzymes. Therefore, our method provides a new direction for enhancing the reconstruction quality of chromatin 3D structures.
Background
Correct 3D organization of chromosomes plays important roles in maintaining chromosomal functions such as gene expression, epigenetic modification and timely copy and separation of chromosomes in mitosis. However, determining chromosomal 3D structures is still an unsettled issue currently. Traditional techniques such as fluorescence microscope and fluorescence in situ hybridization (FISH), usually have low resolution and can only probe a few of individual genome loci at one time. HiC [1], which is derived from Chromatin conformation capture (3C) and depth sequencing technique, provides a new promise for this problem. As a highresolution and highthroughout method of studying chromosomal 3D conformation, HiC can measure the contact frequency between genome loci pairs at the genomewide level. Inferring the 3D structure of the genome from the contact frequency matrix obtained by HiC has become an interesting research topic of bioinformatics since the occurrence of HiC.
However, reconstructing the 3D structures of chromosomes from the HiC data is not so straightforward but an optimization problem essentially. As in other applications, a standard optimization procedure requires clarifying two issues: the objective function to be minimized or maximized and the optimization algorithm. As for the objective function, one strategy is the distancebased formula. That is, this strategy first converts the contact frequency matrix into the spatial distance matrix and then minimizes the discrepancy between the distance matrix calculated from the predicted structure and that converted from the frequency matrix [2,3,4,5,6,7]. Two operations are prerequisite for this strategy: first, the frequency matrix is normalized to remove the biases related to the DNA sequence, among which GC content, sequence mappability and frequency of restriction sites are three most apparent bias resources [8]; second, the conversion factor that modulates the power law relationship between the frequency matrix and the distance matrix [1] is estimated through an additional optimization procedure [2]. Another strategy of selecting the objective function casts the problem of structure inference as a maximum likelihood problem by assuming the contact frequency between genome loci follows a Poisson distribution [9, 10]. HSA [11] constructs the likelihood by integrating multiple contact matrices generated from different enzymes. The advantage of this strategy is that, by modeling the effect of all the three data bias (i.e. GC content, sequence mappability and frequency of restriction sites) and the power law relationship between frequency and distance matrix with a generalized linear formula, all these effects can be absorbed into the final likelihood function. Thus, all parameters  the Cartesian coordinates of all genome loci, the coefficients describing the effect of data bias and the conversion factor parameter  can be derived simultaneously through a unified optimization procedure. Consequently, the normalization of the contact frequency matrix and the additional conversion factor inference procedure, which are requisite for the first strategy, are now unnecessary.
No matter which objective function above is adopted, the issue finally boils down to a nonlinear and large scale optimization problem, for which a simple local searching approach, such as Newton algorithm, is not suitable. Several global searching schemes have been proposed. ChromSDE [2] transforms the problem into a semidefinite programming (SDP) problem by embedding the original 3D Euclidean space into the Hilbert space of higher dimension. It can guarantee recovering the correct structure in the noisefree case. But for noisy input data a local optimization method is needed to refine the solution obtained from the SDP problem. PASTIS [9] uses IPOPT [12], a C++ package that implements an interior point filter algorithm for largescale nonlinear optimization, to maximize the Poisson likelihood. BACH and BACHMIX [10] apply Gibbs sampler with hybrid Monte Carlo to draw samples in the parameter space and output a collection of 3D chromosomal structures from the Bayesian posterior distribution. TADbit [13, 14] contains a module of chromosome 3D reconstruction that was developed around Integrative Modeling Platform (IMP, http://www.integrativemodeling.org), a general framework for restraintbased modeling of 3D biomolecular structures [15]. HSA [11] adopts simulated annealing combined with Hamiltonian dynamics to explore the chromatin comformal space. Different from BACH et al., MCMC5C [16] assumes the contact frequency is normally distributed and employs the Markov chain Monte Carlo (MCMC) with MetropolisHastings sampler [17] to sample from the posterior distribution. Same as BACH, MCMC4C outputs an ensemble of conformations. AutoChrom3D [3] selects LINGO (www.lindo.com/products/lingo), a commercial nonlinear constrained optimizer, to get the best chromatin structure. 3DMax [4] utilizes a stochastic gradient ascent algorithm to maximize the likelihood generated from the normal distribution. MOGEN [5, 6] and LorDG [7] maximized the objective function by using steepest gradient ascent with the backtracking line search algorithm.
The problem of inferring the coordinates of N objects in the 3D space from the distance information between them can be solved perfectly by the classical multidimensional scaling method (MDS) [18]. However, the distance matrix converted from the contact frequency matrix is not complete in that it contains many unknown entries generally, which makes the classical MDS method can not be utilized directly. This is just why various optimization approaches above mentioned were proposed. In order to avoid the timeconsuming optimization procedure, ShRec3D [19] cleverly designed a twostep algorithm. It first completes the distance matrix by using the concept of shortest path in graph theory (i.e. FloydWarshall algorithm), and then exerts the classical MDS to reconstruct 3D genome structures. It is orders of magnitude faster than the above optimizationbased methods. ShRec3D+ [20] corrects the conversion factor by a golden section search before carrying out ShRec3D. MDSGA [21] improves the shortest path distances using a genetic algorithm.
In the above we gave a rough introduction for some reconstruction methods. See [22] for a complete overview of the current state of the art 3D chromosome reconstruction. It should be noted that the positions of genomic loci in the 3D space are not irrelevant to each other. Genomic loci can be taken as a bunch of connected beads that comprise of a polymer. Two loci adjacent in the genome are surely close to each other in the 3D space. However, current methods seldom give consideration to this property of genomes. HSA [11] characterizes the adjacency relationship of neighboring loci by a Gaussian Markov chain to capture the local dependence of genomic loci. In the present work we extend the framework of classical MDS and provide a more flexible way to model the correlations between genomic loci of local proximity. Our algorithm, named ShNeigh, can significantly improve the performance of ShRec3D and simultaneously still runs far faster than the optimizationbased methods, such as ChromSDE.
Results
Simulated data study
We compared our ShNeigh with the existing methods ChromSDE [2], ShRec3D [19] and ShRec3D+ [20]. As for ChromSDE, the quadratic SDP algorithm is adopted. We first test these programs on the simulated helix structure dataset. Figure 1 shows the performance comparison for the programs under different measurements. We draw the mean result of 10 runs for each noise level to reduce the occasional fluctuation. The conversion factor is always assumed equal to 1 in ShRec3D, which is just the true value for our simulated data. Because ShRec3D+ merely adds a conversion factor estimation step upon ShRec3D and can not improve the performance of ShRec3D for the simulation scenario, it is not included in Fig. 1ab. As expected, when the noise level increases, SCC decreases and RMSD increases generally. The RMSD of ChromSDE starts from 0 at zero noise level, which coincides with the claim that ChromSDE can guarantee recovery of the true structure in the noisefree case. Unfortunately, the other three programs do not possess such a good feature. However, when the noise level get larger (> 0.25), the superior behavior of ShNeigh1 and ShNeigh2 begins to emerge, and their superiority enlarges compared to ChromSDE with the increasing noise level (Fig. 1a). ShNeigh1 and ShNeigh2 perform similarly and both significantly outperform ShRec3D, showing that inclusion of the neighboring dependency relationship can offer essential improvement against the underlying ShRec3D method. In summary, our ShNeigh algorithms are more robust and accurate than ChromSDE and ShRec3D, except for comparing to ChromSDE in the noisefree or little noise situation. However, Fig. 1b shows ShNeigh1 and ShNeigh2 have no pronounced improvement against ShRec3D in terms of the SCC measure, and both ShRec3D and ShNeigh programs perform worse than ChromSDE on SCC. It seems that ChromSDE tends to be over faithful to the noisy input data, which may be the reason why ChromSDE is less robust than other programs.
In Fig. 1c the absolute error between the estimated conversion factor α and the true α (=1) rises with increasing noise level generally. At low noise levels (< 0.2), ChromSDE can nearly perfectly estimate α values, consistent with its performance on the RMSD measure. But when the noise level increases the error estimated by ChromSDE ascends dramatically, indicating ChromSDE is prone to give a wrong conversion factor estimation as the data get more noisy. By contrast, ShNeigh2 and ShRec3D+ can estimate the conversion factor α quite accurately across various noise levels. We can see from Fig. 1a, c that the performance of these programs on RMSD and that on the absolute α error interweave with each other, in that samller RMSD leads to smaller α error, and vice versa.
As described in the previous section, ShRec3D and ShNeigh1 have no optimization, while ShRec3D+ includes a univariate optimization step (estimate α) and ShNeigh2 possesses a twovariate minimum searching procedure (estimate α and the weight ρ). Therefore, ShRec3D and ShNeigh1 are most efficient among the tested programs, and ShRec3D+ runs slower than ShRec3D and ShNeigh1 and faster than ShNeigh2 (Fig. 1d). ChromSDE is the most timeconsuming since it needs to explore a space of N^{2} variables (compute a semidefinite kernel matrix).
Figure 2 shows the predicted structures of the simulated helix by different programs (ShRec3D, ShNeigh2 and ChromSDE) under different noise levels. The structures predicted by ShNeigh1 are very similar to ShNeigh2 and so not shown. For the noisefree case drawn in the top row, both ShNeigh2 and ChromSDE can almost perfectly recover the true structure, and ShRec3D seems to give a bit overfat structure. For the case of medium noise level (=0.5, the middle row), the performances of all the three programs get worse, but the reconstruction result of ShNeigh2 is still quite good, and it is difficult to identify the helix structure from ChromSDE’s reconstruction. When the noise level gets the maximum (=1, the bottom row), ShNeigh2 can still present a clear helix structure, and by contrast, the structure by ShRec3D is too fat and obscure, while ChromSDE completely fails. We conclude that, on the whole, ShNeigh outperforms ShRec3D and ChromSDE, especially in the highly noisy circumstance.
At last we investigate the impact of signal coverage on the performance of these programs. Obviously signal coverage is proportional to the number of nearest neighbors parameter K of the simulation code. In fact signal coverage is approximately equal to K/N (Fig. 3f). Figure 3 shows RMSD increases with descending nearest neighbors K for all programs and all noise levels, indicating that reducing signal coverage can substantially deteriorate the reconstruction results. Our programs ShNeigh1 and Sheigh2 perform similarly and both of them give apparent improvement relative to ShRec3D for all noise levels and all signal coverage. And they outperform ChromSDE at most situations. It is only at low noise level or high signal coverage that ChromSDE performs better than ShNeigh1 and ShNeigh2 (Fig. 3ab). The leading status of ShNeigh1 and ShNeigh2 compared to ChromSDE gets more significant when the noise level increases, which coincides with the results shown in Fig. 12. When the signal coverage decreases, ChromSDE’s RMSD gets larger rapidly, while our ShNeigh programs are less sensitive to the signal sparseness. Therefore, the leading status of ShNeigh1 and ShNeigh2 compared to ChromSDE also gets more significant when the frequency matrix turns sparser (Supplementary Figure S1).
Real hiC data study
As for the human GM06990 cell lines, we compute the average RMSD across 23 chromosomes (1–22 and X) between the predicted structures from HindIII and NcoI HiC data and the average Spearman correlation coefficient between the estimated distance matrices (dSCC) of the predicted structures from the two enzyme data, which are shown in Fig. 4 and Supplementary Figure S3. Not surprisingly, all tested programs perform worse as the resolution rises (Fig. 4ab), since the average signal coverage gets lower at higher resolution (Fig. 4c). We first compare the performances of ShRec3D and ChromSDE. Note that the average signal coverage is about 0.96, 0.82, 0.40, 0.17 for 1000 k, 500 k, 200 k, 100 k, respectively. The comparison between ShRec3D and ChromSDE shown in Fig. 4 is very similar to the result shown in Fig. 2 of Ref. [11]. We found the improvement of our shNeigh programs against ShRec3D is not so distinct as the case of simulated data at 1000 k and 500 k resolution, though the difference between them is still remarkable at 200 k and 100 k resolution. ChromSDE behaves the best at 1000 k and 500 k resolution but the worst at 200 k and 100 k resolution. The dSCC value of ChromSDE is even close to zero at 100 k resolution (Fig. 4b), reflecting that ChromSDE completely failed to recover the underlying structure of GM06990 data for very high resolution. On the contrary, ShNeigh1 and ShNeigh2 perform relatively stable across all resolutions, and shNeigh1 performs the best among all tested programs at 200 k and 100 k resolution. On the whole, the advantage of shNeig1 and shNeigh2 approaches maximum at high resolution but is limited at low resolution. Comparing Fig. 4 with Fig. 3, the advantage of ChromSDE shown at 1000 k and 500 k resolution seems that the noise level of GM06990 data is very low. However, we are more convinced by the conjecture that real HiC data are commonly the product of a mixture of diverse structures [2, 23]. What’s more, the estimated conversion factor α by ShNeigh2 gets larger with increasing resolutions (Supplementary Figure S2), which coincides with the conclusion of Ref. [2].
The 3D structures of chromosome X predicted by ShRec3D, shNeigh1 and ChromSDE at different resolutions are drawn in Fig. 5. At 1000 k and 500 k resolution, all three programs can give structures of relatively good reproducibility. However, at 200 k and 100 k resolution, only shNeigh1 generated clear and highly reproducible structures, while Shrec3D and ChromSDE just reconstructed some tangled messes.
Because ChromSDE was computationally overburdened on the frequency matrices at 40 kb resolution, we processed Dixon2012 matrices only with ShRec3D and ShNeigh1. The whole genome of each cell type is reconstructed within one or 2 h per method (A PC with i7 7700K CPU and 32GB RAM). Since only one enzyme is available, the RMSD measure is not applicable for Dixon2012 data. In order to evaluate the performance of the two methods, we compute SCC between the input frequency matrix and the frequency matrix calculated from reconstructed structure. As shown in Fig. 6, the outperformance of shNeigh1 against ShRec3D is overwhelming.
Discussion
We have developed a novel method, named shNeigh, to reconstruct the 3D organization of chromosomes at the genome scale. It uses the classical MDS to minimize the gap between the predicted pairwise distances and those converted from the contact data. Shortest path algorithm is used to complete the converted distance matrix before applying MDS. ShNeigh explicitly models the local dependence of neighboring loci by a Gaussian expression and elaborately integrates the model into the MDS framework. Two strategies are adopted to determine the parameters (i.e. conversion factor α and the weight ρ) involved in the procedure: ShNeigh1 directly gives α = 1 and ρ values by relating ρ with the loci number and the signal coverage, and ShNeigh2 searches for the two parameters through an iterative algorithm by minimizing the difference between the measured and predicted contact matrix.
Though ShNeigh2 has a step of searching for the optimal conversion factor α and the weight ρ, it is still much faster than ChromSDE. ShRec3D and ShNeigh1 runs fastest among the tested programs. This means that ShNeigh can process much more genomic loci within bearable time compared to ChromSDE, which of great significance for experimental HiC data with gradually enhanced resolution in the future. What’s more, our method achieves essential performance improvement compared to ShRec3D at some cost of time consuming (i.e. ShNeigh2) or even no time cost (i.e. ShNeigh1). Such an improvement exists and is quite apparent in most situations. Only for the data of high signal coverage that are generated from diverse structures the improvement gets somewhat weaker. Compared to ChromSDE, our programs are very robust in that they perform excellent for noisy or low signal coverage data, while ChromSDE works well for the data of low noise level and high signal coverage and corresponding to diverse structures. Mathematically, we speculate that the loss of precision at the case of low noise and high signal coverage for ShNeigh and ShRec3D is due to the fact that too many matrix entries are modified by the shortest path algorithm. Considering it is very common for real HiC data to be noisy and sparse, our method is highly attractive. Observing Fig. 4 and supplementary Figure S1, we conservatively conclude that ShNeigh can guarantee to obtain substantial improvement against both ShRec3D and ChromSDE for the HiC data with signal coverage not more than 0.5. On the contrary, the Markov chain that was used in HSA to model the local dependence of neighboring loci showed significant improvement only for very sparse HiC contact matrix (say, 10% signal coverage).
Conclusions
We propose a new method to infer a consensus 3D genome structure from a HiC contact map. The novelty of our method is that it takes into account the adjacency of genomic loci along chromosomes. Mathematically, the proposed method penalizes the optimization problem of the classical MDS with a smoothness constraint weighted by a function of the genomic distance between genomic loci. We demonstrate that this optimization problem can still be solved efficiently by a classical MDS method. We then show that the method can recover stable structures in high noise and sparse signal settings. We also show that it can reconstruct similar structures from HiC data obtained using different restriction enzymes.
Our method provides a new guideline for enhancing the reconstruction quality of chromatin 3D structures. We notice that it is possible to involve our Gaussian adjacency model into most existing methods, including both distance based and likelihood based programs, such as HSA, PASTIS, ChromSDE, and so on. Assessing the performance of these various combinations is an interesting topic that deserves to be further explored in the future.
The software package, deposited in https://github.com/fangzhenli/ShNeigh, contains a minimum code for implementation of our ShNeigh method. It requires a normalized contact matrix as input. The users should preprocess the experimental HiC data by sequencing, mapping, binning and normalizing steps to get the normalized matrices before applying our software. As experimented in this work, our software can cope with at least 40 kb resolution real HiC data, which corresponds to contact matrices of more than 5000 × 5000 size. Higher resolutions or bigger matrices may also be processed within a limited time.
Methods
One HiC experiment generates a library of pairedend reads. Each pairedend read represents one observation that the corresponding two restriction fragments contact each other. Then the reads are mapped to the reference genome and those of low quality are filtered out. After grouping the mapped highquality reads according to genome loci where they locate, we get a contact frequency matrix F, where F_{ij} is a nonnegative integer representing the contact count between loci i and j. Here each locus is a genomic bin with a constant size such as 1Mbp or 40kbp. The resolution, namely the size of each genomic bin, is governed by the sequencing depth. The frequency matrix F is square and symmetric. Note that F may contain many zero entries generally, which indicates that the underlying locus pair are too far in the 3D space to interact with each other.
Given a frequency matrix F, our task is to reconstruct the 3D structure of the chromosome from which F is generated. That is, a coordinate matrix X = (x_{1,}⋯, x_{N}) ′ ∈ R^{N × 3} should be derived from F, where N denotes the number of loci in the chromosome and x_{i} ∈ R^{3 × 1} represents the 3D coordinate of the ith locus. Our approach is based on the classical MDS methods.
Classical MDSbased methods
Classical MDSbased methods, such as ShRec3D and ShRec3D+, generally consist of the following three steps.
First, convert the contact frequency matrix F into a distance matrix D. All existing methods, including MDSbased and likelihoodbased approaches, assume that the contract frequency between two loci and their 3D distance agrees with the following power law relationship [2].
where α is the conversion factor and D_{ij} and F_{ij} are the 3D distance and contact frequency between loci i and j, respectively. The infinite distances D_{ij} = ∞ denote they provide no information for structure reconstruction. Eq.(1) does not consider the scale between the converted distance and the real physical distance. This scale, if necessary, can be described by adding a coefficient β before the term (1/F_{ij})^{α} in Eq.(1). The parameter β is usually expressed explicitly in the objective function of likelihoodbased methods. Our goal is to make the predicted structure align the underlying true structure as accurate as possible after applying scaling, reflection, translation and rotation operations, for which it is not requisite for β to emerge in Eq.(1). Thus, for the MDSbased methods β is calculated solely in assessing algorithm performance, namely in computing the RMSD criterion. The conversion factor α was set to a constant one in ShRec3D. Here we calculate α by the policy used in ChromSDE and ShRec3D+. See the subsection Parameter estimation for detailed description.
Second, complete the distance matrix D. The classical MDS requires a full set of distances between all loci pairs available, but the infinite elements of D represent unknown distances and so must be endowed with finite values before applying MDS. To this end, we model the distance matrix D by a weighted graph whose nodes represent the genomic loci. In this graph two nodes i and j are linked by an edge if and only if the corresponding D_{ij} has finite value, and the length (or weight) of the edge is just the value of D_{ij}. We define the distance between two nodes by the length of the shortest path relating them. Finding the shortest path between any two nodes in the graph is a classical problem in graph theory. As in ShRec3D, we use the FloydWarshall algorithm (implemented by the Matlab function graphallshortestpaths) to calculate the shortest paths and their lengths. FloydWarshall is a dynamic programming algorithm with time complexity O(N^{3}), where N is the number of nodes. The resulting graph becomes a clique, namely a fully connected graph, and the shortestpath distances satisfy the triangular inequality. Note that some original finite distances may change their values after FloydWarshall calculation, reflecting the input data are noisy.
Third, map the distance matrix into 3D structure by multidimensional scaling. Multidimensional scaling (MDS) is a technique of data statistics that can determine the coordinates of n objects in the kdimensional Euclidean space (here k = 3) from their distance measures [24]. In order to elucidate the procedure of MDS, we firstly let I_{N} denote an N × N unity matrix and 1 = (1, ⋯, 1)′ be a column vector of length N with all elements being ones, then we define an N × N matrix \( H={I}_N\frac{1}{N}\mathbf{11}^{\prime } \). H is symmetric and idempotent. Given the distance matrix D, construct the matrix \( A=\left({a}_{ij}\right)=\left(\frac{1}{2}{D}_{ij}^2\right) \) and further define B = (b_{ij}) = HAH. Meanwhile, for the coordinate matrix X = (x_{1,}⋯, x_{N}) ′ ∈ R^{N × 3} to be reconstructed we define its centralized inner product matrix by \( \hat{B}=\left(\hat{b_{ij}}\right)= HXX^{\prime }H \). The classical MDS aims to minimize the following cost function:
where tr(.) denotes the trace of a matrix. To this end, singular value decomposition is applied to B to get its three largest eigenvalues λ_{1} ⩾ λ_{2} ⩾ λ_{3} and their corresponding eigenvectors (γ_{1}, γ_{2}, γ_{3}), with γ_{i} having been normalized to 1. Then the coordinate matrix X is recovered by
With this solution the cost function get minimum:
Therefore, when all eigenvalues other than the top three are equal to zero the reconstruction is exact. But in practice some λ_{i} (i > 3) may be negative, so the classical MDS can only approximately recover the chromosome structure generally.
MDS with consideration of neighboring relationship
Intuitively, if two loci x_{i} and x_{j} are neighbors in the genome, the distance between the spatial coordinates of x_{i} and x_{j} should be small. In order to consider the local dependence of neighboring genomic loci, we define an affinity matrix M = (m_{ij}) with
where σ represents the rate that m_{ij} decays with the genomic distance between loci i and j. Then we add the term ∑m_{ij}‖x_{i} − x_{j}‖^{2} into the cost function Eq.(2), turning the cost to
The second term reflects a distance penalty. It controls the smoothness of the reconstructed structure with a tuning parameter ρ. The extreme scenario ρ = 0 is just the ShRec3D [19] method, which gives a reconstruction entirely relying on the contact maps without smoothing.
After some algebra (see Supplementary text for a detailed derivation), we proved that the above problem is equivalent to minimizing the following object function:
where \( \overset{\sim }{B}=\left(\overset{\sim }{b_{ij}}\right)=B\rho L \), and L is the Laplacian matrix defined by L = D − M where D is the diagonal matrix with entries \( {d}_{ii}={\sum}_j{m}_{ij} \). Therefore, compared with Eq.(2), it is straightforward that we should exert singular value decomposition on \( \overset{\sim }{B}=B\rho L \) and get the top three eigenvalues \( \overset{\sim }{\lambda_1}\geqslant \overset{\sim }{\lambda_2}\geqslant \overset{\sim }{\lambda_3} \) and their corresponding eigenvectors \( \left(\overset{\sim }{\gamma_{1,}}\overset{\sim }{\gamma_{2,}}\overset{\sim }{\gamma_3}\right) \). Then the reconstructed coordinate matrix X becomes
In the present work we only consider an affinity matrix M with the form of Eq.(5). Other forms of M are also desirable to attempt, for example, m_{ij} = 1 for i − j = 1 and m_{ij} = 0 otherwise. This matrix is just the scheme used in HSA [11], which captures the local dependency of the most neighboring loci solely.
Parameter estimation
There are three parameters to be estimated in our method: the conversion factor α in Eq.(1), the distance penalty weight ρ in Eq.(6), the decaying rate σ in Eq.(5). Once their values are given, the reconstruction can be implemented by Eq.(8) straightly. We can either provide their values directly or infer them by an additional optimization procedure. We refer to the former as ShNeigh1 and to the latter as ShNeigh2.
For ShNeigh1, we empirically set α = 1, \( \rho =\mathit{\max}\left\{\left(1 sc\right)\sqrt{N},\mathit{\min}\left(\mathrm{3,0.2}\times \sqrt{N}\right)\right\} \) and σ = 0.023 × N, where N is the number of genomic loci and sc ∈ [0, 1] denotes the signal coverage defined by the percent of nonzero entries in the contact matrix. sc is an indicator of the sparseness of the contact matrix. α = 1 is the policy adopted by ShRec3D. More suitable σ values than 0.023 × N are possible, but our experiments showed that the reconstruction is insensitive to σ. The expression of ρ is partly inspired by HSA. It means that the value of ρ is proportional to both one minus signal coverage and the root square of loci quantity. The term \( \mathit{\min}\left(\mathrm{3,0.2}\times \sqrt{N}\right) \) is used to handle the case of very high (close to 1) signal coverage. Without this term, ρ will tend to be zero as sc approaches 1. For ShNeigh2, we also set σ = 0.023 × N, but we infer α and ρ by minimizing an error function that describes the difference between the predicted frequency matrix \( \hat{F} \) and the input frequency matrix F. Fig. 7 gives a detailed description of the function error(α, ρ, F).
Minimizing error(α, ρ, F) with respect to α and ρ is a twodimensional optimization problem, and it is difficult to calculate the gradient for error(α, ρ, F). ChromSDE used the golden section algorithm to optimize α, but it is a onedimensional derivativefree algorithm and thus unsuitable for our context. Here we adopt the NelderMead simplex (implemented by the Matlab function fminsearch), a multidimensional derivativefree algorithm, to simultaneously optimize α and ρ. A simplex in two dimensions is a triangular. For a given simplex, the NelderMead simplex method first evaluates the objective function on its three vertices and recognizes the vertex with the largest value and the one with the smallest value. Then a new point with value lower than the vertex with the largest value is generated by operations of reflection, expansion and compression. A new simplex is thus constructed by substituting the largest vertex with the new point, or by shrinking toward the smallest vertex. Therefore, the minimum of the objective function can be approached by iteratively updating the simplex.
Data
Both simulated and real HiC datasets are used to test the performance of our method. We generate the simulated datasets based on a helix curve structure with the following formula [2]:
This structure is modeled by a linear polymer consisting of N points. The coordinates of the N points are calculated by Eq.(9) and then transformed to an N × N distance matrix D. In order to imitate the incompleteness nature of real HiC frequency matrix, only distances for K (K < N) nearest neighbors around each of the points are retained, and other distances are assigned to infinity. K directly determines the signal coverage of the transformed distance matrix D (see Fig.3f). The distance matrix D is then converted into the contact frequency matrix by F = (1/D)^{1/α}. We further make the frequency matrix noisy by adding a random noise δ that is uniformly distributed in the region [−S,S], with S ∈ (0, 1) being a given noise level. Specifically, \( \overset{\sim }{F}=F\left(1+\delta \right) \). Finally the frequency matrix is scaled to summation 10^{6}, which is similar to the usual treatment of real HiC data. Thus, the simulation code has 4 input parameters to be given by users: point number N, noise leverl S, conversion factor α and the number of nearest neighbors K. We fix the conversion factor α = 1 throughout the simulation and tune the other three parameters according to different tasks. See Ref. [2] for more detailed description of generating simulated data and the code therein.
There have been lots of in situ HiC data online, of which the human GM06990 cell dataset [1] is commonly used in literature. The advantage of this dataset is that it was generated with two different enzymes (HindIII,NcoI), making it possible to validate the structure of the investigated genome or validate alternative experimental designs. This dataset is also used in our present work. As described in the Introduction, the real HiC data need to be normalized to remove biases before reconstruction for all distancebased methods. The normalized contact frequency matrices of human GM06990 cells can be downloaded directly from the website of Amos Tanay’s group (http://compgenomics.weizmann.ac.il/tanay/?page_id=283). The data generated in Dixon et al. [25] is another real HiC dataset we used (denoted as Dixon2012). This dataset is composed of four cell types: mouse embryonic stem cells (mES), mouse cortex (mCO), human embryonic stem cells (hES), human IMR90 fibroblasts (hIMR90). We downloaded the normalized frequency matrices of 40 kb resolution for all four cell types directly from the website of Bing Ren’s group (http://chromosome.sdsc.edu/mouse/hic/download.html). We compared our ShNeigh with three published programs: ShRec3D [19], ShRec3D+ [20] and ChromSDE [2], which are all distancebased methods, by using both the simulated and the real HiC data.
Performance assessment measures
We use different structure similarity measures for simulated data and real HiC data to assess the performance of ShNeigh. Since the true structure is known for the simulated data, a natural measure is the Root Mean Squared Deviation (RMSD). RMSD measures the similarity of two structures by computing the distance of coordinates of the paired points between them. Given a real structure’s N × 3 3D coordinates P = (p_{1}, ⋯, p_{N})′, and a predicted structure Q = (q_{1,}⋯, q_{N})′ (p_{i} or q_{i} is a 3 × 1 vector of the ith locus’ coordinate, i = 1, · · ·, N), RMSD is defined as
Before performing Eq.(10), some geometric operations: reflecting, rotating, translating and scaling, should be imposed on the predicted structure Q to make it align the true structure P. See [10, 11, 26] for the detailed implementation. Obviously, smaller RMSD value means higher similarity of two structures and hence better performance of the tested program. It is widely used in biomolecular structure comparison, such as protein structures and chromosome structures. In addition, we use the Spearman correlation coefficient (SCC) between the pairwise distances from the predicted structure and those from the true structure to give another performance measure.
As for the real HiC data, the underlying true structures of chromosomes are unknown, so the RMSD measure comes from comparing the two predicted structures of HindIII and NcoI enzymes. We also compute the Spearman correlation between the two estimated frequency matrices of the structures inferred from two different enzymes. It is more unbiased to use Spearman correlation than use Pearson correlation for testing every program, because Spearman correlation is independent of the conversion factor α [2]. Similar to Pearson correlation, the Spearman correlation value varies in [− 1,1], the more close to 1.0 the better.
Availability of data and materials
A Matlab implementation of the proposed method is available at https://github.com/fangzhenli/ShNeigh.
Abbreviations
 MDS:

Multidimensional scaling
 RMSD:

Root mean square deviation
 SCC:

Spearman correlation coefficient
 HiC:

Chromatin conformation capture with highthroughput sequencing
References
LiebermanAiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of longrange interactions reveals folding principles of the human genome. Science. 2009;326:289–93.
Zhang Z, Li G, Toh KC, Sung WK. 3D chromosome modeling with semidefinite programming and hiC data. J Comput Biol. 2013;20:831–46.
Peng C, Fu LY, Dong PF, Deng ZL, Li JX, Wang XT, et al. The sequencing bias relaxed characteristics of hiC derived data and implications for chromatin 3D modeling. Nucleic Acids Res. 2013;41:183.
Oluwadare O, Zhang Y, Cheng J. A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data. BMC Genomics. 2018;19:161.
Trieu T, Cheng J. MOGEN: a tool for reconstructing 3D models of genomes from chromosomal conformation capturing data. Bioinformatics. 2016;32:1286–92.
Trieu T, Cheng J. Largescale reconstruction of 3D structures of human chromosomes from chromosomal contact data. Nucleic Acids Res. 2014;42(7):e52.
Trieu T, Cheng J. 3D genome structure modeling by Lorentzian objective function. Nucleic Acids Res. 2017;45:1049–58.
Yaffe E, Tanay A. Probabilistic modeling of hiC contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat Genet. 2011;43:1059–65.
Varoquaux N, Ay F, Noble WS, Vert JP. A statistical approach for inferring the 3D structure of the genome. Bioinformatics. 2014;30:i26–33.
Hu M, Deng K, Qin Z, Dixon J, Selvaraj S, Fang J, et al. Bayesian inference of spatial organizations of chromosomes. PLoS Comput Biol. 2013;9:1002893.
Zou C, Zhang Y, Ouyang Z. HSA: integrating multitrack hiC data for genomescale reconstruction of 3D chromatin structure. Genome Biol. 2016;17:40.
Wachter A, Biegler LT. On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Math Program. 2006;106:25–57.
Baù D, MartiRenom MA. Genome structure determination via 3Cbased data integration by the integrative modeling platform. Methods. 2012;58:300–6.
Serra F, Baù D, Goodstadt M, Castillo D, Filion GJ, MartiRenom MA. Automatic analysis and 3Dmodelling of hiC data using TADbit reveals structural features of the fly chromatin colors. PLoS Comput Biol. 2017;13:e1005665.
Russel D, Lasker K, Webb B, VelázquezMuriel J, Tjioe E, SchneidmanDuhovny D, et al. Putting the pieces together: integrative modeling platform software for structure determination of macromolecular assemblies. PLoS Biol. 2012;10:e1001244.
Rousseau M, Fraser J, Ferraiuolo M, Dostie J, Blanchette M. Threedimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling. BMC Bioinform. 2011;12:414.
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH. Equation of state calculation by fast computing machines. J Chem Phys. 1953;21:1087–92.
Torgerson WS. Multidimensional Scaling: I. Theory Method Psychometrika. 1952;17:401–19.
Lesne A, Riposo J, Roger P, Cournac A, Mozziconacci J. 3D genome reconstruction from chromosomal contacts. Nat Methods. 2014;11:1141–3.
Li J, Zhang W, Li X. 3D genome reconstruction with ShRec3D+ and hiC data. IEEE/ACM Trans Comput Biol Bioinform. 2018;15:460–8.
Kapilevich V, Seno S, Matsuda H, Takenaka Y. Chromatin 3D reconstruction from chromosomal contacts using a genetic algorithm. IEEE/ACM Trans Comput Biol Bioinform. 2018. https://doi.org/10.1109/TCBB.2018.2814995.
Oluwadare O, Highsmith M, Cheng J. An Overview of Methods for Reconstructing 3D Chromosome and Genome Structures from HiC Data. Biological Procedures Online. 2019;21(1):7.
Ay F, Noble WS. Analysis methods for studying the 3D architecture of the genome. Genome Biol. 2015;16:183–97.
Buja A, Swayne DF, Littman ML, Dean N, Hofmann H, Chen L. Data Visualization With Multidimensional Scaling. J Comput Graph Stat. 2007;17:444–72.
Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–80.
Arun KS, Huang TS, Blostein SD. Leastsquares fitting of two 3d point sets. IEEE Trans Pattern Anal Mach Intell. 1987;9:698–700.
Acknowledgments
We are grateful to Prof. HuiMin Zhu for checking the writing of the manuscript, and reviewers for constructive suggestions.
Funding
This work was supported by the National Natural Science Foundation of China (No. 11604170, 61572286, 61873145, 30600121), NSFC Joint with Zhejiang Integration of Informatization and Industrialization under Key Project (No. U1609218), Scientific Research in Universities of Shandong Province (No. J16LJ06) and the Natural Science Foundation of Shandong Province, China (No. ZR2019MA059, ZR2014AQ018). The funding bodies played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
FZL proposed the method, conducted the formula derivation and drafted the manuscript. ZEL designed the study and implemented the proposed method. XYL implemented the compared programs. LMB and HXB collected data and gave preliminary data process. HL participated in the discussion on MDS. CMZ helped to draft the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, FZ., Liu, ZE., Li, XY. et al. Chromatin 3D structure reconstruction with consideration of adjacency relationship among genomic loci. BMC Bioinformatics 21, 272 (2020). https://doi.org/10.1186/s12859020036124
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859020036124