 Research Article
 Open access
 Published:
Assessing stationary distributions derived from chromatin contact maps
BMC Bioinformatics volume 21, Article number: 73 (2020)
Abstract
Background
The spatial configuration of chromosomes is essential to various cellular processes, notably gene regulation, while architecture related alterations, such as translocations and gene fusions, are often cancer drivers. Thus, eliciting chromatin conformation is important, yet challenging due to compaction, dynamics and scale. However, a variety of recent assays, in particular HiC, have generated new details of chromatin structure, spawning a number of novel biological findings. Many findings have resulted from analyses on the level of native contact data as generated by the assays. Alternatively, reconstruction based approaches often proceed by first converting contact frequencies into distances, then generating a three dimensional (3D) chromatin configuration that best recapitulates these distances. Subsequent analyses can enrich contact level analyses via superposition of genomic attributes on the reconstruction. But, such advantages depend on the accuracy of the reconstruction which, absent gold standards, is inherently difficult to assess. Attempts at accuracy evaluation have relied on simulation and/or FISH imaging that typically features a handful of low resolution probes. While newly advanced multiplexed FISH imaging offers possibilities for refined 3D reconstruction accuracy evaluation, availability of such data is limited due to assay complexity and the resolution thereof is appreciably lower than the reconstructions being assessed. Accordingly, there is demand for new methods of reconstruction accuracy appraisal.
Results
Here we explore the potential of recently proposed stationary distributions, hereafter StatDns, derived from HiC contact matrices, to serve as a basis for reconstruction accuracy assessment. Current usage of such StatDns has focussed on the identification of highly interactive regions (HIRs): computationally defined regions of the genome purportedly involved in numerous longrange intrachromosomal contacts. Consistent identification of HIRs would be informative with respect to inferred 3D architecture since the corresponding regions of the reconstruction would have an elevated number of k nearest neighbors (kNNs). More generally, we anticipate a monotone decreasing relationship between StatDn values and kNN distances. After initially evaluating the reproducibility of StatDns across replicate HiC data sets, we use this implied StatDn  kNN relationship to gauge the utility of StatDns for reconstruction validation, making recourse to both real and simulated examples.
Conclusions
Our analyses demonstrate that, as constructed, StatDns do not provide a suitable measure for assessing the accuracy of 3D genome reconstructions. Whether this is attributable to specific choices surrounding normalization in defining StatDns or to the logic underlying their very formulation remains to be determined.
Background
The spatial configuration of chromosomes is essential to various cellular processes, notably gene regulation. Conversely, architecture related alterations, such as translocations and gene fusions, are often cancer drivers. Accordingly, eliciting chromatin conformation is important. Such elicitation had been challenging due to chromatin compaction, dynamics and scale. However, the emergence of the suite of chromatin conformation capture assays, in particular HiC, generated new details of chromatin structure and spawned a number of subsequent biological findings [2, 9, 10, 18, 23]. Many of these findings have directly resulted from analyses of interaction or contact level data generated by HiC assays. Such data, usually obtained from bulk cell populations, record the frequency with which pairs of genomic loci (or bins thereof) are crosslinked, indicating spatial proximity of those loci within the nucleus. A less common HiC analysis paradigm proceeds by first converting these contact frequencies into distances, this transformation often invoking inverse powerlaws [2, 13, 29, 35, 41]), and then generating a putative three dimensional (3D) reconstruction of the associated chromatin configuration via variants of multidimensional scaling (MDS). Such 3D reconstruction has been shown to enrich analyses based solely on the underlying contact map, these deriving, in part, from superposing genomic features. Examples include identifying colocalized genomic landmarks such as early replication origins [6, 37], expression gradients and colocalization of virulence genes in the malaria parasite Plasmodium falciparum [2], the impact of spatial organization on double strand break repair [14], and elucidation of ‘3D hotspots’ corresponding to overlaid ChIPSeq transcription factor maxima, revealing novel regulatory interactions [7].
But, any potential added value in analyses based on 3D reconstruction is conditional on the accuracy of the corresponding reconstruction and, appropriately, many concerns have been expressed regarding such accuracy. Firstly, the very notion of a single reconstruction being representative of the large (∼10^{6}) cell populations characterizing HiC assays is highly simplistic [19]. This issue has prompted reconstruction approaches [13, 33] that produce an ensemble of solutions, intended to capture intercell variation. However, whether these collections capture biologic, as opposed to algorithmic, variation is unclear [26, 35]. The recent development of highthroughput singlecell HiC assays [22, 31] provides an opportunity for systematic investigation of structural variation. Secondly, even at the singlecell level, genome conformation is dynamic with, for instance, obvious changes over the course of the cell cycle, as well as cell type specific. Finally, the lack of 3D chromatin structure gold standards makes accuracy assessment inherently problematic. To address this obstacle several authors have appealed to simulation [16, 20, 34, 35, 41, 42]. In order to deploy real data referents many of the same reconstruction algorithm developers have made recourse to fluorescence in situ hybridization (FISH) imaging as a means for gauging the accuracy of competing algorithms and/or tuning parameter settings. This approach proceeds by comparing measured distances between imaged probes with corresponding distances obtained from 3D reconstruction algorithms. These standard FISHbased methods, however, are tenuous due to the limited number of imaged probes (∼2−6, [18, 20, 29]) and the poor resolution thereof, many straddling over 1 megabase.
To improve on these accuracy assessment shortcomings we previously devised methods that centered on two newly devised biotechnologies [28]: (i) multiplex FISH [36] which provides an order of magnitude more probes, each at higher resolution, and hence two orders of magnitude more distances than conventional FISH, and (ii) a proximitybased ligationfree method, genome architecture mapping [3], predicated on sequencing DNA from a large collection of randomlyoriented, thin nuclear cryosections which enables determination of an internal measure of accuracy by evaluating how well the reconstruction conforms to the underlying collection of planar nuclear cryosections. However, these approaches to accuracy assessment have their own limitations. The primary drawback is that each biotechnology is experimentally intensive and, accordingly, has had minimal uptake. The resultant dearth of associated public data profoundly restricts the extent to which these approaches can be applied. Additionally, there is a resolution disparity, with HiC data being available at higher resolutions, mandating a coarsening of reconstructions prior to accuracy assessment.
In seeking to devise a more broadly applicable means for reconstruction accuracy assessment we were drawn to the recently proposed (Sobhy et al., [30], hereafter SKLLS) stationary distribution (hereafter StatDn(s)) of a HiC matrix and associated highly interactive regions (HIRs): computationally defined regions of the genome purportedly involved in numerous longrange intrachromosomal contacts. Consistent identification of HIRs would be informative with respect to inferred 3D architecture since the corresponding regions of the reconstruction would have an elevated number of k nearest neighbors (kNNs) compared with nonhighly interacting regions. More generally, we would anticipate a monotone decreasing relationship between StatDn values and kNN distances for fixed values of k. This posited relationship provides one means for evaluating the potential utility of StatDns, ithe objective of this paper, which is organized as follows. Under Methods we first recapitulate how StatDns are derived, highlighting normalization and interpretation issues, and then detail data sources to be used in the evaluation thereof. The “Results” section showcases StatDn findings with respect to reproducibility across replicate HiC data sets, effects of normalization scheme, and performance for 3D reconstruction validation, via assessment of the above monotonicity between StatDn values and kNN distances, based on real and simulated examples. The Discussion frames conclusions based on the foregoing findings.
Methods
Stationary distributions from HiC contact matrices
Given a (possibly normalized – see below) symmetric, nonnegative n×n observed contact matrix O=[o_{ij}] the associated StatDn is generated as follows. First, O is standardized by dividing every entry by its row sum. This enables the key step: treating the resultant matrix, W, as a transition probability matrix (TPM), with entry w_{ij} interpreted as the probability of ‘jumping’ from node i to node j where ‘nodes’ denote a rebranding of the underlying HiC bins or loci, thereby allowing an overlay of graph / network concepts. The fact that, due to row sum based standardization, W is not symmetric complicates this interpretation since the original ‘proximities’ as measured via HiC are symmetric: o_{ij}=o_{ji}. SKLLS proceed by prescribing a Markov model with TPM W. Let p_{i}(t) be the probability of occupying node i at time t and p(t)=(p_{1}(t),p_{2}(t),…,p_{n}(t)) be the corresponding probability distribution. Then, under the Markov assumption, transitions occur according to
The limiting (t→∞) StatDn, designated p(∞), satisfies p(∞)=p(∞)W, and is given by the (left) eigenvector corresponding to the (largest) eigenvalue one, the nonnegative entries of p(∞) being normalized to sum to one. We use the R package RSpectra [21] to perform the requisite spectral decomposition.
SKLLS categorize StatDns, at 30^{th}, 50^{th}, 80^{th} and 90^{th} percentiles, and deploy the resultant ordered categories in downstream analyses, with an emphasis on HIRs corresponding to the latter upper decile. In contrast, we utilize StatDns in their native, continuous form obviating the need for thresholding. As a check, we extracted SKLLSdefined categories and reprised select analyses with concordant findings.
Normalization and interpretation issues
There has been extensive discussion surrounding normalization issues for HiC data and development of companion corrective methods [8, 11, 12, 17, 38]. Much of this effort pertains to mitigating systematic biases affecting observed o_{ij} values deriving from factors such as fragment length, GC content and mappability. A distinct aspect of some normalization strategies concerns removing ‘expected’ contact counts from the observed values so as to adjust for contiguity and thereby emphasize features of interest such as loops. In this context expected values are often computed as a function of genomic distance [2, 10]. This equates to applying a common correction within each diagonal of O, elements thereof being equispaced with respect to genomic distance, presuming equal sized contact matrix bins as is standard. It is this approach that is considered by SKLLS.
Specifically, for each of the n diagonals of O, the median of the corresponding entries is obtained. An n×n expectation matrix E with constant diagonals is then created, the constants being the respective medians. In addition to obtaining StatDns (as detailed above) from (unnormalized) O, they are also generated from O−E and O/E. To satisfy the nonnegativity requirement of a TPM any negative values arising post normalization are replaced with a small positive constant. For O−E normalization, with E based on diagonal medians, this means that approximately half the entries will be replaced by this constant. The ramifications, both interpretive and performancewise, of such wholesale substitution are unclear.
In order to decide between the competing normalization schemes SKLLS assert that O−E normalization produces StatDns with a larger ‘dynamic range’ than O or O/E approaches, and is accordingly preferred. Presuming dynamic range is defined as the difference between maximum and minimum StatDn values, the rationale for its selection as a normalization criterion is obscure. Moreover, it will be susceptible to the influence of outliers as can arise from extreme (normalized) contact matrix row sums. The supporting evidence presented for choosing O−E consists of visually comparing StatDns from the three schemes over a limited range of a single chromosome. Further, it is claimed that, in using O directly, the inclusion of both short and longrange contacts attenuates dynamic range but the basis for this is unclear.
It is pertinent to consider StatDns, as operationalized above, arising from specific patterned matrices. For a compound symmetric (exchangeable) matrix the StatDn is constant (p_{i}(∞)=1/n ∀i) irrespective of the value of the offdiagonal entries, with this same StatDn resulting from a tridiagonal matrix, again independent of the value of the offdiagonal entries [25]. While these patterns don’t reflect O,O−E,O/E matrices arising in practice, the lack of StatDn discrimination between such appreciably different matrices raises interpretative concerns about the proposed approach, at least from the perspective of evaluating 3D reconstructions, and potentially beyond.
Data sources and simulated 3D structures
HiC data [23] for GM12878 cells was obtained from the Gene Expression Omnibus (GEO) with accession GSE63525. Contact matrices deriving from several series of experiments were grouped (by the original authors) into ‘primary’ and ‘replicate’ datasets and we utilize these to assess reproducibility, as has been done previously [28]. HiC data [9] for IMR90 cells was obtained from the Gene Expression Omnibus (GEO) with accession GSE35156. For both cell types analyses were restricted to reads with alignment mapping quality scores ≥30 and conducted with contact matrices at 25kb resolution since this corresponds to the resolution of SKLLS defined HIRs.
Noisedup versions of simulated chainlike and topologically associated domain (TAD)like structures and attendant contact maps obtained under differing regimes have been used to evaluate 3D reconstruction algorithms in settings intended to recapitulate practice [34, 42]. Similarly, simulated helical and random walk structures have been used for this purpose [42]. Here we follow an analogous agenda by (i) computing StatDns from the contact matrices provided using each of the normalization schemes described above, and (ii) comparing these to the corresponding structures using k nearest neighbors as described subsequently.
As an illustration of how such synthetic data is obtained we present a brief overview of the formulation used for helical structures following Zou et al., [42]. O_{ij}, the (i,j)^{th} entry of the observed contact matrix O, is generated as a random Poisson variate with rate parameter λ_{ij}. In turn, this parameter is set using the abovementioned inverse powerlaw transformation: \(\lambda _{ij} = c / d_{ij}^{\alpha }\). Here d_{ij} corresponds to the distance between the i^{th} and j^{th} points on the helix, α is fixed at 1.5, and c varies so as to govern the signal coverage – the percentage of nonzero entries in the contact matrix. For the results presented subsequently we obtain 100 points on a helix defined by coordinate functions
and set c to yield 25% signal coverage, with similar findings at 90% coverage.
Obtaining 3D genome reconstructions from HiC data
Use of simulated 3D architectures and associated contact maps, as above, in evaluating StatDns as a validation tool has the advantage of eliminating uncertainties inherent in the reconstruction process. Nonetheless, it is purposeful to assess StatDns using real data reconstructions, reflecting use in practice.
Multidimensional scaling
As noted in the Background, there are numerous approaches for generating 3D reconstructions from HiC contact maps and, in turn, most of these feature several tuning parameters. In order not to obscure our purpose of appraising StatDns we showcase findings from a simple, minimalassumption approach to reconstruction: multidimensional scaling, fit using the R package smacof [15]. MDS is an established approach to finding configurations that recapitulate dissimilarity measures which, in turn, can be obtained from HiC contacts, by powerlaw transformation for example. Accordingly, MDSbased approaches have been widely used in the context of genome reconstruction [2, 4, 16, 24, 27, 29, 32, 35, 41].
Under MDS we seek a 3D configuration \(X = \{\vec {x}_{1},\ldots,\vec {x}_{n}\}; \vec {x}_{j} \in R^{3}\) that best fits the dissimilarity matrix D according to:
Though confining our attention to MDS, we explored a variety of schemes within this framework, using both metric and nonmetric scaling, and varying dissimilarity weights ω_{ij} whereby downweighting of imprecise contact counts can be accommodated, and powerlaw indices for transforming O to D. We note that irrespective of MDS reconstruction method examined results were largely similar.
Hamiltonian simulated annealing
In order for findings not to be solely reliant on a single (MDS) reconstruction strategy – although, as noted, a range of MDS specifications were examined – we additionally applied the Hamiltonian simulated annealing (HSA, [42]) algorithm. HSA has a number of compelling attributes: (i) it can simultaneously handle multiple data tracks allowing for integration of HiC contact data from differing restriction enzyme digests; (ii) it can adaptively estimate the powerlaw index whereby contacts are transformed to distances, the importance of which has been previously emphasized [41]; and (iii) by using simulated annealing combined with Hamiltonian dynamics it can effectively optimize over for the high dimensional space representing the genomic loci’s 3D coordinates.
Analogous to other 3D reconstruction algorithms [20, 35], HSA models (normalized) contact counts, n, via Poisson regression:
where in (3) k indexes track and \(n_{i_{k} j_{k}}\) is the count for genomic loci i_{k},j_{k}. The parameters β_{k1} are (track specific) powerlaw indices relating expected counts (μ) to Euclidean distances (d). Covariates such as GC content and fragment length can be included in (4) in order to facilitate inline normalization. The \(X_{i_{k}} = (x_{i_{k}},y_{i_{k}},z_{i_{k}})\) and \(X_{j_{k}} = (x_{j_{k}},y_{j_{k}},z_{j_{k}})\) in (5) are the 3D coordinates for loci i_{k},j_{k} and constitute the unknown parameters providing the reconstruction. These are subject to constraints designed to capture the local contiguity of chromatin, represented by induced dependencies of a hidden Gaussian Markov chain. The full loglikelihood for β,X is then
to which a penalty term controlling local smoothness is added. Note that (constrained) X enters (6) through μ and d from (4) and (5) respectively. The resulting penalized likelihood is optimized by iterating between generalized linear model (GLM, cf Poisson regression) fitting to obtain estimates \(\hat \beta \) and simulated annealing to obtain estimates of the 3D coordinates \(\hat X = (\hat x, \hat y, \hat z)\). Several tuning parameters control the simulated annealing search and we used default values, as established by the authors’ for their custom R scripts.
Stationary distribution reproducibility
We assessed the reproducibility – between primary and replicate data series – of StatDns obtained under the differing normalization schemes – using scatterplot smoothing and associated correlations. We contrast these correlations with stratumadjusted correlation coefficients (SCCs) of the corresponding HiC data. SCCs, described below, are custom correlation measures developed for HiC contact matrices that reflects the same constant diagonal expected counts described above which, on average, decreases substantially as genomic distance increases [39].
The SCC is based on the generalized CochranMantelHaenszel statistic, M^{2}, which is used for testing whether two variables are associated while being stratified by a third variable [1]. Since the magnitude of M^{2} depends on sample size it does not provide a direct measure of association strength. In the unstratified setting we have the relationship ρ^{2}=M^{2}/(n−1) where ρ is the Pearson correlation coefficient and n is the number of observations. This relationship underscores the derivation of the SCC to measure association in the presence of stratification. Let (X,Y) denote a pair of samples (here contact matrices) with n observations stratified into K strata (here diagonal bands corresponding to equal genomic distances), each having n_{k} observations so that \(\sum _{k=1}^{K} n_{k} = n\). Let the observations in stratum k be \((x_{i_{k}}, y_{i_{k}}); i = 1, \ldots, K\) with associated random variables (X_{k},Y_{k}).
The Pearson correlation coefficient ρ_{k} for the k^{th} stratum is ρ_{k}=r_{1k}/r_{2k}, where
It is straightforward to represent M^{2} in terms of a weighted sum of the ρ_{k} which gives rise to the SCC defined as
Further aspects of SCCs, including obtaining the variance of ρ_{s}, deploying variance stabilizing weights in computing ρ_{s}, guidelines for determining the number of strata K are detailed in Yang et al., [39], with fitting making recourse to R package hicrep [40].
Comparing stationary distributions and 3D genome reconstructions
For each locus of a 3D structure, either simulated or obtained via reconstruction, we compute the distance to its k^{th} nearest neighbor (kNN) in the structure, for k∈Ω={5,15,25}, using the R package FNN [5]. Since kNN distances are monotone in k it suffices to consider a few select values. We plot these kNN distances against StatDn values obtained from the corresponding contact matrix. We again use scatterplot smoothing (R function lowess) to highlight relationships, with a monotone decreasing association anticipated if StatDn identification of highly (and remotely) interacting loci are supported by the structure. To appreciate the basis for this monotone decreasing relationship consider the antithesis of a HIR, namely a minimally interacting region, characterized by low StatDn values. By virtue of its minimal interactions nearest neighbor distances for given k∈Ω will be large. The converse holds for HIRs and the underlying high StatDn values leading to the monotone decreasing relationship between StatDns and kNN distances.
Results
Our findings are presented largely by way of figures. These are constructed so that comparisons between O,O−E,O/E normalizations are highlighted. But, more important than these internal contrasts are overall assessments of StatDns for the stated objective of appraising 3D reconstructions. In most of the settings considered the overall performance is such that StatDns cannot be endorsed as a 3D reconstruction evaluation technique since the abovementioned monotone decreasing relationship with kNN distances fails to hold. Moreover, examples wherein anomalous behavior of StatDns is exhibited are showcased.
We report results for GM12878 chromosome 9 since this exhibits the highest density (per base) of HIRs as defined by SKLLS. We also present results for GM12878 chromosome 4 which is relatively sparse with respect to HIRs. However, similar trends were consistently observed across all chromosomes examined (not shown). Additionally, findings from select IMR90 cells are illustrated, revealing instances of StatDn breakdown.
Stationary distribution reproducibility
In Fig. 1 we compare the StatDns of GM12878 cells chromosome 9 primary and replicate series corresponding to respective normalizations O,O−E,O/E. The respective correlations are 0.962, 0.937 and 0.977 whereas the SCC between the primary and replicate contact matrices is 0.966. Thus, reproducibility for the O−E normalization chosen by SKLLS is furthest removed from the correlation between the underlying contact matrices.
More interesting findings emerge when we similarly assess reproducibility for IMR90 cells. Figure 2 displays the StatDns for IMR90 chromosome 21 primary and replicate series, again corresponding to respective normalizations O,O−E,O/E. The corresponding correlations are 0.935, 0.936 and 0.966, whereas the SCC between the primary and replicate contact matrices is 0.808. Thus, the StatDn correlations appreciably exceed the SCC between the underlying contact matrices, indicative of possible problems with StatDns in view of the careful and contact map customized construction of SCCs [39].
Also apparent in Fig. 2 are StatDn outliers, for both O and the chosen O−E normalizations, which result from (relatively) extreme contact matrix row sums, indicating possible normalization breakdown for such instances. An even more dramatic example of anomalous StatDn values is shown below with respect to reconstruction (Fig. 8).
Relating stationary distributions to 3D structures
The simulated helical and random walk structures previously used for 3D reconstruction evaluation [42] include instances varying according to the extent of signal coverage, defined as the percentage of nonzero entries in the contact matrix derived from the generated structure. Here we illustrate results for the lowest levels of signal coverage: 25% and 10% for the helix and random walk respectively. Findings at higher levels of signal coverage are similar (not shown) although the helical structure with 90% signal coverage does not display a monotone decreasing relationship between kNN distances and StatDns with O/E normalization.
Results for the simulated helical structure, based on 100 loci, are presented in Fig. 3. The quantal nature of the kNN distances (we display results for k=5,15) – for example, there are only three distinct 5 nearest neighbor distances – reflects the regularity of the helical configuration. The left and right panels, corresponding to O and O/E normalization, exhibit decreasing trends: the higher the StatDn value, nominally corresponding to loci with greater numbers of interactions, the smaller the kNN distance in the structure, as would be expected. However, for the middle panel, corresponding to O−E normalization, no such relationship is evident. Further, by virtue of the manner whereby O−E normalization handles nonpositive values, there is substantial duplication of StatDn values: 47 uniques versus 97 for O,O/E. Results for the random walk structure are presented in Fig. 4. Here we see very similar performance across normalization schemes with the anticipated decreasing relationship exhibited for each.
A comprehensive effort to generate structures and attendant contact matrices that more realistically reflect chromatin architecture has been undertaken by Trussart et al., [34]. Here we focus on two such structures, TADlike and chainlike, each generated with midlevel noise and structural variability corresponding to Trussart et al., parameter settings of α=100 and Δt=10^{3} respectively. Results for the TADlike structure are presented in Fig. 5 and for the chainlike structure in Fig. 6. For both structures we observe StatDns displaying an increasing relationship with kNN distances, this being strongest for O−E normalization.
Results from StatDn evaluation of a reconstruction for GM12878 chromosome 9 via unweighted metric MDS are depicted in Fig. 7. While the left and right panels corresponding to O and O/E normalization display decreasing relationships with kNN distances these are driven by elevated kNN values for small StatDn probabilities. Results for O−E normalization are effectively constant. Analogous findings were obtained from other (weighted, nonmetric) MDS reconstruction approaches, as well as for HSAbased reconstruction.
Similarly, results from StatDn evaluation of a reconstruction for IMR90 chromosome 21 by HSA are depicted in Fig. 8. Here the left and middle panels corresponding to O and O−E normalization display decreasing relationships with kNN for the bulk of the data but exhibit increasing trends in the upper tail: the region containing the HIR. These same trends were evident in reconstructions obtained using MDS.
Discussion
Many potential difficulties surrounding use of StatDns were delineated in Methods under Normalization and Interpretation Issues and these concerns have been borne out by the empirical results. It is important to note that these problems cannot be ascribed to deficiencies of the reconstruction algorithms since they are also exhibited with simulated structures that bypass the reconstruction step. Moreover, for some of the explorations based on chromatin configuration reconstruction, we have deliberately opted to utilize a minimalist MDS approach, thereby limiting the influence of assumptions and parameter tuning. These findings, wherein StatDns do not recapitulate inferred 3D MDS reconstructions, also pertain to an alternate stateoftheart reconstruction algorithm, HSA, and hold across all cell lines and chromosomes examined. Thus, the overall weight of evidence, both theoretic and empirical, is such that StatDns, especially those based on the prescribed O−E normalization, cannot be recommended as a means for evaluating 3D genome reconstruction. Indeed, these problematic underpinnings of StatDns, including the logic surrounding their definition, call into question their usage for any purpose, not just reconstruction assessment as examined here.
This conclusion begs the question as to whether alternate, established structural units derived from HiC contact matrices, such as TADs [9] and contact domains [23], might serve as components for (nonorthogonal) reconstruction assessment. However, these constructs are by definition local and so do not provide a basis for effecting largescale structure interrogation. It was the purported ability of StatDns to capture frequent, longrange interactions that motivated this evaluation of their validation potential. Conversely, TADs [24] and FISH distances [29] have been used to improve the reconstruction process itself. Again, given their uncertain foundation, we see no analogous role for StatDns.
Conclusion
Our analyses demonstrate that, as constructed, StatDns do not provide a suitable measure for assessing the accuracy of 3D genome reconstructions. Whether this is attributable to specific choices surrounding their formulation or to the logic underlying their very definition remains to be determined.
Availability of data and materials
HiC data for GM12878 cells is available from GEO with accession GSE63525: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525. HiC data for IMR90 cells is available from GEO with accession GSE35156: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35156.
Contact maps and associated structures corresponding to chainlike and TADlike models [34] were obtained from http://sgt.cnag.cat/3dg/datasets/.
The noisedup helical (regular) and random walk structures and attendant contact matrices utilized in [42] are available from https://people.umass.edu/ouyanglab/hsa/downloads.html#Data.
Abbreviations
 3D:

Three dimensional
 FISH:

Fluorescence in situ hybridization
 GEO:

Gene expression Omnibus
 HIRs:

Highly interactive regions
 HSA:

Hamiltonian simulated annealing
 kNNs:

k Nearest neighbors
 MDS:

Multidimensional scaling
 SCC:

Stratified correlation coefficient
 SKLLS:

Sobhy, Kumar, Lewerentz, Lizana, Stenberg
 StatDn:

Stationary distribution
 TAD:

Topologically associated domain
 TPM:

Transition probability matrix
References
Agresti A. Categorical data analysis. 3rd Ed. New York: Wiley; 2012.
Ay F, Bunnik EM, Varoquaux N, Bol SM, Prudhomme J, Vert JP, Noble WS, Le Roch KG. Threedimensional modeling of the P. falciparum genome during the erythrocytic cycle reveals a strong connection between genome architecture and gene expression. Genome Res. 2014; 24(6):974–88.
Beagrie RA, Scialdone A, Schueler M, Kraemer DC, Chotalia M, Xie SQ, Barbieri M, de Santiago I, Lavitas LM, Branco MR, Fraser J, Dostie J, Game L, Dillon N, Edwards PA, Nicodemi M, Pombo A. Complex multienhancer contacts captured by genome architecture mapping. Nature. 2017; 543(7646):519–24.
BenElazar S, Yakhini Z, Yanai I. Spatial localization of coregulated genes exceeds genomic gene clustering in the Saccharomyces cerevisiae genome. Nucleic Acids Res. 2013; 41(4):2191–201.
Beygelzimer A, Kakadet S, Langford J, Arya S, Mount D, Li S. FNN: Fast nearest neighbor search algorithms and applications. R package version 1.1.3. 2019. https://CRAN.Rproject.org/package=FNN. Accessed 2019.
Capurso D, Segal MR. Distancebased assessment of the localization of functional annotations in 3D genome reconstructions. BMC Genomics. 2014; 15:992.
Capurso D, Bengtsson H, Segal MR. Discovering hotspots in functional genomic data superposed on 3D chromatin configuration reconstructions. Nucleic Acids Res. 2016; 44(5):2028–35.
Cournac A, MarieNelly H, Marbouty M, Koszul R, Mozziconacci J. Normalization of a chromosomal contact map. BMC Genomics. 2012; 13:436.
Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485(7398):376–80.
Duan Z, Andronescu M, Schutz K, McIlwain S, Kim YJ, Lee C, Shendure J, Fields S, Blau CA, Noble WS. A threedimensional model of the yeast genome. Nature. 2010; 465(7296):363–7.
Hu M, Deng K, Selvaraj S, Qin Z, Ren B, Liu JS. HiCNorm: removing biases in HiC data via Poisson regression. Bioinformatics. 2012; 28(23):3131–3.
Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, Dekker J, Mirny LA. Iterative correction of HiC data reveals hallmarks of chromosome organization. Nat Meth. 2011; 9:999–1003.
Kalhor R, Tjong H, Jayathilaka N, Alber F, Chen L. Genome architectures revealed by tethered chromosome conformation capture and populationbased modeling. Nat Biotech. 2011; 30:90–8.
Lee CS, Wang RW, Chang HH, Capurso D, Segal MR, Haber JE. Chromosome position determines the success of doublestrand break repair. Proc Natl Acad Sci. 2016; 113(2):E146–54.
de Leeuw J, Mair P. Multidimensional scaling using majorization: SMACOF in R. J Stat Softw. 2009; 31(3):1–30.
Lesne A, Riposo J, Roger P, Cournac A, Mozziconacci J. 3D genome reconstruction from chromosomal contacts. Nat Meth. 2014; 11(11):1141–3.
Li W, Gong K, Li Q, Alber F, Zhou XJ. HiCorrector: A fast, scalable and memoryefficient package for normalizing largescale HiC data. Bioinformatics. 2015; 31(6):960–2.
LiebermanAiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of longrange contacts reveals folding principles of the human genome. Science. 2009; 326:289–93.
Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Yaffe E, Dean W, Laue ED, Tanay A, Fraser P. Singlecell HiC reveals celltocell variability in chromosome structure. Nature. 2013; 502(7469):59–64.
Park J, Lin S. A random effect model for reconstruction of spatial chromatin structure. Biometrics. 2017; 73(1):52–62.
Qiu Y, Mei J. RSpectra: Solvers for largescale eigenvalue and SVD problems. R package version 0.150. 2019. https://CRAN.Rproject.org/package=RSpectra. Accessed 2019.
Ramani V, Deng X, Gunderson KL, Steemers FJ, Disteche CM, Noble WS, Duan Z, Shendure J. Massively multiplex singlecell HiC. Nat Meth. 2017; 14(3):263–6.
Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159(7):1665–80.
Rieber L, Mahony S. miniMDS: 3D structural inference from highresolution HiC data. Bioinformatics. 2017; 33:i261–6.
Segal MR. Representative curves for longitudinal data via regression trees. J Comp Graph Stat. 1994; 3:214–33.
Segal MR, Xiong H, Capurso D, Vazquez M, Arsuaga J. Reproducibility of 3D chromatin configuration reconstructions. Biostatistics. 2014; 15(3):442–56.
Segal MR, Bengtsson HL. Reconstruction of 3D genome architecture via a twostage algorithm. BMC Bioinformatics. 2015; 16:373.
Segal MR, Bengtsson HL. Improved accuracy assessment for 3D genome reconstructions. BMC Bioinformatics. 2018; 19(1):196.
Shavit Y, Hamey FK, Lio P. FisHiCal: an R package for iterative FISHbased calibration of HiC data. Bioinformatics. 2014; 30(21):3120–2.
Sobhy H, Kumar R, Lewerentz J, Lizana L, Stenberg P. Highly interacting regions of the human genome are enriched with enhancers and bound by DNA repair proteins. Sci Rep. 2019; 9(1):4577.
Stevens TJ, Lando D, Basu S, Atkinson L, Cao Y, Lee S, Leeb M, Wohlfahrt KJ, Boucher W, O’ShaughnessyKirwan A, Cramard J, Faure AJ, Ralser M, Blanco E, Morey L, Sansó M, Palayret MGS, Lehner B, Di Croce L, Wutz A, Hendrich B, Klenerman D, Laue ED. 3D structure of individual mammalian genomes studied by single cell HiC. Nature. 2017; 544(7648):59–64.
Szalaj P, Tang Z, Michalski P, Pietal MJ, Luo OJ, Sadowski M, Li X, Radew K, Ruan Y, Plewczynski D. An integrated 3Dimensional Genome Modeling Engine for datadriven simulation of spatial genome organization. Genome Res. 2016; 26(12):1697–709.
Tjong H, Gong K, Chen L, Alber F. Physical tethering and volume exclusion determine higherorder genome organization in budding yeast. Genome Res. 2012; 22:1295–1305.
Trussart M, Serra F, Baú D, Junier I, Serrano L, MartiRenom MA. Assessing the limits of restraintbased 3D modeling of genomes and genomic domains. Nucleic Acids Res. 2015; 43(7):3465–77.
Varoquaux N, Ay F, Noble WS, Vert JP. A statistical approach for inferring the 3D structure of the genome. Bioinformatics. 2014; 30(12):i26–33.
Wang S, Su JH, Beliveau BJ, Bintu B, Moffitt JR, Wu CT, Zhuang X. Spatial organization of chromatin domains and compartments in single chromosomes. Science. 2016; 353(6299):598–602.
Witten DM, Noble WS. On the assessment of statistical significance of threedimensional colocalization of sets of genomic elements. Nucleic Acids Res. 2012; 40(9):3849–55.
Yaffe E, Tanay A. Probabilistic modeling of HiC contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat Genet. 2011; 43:1059–65.
Yang T, Zhang F, Yardimci GG, Song F, Hardison RC, Noble WS, Yue F, Li Q. HiCRep: assessing the reproducibility of HiC data using a stratumadjusted correlation coefficient. Genome Res. 2017; 27(11):1939–49.
Yang T. hicrep: Measuring the reproducibility of HiC data. R package version 1.6.0. 2019. https://www.bioconductor.org/packages/release/bioc/html/hicrep.html.
Zhang Z, Li G, Toh KC, Sung WK. 3D chromosome modeling with semidefinite programming and HiC data. J Comp Biol. 2013; 20(11):831–46.
Zou C, Zhang Y, Ouyang Z. HSA: integrating multitrack HiC data for genomescale reconstruction of 3D chromatin structure. Genome Biol. 2016; 17:40.
Acknowledgements
The authors thank Trevor Hastie for helpful comments.
Funding
Support was provided by NIH grant R01GM109457. The funding body played no role in the design of the study, the collection, analysis, and interpretation of data, or in the writing of the manuscript.
Author information
Authors and Affiliations
Contributions
MRS conceived the study, performed analyses and wrote the manuscript. KFB performed analyses. Both authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
MRS is an Associate Editor for BMC Bioinformatics.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Segal, M.R., FletezBrant, K. Assessing stationary distributions derived from chromatin contact maps. BMC Bioinformatics 21, 73 (2020). https://doi.org/10.1186/s128590203424y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128590203424y