Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling
© Rousseau et al; licensee BioMed Central Ltd. 2011
Received: 18 May 2011
Accepted: 25 October 2011
Published: 25 October 2011
Long-range interactions between regulatory DNA elements such as enhancers, insulators and promoters play an important role in regulating transcription. As chromatin contacts have been found throughout the human genome and in different cell types, spatial transcriptional control is now viewed as a general mechanism of gene expression regulation. Chromosome Conformation Capture Carbon Copy (5C) and its variant Hi-C are techniques used to measure the interaction frequency (IF) between specific regions of the genome. Our goal is to use the IF data generated by these experiments to computationally model and analyze three-dimensional chromatin organization.
We formulate a probabilistic model linking 5C/Hi-C data to physical distances and describe a Markov chain Monte Carlo (MCMC) approach called MCMC5C to generate a representative sample from the posterior distribution over structures from IF data. Structures produced from parallel MCMC runs on the same dataset demonstrate that our MCMC method mixes quickly and is able to sample from the posterior distribution of structures and find subclasses of structures. Structural properties (base looping, condensation, and local density) were defined and their distribution measured across the ensembles of structures generated. We applied these methods to a biological model of human myelomonocyte cellular differentiation and identified distinct chromatin conformation signatures (CCSs) corresponding to each of the cellular states. We also demonstrate the ability of our method to run on Hi-C data and produce a model of human chromosome 14 at 1Mb resolution that is consistent with previously observed structural properties as measured by 3D-FISH.
We believe that tools like MCMC5C are essential for the reliable analysis of data from the 3C-derived techniques such as 5C and Hi-C. By integrating complex, high-dimensional and noisy datasets into an easy to interpret ensemble of three-dimensional conformations, MCMC5C allows researchers to reliably interpret the result of their assay and contrast conformations under different conditions.
In the nucleus, genomic DNA exists in the form of chromatin, which is tightly packaged and organized into higher-level structures required for proper genome function [1, 2]. Chromatin conformation is highly dynamic and modified by several biological processes such as DNA replication, repair and transcription. The three-dimensional chromatin organization itself was recently found to play an important role in transcription regulation [3–5] and can be used to define chromatin signatures [6–9]. For example, it was shown that elements that lie far apart in the one-dimensional genomic sequence or on different chromosomes could functionally interact through physical contacts [10–12]. One such example is the 100-kb imprinted Igf2/H19 locus on human chromosome 11 where there exists an imprinting control region (ICR) located between the Igf2 gene and its enhancer sequence. On the maternal allele, CTCF (a known insulator protein) is able to bind the unmethylated ICR and subsequently forms multiple long-range looping contacts along the locus that block gene-enhancer interaction. However, the paternal ICR is methylated and cannot be bound by CTCF, thus allowing the Igf2 gene and its enhancer sequence to interact through a long-range loop, thereby regulating expression to only the paternal allele [13–16]. Such long-range interactions have been found throughout metazoan genomes where thus far many of them appear to correlate well with the transcriptional state of target genes [6, 17–20].
Although we still do not know how many types of contacts exist or how the majority of them are regulated, it is now clear that spatial transcriptional control is an important mechanism of gene regulation. Thus, mapping of physical contacts within (cis) and between (trans) chromosomes will be essential to fully understand gene regulation.
Several techniques are now available to examine chromatin structure at high-resolution, such as DamID , and more recent approaches including Chromosome Conformation Capture (3C) , Circular Chromosome Conformation Capture (4C) [23, 24], Chromosome Conformation Capture Carbon Copy (5C) , Chromatin interaction analysis with paired-end tag sequencing (ChIA-PET) , the technology developed by Duan et al. , and Hi-C . These techniques combine various high-throughput approaches, such as microarrays and next-generation sequencing, and produce large datasets. In the case of 5C and Hi-C, the measurements obtained consist of pairwise interaction frequency values that are proportional to the proximity of the chromatin fragments in the nuclear space in vivo. These data broadly define the three-dimensional conformation of chromatin. It is important to note that these assays are not performed on a single cell, but rather a population of cells, and these data thereby represent population-average measurements of the degree of interaction between chromatin fragments that require tailored bioinformatics tools for interpretation. In this paper, we propose a computational approach to robustly infer ensembles of chromatin conformations that are supported by a given 5C or Hi-C dataset. These three-dimensional models of chromatin conformation can be analyzed to determine robust structural properties.
Recently, several approaches have been proposed to model chromatin 3D conformation from interaction frequency (IF) data. In previous work , we developed a program called 5C3D that first translates IF values into physical distance estimates and then uses a gradient descent approach to find the 3D conformation with the best fit to the observed data based on a simple misfit objective function. Bau et al. [17, 18] proposed 3D models of the α-globin locus based on 5C data. They formulate an optimization problem where pairwise interactions are modeled with springs whose equilibrium length depends on the observed IF values, subject to certain constraints based on the structure of the 30-nm fiber. They then use the Integrative Modeling Platform (IMP; http://salilab.org/imp/) to produce a set of possible conformations that satisfy the constraints while maximizing the fit to the IF data. Duan et al.  proceed similarly to obtain a model of the budding yeast chromatin conformation based on data obtained using a modification of the 4C technology coupled with high-throughput sequencing. They first convert observed interaction frequencies to Euclidean distances and then seek the chromatin conformation that minimizes the same measure of misfit as 5C3D, with the addition of a set of clash avoidance constraints, and a few biologically-motivated constraints based on prior knowledge about the yeast genome organization. The constrained optimization problem is solved using an optimization package to produce the best fitting structure. A very similar approach is used by Tanizawa et al.  to model the genome of fission yeast. Of all these approaches, 5C3D is the only one we are aware of that comes with stand-alone software.
Although these approaches differ slightly in the manner in which IF data is translated into distance constraints, the set of additional constraints included in the model, and the way the resulting system of equations is solved, they all have the merit of turning a set of noisy IF measurements into a more interpretable read out. By integrating O(n2) noisy IF measurements into O(n) predictions about the 3D location of each fragment, they also potentially produce an output that is more reliable than any of the individual IF measurements it is based on. However, these approaches suffer from two significant drawbacks. First, the objective function (always some form of sum-of-squared differences between predicted and IF-derived distance) is debatable, as, among other things, it assumes that each IF measurement is equally reliable. Second, the structures obtained come with no guarantee of representativity or reliable measure of uncertainty. Acknowledging this limitation, Baú et al. [17, 28] proposed a heuristic approach to generate sets of candidate structures. However, because none of these approaches are based on a probabilistic model integrating an IF noise model, the set of sampled structures may not be representative of the true (probabilistically weighted) set of possible structures. Even though the approach used by Baú et al. produces an ensemble of solutions, the absence of an underlying probabilistic model prevents the calculation of confidence intervals on specific structural properties (e.g. the distance between two sites along the genome) and do not identify statistically significant conformational features.
In this paper, we introduce MCMC5C, a computational probabilistic modeling approach for inferring chromatin three-dimensional structure from 5C or Hi-C experiments. Our approach is based on a formal probabilistic model of interaction frequencies and their link with physical distance and uses a Markov chain Monte Carlo sampling procedure to produce an ensemble of candidate conformations for a given 5C dataset. Unlike gradient descent approaches, MCMC5C allows (at least in theory) a proper sampling of the structure state space. This set of structures can be used to obtain posterior distributions over specific structural properties, contrast structural properties of chromatin under different conditions, or determine the existence of multiple model subclasses that fit the experimental data.
Markov chain Monte Carlo approaches have been widely applied to numerous computational and biological problems, such as the prediction of RNA structure [30, 31] or protein structure [32, 33], phylogenetic inference [34, 35], and sequence alignment [36, 37]. Our particular application shares some resemblance with the problem of inferring protein structure from nuclear magnetic resonance (NMR) data, which measures distances between hydrogen atoms in a molecule [38, 39]. Although existing software for NMR-based protein structure prediction are not applicable to our problem because they are tightly based on specifics of NMR data and amino acid structures, MCMC approaches are commonly used to produce robust ensembles of candidate structures based on noisy distance data.
The rest of this paper is structured as follows. After a brief introduction to the 5C and Hi-C technologies, we introduce a probabilistic model of the link between 5C or Hi-C data and 3D chromatin conformation. We then describe a MCMC-based algorithm that quickly produces an ensemble of structures, and then show how key features of the chromatin structure can be robustly estimated. Our approach is used for the analysis of three 5C datasets generated for the region of human chromosome 7 containing the HoxA gene cluster in both undifferentiated myelomonocytes and differentiated macrophages, revealing key changes in chromatin conformation. We also show that the MCMC5C program can be applied to Hi-C data by generating a three-dimensional model of human chromosome 14 at a 1 Mb resolution from previously published data .
Summary of Chromosome Conformation Capture Carbon Copy (5C) and Hi-C technologies
5C quantifies 3C ligation products using a modified Ligation-Mediated Amplification (LMA) approach, which has also been described in detail elsewhere . Briefly, 3C ligation products are detected with specially designed 5C primers that are complementary to the region(s) of interest and lie immediately upstream of the predicted 3C ligation junctions. Taq DNA ligase is then added to specifically ligate 5C primers at the junction of 3C products. LMA reactions can be performed at high level of multiplexing such that hundreds of primers could potentially be used in a single experiment to measure thousands of predicted chromatin contacts simultaneously. Universal tail sequences, located at the end of 5C primers, are then used to amplify and fluorescently label the library of synthetic 5C ligation products in a single PCR step. The labeled products are finally hybridized to custom microarrays for quantification. Conversion of microarray fluorescence intensities data obtained from a 5C experiment to interaction frequencies (IF) data can be performed using the IFCalculator program as described in [19, 40]. Briefly, IFCalculator starts by excluding probes with intensity signals close to background and then combines the background-subtracted intensities of the remaining probes for the same fragment pair to obtain IF values and their standard deviations.
Hi-C data is generated in a similar manner to 5C data and was first described in Lieberman-Aiden et al. . The technique includes additional steps of biotin fill-in and shearing before pull-down and paired-end sequencing. Hi-C can thereby be performed on a genome-wide scale and obviates the need for designing specific probes for each predicted pairwise junction. Its main drawback is the depth of sequencing required to obtain a good resolution at the IF level.
Although these assays are typically performed on diploid cells, the intra-chromosomal contacts found by both the 5C and Hi-C technologies can be treated as occurring within one homolog, as it has been previously shown that homologous copies of each chromosome occupy distinct nuclear positioning [41, 42].
In this section, we describe a probabilistic model of 5C/Hi-C interaction frequency data and the link between that data and the underlying chromatin 3D conformation. We then describe a Markov chain Monte Carlo (MCMC) approach to generate a representative sample of structures based on the experimental 5C IF data.
Modeling chromatin conformations and 5C/Hi-C data
We model a chromosome (or a region of a chromosome) as a continuous piece-wise linear curve in 3D, where restriction site i is located at position S(i) = (S x (i), S y (i), S z (i)). The set of fragment end positions S = S(1), S(2),...,S(n), where n is the number of restriction sites considered, constitutes the conformation of the genomic region. In order to remain as general as possible and avoid introducing biases, we place no constraint on S. However, we discuss below how various types of priors or constraints could be used.
for some value of α. The choice of the value of α is discussed in Results.
where N (x; μ, σ2) is the normal density function.
The role of κ, which we set to 10, is to avoid having small read counts being assigned too low a variance.
This defines the posterior probability distribution over the space of structures, conditional on the observed IF data. A gradient descent approach, similar to that presented in 5C3D by , could be used to identify locally optimal structures. However, there are often several different structures that fit the data almost equally well, so a probabilistic sampling approach that produces an ensemble of possible structures is advantageous.
Sampling conformations from the posterior distribution
The Markov chain Monte Carlo (MCMC) algorithm is a method used to sample from a complex distribution (in this instance, from the posterior distribution of S given ), resulting in an ensemble of solutions X1, X2,..., X N . Sampling from the posterior distribution consists of selecting an ensemble of conformations, where each conformation is selected with probability equal to its posterior probability. This is in contrast with maximum likelihood approaches, that seek to identify the (usually unique) structure S* with the highest likelihood given the observed data. Usually the structure with the highest likelihood in our ensemble is a good approximation to S*, but the ensemble allows a much deeper understanding of the structure of the solution space. This sampling is performed using the Metropolis-Hastings algorithm . A random structure R0 is initially chosen to seed the process (t = 0), where each point is placed randomly in a cube of side length . We then iterate the following procedure. The current structure R t is randomly perturbed (see below) to obtain a new structure . The posterior probability of the two structures are then compared. If , the perturbation is retained and we set . Otherwise, we retain with probability , but set Rt+1= R t otherwise. Torrie and Valleau  showed that for values of t sufficiently large, and thus that the structures sampled are representative of the true posterior distribution. The period required for the Markov process to mix, known as the burn-in period, depends on the problem size and the type of perturbation performed.
The choice of the type of random perturbation to be performed can have a major impact on the length of the burn-in period. Perturbations must allow a quick and complete exploration of the conformation space, while only modifying the current conformation in a local manner. In addition, it is beneficial if the likelihoods of the new and old structures can be computed and compared quickly. In the context of protein structure prediction, the most commonly used approach is to randomly modify one of the bond angles between consecutive amino acids. Although this approach is in principle applicable to our type of data, it would yield poor results, as a large number of pairwise distances would be significantly modified by any angular change. Instead, we elected to perturb structures by randomly choosing one point S(i) along the structure and moving it by a vector randomly chosen within a sphere of radius r (manual investigation showed that r = 0.25 nm yields good results for both 5C and Hi-C data). Clearly this type of perturbation allows the exploration of the full structure space from any starting configuration. The likelihood of the resulting structure is then quickly obtained from that of the old by updating the terms corresponding to the pairs of points involving i.
During the first iterations of the MCMC sampling process, called the burn-in phase, structures R1,...,R k are highly dependent on R0, the initial structure, and do not represent a proper sample in our conditional probability distribution. It is critical to be able to determine at what point m the Markov process has mixed, i.e. for what value of m is R m essentially independent of R0. After mixing, i.e. for k ≥ m, any sample R k is representative of the target distribution. Furthermore, for δ sufficiently large, samples R k and Rk + δare independent.
Several approaches exist to determine when a Markov chain has mixed, and what value of δ is suitable. The standard approach is to compare the probability distributions over the state space obtained from parallel runs started from different initial conformations, and keep sampling until the two become indistinguishable. Because our state space is continuous and high-dimensional (3 n parameters), no structure is actually ever sampled more than once, making this approach unusable. A literature search did not yield a ready-made solution for assessing the convergence of MCMC for structural inference, so we generalize the standard approach as follows. We run two independent chains R and R' in parallel, from independently chosen initial conformations R0 and . After k iterations, we say that mixing is achieved if the samples and cannot be distinguished from each other. Specifically, the average pairwise structural distances (see below) among structures in is compared to the average pairwise distances between pairs of conformations from . If the two means are within 10% of each other, we conclude that mixing is achieved and start collecting samples every δ = k/20 iterations: X1 = R k , X2 = Rk + δ, X3 = Rk + 2·δ,...,X N = Rk + (N - 1)·δ. This not only ensures that mixing has occurred, but also that subsequent samples, taken every δ iterations, are essentially independent.
Clustering of structure ensemble
The set of structures X1, X2,..., X N sampled by the MCMC5C program is representative of the distribution of structures that fit the observed interaction frequency data. In several cases, it can be useful to cluster structures from this ensemble based on their similarity, for example to identify subfamilies of structures whose properties can be assessed and contrasted. In addition, when ensembles from parallel runs are obtained, mixing can be assessed by verifying whether structures from each run cluster together (in which case mixing is not achieved) or not. Finally, ensembles from MCMC runs executed on different datasets can reveal similarly/dissimilarity between chromatin conformations under different conditions.
Note that two structures that are mirror images of each other will have distance zero. Indeed, such structures cannot be distinguished based on 5C/Hi-C data. The structures from an ensemble X0, X1,..., X N are clustered by first computing dist(X i , X j ) for all 1 ≤ i ≤ j ≤ N and then using Ward hierarchical clustering . This clustering is used to determine the existence and number of structure subfamilies and the members of each subfamily. Visualization is accomplished with both a hierarchical tree dendrogram and a heatmap representation. Visual inspection is performed to determine the tree height cutoff and number of subfamilies and for each subfamily the member structure with the highest posterior probability is chosen as the representative structure for that cluster. Choosing the maximum likelihood structure from each cluster as representative and assigning it a weight proportional to the number of the structures in its cluster allows focusing on a small number of representative structures.
Identification of reliable substructures
The ensemble of structures generated by the MCMC5C program will typically contain substructures that are highly constrained by the data and are thus present in the vast majority of structures, and others that are highly variable. Knowing what aspects of the reported structure are reliable is critical to guide downstream experimental validation. While this can sometimes be done by visual inspection of the superimposition of the structures from the sample, a more automated approach is usually desirable. This can be achieved by identifying a subset of k fragments whose pairwise distances are best conserved across the structures in the ensemble. To this end, we first compute the standard deviation s(i, j) of the intra-structure pairwise distance for each pair of points i and j, across all samples from the ensemble. We then identify the set of k fragments with the smallest total pairwise standard deviation using a greedy algorithm.
Measuring structural properties
Results and Discussion
Our modeling approaches were applied to three sets of 5C data studying the chromatin structure of the HoxA cluster (see Dostie et al.  and Figure 1 for a summary of the 5C procedure). The first pair of experiments (previously published in ) studies the conformation of the HoxA cluster during THP-1 cell differentiation from myelomonocyte to macrophage. 5C libraries were produced in both the undifferentiated myelomonocyte state and in the differentiated macrophage state (96 hours after treatment with phorbol myristate acetate (PMA)). For the third set of experiments (unpublished data), 5C data was generated for the same genomic region in a MLL-ENL fusion cell line (HB-1119) that expresses a different MLL-fusion protein than the THP-1 cell line and induces aberrant over-expression of the 5' HoxA genes [48, 49]. In both datasets, the genomic region analyzed spans 142 kb and contains 11 protein coding genes. The region contains 42 restriction sites for the BglII restriction enzyme, which was used for the experiment. Each 5C library was hybridized onto a custom array with a set of probes corresponding to every potential pair of fragments (due to the forward and reverse primer design used, only interaction frequencies between an even numbered and an odd numbered fragment are measured). The set of probe intensities were normalized using corresponding gene desert regions as previously described in Fraser et al.  and analyzed using the IFCalculator program [19, 40] to perform outlier detection and obtain interaction frequency and standard deviation estimates for every fragment pair considered. Although nearby sites along the sequence have elevated interaction frequencies, IFs between pairs of fragments located more than 10 kb are generally close to background levels, with several notable exceptions likely resulting from chromatin looping (see Additional File 1).
Choice of distance-to-IF transformation
Although it is clear that pairwise interaction frequencies are inversely correlated with the physical distance between any pair of fragments in the chromatin conformation [18, 22], there is no consensus on how IF depends on physical distance. Duan et al.  perform distance-to-IF conversions by first considering only short-range interactions (involving pairs of points that are close together along the sequence) and obtaining physical distances for these pairs based on polymer models. A given long-range IF value is then mapped to the polymer-based distance that is the most likely to have resulted in that value. The resulting conversion approximately follows d ∝ 1/IF. Mateos-Langerak et al.  also suggest a relationship of the form d ∝ IF α . Bau et al.  convert their IF via a linear transformation of the IF's z-score. Tanizawa et al.  relate IF to physical distance by using a loess regression on a set of physical distances measured by 3D-FISH, but do not report the parameters of this regression. The extent to which the function mapping IF values to physical distance depends on the specific experimental protocol remains unclear.
Mixing and convergence
Additional File 2 compares the likelihood of the structures sampled by MCMC5C to those found by several runs of the gradient descent program 5C3D, started from different initial structures. Although both approaches succeed at identifying credible structures, we observe that the structures found by 5C3D generally have lower likelihoods than those sampled by MCMC5C - indeed, the misfit function optimized by 5C3D is not equivalent to the likelihood function, which explains the slight decrease in likelihoods observed for many 5C3D runs past a certain number of iterations. Importantly, the five 5C3D runs converge to three different solutions, hinting that this type of approach is subject to getting stuck in local optima.
Accuracy of structure predictions on simulated data
Interestingly, the set of four maximum likelihood structures found by the four parallel MCMC5C runs actually contained topological mirror-images of otherwise nearly identical structures. These ''enantiomer'' structures have equal probability given our model of IF data and the structures were mirrored as required before the superimposition shown in Figure 6 was performed.
Clustering of conformational ensemble
Upon analysis of the mixing between two parallel MCMC5C runs on the HB-1119 5C dataset, we observed two distinct clusters in the heatmap (see Figure 5) that correspond to two subclasses of structures, each of which is sampled from both of the two parallel runs (seen by the mixing of the blue (run one) and red (run two) labels at the top of the heatmap). The clusters obtained are robust to changes in the clustering algorithm: the cluster membership determined by the hierarchical Ward clustering algorithm agrees at 85% with that obtained by the k-means algorithm, which operates in structure space rather than based on a distance matrix, suggesting that the two main clusters are indeed distinct and well separated. We note that we do not necessarily expect these two clusters to reflect two different chromatin conformations present in the population of cells used to generate the 3C library. Instead, they represent two possible conformations for the population-wide average conformation.
The posterior probability of each class can be estimated as the fraction of the samples belonging to it. The two largest clusters, whose structures mainly differ in the position of the loop in the region lying between the HoxA11 and HoxA13 genes, account for 42% and 58% of the structures sampled (these two main classes are not the two enantiomers discussed above - indeed, because of our structure similarity measure, enantiomers are considered as identical). This finding illustrates one of the benefits of MCMC5C over 5C3D by demonstrating the ability to discover different subclasses of structures that fit the experimental data almost equally well.
Analysis of HoxA conformational ensembles
While visualization is a powerful analysis approach, chromatin regions whose structure is well supported by the 5C data are better identified by our reliable subset identification algorithm, which identifies, from a given ensemble of structures, the subset of fragments whose spatial relationship varies the least within the ensemble. The subset of fragments that are the most conserved across the ensemble of structures (see Additional File 9) are found to lie within the central core region of the structures. These fragments are spatially close to each other and may be involved in looping contacts that are important for the maintenance of the chromatin structure and are therefore highly conserved. These results are observed in the ensembles of structures for both of the cellular states, whereby the most conserved substructures are found to lie within the regions corresponding to the strongest contact points.
Estimation of structural properties
Analysis of a Hi-C dataset
Implementation and running time
The MCMC5C program is implemented in Java and is available at http://Dostielab.biochem.mcgill.ca. The program takes the experimental interaction frequency data (with standard deviations) and the restriction enzyme genomic cut sites as input and produces an ensemble of structures as PDB files as output. Individual runs were performed on a 2.26 Ghz Intel Core 2 duo machine with 4 GB of RAM, while simultaneous parallel runs were performed on a cluster comprised of 20 Apple dual-processor 2.3-GHz compute-node G5s (2 GB of RAM each). Execution time increases with the number n of fragments in the structure. Each MCMC iteration runs in time O(n), as only the fragment pairs involving the fragment that was moved need updating. However execution time is mostly driven by the time to mixing, which not only depends on the size of the structure but also on how "unique" the solution is; a situation where the pool of likely structures is very small will lead to faster mixing than one where the set of possible solutions is much larger and involves many very different structures. Each of our 5C data sets consisted of 41 fragments, yielding between 335 and 398 interaction frequency measurements (the IFCalculator excluded some measurements because of their excessive variance between microarray probe replicates). Mixing was achieved in approximately 3.5 × 107 iterations for each data set, which took ~250 seconds. Ensembles of 1000 structures were then obtained by running the chain for approximately two hours. For the analysis of the Hi-C data from human chromosome 14, which consists of 89 fragments and 3916 IF pairs, mixing was achieved after 4 × 107 iterations (~800 seconds) and 250 structures were obtained in approximately 2.5 hours. However, our attempts to use MCMC5C on the full Hi-C dataset from Lieberman-Aiden et al. , consisting of data from all 23 human chromosomes, failed to achieve mixing after 24 hours of execution.
The role of high-level chromatin conformation in regulating gene expression is now well accepted, although only a few loci have been studied in detail [17, 20, 53–55]. Chromosome conformation capture-based technologies (3C , 5C , Hi-C , and their variants [23, 24, 26, 27, 57]) offer the ability to measure properties of the high-level chromatin organization by measuring the interaction frequency between genomic fragments. The resolution and accuracy of these techniques is rapidly improving, and, with the use of next generation sequencing, their throughput is increasing while their cost is decreasing. For these reasons, these technologies are increasingly popular.
Whereas the technological advances allow increasingly complex assays to be performed, few computational and statistical tools exist to analyze the data resulting from such experiments, although good approaches exist to help design the experiments [18, 19] or handle and visualize their output [17–19, 26, 27]. We previously developed the 5C3D program, which aims at producing the best-fitting conformation for a given dataset. Similar optimization-based approaches have also been used to model the structure of the yeast genome [27, 29], and the α-globin locus [17, 28]. However, the absence of statistical or Bayesian approaches make it impossible to assess the reliability of the predicted conformation. Downstream analyses are thus limited to qualitative observations that may or may not be reliable. In this paper, we introduce a probabilistic framework to address this problem. By sampling from the posterior probability distribution over conformations, MCMC5C produces an ensemble of different structures that are possible given the data and can find subclasses of structures that fit the data equally well. Overlaying these conformations in a visualization tool such as PyMOL readily allows the identification of reliable and less reliable aspects of the conformation. Using ensembles allows the discovery of subclasses of structures and the estimation of structural properties, together with their distribution, which allows the user to focus on statistically sound properties or differences between datasets. Although we acknowledge that more refined probabilistic models of 5C and Hi-C data will eventually be required to improve the accuracy of the structure predictions, those will be easily accommodated with MCMC5C.
None of the existing computational approaches to model 3D chromatin structures make use of advanced physical models of DNA and chromatin, although the approach of Duan et al.  uses a simple polymer physics model to transform interaction frequency, while Tanizawa et al. include simple sets of constraints derived from polymer physics. The methodology described in this paper attempts to model chromatin without specifying any type of hard constraints on the predicted structure, although such constraints could easily be included if desired. Our probabilistic framework also allows for the easy integration of structure priors based on free energy. Although excellent models of polymers exist (e.g. Langowski and Heermann ), it is unclear to what extent these models are informative at the scale we are considering (average fragment size of 4 kb in the case of our 5C data and 1Mb in the case of the Hi-C data).
A number of interesting directions should be investigated in the future. Time to mixing remains the main obstacle to running MCMC5C on very large datasets such as the whole-genome Hi-C dataset of Lieberman-Aiden et al. . We are currently working on considering other types of structural perturbations for the MCMC sampling, such as modifying the torsion of a given fragment or the angle between two fragments, or a combination of several types of perturbation. These advances should allow for more rapid sampling from the structure space, thereby aiding in the discovery of alternative conformations belonging to small subclusters of structures.
To conclude, we believe that probabilistic tools like MCMC5C are essential for the reliable analysis of data from the 3C-derived techniques such as 5C and Hi-C. By integrating complex, high-dimensional and noisy datasets into an easy to interpret ensemble of three-dimensional conformations, MCMC5C allows researchers to reliably interpret the result of their assay and contrast conformations under different conditions.
MCMC5C is available at http://Dostielab.biochem.mcgill.ca. Detailed protocols, 3C and 5C support information (design and analysis) can also be found at this location.
We would like to thank Jérôme Waldispühl for his useful input, as well Mathieu Lavalée-Adam and Pablo Cingolani for stimulating discussions and feedback.
Funding: This work is funded in part by a Discovery Grant from the National Sciences and Engineering Research Council of Canada (NSERC) to MB, and grants from the Canadian Institutes of Health Research (CIHR DC0190GP), and the Canadian Cancer Society Research Institute (CCSRI 019252) to JD. MR is supported by NSERC. JF is supported by funds from the Fonds de la Recherche en Santé du Québec (FRSQ) and CIHR. MAF is supported by a CCSRI fellowship. JD is a CIHR New Investigator and FRSQ Research Scholar.
- Fraser P, Bickmore W: Nuclear organization of the genome and the potential for gene regulation. Nature 2007, 447(7143):413–7. 10.1038/nature05916View ArticlePubMedGoogle Scholar
- Babu MM, Janga SC, de Santiago I, Pombo A: Eukaryotic gene regulation in three dimensions and its impact on genome evolution. Curr Opin Genet Dev 2008, 18(6):571–82. 10.1016/j.gde.2008.10.002View ArticlePubMedGoogle Scholar
- Berger SL: The complex language of chromatin regulation during transcription. Nature 2007, 447(7143):407–12. 10.1038/nature05915View ArticlePubMedGoogle Scholar
- Kharchenko PV, Woo CJ, Tolstorukov MY, Kingston RE, Park PJ: Nucleosome positioning in human HOX gene clusters. Genome Res 2008, 18(10):1554–61. 10.1101/gr.075952.107PubMed CentralView ArticlePubMedGoogle Scholar
- Cook PR: A model for all genomes: the role of transcription factories. J Mol Biol 2010, 395: 1–10. 10.1016/j.jmb.2009.10.031View ArticlePubMedGoogle Scholar
- Ferraiuolo MA, Rousseau M, Miyamoto C, Shenker S, Wang XQD, Nadler M, Blanchette M, Dostie J: The three-dimensional architecture of Hox cluster silencing. Nucleic Acids Res 2010.Google Scholar
- Hon G, Wang W, Ren B: Discovery and annotation of functional chromatin signatures in the human genome. PLoS Comput Biol 2009, 5(11):e1000566. 10.1371/journal.pcbi.1000566PubMed CentralView ArticlePubMedGoogle Scholar
- Hon G, Ren B, Wang W: ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol 2008, 4(10):e1000201. 10.1371/journal.pcbi.1000201PubMed CentralView ArticlePubMedGoogle Scholar
- Won KJ, Chepelev I, Ren B, Wang W: Prediction of regulatory elements in mammalian genomes using chromatin signatures. BMC Bioinformatics 2008, 9: 547. 10.1186/1471-2105-9-547PubMed CentralView ArticlePubMedGoogle Scholar
- Woodcock CL: Chromatin architecture. Curr Opin Struct Biol 2006, 16(2):213–20. 10.1016/j.sbi.2006.02.005View ArticlePubMedGoogle Scholar
- West AG, Fraser P: Remote control of gene transcription. Hum Mol Genet 2005, 14 Spec No 1: R101–11.View ArticlePubMedGoogle Scholar
- Göndör A, Ohlsson R: Chromosome crosstalk in three dimensions. Nature 2009, 461(7261):212–7. 10.1038/nature08453View ArticlePubMedGoogle Scholar
- Kanduri C, Pant V, Loukinov D, Pugacheva E, Qi CF, Wolffe A, Ohlsson R, Lobanenkov VV: Functional association of CTCF with the insulator upstream of the H19 gene is parent of origin-specific and methylation-sensitive. Curr Biol 2000, 10(14):853–6. 10.1016/S0960-9822(00)00597-2View ArticlePubMedGoogle Scholar
- Hark AT, Schoenherr CJ, Katz DJ, Ingram RS, Levorse JM, Tilghman SM: CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus. Nature 2000, 405(6785):486–9. 10.1038/35013106View ArticlePubMedGoogle Scholar
- Bell AC, Felsenfeld G: Methylation of a CTCF- dependent boundary controls imprinted expression of the Igf2 gene. Nature 2000, 405(6785):482–5. 10.1038/35013100View ArticlePubMedGoogle Scholar
- Court F, Baniol M, Hagege H, Petit JS, Lelay-Taha MN, Carbonell F, Weber M, Cathala G, Forne T: Long-range chromatin interactions at the mouse Igf2/H19 locus reveal a novel paternally expressed long non-coding RNA. Nucleic Acids Res 2011.Google Scholar
- Baù D, Sanyal A, Lajoie BR, Capriotti E, Byron M, Lawrence JB, Dekker J, Marti-Renom MA: The three-dimensional folding of the alpha-globin gene do- main reveals formation of chromatin globules. Nat Struct Mol Biol 2011, 18: 107–14. 10.1038/nsmb.1936PubMed CentralView ArticlePubMedGoogle Scholar
- Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009, 326(5950):289–293. 10.1126/science.1181369PubMed CentralView ArticlePubMedGoogle Scholar
- Fraser J, Rousseau M, Shenker S, Ferraiuolo M, Hayashizaki Y, Blanchette M, Dostie J: Chromatin conformation signatures of cellular differentiation. Genome Biology 2009, 10(4):R37. [http://genomebiology.com/2009/10/4/R37] 10.1186/gb-2009-10-4-r37PubMed CentralView ArticlePubMedGoogle Scholar
- Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W: Looping and interaction between hypersensitive sites in the active beta-globin locus. Mol Cell 2002, 10(6):1453–65. 10.1016/S1097-2765(02)00781-5View ArticlePubMedGoogle Scholar
- van Steensel B, Henikoff S: Identification of in vivo DNA targets of chromatin proteins using tethered dam methyltransferase. Nat Biotechnol 2000, 18(4):424–8. 10.1038/74487View ArticlePubMedGoogle Scholar
- Dekker J, Rippe K, Dekker M, Kleckner N: Capturing chromosome conformation. Science 2002, 295(5558):1306–11. 10.1126/science.1067799View ArticlePubMedGoogle Scholar
- Simonis M, Klous P, Splinter E, Moshkin Y, Willemsen R, de Wit E, van Steensel B, de Laat W: Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat Genet 2006, 38(11):1348–54. 10.1038/ng1896View ArticlePubMedGoogle Scholar
- Zhao Z, Tavoosidana G, Sjölinder M, Göndör A, Mariano P, Wang S, Kanduri C, Lezcano M, Sandhu KS, Singh U, Pant V, Tiwari V, Kurukuti S, Ohlsson R: Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra-and interchromosomal interactions. Nat Genet 2006, 38(11):1341–7. 10.1038/ng1891View ArticlePubMedGoogle Scholar
- Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED, Krumm A, Lamb J, Nusbaum C, Green RD, Dekker J: Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 2006, 16(10):1299–309. 10.1101/gr.5571506PubMed CentralView ArticlePubMedGoogle Scholar
- Li G, Fullwood MJ, Xu H, Mulawadi FH, Velkov S, Vega V, Ariyaratne PN, Mohamed YB, Ooi HS, Tennakoon C, Wei CL, Ruan Y, Sung WK: ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biol 2010, 11(2):R22. 10.1186/gb-2010-11-2-r22PubMed CentralView ArticlePubMedGoogle Scholar
- Duan Z, Andronescu M, Schutz K, McIlwain S, Kim YJ, Lee C, Shendure J, Fields S, Blau CA, Noble WS: A three-dimensional model of the yeast genome. Nature 2010.Google Scholar
- Baù D, Marti-Renom MA: Structure determination of genomic domains by satisfaction of spatial restraints. Chromosome Res 2010.Google Scholar
- Tanizawa H, Iwasaki O, Tanaka A, Capizzi J, Wickramasinghe P, Lee M, Fu Z, Noma K: Mapping of long- range associations throughout the fission yeast genome reveals global genome organization linked to transcriptional regulation. Nucleic Acids Res 2010, 38(22):8164–77. 10.1093/nar/gkq955PubMed CentralView ArticlePubMedGoogle Scholar
- Metzler D, Nebel ME: Predicting RNA secondary structures with pseudoknots by MCMC sampling. J Math Biol 2008, 56(1–2):161–81.View ArticlePubMedGoogle Scholar
- Meyer IM, Miklós I: SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoS Comput Biol 2007, 3(8):e149. 10.1371/journal.pcbi.0030149PubMed CentralView ArticlePubMedGoogle Scholar
- Boomsma W, Mardia KV, Taylor CC, Ferkinghoff-Borg J, Krogh A, Hamelryck T: A generative, probabilistic model of local protein structure. Proc Natl Acad Sci USA 2008, 105(26):8932–7. 10.1073/pnas.0801715105PubMed CentralView ArticlePubMedGoogle Scholar
- Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL: Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol 2003, 20(10):1692–704. 10.1093/molbev/msg184View ArticlePubMedGoogle Scholar
- Rodrigue N, Kleinman CL, Philippe H, Lartillot N: Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons. Mol Biol Evol 2009, 26(7):1663–76. 10.1093/molbev/msp078View ArticlePubMedGoogle Scholar
- Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP: Bayesian inference of phylogeny and its impact on evolutionary biology. Science 2001, 294(5550):2310–4. 10.1126/science.1065889View ArticlePubMedGoogle Scholar
- Lunter G, Miklós I, Drummond A, Jensen JL, Hein J: Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 2005, 6: 83. 10.1186/1471-2105-6-83PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu J, Liu JS, Lawrence CE: Bayesian adaptive sequence alignment algorithms. Bioinformatics 1998, 14: 25–39. 10.1093/bioinformatics/14.1.25View ArticlePubMedGoogle Scholar
- Wüthrich K: The way to NMR structures of proteins. Nat Struct Biol 2001, 8(11):923–5. 10.1038/nsb1101-923View ArticlePubMedGoogle Scholar
- Rieping W, Habeck M, Nilges M: Inferential structure determination. Science 2005, 309(5732):303–6. 10.1126/science.1110428View ArticlePubMedGoogle Scholar
- Fraser J, Rousseau M, Blanchette M, Dostie J: Computing chromosome conformation. Methods Mol Biol 2010, 674: 251–68. 10.1007/978-1-60761-854-6_16View ArticlePubMedGoogle Scholar
- Lanctôt C, Kaspar C, Cremer T: Positioning of the mouse Hox gene clusters in the nuclei of developing embryos and differentiating embryoid bodies. Exp Cell Res 2007, 313(7):1449–59. 10.1016/j.yexcr.2007.01.027View ArticlePubMedGoogle Scholar
- Cremer M, Grasser F, Lanctôt C, Müller S, Neusser M, Zinner R, Solovei I, Cremer T: Multicolor 3D fluorescence in situ hybridization for imaging interphase chromosomes. Methods Mol Biol 2008, 463: 205–39. 10.1007/978-1-59745-406-3_15View ArticlePubMedGoogle Scholar
- Torrie GM, Valleau JP: Monte-Carlo free energy estimates using non-Boltzmann sampling. Chemical Physics Letters 1974, 28: 578–581. 10.1016/0009-2614(74)80109-0View ArticleGoogle Scholar
- Metropolis , T HA, Rosenbluth AW, Teller E: Equation of state calculation by fast computing machines. Journal of Chemical Physics 1953, 21: 1087–1092. 10.1063/1.1699114View ArticleGoogle Scholar
- Ward J: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 1963, 58(301):236. 10.2307/2282967View ArticleGoogle Scholar
- Cohen FE, Sternberg MJ: On the prediction of protein structure: The significance of the root-mean-square deviation. J Mol Biol 1980, 138(2):321–33. 10.1016/0022-2836(80)90289-2View ArticlePubMedGoogle Scholar
- Maiorov VN, Crippen GM: Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins. J Mol Biol 1994, 235(2):625–34. 10.1006/jmbi.1994.1017View ArticlePubMedGoogle Scholar
- Tkachuk DC, Kohler S, Cleary ML: Involvement of a homolog of Drosophila trithorax by 11q23 chromosomal translocations in acute leukemias. Cell 1992, 71(4):691–700. 10.1016/0092-8674(92)90602-9View ArticlePubMedGoogle Scholar
- Ayton PM, Cleary ML: Transformation of myeloid progenitors by MLL oncoproteins is dependent on Hoxa7 and Hoxa9. Genes Dev 2003, 17(18):2298–307. 10.1101/gad.1111603PubMed CentralView ArticlePubMedGoogle Scholar
- Mateos-Langerak J, Bohn M, de Leeuw W, Giromus O, Manders E, Verschure P, Indemans M, Gierman H, Heermann D, van Driel R, Goetze S: Spatially confined folding of chromatin in the interphase nucleus. Proc Natl Acad Sci USA 2009, 106(10):3812–7. 10.1073/pnas.0809501106PubMed CentralView ArticlePubMedGoogle Scholar
- Bystricky K, Heun P, Gehlen L, Langowski J, Gasser SM: Long-range compaction and exibility of interphase chromatin in budding yeast analyzed by high-resolution imaging techniques. Proc Natl Acad Sci USA 2004, 101(47):16495–500. 10.1073/pnas.0402766101PubMed CentralView ArticlePubMedGoogle Scholar
- FANTOM Consortium, et al.: The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat Genet 2009, 41(5):553–62. 10.1038/ng.375View ArticleGoogle Scholar
- Spilianakis CG, Flavell RA: Long-range intrachromosomal interactions in the T helper type 2 cytokine locus. Nat Immunol 2004, 5(10):1017–27. 10.1038/ni1115View ArticlePubMedGoogle Scholar
- Liu Z, Garrard WT: Long-range interactions between three transcriptional enhancers, active Vkappa gene promoters, and a 3' boundary sequence spanning 46 kilobases. Mol Cell Biol 2005, 25(8):3220–31. 10.1128/MCB.25.8.3220-3231.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Crutchley JL, Wang XQD, Ferraiuolo MA, Dostie J: Chromatin conformation signatures: ideal human disease biomarkers? Biomark Med 2010, 4(4):611–29. 10.2217/bmm.10.68View ArticlePubMedGoogle Scholar
- Dostie J, Dekker J: Mapping networks of physical interactions between genomic elements using 5C technology. Nat Protoc 2007, 2(4):988–1002. 10.1038/nprot.2007.116View ArticlePubMedGoogle Scholar
- Rodley CDM, Bertels F, Jones B, O'Sullivan JM: Global identification of yeast chromosome interactions using Genome conformation capture. Fungal Genet Biol 2009, 46(11):879–86. 10.1016/j.fgb.2009.07.006View ArticlePubMedGoogle Scholar
- Schrödinger, LLC: The PyMOL Molecular Graphics System, Version 1.3r1. 2010.Google Scholar
- Langowski J, Heermann DW: Computational modeling of the chromatin fiber. Semin Cell Dev Biol 2007, 18(5):659–67. 10.1016/j.semcdb.2007.08.011View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.