The corrected gene proximity map
Starting from a genome-wide Hi-C contact map, we extracted the contact frequencies between all protein-coding genes, forming a square matrix W (see Methods for details). This matrix can be viewed as the adjacency matrix of a network in which nodes represent genes and the weight of each edge corresponds to the contact frequency between the two genes it connects. We called this weighted network the “gene proximity map” since its weights represent spatial proximity as detected by Hi-C: the higher the weight, the smaller the distance between two genes.
As we described earlier, the weights in the network can be thought of as a combination of the 1D and 3D components of the gene contact frequency. To extract the 3D component, we began by developing a null model for the 1D component, i.e., a model for the spatial positioning of genes based exclusively on their distance on the 1D genome. We shall use E to denote the weight matrix for this null model. Specifically, we assume that the (i, j)-th element Eij, the expected number of contacts between genes i and j, takes the form:
$$ {\displaystyle \begin{array}{l}{E}_{ij}={k}_i{k}_jf\left({d}_{ij}\right)\\ {}\mathrm{where}f\left({d}_{ij}\right)\propto \Big\{\begin{array}{c}{f}^{\mathrm{intra}}\kern0ex (d),\mathrm{if}\ \mathrm{gene}\ \mathrm{i}\ \mathrm{and}\ \mathrm{j}\ \mathrm{are}\ \mathrm{located}\ \mathrm{at}\ \mathrm{the}\ \mathrm{same}\ \mathrm{chromosome}\ \mathrm{and}\ \mathrm{separated}\ \mathrm{by}\ \mathrm{genomic}\ \mathrm{distance}\ \mathrm{d}\\ {}{f}^{\mathrm{inter}},\kern1.36em \mathrm{i}\mathrm{f}\kern0.34em \mathrm{gene}\kern0.28em \mathrm{i}\kern0.28em \mathrm{and}\kern0.28em \mathrm{j}\kern0.28em \mathrm{are}\kern0.28em \mathrm{located}\kern0.28em \mathrm{at}\kern0.28em \mathrm{d}\mathrm{i}\mathrm{f}\mathrm{f}\mathrm{erent}\kern0.34em \mathrm{chromosome}\end{array}\operatorname{}\end{array}}. $$
(1)
There are two cases here: both gene i and gene j locate on the same chromosome; or they are on different chromosomes. For the first case, fintra(d) is a cell type specific function, numerically estimated from the genome-wide Hi-C contact map, that maps the genomic distance d between a pair of genes to an average contact frequency over all pairs of genomic loci that are separated by d. On the other hand, for any pairs of genes that belong to two different chromosomes, we assume they have equal chances to interact. So finter is a constant that is computed from the average contact frequency of all inter-chromosomal interactions (see Methods for details). We further impose the following set of constraints:
$$ {\sum}_j{E}_{ij}={\sum}_j{W}_{ij},\forall i. $$
(2)
These constraints ensure that the total number of contacts mapped to gene i empirically is the same as the number of contacts assigned to it by the null model. By imposing these constraints, ki can be solved using an iterative scheme (see Methods for details). Intuitively, ki can be viewed as the “visibility” of gene i, which is a biological parameter related to gene size, chromatin accessibility, gene location in the nucleus, etc. Genes that are long and located in the center of the nucleus will normally have higher visibility than short genes and genes located on the periphery of the nucleus. Overall, Equation (1) implies that the expected number of contacts between two genes depends on their genomic distance as well as the product of their intrinsic visibility. Highly visible genes are therefore more likely to be in contact with others.
Having defined the null model, which accounts for the 1D component of the contact frequency, we can define the CGP map, denoted by B, as the difference between the observed gene proximity matrix and the developed null model:
In other words, the CGP map quantifies the corrected spatial proximity between genes by eliminating the gene contact frequency component due to the 1D genomic distance from the results of Hi-C experiments. This is motivated by the conventional definition of modularity matrix in network theory, which is defined as the difference between the adjacency matrix and a matrix explaining the expected connectivity pattern of the network, and is shown to be powerful for detecting network communities [20]. Importantly, in our analysis, W is the adjacency matrix of the gene proximity network and B can be thought of as a generalized modularity matrix for the network (see Supplementary Materials for details). Figure 1 shows a schematic of the construction of the CGP map. In our analysis, all genome-wide Hi-C contact maps had been corrected by the ICE technique [21]. Importantly, our framework can be applied to Hi-C maps corrected by any other software tools such as HOMER [22], Hi-Cdat [23] and HiC-Pro [24]. In the following, we demonstrate that the CGP map (B) can capture the correlation between spatial arrangement and co-regulation of genes in 3D better than the raw gene proximity map (W).
CGP better encodes important functional and structural properties of the genome
We investigated whether the CGP map can unravel functional properties of the genome such as gene co-expression. We used RNA-seq data and Hi-C data of 12 cell lines, including 10 cancer cell lines recently published by the ENCODE consortium [25], and two hematopoietic cell lines (GM12878 and K562) in which we have deeply sequenced Hi-C data [26] (see Methods for details). For each cell line and each chromosome, we used the RNA-seq data to build a co-expression matrix C (see Methods for details). Then, for each of the 23 chromosomes, we computed the Pearson correlation coefficient between C and two matrices: the CGP map B and the gene proximity map W. Figure 2 shows the results for GM12878 and K562 cell lines. For all 23 chromosomes: the correlation between B and C remains positive, whereas the correlation between W and C is low and alternates between positive and negative. Note that a normalized version of the gene proximity map \( \overset{\sim }{W} \), similar to the normalized contact map introduced in [11], exhibits the same behavior as W. This analysis demonstrates that the CGP map is indeed better, information-wise, than the raw gene proximity map at explaining gene co-expression in the genome. Similar patterns can be observed in most of the other 10 cell lines (Fig. S1).
A common technical issue in contact map analysis is the diminishing read counts and thus statistical power as separation between loci increases. We investigated to what extent the CGP map could be limited by the issue. In particular, we checked whether the distant pairs dilute the correlation as observed in Fig. 2. As shown in Fig. S1B, by removing the distant pairs, we found that the correlation between co-expression matrix and the CGP map decreases, suggesting the distant pairs contribute to the signal of CGP.
It is well known that a genome could be divided into the so-called A/B compartments, in which compartment A tends to be active whereas compartment B consists of heterochromatin [2, 11]. Therefore, it is instructive to further examine whether the CGP map can capture such information. Motivated by the use of eigenvectors of the modularity matrix for community detection on networks, we formulated the problem as a classification problem and used the components of the leading eigenvectors of the CGP matrix as features (see Methods for details). We found that the CGP matrix works reasonably well in predicting to which compartment a gene belongs (AUC = 0.86, Fig. S2).
CGP captures the interplay between gene spatial positioning and co-regulation
It has long been observed that co-expressed genes, or functionally related genes are likely to be next to each other on the 1D genome [27]. Recent studies further suggest that they are also spatially clustered in the nucleus [28]. Nevertheless, to what extent spatially clustered genes are co-expressed, and to what extent the tendency is attributed to various spatial organizations like compartments and TADs are not well characterized. Here, we quantify the interplay between the spatial positioning of genes and gene co-regulation by phrasing it as an optimization problem using the CGP map. More specifically, we assume the expression states of genes to be either ON (xi = + 1) or OFF (xi = − 1), and explore how the two states are distributed on the gene proximity network (see Methods for details). We define an objective function:
$$ Q=\frac{1}{\sum_{ij}{W}_{ij}}{\sum}_{ij}{B}_{ij}{x}_i{x}_j. $$
(3)
Here, Q quantifies to what extent gene co-expression pattern is related to the underlying spatial proximity between genes. If there is no particular relationship between the spatial positioning of genes and their regulation, Q is close to 0 as genes with different expression states are randomly distributed in 3D (Bij ≈ 0). On the other hand, if a large fraction of co-expressed genes are proximal (Bij > 0), the overall value of Q is high (Fig. 3a). A similar objective function has previously been used to solve the classic network bisection problem [29].
As shown in Fig. 3b, in all 12 cell lines, the value of Q is much higher than 0. To show the statistical significance of the result, we looked at the distribution of Q in an ensemble of gene configurations where the spatial locations of genes are randomly redistributed. As expected, the values of Q generated from the random configurations appear to follow a Gaussian distribution with zero mean. In all tested cell types, the empirical expression profile yields a substantially greater function value, which is always positive, than the permuted profiles. The separation between the values of Q for the empirical configuration and random configurations is a measure of the tendency in which active genes tend to be proximal in space. We confirmed CGP indeed provides a better measure for such a tendency by repeating the analysis with a modified objective function computed using raw gene proximity map. Specifically, we found that the raw gene proximity map is less efficient at capturing the interplay between the spatial positioning of genes and co-regulation (Fig. S3).
We further explored to what extent the tendency is attributed to characterized spatial structures. For instance, it is well known that the genome is organized into A/B compartments. We therefore generated another ensemble by shuffling the genes in A/B compartments separately. We found that although the resultant distribution of Q is significantly higher than the null distribution (Figure 3c, P = 1.2 × 10−21 in GM12878 and P = 3.1 × 10−19 in K562), it is very much lower than the empirical Q. The analysis suggests that the compartment organization alone cannot account for the co-localization of expressed genes.
Genes within a compartment are further organized. Spatial structures of particular interest are the topologically associating domains [30, 31]. Topologically associating domains (TADs) are domains of self-interacting chromatin that have emerged as a fundamental structural unit in genome organization [32]. To investigate to what extent TADs contribute to the value of Q, we generated two more ensembles: One by shuffling the genes in TADs separately; the other one by permuting the TADs first and then shuffling the genes within the permuted TADs (Figure 3c). Since the permuted TADs are essentially random blocks with the same size distribution as the original TADs, the distribution of Q generated by this model in CGP quantifies the effect of genes from forming TADs in contrast to the effect of genes from simply being close in 1D sequence.
The different ensembles shown in Figure 3c take into account different levels of spatial organization. Since the null ensemble refers to a case in which genes are distributed with no regard to spatial structure, the separation between the null and the distribution of Q in the compartment-preserving ensemble or TAD-preserving ensemble correspond to the relative contribution of compartments or TADs. The observation that the empirical Q is higher than the distribution of TAD-preserving ensemble suggests genes are further organized within TADs in order to make co-expressed members proximal to each other. A natural question is, to what extent the co-expressed genes are organized favoring spatial proximity? To do this, we introduced a Monte Carlo procedure to iteratively search for the maximal value of Q (see Methods for details). More specifically, the positions of ON and OFF genes are swapped in order to increase the value of Q. As shown in Figure 3d, in all the tested CGP maps, given a fixed number of expressed genes, it is possible to further increase the value of Q from the empirical value, meaning that the empirical values are far from optimal.
Change of CGP is highly correlated with gene expression change
We further investigated whether there exists a correspondence between changes in expression profile and spatial configurations among different cell types. To do this, we focused on two hematopoietic cell lines: GM12878 and K562, where the latter is a cell line derived from a patient with chronic myeloid leukemia (CML). We considered two sets of genes based on differential expression: the first set contains the most up-regulated genes in GM12878 as compared with K562 (100 genes with the largest expression fold-change); the second set consists of the most down-regulated ones (100 genes with the smallest expression fold-change). We should expect that the first set of genes to be clustered more tightly in GM12878 than in K562, while the second set should be clustered more tightly in K562. Moreover, we would like to compare the effect measured on the CGP map with the one measured on the raw gene proximity map. We defined a quantity, “tightness”, denoted by T, to measure to what extent a set of genes is tightly positioned (i.e., the average spatial distance between genes is small) in 3D (see Methods for details). When computing T using the CGP, we found that the up-regulated genes in GM12878 are close together in GM12878 (high tightness) but farther apart (low tightness) in K562, while the down-regulated genes are more tightly clustered in K562 than in GM12878 (Figure 4, left panel). The right panel in Figure 4 shows that the result is not visible, when we obtain T using the gene proximity map alone. It contradicts our expectation in that the up-regulated genes, on average, have fewer contact frequencies between themselves than the down-regulated genes in a cell.
We performed the same analysis at the level of pathways. More specifically, we investigated how the spatial configuration of the genes constituting a pathway changes when their expression changes. For each of the 186 metabolic pathways in KEGG database, we calculated the difference in tightness between K562 and GM12878 cell lines. A positive (negative) difference indicates that genes in the pathway are tightly (loosely) positioned in K562 but loosely (tightly) positioned in GM12878. At the same time, we performed the Gene Set Enrichment Analysis (GSEA) and identified pathways that are enriched with genes that are up-regulated in GM12878 with respect to K562. As shown in Figure 5 (outset), we found that these up-regulated pathways are consistently more compact in GM12878 than in K562, which illustrates that, for these pathways to be expressed in GM12878 but not in K562, their genes tend to be spatially reorganized to achieve closer proximity. One example of such pathways is the regulation of autophagy, whose inhibition has been reported as a new strategy to induce cell death of drug-sensitive and drug-resistant CML cells [33]. This is probably due to the fact that autophagy genes are poorly expressed in K562 cells as compared with GM12878 cells and are therefore more sensitive to the inhibition of autophagy. Importantly, the figure shows that this effect is not visible when performing the same analysis using raw contact frequencies, as it fails to identify the correlation between the change in relative positioning of genes in 3D and the change in gene expression between cell lines (Figure 5, inset).
CGP reveals critical inter-chromosomal interactions in the nucleus
The CGP map offers a natural framework to study the inter-chromosomal proximity between genes. Based on the values in the CGP matrices, we analyzed the 20 closest inter-chromosomal gene pairs in each of the K562 and GM12878 cell lines (Supplementary Materials Table S1). We observed that the pairs identified in two cell lines involve quite different sets of chromosomes, naming chr11-chr17 in GM12878 whereas chr9-chr13-chr22 in K562. This difference might due to the large-scale chromosomal rearrangement in the cancer cell line K562.
Apart from spatial proximity at gene level, it is interesting to investigate spatial proximity at chromosomal level. To do this, we applied a size-reduction technique [34] that merges genes belonging to the same chromosome in the CGP map together to form a reduced inter-chromosomal proximity map. Starting from the CGP matrix B, the corresponding “inter-chromosomal proximity matrix”\( \hat{B} \) was obtained by merging the rows and columns according to the gene-chromosome correspondence (see Methods for details). We took CML as a case study here and compared the inter-chromosomal proximity map between GM12878 and K562 cell lines in order to obtain some insights on the role of abnormal chromosomal rearrangements in the diseased cell. Figure 6 summarizes the global change of spatial organization of the genome between K562 and GM12878 at the chromosome level. A positive change between a pair of chromosomes indicates that they are closer in 3D in K562 than in GM12878 and a negative change means the opposite. Weighted blue and red links illustrate the extent and direction of proximity changes: blue represents negative changes while red represents positive changes. The network recapitulates a number of known facts that are consistent with the genome of cells affected by CML. First, there is a red connection between chromosome 9 and 22, resembling the Philadelphia chromosome, a reciprocal translocation between chromosome 9 and chromosome 22 [35]. Second, chromosome 3 and 10 are linked by a red line, indicating this pair are closely located in the K562 cell line. This behavior was recently characterized as a rare chromosomal translocation in leukemia patients [36]. Both aforementioned translocations have been observed in the K562 cell line in a recent study [37]. As shown in Figure 6, chromosome 17 appears to move away from several chromosomes including 1, 11, 14 and 19. This could be attributed to an abnormality described as the single most important cytogenetic abnormality for the prognosis of leukemia [38]. In short, we found the inter-chromosomal proximity map, derived from the CGP map, captures the relative spatial organization of chromosomes between cell types. With no surprise, the critical chromosomal abnormalities cannot be effectively identified using the gene proximity map W (Figure S4).