DiNeR: a Differential graphical model for analysis of co-regulation Network Rewiring

Zhang, Jing; Liu, Jason; Lee, Donghoon; Lou, Shaoke; Chen, Zhanlin; Gürsoy, Gamze; Gerstein, Mark

doi:10.1186/s12859-020-03605-3

Methodology article
Open access
Published: 02 July 2020

DiNeR: a Differential graphical model for analysis of co-regulation Network Rewiring

Jing Zhang¹^na1,
Jason Liu^2,3^na1,
Donghoon Lee^2,3,
Shaoke Lou^2,3,
Zhanlin Chen^4,5,
Gamze Gürsoy^2,3 &
…
Mark Gerstein^2,3,5

BMC Bioinformatics volume 21, Article number: 281 (2020) Cite this article

3444 Accesses
6 Citations
7 Altmetric
Metrics details

Abstract

Background

During transcription, numerous transcription factors (TFs) bind to targets in a highly coordinated manner to control the gene expression. Alterations in groups of TF-binding profiles (i.e. “co-binding changes”) can affect the co-regulating associations between TFs (i.e. “rewiring the co-regulator network”). This, in turn, can potentially drive downstream expression changes, phenotypic variation, and even disease. However, quantification of co-regulatory network rewiring has not been comprehensively studied.

Results

To address this, we propose DiNeR, a computational method to directly construct a differential TF co-regulation network from paired disease-to-normal ChIP-seq data. Specifically, DiNeR uses a graphical model to capture the gained and lost edges in the co-regulation network. Then, it adopts a stability-based, sparsity-tuning criterion -- by sub-sampling the complete binding profiles to remove spurious edges -- to report only significant co-regulation alterations. Finally, DiNeR highlights hubs in the resultant differential network as key TFs associated with disease. We assembled genome-wide binding profiles of 104 TFs in the K562 and GM12878 cell lines, which loosely model the transition between normal and cancerous states in chronic myeloid leukemia (CML). In total, we identified 351 significantly altered TF co-regulation pairs. In particular, we found that the co-binding of the tumor suppressor BRCA1 and RNA polymerase II, a well-known transcriptional pair in healthy cells, was disrupted in tumors. Thus, DiNeR successfully extracted hub regulators and discovered well-known risk genes.

Conclusions

Our method DiNeR makes it possible to quantify changes in co-regulatory networks and identify alterations to TF co-binding patterns, highlighting key disease regulators. Our method DiNeR makes it possible to quantify changes in co-regulatory networks and identify alterations to TF co-binding patterns, highlighting key disease regulators.

Background

Thousands of transcription factors (TFs), their cofactors, and chromatin remodelers are employed in a highly coordinated manner to accurately initiate and control the transcriptional process on DNA sequences [1,2,3,4]. Precise temporal and spatial coordination among these factors is important for determining cell phenotype and maintaining biological function. Studies have reported that disruption of the co-regulation relationships of TFs can result in gene expression alterations, which can consequently introduce phenotypical variations and even lead to disease [5,6,7,8,9]. However, various computational methods have been proposed to infer dysregulations of an individual TF by exploring differential expressions of the TF itself and its gene targets [10], or investigating the direct TF-gene gain and loss events [11], ignoring the higher-order combinatory binding patterns among TFs. Therefore, large-scale mining of TF co-binding changes in disease and normal states could provide new insights into gene dysregulation and opportunities for targeted therapies.

In this study, we aimed to quantify such alterations to the TF co-regulatory relationships and prioritize regulators associated with pathogenesis. This is a challenging task for many reasons. For instance, many of the ~ 1400 known human TFs have highly dynamic binding profiles depending on the cell state and conditions [12]. Therefore, it is essential to carefully curate a dataset of appropriate cell types to precisely capture disease-specific TF co-regulatory disruptions. Furthermore, joint analysis of binding profiles of numerous factors would be beneficial by maximizing our knowledge of TF co-binding events. Researchers have proposed various models to systematically impute all tissue-specific TF regulomes using features such as DNase accessibility and sequence context, but the accuracy of these methods across TFs is still unknown [13, 14].

In light of this, we propose a multi-step computational framework that we call Differential graphical model of Network Rewiring (DiNeR) to infer TF co-binding alterations and pinpoint disease-causing TFs. First, we modeled the cooperative regulation patterns among TFs using a co-regulation network, where the nodes represent TFs and the weighted edges measure the genome-wide co-occurrence between pairwise TFs, all of which are derived from chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. This is distinct from traditional regulatory networks, where edges usually imply the relationship between a TF and the physical interaction it shares with an enhancer or promoter region in order to initiate the transcription process of its target genes. Second, we directly measured the gain or loss of TF co-regulatory events across the genome during the transition from one cellular state to another using a differential graphical model. Intuitively, a higher weight in our differential network indicates a larger alteration in the co-regulatory pattern between two states. As a third step, we adopted the graphical LASSO model and used a stability-based method for penalizing parameter selection to control for the sparsity of the differential network with the goal of removing spurious edges [15,16,17,18]. Lastly, we prioritized the TFs according to their degree in the differential network based on the assumption that disease-driving TFs can demonstrate massive binding profile changes and consequently disrupt their cooperative regulation with many other regulators.

To test the effectiveness of our DiNeR framework, we applied this model to the ENCODE Tier 1 cell lines, K562 and GM12878, roughly representing paired disease-to-normal cells for chronic myelogenous leukemia (CML). We used our DiNeR framework to directly estimate the changes in the TF combinatory regulation relationships between disease and normal states. As a result, we identified several TFs that exhibit a high level of disruption to their co-binding partners. Among this list were several well-known risk factors for leukemia, such as BRCA1, RAD51, BMI1, and H3k27me3 [19,20,21,22], demonstrating the effectiveness of our method.

Results

Collecting appropriate large-scale ChIP-seq data to investigate transcriptional regulation

TF binding sites are highly tissue specific and usually change dramatically across cell types. Therefore, it is important to select appropriately matched and representative cell types to investigate a specific disease. Here, we used K562 and GM12878 to represent a rough disease-to-normal pair in order to investigate the transcriptional regulation dynamics during oncogenesis for CML (see details in “ChIP-seq data collection and pre-processing” section). Specifically, we extracted 94 common TFs among these two cell lines (Fig. 1a, Table S1). In order to investigate alterations in the joint activity between TFs and specific chromatin marks in disease, we also extracted nine histone modification marks and chromatin accessibility data sets from these cell lines. Among these TFs, 31 showed significant expression changes between disease and normal states (Fig. 1c).

Building a genome-wide TF co-regulatory network

Aberrant transcriptional regulation is associated with various diseases [5,6,7,8,9]. Network analysis has been proven to be a powerful tool for identifying and prioritizing genes or regulators associated with pathogenesis. For example, scientists have constructed TF regulatory networks in order to mimic the physical binding of TFs to either enhancer or promoter regions during initiation of the transcriptional process of its target gene (Fig. 2a, b) [11]. Edges in this type of regulatory network only focus on the local interaction between the TF and target gene pair, and do not consider the effect of the rest of the genome. Another approach is to use gene co-expression networks, where a shared edge in the network represents consistent expression patterns between a pair of genes across many samples, which was usually inferred from RNA sequencing or microarray data [23, 24]. However, this output only reflects co-expression patterns, rather than regulatory relationships.

Here, we propose a TF co-regulatory network based on large-scale ChIP-seq data to model a related but different aspect of transcriptional regulation – the cooperative behavior among TFs. Specifically, in our network, nodes represent TFs and the weighted edge between them measures the level of non-random co-binding activity across the genome. It is important to note that this network is distinct from a traditional TF-TF network, which focuses on the mechanism of how one TF gene is regulated by another TF protein. In particular, this type of TF-TF regulatory network describes the binding event of one TF at the non-coding element (e.g., enhancer or promoter) of another TF, but does not indicate whether these two TFs work in a coordinated way to regulate other downstream genes.

We used the Gaussian graphical model (GGM) to construct this TF co-regulatory network [25]. The schematic in Fig. 2c illustrates one example. Two pairs TF₁ and TF_3, TF₂ and TF₃ co-bind over many places in the genome, possibly resulting from interacting domains that are necessary to initiate transcription in many genes. Then we draw two edges in our co-regulation network between TF₁ and TF_3, TF₂ and TF₃ (Fig. 2d)_.

Using differential networks to measure TF co-regulation alterations between states

After building the co-regulatory networks under one condition, we aimed to provide a quantitative measure of co-regulation alterations between two conditions (e.g., disease and normal states). Intuitively, our goal is to try and infer a differential co-regulatory network. In this differential network, while nodes still represent TFs, each edge now describes the level of change that each TF pair experiences during disease progression (Fig. 3). Therefore, we extended the GGM for one condition to a differential graphical model for two conditions by estimating the differences between two precision matrices (details in “Differential graphical model for co-regulation alteration in two states” section).

During the estimation process of these two precision matrices, small non-zero values are introduced and may lead to many spurious edges, resulting in a dense differential network with many false positives. Therefore, we introduced a regulation parameter, λ, to penalize the retained edge count in order to remove potential spurious edges. As shown in Fig. 4, higher λ values indicate a larger penalty in the network edge, resulting in a sparser estimation of the differential network.

Selecting an appropriate λ parameter is key to reliably inferring the TF co-regulation gain and loss events while simultaneously removing false positives. We further used a stability-based model selection method to choose the optimal λ based on sub-sampling of the genome [15]. The intuition of our model is that we encourage the network to be more inclusive of edges for the benefit of allowing us to scrutinize many possible changes associated with disease, while simultaneously ensuring that the results are highly repeatable across many regions in the genome (Fig. 5, details in “Model selection” section).

Applying DiNeR to prioritize key TFs associated with CML

We applied our DiNeR framework to 104 paired factors in K562 and GM12878 cell lines. After model selection, we set λ_opt = 0.2 (details in “Model selection” section). In total, we included 6.49% of all possible edges in the final network (351 out of 5408). We identified eight out of the 104 factors as consistent network hubs (Table 1), and reliably captured all of them by sub-sampling half of the genome (see Supplement S4). These eight factors include many well-known genes that have been previously associated with leukemia (Table 1), indicating that our method can reliably detect key regulators of disease.

Table 1 A list of DiNeR prioritized TFs using the network hubs of the K562 vs. GM12878 differential co-regulation networks upon different subsampling of the entire genome. Perc. Incl.: how many rounds of simulations out of 100 this TF has been claimed as a network hub. Lit. Supp: whether there is literature support to link this TF with cancer.

Full size table

One of the identified factors was the proto-oncogene BMI1, which is a major component of polycomb group complex 1 and plays a central role in DNA damage repair. Although BMI1 showed only moderate expression changes in tumor as compared to normal cells (approximately 20% higher; from 42.90 to 51.19), its co-binding relationship with other factors changed dramatically. BMI1 was a significantly rewired TF and maintained edges with 14 other factors, including a well-known cancer-related histone modification mark H3K27me3. Interestingly, BMI1 is a known biomarker of hematologic malignancies and has been shown to be essential for faithful reprogramming of myeloid progenitors [26, 27]. Studies have reported that levels of BMI1 correlate with prognosis of patients with myelodysplastic syndrome and chronic and acute myelogenous leukemia [21]. Our analysis provides another possible explanation to this phenomenon, suggesting that in addition to aberrant expression patterns, the disruption of the coordination between key regulators during transcription can also contribute to disease progression.

Another interesting factor identified by our method was the tumor suppressor gene BRCA1 (Fig. 6a). BRCA1 has been shown to play a central role in maintaining genomic stability. Moreover, germline defects in this gene have been associated with multiple cancer types, including breast and ovarian cancers [28]. In normal cells, BRCA1 is typically associated with RNA polymerase II across multiple species to faithfully activate transcription of key genes [29]. However, we found that in K562 cells, the key interaction between BRCA1 and RNA polymerase II was significantly disrupted. Specifically, the Jaccard distance between the BRCA1 and POLR2A ChIP-seq peaks decreased by 98% in K562 as compared to GM12878 cells, indicating severe co-localization disruption. Consistent with this finding, we observed a remarkable proximal-to-distal shift in the BRCA1 binding locations in K562 cells (Fig. 6a and b). In particular, approximately 93.6% of the BRCA1 ChIP-seq peaks in GM12878 were located within a 5 kb region of annotated transcription start sites (proximal), and this number was drastically reduced to 16.4% for K562 peaks. Such a remarkable shift indicates that in cancer cells, BRCA1 fails to cooperate with other conserved collaborators to control the transcriptional process. BRCA1 has also been widely reported to affect chromatin structure and introduce chromatin remodeling. Interestingly, we also found that H3K4me1 and open chromatin regions were among the rewired partners of BRCA1. The overlapping peaks between BRCA1 and these factors were significantly reduced during the normal-to-tumor transition (p value < 2 × 10⁻¹⁶ for binomial tests in both cases). Hence, we hypothesize that BRCA1 severely alters not only its transcriptional regulation through promoter region interactions, but also its chromatin remodeling activity.

Further investigating TFs prioritized by DiNeR

To further explore the prognostic value of BRCA1 in leukemia and validate our model, we downloaded 151 patient expression and clinical profiles from The Cancer Genome Atlas (TCGA) for acute myeloid leukemia [30]. If we only consider the expression levels of BRCA1 in K562 and GM12878, we are unable to fully detect the difference in behavior of this gene between the two cell types. However, when considering the regulatory activity, we found that BRCA1 regulatory activity was increased by 50% in K562 as compared to GM12878 cells, suggesting that the regulatory role of BRCA1 could be of interest as compared to just the expression of this gene. In line with large-scale network rewiring events, we combined the BRCA1 regulatory network with patient tumor-to-normal differential expression profiles to measure the regulatory potential of BRCA1 in each patient. In contrast to the survival analysis using expression profiles alone, we found that a severe disruption of BRCA1 regulatory activities usually indicated worse patient survival rates as compared to those patients with strong regulatory (p-value = 0.0035) (Fig. 6c). This analysis demonstrates that the differential graphical model can effectively identify key factors affecting patient survival beyond gene expression.

Discussion

Simultaneous binding of many TFs to proximal and distal regulatory regions of the genome is imperative for precise control of spatiotemporal gene expression patterns. Therefore, it is essential to investigate transcriptional regulation alterations in order to prioritize risk factors associated with disease. Many methods have been developed to address this goal. For instance, researchers investigated TF target gene gain and loss events and found distinct patterns between oncogenes and tumor suppressors [11]. Others combined expression changes with TF regulatory networks and identified TFs responsible for aberrant gene expression patterns, and then validated their discoveries through patient survival analysis [10]. Here, we focused on a unique aspect of transcriptional regulation – coordination among TFs, which goes beyond the binding or expression changes of TFs. For instance, genome-wide chromatin remodeling and histone modifications changes may introduce dramatic binding profile alterations for different TF, resulting in alterations of the combinatory coordination among TFs. For instance, some TFs, such as BRCA1 in our analysis, demonstrate massive binding profile changes. As a result, these TFs disrupt the co-binding relationship with many other TFs and chromatin features and are prioritized with the highest importance by our DiNeR framework. We also found that the target genes of TFs showed distinct expression patterns in tumors compared to normal cell lines, most likely due to extensive binding profiles changes. As a result, our analyses offer related but complementary views of transcriptional dynamics compared to previous methods and provide new insights into disease-related transcriptional regulation.

We also emphasize that model selection is a key step of our DiNeR framework. A larger penalty to the edge numbers captures the more confident co-regulatory network alterations at the cost of excluding weaker, but not necessarily insignificant, changes. Researchers have proposed various model selection methods, such as Akaike and Bayesian information criteria, to solve this problem [25]. In our framework, we used the stability-based model selection method, which provides a more interpretable explanation [15]. Specifically, we include an edge in the network as long as we can reliably discover it from numerous random subsamples of different genomic regions.

In addition, we ranked TFs according to the number of gained and lost partner TFs in the differential network. The intuition behind this scheme is that TFs that show a larger degree of genome-wide binding profile changes represent network hubs and are more likely to have disrupted coordination with many other partner TFs, resulting in a higher impact on transcriptional regulation. This is a reasonable assumption without any prior information. However, it is possible that even the disruption of one canonical TF pair may change the expression of key genes, such as in cases of oncogenes and tumor suppressor genes. Hence, it is also valuable to scrutinize the non-hubs of the network provided by DiNeR.

Finally, we showed that using matched ChIP-seq data from disease and normal cells is beneficial to directly capturing the co-localization events of TFs, as compared to other profiles such as expression. However, this approach requires the availability of hundreds of functional characterization datasets. We believe that as high-throughput sequencing technologies continue to develop, especially single-cell sequencing methods, our proposed differential graphical models could be applied to new opportunities to highlight regulators in disease.

Conclusion

We developed a TF-TF network rewiring and regulator prioritization method by applying non-parametric graphical models on large-scale functional genomics data. This approach allows us to identify DNA binding factors demonstrating large co-localization disruptions. Given the number of genome-wide binding profiles from ChIP-seq data in matched disease and control cells, our DiNeR method can reliably and efficiently highlight the significant changes of pairwise coordinated regulations between different factors. We applied our model to 104 common TF, histone modification, and chromatin accessibly data from a loosely paired tumor and normal cell line in CML. We discovered disruptions between well-known partners of transcriptional regulation, such as BRCA1 and RNA polymerase II, signifying the effectiveness of our method.

Methods

Here, we adopted a differential graphical model to investigate the differences in co-binding patterns of TFs between normal and disease conditions. We hypothesized that factors that dramatically change their partners during the transcription process would be altered to a larger degree in disease samples, and hence would have larger effects than TFs showing little difference in co-binding patterns in driving disease progression.

ChIP-seq data collection and pre-processing

We aim to infer the differential TF co-regulation alterations among normal and disease conditions. In order to assess these alterations, we selected disease-to-normal cell types that shared at least 50 common TFs (ChIP-Seq) as well as had a strong expression correlation between the two cell types (Figure S1). Therefore, we decided to use the 369 and 143 ChIP-seq experiments from K562 and GM12878 cell lines, respectively. After de-duplicating and extracting common ChIP-Seq targets, we identified 94 common TFs among these two cell lines (Table S1). In order to investigate alterations in the joint activity between TFs and specific chromatin marks in disease, we also extracted nine histone modifications and chromatin accessibility datasets from these cell lines (Fig. 1a, b). The majority of the 94 common TFs were sequence-specific binding factors (TFSS in Fig. 1c), 31 of which showed significant expression changes between disease and control.

To uniformly process the data, we first divided the autosomal chromosome (hg38 version) into 100 base pair (bp) bins and removed bins that overlapped with genomic regions that have gaps or low mappability using BEDTools (version 2.27.1-foss-2018b) [31]. To remove any artifacts from non-peak regions, we further removed all 100 bp bins that did not overlap with any peaks from these 104 factors (Fig. 3a). In total, we kept 1,351,140 bins in our analysis.

For each factor, we calculated the fold change of read count between the TF ChIP-seq experiment and its matched ChIP-seq control experiment from ENCODE for all bins in order to normalize for read depth (Fig. 3b). We then calculated the average signal from each of the replicates or experiments from different labs when multiple datasets were present for a single factor. Details of the ChIP-seq signal files used has been listed in the supplementary data. We organized the resultant signal data into a matrix for K562 and GM12878 separately, with columns indicating factors and rows indicating bins in the genome (Fig. 3c). We used these two matrices as the inputs for the following differential graphical model (Fig. 3d).

Infer differential graphical model to TF-TF network rewiring

Gaussian graphical model for co-regulation network in one state

In this section, we describe the details of the differential graphical model. Let G^(D) = (V^(D), E^(D)) represent the network with nodes V^(D) and edges E^(D) for disease status D. Let $ {X}_j^{(D)} $ denote the vector of the average normalized ChIP-seq signal of factor j in status D, where j = 1⋯J . D = 1 indicates disease and 0 indicates normal. Next, for each D, we calculated $ {X}_1^{(D)},{X}_2^{(D)},\cdots, {X}_J^{(D)} $ over 100 bp bins for all the 104 factors. In our analysis, we included 94 TFs, nine histone modification marks, and one chromatin accessibility (J = 104).

Under one condition D, we assume that $ {X}_1^{(D)},{X}_2^{(D)},\cdots, {X}_J^{(D)} $ follows a multivariate Gaussian distribution such that $ {\boldsymbol{X}}^{(D)}={\left({X}_1^{(D)},\cdots, {X}_J^{(D)}\right)}^T\sim {N}_J\left({\boldsymbol{\mu}}^{(D)},{\boldsymbol{\Sigma}}^{(D)}\right) $. We can construct the TF co-regulatory network using a traditional GGM. Here, we are aiming to identify TF true physical interactions by highlighting conditional dependent binding profiles among TFs. Therefore, we used the precision matrix represented as Θ^(D) ≔ (Σ^(D))⁻¹ to infer whether any TF pair has a non-random co-binding interaction. In other words, If $ {\boldsymbol{\Theta}}_{ij}^{(D)}=0, $ then $ {X}_i^{(D)} $ and $ {X}_j^{(D)} $ are independent of each other, conditioned on all the other TFs. As a result, $ \left(i,j\right)\notin {E}^{(D)}\ if\ {\boldsymbol{\Theta}}_{ij}^{(D)}=0 $.

Differential graphical model for co-regulation alteration in two states

Next, we used the difference between two networks G⁽¹⁾ and G⁽⁰⁾, called the differential network, to represent the degree of TF co-regulation alteration under two conditions (D = 1 and D = 0). Given the observed data $ {X}_1^{(1)},{X}_2^{(1)},\cdots, {X}_J^{(1)} $ in the disease cell and $ {X}_1^{(0)},{X}_2^{(0)},\cdots, {X}_J^{(0)} $ for the normal cell, edges in the differential co-regulatory network can be inferred from the difference between the two precision matrices Δ = Θ⁽¹⁾ − Θ⁽⁰⁾ = (Σ⁽¹⁾)⁻¹ − (Σ⁽⁰⁾)⁻¹, where the co-regulation relationship between TF_i and TF_j changes if |Δ_{i, j}| ≠ 0. Note that Σ⁽¹⁾ΔΣ⁽⁰⁾ − (Σ⁽¹⁾ − Σ⁽⁰⁾) = 0. Hence, we can solve the following equation to estimate a reasonable Δ.

$$ {\hat{\boldsymbol{\Sigma}}}^{(1)}\boldsymbol{\Delta} {\hat{\boldsymbol{\Sigma}}}^{(0)}-\left({\hat{\boldsymbol{\Sigma}}}^{(1)}-{\hat{\boldsymbol{\Sigma}}}^{(0)}\right)=0 $$

Here $ {\hat{\boldsymbol{\Sigma}}}^{(D)} $ is the sample covariance matrix. Specifically, we used the penalized D-Trace loss model estimate Δ [16,17,18].

$$ l\left(\boldsymbol{\Delta} \right)=\frac{1}{2}\ tr\left(\boldsymbol{\Delta} {\hat{\boldsymbol{\Sigma}}}^{(1)}\boldsymbol{\Delta} {\hat{\boldsymbol{\Sigma}}}^{(0)}\right)- tr\left(\boldsymbol{\Delta} \left({\hat{\boldsymbol{\Sigma}}}^{(1)}-{\hat{\boldsymbol{\Sigma}}}^{(0)}\right)\right) $$

tr represents the trace of a matrix. To remove spurious differential edges, we introduced a non-negative regularization parameter λ to penalize the number of edges in the network.

$$ l\left(\boldsymbol{\Delta} \right)=\frac{1}{2}\ tr\left(\boldsymbol{\Delta} {\hat{\boldsymbol{\Sigma}}}^{(1)}\boldsymbol{\Delta} {\hat{\boldsymbol{\Sigma}}}^{(0)}\right)- tr\left(\boldsymbol{\Delta} \left({\hat{\boldsymbol{\Sigma}}}^{(1)}-{\hat{\boldsymbol{\Sigma}}}^{(0)}\right)\right)+\lambda {\sum}_{i,j}{\left|{\Delta }_{i,j}\right|}_1 $$

Here, λ controls the sparsity of the rewired network. For example, λ = 0 indicates no penalty and usually will result in a very dense network. In contrast, a large λ value will result in a sparse network.

Co-variance matrix inference

Under the Gaussian assumption, $ {\hat{\boldsymbol{\Sigma}}}^{(D)} $ can be directly obtained from the sample covariance matrix. However, One statistical concern of using a differential graphical model is the distribution of X^(D). Here, we found that even after log transformation, almost all TFs severely contradicted the Gaussian assumption (for details see suppl. sect. S2). Therefore, going forward we used a non-parametric model instead of a GGM. Our assumption is that a set of monotonically increasing functions $ {\left\{{f}_j^{(D)}\right\}}_{j=1}^J $ exists such that, after transformation, $ {f}_1^{(D)}\left({X}_1^{(D)}\right),{f}_2^{(D)}\left({X}_2^{(D)}\right),\cdots, {f}_J^{(D)}\left({X}_J^{(D)}\right) $ follow a multivariate normal distribution N_J(0, Σ^(D)). Similar to the GGM, we can use the precision matrix Θ^(D) ≔ (Σ^(D))⁻¹ to infer the conditional dependence between any pair of factors in the network. As described in [17], we adopted the rank-based scheme to estimate the sample covariance matrix without directly estimating $ {\left\{{f}_j^{(D)}\right\}}_{j=1}^J $. Specifically, let $ {r}_{li}^{(D)} $ represent the rank of bin l for TF i in status D among all the bins, and n is the total number of bins in the genome. The Spearman correlation of TFs i and j are represented as below.

$$ {\rho}_{ij}^{(D)}=\frac{\sum_{l=0}^n\left({r}_{li}^{(D)}-\frac{n+1}{2}\right)\left({r}_{lj}^{(D)}-\frac{n+1}{2}\right)}{\sqrt{\sum_{l=1}^n{\left({r}_{li}^{(D)}-\frac{n+1}{2}\right)}^2{\sum}_{l=1}^n{\left({r}_{li}^{(D)}-\frac{n+1}{2}\right)}^2}} $$

Then we replace the sample covariance matrix $ {\hat{\boldsymbol{\Sigma}}}^{(D)} $ by, $ {\hat{\mathbf{S}}}^{(D)} $ with elements

$$ {\hat{\mathbf{S}}}^{(D)}=\left\{\begin{array}{cc}2\sin \left(\frac{\pi }{6}{\hat{\rho}}_{ij}^{(D)}\right)& i\ne j\\ {}1& i=j\end{array}\right. $$

In cases where $ {\hat{\mathbf{S}}}^{(D)} $ was not positively semi-definite, we used a projection method as described in [17, 32].

Model selection

It is critical to select an appropriate λ to reliably infer network changes in disease samples because different λ values can lead to different conclusions in downstream analyses. Researchers have proposed many methods, cross validation, Akaike information criterion (AIC), and Bayesian information criterion (BIC), to automatically select λ [33,34,35]. We chose a more interpretable approach in our ChIP-seq-based co-regulation network analysis than AIC and BIC, by using the Stability Approach to Regularization Selection (StARS) approach (Fig. 5) [15]. The key characteristic of this method is that it encourages the differential network to be inclusive to account for the true dynamics between disease and control networks, while guaranteeing an acceptable stability in the resultant differential network.

Specifically, we defined $ \Lambda =\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$\lambda $}\right. $ as an alternative parameter to control network density so that a larger Λ indicates a denser network. During the model section process, we start from subsampling part of the genome for S times. Specifically, during the s^th sampling, X^{(1), s} and X^{(0), s} represent the binding profile matrices. $ {\psi}_{i,j}^s\left(\Lambda \right)=1 $ if there is an edge between TF i and TF j in the rewired network under Λ, otherwise $ {\psi}_{i,j}^s\left(\Lambda \right)=0 $. In the s = 1, 2, ⋯, S randomly sampled datasets, we defined $ {\theta}_{i,j}\left(\Lambda \right)=\frac{1}{S}{\sum}_{s=1}^S{\psi}_{i,j}^s\left(\Lambda \right) $, and ξ_{i, j}(Λ) = 2θ_{i, j}(Λ){1 − θ_{i, j}(Λ)} to be the fraction of times the networks disagree with the existence of the edge (i, j). Then, the overall instability of the networks over the sampling sets is

$$ \hat{D}\left(\Lambda \right)=\frac{\sum_{i<j}{\xi}_{i,j}\left(\Lambda \right)}{\left(\begin{array}{c}J\\ {}2\end{array}\right)} $$

It is clear that $ \hat{D}(0)=0 $ in an empty network because there is no instability when there are no edges. In general, the network becomes denser and more instable as Λ goes larger. However, when the network becomes very dense and even fully connected, $ \hat{D}\left(\Lambda \right) $ goes smaller again and eventually reduces to zero. As suggested in [15], we used the monotone function $ \overline{D}\left(\Lambda \right)={\mathit{\sup}}_{0\ll t\ll \lambda}\hat{D}\left(\Lambda \right) $ to remove such artifact effect in an extremely dense network. As a result, the optimal Λ should be $ {\hat{\Lambda}}_{opt}=\mathit{\sup}\left\{\Lambda :\overline{D}\left(\Lambda \right)\le \beta \right\} $ for a predefine network instability measure β.

In our analysis, we started from a broad spectrum of parameter values from λ₁ = 0.01, λ₂ = 0.05, ⋯λ_m, ⋯, λ_M = 10, representing a wide range of sparse networks, from almost fully connected to empty (Fig. 4). For each λ_m, we randomly selected half of the bins we used in “Infer differential graphical model to TF-TF network rewiring” section to run the LASSO penalized D-Trace loss model. We repeated this process S = 100 times for each λ_m and calculated the average network variance. We used β = 0.5% as our stability threshold and selected the optional λ is 0.2.

Availability of data and materials

The datasets generated and/or analyzed during the current study are available in the ENCODE portal, https://www.encodeproject.org/ [21]. Accession codes can be found in the supplement. Users can directly download the raw data used in this paper from the ENCODE website with the accession code.

Abbreviations

DiNeR:: Differential Graphical Model of co-regulation Network Rewiring to Infer Transcription Factor Co-binding Alterations
TF:: Transcription factor
CML:: Chronic myeloid leukemia
GGM:: Gaussian graphical model
ChIP-seq:: Chromatin immunoprecipitation followed by sequencing
StARS:: Stability Approach to Regularization Selection
TCGA:: The Cancer Genome Atlas

References

Spitz F, Furlong EE. Transcription factors: from enhancer binding to developmental control. Nat Rev Genet. 2012;13(9):613–26.
Article CAS Google Scholar
Djebali S, et al. Landscape of transcription in human cells. Nature. 2012;489(7414):101–8.
Article CAS Google Scholar
Gerstein MB, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489(7414):91–100.
Article CAS Google Scholar
Lee TI, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002;298(5594):799–804.
Article CAS Google Scholar
Golson ML, Kaestner KH. Fox transcription factors: from development to disease. Development. 2016;143(24):4558–70.
Article CAS Google Scholar
Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152(6):1237–51.
Article CAS Google Scholar
Nebert DW. Transcription factors and cancer: an overview. Toxicology. 2002;181-182:131–41.
Article CAS Google Scholar
Santiago C, Bashaw GJ. Transcription factors and effectors that regulate neuronal morphology. Development. 2014;141(24):4667–80.
Article CAS Google Scholar
Bhagwat AS, Vakoc CR. Targeting transcription factors in Cancer. Trends Cancer. 2015;1(1):53–65.
Article Google Scholar
Jiang P, et al. Inference of transcriptional regulation in cancers. Proc Natl Acad Sci U S A. 2015;112(25):7731–6.
Article CAS Google Scholar
Zhang J, et al. An integrative ENCODE resource for cancer genomics. bioRxiv. 706424. https://doi.org/10.1101/706424.
Tsankov AM, et al. Transcription factor binding dynamics during human ES cell differentiation. Nature. 2015;518(7539):344–9.
Article CAS Google Scholar
Qian J, et al. Identification of regulatory targets of tissue-specific transcription factors: application to retina-specific gene regulation. Nucleic Acids Res. 2005;33(11):3479–91.
Article CAS Google Scholar
Zhong S, He X, Bar-Joseph Z. Predicting tissue specific transcription factor binding sites. BMC Genomics. 2013;14:796.
Article Google Scholar
Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (StARS) for high dimensional graphical models. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2. Vancouver: Curran Associates Inc.; 2010. p. 1432–40.
Google Scholar
Yuan HL, et al. Differential network analysis via lasso penalized D-trace loss. Biometrika. 2017;104(4):755–70.
Article Google Scholar
Zhang XF, et al. DiffGraph: an R package for identifying gene network rewiring using differential graphical models. Bioinformatics. 2018;34(9):1571–3.
Article CAS Google Scholar
Tian DC, Gu QQ, Ma J. Identifying gene regulatory network rewiring using latent differential graphical models. Nucleic Acids Res. 2016;44(17):e140.
Article Google Scholar
Friedenson B. The BRCA1/2 pathway prevents hematologic cancers in addition to breast and ovarian cancers. BMC Cancer. 2007;7:152.
Article Google Scholar
Hamdy MS, et al. RAD51 and XRCC3 gene polymorphisms and the risk of developing acute myeloid leukemia. J Investig Med. 2011;59(7):1124–30.
Article CAS Google Scholar
Saudy NS, et al. BMI1 gene expression in myeloid leukemias and its impact on prognosis. Blood Cells Mol Dis. 2014;53(4):194–8.
Article CAS Google Scholar
Buchi F, et al. Redistribution of H3K27me3 and acetylated histone H4 upon exposure to azacitidine and decitabine results in de-repression of the AML1/ETO target gene IL3. Epigenetics. 2014;9(3):387–95.
Article CAS Google Scholar
Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4:Article17.
Article Google Scholar
Carter SL, et al. Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics. 2004;20(14):2242–50.
Article CAS Google Scholar
Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19–35.
Article Google Scholar
Sahasrabuddhe AA. BMI1: a biomarker of hematologic malignancies. Biomark Cancer. 2016;8:65–75.
Article CAS Google Scholar
Yuan J, et al. Bmi1 is essential for leukemic reprogramming of myeloid progenitor cells. Leukemia. 2011;25(8):1335–43.
Article CAS Google Scholar
Roy R, Chun J, Powell SN. BRCA1 and BRCA2: different roles in a common pathway of genome protection. Nat Rev Cancer. 2011;12(1):68–78.
Article Google Scholar
Krum SA, et al. BRCA1 associates with processive RNA polymerase II. J Biol Chem. 2003;278(52):52012–20.
Article CAS Google Scholar
Cancer Genome Atlas Research, N, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–20.
Article Google Scholar
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–2.
Article CAS Google Scholar
Liu H, et al. High-dimensional Semiparametric Gaussian copula graphical models. Ann Stat. 2012;40(4):2293–326.
Article Google Scholar
Akaike H. In: Parzen E, Tanabe K, Kitagawa G, editors. Information Theory and an Extension of the Maximum Likelihood Principle, in Selected Papers of Hirotugu Akaike. New York: Springer New York; 1998. p. 199–213.
Chapter Google Scholar
Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6(2):461–4.
Article Google Scholar
Efron B. The jackknife, the bootstrap and other resampling plans. https://doi.org/10.1137/1.9781611970319.

Download references

Acknowledgements

We thank the Yale Center for Research Computing for guidance and use of the research computing infrastructure at Yale university.

Funding

We acknowledge support from the NIH (U24 HG 009446–02) and from AL Williams Professorship funds.

Author information

Jing Zhang and Jason Liu contributed equally to this work.

Authors and Affiliations

Department of Computer Science, University of California, Irvine, CA, 92617, USA
Jing Zhang
Computational Biology and Bioinformatics Program, Yale University, New Haven, CT, 06520, USA
Jason Liu, Donghoon Lee, Shaoke Lou, Gamze Gürsoy & Mark Gerstein
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06520, USA
Jason Liu, Donghoon Lee, Shaoke Lou, Gamze Gürsoy & Mark Gerstein
Department of Molecular Cellular and Developmental Biology, Yale University, New Haven, CT, 06520, USA
Zhanlin Chen
Department of Computer Science, Yale University, New Haven, CT, 06520, USA
Zhanlin Chen & Mark Gerstein

Authors

Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jason Liu
View author publications
You can also search for this author in PubMed Google Scholar
Donghoon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Shaoke Lou
View author publications
You can also search for this author in PubMed Google Scholar
Zhanlin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Gamze Gürsoy
View author publications
You can also search for this author in PubMed Google Scholar
Mark Gerstein
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JZ and JL designed the methods; JZ, JL, DL, and ZC performed all the analyses; SL and GG helped with expression data processing; all authors were involved in writing the paper. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Mark Gerstein.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

We would like to note that Mark Gerstein is a member of the editorial board of this journal.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhang, J., Liu, J., Lee, D. et al. DiNeR: a Differential graphical model for analysis of co-regulation Network Rewiring. BMC Bioinformatics 21, 281 (2020). https://doi.org/10.1186/s12859-020-03605-3

Download citation

Received: 06 August 2019
Accepted: 15 June 2020
Published: 02 July 2020
DOI: https://doi.org/10.1186/s12859-020-03605-3

DiNeR: a Differential graphical model for analysis of co-regulation Network Rewiring

Abstract

Background

Results

Conclusions

Background

Results

Collecting appropriate large-scale ChIP-seq data to investigate transcriptional regulation

Building a genome-wide TF co-regulatory network

Using differential networks to measure TF co-regulation alterations between states

Applying DiNeR to prioritize key TFs associated with CML

Further investigating TFs prioritized by DiNeR

Discussion

Conclusion

Methods

ChIP-seq data collection and pre-processing

Infer differential graphical model to TF-TF network rewiring

Gaussian graphical model for co-regulation network in one state

Differential graphical model for co-regulation alteration in two states

Co-variance matrix inference

Model selection

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us