Constructing non-stationary Dynamic Bayesian Networks with a flexible lag choosing mechanism

Background Dynamic Bayesian Networks (DBNs) are widely used in regulatory network structure inference with gene expression data. Current methods assumed that the underlying stochastic processes that generate the gene expression data are stationary. The assumption is not realistic in certain applications where the intrinsic regulatory networks are subject to changes for adapting to internal or external stimuli. Results In this paper we investigate a novel non-stationary DBNs method with a potential regulator detection technique and a flexible lag choosing mechanism. We apply the approach for the gene regulatory network inference on three non-stationary time series data. For the Macrophages and Arabidopsis data sets with the reference networks, our method shows better network structure prediction accuracy. For the Drosophila data set, our approach converges faster and shows a better prediction accuracy on transition times. In addition, our reconstructed regulatory networks on the Drosophila data not only share a lot of similarities with the predictions of the work of other researchers but also provide many new structural information for further investigation. Conclusions Compared with recent proposed non-stationary DBNs methods, our approach has better structure prediction accuracy By detecting potential regulators, our method reduces the size of the search space, hence may speed up the convergence of MCMC sampling.


Introduction
Recently non-stationary Bayesian network models have attracted significant research interests in modeling gene expression data. In non-stationary Bayesian networks, we assume that the underlying stochastic process that generates the gene expression data may change over time. Non-stationary Bayesian networks have advantage over conventional methods in applications where the intrinsic regulatory networks are subject to changes for adapting to internal or external stimuli. For example, gene expression profiles may go through dramatic changes in different development stages [1], or in the invasion process of viruses [2], or as response to changes of outside environment such as temperature and light intensity [3].
Recent work on non-stationary Bayesian networks could be found in [1,2]. Robinson's method [1] used RJMCMC (Reversible Jump Markov Chain Monte Carlo) to sample underlying changing network structures, in which an extended BDe metric (Bayesian-Drichlet equivalent) is applied. And Grzegorczy et al. [2] proposed a non-homogeneous Bayesian network method to model non-stationary gene regulatory processes, in which they included a Gaussian mixture model based on allocation sampler technique [4], provided an extended non-linear BGe (Bayesian Gaussian likelihood equivalent) metric and employed MCMC (Markov Chain Monte Carlo) to collect samples.
There are several limitations on the existing non-stationary DBNs methods that are discussed above. First, the RJMCMC that is used in Robinson's work [1] is a computationally expensive approach especially in dealing with gene networks. Second, mixture model used by Grzegorczy et al. avoided intensive computational issue by using MCMC, but it does not capture the underlying changing network structures over time. In addition, both methods used a fixed time delay τ = 1 that leads to a relatively low accuracy of prediction on network reconstruction [5].
In this paper, we proposed a new non-stationary DBNs approach extending the work presented in [1] and [5]. Our method modified RJMCMC by employing a systematic approach to determine potential regulators. We designed a flexible lag determine mechanism by considering the delay in the gene expression changes between potential regulators and target genes. In this approach we efficiently reduce the model searching space, capture the dynamics of transcriptional time delay, and speed up computation with a fast convergence.
The early work of applying BNs to analyzing expression data could be found in [7,8]. Many works have been done since then. Hartemink et al. extended the static BNs by including latent variables and annotated edges, and their work focused on scoring the models of regulatory network [10]. Considering the problem of information loss incurred by discretization of expression data, Imoto et al. proposed a continuous BNs and nonparametric regression model [12]. They used Laplace approximation to the marginal probability to infer a BNRC score as the scoring metric for network models. Further, Hartemink and Imoto extended their techniques to DBNs [11,14]. Before the BNs, previous efforts at modeling genetic regulatory networks fell into two categories [9,10]: fine-scale methods utilizing differential equations, and coarse-scale methods using clustering and boolean network models. BNs method is perceived as a good compromise of the two levels. With the challenging of small number of samples, researchers seek additional information such as transcriptional localization data [16], DNA sequences of promoter elements [13], and protein-protein interaction data [15] to improve the accuracy of gene networks reconstruction.

Structure Learning of Non-stationary Bayesian Networks
Bayesian networks (BNs) are a special case of probabilistic graphic models. A static BN is defined by an acyclic directed graph G and a complete joint probability distribution of its nodes P(X) = P(X 1 ,…, X n ). The graph G : G = {X, E} contains a set of variables X = {X 1 ,…, X n }, and a set of directed edges E, defining the causal relations between variables. With a directed acyclic graph, the joint distribution of random variables X = {X 1 ,…, X n } are decomposed as P(X 1 ,…, X n ) = ∏ i P(X i |Π i ), where Π i are the parents of the node (variable) X i .
The topology of bayesian networks must be a directed acyclic graph and hence could not be used to model the case where two genes may be a regulator of each other. As an extension of BNs to model time series data, Dynamic Bayesian Networks (DBNs) lift the limitation of directed acyclic graph by incorporating time in constructing bayesian networks. Given an observed time series data D spanning T time points, the structure learning problem of DBNs is equal to maximizing the posterior probability of the network structure G. By the Bayes' rule, the posterior probability is expressed as the following: The current application of DBNs to gene expression data assumes that the underlying stochastic process generating the data is stationary. Here we provide a new approach to capture the structural dynamics of non-stationary data.
We assume the time series gene expression profile is subdivided to m segments. In each segment, there is one graph G i : 1 ≤ i ≤ m that dominates the segment. Given a sequence of network structures G T = (G 1 ,…, G m ), the posterior probability in Equation 1 is replaced by Equation 2.
In applying DBNs to gene expression data, we first decide the time lag value τ, which is the time delay between causes and effects in the time series data. Most previous work set τ = 1 for modeling a first-order markov chain. However, evidence shows that higher-order markov chain might better model gene expression data and biological networks [5]. Given a maximum lag value τ max , in corresponding to the graph structure sequence G T , we assign a lag vector τ T = (τ 1 ,…,τ m ), in which τ i : 1 ≤ τ i ≤ τ max . So Equation 2 further extends to: In the following discussion, we specify the formula for calculating each component of Equation 4. The prior P (τ max |T) is 1 because we set the τ max value when we find the potential parents for each variable.
We are using the same assumption in [1] that the networks change smoothly over time. We use the exponential priors on the change of network structures. We transform the form of the sequence of graph structures G T : G T = (G 1 ,…,G m ) into G T : G T = (G 1 , ΔG 1 ,…, ΔG m −1 ), where ΔG i : 1 ≤ i ≤ m −1 is the change of edges between G i and G i+1 . we calculate P(G T |m,T) as follows. and s i , is the number of edges' change between G i+1 and G i . We have no prior knowledge on P(G 1 ) and see the uniform distribution as the prior.
We set the exponential prior on the transition times of networks over time and calculate P(m|T) as the following.
We assume that the segments are independent and calculate P(D h \G T , m, τ T , τ max ,T) of each segment as the following.
I h is a segment where a network structure G h and its corresponding lag value τ h work. Θ G h are the parameters associated with the data of one segment I h corre- We assume that the data are complete and multinomially distributed with a Dirichlet prior on the parameters. We weight the hyperparameters of Dirichlet distribution in each segment with the ratio of the segment length over the sample size. We calculate the BDe [17] score of each segment as the following: N is the sample size of the observed data. |I h | is the length of the segment I h . Θ G h are the multinomial parameters of the joint probability distributions corresponding to G h . r i is the number of possible discrete values of x i . q ih is the number of configurations of parents Π i for the variable x i in the segment I h . N ijk (I h ) is the times that x i had value k in the segment I h .
.a ijk (I h ) and a ij (I h ) are the hyperparameters for Dirichlet distributions applied in the segment I h . a ijk (I h ) is assumed to be uniformly distributed inside a segment and is set to a ijk (I h ) = a|I h |/(r i q ih N) a is the equivalent sample size. We calculate the marginal likelihood P(D|G T , m, τ T , τ max , T) by using the modified Bayesian-Dirichlet equivalent ( BDe ) metric introduced in [1]. By multiplying the BDe metric of each segment, we get the extended BDe metric equation as follows: Once the parents are decided, we use a conditional where p j  is the conditional probability of the jth component's value in the lag vector τ T .

Potential regulator detection
We know that the change of expression level of most transcriptional factors (TFs) always precedes or happens simultaneously with that of target genes [18]. This fact provides a useful technique to find potential regulators and relative expression lag value τ. We follow Zou's work [5] to detect the possible TFs. In Zou's work, they used the expression levels of ≥ 1.2-fold and ≤ 0.70-fold compared with the average gene expression level as upregulation and down-regulation cutoff thresholds. Any gene with initial up(down) change of expression level earlier is seen as the potential TFs of genes with change of expression level later. One example of up-regulation is showed in Figure 1. Instead of using a fixed value we relax the cutoff thresholds by taking a range of values. For up-regulation, we use the range 1.0~1.2, and for down-regulation, we take the range 0.6~0.8. In order to get all the possible TFs for each gene, we need to consider all the combinations of possible up(down)-regulation pairs. The yeast cell cycle data set analyzed by Zou has a limited time points (T = 16), which makes the complete search over all possible lag values affordable. However, with the increasing sample size and number of genes in the gene expression profiles, this searching algorithm is unrealistic and will bring more noises and high computational cost. We developed a heuristic to limit the potential regulator-target gene pairs for processing large data sets.
Below is our method. We first discretize the expression data by following the method above. We then search the data and only select the initial up(down)-regulation points. Slide the window with the width τ max from the start (t = 1) of the time series expression data to the end (t = T − τ max + 1), where T is the length of time points. For each moving step, the window slides one time step and only the up(down)-regulation pairs inside the window are calculated. One example of the sliding window is showed in Figure 2. We group the pairs according to their time lag and calculate the posterior probability for each lag value τ : 1 ≤ τ ≤ τ max . For each gene, its potential TFs are also collected to be used as the prior knowledge to limit the search space during the process of structure sampling.

Structure sampling using RJMCMC
We choose sampling approaches rather than heuristic methods to search network structures due to the reason that microarray expression data are usually sparse, which makes the posterior probability of structures to be diffuse [9]. In this approach, a group of most likely structures could explain data better than a single one. We use a sampling method called RJMCMC (Reversible Jump Markov Chain Monte Carlo) to collect structure samples. The details of this method are available on [19].
Compared with the move types introduced in [1], we add one new move type called change lag and modify most of the existing operations by incorporating more restrictions. We also define a vector of time points L T = (L 1 ,...,L m−1 ), where L i : 1 ≤ i ≤ m − 1 is the start time point where G i+1 is applied. We use Metropolis-Hastings algorithm for RJMCMC sampling [20]. The move set of our RJMCMC consists of 11 move types: MT1: add edge to G i . MT2: delete edge from G i . MT3: add edge to ΔG i . MT4: delete edge from ΔG i . MT5: move edge between ΔG i s. MT6: shift time, which changes a single L i 's value. This operation will trigger the checking of τ i 's value under the restriction of Both MT8 and MT9 operations will trigger the change of dimensions of L T and τ T . In MT8, the new component of τ T takes the least value of two merged components. Similarly with MT8 and MT9, M10 and M11 will change the dimensions of L T and τ T . MT1, MT3, MT10 and MT11 follow the restriction that the edges pointed to one target gene should have the origins from its potential regulators.

Experimental study and evaluation
We performed all the experiments on a cluster with 256 Intel Xeon 3.2 Ghz EM64T processors with 4 GB memory each. We implemented our method FLnsDBNs  The parameter configurations of three methods are shown in Figure 4 and 7. Figure 1 One example of detecting a potential up-regulation pair A ! B.
(Flexible Lag Non-Stationary Dynamic Bayesian Networks) in Matlab. We compare three approaches: our approach FLnsDBNs, reversible jump Markov chain Monte Carlo Non-Stationary Dynamic Bayesian Networks (RJnsDBNs) [1], and Allocation Sampler Non-Stationary Dynamic Bayesian Networks (ASnsDBNs) [2]. For RJnsDBNs, we use the default setting of unknown numbers and times of transitions (UNUT) in all of the data sets. RJnsDBNs is implemented in Java, and ASnsDBNs is implemented in Matlab. We show the average elapsed time of three methods on two data sets in Table 1. In FLnsDBNs, we ignore the computational cost on the potential regulator detection process because it takes less than 0.03 second. Although the direct comparison of three approaches by using the elapsed time is unfair due to the difference in implementation, our method shows the comparable computational performance with ASnsDBNs.
Our experimental study is based on three data sets: (i) Bone Marrow-derived Macrophages gene expression time series data (Macrophages data set), (ii) Circadian regulation in Arabidopsis Thaliana gene expression time series data (Arabidopsis data set), and (iii) Drosophila muscle development gene expression time series data (Drosophila data set). To compare the results from different data sets, we follow the evaluation method introduced in [2,9,21]. For each data set, we first collect gold standard reference networks as the ground truth. For the Macrophages data set, such reference networks are available in [2,22,23]. For the Arabidopsis data set, we collect the network information from [3,[24][25][26][27]. For the Drosophila data set, there is no ground truth regarding the network structure. We compare our method with others by showing the commonality and differences. In case where we have ground truth network structure (the Bone Marrow data set and Arabidopsis data set), we use the area under receiver operating characteristic curve (AUROC) values to evaluate the performance. We obtained the ROC curves by postprocessing the posterior probabilities of directed edges and taking different cutoff thresholds in [0, 1]. If the posterior probability of an edge is greater than the threshold, we keep the edge. Otherwise, we do not keep the edge. With the ROC curves, we evaluate the performance of different methods by comparing the AUROC scores. In addition, for each data set, we show the posterior distribution of the number of segments and the locations of changepoints. In all of our experimental study, we find that the method FLnsDBNs produces compatible results with previous methods and demonstrates better network prediction performance in all the data sets. Before we discuss the details of experimental results, we present our data set first below.

Data sets
As mentioned briefly before, we evaluate our method on three data sets used in [1,2]. We preprocess the original data sets by following Zhao's work [28]. We set the  values of a missed time point with the mean of its two neighbors; i.e., X i , t = (X i , t−1 + X i,t+1 )/2 if 1 < t < T. If the missed values are at the beginning or end, simply set the same value as its neighbor; i.e., X i,t = X i,t+1 if t = 1 or X i,t = X i,t−1 if t = T. In the following, we show the details of each data set.
Bone Marrow-derived Macrophages gene expression data. Interferon regulatory factors (IRFs) are proteins crucial for the mammalian innate immunity [29]. These transcription factors are central to the innate immune response to the infection by pathogenic organisms [23]. We use the Macrophage data sets sampled from three external conditions: (I) Infection with Cytomegalovirus (CMV), (II) Treatment with Interferon Gamma (IFN g ), and (III) Infection with Cytomegalovirus after pretreatment with IFN g (CMV+IFN g ). Each data set has 3 genes: Irf1, Irf2 and Irf3, and contains 25 time points with the interval of 30 minutes. We use the network Irf 2 ↔ Irf 1 ¬ Irf 3 as the gold standard and assume the network never changes over the time.
Arabidopsis thaliana circadian regulation gene expression data. A. thaliana circadian gene expression data was sampled to understand the internal clock-signalling network of plant. Two data sets were collected with the interval of 2h from two light-dark conditions: 10h:10h and 14h:14h light/dark cycles, both of which contain 13 time points. We choose a group of 9 genes, LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 for analysis, which create transcriptional feedback loops. We show the referred biological regulatory network in Figure 3. In this network, CCA1, LHY and TOC1, as core components of the reciprocal regulation , are important for the proper function of this oscillator network in A. thaliana [3]. CCA1 and LHY proteins' direct binding to the promoter of TOC1 represses the expression of TOC1, and ELF3 works as a negative regulator of light signaling to the clock oscillator and enables the induction of oscillator output [24,25]. The pseudoresponse regulators PRR5 and PRR9 are activated by CCA1 and LHY accompanied with light, and repress CCA1 and LHY subsequently. G1 is activated by light and improve the expression of TOC1. ELF4 is repressed by CCA1. And PRR3 is highly correlated with TOC1 and together form a functional complex [30].
Drosophila muscle development gene expression data. The original transcriptional profile on the life cycle of Drosophila melanogaster contains 4028 genes, nearly one third of all of the predicted Drosophila genes. The samples were collected over 66 time steps throughout the life cycle of Drosophila melanogaster consisting of four periods: embryonic, larval, pupal, and adulthood periods [31]. The intervals of sampling are not even, from overlapped 1 hour during the early embryonic period to multiple days in the adulthood. We choose 11 genes for analysis, which are eve, gfl/lmd, twi, mlc1, sls, mhc, prm, actn, up, myo61f, msp300. Those genes were reported to be related with the muscle development of Drosophila.

Experimental results
In this section, we compare the experimental results of three approaches: FLnsDBNs, RJnsDBNs, and ASnsDBNs on three data sets.
The experimental results on Macrophages data. On the Macrophages data, for each method, we run 10,000 iterations for burn-in and then take additional 40,000 iterations to collect samples. In Figure 4, 5 and 6, we show the posterior probabilities of the numbers of segments and changepoints on three Macrophages data sets. The sample collection of FLnsDBNs on the Macrophages data takes about 2 minutes.
For the CM V data, we first observe that there is a high agreement among all three methods in term of the range of the number of identified segments. The ranges are 1~4 for FLnsDBNs, 1~4 for RJnsDBNs, and 24 for ASnsDBNs. When we compare the distributions of the number of segments identified by three methods, we observe that ASnsDBNs clearly identifies a dominant 3segment in the data set while the posterior probabilities produced by FLnsDBNs and RJnsDBNs are flat. For the predicted locations of the changepoints, FLnsDBNs identifies three posterior peaks at time stamps 4, 8, and 14. RJnsDBNs finds four peaks at 5, 11, 14, and 19. In ASnsDBNs, two peaks happen at 1 and 4 with the probabilities more than 0.5. There is a consensus among three methods that the most probable changepoint occurs at the location 4. The results of three methods are consistent with the biological phenomenon that the simultaneous responses of Macrophages happen under the attack of Cytomegalovirus [2]. In order to assess the   network prediction performance, we show the AUROC scores in Table 2. We find that all methods perform well in the CMV data with the AUROC scores equal to 1. For the CMV + IFN g data, all three methods identify 1 segment, which corresponds to a coexistence state between virus and its host cell [2,32], and have the same range of the number of segments 1~3. In Table 2, we find that FLnsDBNs shows a much better network prediction with the AUROC score equal to 1 while in RJnsDBNs the AUROC score is equal to 0.2222 and in ASnsDBNs the AUROc score is equal to 0.6667. For the IFN g data, there is a postulated transition with the immune activation under the treatment of IFNg. FLnsDBNs infers 2 segments and finds two posterior peaks of transition time at 8 and 14.
ASnsDBNs and RJnsDBNs infer only one segment, even though the two methods identify a differnt posterior peak at the location around 5. On the assessment of the predicted network structures, the AUROC scores are 0.8333 in FLnsDBNs, 0.7778 in RJnsDBNs, and 0.6667 in ASnsDBNs. In all of three Macrophages data sets, our approach shows the best network prediction accuracy.
For each Macrophages data set using FLnsDBNs and RJnsDBNs methods, we find that the posterior probability distributions of any edge do not change much across different segments. This finding is consistent with the assumption that the underlying network does not change through the time.
The experimental results on Arabidopsis data. On the Arabidopsis data, we use a larger number of iterations in the MCMC sampling because the data set is much larger than the Macrophages data. We run 10,000 iterations for burn-in and then take additional 990,000 iterations to collect samples. The sample collection of FLnsDBNs on the Arabidopsis data takes about 4 hours.  In Figure 7 and 8, we show the posterior distributions of the numbers of segments and changepoints on two Arabidopsis data sets. For the Arabidopsis T20 data, in FLnsDBNs the range of the number of segments is 23 , and in RJnsDBNs and ALnsDBNs the ranges are 14 . In FLnsDBNs, the dominant samples are the ones with 2 segments while in AlnsDBNs they are 3 segments. For the Arabidopsis T28 data, the ranges are 23 in FLnsDBNs, 1~3 in RJnsDBNs and 3~5 in ASnsDBNs. FLnsDBNs infers 2 segments, RJnsDBNs infers 1 segment, and ASnsDBNs infers 5 segments, respectively on the T28 data. In both data sets, we find that the differences of the posterior probabilities of 2 and 3 segments are low in RJnsDBNs and the difference between the posterior peaks of changepoints and the time points nearby are not noticeable. Hence, for this data set, we only use a single network in RJnsDBNs to compare with other methods. Using ASnsDBNs, the poseterior peaks of changepoints on T20 data are 1, 5 and those on T28 are 2, 7, 10. In [2], the results of ASnsDBNs are explained as a phase shift incurred by different dark/light cycles. However, our approach predicts the posterior peak of changepoints both at the location 6. We evaluated the network reconstruction accuracy of three methods by comparing with the reference network showed in Section 3.2. We show the AUROC scores in Table 3. In addition, we use a new comparative criteria called the TP|FP=5 counts [2,21] to further demonstrate the performance of our method. TP are the true positive counts; FP are the false positive counts; TP|FP=5 are the TP counts when FP is 5. The TP|FP=5 counts of three approaches are shown in Table  4. FLnsDBNs outperforms other two methods in both two evaluation criteria of the AUROC score and TP| FP=5 counts on the Arabidopsis data sets.
The experimental results on Drosophila data. For the Drosophila data, We run 10,000 iterations for burnin and then take additional 990,000 iterations to collect samples. The sample collection of FLnsDBNs on the Drosophila data takes about 10 hours.
We show the results of posterior probabilities of the numbers of segments and changepoints in Figure 9. ASnsDBNs predicts more than 20 segments and fails to provide a meaningful result of changepoints. Therefore,   in the subsequent discussion, we only compare FLnsDBNs and RJnsDBNs approaches. The assumed transition time of four life periods are located at 30, 40 and 58. RJnsDBNs predicts 3 segments with the posterior peaks located at 11 and 21. FLnsDBNs prefers 4 segments with the posterior peaks at 19, 36 and 54, which happen before the assumed changepoints. And our prediction of the Embryonic!Larval transition occurs at 19 much earlier than 30. Both ASnsDBNs and RJnsDBNs methods do not converge well in this fly data set. We show the reconstructed networks of our approach, those of RJnsDBNs (UNUT), a stationary directed network predicted by [33], and the non-stationary undirected networks predicted by [34] in Figure 10 for the purpose of comparison. In addition, we provide the networks predicted by RJnsDBNs with another setting of KNKT to compare because the networks inferred by RJnsDBNs (UNUT) show much difference from other predictions. In the following, we only compare the results of [34], [33], RJnsDBNs (KNKT) and FLnsDBNs.
These four predictions share many similarities and also show some difference. We find that the gene msp-300 may play a key role in the cluster of these 11 genes. myo-61f is only predicted to be a regulated gene by msp-300 in [33], but other three methods show that myo-61f is another key gene in this cluster. In [33], myo-61f is correlated with twi, sls, mlc1, mhc and msp-300. In RJnsDBNs (KNKT), myo-61f serves as the regulators of prm, up and sls. Our approach predicts that myo-61f regulates four genes: sls, prm, actn, and msp-300. FLnsDBNs, [33] and [34] all agree that there are regulation relationships between myo-61f and msp300, while RJnsDBNs (KNKT) did not identify this interaction. Different from the prediction of RJnsDBNs (KNKT), Our approach finds that twi is not separated from other genes and actn serves as the parents of other genes, which is consistent with the networks in [33]. In Figure 10E, twi is the regulator of sls, and actn regulates sls, prm and gfl. We also notice that the regulating effects of myo-61f and msp-300 on other genes intensify over the time. Nearly different from all of three methods, our approach finds that twi and gfl/lmd are regulators of other genes while only [33] sees twi as a regulator. gfl/lmd and twi are direct upstream regulators of mef2 [35,36] that directly regulates some target myosin family genes at all stages of muscle development [37], such as mhc and mlc1. Evidence show the cooperative binding of twi and Mef2 or gfl/lmd and Mef2 to these target genes are attractive models [35,37]. It indicates that a co-regulation role of twi and gfl/lmd with Mef2 to other muscle development genes may exist. The prediction of our method shows this biological behavior. Currently the reference regulatory network on the muscle development of Drosophila melanogaster is not available and the relevant biological literatures are limited. Further biological researches and experiments are needed to verify the regulatory networks.

Conclusion
In this paper we introduced a new non-stationary DBNs method and applied our approach on three time series microarray gene expression data. Our new DBNs method uses a systematic way to determine potential regulators and takes a flexible lag choosing mechanism. Our experimental study demonstrated that compared with recent proposed non-stationary DBNs methods, our approach has better structure prediction accuracy. By detecting potential regulators, our method reduces the size of the search space, hence may speed up the convergence of MCMC sampling.