Inference of regulatory networks with a convergence improved MCMC sampler

Background One of the goals of the Systems Biology community is to have a detailed map of all biological interactions in an organism. One small yet important step in this direction is the creation of biological networks from post-genomic data. Bayesian networks are a very promising model for the inference of regulatory networks in Systems Biology. Usually, Bayesian networks are sampled with a Markov Chain Monte Carlo (MCMC) sampler in the structure space. Unfortunately, conventional MCMC sampling schemes are often slow in mixing and convergence. To improve MCMC convergence, an alternative method is proposed and tested with different sets of data. Moreover, the proposed method is compared with the traditional MCMC sampling scheme. Results In the proposed method, a simpler and faster method for the inference of regulatory networks, Graphical Gaussian Models (GGMs), is integrated into the Bayesian network inference, trough a Hierarchical Bayesian model. In this manner, information about the structure obtained from the data with GGMs is taken into account in the MCMC scheme, thus improving mixing and convergence. The proposed method is tested with three types of data, two from simulated models and one from real data. The results are compared with the results of the traditional MCMC sampling scheme in terms of network recovery accuracy and convergence. The results show that when compared with a traditional MCMC scheme, the proposed method presents improved convergence leading to better network reconstruction with less MCMC iterations. Conclusions The proposed method is a viable alternative to improve mixing and convergence of traditional MCMC schemes. It allows the use of Bayesian networks with an MCMC sampler with less iterations. The proposed method has always converged earlier than the traditional MCMC scheme. We observe an improvement in accuracy of the recovered networks for the Gaussian simulated data, but this improvement is absent for both real data and data simulated from ODE. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0734-6) contains supplementary material, which is available to authorized users.


Introduction
Markov Chain Monte Carlo (MCMC) is a widespread method for obtaining posterior quantities of interest. It samples successively from the posterior distribution where each sample depends on the previous construction creating in this way the notion of a Markov chain. MCMC has been applied successfully in a variety of problems. However, there are many aspects that have to be addressed when applying such methods. For instance it is possible that the chain does not converge, or that the initial values of parameters have a big influence in the samples. Another possible problem is that the sequence of sampled values show correlation between them.
Markov chain is a powerful tool which is easy to apply and this introduces the risk of serious errors including inappropriate modelling, errors in calculation or programming and slow convergence [1].
Usually to perform diagnosis of convergence graphs or descriptive measures are analysed. These methods are described in [1][2][3][4][5][6][7] Most of these methods are developed for problems in which few parameters, usually between one and three, are sampled in the Markov chain. Problems where convergence diagnostics is performed in samplers that simultaneously sample several parameters are scarce. Hereafter it is described and explained a new method to verify the indication of convergence in the case of high dimensional MCMC samplers.

Heuristic visual evaluation
When sampling a network structure each sample is an adjacency matrix. For a network with n nodes the adjacency matrix has dimension n × n. Thus, in this case, we have to evaluate the convergence of n 2 parameters that are the posterior probabilities over the edges of the network. Because in this setting the number of parameters is usually high the convergence of the MCMC simulations are frequently accessed with a simple heuristic approach. Assuming that the matrix of posterior probabilities is P each entry p ij of this matrix indicates the marginal posterior probability of the existence of an edge between nodes i and j.
For accessing the MCMC's convergence two simulations are run from different initializations, obtaining two resulting posterior probability matrices, P 1 and P 2 . After, a scatter plot is produced by plotting p 1 ij against p 2 ij , for 1 < i, j < n.  The graph to the right in the bottom row shows good indication of convergence. As can be seen from the graphs the decision about convergence from the figures can be controversial. It is interesting to note that this diagnostic of convergence is only a necessary condition for convergence. Therefore, this method does not guarantee the convergence and can only assure the lack of convergence.

Proposal of an score for the visual heuristic evaluation
The method for the diagnosis of convergence proposed in this section is based in the visual heuristic method presented in subsection . Instead of relying on visual analysis we propose to calculate a numeric value that summarizes what is visualized.
Consider that in each step of the MCMC the sampled parameter is a matrix of parameters, P, with dimension n × n. Matrix entries are p ij , 1 < i, j < n.
First we calculate de distance of all points to the line y = x. The distance, d ij , from a point P (x i , y i ) to a line with equation ax + by + c = 0 is defined by: Because in our case we are interested in the distance of a point from the identity line, defined by y = x, Equation 1 becomes: We are interested in the spread of the points around the line y = x and we use the rms distance as a measure of this spread. The rms distance is defined by: where N is the total number of points and d i is the distance of the i-th point to the line. If we combine Equation 2 with 3 we get: We call this measure the convergence rms, or simply c rms .

Supplementary Results
In this section we present more detailed results which have not been included in the main article due to space restrictions.
Hyperparameter β GGM One important component of the proposed hierarchical Bayesian model is the sampling of the hyperparameter β GGM . For a uniform prior distribution P (β GGM ) and a symmetric proposal distribution R(β ′ GGM |β GGM ), the following acceptance criterion is applied: which can be rewritten as: In this work we follow [8] and set P (β GGM ) to be the uniform distribution in the interval [0, 30]. In Figure 2 details of the sampled hyperparameter β GGM are presented for all the data sets in this work. Each column of graphs presents one type of data and each row presents a different data set. Left, centre and right columns presents respectively Gaussian, Ecoli and Real data sets. Each graph presents in the horizontal axis the MCMC step and in the vertical axis the value of β GGM . In [8] it is shown that values of the sampled hyperparameter β GGM close to zero indicate that the prior knowledge is not used by the BN. On the other hand high values of the sampled hyperparameter indicate that the prior knowledge is used in learning process.
In the present work the sampled values of β GGM in all simulations are greater than zero, indicating that the information obtained from GGM is being used by BNs. This is the expected behaviour because the prior information in the present case comes from the same data which is used in the inference. This use of the information obtained from the GGM leads to the improvement in convergence that we observe in all the results.

AUC diagnostics
One way of indirectly checking convergence is looking to the results in terms of reconstruction accuracy. In order to achieve this we resort to inspecting the values of the AUC for each step in the MCMC simulations. The AUC value is calculated for each simulation step, i.e., in each step we considered that this is the size of the simulation and calculated the AUC value. With this setting it is possible to analyse which would be the results if the simulation was run for the given number of steps. In the main paper we presented the mean and standard deviation of the AUC over the five data sets. Here, in the supplementary material, we show the individual results. In Figure 3 each column of graphs is related to one type of data set. The data set represented in the left, center and right columns are respectively Gaussian, Ecoli and Real. Each row of graphs in Figure 3 is a particular data set. In each graph it is shown the step of the MCMC simulation in the horizontal axis and the value of the AUC from the BNGGM (thick line) and from the BNMCMC (thin line) in the vertical axis.

Convergence criterion
The results considering the convergence criterion presented in Section are shown in Figure 4. In the main paper we presented the result for only one data set and here in the supplementary material we show the results for all the data sets. In Figure 4 each column of graphs is related to one type of data set. The data set represented in the left, center and right columns are respectively Gaussian, Ecoli and Real. Each row of graphs is a particular data set. In each graph the vertical axis presents the c rms and horizontal axis shows the MCMC step. The thin and thick lines in each graph represents respectively BN-MCMC and BNGGM methods. Each of these lines is the result of running the algorithm twice with the same data set but from different initializations.

RMS Distance
Step

BN-MCMC BNGGM
(o) Real data set 5 Figure 4 Comparison between the convergence of BN-MCMC and BNGGM methods. The data set represented in the left, center and right columns are respectively Gaussian, Ecoli and Real. Each row of graphs is a particular data set. In each graph the vertical axis presents the crms and horizontal axis shows the MCMC step. The thin and thick lines in each graph represents respectively BN-MCMC and BNGGM methods. Each of these lines is the result of running the algorithm twice with the same data set but from different initializations.