Boolean networks using the chi-square test for inferring large-scale gene regulatory networks

Background Boolean network (BN) modeling is a commonly used method for constructing gene regulatory networks from time series microarray data. However, its major drawback is that its computation time is very high or often impractical to construct large-scale gene networks. We propose a variable selection method that are not only reduces BN computation times significantly but also obtains optimal network constructions by using chi-square statistics for testing the independence in contingency tables. Results Both the computation time and accuracy of the network structures estimated by the proposed method are compared with those of the original BN methods on simulated and real yeast cell cycle microarray gene expression data sets. Our results reveal that the proposed chi-square testing (CST)-based BN method significantly improves the computation time, while its ability to identify all the true network mechanisms was effectively the same as that of full-search BN methods. The proposed BN algorithm is approximately 70.8 and 7.6 times faster than the original BN algorithm when the error sizes of the Best-Fit Extension problem are 0 and 1, respectively. Further, the false positive error rate of the proposed CST-based BN algorithm tends to be less than that of the original BN. Conclusion The CST-based BN method dramatically improves the computation time of the original BN algorithm. Therefore, it can efficiently infer large-scale gene regulatory network mechanisms.

In the linear modeling of a genetic network, the expression data is fitted using a regression model, where the change in expression levels is a response for all other genes [1]. Although such standard linear modeling approaches enable the analysis of many different features of the modeled system, they are not effective in genomewide network discovery. This is because the number of candidate parameters and models is very high and therefore it is difficult to search efficiently and reliably with tight control on many false positives.
Bayesian network algorithms have limitations with regard to the determination of an important network structure because of their complex modeling strategies (with a large number of parameters to be estimated) and a long computation time for searching all potential network struc-tures on genome-wide expression data. These limitations of the Bayesian network may be overcome by the dynamic Bayesian network (DBN), which models the stochastic evolution of a set of random variables over time [11,12]. Although some improvements have been proposed, the accuracy of prediction of the DBN is relatively low, and its excessive computational time is still very high [13].
Recently, studies on the hierarchical scale-free network in lower organisms [14,15] have indicated the necessity of a network method for the simultaneous analysis of thousands of genes. The human B-cell network analysis using mutual information [16] is a type of hierarchical scale-free analysis. Although this analysis successfully constructs gene networks with thousands of genes, the method is based on mutual information between two genes; therefore, it cannot obtain the response of a target gene when more than two genes simultaneously affect the target gene.
Among these methods, the Boolean network (BN) is useful to construct gene regulatory networks observed by high-throughput microarray data because it can monitor the dynamic behavior in complex systems based on its binarization of such massive expression profiling data [17,18]. The Boolean function of a gene in a BN can describe the behavior of the target gene according to simultaneous changes in the expression levels of the other genes.

Boolean networks
BN models were first introduced by Kauffman [3]. In these models, a gene expression is simplified with two levels: ON and OFF. A BN G(V, F) is defined by a set of node V = {x 1 , ..., x n } and a set of Boolean functions F = {f 1 , ..., f n }.
A Boolean function f i (x 1 , ..., x k ), where i = {1, ..., n}, with k specified input nodes (indegree) is assigned to node x i . The regulation of nodes is defined by F. More specifically, given the values of the node (V) at time t -1, the Boolean function are used to update these values at time t.
The model system has been developed into a so-called Random BN model [19]. BNs have attracted attention since the introduction of probabilistic Boolean network (PBN) models by Shmulevich et al. [6]. Many algorithms have been proposed for the inference of BNs. For example, the REVEAL algorithm has been introduced by Liang et al. [5] for causal inference by using mutual information, which is the most fundamental and general measure of correlation. Akutsu et al. [4] have constructed a BN structure based on the consistency problem, which can be used to determine the existence of a network that is consistent with the observed data. In one of the most recent studies on the BN algorithm, the Best-Fit Extension problem [20] is used for the inference of PBNs [6]. In PBNs, every node (gene) can have a chance of acquiring different Boolean functions. The probabilistic selection of Boolean functions adds flexibility in the determination of the steady state of BNs and monitoring of the dynamical network behavior for gene perturbation or intervention [18,21].
Recently, several software packages have been developed for constructing BNs. The random BN toolbox [22] and the PBN toolbox [6] are available in Matlab. NetBuilder (version 0.94) [23] is a genetic regulatory network tool used to simulate genetic network using a BN. The BN has been widely used to describe biological processes. For example, Huang [17] has used these networks to represent cell growth, cell differentiation, and apoptosis. A transcriptional network model in yeast has been studied using a random BN [24]. Further, Johnson [25] has studied the signal transduction pathways in a B-cell ligand screen.

Advantages and disadvantages of Boolean networks
The estimation of gene regulatory networks using the BN offers several advantages. First, the BN model effectively explains the dynamic behavior of living systems [17,18,26]. Simplistic Boolean formalism can represent realistic complex biological phenomena such as cellular state dynamics that exhibit switch-like behavior, stability, and hysteresis [17]. It also enables the modeling of nonlinear relations in complex living systems [27]. Second, Boolean algebra is an established science that provides a large set of algorithms that are already available for supervised learning in the binary domain, such as the logical analysis of data [28], and Boolean-based classification algorithms [29]. Finally, dichotomization to binary values improves the accuracy of classification and simplifies the obtained models by reducing the noise level in experimental data [30,31].
However, the BN has some drawbacks. One of the major drawbacks is that it requires extremely high computing times to construct reliable network structures. Therefore, most BN algorithms such as REVEAL can thus be used only with a small number of genes and a low indegree value. For higher indegree values, these algorithms should be accelerated through parallelization in order to increase the search efficiency in the solution space [5]. The consistency problem [4]  for a fixed indegree k; this is because Boolean functions must be checked for each of the possible n C k combinations of variables and for m observations. The Best-Fit Extension problem [21] also works in time complexity O( · ·m·n·poly(k)). Although the improved consistency algorithm and Best-Fit Extension problem work in time complexity O( ·m·n·poly(k)) [32], they still exhibit an exponential increase in the computing time for the parameters n and k. Such high computing times are a major problem in the study of large-scale gene regulatory and gene interaction systems using BNs.

Chi-square-test-based Boolean network
In order to overcome the time complexity problem of the BN method, we propose a variable selection method based on the chi-square test (CST). The proposed CSTbased BN adopts the Best-Fit Extension problem, which is commonly used in the PBN to effectively determine all possible relevant Boolean functions. In our method, the maximum indegree of networks is assumed to be three. We also focus on the Boolean functions that comprise three different literals (input genes in the Boolean function). Each literal is connected by the three Boolean operators NOT(¬), AND(∧), OR(∨); for example, f = X1 ∧ ¬X2 ∨ X3. Then, the time complexity of the CST-based BN reduces to O( · m·poly(k)), where n is the total number of genes, k is the indegree, m is the total number of time points, n 1, j is the number of first selected genes for the jth gene, and n 2, ij is the number of second selected genes when the ith gene is selected in the first step. We have found that the dichotomization of the continuous gene expression values allows us to efficiently perform the independence test for a two-way contingency table on each pairwise network mechanisms. We use the CST to identify genes that are associated with a target gene. A target gene would be expressed in accordance with a Boolean function related to the selected genes. Since the genes have only two levels, (0 and 1), we use 2 × 2 and 2 × 2 × 2 contingency tables to identify the relationship between two and three genes respectively. The proposed method is used along with the Best-Fit Extension problem. This method is described in detail in the Methods section.

Simulation study
For our simulation study, an artificially generated network structure is illustrated in Figure 1. It comprises 40 nodes and the maximum value of the indegree (k) is three. The network structure is composed of 27 Boolean functions that are randomly generated from 40 nodes. Forty sets of binary data are obtained sequentially from the network structure. Each data set has different initial states and seven time points. Since the data sets are generated from definite Boolean functions, the genes in the Boolean functions tend to have strong associations. The CST-based BN uses two thresholds α 1 and α 2 where α 1 is used for selecting variables for the main effect and α 2 is used for the conditional effect. A detailed description is provided in the Methods section. The smaller the values of α 1 and α 2 , the stronger is the association of the variables. In order to select the nodes that have strong associations with the target nodes, we used very small cutoff values -α 1 = 1 -e15 and α 2 = 1 -e15 -for the CST-based BN. Table 1 Figure 2 shows the FPRs for various noise levels. It appears that the CST-based BN reduces the FPR of the original BN.
The CST-based BN method yielded the same estimated Boolean functions as those obtained by the original BN method for various noise levels. However, there are large differences between the computing times. Table 2 shows the selected number of nodes in the first and second steps of the variable selection method. The first column shows the node number. The second and third columns show the number of nodes in the first and second steps, respectively. Each number in the third column, separated by a comma, is the number of selected nodes for each node selected in the first step. For example, in the second line of Table 2, there are two selected nodes in the first step. For these two nodes, there are 39 and 38 selected nodes in the second step, respectively. From the result of the variable selection (Table 2), the ratio of the time complexities of the two methods can be obtained as follows: In summary, the CST-based BN method was approximately 6.9 times faster than the original BN method. If the network had a larger number of nodes (n), then the difference between the computing times of the two algorithms would be significantly high.

Yeast cell cycle data
In order to demonstrate the improvement in the computing times, we apply the proposed variable selection method to yeast cell cycle data [33]. The data comprises 18 time points (alpha-factor-based synchronization experiment). In this example, the computing time and accuracy of network estimation of the original BN method are compared with those of our CST-based BN method.

Comparison between the network structure estimation accuracies of the CST-based BN and the original BN
Our variable selection method significantly improves the computing time of the BNs. However, the accuracy of our method should be assessed before comparing the computing times of the two methods. The improvement in the computing times primarily depends on the cutoff statistical significance levels, α 1 and α 2 , for the selection of genes Time complexity of original BN Time complexity of CST-based BN  Network structure with 40 nodes in simulation data set We define the error rate as the discrepancy between the Boolean functions estimated by the original BN and the proposed CST-based BN as follows: where BF OriginalBN and BF CSTbasedBN are sets of Boolean functions estimated by the original BN and the CST-based BN, respectively. Three data sets with randomly selected 80, 100, and 120 genes were used to compute the error rate.
We have calculated the error rate for various values of α 1 and α 2 (Figure 3(a),(b) and Figure 4(a)-(d)). Figure 3 shows the error rate when the error size of the Best-Fit Extension problem is 0, while Figure 4 shows the error rate when the error size is 1. In each plot, the y-axis represents the error rate and the x-axis represents the value of α 2 . The error rate decreases with an increase in α 1 and α 2 ; this implies that a less conservative cutoff value would be more suitable to construct a BN.
In order to select appropriate values of α 1 and α 2 , the error rates were obtained for various combination of (α 1 , α 2 ). Figures 3 and 4. To obtain an error rate of zero, the values of α 1 and α 2 should be greater than 1% and 2%, respectively, when the error size is zero (Figure 3(a),(b)). However, when the error size is one, larger values of α 1 and α 2 are required to obtain an error rate of zero (Figure 4(a)-(d)). Based on these results, we suggest that the following values be used when the error size is 0, α 1 = 1%, α 2 = 2% and when the error size is 1, α 1 = 7.5%, α 2 = 10%.

Comparison between the computing times of the original and the proposed BN algorithms
In order to compare the computing times, we executed the BN program based on the Best-Fit Extension problem (written in C language). Figure 5 shows the computing times with a change in the number of genes from 40 to 120. We set the value of error size in the Best-Fit Extension problem as 0 and 1 and used the variable selection criteria α 1 = 1%, α 2 = 2% and α 1 = 7.5%, α 2 = 10%, respectively.

Construction of gene networks with yeast cell cycle related 800 genes
We also applied the CST-based BN to the subset of the yeast cell cycle data with 800 genes [33] for demonstration. Figure 6 shows a partial structure of the gene network structure constructed by using the CST-based BN. The Best-Fit Extension problem provided several Boolean functions for a gene at time t. For the demonstration, we randomly selected a Boolean function from the estimated Boolean functions of each gene; the method used for this purposed was similar to that used by the PBN [6,34] to select a set of Boolean functions for a given gene using the coefficient of determination (COD).
For all 800 genes, the CST-based BN required approximately three days to construct the network structure. In order to estimate the total computation time of the original BN, we selected the first target gene and constructed the BN, which required approximately 37,011 s. Therefore, the total computation times for all 800 genes will be Error rate Number of False positive rates Figure 2 False positive rates. The result obtained by the CST-based BN is more accurate than that obtained by the original BN. It appears that the Boolean networks determine Boolean functions with significantly higher false positive genes than those in the case of the proposed variable selection method.

Discussions and Conclusion
Recent studies [14,15] have emphasized that thousands of genes must be considered simultaneously in order to construct gene regulatory networks in an organism. The BN method is useful for constructing a gene regulatory network. If the gene expression data contain a considerable amount of noise, the binary transformation of these data can reduce the error [35]. The BN has been successfully used to model a nonlinear system [27] and the dynamic behavior of living systems [18,26]. Despite these advan-tages, it is difficult to apply the BN method to large-scale gene regulatory network studies due to the extremely high computation times.
In order to overcome this computational drawback, we proposed the variable selection method using the CST for the two-way and three-way contingency tables of Boolean count observations. This method reduces the computation times significantly; for example, for 120 genes, the computation time is approximately 70.8 times faster than that of the original BN method. If the total number of genes and the value of k increase, the improvement in the computation time is expected to be significantly greater than the original BN method. Also the proposed method can be easily implemented with the existing BN modeling algorithms such as the PBNs by efficiently selecting only the most relevant genes for determining the Boolean functions. This method is thus demonstrated to reduce the false positive rate, which is an important problem in network studies conducted on a genome-wide scale.
In our method, the value of k is assumed to be three. However, it is possible to use a large value of k greater than 3. Since our method uses the Best-Fit Extension problem, a gene can be controlled by more than one Boolean function. Therefore, it appears that k = 3 provides a large number of Boolean functions that can model the gene regulatory network successfully.
The proposed CST-based BN used the two-step discovery of 3-indegree Boolean functions because the prediction information of addtional genes in such high-dimensional Boolean functions is mainly observed after considering, or conditional on, primary genes' effects. This strategy, in turn, resulted in much more efficient discovery of the most predictive high-dimensional Boolean functions in our BN modeling.
The result of the proposed CST-based BN may be sensitive to the sample size n. When n is small, the contingency table may contain many cells with low and zero frequencies. To ensure that the expectation value is not equal to zero, a continuity correction is used by adding a small constant 0.1 to the observed frequency in each cell [36]. This simple correction produces a successful result in the real data set [33] that contains 17 time points. However, we should be more careful while applying the CST method to data with small time points because the result of the CST can be less reliable for the sparse data set. In this case, we suggest that the CST be replaced with Fisher's exact test that provides a more reliable result for the small sample size data [36]. The small sample size problem also makes it difficult for the original BN algorithm to produce a reliable result. We think that in the near future the advancement of high throughput techniques and the costdown of microarrays will enable us to solve the sparse data problem by producing large data sets easily.
The improvement of the computing times using the CSTbased BN will significantly increase the utility and appli-Error rates of estimated Boolean functions using variable selection method when the error size is 0  Therefore, for practical application, we suggest that α 1 = 1% and α 2 = 2% be used when the error size of the Best-Error rates of estimated Boolean functions using variable selection method when the error size is 1 Figure 4 Error rates of estimated Boolean functions using variable selection method when the error size is 1. The error rates were calculated from the three data sets with a different number of genes: 80, 100, and 120. The genes were randomly selected from the yeast cell cycle data set. (a), (b), (c), and (d) show the plots when the cutoff values of α 1 are 1%, 2.5%, 5%, and 7.5%, respectively. The appropriate values of α 1 and α 2 are 7.5% and 10%, respectively, when the error size is 1. In addition, a more careful dichotomization is required for a more accurate biological interpretation of the network structure. For example, since microarray data have continuous expression values with a considerably large amount of information, the dichotomization may require the selection of an appropriate threshold value depending on the biological function of each gene [35]. The performance may not be remarkable when a small number of time points and genes are available. However, we show that the proposed variable selection method is significantly more efficient for large-scale gene regulatory network studies. For example, the CST-based BN is approximately 114 times faster than the original BN for 800 genes ( Figure 6).
The next step would be to perform a biological evaluation of the selected network structure. However, the main focus of our study is to improve the computation times of the BN by using the CST. Our approach allows the application of the BN to genome-wide network construction and discovery. A future study will evaluate the accuracy of the BN and compare it with other network methods such as the Bayesian network and hierarchical scale-free network.

Methods
The proposed CST-based BN consists of two steps. The first step is to determine a pair of genes that are associated with each other. The second step is to determine the third gene that is conditionally associated with the pair of genes identified in the first step.

First step for the main effect
Let n be the total number of genes. In the first step, 2 × 2 contingency tables are constructed from the dichot- For the continuity correction, we add an arbitrary small number a to each observed frequency in order to prevent E pq from becoming zero [36]. We use a = 0.  For the given expression value of h, there are two 2 × 2 contingency tables for the i and j genes. We focus on the conditional independence test. The null hypothesis that the ith gene at time t -1 and the jth gene at time t are con-ditionally independent when the hth gene expression level at time t -1 is given by H 0 : π pq|o = π p+|o π q+|o for all p (= 0, 1) and q (= 0,1), where π ..|o represents the conditional probability for the given o.  at least one of the two CST in the second step is significant (the p-value of the test is less than α 2 ).
The rationale for using this conditional independence test is that h affects the association between two genes i and j.
The conditional test approach is very effective in the determination of a relationship for more than two genes.

Implementation of two step variable selection method
We select n 1, j genes at time t -1 that are associated with the jth gene at time t in the first step where 1 ≤ n 1, j ≤ n. If the ith gene is one of the n 1, j genes, we select n 2, ij genes in the second step for 1 ≤ n 2, ij ≤ n -1 (excluding the ith gene).
Then, we consider all possible combinations for selected three genes (when k = 3); one gene is the ith gene selected in the first step and the other two genes are selected in the second step. These combinations are used instead of all possible combinations in the original BN algorithms. The time complexity of determining the Boolean functions for the jth gene is O( · m·poly(k)) and that of variable selection using the CST is O(n 2 + n 1, j ·(n -1)).
Hence, the total time complexity of the proposed algorithm is expressed as follows: As n increases, the time complexity of determining the Boolean functions dominates the time complexity of var-iable selection because the former increases more rapidly than the latter. Table 3 shows the expression profiles of eight genes with 17 time points. Here, only two Boolean functions -f 1 = ¬G3 ∨ (¬G4 ∧ ¬G8) and f 3 = ¬G1 ∨ (G2 ∧ G6) -are true.

EXAMPLE
To determine these functions, the original BN searches all possible combinations of genes at time t -1 for every target gene at time t ( 8 C 3 × 8 = 448 when the indegree is three). We can reduce the BN combinations by using the proposed variable selection method. Table 4 shows the result obtained by using the proposed variable selection method. Five genes at time t (2nd, 4th, 5th, 6th, and 8th genes) are excluded because they do not have any genes selected in the first step.
Figures 7(a) and 7(b) show the 2 × 2 contingency tables using (G1 t , G3 t -1 ) and (G3 t , G1 t -1 ), respectively. The second step is performed only for the selected genes in the first step in order to identify genes that are conditionally associated with the target gene. Figure 8 shows 2 × 2 × 2 tables with three genes -a gene at time t, a gene at time t -1 from the first step, and new gene at time t -1: (a) for (G1 t , G3 t -1 , G4 t -1 ), (b) for (G1 t , G3 t -1 , G8 t -1 ), (c) for (G1 t , G3 t -1 , G2 t -1 ), and (d) for (G1 t , G3 t -1 , G6 t -1 ). The total number of possible combinations of the CST-based BN is 47. Therefore, the total time complexity of the CST-based BN is 9.5 times faster than that of the original BN.    First step for the main effect Second step for the conditional effect Figure 8 Second step for the conditional effect. 2 × 2 × 2 tables with three genes: a gene at time t, a gene at time t -1 from the first step, and a new gene at time t -1. (a) (G1 t , G3 t -1 , G4 t -1 ), (b) (G1 t , G3 t -1 , G8 t -1 ), (c) (G1 t , G3 t -1 , G2 t -1 ), and (d) (G1 t , G3 t -1 , G6 t -1 ).