Network optimization: applied example
We illustrate how our optimization method behaves by following its evolution when applied on a sample data set. We performed 500 independent runs of the optimization method using training set 1 (Additional file 2), PKN 1 (Additional file 4) and the 14 essential nodes discussed in the method section (grey nodes in Fig. 3).
Figure 5a illustrates how the individual components of the fitness function F = (f
T
, − N
ess. nodes
, N
nodes
, − N
edges
) evolve during the run giving the network with the minimal fitness function value. Following an initialization of all replicas to the empty network, the optimization process starts with an initial phase where f
T
decreases while N
ess. nodes
, N
nodes
and N
edges
all increase as the networks are populated with nodes and edges. The number of nodes and edges is large enough to include all essential nodes (N
ess. nodes
= 14) for most networks after the 50th iteration, and when N
ess. nodes
saturates the number of nodes starts to decrease. The number of edges tends to increase steadily from the first to the last iteration, but increases faster whenever the number of nodes increases or saturates. This is a consequence of the lexicographical ordering of multi-dimensional fitness functions. The optimization is stopped when the best value of the fitness function fails to decrease during 10 consecutive iterations, and the network with best value of the fitness function is retained as the output of this optimization run.
Figure 5b presents the best network obtained during this optimization run. It has a first fitness function component f
T
= 0, which means that it perfectly reproduces all the experiments in the training set. The second component (N
ess. nodes
= 14) is also optimal. The third component N
nodes
= 15 is very close to the minimal value 14. Although it may not be optimal, the fact that no network was found with N
nodes
= 14 and f
T
= 0 in any of the 500 runs suggest that N
nodes
= 15 is the minimum number of nodes compatible with f
T
= 0. It is not clear whether the fourth component (N
edges
= 58) is optimal since it is much smaller than the 99 edges from the PKN connecting the 15 nodes of this network. Therefore, while this network may not be an exact global minimum of the fitness function, it seems to be very close, which suggests that our implementation of the genetic algorithm is appropriate to handle this minimization problem.
The goal of our network optimization method is not only to find networks that can reproduce all experiments in the training set (f
T
= 0), but also, and most importantly to find networks that can predict the outcome of experiments against which they were not trained. Since the gold standard network used to generate the PKN and the training set is known, this can be quantified by evaluating the s
all
score (defined in the Methods section) which measures the similarity between the average attractors reached by the model network and the average attractors reached by the gold standard network after all possible single node perturbations. Note that contrary to f
T
, which only takes into account a subset of the attractors that best match the training set, the s
all
score takes into account all attractors reached after perturbation, thus penalizing the situation where only a subset of attractors behave properly. For the model network given in Fig. 5b, s
all
≃ 0.974, so approximately 97.4 % of the predicted node states are correct following single node perturbations. Although only one attractor of the unperturbed gold standard network was part of the training set, the model network nevertheless recovered the same 4 attractors as the gold standard network, with only one error (node CASP8 = 1 instead of 0 in one attractor).
We next examined the diversity of model networks that can be obtained by running the optimization method multiple times. For each of the 500 independent runs we selected the single best network, and from these 500 optimized networks then selected the 50 networks with the lowest fitness function, which were retained as model networks. While none of these networks necessarily represents an exact global minimum of the fitness function, all networks had optimal values for the first and second components (f
T
= 0 and N
ess. nodes
= 14). The best 50 model networks were compared to the gold standard model by measuring the s
all
scores and the resulting distribution of s
all
scores is shown in Fig. 6 (dark blue) along with the score for the best network (red). The median s
all
of the 50 best networks is high, with about 97 % of node states correctly predicted after all single node perturbations, and significantly higher than the distribution of s
all
scores for all the 500 networks (light blue). This shows that at least for this sample input data set, networks with small values of the fitness function tend to have high s
all
scores, i.e. their predictions are close to the gold standard network predictions.
To check that the good scores obtained by the model networks were not only due to a very informative PKN, which by construction contains all the functional interactions of the gold standard network, we also measured the s
all
score of the PKN (black line in Fig. 6). As shown in Fig. 6, scores of the best 50 model networks (dark blue) are significantly higher than the score of the PKN (black), indicating that the optimization method indeed generates models with much improved predictive power compared to the PKN used as input. The s
all
scores for random sub-networks of the PKN (shown in grey) are also significantly lower than those for the 50 model networks. This result is not entirely trivial, since these random networks are built with edges of the PKN that correspond to functional interactions of the gold standard network.
Genetic algorithm
The quality of the optimization method, as measured by the s
all
score, is determined by the combination of the optimization algorithm (genetic algorithm) and the choice of fitness function. In the previous section, we showed that the optimization method was able to generate model networks with good predictive power (high s
all
scores). This result strongly suggests that the choice of fitness function was adequate and that the optimization algorithm was able to find sufficiently good solutions to the minimization problem. However, it does not say anything about the efficiency of the optimization algorithm, and indeed a simple random network generator could also produce optimal solutions in principle, simply by generating a sufficiently large number of networks.
To evaluate our implementation of the genetic algorithm, we compared the evolution of best fitness function values obtained by our algorithm to best fitness function values obtained on a population of random networks. We started 200 independent runs of the optimization method, using the same input as in the previous section (training set 1 in Additional file 2, PKN 1 in Additional file 4 and 14 grey nodes in Fig. 3 as essential nodes). During each run, the fitness function value and s
all
score of the best network (i.e. with minimal fitness function value) obtained were stored for each iteration of the genetic algorithm (black dots in Fig. 5a). The resulting distributions of best fitness function values and corresponding s
all
scores obtained with the 200 runs are shown in Fig. 7 (blue boxplots, left panel) as a function of the number of iterations of the genetic algorithm. In parallel, for each run of the optimization method, a corresponding “random run” was created by generating, for each iteration, exactly the same number of random networks (random sub-networks of the PKN) as the number of networks generated by the genetic algorithm. For each iteration of the random run, we stored the fitness function value and s
all
score of the best random network (i.e. with minimal fitness function value) obtained since the beginning of the run. The resulting distributions of best fitness function values and corresponding s
all
scores obtained with the 200 “random runs” are shown in Fig. 7 (grey boxplots, left panel) as a function of the number of iterations. Note that although Fig. 7 presents data for iterations up to 285, the average run length was 171 iterations. Missing values from the end of each run until iteration 285 were replaced by the fitness function value and s
all
score obtained in the last iteration of the run.
When comparing results obtained with the genetic algorithm (blue) and random runs (grey), it is clear that the genetic algorithm is much more efficient at generating networks with low values of the fitness function. This is particularly apparent for the first component of the fitness function (f
T
in upper left panel of Fig. 7), where optimized networks reach a median value f
T
≃ 0.020, while random runs reach only a median f
T
≃ 0.236. In addition, while the other components of the fitness function (N
ess. nodes
, N
nodes
and N
edges
) tend to approach their optimal values in a similar way as in Fig. 5a using the genetic algorithm, this is not the case for the random runs. More importantly, the s
all
score (lower left panel) increases quickly for optimized networks (blue) to reach a median value s
all
≃ 0.938, while it only marginally increases for random runs to reach a median s
all
≃ 0.788.
In addition to the distribution of best fitness function values obtained during one run discussed above, we also characterized the distribution of best fitness function values obtained after multiple runs (right panel of Fig. 7). This is particularly relevant in the context of optimization based on genetic algorithms, where it is common practice to start multiple independent runs in parallel. We randomly sampled the specified number of runs (horizontal axis) among the 200 runs, storing the fitness function value and s
all
score of the best network (according to the fitness function) found in the sampled runs. This process was repeated 500 times for each number of runs, so that each boxplot summarizes 500 values. Using multiple runs has a dramatic impact on the best value of the fitness function obtained with the genetic algorithm (blue boxes): the median value of the first component (f
T
in upper right panel) decreases from f
T
≃ 0.020 with one run to f
T
= 0 with 20 runs, while its dispersion strongly reduces, with an interquartile range that decreases from 0.052 with one run to 0 with 20 runs. Similarly, while the median s
all
score increases from s
all
≃ 0.938 with one run to s
all
≃ 0.970 with 20 runs (lower right panel), its dispersion is reduced, with an interquartile range that decreases from 0.034 to 0.017. The other components of the fitness function (N
ess. nodes
, N
nodes
and N
edges
) are only marginally improved by increasing the number of runs.
The best value of the fitness function obtained with random runs (upper panel, grey boxes) also improves when increasing the number of runs, but it is still far from the best values obtained with the genetic algorithm. The s
all
score obtained with random runs (lower panel, grey boxes) only slightly increases from s
all
≃ 0.788 with one run to s
all
≃ 0.797 with 20 runs.
To summarize, Fig. 7 shows that our implementation of the genetic algorithm is significantly more efficient than random sampling in finding solutions to the fitness function minimization problem.
PKN quality
In the previous sections we used an ideal PKN which contained only (direct or indirect) interactions observed after in silico perturbation experiments of the gold standard network. However, a PKN built from the literature will usually not be perfect; it may contain “noise” in the form of interactions that are not relevant to the particular biological context under study, while some relevant interactions may also be missing.
To investigate the effect of noise in the PKN on the optimization method, we generated five PKNs (PKN 2 to 6 in Additional file 4) by adding 10, 20, 30, 40, and 50 % of noise to the ideal PKN, as defined in Methods section. For each PKN, we started 500 independent runs of the optimization method and kept the best network (with minimal value of the fitness function) obtained with each run. Among the 500 resulting networks, only the 50 with minimal values of the fitness function were kept as model networks. The left panel of Fig. 8 presents the distribution of fitness function values and s
all
scores as a function of the PKN used as input (horizontal axis) for the resulting model networks (blue boxes), for random sub-networks of the PKNs used as input (grey boxes) and for the PKNs (black lines). The values of the fitness function obtained with model networks increase (worsen) as the fraction of noise in the PKN increases, with the median f
T
reaching f
T
≃ 0.015 with 50 % noise. The number of nodes and edges also increases with noise, suggesting that it is not possible to find simple networks that are able to reproduce training set experiments when too many interactions are missing in the PKN.
As expected, the predictive power of model networks (s
all
score, lower left panel) decreases with increasing noise in the PKN. However, the median s
all
score is always above 0.798 even with 50 % noise in the PKN. In addition, the distribution of s
all
scores is always significantly higher for model networks (blue) than for random networks (grey) and PKN (black). The decrease of s
all
scores with increasing PKN noise suggests that our method makes good use of the information contained in the PKN, and that a carefully constructed PKN can increase the quality of the resulting model networks. The fact that model networks always have better s
all
scores than PKNs also shows that our method is rather tolerant to errors in the PKN. Indeed, even with 50 % noise in the PKN, our method is able to output model networks that have more predictive power than the input PKN. However, if the PKN is known to be very noisy, it could be worth considering other methods which are not based on prior knowledge.
Training set quality
In the previous section we used an ideal training set obtained by measuring the response of the gold standard network to perturbations. To evaluate the robustness of the optimization method to errors in the training set, we generated five training sets by adding 10, 20, 30, 40, and 50 % of errors to the ideal training set (training sets 1 with 10 to 50 % error in Additional file 2), as described in the Methods section.
For each training set, we generated 50 model networks from 500 independent runs by following the procedure described in the previous sections. The centre panel of Fig. 8 presents the distribution of fitness function values and s
all
scores as a function of the fraction of errors in the input training set for the resulting model networks (blue boxes), for random sub-networks of the PKN (grey boxes) and for the PKN (black lines).
The values of the fitness function dramatically increase (worsen) when the number of errors in the training set is increased. In particular, while model networks obtained with the ideal training set always perfectly reproduce all experiments in the training set (f
T
= 0), about 9.9 % of the training set cannot be reproduced by model networks when the training set contains 10 % of errors (median f
T
≃ 0.099). When the training set contains 50 % of errors, f
T
reaches a median value of 0.327, which means that model networks fail to reproduce about 33 % of the training set. This suggests that it is not possible to find a model network, built as a sub-graph of the PKN, which is able to reproduce all errors in our training set.
As expected, the predictive power of model networks (s
all
score, lower centre panel) decreases when increasing the fraction of errors in the training set, but this decrease is moderate. For instance, s
all
scores are not significantly lower with 10 % of errors in the training set (median s
all
≃ 0.969) than with the ideal training set (median s
all
≃ 0.970). Moreover, for all training sets with up to 30 % of errors, s
all
is clearly higher for model networks (blue) than for random sub-networks of the PKN (grey) and for the PKN (black). Only when the training set contains 50 % of errors does the median s
all
score of model networks become lower than the s
all
score of the PKN. However, even in this case, the median predictive power of model networks is still above 80 %. The moderate decrease of predictive power, together with the rapid increase of fitness function values when the number of errors in the training set increases, suggest that our optimization method is not greatly affected by overfitting. The use of a PKN contributes greatly to this result by drastically reducing the number of parameters in the model.
Training set size
In the previous experiments we used a comprehensive training set that included the states of all 14 essential nodes measured before and after over-expression of TNF, FASL or the combination of TNF and FASL, for seven mutant versions of the gold standard network (training set 1 in Additional file 2). To study the effect of reducing the training set coverage on the optimization method, we generated two smaller training sets (see Methods section): a “medium” training set, which contains data for the wild-type mutant only (training set 2 in Additional file 2) and a “small” training set, containing only the initial physiological state (training set 3 in Additional file 2). Clearly, the small training set does not contain enough information to properly infer a model network, and it was used to study the behaviour of the optimization method in this limit. Following the same procedure as in the previous sections, for each training set we used the optimization method to generate 50 model networks out of 500 independent runs. The resulting distributions of fitness function values and s
all
scores are shown in the right panel of Fig. 8 (blue boxes) as a function of the training set used as input (horizontal axis), together with fitness function values and s
all
scores measured on random sub-networks of the PKN (grey boxes) and directly on the PKN (black lines). Although model networks always perfectly reproduce all experiments in the training sets (f
T
= 0, upper right panel), other components of the fitness function tend to improve when the size of the training set decreases. In particular, the number of nodes reaches its optimal value (N
nodes
= 14) for medium and small training sets, while networks obtained with the small training set have more edges than networks obtained with large training set. This is due to the smaller training sets imposing fewer constraints on the networks than larger training sets, which leave more freedom to optimize the remaining components of the fitness function.
Although the predictive power of model networks decreases with training set size (s
all
score, lower right panel), it remains significantly higher than that of random sub-networks of the PKN (grey) and the complete PKN (black) for all training sets. Interestingly, while the small training set contains only the states of 14 essential nodes in one attractor of the unperturbed gold standard network, the resulting model networks still have a median s
all
of 0.937, which means that approximately 94 % of the 1624 nodes states measured after all single nodes perturbations are predicted correctly.
To summarize, although the predictive power of model networks decreases with the quantity of information contained in the training set, our optimization method was still able to produce networks with significantly improved predictive power compared to the input PKN and random sub-networks of the PKN when using reduced training sets.
Fitness function: maximizing versus minimizing number of edges
The choice of fitness function is an essential ingredient of network optimization methods. While the first two components of our fitness function (f
T
and N
ess. nodes
) can be easily understood, our choice of third and fourth components (N
nodes
and N
edges
) may not be so obvious. Indeed, these last two components were chosen to first minimize the number of nodes and then maximize the number of edges (among networks with same number of nodes), whereas a more usual choice of regularization consists in minimizing the number of edges.
To motivate our choice of fitness function, we used our optimization method with the modified fitness function F = (f
T
, − N
ess. nodes
, N
edges
) which minimizes the number of edges and does not constrain the number of nodes. For each input data set discussed in the previous sections, 50 model networks were generated out of 500 independent runs. Due to slower convergence of the genetic algorithm with this fitness function, we had to use 100 replicas instead of 50 to obtain model networks with comparable fitness function values. The resulting distributions of s
all
scores are shown in Fig. 9 (red boxes), together with s
all
scores obtained previously by minimizing number of nodes and maximizing number of edges (blue boxes). In addition, s
all
scores measured on random sub-networks of the PKN (grey boxes) and directly on the PKN (black lines) are also shown for comparison. In the following, we will abbreviate the approach of “maximizing the number of edges while minimizing the number of nodes” as simply “maximizing the number of edges”. The additional constraint on the node number has to be kept in mind.
Clearly, the predictive power of model networks obtained by minimizing the number of edges is less sensitive to noise in the PKN used as input (left panel). With no or low noise, maximizing the number of edges is clearly a good strategy, as the s
all
scores are significantly lower when minimizing the number of edges. When PKN noise increases, both approaches give comparable results, with slightly better, although not significantly, s
all
scores obtained by minimizing number of edges with 30 and 40 % noise.
Similarly, when the training set contain errors (centre panel), model networks obtained by maximizing the number of edges tend to have higher predictive power. Although the difference is more pronounced for low fractions of errors in training sets, median s
all
scores obtained by maximizing the number of edges are always higher than scores obtained by minimizing the number of edges.
The advantage of maximizing the number of edges becomes more striking when decreasing the size of the training set used as input (right panel). Indeed, while networks obtained by maximizing the number of edges have a median s
all
score that slowly decreases from 0.970 to 0.937, networks obtained by minimizing the number of edges always have significantly lower s
all
scores, reaching a median of 0.778 with small training set. More importantly, although networks obtained by minimizing the number of edges are significantly better than random sub-networks of the PKN and the PKN itself when using a large training set, they are not much better than the PKN when using a medium training set, and significantly worse than the PKN when using a small training set, with a median s
all
score 0.778 barely better than the median s
all
score 0.774 of random sub-networks of the PKN.
These results can be understood by realizing that maximizing the number of edges generates networks which are as close as possible to the PKN, only removing edges that are in contradiction with the training set, and therefore use a maximum of information contained in the PKN. This is particularly interesting in the limit of small training sets, when the information contained in the training set is not sufficient to properly infer a model network. In this limit, in addition to the training set, our method heavily uses the information contained in the PKN, and therefore generates model networks with reasonably good predictive power. By contrast, minimizing the number of edges within the limit of small training sets results in over-simplified model networks that are not able to properly predict anything else than what they were trained for.
This reasoning also explains why maximizing the number of edges is a better strategy when the quality of PKN is good, and becomes less attractive in the limit of very noisy PKN. What is interesting, and somewhat unexpected, is the fact that maximizing edges does not give significantly worse results than minimizing edges in the limit of very noisy PKNs.
The main motivation behind minimizing the number of edges is usually based on Occam’s razor, i.e. the principle of parsimony. However, this approach can lead to oversimplification of the resulting model, for instance by removing alternative pathways that are necessary to ensure the known robustness of regulatory networks. The idea behind maximizing the number of edges is to keep this robustness and to reflect the complexity of biological networks. For a node with various documented biological functions, it seems artificial to reduce that node to only one of them, assuming that the input PKN was built carefully.
Combined predictions
In general, one should not expect the optimization problem discussed here to have a unique solution (network). If the training set does not contain enough information to completely constrain the problem then multiple solutions may be possible. If essential interactions are missing from the PKN then no sub-network of the PKN (model network) may be able to exactly reproduce all experiments in the training set (f
T
= 0) and a potentially large number of solutions with equally good f
T
> 0 may be possible. Even when the optimization problem does have a unique optimal solution, heuristic optimization methods, like the genetic algorithm that we use, might be unable to find it, and instead output multiple sub-optimal networks with similar fitness function values. For all these reasons, a large set of model networks may be equally good solutions to the optimization problem.
This multiplicity of model networks is not in contradiction with a unique network describing the underlying biological system (in this work: the gold standard network), but only reflects our lack of knowledge on the system. To summarize all these solutions, a common practice is to generate an average network as the union of all models networks, with a weight attached to each edge, such as the number of model networks in which it appears. However, unless all model networks are very similar, this consensus network is of limited interest since it does not retain the topology (feed-back loops, connectivity), and more importantly, the dynamical behaviour (attractor reachability graph) of the original model networks. Instead, we propose to summarize our results at the level of the predictions by measuring averages but also variability of node states predicted by all model networks. Intuitively, if a node state prediction varies a lot across networks, this could be due to a lack of information in the PKN and training set, and therefore could be correlated to prediction errors.
To check whether prediction variability was correlated to prediction errors, we summarized the predictions of the 50 model networks obtained previously with each input data set (PKN and training set). We used the same predictions that were used to measure the s
all
score (see Methods section), i.e. the average states reached by a network after each single node perturbation of essential nodes, starting from each attractor of the gold standard network. For each input data set (PKN and training set), single node perturbation and initial state (gold standard network attractor), we evaluated the average and variance of the 50 average states predicted by the 50 model networks. We also evaluated the corresponding prediction error by measuring the absolute difference between average model networks’ predictions and average states reached by the gold standard network. This procedure led to 21,112 error versus variance measurements, one for each of the 14 essential nodes, with 29 single node perturbations (unperturbed, 14 nodes forced to 1, 14 nodes forced to 0), 4 attractors of the gold standard network and 13 input data sets (more details on this procedure are given in Additional file 1).
The resulting distribution of error versus variance is shown in Fig. 10. This figure shows a strong correlation between error and variance, confirmed by a Spearman’s rank correlation coefficient of ≃ 0.96. This result suggests that the prediction error, which is usually unknown, can be (approximately) estimated by measuring the prediction variance provided that not only one, but multiple model networks are kept as solutions of the optimization problem.
Ideally, variance should be estimated based on the full population of networks which are solutions of our optimization problem, or at least on a uniformly sampled subset of the solutions. However, due to the heuristic approach used here, our optimization method should in principle be unable to generate an exhaustive list of solutions, or to uniformly sample the set of solutions. The strong correlation observed in Fig. 10 is therefore an interesting and non-trivial result, which shows that despite these limitations, our optimization method is able to sample sufficiently well the space of solutions.
Related work
A large amount of methods focusing on the problem of Boolean network inference have been published. In the following we will consider only methods that are similar to our method in the sense that they combine the use of experimental data against which Boolean networks are trained, together with prior knowledge to reduce the size of the search space. These methods are classified according to the type of data used in the training set.
Several methods require training sets in the form of time series of experimental measurements. REACT [13] and CellNopt [14] use evolutionary algorithms to train networks against time series obtained with various network perturbations. Both methods only consider networks with synchronous dynamics. By assuming that the consecutive measurements in the time series corresponds to successive states of the network with synchronous dynamics, Breindl et al. [4] can use linear programming to solve the network optimization problem. BoolNet [15] implements the Best-Fit Extension [16] and REVEAL [17] algorithms to reconstruct networks from time series obtained with various network perturbations, and can use synchronous as well as asynchronous dynamics. RE:IN [18] uses a Satisfiability Modulo Theories (SMT) solver to exhaustively find all networks that can exactly reproduce the training set, which consists of time series obtained under various network perturbations. Only synchronous dynamics is implemented in this method.
Other methods focus on early responses to perturbations and only consider a small subset of two time points in each time series. Only the initial measurement and a second carefully chosen time point reflecting the initial response to the perturbation are kept in the training set. In addition to time series discussed above, CellNopt [14] also implements a method based on a genetic algorithm to train networks on the early response to perturbations, assuming synchronous dynamics of the networks. Guziolowski et al. [19] and Videla et al. [5] use answer set programming to enumerate all networks that exactly reproduce the initial response to perturbations using synchronous dynamics, but also all suboptimal networks within a user defined tolerance.
Another popular approach is to use a training set containing stable phenotypes assumed to be steady states (attractors with one state) of the network. These training sets contain only equilibrium properties of the network and lack information on the dynamics of the network. XPRED [9] and PRUNET [10] use evolutionary algorithms to find networks that have steady states as close as possible to the phenotypes given in the training set. Knapp and Kaderali [3] reformulated the problem of finding networks with specific steady states obtained under various perturbations as a linear programming problem. A runtime comparison with these methods is given in Additional file 5.
Our optimization method contrasts with the aforementioned approaches in a number of ways. First, our method combines information on both the dynamics (time series) and equilibrium properties (steady states) of the networks, while the aforementioned approaches use only one of these types of information. Indeed our training set can contain measurements performed on stable phenotypes that are assumed to correspond to attractors of the networks, together with measurements on their reachability upon perturbation of the network. Second, our method uses asynchronous dynamics, which is usually considered as more relevant for the description of biological networks than synchronous dynamics [20]. By contrast, all methods discussed above, except BoolNet, use synchronous dynamics. Finally, our method attempts to maximize the number of edges in order to be as close as possible to the input PKN, while only Videla et al. [9] suggest that minimizing the number of edges may not be the optimal approach.