Model averaging strategies for structure learning in Bayesian networks with limited data
© Broom et al; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Skip to main content
© Broom et al; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Considerable progress has been made on algorithms for learning the structure of Bayesian networks from data. Model averaging by using bootstrap replicates with feature selection by thresholding is a widely used solution for learning features with high confidence. Yet, in the context of limited data many questions remain unanswered. What scoring functions are most effective for model averaging? Does the bias arising from the discreteness of the bootstrap significantly affect learning performance? Is it better to pick the single best network or to average multiple networks learnt from each bootstrap resample? How should thresholds for learning statistically significant features be selected?
The best scoring functions are Dirichlet Prior Scoring Metric with small λ and the Bayesian Dirichlet metric. Correcting the bias arising from the discreteness of the bootstrap worsens learning performance. It is better to pick the single best network learnt from each bootstrap resample. We describe a permutation based method for determining significance thresholds for feature selection in bagged models. We show that in contexts with limited data, Bayesian bagging using the Dirichlet Prior Scoring Metric (DPSM) is the most effective learning strategy, and that modifying the scoring function to penalize complex networks hampers model averaging. We establish these results using a systematic study of two well-known benchmarks, specifically ALARM and INSURANCE. We also apply our network construction method to gene expression data from the Cancer Genome Atlas Glioblastoma multiforme dataset and show that survival is related to clinical covariates age and gender and clusters for interferon induced genes and growth inhibition genes.
For small data sets, our approach performs significantly better than previously published methods.
In the last ten years there has been a great deal of research published on learning Bayesian networks from data [1–4]. Most of the work in structure discovery in Bayesian networks has focused on designing computationally tractable procedures for searching the space of networks, and on devising metrics for scoring networks during search. However, the problem of determining the quality and robustness of learned structures in the context of limited data remains largely open. Structure learning algorithms are known to be unstable — small changes in training data can cause large changes in the learned structures. As Bayesian network learning begins to find serious application in biology , there is an increasing need for learning protocols that can not only discover network features, such as edges and Markov blankets, from small amounts of data, but can also determine an accurate confidence estimate for those features. In many biological and clinical settings, the size of the dataset may be severely limited, either by high cost or by the limited number of cases from which data can be collected (for instance, people affected by a rare disease) . In such settings, it is worthwhile to use substantial amounts of computation in order to achieve as accurate a result as possible. Model averaging using bootstrap replicates (that is, bagging) with feature selection by thresholding  is a widely used solution . Unfortunately, there is no established method for determining feature-selection thresholds. There is no prescription for the choice of bootstrap technique: the ordinary non-parametric bootstrap , a bias-corrected version of the non-parametric bootstrap , or the more continuous Bayesian bootstrap . Whether the choice of bootstrap method interacts with the choice of scoring function is unknown, as is the number of bootstrap replicates needed to achieve a certain level of confidence in the final model.
In this paper, our goal is to understand the interactions between choice of bootstrap method and scoring function, size of training data and number of bootstrap replicates as well as the choice of feature-selection threshold on averaged models. Unlike linear regression, for which closed form expressions for the performance of model-averaging procedures can be derived , our model class is too complex to permit the analytical derivation of quality guarantees. We therefore use experiments on two moderately sized Bayesian networks with known structure (ALARM and INSURANCE) to systematically study the effects of varying the bootstrap technique, the amount of training data, the number of bootstrap replicates, as well as the scoring function for network learning. We design a variant of permutation testing to automatically select thresholds for feature selection in bootstrap averaged (bagged) models and construct a loose upper bound for the number of bootstrap replicates needed for stable estimates of feature probabilities.
where we weight each feature in proportion to the posterior probability of the structure in which it occurs.
For large networks, the direct estimation of is intractable. Structure learning algorithms approximate the computation in Equations 1 or 2 by only considering high scoring structures encountered during search. These structures tend to be quite similar to one another, as evidenced by the fact that feature (such as, edge) probabilities computed from them are usually close to 0 or 1. Computing feature probabilities by MCMC sampling over structures drawn from the posterior distribution has been proposed by Madigan and York  and refined by Giudici et al. . These techniques do not scale to large networks, and have been shown to be slow-mixing .
The practical application of this algorithm, particularly in the context of limited data requires answers to many questions. Specifically:
How many bootstrap resamples M are required to obtain stable estimates of the feature probabilities in bagged models?
Which scoring functions F s work best with bagging? Friedman et al.  only used BDe as a scoring function in their bootstrap experiments. Hartemink et al.  offer experimental evidence that that BDe outperforms MDL/BIC in a bagging context. It is an open question how scoring functions such as DPSM compare to BDe.
Is the bias in bagged Bayesian networks arising from the discreteness of the bootstrap method  a significant problem? If so, would a continuous version of bootstrap, like the Bayesian bootstrap  be a better choice for bootstrap resampling than the ordinary discrete bootstrap?
How do we select thresholds for learning statistically significant features over averaged models? Does the choice of threshold depend on the scoring function, problem domain, sample size, and bootstrap resampling technique?
Should a single high scoring structure be learned from a bootstrap replicate (as shown in Algorithm 1), or an averaged ensemble of high scoring structures (double averaging, as shown in Figure 2)?
The paper is organized as follows. In the next section, we review the problem of learning the structure of Bayesian networks from data, and describe the state of the art in scoring functions, search algorithms, and bootstrapping techniques. In the following section, we present an extensive experimental study that explores the space of model-averaging strategies in the context of the ALARM network. We then describe our technique for automatically selecting a probability threshold for feature selection. Our threshold selection criterion is designed to minimize the number of false positive features in bagged models. We conclude with a characterization of an effective model-averaging strategy in contexts with limited data.
where r i is the number of values that X i takes, and q i is the number of values that the parents of X i in take. N ijk is the number of instances in in which variable X i has its k th value, and its parents take their j th value, and .
where is the Dirichlet distribution order for variable i with value k and parent value j, and . When the Dirichlet orders for all sets of parameters are set to a constant λ, the score is called DPSM . When we assume , it is called the Bayesian Dirichlet (BDe) metric . The choice of is critical, particularly for small data sets. If these parameters are large, making dominate the N ijk values, the available data has less influence in determining the space of structures explored.
where r i is the number of values that variable X i can take, and q i is the number of values that the parents of X i in can take. Direct maximization of yields overfitted models, because the likelihood of the data given and Θ can be increased by increasing the complexity of (that is, by adding an edge). The second term can be seen as a penalty term to compensate for model complexity. Finding a structure with the highest BIC score is equivalent to finding a structure with the highest approximate posterior probability given the data D. As the sample size N → ∞, the probability that the BIC criterion selects the true model approaches 1. However, with finite sample sizes, BIC tends to select models that are much simpler than the true model, due to the penalty term.
Minimizing the MDL scoring function yields a model with the shortest description length. If we ignore the description length of the graph, , the MDL score equals the BIC score with reversed sign. The MDL scoring function favors graphs in which nodes have fewer parents, because it reduces the encoding length of Θ and .
To summarize, scoring functions belong to one of two families. Bayesian scoring functions, of which K2, DPSM, and BDe are examples, represent the posterior probability of a structure given data . The second family consists of complexity-penalized likelihood measures, of which BIC and MDL are examples. MDL can also be justified from an information-theoretic perspective. These scoring functions embody different assumptions about priors on the space of structures as well as priors on parameter distributions given the structures. There are few systematic studies in the literature to guide choice of scoring functions for specific applications. Below we show a comparative analysis of the performance of these scoring functions in the context of limited data.
The problem of learning a network which maximizes is usually posed as a combinatorial optimization problem. It is known to be NP-complete , so local search algorithms guided by scoring functions are used to find approximate solutions.
The local search algorithm used in our experiments is greedy hill-climbing with randomized restarts and Friedman’s sparse candidate algorithm  with k = 6 (maximum number of parents per node). Every network explored by the algorithm during the search is recorded, and no network is ever considered twice. Initially, the k most likely parents for each node, the candidates, are selected by scoring all networks with a single edge between two nodes. Denote by K the set of all networks in which the parents of every node belong to the set of k candidate parents for that node. A list of starting networks containing all networks in K with up to two edges is then generated. A starting network is picked at random from this initial list. From this starting network, all neighboring networks in K that have not been considered before and which differ by an additional edge, one less edge, or a reversed edge are evaluated. The highest scoring neighboring network, if its score is higher, replaces the current network. The search is continued until no new networks are generated or all generated networks score less than the current network. New sets of k candidate parents are then generated following Friedman’s algorithm, and the search is continued. New candidate parents sets are picked until a previously seen set of candidate parents is revisited, or C = 10 different candidate parent sets have been considered. Such searches starting from a randomly picked member of the initial network list are performed a total of M 1 = 25 times. Another M 2 = 25 such searches are performed starting from a network chosen randomly from all of those seen during the first M 1 searches. Network features are determined from the highest scoring network(s) visited over all searches.
The bootstrap is a general tool for assessing statistical accuracy of models learned from data. Suppose we have a data set , where each x ( j ) is a vector of size n drawn from the cross product of the domains of variables X 1,…, X n . The basic idea is to randomly draw datasets with replacement from , with each sample the same size as the original set, that is, N. This is done B times, producing B bootstrap replicates, as shown on the left in Figure 2. We learn Bayesian networks from each bootstrap resample. We can then analyze the networks generated over the B resamples; for example, producing means and variances on frequencies of structural features.
Each example in is represented between 0 and B times among the bootstrap resamples. Thus one can can think of the standard bootstrap procedure as assigning each example in an integer weight drawn from a multinomial distribution, representing its number of occurrences in the B resamples. The probability of not including a specific example in a resample is about 1/e ≈ 37%. Since an example contributes to the count N ijk in the scoring function; dropping examples biases the counts, and the structures that are learned from them . Whether the discreteness of the bootstrap leads to the generation of more complex graphs as claimed in , with more false positive features, or whether it leads to structures with missing edges (false negative features) because of undersampling of weaker relationships in the data, is an open one.
An alternative approach is to use the Bayesian bootstrap . It is a Bayesian resampling procedure that is operationally similar to the ordinary non-parametric bootstrap. In the Bayesian bootstrap, examples are assigned continuously varying weights drawn from a Dirichlet distribution. Not surprisingly, the Bayesian bootstrap procedure has a Bayesian interpretation. Assume that examples are drawn from some unknown multivariate distribution P(X), and that we have no specific priors on that distribution. The uninformative prior on P combined with the multinomial sample likelihood yields, via Bayes Theorem, a Dirichlet posterior distribution on the fraction of the original population that each sampled example represents. The ensemble of Bayesian bootstrap resamples, and the distribution of statistics derived from them, can be viewed as samples from a Bayesian posterior distribution. The continuously varying weights of the Bayesian bootstrap ensure that there is a vanishingly small chance of assigning a zero weight to any example in a resample. Thus, all of the inter-relationships between examples are preserved in the resample, but in reweighted form. Statistics derived from these resamples do not embody the bias introduced by the discreteness of the regular bootstrap.
We now turn to the question of the number of bootstrap resamples needed to form accurate estimates of structure feature probabilities.
Even for a large p, for instance 0.9, this works out to be 900 resamples.
In practice, our resample feature probabilities are sometimes not exactly 0 or 1, so we expect the above to be slight overestimates.
To understand how different bagging methods, scoring metrics, and their parameters affect learning performance, we conducted an extensive simulation study using the well known ALARM network. This network contains 37 discrete nodes, with two to four values per node, and 46 edges. We drew a master dataset containing 100,000 data points from the ALARM network’s joint probability distribution.
To evaluate the effect of different strategies and parameters on learning performance, we extract a subset from that master dataset and apply the network learning strategy being tested. In all cases, the core learning strategy is the greedy hill-climbing algorithm with random restarts described above. We compute posterior probabilities of network features, such as edges, by determining how often each such feature occurs in the best single network learnt from each bootstrap resample. A slight variation is double averaging, which first computes feature probabilities for each resample by (Bayesian) averaging of multiple high-scoring networks learnt for that resample, and then averages these probabilities across resamples. In either case, the bagging process produces estimates of posterior probabilities for edge features. Given a threshold 0 ≤ t ≤ 1, we will call an edge e present if . An edge is a true positive if it is present in the known ALARM network. An edge is labeled a false positive if its posterior probability exceeds t, and it is absent in the reference network. An edge is a true negative if its posterior probability is less than t and it is absent in the reference network. Edges with posterior probabilities less than t that are present in the reference network are false negatives. We do not consider edge direction in our assessments; an edge is defined to be present if the sum of probabilities of occurrence in both directions exceeds the threshold t. We report the number of false positives and false negatives as a function of the threshold t and use it to assess the quality of models learned. Unless otherwise stated, the learning performance we report are averages obtained by applying the above process to 60 independent subsets of the master dataset.
Below we offer a possible explanation for this result. Ordinary bagging and Bayesian bagging appear to have very similar performance, with Bayesian bagging outperforming ordinary bagging in terms of the number of structural errors for thresholds above 0.72. Since this threshold range is likely to be the range of greatest interest, especially in the context of limited data, we believe that Bayesian bagging is a better choice for structure learning of Bayesian networks.
As λ values are lowered, the edge density of the Bayesian averaged graphs produced from each bootstrap resample increases. Our results suggest that it is a good strategy to learn overfitted models from each bootstrap resample, and to let bagging average out feature probability estimates of false positive edges to zero. The benefit of learning overfitted models is the reduction in the number of false negative edges. This explains why applying a bias correction factor to the scoring metric (see above) reduces learning performance. While bias correction ensures that false positive features are minimized, it increases the number of false negative features because of its conservatism in adding edges. Thus the total number of structural errors is higher with bias corrected scoring functions than with DPSM with λ = 0.1.
Our experimental results suggest that when data is limited, promiscuous scoring functions (such as DPSM with λ < 1) are very effective when combined with Bayesian bagging over a sufficiently large number of bootstrap resamples. Both the false positive and false negative rates decrease as a result of investment of additional computational resources, and not additional data.
When we attempt to learn a network structure using the above bagging procedure, the edge frequencies obtained will range between 0 and 1. We need a global threshold on edge frequencies to determine whether or not an edge is to be included in the final model. The optimal threshold to use is not constant, but depends on the specific dataset being analyzed. Consequently, in practice, when learning a network model from data, both the true model and the optimal threshold to use are unknown. We need a method to estimate the threshold that minimizes the number of structural errors made (false positive and false negatives).
Given randomly permuted (null) data, our bagging process will compute edge frequencies. Our hypothesis is that the edge frequencies of incorrect edges are similar to the edge frequencies obtained from randomly permuted data. Consequently, the number of edges found in a randomly permuted data set that are above a particular threshold will be indicative of the number of incorrect edges above that threshold in the network learnt from the unpermuted data. Our permutation test determines the likely number of incorrect edges above a particular threshold by averaging the number of edges above that threshold across 60 random permutations of the data. If the data is in fact random, this is clearly a reasonable estimate. For non-random data, it appears that the number of edges obtained from the permuted data overestimates the number of incorrect edges (see figure 11). Consequently, we believe that the number of edges in permuted data is a conservative but reasonable estimate of the number of incorrect edges.
To determine the generality of the above results we performed a similar study using the INSURANCE network. As was the case for the ALARM network, learning performance using DPSM improves as λ decreases, but in this case the best performance was obtained using BDe.
We use bagged gene shaving  to extract consensus gene clusters. Clusters containing genes with an obvious common biological function were also given a descriptive name. To generate a Bayesian network, we trichotomized the signed mean gene for each cluster into low, medium, and high values. We included additional nodes for clinical covariates of interest, including age, gender, whether chemotherapy or raditional was given and survival. We dichotomized survival into short and long survivors at 12 months, but marked as not available all censored survival times shorter than 12 months.
According to our analysis, survival is linked most closely with two clinical covariates, age and gender, and two gene clusters, a cluster of interferon induced genes and a cluster of growth inhibition genes.
We explored the space of model-averaging strategies in contexts with limited data with the goal of robustly learning large Bayesian networks. Our results follows.
1. Is bias-correction a good idea? In contexts with limited data, Bayesian bagging with scoring functions that permit the learning of overfitted models is a better strategy than using model-averaging with conservative scoring functions. In our experiments, bias-corrected scoring functions with Bayesian bagging have higher number of structural errors, both in terms of false positives (incorrect features) and false negatives (missing features).
2. Which is better: Bayesian bagging or ordinary bagging? Bayesian bagging yields models with more features than ordinary bagging. At high feature selection threshold values, Bayesian bagging yields better models, especially when sample sizes are small. As sample sizes increase, the two approaches are nearly indistinguishable in performance as measured by the number of false positives and false negatives. In limited data contexts, Bayesian bagging is superior to ordinary bagging.
3. How many bootstrap resamples are needed?: Our experiments show that for features whose true probability is close to the threshold for inclusion in the final model, up to 2500 bootstrap samples are needed for robust estimation. Previous studies [7, 10] have used one to two hundred bootstrap resamples. Noise features (false positives) are averaged out by bagging over such a large ensemble. This improvement comes not at the expense of false negatives, which are also reduced by the protocol.
4. Which class of scoring functions yield the most accurate models?: For the ALARM network, the DPSM scoring metric with small values of λ performed best, whereas for the INSURANCE network the BDe metric performed best. What both of these metrics have in common is that they very readily include additional edges into the network. A promiscuous scoring metric that produces overly complex models works well in the context of bagging because it is more likely to include true edges with weak support in each resample, hence significantly reducing false negatives, while the bagging method effectively eliminates the false positives introduced by the scoring metric.
5. How is a feature selection threshold to be determined?: We developed a permutation based method to compute significance thresholds for feature selection in bagged graphs. In the absence of a true model, the threshold gives us a bound on the number of false positive features in the final graph.
6. How many data samples are needed?: Dataset size is a key determiner of performance of the learning methods. Our model averaging strategy with bayesian bagging, and the DPSM scoring function (λ = 0.1) with feature selection using a threshold estimated by permutation testing, outperforms existing methods and yields excellent models even with samples sizes as low as 125 for the ALARM network. This result suggests that our model averaging approach is suitable for biological applications such as learning regulatory genetic networks, characterized by several tens of nodes and training data sets of the order of a few hundred samples.
One of the open questions is a theoretical characterization of the relationship between model complexity and sample size for Bayesian networks. In this paper, we characterized this relationship empirically for two well-known networks of moderate complexity. Our model averaging strategy learns more accurate models than other approaches when the amount of data is limited relative to the complexity of the model being learned. In future work we plan to explore more networks in biological applications, and refine our protocol for learning high confidence models with limited data.
Aikake Information Criterion
Bayesian Dirichlet metric
Bayesian Information Criterion
Dirichlet Prior Scoring Metric
Markov Chain Monte Carlo
Minimum Description Length
The Cancer Genome Atlas
Uniform Prior Scoring Metric.
This research was supported in part by NIH/NCI grant R01 CA133996 and the Gulf Coast Center for Computational Cancer Research.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT-2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S13/S1