### Learning Bayesian networks from data: scoring functions

The problem of learning a network from data is normally posed as an optimization problem: Given a set

of instances drawn from a multivariate joint probability distribution

**P**(

**X**), find a network

which maximizes the posterior probability of the network given the data:

The prior over network structures has two components: a discrete probability distribution

over graph structures

, and for each possible graph, a density measure

over possible values of the parameter Θ. The simplest, and most common choice for both the prior

over graphs as well as the parameter distribution

, is the uniform distribution. Assuming that the data is sampled

*i.i.d.*, the likelihood of the data given a network

is

The posterior probability

is called a scoring metric or scoring function for Bayesian network learning algorithms. By marginalizing over all choices for the parameter distribution Θ, we can write

as

The scoring function

simplifies under the assumption of uniform structure and parameter priors to the following, called the K2 metric in [

17], and the UPSM score in [

18]:

where *r*
_{
i
} is the number of values that *X*
_{
i
} takes, and *q*
_{
i
} is the number of values that the parents of *X*
_{
i
} in
take. *N*
_{
ijk
} is the number of instances in
in which variable *X*
_{
i
} has its *k*
^{
th
} value, and its parents take their *j*
^{
th
} value, and
.

A generalization of the above scoring metric uses Dirichlet priors (

*λ* >= 1) for

. The parameter priors for each variable are considered independent of one another, an assumption called global parameter independence in [

13],

In addition, the parameter modularity assumption is made [

19]. For two graph structures

and

such that the parents of node

*X*
_{
i
} are the same in both graphs, we have

The graph structure parameter prior

is assumed to satisfy structural modularity; that is, it can be factored as a product of distributions over the possible parent sets of node

*X*
_{
i
} in the graph. Under these assumptions, the posterior probability of the graph structure

can be written in fully factored form as

The scoring function with these assumptions reduces to

where
is the Dirichlet distribution order for variable *i* with value *k* and parent value *j*, and
. When the Dirichlet orders for all sets of parameters
are set to a constant *λ*, the score is called DPSM [18]. When we assume
, it is called the Bayesian Dirichlet (BDe) metric [17]. The choice of
is critical, particularly for small data sets. If these parameters are large, making
dominate the *N*
_{
ijk
} values, the available data has less influence in determining the space of structures explored.

There is another family of scoring functions based on maximizing the likelihood function

. The first among these is the BIC score, also known as the Schwartz criterion, which is derived from the Laplace approximation to the logarithm of

( [

20], page 64) and defined by

where

is the maximum likelihood estimate of Θ given

and

*D*,

*d* is the number of free parameters in Θ, and

*N* is the size of the training sample. The number of free parameters in Θ is

where *r*
_{
i
} is the number of values that variable *X*
_{
i
} can take, and *q*
_{
i
} is the number of values that the parents of *X*
_{
i
} in
can take. Direct maximization of
yields overfitted models, because the likelihood of the data given
and Θ can be increased by increasing the complexity of
(that is, by adding an edge). The second term
can be seen as a penalty term to compensate for model complexity. Finding a structure
with the highest BIC score is equivalent to finding a structure with the highest approximate posterior probability given the data *D.* As the sample size *N →* ∞, the probability that the BIC criterion selects the true model approaches 1. However, with finite sample sizes, BIC tends to select models that are much simpler than the true model, due to the penalty term.

The MDL, or the Minimum Length Description scoring function has its origins in data compression and information theory. The best model for a data set

, according to the MDL, is one that minimizes the sum of the length (in bits) of the encoding of the model itself, and the length of the encoding of the data given the model. When applied to Bayesian network structure learning, the MDL score is the sum of the description length of the graph

, the description length of the conditional probability parameters Θ given

, and that of the data

given

and Θ. The description length of the graph

is

where |

*π*
_{
i
}| is the number of parents of

*X*
_{
i
} in

, and

*n* is the number of variables

*X*
_{
i
} in

**X**. The description length of Θ is the product of the total number of independent parameters for specifying each conditional probability table, and the number of bits needed to encode each of these parameters for a data set of size

*N.* Define

Finally, the number of bits needed to encode the data given the model is

Minimizing the MDL scoring function yields a model with the shortest description length. If we ignore the description length of the graph,
, the MDL score equals the BIC score with reversed sign. The MDL scoring function favors graphs in which nodes have fewer parents, because it reduces the encoding length of Θ and
.

To summarize, scoring functions belong to one of two families. Bayesian scoring functions, of which K2, DPSM, and BDe are examples, represent the posterior probability
of a structure
given data
. The second family consists of complexity-penalized likelihood measures, of which BIC and MDL are examples. MDL can also be justified from an information-theoretic perspective. These scoring functions embody different assumptions about priors on the space of structures as well as priors on parameter distributions given the structures. There are few systematic studies in the literature to guide choice of scoring functions for specific applications. Below we show a comparative analysis of the performance of these scoring functions in the context of limited data.

### Bootstrap and Bayesian bootstrap

The bootstrap is a general tool for assessing statistical accuracy of models learned from data. Suppose we have a data set
, where each **x**
^{(}
^{
j
}
^{)} is a vector of size *n* drawn from the cross product of the domains of variables *X*
_{1},…, *X*
_{
n
}. The basic idea is to randomly draw datasets with replacement from
, with each sample the same size as the original set, that is, *N.* This is done *B* times, producing *B* bootstrap replicates, as shown on the left in Figure 2. We learn Bayesian networks from each bootstrap resample. We can then analyze the networks generated over the *B* resamples; for example, producing means and variances on frequencies of structural features.

Each example in
is represented between 0 and *B* times among the bootstrap resamples. Thus one can can think of the standard bootstrap procedure as assigning each example in
an integer weight drawn from a multinomial distribution, representing its number of occurrences in the *B* resamples. The probability of not including a specific example in a resample is about 1/*e* ≈ 37%. Since an example contributes to the count *N*
_{
ijk
} in the scoring function; dropping examples biases the counts, and the structures that are learned from them [10]. Whether the discreteness of the bootstrap leads to the generation of more complex graphs as claimed in [10], with more false positive features, or whether it leads to structures with missing edges (false negative features) because of undersampling of weaker relationships in the data, is an open one.

An alternative approach is to use the Bayesian bootstrap [11]. It is a Bayesian resampling procedure that is operationally similar to the ordinary non-parametric bootstrap. In the Bayesian bootstrap, examples are assigned continuously varying weights drawn from a Dirichlet distribution. Not surprisingly, the Bayesian bootstrap procedure has a Bayesian interpretation. Assume that examples are drawn from some unknown multivariate distribution *P*(**X**), and that we have no specific priors on that distribution. The uninformative prior on *P* combined with the multinomial sample likelihood yields, via Bayes Theorem, a Dirichlet posterior distribution on the fraction of the original population that each sampled example represents. The ensemble of Bayesian bootstrap resamples, and the distribution of statistics derived from them, can be viewed as samples from a Bayesian posterior distribution. The continuously varying weights of the Bayesian bootstrap ensure that there is a vanishingly small chance of assigning a zero weight to any example in a resample. Thus, all of the inter-relationships between examples are preserved in the resample, but in reweighted form. Statistics derived from these resamples do not embody the bias introduced by the discreteness of the regular bootstrap.

We now turn to the question of the number of bootstrap resamples needed to form accurate estimates of structure feature probabilities.

### Estimating a bound on the number of bootstrap resamples

The probability of a feature

*e* from a bootstrap resample,

, will be exactly 0 or 1 if estimated from the single best network. In double averaging, the

, although estimated from multiple high-scoring networks, have highly bimodal distributions, each being either close to 0 or close to 1, with few exceptions. Consequently, we here approximate the feature probabilities from each resample as either 0 or 1. The averaged or bagged feature probabilities then have binomial distributions that in the large resample limit approximate a normal distribution with mean

*p* and variance

where

*p* is the probability of getting a probability value of 1 in each resample, and

*B* is the number of resamples. For fixed number of resamples

*B*, the largest variance is obtained when

*p* = 0.5. The largest feature probability variance can therefore be estimated by

For

*B* = 200 (experimental setting used in [

4] and suggested in [

9]), the largest feature probability variance is 0.25/200 ≈ 0.00125 yielding a standard deviation of approximately 0.035. Three standard deviations around the mean of 0.5 yields a probability range of 0.4 – 0.6. To obtain a more reasonable standard deviation of about 0.01 on link probabilities, and hence a three-sigma range of 0.47-0.53, we need

In practice, we are most concerned about feature probabilities close to the cutoff threshold. For features probabilities

*p* greater than 0.5, the number of resamples required to achieve a standard deviation of 0.01 is approximately

Even for a large *p*, for instance 0.9, this works out to be 900 resamples.

In practice, our resample feature probabilities are sometimes not exactly 0 or 1, so we expect the above to be slight overestimates.