Many scoring functions are in the form of a penalized log-likelihood (LL) functions. The LL is the log probability of *D* given *B*. Under the standard i.i.d assumption, the likelihood of the data given a structure can be calculated as

\begin{array}{ll}\hfill LL\left(D|B\right)& =\sum _{j}^{N}\text{log}P\left({D}_{j}|B\right)\phantom{\rule{2em}{0ex}}\\ =\sum _{i}^{n}\sum _{j}^{N}\text{log}P\left({D}_{ij}|P{A}_{ij}\right),\phantom{\rule{2em}{0ex}}\end{array}

where *D*_{
ij
} is the instantiation of *X*_{
i
} in data point *D*_{
j
}, and *PA*_{
ij
} is the instantiation of *X*_{
i
}'s parents in *D*_{
j
}. Adding an arc to a network never decreases the likelihood of the network. Intuitively, the extra arc is simply ignored if it does not add any more information. The extra arcs pose at least two problems, though. First, they may lead to overfitting of the training data and result in poor performance on testing data. Second, densely connected networks increase the running time when using the networks for downstream analysis, such as inference and prediction.

A penalized LL function aims to address the overfitting problem by adding a penalty term which penalizes complex networks. Therefore, even though the complex networks may have a very good LL score, a high penalty term may reduce the score to be below that of a less complex network. Here, we focus on decomposable penalized LL (DPLL) scores, which are always of the form

DPLL\left(B,D\right)=LL\left(D|B\right)-\sum _{i=1}^{n}Penalty\left({X}_{i},B,D\right).

There are several well-known DPLL scoring functions for learning Bayesian networks. In this study, we consider MDL, AIC, BDeu and fNML. These scoring functions only differ in the penalty terms, so we will focus on discussing the penalty terms in the following discussions. In terms of memory and runtime, all of the scoring functions incur similar overhead [32].

### Minimum description length (MDL)

The MDL [3] scoring metric for Bayesian networks was defined in [2, 33]. MDL approaches scoring Bayesian networks as an information theoretic task. The basic idea is to minimally encode *D* in two parts: the network structure and the unexplained data. The model can be encoded by storing the conditional probability tables of all variables. This requires\frac{\text{log}\phantom{\rule{0.3em}{0ex}}N}{2}*p bits, where \frac{\text{log}\phantom{\rule{0.3em}{0ex}}N}{2} is the expected space required to store one probability value and *p* is the number of individual probability values for all variables. The unexplained part of the data can be explained with *LL*(*D*|*B*) bits. Therefore, we can write the MDL penalty term as

PenaltyMDL\left({X}_{i},B,D\right)=\frac{\text{log}N*{p}_{i}}{2},

where *p*_{
i
} is the number of parameters for *X*_{
i
}. For MDL, the penalty term reflects that more complex models will require longer encodings. The penalty term for MDL is larger than that of most other scoring functions, so optimal MDL networks tend to be sparser than optimal networks of other scoring functions. As hinted at by its name, an optimal MDL network minimizes rather than maximizes the scoring function. To interpret the penalty as a subtraction, the scores must be multiplied by -1. The Bayesian information criterion (BIC) [3] is a scoring function whose calculation is equivalent to MDL for Bayesian networks, but it is derived based on the asymptotic behavior of the models, that is, BIC is based on having a sufficiently large amount of data. Also, BIC does not require the -1 multiplication.

### Akaike's information criterion (AIC)

Bozdogan [34] defined the AIC [4] scoring metric for Bayesian networks. It, like BIC, is another scoring function based on the asymptotic behavior of models with sufficiently large datasets. In terms of the equation, the penalty for AIC differs from that of MDL by the log *N* term. So the AIC penalty term is

PenaltyAIC\left({X}_{i},B,D\right)={p}_{i}.

Because its penalty term is less than that of MDL, AIC tends to favor more complex networks than MDL.

### Bayesian Dirichlet with score equivalence and uniform priors (BDeu)

The Bayesian Dirichlet (BD) scoring function was first proposed by Cooper and Herskovits [1]. It computes the joint probability of a network for a given dataset. However, the BD metric requires a user to specify a parameter for all possible variable-parents combinations. Furthermore, it does not assign the same score to equivalent structures, so it is not score equivalent. To address the problems, a single "hyperparameter" called the *equivalent sample size* was introduced, referred to as *α* [6]. All of the needed parameters can be calculated from *α* and a prior distribution over network structures. This score, called BDe, is score equivalent. Furthermore, if one assumes all network structures are equally likely, that is, the prior distribution over network structures is uniform, *α* is the only input necessary for this scoring function. BDe with this additional uniformity assumption is called BDeu [6]. Somewhat independently, the BDeu scoring function was also proposed earlier by Buntine [5]. BDeu is also a decomposable penalized LL scoring function whose penalty term is

PenaltyBDeu\left({X}_{i},B,D\right)=\sum _{j}^{{q}_{i}}\sum _{k}^{{r}_{i}}\text{log}\frac{P\left({D}_{ijk}|{D}_{ij}\right)}{P\left({D}_{ijk}|{D}_{ij},{\alpha}_{ij}\right)},

where *q*_{
i
} is the number of possible values of *PA*_{
i
}, *r*_{
i
} is the number of possible values for *X*_{
i
}, *D*_{
ijk
} is the number of times *X*_{
i
} = *k* and *PA*_{
i
} = *j* in *D*, and *α*_{
ij
} is a parameter calculated based on the user-specified *α*. The original derivations [5, 6] include a more detailed description. The density of the optimal network structure learned with BDeu is correlated with *α*; low *α* values typically result in sparser networks than higher *α* values. Recent studies [35] have shown the behavior of BDeu is very sensitive to *α*. If the density of the network to be learned is unknown, selecting an appropriate *α* is difficult.

### Factorized normalized maximum likelihood (fNML)

Silander *et al.* developed the fNML score function to address the problem of *α* selection in BDeu based on the normalized maximum likelihood function (NML) [7]. NML is a penalized LL scoring function in which *regret* is the penalty term. Regret is calculated as

\sum _{{D}^{\prime}}P\left({D}^{\prime}|B\right),

where the sum ranges over all possible datasets of size *N* . Kontkanen and Myllymäki [36] showed how to efficiently calculate regret for a single variable. By calculating regret for each variable in the dataset, the NML becomes decomposable, or factorized. fNML is given by

Penalty\phantom{\rule{0.3em}{0ex}}fNML\left({X}_{i},B,D\right)=\sum _{k}^{{q}_{i}}\text{log}{C}_{{N}_{ij}}^{{r}_{i}},

where {C}_{{N}_{ij}}^{{r}_{i}} are the regrets. fNML is not score equivalent.