Apples and oranges: avoiding different priors in Bayesian DNA sequence analysis
 Jens Keilwagen†^{1}Email author,
 Jan Grau†^{2},
 Stefan Posch^{2} and
 Ivo Grosse^{1, 2}
DOI: 10.1186/1471210511149
© Keilwagen et al; licensee BioMed Central Ltd. 2010
Received: 21 August 2009
Accepted: 22 March 2010
Published: 22 March 2010
Abstract
Background
One of the challenges of bioinformatics remains the recognition of short signal sequences in genomic DNA such as donor or acceptor splice sites, splicing enhancers or silencers, translation initiation sites, transcription start sites, transcription factor binding sites, nucleosome binding sites, miRNA binding sites, or insulator binding sites. During the last decade, a wealth of algorithms for the recognition of such DNA sequences has been developed and compared with the goal of improving their performance and to deepen our understanding of the underlying cellular processes. Most of these algorithms are based on statistical models belonging to the family of Markov random fields such as position weight matrix models, weight array matrix models, Markov models of higher order, or moral Bayesian networks. While in many comparative studies different learning principles or different statistical models have been compared, the influence of choosing different prior distributions for the model parameters when using different learning principles has been overlooked, and possibly lead to questionable conclusions.
Results
With the goal of allowing direct comparisons of different learning principles for models from the family of Markov random fields based on the same apriori information, we derive a generalization of the commonlyused productDirichlet prior. We find that the derived prior behaves like a Gaussian prior close to the maximum and like a Laplace prior in the far tails. In two case studies, we illustrate the utility of the derived prior for a direct comparison of different learning principles with different models for the recognition of binding sites of the transcription factor Sp1 and human donor splice sites.
Conclusions
We find that comparisons of different learning principles using the same apriori information can lead to conclusions different from those of previous studies in which the effect resulting from different priors has been neglected. We implement the derived prior in the opensource library Jstacs to enable an easy application to comparative studies of different learning principles in the field of sequence analysis.
Background
The computational recognition of short signal sequences in genomic DNA is one of the prevalent tasks in bioinformatics. It includes e.g. the recognition of transcription factor binding sites (TFBSs) [1, 2], donor or acceptor splice sites [3–5], nucleosome binding sites [6, 7], or binding sites of insulators like CTCF [8]. Many different algorithms have been developed for the recognition of such DNA binding sites, with specific strengths and weaknesses, but none of them is perfect. Hence, great efforts have been made over the last decade to evaluate and compare the performance of different algorithms [2, 3, 9–13]. The results of such comparative studies are often influential to the direction of future research, because they lead to new and superior approaches by combining the advantages of existing algorithms and because they provide a deeper understanding of the mechanisms of proteinDNA interaction. The approaches compared typically differ by (i) the statistical model employed at the heart of these algorithms, (ii) the learning principle chosen for estimating the model parameters, and (iii) the prior used for the parameters of the model, and it is nontrivial to keep the influences of these different contributions apart. The first two aspects focus on developing improved statistical models or learning principles, while the choice of the prior is often arbitrary or determined by conjugacy. However, the choice of the prior may have a decisive effect on the recognition performance [14, 15]. The goal of this paper is to derive a common prior for Markov random fields (MRFs) and mixtures of MRFs, which are at the heart of many existing algorithms for binding site recognition, allowing an unbiased comparison of different learning principles for models from this model family.
Many computer algorithms available today use statistical models for representing the distribution of sequences, and many of these statistical models are special cases of MRFs [16, 17]. These models range from simple models like the position weight matrix (PWM) model [1, 18, 19], the weight array matrix (WAM) model [4, 6, 20], or Markov models of higher order [21, 22] to more complex models like moral Bayesian networks [2, 12, 23] or general MRFs [5, 24, 25]. Hence, we restrict our attention to statistical models from the family of MRFs in this paper.
One of the first learning principles used in bioinformatics is the maximum likelihood (ML) principle. However, for many applications, the sequence data available for learning statistical models is very limited. This is especially true for the recognition of TFBSs, where typical data sets contain sometimes as few as 20 and seldom more than 300 sequences. For this reason, the ML principle often leads to suboptimal classification performance e.g. due to zerooccurrences of some nucleotides or oligonucleotides in the training data sets. The maximum aposteriori (MAP) principle, which applies a prior to the parameters of the models, establishes a theoretical foundation to alleviate this problem and at the same time allows for the inclusion of prior knowledge aside from the training data.
Recently, the application of discriminative principles instead of generative ones has been shown to be promising in the field of bioinformatics [9, 21, 22, 24, 26]. Generative learning principles aim at an accurate representation of the distribution of the training data, whereas discriminative learning principles aim at an accurate classification of the training data. The discriminative analogue to the ML principle is the maximum conditional likelihood (MCL) principle, which has been widely used in the machine learning community [27–31]. However, the effects of limited data may be even more severe when using the MCL principle compared to generative learning principles [11]. To overcome this problem, the maximum supervised posterior (MSP) principle [32, 33] has been proposed as discriminative analogue to the MAP principle.
Many different priors have been used in the past, and their choice seems arbitrary or motivated by technical aspects. ProductGaussian and productLaplace priors are widely used for generatively trained MRFs [16] and discriminatively trained MRFs also called conditional random fields [17, 34]. For the generative MAP learning of Markov models and Bayesian networks, the most prevalent prior is the productDirichlet prior, whereas for the discriminative MSP learning, either a productGaussian or productLaplace prior is typically employed [26]. Hence, when comparing generatively and discriminatively trained Markov models, Bayesian networks, and MRFs, in many occasions apples are compared to oranges by using different priors.
The comparison of generative and discriminative learning principles is the topic of several recent studies. Ng & Jordan [11] compare generatively and discriminatively trained PWM models. To be specific, they compare the Bayesian MAP principle with the nonBayesian MCL principle. Pernkopf & Bilmes [30] compare the ML principle to the MCL principle for estimating the parameters of Bayesian networks, while the structures of the networks are estimated by generative as well as discriminative measures. Greiner et al. [29] compare the ML principle with a variant of the MCL principle that prevents overfitting, and they apply these approaches to Bayesian networks. Grau et al. [26] compare the MAP principle for Markov models using a productDirichlet prior to the MSP principle using productGaussian and productLaplace priors.
All of these studies use different priors when comparing different learning principles, rendering the conclusions regarding the superiority of one learning principle over the other questionable, because the differing influences of these priors are neglected. In fact, we are not aware of any study that uses the same apriori information when comparing generative to discriminative learning principles.
 i)
can be used for the generative (MAP) and the discriminative (MSP) principles,
ii) is conjugate to the likelihood of MRFs, which include moral Bayesian networks,
iii) contains the widelyused productDirichlet prior as special case when the structure of the MRF is equivalent to that of a moral Bayesian network including all of its special cases such as PWM models, WAM models, Markov models of higher order, or Bayesian trees.
In section Methods, we present the derivation of such a prior, which is the main result of this paper. With such a prior at hand, it becomes possible to accomplish an unbiased comparison of generative and discriminative learning principles applied to the same model using the same prior. In addition, this prior allows a comparison of different generatively trained models for binding site recognition that are special cases of MRFs including PWM models, WAM models, Markov models of higher order, Bayesian trees, or moral Bayesian networks as well as a comparison of different discriminatively trained models that are special cases of MRFs using the BDeu prior [35]. In section Results and Discussion, we illustrate the applicability of the derived prior using two typical data sets of TFBSs and donor splice sites.
Methods
We denote by x = (x_{1},...,x_{ L }) a sequence of length L over an alphabet Σ = {1,2,...,S} with x_{ℓ} ∈ Σ. where S = 4 in case of DNA and RNA sequences, and S = 20 in case of protein sequences. We denote by c ∈ = {1,2,...,C} the class of a sequence. In this paper, we consider twoclass problems, i.e., C = 2, and we denote the first class containing biological binding sites by foreground, and the second class containing decoy DNA sequences by background. For each sequence x_{ n }in the training data set, we know its correct class label c_{ n }∈ . We denote the data set of all sequences by = ( x1,..., xn) and we denote the vector of the corresponding class labels by c = (c1,...,cN).
In this paper, we consider two Bayesian learning principles, namely the generative maximum aposteriori (MAP) principle and the discriminative maximum supervised posterior (MSP) principle. The goal of both learning principles is to estimate the optimal parameters of some statistical model with respect to the posterior or supervised posterior, respectively.
Using the assumption of i.i.d. sequences and the assumption of independence of the parameters of the classes, generative learning principles, as for instance the MAP principle, can be simplified to classspecific generative learning principles that allow inferring the parameters of the foreground and background class separately. For several simple models like Markov models, generative learning principles amount to computing smoothed relative frequencies of nucleotides and oligonucleotides [18–20].
While the generative ML and MAP principles often lead to analytic solutions for simple models such as Markov models, we must use numerical optimization procedures [36] for the discriminative MCL and MSP principles.
In practical applications, the parameterization ϑ of the models and the priors h (ϑ α ) differ between the MAP and the MSP principle, since both learning principles evolved from different theoretical backgrounds. With the goal of resolving these differences, we present a common parameterization for the likelihood of all models from the class of MRFs, which can be used for the MAP and the MSP principle, and we derive a prior for this parameterization that is equivalent to the wellknown productDirichlet prior in the remainder of this section.
Foundations of moral Bayesian networks
Graphical models, which combine probability theory and graph theory, are statistical models in which random variables are represented by nodes of a graph and in which the dependency structure of the joint probability distribution is represented by edges [37]. Graphical models can be categorized into directed acyclic graphical models called Bayesian networks and undirected graphical models called MRFs with a nonempty intersection called moral Bayesian networks [38]. For deriving the desired prior, we start with moral Bayesian networks in this subsection, where we give an introduction to moral Bayesian networks, and in the second subsection we present the MRF parameterization for these models. In the third subsection, we present the widelyused productDirichlet prior for moral Bayesian networks, and transform this prior to the MRF parameterization. Finally, we extend the resulting prior for moral Bayesian networks to the case of general MRFs in the last subsection.
Graphical models are represented by graphs consisting of nodes and edges. The nodes in the graph represent random variables X_{ℓ} having realizations denoted by x_{ℓ}. In case of directed graphical models, the edges are directed from the parent nodes to their children. We denote by Pa(ℓ) the vector of parents of node ℓ representing random variable X_{ℓ}, and we denote by pa (ℓ, x ) the realizations of the parents Pa (ℓ) in sequence x . Edges between nodes represent potential statistical dependencies between the random variables, while missing edges between nodes represent conditional independencies of the associated random variables given their parents. Specifically, if there is no edge from i to j, then X_{ i }and X_{ j }are conditionally independent given Pa(i) and Pa (j), the parents of node i and j. For Bayesian networks the underlying graph structure is a directed acyclic graph (DAG). In this paper, we consider models with a given graph structure, such that all parents of each node are predetermined. To simplify notation in the following derivation, we assume the same graph structure for the models of all classes. The extensions to models with different graph structures and to positiondependent alphabets is straightforward.
A Bayesian network is called a moral Bayesian network iff its DAG is moral. A DAG is called moral iff, for each node ℓ, each pair (p_{1},p_{2}), p_{1} ≠ p_{2}, of its parents is connected by an edge [38]. The family of moral Bayesian networks contains popular models such as PWM models, WAM models, Markov models of higher order, and Bayesian trees. When considering the parents Pa (ℓ) of a node ℓ in a moral Bayesian network, we can order the nodes in Pa (ℓ) uniquely according to the topological ordering within the set Pa (ℓ).
with c ∈ , ℓ ∈ [1, L], and a ∈ Σ^{ Pa (ℓ)}being a possible observation at the random variables represented by Pa (ℓ) and, hence, corresponding to pa (ℓ, x ) for a specific sequence x .
It follows from these constraints that not all parameters of θ are free: if the values of θ_{1}, θ_{2},...,θ_{C1}are given, the value of θ_{ C }is determined, and if the values of θ_{c,ℓ,1, a }, θ_{c,ℓ,2, a },...,θ_{c,ℓ,S1, a } are given, the value of θ_{c, ℓ,S, a } is determined.
MRF Parametrization of moral Bayesian networks
Similar to the θ parameters, there is one parameter λ_{ c }∈ ℝ for each class c ∈ , and one parameter for each class c and each symbol b at X_{ℓ} given the observation a at random variables represented by the nodes Pa(ℓ). In contrast to θ , however, these parameters cannot be interpreted directly as probabilities.
with c ∈ [1, C1] and c ∈ [1, C], ℓ ∈ [1, L], b ∈ [1, S  1], a ∈ Σ^{ Pa (ℓ)},respectively.
where Z_{ c }(λ) and Z_{c,ℓ,b, a } ( λ ) are two partial normalisation constants defined in Appendix A of Additional File 1.
Prior for moral Bayesian networks
where ϕ = (ϕ_{1}, ϕ_{2},...), ϕ_{ i }stands for θ_{ c }or θ_{ c },_{ℓ,b, a }.
where the Kronecker symbol δ is 1 if both indices are equal and 0 otherwise. These constraints ensure that the hyperparameters α of the productDirichlet prior can be interpreted as, possibly realvalued, counts stemming from a set of apriorily observed pseudodata. The size of the set of pseudodata is commonly referred to as equivalent sample size[35, 39], and we denote the equivalent sample size of class c by α_{ c }. Hence, a productDirichlet prior allows an intuitive and easilyinterpretable choice of hyperparameters, in contrast to productGaussian or productLaplace priors.
and obtain a general transformed Dirichlet prior (Appendix C of Additional File 1).
where α := Σ_{ c }α_{ c }.
Since the commonlyused productDirichlet prior for θ defined in equation (7a) is conjugate to the likelihood defined in equation (3), the transformed prior of equation (11) is also conjugate to the likelihood defined in equation (4). While in earlier comparisons of different learning principles for the same moral Bayesian network, different priors have been employed, we are now capable of using the same prior as defined in equation (11) for the MAP and the MSP principle. Employing this prior, we can compare the classification accuracy of two classifiers based on the same model, but trained either by the MAP or the MSP principle, using the same prior, avoiding a potential bias induced by differing priors.
Choice of hyperparameters
In contrast to the comparison of the MAP and the MSP principle for the same model, the derived prior cannot be used for an unbiased comparison of different models without further premises, since different models typically use different parameters of potentially different dimension, inevitably leading to different priors for these models. One reasonable requirement for the comparison of models with different graph structures is likelihood equivalence[39], stating that models with different graph structures representing the same likelihood also obtain the same marginal likelihood of the data given graph structure and hyperparameters or, equivalently, that the values of the prior density on the parameters of such models must be equal for equivalent parameter values. Examples for different graph structures representing the same likelihood are lefttoright and righttoleft Markov models or differently rooted Bayesian trees with the same undirected graph structure.
Heckerman et al. [39] show that this property is satisfied only by the BDe metric, which corresponds to the consistency condition presented above. This condition also entails that the hyperparameters used for the priors of these models can be derived from a common set of pseudodata. However, the consistency criterion does not determine how a specific set of pseudodata should be chosen in order to minimize the bias imposed on the comparison, and different choices may favor different models in one way or the other. For example, a comparison of different models can be easily biased if the set of pseudodata contains statistical dependencies that can be exploited by some but not by all models, as for instance dinucleotide dependencies that can be captured by a WAM model but not by a PWM model.
where Pa (ℓ) is the number of parents Pa (ℓ) of node ℓ, c ∈ , ℓ, ∈ [1, L], b ∈ Σ, and a ∈ Σ^{ Pa (ℓ)}.
Consider the example that the equivalent sample size for class c is α_{ c }= 32 and that the data of each class is modeled either by a PWM or by a WAM model. The PWM model has parameters λ_{c,ℓ,b}, ℓ ∈ [1, L], b ∈ Σ, while the WAM model has parameters , b ∈ Σ and , ℓ ∈ [2, L], b, a∈ Σ. In case of the DNA alphabet, the BDeu metric determines the hyperparameters for the PWM model to be α_{c,ℓ,b}= 8, while it determines the hyperparameters for the WAM model to be and . With this choice of hyperparameters, both productDirichlet priors represent the same set of pseudodata. The hyperparameters α_{c,ℓ,b}of the PWM model correspond to pseudocounts of mononucleotides b, while the hyperparameters of the WAM model correspond to conditional pseudocounts of nucleotides b given nucleotide a observed at the previous position ℓ  1. This result does equally hold for all specializations of MRFs considered in this paper, and we choose the hyperparameters accordingly throughout the case studies.
Markov random fields
The prior of equation (11) allows an unbiased comparison of different learning principles including the generative MAP principle and the discriminative MSP principle for different models from the family of moral Bayesian networks including PWM models, WAM models, Markov models of higher order, or Bayesian trees. However, several important models proposed for the recognition of short signal sequences do not belong to this family. Hence, we now focus on the main goal of deriving a prior for the family of MRFs, which contains the family of moral Bayesian networks as special case.
where I_{ c }denotes the number of λ parameters conditional on class c, and f_{c, i}( x ) ∈ {0.1} denotes the indicator function of λ_{c, i}[17, 40]. These indicator functions determine the undirected graph structure.
Renaming the parameters in terms of λ_{c, i}and defining the indicator functions f_{c, i}as corresponding Kronecker symbols, we obtain the likelihood in form of equation (12).
that contains the transformed Dirichlet prior of equation (11) as special case if the MRF of each class belongs to the family of moral Bayesian networks. Examining the likelihood of equation (12), we find that the prior of equation (14) is conjugate to the likelihood of MRFs. Additionally, it is equivalent to the conjugate prior of the exponential family [44] for the studied family of models.
 i)
can be used for the generative MAP and the discriminative MSP principle,
ii) is conjugate to the likelihood of MRFs and, hence, also to the likelihoods of many popular models used for the recognition of short sequence motifs,
iii) includes the commonlyused productDirichlet prior of equation (7a) as special case if the MRF belongs to the family of moral Bayesian networks including PWM models, WAM models, Markov models of higher order, or Bayesian trees, and
iv) allows to incorporate prior knowledge intuitively by defining a set of apriorily observed pseudodata.
Hence, it can be employed in comparative studies of generative and discriminative learning principles applied to the same family of models, and of different, generatively or discriminatively trained models. Additionally, the derived prior can be readily extended to mixtures of models from the family of MRFs. In the next section, we illustrate the utility of the derived prior.
Results and Discussion
In this section, we present two case studies that illustrate how the derived prior can be used for an unbiased comparison of different learning principles for different models related to two standard problems in bioinformatics.
In case study 1, we illustrate the comparison of different learning principles for the recognition of TFBSs using the same models and the same priors. Specifically, we investigate the influence of different sizes of data sets on the performance of generatively and discriminatively trained models in close analogy to the pioneering study of Ng & Jordan [11]. Possibly due to the lack of a common prior that could be used for both the generative and the discriminative learning, Ng & Jordan compare the generative Bayesian approach of parameter estimation (MAP) to the discriminative nonBayesian approach of parameter estimation (MCL). Based on the derived prior, it is now possible to compare the two Bayesian learning principles directly using exactly the same prior in both cases. In case of TFBSs, the number of available training sequences is small, typically ranging from only 20 to at most 300 sequences. Hence, available algorithms for the recognition of TFBSs are far from being perfect, and unbiased comparisons of different learning principles for data sets of this size are of fundamental importance for any further advance on this field.
In case study 2, we illustrate the comparison of different learning principles with different models for the recognition of human donor splice sites using the same apriori information. Donor splice sites exhibit nonadjacent dependencies [3, 45, 46]. Hence, it seems worthwhile to employ MRFs for this task, as they are capable of capturing dependencies between all pairs of positions in a sequence [5]. However, different subclasses of donor splice sites exist [3], so the use of mixtures of MRFs may be favourable. Donor splice sites are highly conserved so that for some pairs of positions some of the 16 possible pairs of nucleotides do not occur. These nonoccurrences cause numerical problems when using the ML or MCL principle, but one may adopt a Bayesian approach to circumvent these problems. Interestingly, mixtures of MRFs have not been employed in the past for the classification of donor splice sites, possibly because of the lack of a suitable prior. The derived prior now provides an opportunity to investigate if mixtures of MRFs might be useful for the recognition of splice sites. We compare mixtures of MRFs to single MRFs, mixtures of Markov models, and single Markov models using the MAP and the MSP principle, and we investigate which of these two learning principles may be worthwhile for the recognition of splice sites.
The focus of the case studies presented is not on the identification of the most appropriate model class or learning principle for the recognition problem scrutinized, although undoubtedly this is a welcome sideeffect, but primarily we aim at illustrating the benefit of the derived prior for unbiased comparative studies in bioinformatics.
Case Study 1: Discriminative vs. generative parameter estimation
In case study 1, we illustrate a comparison of generatively trained and discriminatively trained Markov models of different orders using the derived prior. We choose the data set of [26] containing 257 aligned binding sites, each of length 16 bp, of the mammalian transcription factor Sp1 as foreground data set and 267 second exons of human genes, which have different lengths and are cut into 100mers for this study, with a total size of approximately 68 kb as background data set. We use a PWM model as foreground model and Markov models of order 3 as background model. Results for all other combinations of a Markov model of orders 0 or 1 as foreground model and Markov models of orders 0 to 3 as background model are available in Additional File 2. These models are trained by the MAP principle and by the MSP principle using the same priors and the same hyperparameters for both cases. We choose for both cases and all model combinations an equivalent sample size of 4 for the foreground model and an equivalent sample size of 1024 for the background model.
We use a stratified holdout sampling procedure for the comparison of the classification performance of the resulting classifiers. In each iteration of the stratified holdout sampling procedure, we randomly partition both the foreground data set and the background data set into a preliminary training data set comprising 90% of the sequences and a test data set comprising the remaining 10% of the sequences. In order to vary the size of the training data set, we use an additional sampling step, where we randomly draw a given fraction of the preliminary training data sets ranging from 5% to 100% yielding the final training data sets. We train all classifiers corresponding to different learning principles and different model combinations on the same subsets of the preliminary training data sets, and we use the resulting classifiers to classify the same sequences in the test data sets.
The classification performance increases rapidly with increasing size of the training data set and achieves its optimal value for the largest training data sets. For the largest training data set, the discriminatively trained classifier yields an FPR of 0.4%, an Sn of 76.6%, a PPV of 57.3%, and an AUCPR of 0.826, whereas the generatively trained classifier yields only an FPR of 0.6%, an Sn of 70.5%, a PPV of 47.0%, and an AUCPR of 0.803.
Ng & Jordan [11] compare the classification performance of PWMs trained by the MAP principle and the MCL principle on a number of data sets from the UCI machine learning repository. They find that for large data sets the discriminative MCL principle has a lower asymptotic error, corresponding to a higher classification performance, but that the generative MAP principle yields a higher classification performance for small data sets. In contrast to those findings, we find a superior classification performance of the discriminatively compared to the generatively trained models irrespective of the size of the training data set. This result suggests that the choice of the same prior is advisable for an unbiased comparison of generative and discriminative learning principles and, moreover, that it might be worthwhile to reevaluate the power of the MSP principle for other applications in bioinformatics as well.
Case Study 2: Mixtures of Markov random fields
where ℓ_{1}, ℓ_{2} ∈ [1, L], ℓ_{ l }≠ ℓ_{2}, and b_{1}, b_{2} ∈ Σ. Based on these basic models, we build mixture models with two MMs (mixMM) and two such MRFs (mixMRF), and we compare those four classifiers that are based on a combination of the same kind of model for the foreground and for the background class. For all of these classifiers, we use the derived prior with an equivalent sample size of 32 for each of the four foreground models and an equivalent sample size of 96 for each of the four background models. We train each of these classifiers on the two training data sets using the MAP and the MSP principle, and we evaluate their classification performance on the two test data sets. We use the same performance measures as in case study 1, except that we replace Sn by the the area under the receiver operating characteristic curve (AUCROC) [48], because AUCROC is more commonly used than Sn for the classification of splice sites [5].
In close analogy to Figure 3(ad), Figure 3(eh) shows the results using the MSP principle. We find that discriminatively trained mixture models, i.e., mixMM and mixMRF, outperform the two corresponding classifiers based on the single MM and single MRF, and that the mixMM classifier is comparable or even better than the MRF classifier. The mixMRF classifier yields the best results for FPR (7.0%) and PPV (39.0%), while the mixMM classifier yields a higher AUCROC (0.9809) and AUCPR (0.6876) than the mixMRF classifier.
Comparing Figures 3(ad) and 3(eh), we find that the four MSPtrained models outperform the corresponding MAPtrained models. For instance, the MM classifier yields an PPV of 37.8% for the MSP principle and only 35.7% for the MAP principle, and the mixMRF classifier yields a PPV of 39.0% for the MSP principle only 38.5% for the MAP principle. Interestingly, classifiers based on simple models (MM and mixMM) show the greatest improvement when replacing the MAP principle by the MSP principle. This observation is in accordance with previous findings that discriminative learning seems to be advantageous over generative learning if the model assumption is wrong [29].
Conclusions
The systematic comparison of different statistical models and different learning principles has been the focus of several studies of the last decade [11, 26, 29, 30]. However, these comparisons lose value if different priors are used for different models or different learning principles, and it is questionable if the obtained results from such comparisons are meaningful at all.
In this paper, we derive a prior that allows an unbiased comparison of generative and discriminative learning principles for models from the family of MRFs including PWM models, WAM models, Markov models of higher order, Bayesian trees, moral Bayesian networks, and their mixtures as special cases. The derived prior is conjugate to the likelihood of MRFs and a generalization of the commonlyused productDirichlet prior for moral Bayesian networks. The derived prior provides an interesting interpolation between a productGaussian prior and a productLaplace prior: it is qualitatively similar to a productGaussian prior in the vicinity of the maximum and qualitatively similar to a productLaplace prior in the far tails. In contrast to a productGaussian and a productLaplace prior, the hyperparameters of the derived prior can be easily interpreted as counts stemming from pseudodata, allowing an intuitive choice of these hyperparameters.
We present two case studies using the derived prior for an unbiased comparison, and we find that discriminative parameter learning can be beneficial for sequence classification in the field of bioinformatics. On a set of mammalian TFBSs, we find that it is possible to yield an improved classification performance by using the discriminative MSP principle instead of the generative MAP principle even if the amount of available training data is small. By varying the size of the training data set, we find that discriminative parameter learning can improve the recognition of TFBSs over generative parameter learning irrespective of the size of the training data set. This result differs from previous findings of Ng & Jordan [11], who did a similar study comparing the generative Bayesian MAP principle to the discriminative nonBayesian MCL principle. On a data set of donor splice sites [5], we illustrate the utility of the proposed prior for comparing Markov models, mixtures of Markov models, MRFs, and mixtures of MRFs. For this data set, we find that the best classification performance can be achieved by a discriminatively trained mixture of MRFs.
The derived prior might be useful in future comparative studies as it provides a lessbiased guidance to the understanding of molecular mechanisms, and it leads to further improvements of algorithms for the recognition of short signal sequences including splice sites, TFBSs, nucleosome binding sites, miRNA binding sites, transcription initiation sites, or insulator binding sites. Hence, we make an implementation of this prior available to the scientific community as part of the open source Java library Jstacs http://www.jstacs.de.
Notes
List of abbreviations
 MAP:

maximum aposteriori
 MCL:

maximum conditional likelihood
 ML:

maximum likelihood
 MRF:

Markov random field
 MSP:

maximum supervised posterior
 PWM:

position weight matrix
 TFBS:

transcription factor binding site
 WAM:

weight array matrix.
Declarations
Acknowledgements
We thank André Gohr, Michael Seifert, and Marc Strickert for helpful discussions and comments on the manuscript. This work was supported by grant XP3624HP/0606T by the Ministry of Culture of SaxonyAnhalt.
Authors’ Affiliations
References
 Kel AE, Gössling E, Reuter I, Cheremushkin E, KelMargoulis OV, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 2003, 31(13):3576–3579. 10.1093/nar/gkg585View ArticlePubMedPubMed CentralGoogle Scholar
 Barash Y, Elidan G, Friedman N, Kaplan T: Modelling dependencies in proteinDNA binding sites. In RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology. New York, NY, USA: ACM Press; 2003:28–37. full_textView ArticleGoogle Scholar
 Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 1997, 268: 78–94. 10.1006/jmbi.1997.0951View ArticlePubMedGoogle Scholar
 Salzberg SL: A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput Appl Biosci 1997, 13(4):365–376.PubMedGoogle Scholar
 Yeo G, Burge CB: Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals. Journal of Computational Biology 2004, 11(2–3):377–394. 10.1089/1066527041410418View ArticlePubMedGoogle Scholar
 Segal E, FondufeMittendorf Y, Chen L, Thåaströom A, Field Y, Moore IK, Wang JPZ, Widom J: A genomic code for nucleosome positioning. Nature 2006, 442(7104):772–778. 10.1038/nature04979View ArticlePubMedPubMed CentralGoogle Scholar
 Peckham HE, Thurman RE, Fu Y, Stamatoyannopoulos JA, Noble WS, Struhl K, Weng Z: Nucleosome positioning signals in genomic DNA. Genome Res 2007. gr.6101007+ gr.6101007+Google Scholar
 Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, Zhang MQ, Lobanenkov VV, Ren B: Analysis of the vertebrate insulator protein CTCFbinding sites in the human genome. Cell 2007, 128(6):1231–1245. 10.1016/j.cell.2006.12.048View ArticlePubMedPubMed CentralGoogle Scholar
 Redhead E, Bailey T: Discriminative motif discovery in DNA and protein sequences using the DEME algorithm. BMC Bioinformatics 2007, 8: 385. 10.1186/147121058385View ArticlePubMedPubMed CentralGoogle Scholar
 Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotech 2005, 23: 137–144. 10.1038/nbt1053View ArticleGoogle Scholar
 Ng AY, Jordan MI: On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems. Volume 14. Edited by: Dietterich T, Becker S, Ghahramani Z. Cambridge, MA: MIT Press; 2002:605–610.Google Scholar
 BenGal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, Posch S, Grosse I: Identification of transcription factor binding sites with variableorder Bayesian networks. Bioinformatics 2005, 21(11):2657–2666. 10.1093/bioinformatics/bti410View ArticlePubMedGoogle Scholar
 Sonnenburg S, Zien A, Rätsch G: ARTS: accurate recognition of transcription starts in human. Bioinformatics 2006, 22(14):e472–480. 10.1093/bioinformatics/btl250View ArticlePubMedGoogle Scholar
 Kim NK, Tharakaraman K, MarinoRamirez L, Spouge J: Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 2008, 9: 262. 10.1186/147121059262View ArticlePubMedPubMed CentralGoogle Scholar
 Narlikar L, Gordan R, Ohler U, Hartemink AJ: Informative priors based on transcription factor structural class improve de novo motif discovery. Bioinformatics 2006, 22(14):e384–392. 10.1093/bioinformatics/btl251View ArticlePubMedGoogle Scholar
 Chen S, Rosenfeld R: A Gaussion Prior for Smoothing Maximum Entropy Models. In Tech. rep. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA; 1999.Google Scholar
 Klein D, Manning C: Maxent Models, Conditional Estimation, and Optimization. HLTNAACL 2003 Tutorial 2003.Google Scholar
 Staden R: Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Research 1984, 12: 505–519. 10.1093/nar/12.1Part2.505View ArticlePubMedPubMed CentralGoogle Scholar
 Stormo GD, Schneider TD, Gold LM, Ehrenfeucht A: Use of the 'perceptron' algorithm to distinguish translational initiation sites. NAR 1982, 10: 2997–3010. 10.1093/nar/10.9.2997View ArticlePubMedPubMed CentralGoogle Scholar
 Zhang M, Marr T: A weight array method for splicing signal analysis. Comput Appl Biosci 1993, 9(5):499–509.PubMedGoogle Scholar
 Yakhnenko O, Silvescu A, Honavar V: Discriminatively Trained Markov Model for Sequence Classifcation. In ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining. Washington, DC, USA: IEEE Computer Society; 2005:498–505. full_textView ArticleGoogle Scholar
 Keilwagen J, Grau J, Posch S, Grosse I: Recognition of splice sites using maximum conditional likelihood. In LWA: Lernen  Wissen  Abstraktion Edited by: Hinneburg A. 2007, 67–72.Google Scholar
 Cai D, Delcher A, Kao B, Kasif S: Modeling splice sites with Bayes networks. Bioinformatics 2000, 16(2):152–158. 10.1093/bioinformatics/16.2.152View ArticlePubMedGoogle Scholar
 Culotta A, Kulp D, McCallum A: Gene Prediction with Conditional Random Fields. In Tech. Rep. Technical Report UMCS2005–028. University of Massachusetts, Amherst; 2005.Google Scholar
 Bernal A, Crammer K, Hatzigeorgiou A, Pereira F: Global Discriminative Learning for HigherAccuracy Computational Gene Prediction. PLoS Comput Biol 2007, 3(3):e54. 10.1371/journal.pcbi.0030054View ArticlePubMedPubMed CentralGoogle Scholar
 Grau J, Keilwagen J, Kel A, Grosse I, Posch S: Supervised posteriors for DNAmotif classification. In German Conference on Bioinformatics, of Lecture Notes in Informatics (LNI)  Proceedings. Volume 115. Edited by: Falter C, Schliep A, Selbig J, Vingron M, Walter D. Bonn: Gesellschaft für Informatik (GI); 2007:123–134.Google Scholar
 Wettig H, Grünwald P, Roos T, Myllymäki P, Tirri H: On Supervised Learning of Bayesian Network Parameters. In Tech. Rep. HIIT Technical Report 2002–1. Helsinki Institute for Information Technology HIIT; 2002.Google Scholar
 Grossman D, Domingos P: Learning Bayesian network classifiers by maximizing conditional likelihood. ICML, ACM Press; 2004:361–368.Google Scholar
 Greiner R, Su X, Shen B, Zhou W: Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers. Machine Learning Journal 2005, 59(3):297–322. 10.1007/s1099400504690View ArticleGoogle Scholar
 Pernkopf F, Bilmes JA: Discriminative versus generative parameter and structure learning of Bayesian network classifiers. Proceedings of the 22nd International Conference on Machine Learning 2005, 657–664. full_textGoogle Scholar
 Feelders A, Ivanovs J: Discriminative Scoring of Bayesian Network Classifiers: a Comparative Study. Proceedings of the third European workshop on probabilistic graphical models 2006, 75–82.Google Scholar
 Grünwald P, Kontkanen P, Myllymäki P, Roos T, Tirri H, Wettig H: Supervised posterior distributions. Presented at the Seventh Valencia International Meeting on Bayesian Statistics 2002.Google Scholar
 Cerquides J, de Mántaras RL: Robust Bayesian Linear Classifier Ensembles. ECML 2005, 72–83.Google Scholar
 Goodman J: Exponential Priors for Maximum Entropy Models. Proceedings of HLTNAACL 2004 2003.Google Scholar
 Buntine WL: Theory Refinement of Bayesian Networks. In Uncertainty in Artificial Intelligence. Morgan Kaufmann; 1991:52–62.Google Scholar
 Wallach H: Efficient Training of Conditional Random Fields. In Master's thesis. University of Edinburgh; 2002.Google Scholar
 Jordan MI: Graphical Models. Statistical Science (Special Issue on Bayesian Statistics) 2004, 19: 140–155.Google Scholar
 Castelo R: The discrete acyclic digraph Markov model in data mining. PhD thesis. Faculteit Wiskunde en Informatica, Universiteit Utrecht; 2002.Google Scholar
 Heckerman D, Geiger D, Chickering DM: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 1995, 197–243.Google Scholar
 Berger AL, Pietra SD, Pietra VJD: A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics 1996, 22: 39–71.Google Scholar
 MeilaPredoviciu M: Learning with Mixtures of Trees. PhD thesis. Massachusetts Institute of Technology; 1999.Google Scholar
 Castelo R, Guigo R: Splice site identification by idlBNs. Bioinformatics 2004, 20(suppl_1):i69–76. 10.1093/bioinformatics/bth932View ArticlePubMedGoogle Scholar
 Schulte O, Frigo G, Greiner R, Luo W, Khosravi H: A new hybrid method for Bayesian network learning With dependency constraints. Bioinformatics 2009, 53–60.Google Scholar
 Bishop CM: Pattern Recognition and Machine Learning. Information Science and Statistics. 1st edition. New York: Springer; 2006.Google Scholar
 Arita M, Tsuda K, Asai K: Modeling splicing sites with pairwise correlations. Bioinformatics 2002, 18(suppl_2):S27–34.View ArticlePubMedGoogle Scholar
 Chen TM, Lu CC, Li WH: Prediction of splice sites with dependency graphs and their expanded bayesian networks. Bioinformatics 2005, 21(4):471–482. 10.1093/bioinformatics/bti025View ArticlePubMedGoogle Scholar
 Davis J, Goadrich M: The relationship between PrecisionRecall and ROC curves. In ICML '06: Proceedings of the 23rd international conference on Machine learning. New York, NY, USA: ACM; 2006:233–240. full_textView ArticleGoogle Scholar
 Fawcett T: ROC Graphs: Notes and Practical Considerations for Researchers. In Tech. rep. HP Laboratories; 2004.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.