The Hierarchical Naïve Bayes classifier (HBN) is an extension of the well-known Naïve Bayes classifiers (NB). NB assumes that, given a class variable *C* that we aim at predicting (say disease yes/disease no) on the basis of a set of *n*_{
f
} features *X* = {*x*_{1},..., *x*_{
nf
}}, the posterior probability of the class given the data *P*(*C*|*X*) is proportional to the product of the prior probability of the class and the conditional probability, P\left(X|C\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\prod}_{f\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1}^{{n}_{f}}P\left({X}_{f}|C\right), i.e. that the features are independent among each other given the class. NB is a simple and robust classifier, which may be conveniently used also in the context of large number of features, due to its strong bias.

HBN assumes that the measurements are stochastic variables with a hierarchical structure in terms of their probability distributions. We suppose that we can collect a number *n*_{
rep
} of observations, or replicates on each example, and that an example belongs to one of a set of given classes. Let us suppose that is a stochastic variable representing the replicates, whose probability distribution is dependent on a vector of parameters *θ*, which corresponds to the single example, and may represent, for example, the mean and variance of the probability distribution of replicates; if we consider the *i*-th example, with *i* in 1,..., *N*, the probability distribution of the vector of the replicates is given by {p}_{\left({X}_{i}|{\theta}_{i}\right)}, with, {X}_{i}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}\left\{{x}_{i1},\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{x}_{ij},\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}}.\phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{x}_{i{n}_{rep}}\right\}, while the probability distribution of the individual parameters is {p}_{\left({\theta}_{i}|{{\xi}_{c}}_{{}_{k}}\right)}, where {\xi}_{{C}_{k}}is a set of population hyper-parameters that depends on the class *C*_{
k
} in the set *C* = {*C*_{
1
},... *C*_{
h
}} to which the example belongs, and is thus the same for all the examples of the same class. Figure 2 shows the representation of the problems though a graphical model with plates [21].

In a Bayesian framework, the classification step is therefore performed by finding the class with the highest posterior probability distribution. Thanks to the conditional independence assumptions of the hierarchical model described above, we can write {P}_{\left({C}_{k}|X\right)}\propto {P}_{\left(X|{\xi}_{{C}_{k}}\right)}{P}_{\left({\xi}_{{C}_{k}}|{C}_{k}\right)}{P}_{\left({C}_{k}\right)}. Since the population parameters {\xi}_{{C}_{k}} are determined by the knowledge of the class *C*_{
k
} with probability one, the equation can be simplified as {P}_{\left({C}_{k}|X\right)}\propto {P}_{\left(X|{\xi}_{{C}_{k}}\right)}{P}_{\left({C}_{k}\right)}. The posterior is thus dependent on the so-called marginal likelihood,{P}_{\left(X|{\xi}_{{C}_{k}}\right)}, which can be calculated by integrating out the vector of parameters *θ*.

Many replicates are available for each example. The examples are characterized by an individual vector of parameters *θ*; moreover, the examples belonging to the same class have a common set of parameters *ξ*.

P(X|{C}_{k},\xi )={\displaystyle {\int}_{{\mathrm{\Omega}}_{\theta}}{P}_{(X|{C}_{k},\theta )}{P}_{(\theta |{C}_{k},\xi )}d\theta}

(1)

where **Ω**_{
θ
} is the support of *θ*.

The learning problem will therefore consist in estimating the population parameters {\xi}_{{C}_{k}} for each class, while the classification problem is mainly related to the calculation of the marginal likelihood. To deal with multivariate problems, we resort to the Naïve Bayes algorithm (NB), which assumes that each attribute is conditionally independent from the others given the class.

P\left(X|{C}_{k}\right)=\prod _{f=1}^{{n}_{f}}P\left({X}_{f}|{C}_{k}\right)

(2)

From the computational viewpoint, this will allow us to compute separately the marginal likelihood for each variable to perform classification and to learn a collection of independent univariate models. In the following we will show how HNB deals with the classification and learning problems when the variables are discrete with multinomial distribution.

### Hierarchical Naïve Bayes for discrete variables

In a SNPs based case-control GWAS, the individual-level information is represented by genotype configurations (aa/aA/AA). For sake of readability we have omitted the dependence of the vectors to the class *k*. We assume that the vector of the occurrences (counts) of the *i-* th example is {X}_{i}=\left\{{x}_{i1},\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{x}_{ij},\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}},\phantom{\rule{0.3em}{0ex}}{x}_{{i}_{S}}\right\}, where {x}_{{i}_{j}}is the number of occurrences of the *j-th* discrete value, or state, of the *i-th* example and *S* is the number of states of the variable *x*. The number of replicates of each example is given by {n}_{re{p}_{i}}={\sum}_{j}^{S}{x}_{ij}.

We also assume that the relationship between the data *X*_{
i
} and the example parameters *θ*_{
i
} is expressed by a multinomial distribution:

{X}_{i}\phantom{\rule{0.3em}{0ex}}~Multin\left({n}_{re{p}_{i}},\phantom{\rule{2.77695pt}{0ex}}{\theta}_{i1},\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}{\theta}_{ij},\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}{\theta}_{iS}\right)

(3)

Therefore *θ*_{
i
} is an *S*-dimensional vector, where *θ*_{
ij
} represents the probability of the occurrence of the *j-th* event in the example *i*. The parameters *θ*_{
i
}, for *i* = *1, 2*,..., {N}_{{C}_{k}}, are characterized by the same prior Dirichlet distribution:

{\theta}_{i}\phantom{\rule{0.3em}{0ex}}~\phantom{\rule{0.3em}{0ex}}Dirichlet\left(\alpha {\xi}_{1},\phantom{\rule{2.77695pt}{0ex}}\alpha {\xi}_{2},\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}\alpha {\xi}_{S}\right)

(4)

with probability density:

P\left({\theta}_{i}|\alpha ,\phantom{\rule{2.77695pt}{0ex}}\xi \right)=\frac{\Gamma \left(\alpha \right)}{{\prod}_{j=1}^{s}\Gamma \left(\alpha {\xi}_{j}\right)}\phantom{\rule{0.3em}{0ex}}\prod _{j=1}^{s}{\theta}_{ij}^{\alpha {\xi}_{j}-1}

(5)

where 0 <*α* < ∞, *ξ*_{
j
} < 1 ∀*j* = 1,..., *S* and {\sum}_{j=1}^{s}\xi j\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}1. Following the hierarchical model reported in the previous section, the individual example parameters *θ*_{
i
}, are independent from each other given *ξ* = {*ξ*_{
1
},...,*ξ*_{
S
}} and *α*. In the following we will assume that the parameter *α* will be fixed, and it will be thus treated as a design parameters of the algorithm. α represents the prior assumption on the degree of similarity of all examples belonging to the same class. A proper setting of the parameter *α* allows finding a compromise between a pooling strategy, where all replicates are assumed to belong to the same example and a full hierarchical strategy where all examples are assumed to be different.

### Classification

As described in the previous section, the classification problem requires the computation of the marginal likelihood (1). We assume that an estimate of the population parameters *ξ* is available and that *α, β* and *γ* are known. Given an example with counts distributed on different states *X* = {*x*_{
1
},..., *x*_{
S
}}, where {n}_{rep}=\phantom{\rule{0.3em}{0ex}}{\sum}_{j=1}^{s}xj, we must compute:

P(X|{C}_{k},\xi )={\displaystyle \underset{{\mathrm{\Omega}}_{\theta}}{\int}{P}_{(X|\theta )}{P}_{(\theta |{\xi}_{{C}_{k}})}d\theta}

(6)

where *θ* = {*θ*_{
1
},..., *θ*_{
S
}} is the vector of the individual example parameters, with {\sum}_{j=1}^{s}\theta j=1 and Ω_{
θ
} the support of *θ*. This integral can be solved by noting that it contains the product of a Multinomial and a Dirichlet distribution.

The marginal likelihood can be thus computed as:

P\left(X|{C}_{k},\xi \right)=\frac{{n}_{rep}!\phantom{\rule{0.3em}{0ex}}\Gamma \left({\sum}_{j}\alpha {\xi}_{j}\right)}{\Gamma \left({\sum}_{j}\left({x}_{j}+\alpha {\xi}_{j}\right)\right)}\phantom{\rule{0.3em}{0ex}}\prod _{j}\phantom{\rule{0.3em}{0ex}}\frac{\Gamma \left({x}_{j}+\alpha {\xi}_{j}\right)}{{x}_{j}!\phantom{\rule{0.3em}{0ex}}\Gamma \left(\alpha {\xi}_{j}\right)}

(7)

The NB approach allows to exploit this equation for each variable in the problem at hand, and then to apply the equation (2) to perform the classification. The marginal likelihood however requires the estimate of the population parameters *ξ* from the data.

### Learning with collapsing

The task of learning the population parameters can be performed by resorting to approximated techniques. Herein we will describe a strategy previously presented by [24] and [25].

We suppose that a data set *X* = {*X*_{1},..., *X*_{
N
}} is available for each class where *X*_{
i
} = {*x*_{i 1},..., *x*_{
is
}} and *N* is the number of examples within each class (the number of examples can differ between the classes). Such vector is transformed into a new vector *X**where the *i*-th element {X}_{i}^{*}=\left\{{\tau}_{i}{x}_{i1},\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}{\tau}_{i}{x}_{ij},\phantom{\rule{2.77695pt}{0ex}}\dots ,\phantom{\rule{2.77695pt}{0ex}}{\tau}_{i}{x}_{is}\right\} with:

{\tau}_{i}=\frac{1+\alpha}{{n}_{re{p}_{i}}+\alpha}

(8)

*τ*_{
i
} is a suitable weight that allows to take into account the prior assumptions on the heterogeneity of the example belonging to the class. The hierarchical model is then collapsed into a new model, where the vector of the measurements {X}_{i}^{*} is assumed to have a multinomial distribution with parameters *ξ* and.{\tau}_{i}{x}_{i{n}_{re{p}_{i}}}.

Such assumption can be justified by the calculation of the first and second moment of *P*_{(X*|ξ)}which is computed by approximating the distribution of the parameters *θ* given *ξ* with its average value [25].

The Maximum Likelihood (ML) estimate of the parameters *ξ* can be thus obtained for each state of the discrete variable as:

\overline{{\xi}_{J}}=\frac{{\sum}_{i=1}^{N}{\tau}_{i}{x}_{ij}}{{\sum}_{i=1}^{N}{\tau}_{i}{n}_{re{p}_{i}}}

(9)

Within this framework we can also provide a Bayesian estimate of the population parameters *ξ*. We assume that *ξ* is a stochastic vector with a Dirichlet prior distribution: *ξ* ~ *Dirichlet*(*β*_{
γ1
},...., *β*_{
γS
}), where 0 <*β* < ∞, *γ*_{
j
} < 1 ∀ *j* = 1,..., *S* and {\sum}_{j=1}^{s}{\gamma}_{j}=1.

After collapsing, we may derive the posterior distribution of *ξ* is still a Dirichlet with expected value of the probability of the *j-th* state of the discrete variable:

\overline{{\xi}_{J}}=\frac{{\sum}_{i=1}^{N}{\tau}_{i}{x}_{ij}+\beta {\gamma}_{j}}{{\sum}_{i=1}^{N}{\tau}_{i}{n}_{re{p}_{i}}+\beta}

(10)

In this setting, the parameter vector *γ* and *β* assume the same meaning of the parameters usually specified in the Bayesian learning strategies applied in many Machine Learning algorithms. In particular, if we assume *γ* = 1/*S* and *β* = 1 we obtain an estimate which is close to the Laplace estimate, while different choices of *γ* and *β* lead to estimates which are similar to the m-estimate, where *β* plays the role of *m*.

### Building the model

The HBN machinery can be conveniently exploited to build a multivariate model for SNPs coming from a GWAS. In presence of regions in which non-random association of alleles at two or more loci or Linkage Disequilibrium (LD) is observed [26], a new variable *X* is generated, and all the SNPs belonging to the same block are considered as replicate of the same variable (see Figure 1). On the contrary, if the SNPs are not in LD, they are treated as independent variables in equation (2). For this reason, the model needs a convenient pre-processing step, in which blocks of SNPs characterized by LD are identified and the variables extracted.

Figure 3 reports a graphical representation of how SNPs data can be mapped using the plates notation. According to this representation, each individual is characterized by a vector (or individual parameter *θ*) reporting the genotypes corresponding to set of SNPs mapping to the same LD region. The set of individual parameters are then employed to estimate a latent variable *ξ:* each latent variable resumes the individual level information deriving from a different LD region. Thus, the complete set of latent variables (along with potentially informative covariates) is used in turns to estimate the probability of being affected or healthy by the Bayes' theorem.