We present two algorithms, Hierarchical TopDown (HTDDAG) and True Path Rule (TPRDAG) for Directed Acyclic Graphs (DAG), specifically designed to exploit the DAG structure of the relationships between HPO terms to predict associations between genes and sets of HPO terms. Both hierarchical ensemble methods adopt a twostep learning strategy:

1.
Flat learning of the terms of the ontology: each base classifier learns a specific and individual class (HPO term) resulting in a set of dichotomic classification problems which are independent of each other.

2.
Hierarchical combination of the predictions: aggregation of the classifications performed by the base learners trained in the previous step. The resulting consensus predictions of the ensemble take the hierarchical relationships of the ontology into account.
This ensemble approach is highly modular: in principle any learning algorithm can be used to train the classifiers in the first step, and in the second step the hierarchical relationships between the HPO terms are exploited to achieve the final ensemble predictions of the set of HPO terms associated with a specific gene.
The main limitation of the flat learning of the HPO terms is that each term is separately learned without taking into account the relationships between classes. Other methods, such as the ensemble approach proposed in [27] can overcome this limitation by applying global models for the multilabel classification, thus explicitly considering class interactions during the learning process. On the other hand hierarchical ensemble methods are able to correct flat predictions, by splitting in two separate steps the process of learning HPO terms and the process of evaluating the hierarchical relationships between classes.
Basic notation and definitions
The directed acyclic graph G=<V,E> represents the HPO taxonomy, its vertices V={1,2,…,V} the terms of the ontology and the directed edges (i,j)∈E the hierarchical relationships between the parent terms i and their children j. We define c
h
i
l
d(i) as the children of a term i, p
a
r(i) as the set of its parents, a
n
c(i) as its ancestors and d
e
s
c(i) as the set of its descendants.
A “flat multilabel scoring” predictor \(f: X \rightarrow \mathbb {Y}\) provides a score \(f(x) = \hat {\boldsymbol {y}}, \hat {\boldsymbol {y}} \in \mathbb {Y}=[0,1]^{V}\) for a given example x∈X, where X is a suitable input space for the predictor f. In other words a flat predictor provides a score \(\hat {y}_{i} \in [0,1]\) that represents the likelihood that a given gene belongs to a given node/HPO term i∈V of the DAG G, and \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{V}>\). We say that the multilabel scoring y is consistent if it obeys the true path rule:
$$ \boldsymbol{y} \; \text{is \; consistent} \iff \forall i \in V, j \in par(i) \Rightarrow y_{j} \geq y_{i} $$
(1)
It is straightforward to show that (1) holds even with flat classifiers that do not provide a score but a label \(\hat {y}_{i} \in \{0,1\}\) indicating that a given gene belongs (\(\hat {y}_{i} = 1\)) or does not (\(\hat {y}_{i} = 0\)) to a given HPO term i.
It is very unlikely that the true path rule could be satisfied by a flat multilabel scoring predictor, but with an additional topologyaware score/label modification step we can easily satisfy the constraints imposed by the true path rule. More precisely, we can provide a prediction function \(g\left (f(x)\right): \mathbb {Y} \rightarrow \mathbb {Y}\) such that the true path rule (1) holds for all the predictions \(g\left (f(x)\right) = \bar {\boldsymbol {y}}\): \(\forall i \in V, j \in par(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\).
Flat learning of the terms of the ontology
The algorithm first utilizes a flat ensemble learning strategy by which each term i∈V of the HPO is independently learned through a term specific predictor f
_{
i
}:X→[0,1]. Accordingly, the output of the flat classifier \(f:X \rightarrow \mathbb {Y}\) on the instance x∈X is \(f(x) = \hat {\boldsymbol {y}}\):
$${}f(x) = <f_{1}(x),\ f_{2}(x),\ \ldots,\ f_{V}(x)>= <\hat{y}_1,\hat{y}_2,\ \ldots,\ \hat{y}_{V}>$$
To this end any supervised or semisupervised base predictor can be used, including also flat binary classifiers. Indeed both learners able to provide a probability or a score related to the likelihood that a gene is annotated with a HPO term (that is scores \(\hat {y}_{i} \in [0,1]\)), and base binary classifiers that can directly provide a label (but not a score) about the association geneHPO term (that is a label \(\hat {y}_{i} \in \{0,1\}\)) can be used to generate the flat predictions. Note that the training of perclass predictors f
_{1}, f
_{2}, …,f
_{V} can be performed in parallel, and it is easy to achieve a linear speedup in the number of the available processors by adopting simple parallel computational techniques.
For the HPO predictions we used semisupervised (RANKS, [28]) and supervised (Support Vector Machines – SVM [29]) machine learning methods to implement the base learners of the proposed hierarchical ensemble methods.
The semisupervised approach RAnking of Nodes with Kernalized Score functions (RANKS) is a networkbased method, that adopts a local as well as a global prediction strategy. By using local learning, RANKS measures the similarity between a gene and its neighbors using different score functions. Global learning is accomplished through graph kernels that exploit the overall topology of the network to predict node labels. In principle any valid kernel function can be applied here. In this work we apply the average score function and a 1, 2 and 3step random walk kernels (that are respectively able to explore direct neighbors and genes at 2 or three steps away in the network). RANKS was previously successfully applied in the prioritization of disease genes [30], the prediction of gene function [31] as well as for drug repositioning problems [32].
It is worth noting that RANKS returns a score and not a probability, that represents the likelihood that a gene belongs to a given class, but the “magnitude” of the scores may vary across different classes [31]. To make the scores comparable across classes, we considered two distinct normalization procedures:

1.
Normalization in the sense of the maximum: the score of each class is normalized by dividing the score values for the maximum score of that class;

2.
Quantile normalization: a method originally designed for the normalization of probe intensity levels for high density oligonucleotide microarray data across multiple experiments [34]. In our case we applied quantile normalization to make the scores comparable across different HPO terms.
SVMs were trained for each term using the R interface of the machine learning library LiblineaR [35] with default parameter settings. Because of the high running time of SVMs we implemented a multicore version of LiblineaR using doParallel and foreach R packages. The parallel implementation of SVMs is available upon request from the authors.
Hierarchical TopDown (HTDDAG) ensembles
The main idea behind the Hierarchical topdown algorithm (HTDDAG) consists in modifying the predictions of each base learner from “top to bottom”, i.e. from the least to the most specific terms by exploiting at each step the predictions provided by the less specific predictors, e.g. predictors associated to parent HPO terms. This is performed in a recursive way by transmitting the predictions from each node to the their children, and from the children to the children of the children through a propagation of the information towards the descendants of each node of the ontology. For instance in Fig. 1
a the information can flow along the path traversing nodes 1,5,6,7 or 1,3,7, and a prediction for e.g. the node 5 depends on the predictions performed by the base learners for the parent nodes 4,1 and 3. This operating mode of the ensemble is performed in an ordered way from the top to the bottom nodes (Fig. 1a).
More precisely, the HTDDAG algorithm modifies through a unique run across the nodes of the graph the flat scores according to the hierarchy of a DAG. The flat predictions \(f(x) = \hat {\boldsymbol {y}}\) are hierarchically corrected to \(\bar {\boldsymbol {y}}\) by topdown perlevel traversing the nodes of the DAG, and by applying recursively this rule:
$$ \bar{y}_{i} := \left\{ \begin{array}{lll} \hat{y}_{i} & \text{if} \quad i \in root(G) \\ \min_{j \in par(i)} \bar{y}_{j} & \text{if} \quad \min_{j \in par(i)} \bar{y}_{j} < \hat{y}_{i} \\ \hat{y}_{i} & \text{otherwise} \end{array} \right. $$
(2)
The maximum path length from the root is used to define the node levels. More precisely, given that p(r,i) represents a path from the root node r and a node i∈V, l(p(r,i)) the length of p(r,i), \(\mathcal L = \{0, 1, \ldots, \xi \}\) the observed levels, and ξ the maximum node level, then we can define a level function \(\psi : V \longrightarrow \mathcal L\):
$$ \psi(i) = \max_{p(r,i)}\ l \left(p(r,i)\right) $$
(3)
The above level function ψ assigns each node i∈V to its level ψ(i). For instance, from this definition root nodes are {iψ(i)=0}, while nodes lying at a maximum distance ξ from the root are {iψ(i)=ξ}.
The consistency of the predictions is guaranteed if and only if the levels are defined according to the maximum path length from the root (Correctness and consistency of the predictions section provides a formal proof of this fact).
Figure 2 shows the pseudo code of the second step of HTDDAG algorithm, by which the flat predictions \(\hat {\boldsymbol {y}}\) computed in the first step are combined and updated according to topdown perlevel traversal of the DAG.
The block A of the algorithm (row 1) computes the maximum distance of each node from the root; to this end the classical BellmanFord algorithm or the methods based on the Topological Sorting algorithm can be applied [36].
A topdown perlevel traversal of the graph is implemented in the block B: for each level of the graph, starting from the nodes just below the root, the flat predictions \(\hat {y}_{i}\) are topdown corrected, according to Eq. 2, and the consensus ensemble predictions \(\bar {y}_{i}\) are obtained. More precisely, the nested loops starting respectively at line 04 and 06 ensure that nodes are processed by level in an increasing order. Lines 07–11 perform the hierarchical correction of the flat predictions \(\hat {y}_{i}\), i∈{1,…,V}. The algorithm ends when nodes at distance ξ from the root are processed (last iteration of the external loop within lines 04–13) and it finally provides the hierarchically corrected predictions \(\bar {\boldsymbol {y}}\).
The complexity of block A is \(\mathcal {O}(V + E)\) (if the Topological Sort algorithm is used to implement ComputeMaxDist), while it is easy to see that the complexity of block B (rows 3−13) is \(\mathcal {O}(V+E)\). Hence the overall complexity of the topdown step of HTDDAG is \(\mathcal {O}(V+E))\), that is linear in the number of nodes of the HPO, considering that its underlying DAG is sparse.
Hierarchical true path rule (TPRDAG) ensembles
By considering the opposite flow of information “from bottom to top”, we can construct the prediction of the ensemble by recursively propagating the predictions provided by the the most specific nodes toward their parents and ancestors.
For instance in Fig. 1b a possible flow of information could be along the path 8,5,4,2,1 or 7,6,5,1, and the prediction of the ensemble for e.g. node 3 depends on children nodes 5,6 and 7. The proposed True Path Rule for DAG (TPRDAG) adopts this bottomup flow of information, to take into account the predictions of the most specific HPO terms, but also the opposite flow from top to bottom to consider the predictions of the least specific terms. Figure 3 provides a pictorial toy example of the operating mode of the TPRDAG algorithm.
This algorithm is related to the TPR algorithm for treestructured taxonomies [37], but despite the similarity of their names, the TPR for trees cannot be applied to the HPO, since it does not work on DAGstructured taxonomies and provides inconsistent predictions when applied to the HPO. In contrast to the perlevel tree traversal proposed in [37], the DAG perlevel traversal has two distinct and strictly separated steps: (1) the DAG is inspected bottomup perlevel, followed by (2) a topdown visit. This separation is necessary to assure the true path rule consistency of the predictions in DAGstructured taxonomies (see “Correctness and consistency of the predictions” section for details). The other main difference consists in the way the levels are computed: in this new DAG version the levels are constructed according to the maximum distance from the root, since this guarantees that in the topdown step all the ancestor nodes have been processed before their descendants, thus assuring the true path rule consistency of the predictions (see “Correctness and consistency of the predictions” section for a formal proof of this fact). Moreover in this paper we also propose novel algorithms for the bottomup propagation of the predictions from the most to the least specific terms of the HPO.
TPRDAG provides “consensus” ensemble predictions by integrating the flat predictions \(\hat {y}_{i}\) through a perlevel visit of the DAG:
$$ \bar{y}_{i} := \frac{1}{1 + \phi_{i}} \left(\hat{y}_{i} + \sum_{j \in \phi_{i}} \bar{y}_{j}\right) $$
(4)
where ϕ
_{
i
} are the “positive” children of i.
Note that only positive predictions of the children obey the true path rule. Indeed, according to this rule, we may have a gene annotated to a term t, but not annotated to a terms s∈c
h
i
l
d(t). Hence if we have a negative prediction for the terms s it is not meaningful to use this prediction to predict the term t. It is worth noting that we can combine children predictions using aggregation strategies other than the average. For instance using the maximum, we could likely improve the sensitivity, but with a likely decrement of the precision. Different strategies to select the “positive” children ϕ
_{
i
} can be applied, according to the usage of a specific threshold to separate positive from negative examples:

1.
Constant Threshold (T) strategy. For each node the same threshold \(\bar {t}\) is a priori selected: \(t_{j} = \bar {t}, \quad \forall j \in V\). In this case ∀i∈V we have:
$$ \phi_{i} := \{ j \in child(i)  \bar{y}_{j} > \bar{t} \} $$
(5)
For instance if the predictions represent probabilities it could be meaningful to a priori select \(\bar {t} = 0.5\).

2.
Adaptive Threshold (AT) strategy. The threshold is selected to maximize some performance metric \(\mathcal {M}\) estimated on the training data, e.g. the Fscore or the AUROC. In other words the threshold is selected to maximize some measure of accuracy of the predictions \(\mathcal {M}(j,t)\) on the training data for the class j with respect to the threshold t. The corresponding set of positives ∀i∈V is:
$$ \phi_{i} := \left\{ j \in child(i)  \bar{y}_{j} > t_{j}^{*}, t_{j}^{*} = \arg \max_{t} \mathcal{M}(j,t) \right\} $$
(6)
For instance internal crossvalidation can be used to select \(t^{*}_{j}\) from a set of t∈(0,1).

3.
Threshold Free (TF) strategy. This strategy does not require an a priori or experimentally selected threshold. We select as positive those children that increment the score of their parent node i:
$$ \phi_{i} := \{ j \in child(i)  \bar{y}_{j} > \hat{y}_{i} \} $$
(7)
Accordingly, we can derive three different algorithmic variants of the basic TPR:

a)
TPRT: TPR with constant threshold, corresponding to strategy 1);

b)
TPRAT: TPR with adaptive thresholds, corresponding to strategy 2);

c)
TPRTF: TPR thresholdfree, corresponding to strategy 3).
All three TPR algorithms move positive predictions recursively towards the ancestors so that the predictions are propagated bottomup. The pseudocode of the TPRDAG algorithm is shown in Fig. 4 and it is structured into three parts. At first in part A, the maximum distance for every node V
_{
i
} to the root is computed via the BellmanFord algorithm. Second, block B updates the predictions \(\hat {y}_{i}\) using Eq. 4 together with one of the three positive selection strategies by a bottomup visit. After this step the true path rule have to be fulfilled, since the bottom up step assures the propagation of positive predictions, but not their consistency. To this end, in the final block C, the hierarchical topdown step is performed, like in the HTDDAG algorithm.
The complexity of the TPRDAG algorithm is quadratic in the number of nodes for the block A, but can be \(\mathcal {O}(V+E)\) if the Topological Sort algorithm is used instead. It is easy to see that the complexity is \(\mathcal {O}(V)\) for both the B and C blocks when graphs are sparse. Hence, considering the sparseness of the HPO, the algorithm is linear with respect to the number of the terms of the HPO.
To modulate the contribution to the ensemble prediction of the parent node and its children, a TPRDAG variant similar to the weighted True Path Rule algorithms for treestructured taxonomies [23] can be designed for DAGs. The TPRW, i.e. TPRWeighted, can be obtained by weighting the predictions according to this rule:
$$ \bar{y}_{i} := w \hat{y}_{i} + \frac{(1  w)}{\phi_{i}} \sum_{j \in \phi_{i}} \bar{y}_{j} $$
(8)
In this approach a weight w∈[0,1] is added to balance between the contribution of the node i and that of its “positive” children. If w=1 no weight is attributed to the children and the TPRDAG reduces to the HTDDAG algorithm, since in this way only the prediction for node i is used in the bottomup step of the algorithm. If w=0 only the predictors associated to the children nodes “vote” to predict node i. In the intermediate cases we attribute more importance to the predictor for the node i or to its children depending on the values of w. A different way to implement a weighting strategy could be also pursued not only considering balancing between the predictions on node i and nodes j∈c
h
i
l
d(i), but including also weighting with respect to the estimated accuracy of each base learner, estimated e.g. by internal crossvalidation.
As shown in [37] for the tree version of the TPR algorithm, the contribution of the descendants of a given node decays exponentially with their distance from the node itself, and it is easy to see that this is true also for the TPRDAG algorithm. To enhance the contribution of the most specific nodes to the overall decision of the ensemble, a linear decaying or a constant contribution of the “positive” descendants could be considered instead:
$$ \bar{y}_{i} := \frac{1}{1 + \Delta_{i}} \left(\hat{y}_{i} + \sum_{j \in \Delta_{i}} \bar{y}_{j}\right) $$
(9)
where
$$ \Delta_{i} = \left\{ j \in desc(i)  \bar{y}_{j} > t_{j} \right\} $$
(10)
In this way all the “positive” descendants of node i provide the same contribution to the ensemble prediction \(\bar {y}_{i}\). We named this TPR variant as TPRD.
Correctness and consistency of the predictions
Hierarchical ensemble methods can improve flat predictions by reducing the number of both false positives (FP) and false negatives (FN). For instance, the Additional file 1 (Figure S1a) shows that hierarchical ensembles can correct FN flat predictions to TP for the gene RGS9 (regulator of Gprotein signaling 9) that encodes a member of the RGS family of GTPase whose mutations cause bradyopsia [38], while the Additional file 1: Figure S1b of the same additional file shows the ability of hierarchical ensembles of correcting FP to TN for the gene ENAM (enamelin) that encodes the largest protein in the enamel matrix whose deficiency is associated with amelogenesis imperfecta type 1C [39]. More precisely, in the Additional file 1: Figure S1a the hierarchical ensemble TPRW “recovered” four TP (red rectangles), while in Additional file 1: Figure S1b TPRW corrected six FP to TN. It is worth nothing that the hierarchical ensemble methods can improve the correctness, but they cannot of course guarantee the correctness of all the predictions: when, e.g. the flat predicted scores are too bad, hierarchical ensembles may fail in improving the recovery of FP or FN. For instance, in Additional file 1: Figure S1a TPRW failed in removing three FN (orange rectangles), and in Additional file 1: Figure S1b was not able to remove three FP.
Hierarchical ensemble methods can guarantee consistent predictions, that is predictions that obey the true path rule. To this end, the perlevel visit of the hierarchical taxonomy should be realized by using the maximum and not the minimum distance from the root (see the Additional file 2 for an intuitive example showing this fact). An example of the capability of obtaining hierarchically corrected consistent predictions from inconsistent flat predictions is shown in Fig. 5 for the gene C1QC (complement C1q C chain), that encodes a 18 polypeptide chains protein whose deficiency is associated with lupus erythematosus and glomerulonephritis [40].
The consistency of the predictions of both the HTDDAG and TPRDAG is proved through the following theorems (proofs are available in the Additional file 3):
Theorem 1
Given a DAG G=<V,E>, a level function ψ that assigns to each node its maximum path length from the root and the set of HTDDAG flat predictions \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{V}>\), the topdown hierarchical correction of the HTDDAG algorithm assures that the set of ensemble predictions \(\bar {\boldsymbol {y}} =< \bar {y}_{1}, \bar {y}_{2}, \ldots, \bar {y}_{V}>\) satisfies the following property: \(\forall i \in V, \; j \in par(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\)
An immediate consequence of this theorem is that HTDDAG assures consistent predictions not only for the parents, but also for all the ancestors of any node of a hierarchical DAGstructured taxonomy:
Corollary 1
Given a DAG G=<V,E>, the level function ψ and the set of flat predictions \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{V}>\), the HTDDAG algorithm assures that for the set of ensemble predictions \(\bar {\boldsymbol {y}} = < \bar {y}_{1}, \bar {y}_{2}, \ldots, \bar {y}_{V}>\) the following property holds: \(\forall i \in V, \; j \in anc(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\).
Independently of the choice of the positive children (“Hierarchical true path rule (TPRDAG) ensembles” subsection), the following consistency theorem holds for TPRDAG:
Theorem 2
Given a DAG G=<V,E>, a set of flat predictions \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{V}>\) for each class associated to each node i∈{1,…,V}, the TPRDAG algorithm assures that for the set of ensemble predictions \(\bar {\boldsymbol {y}} = < \bar {y}_{1}, \bar {y}_{2}, \ldots, \bar {y}_{V}>\) the following property holds: \(\forall i \in V, \; j \in anc(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\).
Finally a good property of TPRDAG is that its sensitivity is always equal or better than that of the HTDDAG:
Theorem 3
The TPRDAG ensemble algorithm with “positive” children selected according to (7) achieves always a sensitivity equal or higher than the HTDDAG ensemble algorithm.
Unfortunately there is no guarantee that the precision of TPRDAG is always larger or equal than that of the HTDDAG algorithm.
All the proofs and details about the above theorems are available in the Additional file 3.