We present two algorithms, Hierarchical Top-Down (HTD-DAG) and True Path Rule (TPR-DAG) for Directed Acyclic Graphs (DAG), specifically designed to exploit the DAG structure of the relationships between HPO terms to predict associations between genes and sets of HPO terms. Both hierarchical ensemble methods adopt a two-step learning strategy:
-
1.
Flat learning of the terms of the ontology: each base classifier learns a specific and individual class (HPO term) resulting in a set of dichotomic classification problems which are independent of each other.
-
2.
Hierarchical combination of the predictions: aggregation of the classifications performed by the base learners trained in the previous step. The resulting consensus predictions of the ensemble take the hierarchical relationships of the ontology into account.
This ensemble approach is highly modular: in principle any learning algorithm can be used to train the classifiers in the first step, and in the second step the hierarchical relationships between the HPO terms are exploited to achieve the final ensemble predictions of the set of HPO terms associated with a specific gene.
The main limitation of the flat learning of the HPO terms is that each term is separately learned without taking into account the relationships between classes. Other methods, such as the ensemble approach proposed in [27] can overcome this limitation by applying global models for the multi-label classification, thus explicitly considering class interactions during the learning process. On the other hand hierarchical ensemble methods are able to correct flat predictions, by splitting in two separate steps the process of learning HPO terms and the process of evaluating the hierarchical relationships between classes.
Basic notation and definitions
The directed acyclic graph G=<V,E> represents the HPO taxonomy, its vertices V={1,2,…,|V|} the terms of the ontology and the directed edges (i,j)∈E the hierarchical relationships between the parent terms i and their children j. We define c
h
i
l
d(i) as the children of a term i, p
a
r(i) as the set of its parents, a
n
c(i) as its ancestors and d
e
s
c(i) as the set of its descendants.
A “flat multi-label scoring” predictor \(f: X \rightarrow \mathbb {Y}\) provides a score \(f(x) = \hat {\boldsymbol {y}}, \hat {\boldsymbol {y}} \in \mathbb {Y}=[0,1]^{|V|}\) for a given example x∈X, where X is a suitable input space for the predictor f. In other words a flat predictor provides a score \(\hat {y}_{i} \in [0,1]\) that represents the likelihood that a given gene belongs to a given node/HPO term i∈V of the DAG G, and \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{|V|}>\). We say that the multi-label scoring y is consistent if it obeys the true path rule:
$$ \boldsymbol{y} \; \text{is \; consistent} \iff \forall i \in V, j \in par(i) \Rightarrow y_{j} \geq y_{i} $$
(1)
It is straightforward to show that (1) holds even with flat classifiers that do not provide a score but a label \(\hat {y}_{i} \in \{0,1\}\) indicating that a given gene belongs (\(\hat {y}_{i} = 1\)) or does not (\(\hat {y}_{i} = 0\)) to a given HPO term i.
It is very unlikely that the true path rule could be satisfied by a flat multi-label scoring predictor, but with an additional topology-aware score/label modification step we can easily satisfy the constraints imposed by the true path rule. More precisely, we can provide a prediction function \(g\left (f(x)\right): \mathbb {Y} \rightarrow \mathbb {Y}\) such that the true path rule (1) holds for all the predictions \(g\left (f(x)\right) = \bar {\boldsymbol {y}}\): \(\forall i \in V, j \in par(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\).
Flat learning of the terms of the ontology
The algorithm first utilizes a flat ensemble learning strategy by which each term i∈V of the HPO is independently learned through a term specific predictor f
i
:X→[0,1]. Accordingly, the output of the flat classifier \(f:X \rightarrow \mathbb {Y}\) on the instance x∈X is \(f(x) = \hat {\boldsymbol {y}}\):
$${}f(x) = <f_{1}(x),\ f_{2}(x),\ \ldots,\ f_{|V|}(x)>= <\hat{y}_1,\hat{y}_2,\ \ldots,\ \hat{y}_{|V|}>$$
To this end any supervised or semi-supervised base predictor can be used, including also flat binary classifiers. Indeed both learners able to provide a probability or a score related to the likelihood that a gene is annotated with a HPO term (that is scores \(\hat {y}_{i} \in [0,1]\)), and base binary classifiers that can directly provide a label (but not a score) about the association gene-HPO term (that is a label \(\hat {y}_{i} \in \{0,1\}\)) can be used to generate the flat predictions. Note that the training of per-class predictors f
1, f
2, …,f
|V| can be performed in parallel, and it is easy to achieve a linear speed-up in the number of the available processors by adopting simple parallel computational techniques.
For the HPO predictions we used semi-supervised (RANKS, [28]) and supervised (Support Vector Machines – SVM [29]) machine learning methods to implement the base learners of the proposed hierarchical ensemble methods.
The semi-supervised approach RAnking of Nodes with Kernalized Score functions (RANKS) is a network-based method, that adopts a local as well as a global prediction strategy. By using local learning, RANKS measures the similarity between a gene and its neighbors using different score functions. Global learning is accomplished through graph kernels that exploit the overall topology of the network to predict node labels. In principle any valid kernel function can be applied here. In this work we apply the average score function and a 1, 2 and 3-step random walk kernels (that are respectively able to explore direct neighbors and genes at 2 or three steps away in the network). RANKS was previously successfully applied in the prioritization of disease genes [30], the prediction of gene function [31] as well as for drug repositioning problems [32].
It is worth noting that RANKS returns a score and not a probability, that represents the likelihood that a gene belongs to a given class, but the “magnitude” of the scores may vary across different classes [31]. To make the scores comparable across classes, we considered two distinct normalization procedures:
-
1.
Normalization in the sense of the maximum: the score of each class is normalized by dividing the score values for the maximum score of that class;
-
2.
Quantile normalization: a method originally designed for the normalization of probe intensity levels for high density oligonucleotide microarray data across multiple experiments [34]. In our case we applied quantile normalization to make the scores comparable across different HPO terms.
SVMs were trained for each term using the R interface of the machine learning library LiblineaR [35] with default parameter settings. Because of the high running time of SVMs we implemented a multicore version of LiblineaR using doParallel and foreach R packages. The parallel implementation of SVMs is available upon request from the authors.
Hierarchical Top-Down (HTD-DAG) ensembles
The main idea behind the Hierarchical top-down algorithm (HTD-DAG) consists in modifying the predictions of each base learner from “top to bottom”, i.e. from the least to the most specific terms by exploiting at each step the predictions provided by the less specific predictors, e.g. predictors associated to parent HPO terms. This is performed in a recursive way by transmitting the predictions from each node to the their children, and from the children to the children of the children through a propagation of the information towards the descendants of each node of the ontology. For instance in Fig. 1
a the information can flow along the path traversing nodes 1,5,6,7 or 1,3,7, and a prediction for e.g. the node 5 depends on the predictions performed by the base learners for the parent nodes 4,1 and 3. This operating mode of the ensemble is performed in an ordered way from the top to the bottom nodes (Fig. 1a).
More precisely, the HTD-DAG algorithm modifies through a unique run across the nodes of the graph the flat scores according to the hierarchy of a DAG. The flat predictions \(f(x) = \hat {\boldsymbol {y}}\) are hierarchically corrected to \(\bar {\boldsymbol {y}}\) by top-down per-level traversing the nodes of the DAG, and by applying recursively this rule:
$$ \bar{y}_{i} := \left\{ \begin{array}{lll} \hat{y}_{i} & \text{if} \quad i \in root(G) \\ \min_{j \in par(i)} \bar{y}_{j} & \text{if} \quad \min_{j \in par(i)} \bar{y}_{j} < \hat{y}_{i} \\ \hat{y}_{i} & \text{otherwise} \end{array} \right. $$
(2)
The maximum path length from the root is used to define the node levels. More precisely, given that p(r,i) represents a path from the root node r and a node i∈V, l(p(r,i)) the length of p(r,i), \(\mathcal L = \{0, 1, \ldots, \xi \}\) the observed levels, and ξ the maximum node level, then we can define a level function \(\psi : V \longrightarrow \mathcal L\):
$$ \psi(i) = \max_{p(r,i)}\ l \left(p(r,i)\right) $$
(3)
The above level function ψ assigns each node i∈V to its level ψ(i). For instance, from this definition root nodes are {i|ψ(i)=0}, while nodes lying at a maximum distance ξ from the root are {i|ψ(i)=ξ}.
The consistency of the predictions is guaranteed if and only if the levels are defined according to the maximum path length from the root (Correctness and consistency of the predictions section provides a formal proof of this fact).
Figure 2 shows the pseudo code of the second step of HTD-DAG algorithm, by which the flat predictions \(\hat {\boldsymbol {y}}\) computed in the first step are combined and updated according to top-down per-level traversal of the DAG.
The block A of the algorithm (row 1) computes the maximum distance of each node from the root; to this end the classical Bellman-Ford algorithm or the methods based on the Topological Sorting algorithm can be applied [36].
A top-down per-level traversal of the graph is implemented in the block B: for each level of the graph, starting from the nodes just below the root, the flat predictions \(\hat {y}_{i}\) are top-down corrected, according to Eq. 2, and the consensus ensemble predictions \(\bar {y}_{i}\) are obtained. More precisely, the nested loops starting respectively at line 04 and 06 ensure that nodes are processed by level in an increasing order. Lines 07–11 perform the hierarchical correction of the flat predictions \(\hat {y}_{i}\), i∈{1,…,|V|}. The algorithm ends when nodes at distance ξ from the root are processed (last iteration of the external loop within lines 04–13) and it finally provides the hierarchically corrected predictions \(\bar {\boldsymbol {y}}\).
The complexity of block A is \(\mathcal {O}(|V| + |E|)\) (if the Topological Sort algorithm is used to implement ComputeMaxDist), while it is easy to see that the complexity of block B (rows 3−13) is \(\mathcal {O}(|V|+|E|)\). Hence the overall complexity of the top-down step of HTD-DAG is \(\mathcal {O}(|V|+|E|))\), that is linear in the number of nodes of the HPO, considering that its underlying DAG is sparse.
Hierarchical true path rule (TPR-DAG) ensembles
By considering the opposite flow of information “from bottom to top”, we can construct the prediction of the ensemble by recursively propagating the predictions provided by the the most specific nodes toward their parents and ancestors.
For instance in Fig. 1b a possible flow of information could be along the path 8,5,4,2,1 or 7,6,5,1, and the prediction of the ensemble for e.g. node 3 depends on children nodes 5,6 and 7. The proposed True Path Rule for DAG (TPR-DAG) adopts this bottom-up flow of information, to take into account the predictions of the most specific HPO terms, but also the opposite flow from top to bottom to consider the predictions of the least specific terms. Figure 3 provides a pictorial toy example of the operating mode of the TPR-DAG algorithm.
This algorithm is related to the TPR algorithm for tree-structured taxonomies [37], but despite the similarity of their names, the TPR for trees cannot be applied to the HPO, since it does not work on DAG-structured taxonomies and provides inconsistent predictions when applied to the HPO. In contrast to the per-level tree traversal proposed in [37], the DAG per-level traversal has two distinct and strictly separated steps: (1) the DAG is inspected bottom-up per-level, followed by (2) a top-down visit. This separation is necessary to assure the true path rule consistency of the predictions in DAG-structured taxonomies (see “Correctness and consistency of the predictions” section for details). The other main difference consists in the way the levels are computed: in this new DAG version the levels are constructed according to the maximum distance from the root, since this guarantees that in the top-down step all the ancestor nodes have been processed before their descendants, thus assuring the true path rule consistency of the predictions (see “Correctness and consistency of the predictions” section for a formal proof of this fact). Moreover in this paper we also propose novel algorithms for the bottom-up propagation of the predictions from the most to the least specific terms of the HPO.
TPR-DAG provides “consensus” ensemble predictions by integrating the flat predictions \(\hat {y}_{i}\) through a per-level visit of the DAG:
$$ \bar{y}_{i} := \frac{1}{1 + |\phi_{i}|} \left(\hat{y}_{i} + \sum_{j \in \phi_{i}} \bar{y}_{j}\right) $$
(4)
where ϕ
i
are the “positive” children of i.
Note that only positive predictions of the children obey the true path rule. Indeed, according to this rule, we may have a gene annotated to a term t, but not annotated to a terms s∈c
h
i
l
d(t). Hence if we have a negative prediction for the terms s it is not meaningful to use this prediction to predict the term t. It is worth noting that we can combine children predictions using aggregation strategies other than the average. For instance using the maximum, we could likely improve the sensitivity, but with a likely decrement of the precision. Different strategies to select the “positive” children ϕ
i
can be applied, according to the usage of a specific threshold to separate positive from negative examples:
-
1.
Constant Threshold (T) strategy. For each node the same threshold \(\bar {t}\) is a priori selected: \(t_{j} = \bar {t}, \quad \forall j \in V\). In this case ∀i∈V we have:
$$ \phi_{i} := \{ j \in child(i) | \bar{y}_{j} > \bar{t} \} $$
(5)
For instance if the predictions represent probabilities it could be meaningful to a priori select \(\bar {t} = 0.5\).
-
2.
Adaptive Threshold (AT) strategy. The threshold is selected to maximize some performance metric \(\mathcal {M}\) estimated on the training data, e.g. the F-score or the AUROC. In other words the threshold is selected to maximize some measure of accuracy of the predictions \(\mathcal {M}(j,t)\) on the training data for the class j with respect to the threshold t. The corresponding set of positives ∀i∈V is:
$$ \phi_{i} := \left\{ j \in child(i) | \bar{y}_{j} > t_{j}^{*}, t_{j}^{*} = \arg \max_{t} \mathcal{M}(j,t) \right\} $$
(6)
For instance internal cross-validation can be used to select \(t^{*}_{j}\) from a set of t∈(0,1).
-
3.
Threshold Free (TF) strategy. This strategy does not require an a priori or experimentally selected threshold. We select as positive those children that increment the score of their parent node i:
$$ \phi_{i} := \{ j \in child(i) | \bar{y}_{j} > \hat{y}_{i} \} $$
(7)
Accordingly, we can derive three different algorithmic variants of the basic TPR:
-
a)
TPR-T: TPR with constant threshold, corresponding to strategy 1);
-
b)
TPR-AT: TPR with adaptive thresholds, corresponding to strategy 2);
-
c)
TPR-TF: TPR threshold-free, corresponding to strategy 3).
All three TPR algorithms move positive predictions recursively towards the ancestors so that the predictions are propagated bottom-up. The pseudo-code of the TPR-DAG algorithm is shown in Fig. 4 and it is structured into three parts. At first in part A, the maximum distance for every node V
i
to the root is computed via the Bellman-Ford algorithm. Second, block B updates the predictions \(\hat {y}_{i}\) using Eq. 4 together with one of the three positive selection strategies by a bottom-up visit. After this step the true path rule have to be fulfilled, since the bottom up step assures the propagation of positive predictions, but not their consistency. To this end, in the final block C, the hierarchical top-down step is performed, like in the HTD-DAG algorithm.
The complexity of the TPR-DAG algorithm is quadratic in the number of nodes for the block A, but can be \(\mathcal {O}(|V|+|E|)\) if the Topological Sort algorithm is used instead. It is easy to see that the complexity is \(\mathcal {O}(|V|)\) for both the B and C blocks when graphs are sparse. Hence, considering the sparseness of the HPO, the algorithm is linear with respect to the number of the terms of the HPO.
To modulate the contribution to the ensemble prediction of the parent node and its children, a TPR-DAG variant similar to the weighted True Path Rule algorithms for tree-structured taxonomies [23] can be designed for DAGs. The TPR-W, i.e. TPR-Weighted, can be obtained by weighting the predictions according to this rule:
$$ \bar{y}_{i} := w \hat{y}_{i} + \frac{(1 - w)}{|\phi_{i}|} \sum_{j \in \phi_{i}} \bar{y}_{j} $$
(8)
In this approach a weight w∈[0,1] is added to balance between the contribution of the node i and that of its “positive” children. If w=1 no weight is attributed to the children and the TPR-DAG reduces to the HTD-DAG algorithm, since in this way only the prediction for node i is used in the bottom-up step of the algorithm. If w=0 only the predictors associated to the children nodes “vote” to predict node i. In the intermediate cases we attribute more importance to the predictor for the node i or to its children depending on the values of w. A different way to implement a weighting strategy could be also pursued not only considering balancing between the predictions on node i and nodes j∈c
h
i
l
d(i), but including also weighting with respect to the estimated accuracy of each base learner, estimated e.g. by internal cross-validation.
As shown in [37] for the tree version of the TPR algorithm, the contribution of the descendants of a given node decays exponentially with their distance from the node itself, and it is easy to see that this is true also for the TPR-DAG algorithm. To enhance the contribution of the most specific nodes to the overall decision of the ensemble, a linear decaying or a constant contribution of the “positive” descendants could be considered instead:
$$ \bar{y}_{i} := \frac{1}{1 + |\Delta_{i}|} \left(\hat{y}_{i} + \sum_{j \in \Delta_{i}} \bar{y}_{j}\right) $$
(9)
where
$$ \Delta_{i} = \left\{ j \in desc(i) | \bar{y}_{j} > t_{j} \right\} $$
(10)
In this way all the “positive” descendants of node i provide the same contribution to the ensemble prediction \(\bar {y}_{i}\). We named this TPR variant as TPR-D.
Correctness and consistency of the predictions
Hierarchical ensemble methods can improve flat predictions by reducing the number of both false positives (FP) and false negatives (FN). For instance, the Additional file 1 (Figure S1a) shows that hierarchical ensembles can correct FN flat predictions to TP for the gene RGS9 (regulator of G-protein signaling 9) that encodes a member of the RGS family of GTPase whose mutations cause bradyopsia [38], while the Additional file 1: Figure S1b of the same additional file shows the ability of hierarchical ensembles of correcting FP to TN for the gene ENAM (enamelin) that encodes the largest protein in the enamel matrix whose deficiency is associated with amelogenesis imperfecta type 1C [39]. More precisely, in the Additional file 1: Figure S1a the hierarchical ensemble TPR-W “recovered” four TP (red rectangles), while in Additional file 1: Figure S1b TPR-W corrected six FP to TN. It is worth nothing that the hierarchical ensemble methods can improve the correctness, but they cannot of course guarantee the correctness of all the predictions: when, e.g. the flat predicted scores are too bad, hierarchical ensembles may fail in improving the recovery of FP or FN. For instance, in Additional file 1: Figure S1a TPR-W failed in removing three FN (orange rectangles), and in Additional file 1: Figure S1b was not able to remove three FP.
Hierarchical ensemble methods can guarantee consistent predictions, that is predictions that obey the true path rule. To this end, the per-level visit of the hierarchical taxonomy should be realized by using the maximum and not the minimum distance from the root (see the Additional file 2 for an intuitive example showing this fact). An example of the capability of obtaining hierarchically corrected consistent predictions from inconsistent flat predictions is shown in Fig. 5 for the gene C1QC (complement C1q C chain), that encodes a 18 polypeptide chains protein whose deficiency is associated with lupus erythematosus and glomerulonephritis [40].
The consistency of the predictions of both the HTD-DAG and TPR-DAG is proved through the following theorems (proofs are available in the Additional file 3):
Theorem 1
Given a DAG G=<V,E>, a level function ψ that assigns to each node its maximum path length from the root and the set of HTD-DAG flat predictions \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{|V|}>\), the top-down hierarchical correction of the HTD-DAG algorithm assures that the set of ensemble predictions \(\bar {\boldsymbol {y}} =< \bar {y}_{1}, \bar {y}_{2}, \ldots, \bar {y}_{|V|}>\) satisfies the following property: \(\forall i \in V, \; j \in par(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\)
An immediate consequence of this theorem is that HTD-DAG assures consistent predictions not only for the parents, but also for all the ancestors of any node of a hierarchical DAG-structured taxonomy:
Corollary 1
Given a DAG G=<V,E>, the level function ψ and the set of flat predictions \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{|V|}>\), the HTD-DAG algorithm assures that for the set of ensemble predictions \(\bar {\boldsymbol {y}} = < \bar {y}_{1}, \bar {y}_{2}, \ldots, \bar {y}_{|V|}>\) the following property holds: \(\forall i \in V, \; j \in anc(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\).
Independently of the choice of the positive children (“Hierarchical true path rule (TPR-DAG) ensembles” subsection), the following consistency theorem holds for TPR-DAG:
Theorem 2
Given a DAG G=<V,E>, a set of flat predictions \(\hat {\boldsymbol {y}} = < \hat {y}_{1}, \hat {y}_{2}, \ldots, \hat {y}_{|V|}>\) for each class associated to each node i∈{1,…,|V|}, the TPR-DAG algorithm assures that for the set of ensemble predictions \(\bar {\boldsymbol {y}} = < \bar {y}_{1}, \bar {y}_{2}, \ldots, \bar {y}_{|V|}>\) the following property holds: \(\forall i \in V, \; j \in anc(i) \Rightarrow \bar {y}_{j} \geq \bar {y}_{i}\).
Finally a good property of TPR-DAG is that its sensitivity is always equal or better than that of the HTD-DAG:
Theorem 3
The TPR-DAG ensemble algorithm with “positive” children selected according to (7) achieves always a sensitivity equal or higher than the HTD-DAG ensemble algorithm.
Unfortunately there is no guarantee that the precision of TPR-DAG is always larger or equal than that of the HTD-DAG algorithm.
All the proofs and details about the above theorems are available in the Additional file 3.