In this section, we introduce the method NHMC (Network CLUS-HMC; a preliminary version of NHMC has been presented in [15]), the major contribution of the paper. NHMC builds autocorrelation-aware models (trees) for HMC. We shall start with a brief description of the algorithm CLUS-HMC, which builds trees for HMC and is the starting point for developing NHMC.

For the HMC task, the input is a dataset *U* consisting of instances (examples) that have the form *u*_{
i
} = (*x*_{
i
},*y*_{
i
}) ∈ **X** × 2^{C}, where **X** = *X*_{1} × *X*_{2}…× *X*_{
m
} is the space spanned by *m* attributes or features (either continuous or categorical), while 2^{C} is the power set of *C* = {*c*_{1},…,*c*_{
K
}}, the set of all possible class labels. *C* is hierarchically organized with respect to a partial order ≼ which represents the superclass relationship. Note that each *y*_{
i
} satisfies the *hierarchical constraint*:

c\in {y}_{i}\Rightarrow \forall {c}^{\prime}\preccurlyeq c:{c}^{\prime}\in {y}_{i}.

(1)

The method we propose (NHMC) builds a generalized form of decision trees and is set in the Predictive Clustering (PC) framework [11]. The PC framework views a decision tree as a hierarchy of clusters: the top-node corresponds to one cluster containing all the data, that is recursively partitioned into smaller clusters when moving down the tree. Such a tree is called a predictive clustering tree (PCT). PCT combines elements from both prediction and clustering. As in clustering, clusters of data points that are similar to each other are identified, but, in addition, a predictive model is also associated to each cluster. This predictive model provides a prediction for the target property of new examples that are recognized to belong to the cluster. In addition, besides the clusters themselves, PC approaches also provide symbolic descriptions of the constructed (hierarchically organized) clusters.

The original PC framework is implemented in the CLUS system [11] (http://sourceforge.net/projects/clus/), which can learn both PCT and predictive clustering rules. The induction of PCT is not very different than the induction of standard decision trees (as performed, e.g., by the C4.5 algorithm [16]). The algorithm takes as input a set of training instances and searches for the best acceptable test to put in a node and split the data. If such a test can be found, then the algorithm creates a new internal node labeled with the test and calls itself recursively to construct a subtree for each subset (cluster) in the partition induced by the test on the training instances.

### CLUS-HMC

The CLUS-HMC [4] algorithm builds HMC trees, PCT for hierarchial multi-label classification (see Figure 1(c) for an example of an HMC tree). These are very similar to classification trees, but each leaf predicts a hierarchy of class labels rather than a single label. CLUS-HMC builds the trees in a top-down fashion and the outline of the algorithm is very similar to that of top-down decision tree induction algorithms (see the CLUS-HMC pseudo-code in Additional file 1). The main differences are in the search heuristics and in the way predictions are made. For the sake of completeness both aspects are reported in the following. Additional details on CLUS-HMC are given by Vens et al. [4].

#### Search heuristics

To select the best test in an internal node of the tree, the algorithm scores the possible tests according to the reduction in variance (defined below) induced on the set *U* of examples associated to the node. In CLUS-HMC, the variance of class labels across a set of examples *U* is defined as follows:

\mathit{\text{Var}}(U)=\frac{1}{|U|}\xb7\sum _{{u}_{i}\in U}d{({L}_{i},\overline{L})}^{2},

(2)

where *L*_{
i
} is the vector associated to the class labels of example *u*_{
i
} (each element of *L*_{
i
} is binary and represents the presence/absence of a class label for *u*_{
i
}), \overline{L} is the average of all *L*_{
i
} vectors corresponding to the class labels of examples in *U* and *d*(·,·) is a distance function on such vectors. The basic idea behind the use of the variance reduction is to minimize intra-cluster variance.

In the HMC context, class labels at higher levels of the annotation hierarchy are more important than class labels at lower levels. This is reflected in the distance measure used in the above formula, which is a weighted Euclidean distance:

d({L}_{1},{L}_{2})=\sqrt{\sum _{k=1}^{K}\omega ({c}_{k})\xb7{({L}_{1,k}-{L}_{2,k})}^{2}}

(3)

where *L*_{i,k} is the *k*-th component of the class vector *L*_{
i
} and the class weights *ω*(*c*_{
k
}) associated with the labels decrease with the depth of the class in the hierarchy. More precisely, *ω*(*c*) = *ω*_{0}·*a* *v* *g*_{
j
} {*ω*(*p*_{
j
}(*c*))}, where *p*_{
j
}(*c*) denotes the *j*-th parent of class *c* and 0 < *ω*_{0} < 1). This definition of the weights allows us to take into account a hierarchy of classes, structured as a tree and DAG (multiple parents of a single label).

For instance, consider the small hierarchy^{a} in Figure 1(b), and two examples (*x*_{1},*y*_{1}) and (*x*_{2},*y*_{2}), where *y*_{1} = {*a* *l* *l*,*B*,*B*.1,*C*,*D*,*D*.2,*D*.3} and *y*_{2} = {*a* *l* *l*,*A*,*D*,*D*.2,*D*.3}. The class vectors for *y*_{1} and *y*_{2} are: *L*_{1} = [ 1,0,0,0,1,1,1,1,0,1,1] and *L*_{2} = [ 1,1,0,0,0,0,0,1,0,1,1]. The distance between the two class vectors is then:

\begin{array}{l}d([1,0,0,0,1,1,1,1,0,1,1],[1,1,0,0,0,0,0,1,0,1,1])\\ \phantom{\rule{1em}{0ex}}=\sqrt{3\xb7{w}_{0}^{2}+4\xb7{w}_{0}^{3}}\end{array}

(4)

At each node of the tree, the test that maximizes the variance reduction is selected. This is expected to maximize cluster homogeneity with respect to the target variable and improve the predictive performance of the tree. If no test can be found that significantly reduces variance (as measured by a statistical F-test), then the algorithm creates a leaf and labels it with a prediction, which can consist of multiple hierarchically organized labels.

#### Predictions

A classification tree typically associates a leaf with the "majority class", i.e., the label most appearing in the training examples at the leaf. This label is later used for prediction purposes when a test case reaches that leaf. However, in the case of HMC, where an example may have multiple classes, the notion of "majority class" cannot be straightforwardly applied. In fact, CLUS-HMC associates the leaf with the mean \stackrel{\u0304}{L} of the class vectors of the examples in the leaf. The value at the *k*-th component of \stackrel{\u0304}{L} is interpreted as the membership score of class *c*_{
k
}, i.e., the probability that an example arriving at the leaf will be labeled with a class *c*_{
k
}.

For an example arriving at a leaf, binary predictions for each class label can be obtained by applying a user defined threshold *τ* on this probability: If the *i*-th component of \stackrel{\u0304}{L} is above *τ* (> *τ*), then the leaf predicts the class *c*_{
i
}. To ensure that the predictions satisfy the hierarchical constraint, i.e., whenever a class is predicted, its super-classes are also predicted, it suffices to choose *τ*_{
i
} ≤ *τ*_{
j
} whenever *c*_{
j
} is ancestor of *c*_{
i
}.

### NHMC

We first discuss the network setting that we consider in this paper. We then propose a new network autocorrelation measure for HMC tasks. Subsequently, we describe the CLUS-HMC algorithm for learning HMC trees and introduce its extension NHMC (i.e., Network CLUS-HMC), which takes into account the network autocorrelation (coming from PPI networks) when learning trees for HMC.

#### Network setting for HMC

Some uses of a PPI network in learning gene function prediction models include: treating the interactions between pairs of genes as descriptive attributes (e.g., binary attributes [17]) and generating new features as combinations of PPI data with other descriptive attributes. Both approaches require that data are pre-processed before applying a network oblivious learning method (e.g., CLUS-HMC). However, the applicability of predictive models built in this way is strongly dependent on PPI network information being available for the testing data, i.e., for the proteins whose gene function we want to predict.

In order to learn general models, which can be used to make predictions for any test set, we use protein interactions as a form of background knowledge and exploit them only in the learning phase. More specifically, in the training phase, both gene properties and network structure are considered. In the testing phase, only gene properties are considered and the network structure is disregarded. This key feature of the proposed solution is especially attractive when function prediction concerns new genes, for which interactions with other genes are not known or are still to be confirmed.

Following Steinhaeuser et al. [18], we view a training set as a single network of labeled nodes. Formally, the network is defined as an undirected edge-weighted graph *G* = (*V*,*E*), where *V* is the set of labeled *nodes*, and E\subseteq \{\u3008u,v,w\u3009|u,v\in V,w\in {\mathbb{R}}^{+}\} is the set of *edges*. Each edge *u* ↔ *v* is assigned with a non-negative real number *w*, called the *weight* of the edge. It can be represented by a symmetric adjacency matrix **W**, whose entries are positive (*w*_{
ij
} > 0) if there is an edge connecting *i* to *j* in *G*, and zero (*w*_{
ij
} = 0) otherwise. In PPI networks, edge weights can express the strength of the interactions between proteins. Although the proposed method works with any non-negative weight values, in our experiments we mainly focus on binary (0/1) weights.

Each node of the network is associated with an example pair *u*_{
i
} = (*x*_{
i
},*y*_{
i
}) ∈ **X** × 2^{C}, where {y}_{i}=({y}_{{i}_{1}},{y}_{{i}_{2}},\dots ,{y}_{{i}_{q}}),q\le K, is subject to the hierarchical constraint. Given a network *G* = (*V*,*E*) and a function *η* : *V* ↦ (**X** × 2^{C}) which associates each node with the corresponding example, we interpret the task of hierarchical multi-label classification as building a PCT which represents a multi-dimensional predictive function *f* : **X** ↦ 2^{C} that satisfies the hierarchical constraint, maximizes the autocorrelation of the observed classes *y*_{
i
} for the network *G*, and minimizes the prediction error on *y*_{
i
} for the training data *η*(*V*).

#### Network autocorrelation for HMC

An illustration of the concept of network autocorrelation for HMC is a special case of network autocorrelation [19]. It can be defined as the statistical relationship between the observations of a variable (e.g., protein function) on distinct but related (connected) nodes in a network (e.g., interacting proteins). In HMC, domain values of the variable form a hierarchy, such as the GO hierarchy for protein functions. Therefore, it is possible to define network autocorrelation for individual nodes and for various levels of the hierarchy.

In predictive modeling, network autocorrelation can be a problem, since the i.i.d. assumption is violated, but also an opportunity, if it is properly considered in the model. This is particularly true for the task of hierarchical multi-label classification considered in this work. Indeed, due to non-stationary autocorrelation, PPI network data can provide useful (and diverse) information for each single class at each level of the hierarchy. Intuitively, genes belonging to classes at higher levels of the hierarchy tend to participate in very general types of interactions, while genes belonging to classes at lower levels of the hierarchy tend to participate in very specific and localized interactions. In any case, the effect of autocorrelation changes from level to level (this aspect is also mentioned by Gillis and Pavlidis [20]). For this reason, we explicitly measure autocorrelation and we build a model such that its value is maximized.

#### Geary’s *C*for HMC

In order to measure the autocorrelation of the response variable *Y* in the network setting for HMC, we propose a new statistic, named *A*_{
Y
}(*U*), whose definition draws inspiration from Global Geary’s *C*[21]. Global Geary’s *C* is a measure of spatial autocorrelation for a continuous variable. Its basic definition (used in spatial data analysis [22]) is given in Additional file 2.

Let *u*_{
i
} = (*x*_{
i
},*y*_{
i
}) ∈ *U* ⊆ **X** × 2^{C} be an example pair in a training set *U* of *N* examples. Let *K* be the number of classes in *C*, possibly defining a hierarchy. We represent *y*_{
i
} as a binary vector *L*_{
i
} of size *K*, such that *L*_{i,k} = 1 if *c*_{
k
} ∈ *y*_{
i
} and *L*_{i,k} = 0 otherwise, and each *L*_{
i
} satisfies the hierarchical constraint. Let *d*(*L*_{
i
},*L*_{
j
}) be a distance measure defined for two binary vectors associated to two examples *u*_{
i
} = (*x*_{
i
},*y*_{
i
}) and *u*_{
j
} = (*x*_{
j
},*y*_{
j
}), which takes the class-label hierarchy into account.

The network autocorrelation measure *A*_{
Y
}(*U*), based on Geary’s *C*, is defined as follows:

{A}_{Y}(U)=1-\frac{(N-1)\xb7{\sum}_{i}{\sum}_{j}{w}_{\mathit{\text{ij}}}\xb7d{({L}_{i},{L}_{j})}^{2}}{4\xb7{\sum}_{i}{\sum}_{j}{w}_{\mathit{\text{ij}}}\phantom{\rule{.5em}{0ex}}\xb7\phantom{\rule{.5em}{0ex}}{\sum}_{i}d{({L}_{i},\overline{L})}^{2}}

(5)

where \overline{L} is the vector representation of the mean vector computed on all binary vectors associated to example pairs in *U*. The constant 4 in the denominator is included for scaling purposes. The new autocorrelation measure *A*_{
Y
}(*U*) takes values in the unit interval [0,1], where 1 (0) means strong positive (negative) autocorrelation and 0.5 means no autocorrelation.

#### The Algorithm

We can now proceed to describe the top-down induction algorithm for building Network HMC trees. The main differece with respect to CLUS-HMC is that the heuristic is different. The network is considered as background knowledge and exploited only in the learning phase. Below, we first give an outline of the algorithm, before giving details on the new search heuristics, which takes autocorrelation into account. We discuss how the new search heuristics can be computed efficiently.

##### Outline of the algorithm

The top-down induction algorithm for building PCT for HMC from network data is given below (Algorithm 1). It takes as input the network *G* = (*V*,*E*) and the corresponding HMC dataset *U*, obtained by applying *η* : *V* ↦ **X** × 2^{C} to the vertices of the network.

In practice, this means that for each gene *u*_{
i
} (see Figure 1(b)) there is a set of (discrete and continuous) attributes describing different aspects of the genes. For the experiments with the yeast genome, these include sequence statistics, phenotype, secondary structure, homology, and expression data (see next Section) and a class vector, *L*_{
i
} i.e., functional annotations associated to it.

The algorithm recursively partitions *U* until a stopping criterion is satisfied (Algorithm 1 line 2). Since the implementation of this algorithm is based on the implementation of the CLUS-HMC algorithm, we call this algorithm NHMC (Network CLUS-HMC).

##### Search space

As in CLUS-HMC, for each internal node of the tree, the best split is selected by considering all available attributes. Let *X*_{
i
} ∈ {*X*_{1},…,*X*_{
m
}} be an attribute and {\mathit{\text{Dom}}}_{{X}_{i}} its active domain. A split can partition the current sample space *D* according to a test of the form *X*_{
i
} ∈ *B*, where B\subseteq {\mathit{\text{Dom}}}_{{X}_{i}}. This means that *D* is partitioned into two sets, *D*_{1} and *D*_{2}, on the basis of the value of *X*_{
i
}.

For continuous attributes, possible tests are of the form *X* ≤ *β*. For discrete attributes, they are of the form X\in \{{a}_{{i}_{1}},{a}_{{i}_{2}},\dots ,{a}_{{i}_{o}}\} (where \{{a}_{{i}_{1}},{a}_{{i}_{2}},\dots ,{a}_{{i}_{o}}\} is a non-empty subset of the domain *Dom*_{
X
} of *X*). In the former case, possible values of *β* are determined by sorting the distinct values in *D*, then considering the midpoints between pairs of consecutive values. For *b* distinct values, *b*-1 thresholds are considered. When selecting a subset of values for a discrete attribute, CLUS-HMC relies on the non-optimal greedy strategy proposed by Mehta et al. [23].

##### Heuristics

The major difference between NHMC and CLUS-HMC is in the heuristics they use for the evaluation of each possible split. The variance reduction heuristics employed in CLUS-HMC (Additional file 1) aims at finding accurate models, since it considers the homogeneity in the values of the target variables and reduces the error on the training data. However, it does not consider the dependencies of the target variables values between related examples and therefore neglects the possible presence of autocorrelation in the training data. To address this issue, we introduced network autocorrelation in the search heuristic and combined it with the variance reduction to obtain a new heuristics (Algorithm 1).

More formally, the NHMC heuristics is a linear combination of the average autocorrelation measure *A*_{
Y
}(·) (first term) and variance reduction *Var*(·) (second term):

\begin{array}{ll}\phantom{\rule{6.5pt}{0ex}}h& =\alpha \xb7\left(\frac{|{U}_{1}|\xb7{A}_{Y}({U}_{1})+|{U}_{2}|\xb7{A}_{Y}({U}_{2})}{|U|}\right)\\ \phantom{\rule{1em}{0ex}}+(1-\alpha )\xb7\left({\mathit{\text{Var}}}^{\prime}(U)-\frac{|{U}_{1}|\xb7{\mathit{\text{Var}}}^{\prime}({U}_{1})+|{U}_{2}|\xb7{\mathit{\text{Var}}}^{\prime}({U}_{2})}{|U|}\right)\end{array}

(6)

where *Var*^{′}(*U*) is the min-max normalization of *Var*(*U*), required to keep the values of the linear combination in the unit interval [0,1], that is:

{\mathit{\text{Var}}}^{\prime}(U)=\frac{\mathit{\text{Var}}(U)-{\delta}_{\mathit{\text{min}}}}{{\delta}_{\mathit{\text{max}}}-{\delta}_{\mathit{\text{min}}}},

(7)

with *δ*_{
max
} and *δ*_{
min
} being the maximum and the minimum values of *Var*(*U*) over all tests.

We point out that the heuristics in NHMC combines information on both the network structure, which affects *A*_{
Y
}(·), and the hierarchical structure of the class, which is embedded in the computation of the distance, *d*(·,·) used in formula (5) and (2). We also note that the tree structure of the NHMC model makes it possible to consider different effects of the autocorrelation phenomenon at different levels of the tree model, as well as at different levels of the hierarchy (non-stationary autocorrelation). In fact, the effect of the class weights *ω*(*c*_{
j
}) in Equation (3) is that higher levels of the tree will likely capture the regularities at higher levels of the hierarchy.

However, the efficient computation of distances according to Equation 3 is not straightforward. The difficulty comes from the need of computing *A*(*U*_{1}) and *A*(*U*_{2})*incrementally*, i.e., from the statistics already computed for other partitions. Indeed, the computation of *A*(*U*_{1}) and *A*(*U*_{2}) from scratch for each partition would increase the time complexity of the algorithm by an order of magnitude and would make the learning process too inefficient for large datasets.

##### Efficient computation of the heuristics

In our implementation, in order reduce the computational complexity, Equation (6) is not computed from scratch for each test to be evaluated. Instead, the first test to be evaluated is that which splits *U* in *U*_{2} ≠ *∅* and *U*_{1} ≠ *∅* such that |*U*_{2}| is minimum (1 in most of cases, depending on the first available test) and *U*_{1} = *U* - *U*_{2}. Only on this partition, Equation (6) is computed from scratch. The subsequent tests to be evaluated progressively move examples from *U*_{1} to *U*_{2}. Consequently, *A*_{
Y
}(*U*_{1}),*A*_{
Y
}(*U*_{2}),*V* *a* *r*(*U*_{1}) and *V* *a* *r*(*U*_{2}) are computed incrementally by removing/adding quantities to the same values computed in the evaluation of the previous test.

*Var*(·) can be computed according to classical methods for incremental computation of variance. As regards *A*_{
Y
}(·), its numerator (see Equation (5)) only requires distances that can be computed in advance. Therefore, the problem remains only for the denominator of Equation (5). To compute it incrementally, we consider the following algebraic transformations:

\begin{array}{ll}\phantom{\rule{6.5pt}{0ex}}{\sum}_{{u}_{i}\in U}d{({L}_{i},\overline{{L}_{U}})}^{2}& =\sum _{{u}_{i}\in U}\sum _{k=1}^{K}\omega ({c}_{k}){({L}_{i,k}-{\overline{{L}_{U}}}_{k})}^{2}\\ =\sum _{k=1}^{K}\omega ({c}_{k})\sum _{{u}_{i}\in U}{({L}_{i,k}-{\overline{{L}_{U}}}_{k})}^{2}\\ =\sum _{k=1}^{K}\omega ({c}_{k})\sum _{{u}_{i}\in {U}^{\prime}}{({L}_{i,k}-{\overline{{L}_{{U}^{\prime}}}}_{k})}^{2}\\ \phantom{\rule{1em}{0ex}}+({L}_{t,k}-{\overline{{L}_{{U}^{\prime}}}}_{k})({L}_{t,k}-{\overline{{L}_{U}}}_{k})\\ =\sum _{{u}_{i}\in {U}^{\prime}}d{({L}_{i},\overline{{L}_{{U}^{\prime}}})}^{2}+({L}_{t,k}-{\overline{{L}_{{U}^{\prime}}}}_{k})\\ \phantom{\rule{1em}{0ex}}\times ({L}_{t,k}-{\overline{{L}_{U}}}_{k})\end{array}

where *U* = *U*^{′} ∪ {*u*_{
t
}} and \overline{{L}_{U}} (\overline{{L}_{{U}^{\prime}}}) is the average class vector computed on *U* (*U*^{′}).

This allows us to significantly optimize the algorithm, as described in the following section.

##### Time complexity

In NHMC, the time complexity of selecting a split test represents the main cost of the algorithm. In the case of a continuous split, a threshold *β* has to be selected for the continuous variable. If *N* is the number of examples in the training set, the number of distinct thresholds can be *N* - 1 at worst. Since the determination of candidate thresholds requires an ordering of the examples, its time complexity is *O*(*m* · *N* · *logN*), where *m* is the number of descriptive variables.

For each variable, the system has to compute the heuristic *h* for all possible thresholds. In general, this computation has time-complexity *O*((*N* - 1) · (*N* + *N* · *s*) · *K*), where *N* - 1 is the number of thresholds, *s* is the average number of edges for each node in the network, *K* is the number of classes, *O*(*N*) is the complexity of the computation of the variance reduction and *O*(*N* · *s*) is the complexity of the computation of autocorrelation.

However, according to the analysis reported before, it is not necessary to recompute autocorrelation values from scratch for each threshold. This optimization makes the complexity of the evaluation of the splits for each variable *O*(*N* · *s* · *K*). This means that the worst case complexity of creating a split on a continuous attribute is *O*(*m* · (*N* · *logN* + *N* · *s*) · *K*).

In the case of a discrete split, the worst case complexity (for each variable and in the case of optimization) is *O*((*d* - 1) · (*N* + *N* · *s*) · *K*), where *d* is the maximum number of distinct values of a discrete variable (*d* ≤ *N*). Overall, the identification of the best split node (either continuous or discrete) has a complexity of *O*(*m* · (*N* · *logN* + *N* · *s*) · *K*) + *O*(*m* · *d* · (*N* + *N* · *s*) · *K*), that is *O*(*m* · *N* · (*logN* + *d* · *s*) · *K*). This complexity is similar to that of CLUS-HMC, except for the *s* factor which equals *N* in the worst case, although such worst-case behavior is unlikely.

##### Additional remarks

The relative influence of the two parts of the linear combination in Formula (6) is determined by a user-defined coefficient *α* that falls in the interval [0,1]. When *α* = 0, NHMC uses only autocorrelation, when *α* = 0.5, it weights equally variance reduction and autocorrelation, while when *α* = 1 it works as the original CLUS-HMC algorithm. If autocorrelation is present, examples with high autocorrelation will fall in the same cluster and will have similar values of the response variable (gene function annotation). In this way, we are able to keep together connected examples without forcing splits on the network structure (which can result in losing generality of the induced models).

Finally, note that the linear combination that we use in this article (Formula (6)) was selected as a result of our previous work on network autocorrelation for regression [12]. The variance and autocorrelation can also be combined in some other way (e.g., by multiplying them). Investigating different ways of combining them is one of the directions for our future work.

## Comments

View archived comments (1)