Predicting protein functions using incomplete hierarchical labels
 Guoxian Yu^{1, 2}Email author,
 Hailong Zhu^{1}Email author and
 Carlotta Domeniconi^{3}
https://doi.org/10.1186/s128590140430y
© Yu et al.; licensee BioMed Central. 2015
Received: 24 July 2014
Accepted: 11 December 2014
Published: 16 January 2015
Abstract
Background
Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction.
Results
In this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels.
Conclusion
The empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.
Keywords
Background
The increasing amount of proteomic data produced using highthroughput technology makes it crucial but challenging to develop computational models that can identify hypothetical functions of proteins. Such techniques have the potential to drive the biological validation and discovery of novel functions of proteins, and to save on the experimental cost. At the same time, functional annotations of proteins have been incorporated into several bioinformatics tools (e.g., Panther [1], IntPath [2], and InterProScan [3]) to investigate the semantic similarity between proteins, proteins functional interactions, pathway enrichment analysis, functional enrichment analysis, and phylogenetic tree [4,5].
Protein function prediction is a challenging computational problem, characterized by several intrinsic hardships: the number of function labels is rather large, each protein can have several labels, and the labels are structured in a hierarchy and are unbalanced. Furthermore, function labels associated to proteins are uncertain and incomplete. Various computational models have been proposed to address one or more of these issues [3,610]. Some models use costsensitive learning and hierarchical classification [8,11], others apply multilabel learning [12,13], classifier ensemble [8,12] and multiple networks (kernel) integration [14] to use the complimentary information spread across different heterogeneous data sources. More recent approaches incorporate evolutionary knowledge [15], pathways [1,2,16], domains [17], or negative examples selection [7,18]. For a complete review on protein function prediction, see [6,10,19]. Radivojac et al. [9,19] organized the large scale communitybased critical assessment of protein function annotation, and suggested that there is significant room for improving protein function prediction.
Protein function prediction can be viewed as a multilabel learning problem [7,10,12,20,21]. Recently, multilabel learning approaches that use the correlation (or similarity) between function labels have been introduced. Pandey et al. [22] incorporated label correlation using Lin’s similarity [23] into the knearest neighborhood (LkNN) classifier; the authors observed that utilizing the correlation between function labels can boost the prediction accuracy. Zhang and Dai [24] investigated the usefulness of functional interrelationships based on Jaccard coefficients for protein function prediction. Wang et al. [25] introduced a functionfunction correlated multilabel learning method to infer protein functions. Yu et al. [12] studied a directed birelational graph (composed by protein nodes and function label nodes) to utilize the correlation between function labels for protein function prediction. Chi and Hou [26] assumed the label sets of two proteins can influence their similarity and introduced a Cosine Iterative Algorithm (CIA). In each iteration of CIA, the function predicted with highest confidence is appended to the label set of a protein. Next, the pairwise similarity between training proteins and testing proteins is updated based on the extended function sets. CIA considers the updated pairwise similarity, the function correlation based on cosine similarity, and the PPI network topology to predict functions in consecutive iterations.
Most of these multilabel learning algorithms focus on exploiting label correlations to boost prediction accuracy, under the assumption that the labels of labeled proteins used for training are complete, i.e. no label is missing. Due to various reasons (e.g., evolving Gene Ontology terms, or limitations of experimental methods), in practice we may be aware of some functions only, while additional functions (unknown to us) may also be associated with the protein. In other words, proteins are partially labeled. Learning from partially and multilabel instances (or proteins) can be formulated as a multilabel and weaklabel learning problem [2729].
Several multilabel and weaklabel learning algorithms have been introduced in the past years. Sun et al. [27] studied a multilabel and weaklabel learning method called WELL. WELL assumes there is a margin between instances of different classes and any given label has a small number of member instances. To make use of the label correlation among multilabel instances, this approach assumes that there is a group of low rank based similarities, and the similarity between instances of different labels can be approximated based on these similarities. However, WELL relies on quadratic programming to compute the low rank based similarities and to make the final predictions. Therefore, it’s computationally expensive and can hardly make predictions for samples with a large number of labels. Bucak et al. [30] proposed a weaklabel learning approach called MLRGL. MLRGL optimizes a convex objective function that includes a ranking loss and a group Lasso loss. MLRGL aims at labeling instances with no labels by using partially labeled instances. Yang et al. [28] introduced a multiinstance and multilabel weaklabel learning algorithm. Yu et al. [29] proposed an approach called ProWL to predict protein functions using partially labeled proteins. ProWL exploits the label correlation and available labels of a protein to estimate the likelihood of a missing function for the protein. ProWL integrates these estimations with a smoothness loss function to replenish the missing function labels and to predict functions for proteins with no labels. Yu et al. [31] assumed a function label depends on the feature information of proteins and introduced an algorithm called ProDM. ProDM maximizes this dependency to replenish the missing function labels and to predict functions for unlabeled proteins.
However, these weaklabel learning techniques only use the flat relationships among function labels, and do not explicitly take into account the hierarchical relationship among labels. It is widely recognized that the MIPS Functional Catalogue (FunCat) [32] organizes the function labels in a tree structure and the Gene Ontology (GO) [33] organizes the function terms (or labels) in a directed acyclic graph. It is reported that exploiting the hierarchical relationship among function labels can boost the accuracy of protein function prediction [7,8,11,22]. For example, Barutcuoglu et al. [11] suggested that organizing the prediction produced by the binary classifier for each individual function label in a Bayes network can improve the accuracy of gene function prediction. Tao et al. [34] utilized an information theory based metric to measure the interrelationships between function labels and to determine whether a certain function label belongs to a protein or not. However, this method cannot predict functions for unlabeled proteins, since it only employs the known annotations of a protein to infer its other potential annotations. Jiang et al. [35] combined the relational PPI network and the label hierarchical structure to predict consistent functions by setting the descendants of a function label as negative whenever this label is set to negative. Pandey et al. [22] used Lin’s similarity to capture the relationship among hierarchically organized labels. Schietgat et al. [36] integrated hierarchical multilabel decision trees for protein function prediction. Valentini [7] postprocessed the prediction made by a binary classifier for each label according to the true path rule in the GO and the FunCat hierarchies, and proposed a method called TPR. CesaBianch et al. [8] integrated costsensitive learning and data fusion with TPR to further boost the accuracy of protein function prediction. Valentini [10] advocated in his recent survey that it is paramount to exploit the hierarchical relationship among function labels for protein function prediction.
According to the True Path Rule [7] in GO and FunCat: (i) if a protein is labeled with a function, then this protein should be labeled with the ancestor functions (if any) of this function; (ii) if a protein cannot be labeled with a function, then this protein should not be labeled with the descendant functions (if any) of this function. In [29,31], the incomplete annotation problem was simulated by randomly masking function labels in a flat style, ignoring the hierarchical relationship between labels. In the simulation, if a function label of a protein is missing, this protein may still be labeled with the descendant functions of this function. And in fact, the missing function can be directly inferred from its descendant function labels.
In this paper, we studied the incomplete label problem in a hierarchical manner, as opposed to a flat style. We propose an approach called PILL to predict protein functions using partially labeled proteins with hierarchical labels. PILL integrates the hierarchical and flat relationships between function labels to estimate the likelihoods of missing labels, and the interaction between proteins to replenish the missing annotations and to predict the functions of unlabeled proteins. Particularly, PILL simulates the incomplete hierarchical labels by randomly masking the leaf function labels of a protein, which is closer to the real situation than the simulation in the previous study [29,31]. We conducted experiments on three publicly available PPI datasets, in which each dataset was annotated with FunCat labels and GO labels. The experimental results showed that PILL outperforms other related algorithms on replenishing the missing labels of partially labeled proteins and on predicting functions for completely unlabeled proteins.
The incomplete hierarchical label problem
Methods
Function correlation definition
s and t are two function labels, p(s) denotes the probability for a protein to be labeled with s. p(s) can be estimated from the available number of member proteins of s for an organism. c a(s,t) is the set of common ancestors of s and t, and p _{ ca }(s,t) denotes the probability of the most specific function label in the hierarchy that subsumes both s and t. Intuitively, Eq. (1) measures the semantic similarity of s and t in terms of the content of their minimum subsumer node in the hierarchy. Clearly, p _{ ca }(s,t)=1 if s=t, and p _{ ca }(s,t)=0 when their minimum subsumer is the root node of the ontology, or the function label corresponding to the minimum subsumer node is associated with all the proteins of an organism. L i n S i m(s,t) can also be viewed as a correlation measure between s and t. According to this definition, L i n S i m(s,t) is large if s and t often coannotate the same proteins, and their most specific ancestor label is close to s and t but far away from the root node. On the other hand, if the most specific ancestor of s and t is (close to) the root node, but s and t are far away from the root node in the hierarchy, L i n S i m(s,t) will be small.
s a(s,t) represents the set of shared ancestors of s and t, which includes s if t is a descendant label of s, or t if t is an ancestor label of s. Thus, c a(s,t)⊆s a(s,t). We extend Lin’s similarity to a similarity named H S i m(s,t) by substituting p _{ ca }(s,t) in Eq. (1) with p _{ sa }(s,t). If s is an ancestor label of t, H S i m(s,t) is no smaller than L i n S i m(s,t), since s is more specific than any function label in c a(s,t) (or p(s)≤p _{ ca }(s,t)). When s and t are siblings (or cousins), H S i m(s,t) and L i n S i m(s,t) are the same.
s a(s,t) often includes more specific functions (i.e., the parent function label of t) than c a(s,t), since c a(s,t)⊆s a(s,t). If t is missing for a protein, but the ancestor function labels (including parent function label s) of t are associated with this protein, it is easy to see that the missing label estimation from the parent function is more reliable than that from other ancestor functions (i.e. grandparent functions). This property of function label hierarchies motivates us to estimate the missing labels using HSim instead of LinSim. The statistics computed in the next Section supports our rationale.
where JcdSim is the similarity based on the Jaccard coefficient J c d S i m(s,t)=N(s)∩N(t)/N(s)∪N(t). N(·) denotes the set of proteins labeled with the corresponding function label and N(·) is the cardinality of the set. From the definition, if s and t do not have shared ancestor function labels, C o m S i m(s,t) is large when they often coassociated with the same set of proteins; C o m S i m(s,t) is small when they seldom coassociated with the same proteins. When s, t and the most specific shared ancestors of these two function labels are always associated with the same proteins, C o m S i m(s,t)=1. In this case, J c d S i m(s,t) is also set to 1. As such, ComSim captures both the hierarchical and the flat relationships between functions.
Statistics of hierarchical function label relationships

p(sp a r(s))≥p(sg p a r(s))

p(sg p a r(s))≥p(su n c l e(s))
where p a r(s) denotes the parent function label of s, g p a r(s) is the grandparent function label of s, and u n c l e(s) is the uncle (parent’s sibling) function label of s. p(sp a r(s)) is the conditional probability that a protein is labeled with s given that it’s already labeled with p a r(s). These equations hold since if a protein is labeled with s, then this protein is also labeled with the ancestor functions of s (including p a r(s) and g p a r(s)), and if a protein is labeled with u n c l e(s), this protein is also labeled with g p a r(s). In contrast, if a protein is labeled with p a r(s) (or g p a r(s)), it is uncertain whether this protein is labeled with s (or u n c l e(s)).
The boxplots of Figure 2 support the relationships p(sp a r(s))≥p(sg p a r(s)) and p(sg p a r(s))≥p(su n c l e(s)). If s is missing for a protein, and the protein is labeled with labels in p a r(s), g p a r(s) and u n c l e(s), the estimated likelihood of the missing label s from p a r(s) is more reliable than that from g p a r(s) and u n c l e(s). The explanation is straightforward: the more specific the function label is, the fewer member proteins the label has. In other words, if the function label in p a r(s) is associated with a protein, we can ensure that the function label in g p a r(s) is also associated with the same protein, but not vice versa. Similarly, given that u n c l e(s) is the sibling of p a r(s) and the two share the same parent, if a protein is annotated with u n c l e(s), this protein is also annotated with g p a r(s). Similar results are obtained for the S. Cerevisiae proteins annotated with GO labels (see Figure S2 of the Additional file 1).
In Figure 2, p(sp a r(s)) is more evenly distributed than p(sg p a r(s)) and p(su n c l e(s)), and it has fewer outliers than the latter two. We can also observe that the distributions of the function correlations defined by LinSim and ComSim are closer to p(sp a r(s)) than the correlations defined by the Cosine similarity and the Jaccard coefficient, and the label correlations based on LinSim and ComSim are more evenly distributed than the correlations based on Cosine and Jaccard similarity, since the former two have fewer outliers than the latter two. ComSim considers both the hierarchical (measured by HSim) and flat (measured by JcdSim) relationships among labels, and its margin between 25% and 75% percentiles is wider than that of LinSim. In addition, the overlap between ComSim and p(sp a r(s)) is larger than that between LinSim and p(sp a r(s)). In fact, we also studied the Gaussian function (\(exp\left (\frac {(x\mu)^{2}}{\sigma ^{2}}\right)\), where μ and σ are the mean and standard deviation of x, x corresponds to a kind of likelihood or similarity) distribution of these likelihoods and similarities, and also observed that ComSim overlaps more with p(sp a r(s)) than with other similarity metrics (not reported). Since ComSim will be used to estimate the likelihoods of missing labels, these differences indicate that ComSim can estimate the missing labels more accurately than the other three techniques. The advantage of ComSim will also be verified in our experiments.
Objective function
Given n proteins, let K be the number of distinct functions across all proteins. Let Y=[y _{1},y _{2},…,y _{ n }] be the original label set, with y _{ ik }=1 if protein i has the kth function, and y _{ ik }=0 if it’s unknown whether this protein has the kth function or not. We assume the first l≤n proteins are partially labeled and the remaining n−l proteins are completely unlabeled. We set the normalized function correlation matrix as \(C_{m}(s,t)=\frac {ComSim(s,t)}{\sum _{t=1}^{K} ComSim(s,t)}\).
If y _{ ik }=0 and the kth function label has a large correlation with the already known functions of protein i, then it is likely that the kth function is missing for this protein, \(\tilde {y}_{\textit {ik}}\) is assigned to a large value. \(\tilde {\mathbf {y}}_{i}\) is the label vector for the confirmed labels (the corresponding entries are set to 1) together with y _{ i } and C _{ m } estimated likelihoods of the missing labels (for entries corresponding to y _{ ik }=0) on the ith protein.
where \(\mathbf {f}_{i} \in \mathbb {R}^{K}\) is the to be predicted probability likelihood on the ith protein, F=[f _{ i },f _{2},…,f _{ n }] is the predictions on n proteins, \(\tilde {Y}=\left [\tilde {\mathbf {y}}_{1},\tilde {\mathbf {y}}_{2},\ldots,\tilde {\mathbf {y}}_{n}\right ]\) is the likelihood matrix for confirmed labels along with the estimated missing labels on n proteins, U is an n×n diagonal matrix with U _{ ii }=1 if i≤l, and U _{ ii }=0 otherwise.
where \(\mathcal {N}(p_{i})\) is the set of proteins interacting with p _{ i }, W _{ ij } is the weight of the interaction (similarity) between proteins i and j, and I is an n×n identity matrix. Our motivation to minimize Eq. (7) is threefold: (i) if two proteins i and j are quite similar to one another (or W _{ ij } is large), then the margin between f _{ i } and f _{ j } should be small, otherwise there is a big loss; (ii) if protein i has missing labels and its interacting partners do have those labels, then we can leverage this information to assist the replenishing process of the missing labels for protein i; (iii) if protein i is completely unlabeled, its labels can be predicted using the labels of its partners. Alternative ways (i.e., based on functional connectivity or homology between proteins) to transfer labels among proteins have been suggested in the literature (see [5,3941]). These methods can also be adapted to replace Eq. (7). Since our work focuses on how to replenish the missing labels and how to predict protein functions using incomplete hierarchical labels, how to more efficiently utilize the guiltbyassociation rule and how to reduce noise in PPI networks to boost the accuracy (i.e., by enhancing the functional content [42], or by incorporating additional data sources [5,15,16]), is out of scope.
where λ>0 is a scaler parameter that balances the importance of the empirical loss and the smoothness loss.
Results and discussion
Datasets and experimental setup
Dataset statistics
Dataset  # Proteins  # FunCat labels  # GO labels  Avg ± Std(FunCat)  Avg ± Std(GO) 

CollinsPPI  1620  176 (13320)  168 (22023)  8.22 ±5.60  13.59 ±8.28 
KroganPPI  2670  228 (20384)  241 (32639)  7.63 ±5.81  12.22 ±8.83 
ScPPI  5700  305 (36909)  372 (61048)  6.48 ±5.71  10.71 ±8.83 
There are no offtheshelf proteomic datasets that can be directly used to test the performance of the solution of the incomplete labels problem, although this problem is practical and common in real world scenarios. To address this issue, we assume the labels of the currently labeled proteins are complete and randomly mask some of the ground truth leaf functions of a protein; these masked functions are considered as missing for this protein.
For representation, we use m as the number of missing functions of a protein. For example, if a protein has 10 functional labels, m=3 means that three functional labels are masked for this protein. If a protein does not have more than m labels, we do not mask all the available labels and ensure it has one function label. A small number of proteins in these networks doesn’t have any label; we keep these proteins in the network to retain the network’s structure, but do not test on them. We introduce N _{ m } to represent how many labels are missing for all the proteins for a given setting of m.
Comparing methods and evaluation metrics
We compare PILL against ProDM [14], ProWL [29], LkNN [22], TPR [7], MLRGL [30], CIA [26], and Naive [9]. ProDM and ProWL are designed to replenish the missing labels and to predict protein functions using partially labeled proteins; neither explicitly considers the hierarchical relationship among function labels. LkNN utilizes LinSim in Eq. (1) to predict the functions of unlabeled proteins. TPR uses the true path rule (or hierarchical relationship) in label hierarchies to refine the predictions of binary classifiers trained for each label. We use the weighted version, TPRw, for the experiments. MLRGL uses partially labeled instances in the training set to predict the labels of unlabeled instances. CIA is an iterative algorithm that uses function correlations based on Cosine similarity to infer protein functions. Naive, which ranks functional labels based on their frequencies, is a baseline approach in the communitybased critical assessment of protein function annotation [9]. It is reported that very few methods performed above the Naive method. Therefore, we take the Naive method as a comparing method for reference. More details about the implementations and parameter settings of these methods are reported in the Additional file 1.
The performance of protein function prediction can be evaluated according to different criteria, and the choice of evaluation metrics differentially affects different prediction algorithms [9,29]. For a fair and comprehensive comparison, we used five representative metrics, namely MacroF1, MicroF1, AvgROC, RankingLoss and Fmax. These evaluation metrics are extensively applied to evaluate the performance of multilabel learning algorithms and protein function prediction [9,21,29]. The formal definition of these metrics is provided in the Additional file 1. To keep consistency across all evaluation metrics, we use 1RankLoss instead of RankingLoss. Thus, the higher the value, the better the performance is for all the used metrics. These metrics evaluate the performance of function prediction in different aspects, and thus it is difficult for an algorithm to outperform another technique on all the metrics.
Replenishing missing function labels
Results of replenishing missing labels on CollinsPPI wrt
Metric  m(N _{ m } )  PILL  ProDM  ProWL  LkNN  TPRw  Naive 

MicroF1  1(1526)  93.91 ±0.11  83.30 ±0.30  90.31 ±0.08  44.07 ±0.14  50.00 ±0.12  29.00 ±0.01 
3(4330)  81.70 ±0.29  72.16 ±0.77  78.38 ±0.23  41.61 ±0.16  43.60 ±0.18  29.77 ±0.13  
5(6580)  70.53 ±0.31  60.10 ±1.01  66.61 ±0.16  37.54 ±0.21  36.79 ±0.13  30.09 ±0.06  
MacroF1  1(1526)  89.29 ±0.25  69.53 ±0.40  85.75 ±0.34  34.23 ±0.21  43.33 ±0.15  4.70 ±0.01 
3(4330)  70.19 ±0.63  60.78 ±1.73  69.03 ±0.46  29.23 ±0.48  35.45 ±0.39  5.06 ±0.04  
5(6580)  55.32 ±0.94  45.37 ±1.90  52.95 ±0.59  24.12 ±0.62  27.06 ±0.75  5.13 ±0.05  
AvgROC  1(1526)  99.47 ±0.01  97.44 ±0.06  98.27 ±0.09  66.14 ±0.05  69.67 ±0.18  49.44 ±0.00 
3(4330)  97.77 ±0.16  93.86 ±0.44  93.35 ±0.18  64.86 ±0.11  64.93 ±0.21  49.44 ±0.00  
5(6580)  94.64 ±0.33  87.03 ±1.04  86.24 ±0.49  63.25 ±0.34  60.41 ±0.36  49.44 ±0.00  
1RankLoss  1(1526)  99.43 ±0.03  96.80 ±0.04  98.55 ±0.05  69.38 ±0.04  55.75 ±0.14  79.33 ±0.04 
3(4330)  97.58 ±0.11  92.15 ±0.26  94.62 ±0.17  66.09 ±0.27  46.90 ±0.47  76.72 ±0.22  
5(6580)  94.55 ±0.27  86.63 ±0.67  89.30 ±0.25  59.65 ±0.65  36.88 ±0.41  74.52 ±0.41  
Fmax  1(1526)  90.88 ±0.07  76.28 ±0.31  80.49 ±0.34  42.74 ±0.09  58.43 ±0.36  28.32 ±0.00 
3(4330)  76.82 ±0.10  67.39 ±0.84  66.14 ±0.29  42.16 ±0.25  51.12 ±0.50  27.93 ±0.01  
5(6580)  66.11 ±0.50  56.26 ±4.01  57.76 ±0.52  40.39 ±0.37  44.01 ±0.53  27.04 ±0.00 
From the results reported in these Tables, we can observe that PILL outperforms other competitive methods across all the evaluation metrics in most cases. In summary, out of 90 configurations (3 datasets × 2 kinds of labels × 5 evaluation metrics × 3 settings of m), PILL outperforms ProDM 85.56% of the cases, outperforms ProWL 91.11% of the cases, ties with them 4.44% and 4.44% of the cases, and loses to them in 5.56% and 4.44% of the cases, respectively. PILL outperforms LkNN, TPRw and Naive in all configurations. Taking MacroF1 on CollingsPPI annotated with FunCat labels, for example, PILL on average is 23.30% better than ProDM, 4.27% better than ProWL, 147.33% better than LkNN, and 106.61% better than TPRw. These results corroborate the effectiveness of PILL on replenishing the missing labels.
PILL largely outperforms ProDM and ProWL, even if the latter two also leverage correlation between function labels and the interaction between proteins. The reason is that ProDM and ProWL use the Cosine based similarity to define the correlation between function labels, and they do not explicitly make use of the hierarchical relationship among labels. Since the label vector implicitly encodes the hierarchical relationship of labels to some extent, ProDM and ProWL can achieve a result comparable (or a slightly better) to PILL in few cases.
LkNN and TPRw explicitly utilize the hierarchical relationship among labels, but they are not able to compete with PILL, ProDM and ProWL. The cause is two fold: (i) LkNN and TPRw assume that the labels of the labeled proteins are complete, and they use partially labeled proteins to predict missing labels without estimating the missing labels in advance; (ii) they do not utilize the flat relationships among function labels. Naive ranks the functional labels according to their frequency and sets the frequency as the predicted probability for the labels. Since the missing labels are ‘leaf’ functional labels, and their frequencies are smaller than the ‘nonleaf’ functional labels, Naive achieves the lowest AvgROC and MacroF1 scores, a medium 1RankLoss score, and almost the lowest Fmax and MicroF1 scores among the comparing methods. Naive performs better than some methods in few cases, but it is outperformed by PILL by a large margin across all the evaluation metrics. These results show that PILL can exploit the hierarchical and flat relationships among labels to boost the performance of protein function prediction.
Examples of replenished labels for proteins by PILL and their support references
Protein  Original label  Replenished label  Evidence code  PMID  Date 

YOR206W  GO:0042255, GO:0000054  GO:0042273  IMP  PMID:23209026  20140502 
YGR104C  GO:0045944,GO:0051123,GO:0001113  GO:0006353, GO:0006369  IMP  PMID:23476016  20140328 
YML074C  GO:0051598,GO:0018208,GO:0000412  GO:0006334  IDA  PMID:24297734  20140523 
YBL102W  GO:0006895  GO:0042147  IGI  PMID:10406798  20140314 
YJL102W  GO:0006414  GO:0032543  ISS  PMID:19716793  20140402 
Predicting functions for unlabeled proteins
Prediction results on complete unlabeled proteins of CollinsPPI wrt
Metric  PILL  ProDM  ProWL  LkNN  TPRw  MLRGL  CIA  Naive 

MicroF1  47.05 ±1.24  34.44 ±2.11  37.58 ±1.24  32.06 ±1.21  33.79 ±1.62  28.53 ±0.87  33.59 ±2.19  25.47 ±0.46 
MacroF1  29.29 ±3.02  16.60 ±4.67  26.11 ±0.90  20.30 ±1.51  22.74 ±1.96  20.58 ±1.14  23.43 ±1.94  1.97 ±0.04 
AvgROC  77.48 ±2.25  64.37 ±1.27  56.97 ±1.08  64.45 ±1.82  61.54 ±1.67  64.29 ±1.08  57.18 ±1.24  49.74 ±1.26 
1RankLoss  82.64 ±0.41  77.90 ±3.66  64.57 ±1.82  50.10 ±2.37  42.49 ±2.02  39.36 ±1.08  64.07 ±3.23  76.60 ±0.67 
Fmax  56.57 ±1.12  26.05 ±7.42  16.22 ±1.00  41.60 ±0.67  47.42 ±2.45  40.32 ±0.74  32.19 ±1.96  27.52 ±0.35 
From these tables, we can observe that PILL achieves the best results among all the comparing methods. PILL, ProDM and ProWL take into consideration the incomplete annotation in the training set, and they often outperform LkNN, TPRw, and CIA. MLRGL considers the incomplete annotation in the training set, but it does not explicitly use the hierarchical relationship between labels. Thus, it loses to the competing algorithms. TPRw postprocesses the predictions of binary classifiers according to the true path rule, and sometimes it achieves comparable results to PILL. For a fair comparison with the other algorithms, we do not apply the true path rule to refine the predictions made by PILL in Eq. (8). Naive, a baseline and yet competitive approach in community based critical assessment of function annotation [9], performs above the comparing methods with respect to some metrics (i.e., 1RankLoss and Fmax, which are more favorable to the frequency based ranking than other metrics). However, Naive is outperformed by PILL by a large margin. Given the superior performance of PILL to Naive, PILL can serve as a valuable method for protein function annotation.
From these results, we can draw the conclusion that it is important to utilize the relationships (including hierarchical and flat ones) among labels, and to explicitly consider the incomplete label problem in protein function prediction. These results also corroborate the effectiveness of PILL on predicting protein functions on unlabeled proteins using hierarchical incomplete labeled proteins.
The benefit of using hierarchical and flat relationships between labels
From these figures, we can observe that using LinSim, HSim (a variant of LinSim) or the Jaccard coefficient separately often cannot achieve results comparable to PILL. PILLHsim based on HSim uses the shared ancestor labels, and performs better than PILLLin based on LinSim, which utilizes the common ancestor labels. This fact supports our motivation to define the HSim using the shared ancestor labels instead of the common ones. The superiority of PILL over PILLJcd indicates that hierarchical relationships between function labels are more important than flat relationships. The larger the number of missing labels, the larger the performance margin between PILL and PILLJcd is. These observations support our motivation to use ComSim to exploit both the hierarchical and flat relationships between labels to boost the performance.
The benefit of using function correlation and guilt by association rule
From these results, we can say that using the function correlation or the guilt by association rule separately cannot replenish the missing labels as well as PILL. PILLFC often achieves better results than PILLGbA. This fact shows that using function correlation alone can replenish the missing labels to some extent. From these results, we can draw the conclusion that both the function correlations and the guilt by association rule are beneficial to replenish the missing labels of proteins, and PILL can jointly utilize these two components to boost the performance of protein function prediction.
Conclusions and future work
In this article, we investigated the seldom studied (but yet important and practical) problem of protein function prediction with partial and hierarchical labels. We proposed an approach, PILL, to replenish the missing labels of partially labeled proteins and to predict functions for completely unlabeled proteins. Our empirical study shows that PILL outperforms a range of related methods and PILL can confidently provide hypothetical missing labels from a large number of candidate labels.
Some methods have been proposed to explore nodebased (or edgebased) similarities to measure the semantic similarity of functional labels [4,43]. These methods capture different characteristics of the ontology structure and correlate with protein sequence similarity, PPI networks, and other types of genomic data to some extent. As part of our future work, we are interested in integrating these characteristics of the functional label structure to accurately estimate the missing labels and predict functions for unlabeled proteins.
Declarations
Acknowledgments
The authors thank anonymous reviewers and editors for their value comments on improving this paper. We are also grateful to the authors of the competitive algorithms for providing their codes for the experimental study. This work is partially supported by the Research Grants Council of Hong Kong (No. 212111 and 212613), Natural Science Foundation of China (No. 61101234 and 61402378), Natural Science Foundation of CQ CSTC (No. cstc2014jcyjA40031), Fundamental Research Funds for the Central Universities of China (No. XDJK2014C044) and Doctoral Fund of Southwest University (No. SWU113034).
Authors’ Affiliations
References
 Mi H, Muruganujan A, Casagrande JT, Thomas PD. Largescale gene function analysis with the PANTHER classification system. Nat Protoc. 2013; 8(8):1551–1566.View ArticlePubMedGoogle Scholar
 Zhou H, Jin J, Zhang H, Yi B, Wozniak M, Wong L. IntPath–an integrated pathway gene relationship database for model organisms and important pathogens. BMC Syst Biol. 2012; 6(S2):S2.View ArticleGoogle Scholar
 Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, et al.InterProScan 5 genomescale protein function classification. Bioinformatics. 2014; 30(9):1236–1240.View ArticlePubMedPubMed CentralGoogle Scholar
 Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009; 5(7):e1000443.View ArticlePubMedPubMed CentralGoogle Scholar
 Zhou H, Gao S, Nguyen NN, Fan M, Jin J, Liu B, et al. Stringent homologybased prediction of H. sapiensM tuberculosis H37Rv proteinprotein interactions. Biol Direct. 2014; 9:1–30.View ArticleGoogle Scholar
 Pandey G, Kumar V, Steinbach M, Meyers CL. Computational Approaches to Protein Function Prediction. New York, NY, USA: WileyInterscience; 2012.Google Scholar
 Valentini G. True Path Rule hierarchical ensembles for genomewide gene function prediction. IEEE/ACM Trans Comput Bi. 2011; 8(3):832–847.View ArticleGoogle Scholar
 CesaBianchi N, Re M, Valentini G. Synergy of multilabel hierarchical ensembles, data fusion, and costsensitive methods for gene functional inference. Mach Learn. 2012; 88:209–241.View ArticleGoogle Scholar
 Radivojac P, Wyatt TC, Oron TR, Tal RO, Alexandra MS, Tobias W, Artem S, et al.A largescale evaluation of computational protein function prediction. Nat Methods. 2013; 10(3):221–227.View ArticlePubMedPubMed CentralGoogle Scholar
 Valentini G. Hierarchical ensemble methods for protein function prediction. ISRN Bioinformatics. 2014; 2014(Article ID 901419):34. doi:10.1155/2014/901419.Google Scholar
 Barutcuoglu Z, Schapire RE, Troyanskaya OG. Hierarchical multilabel prediction of gene function. Bioinformatics. 2006; 22(7):830–836.View ArticlePubMedGoogle Scholar
 Yu G, Rangwala H, Domeniconi C, Zhang G, Yu Z. Protein function prediction using multilabel ensemble classification. IEEE/ACM Trans Comput Bi. 2013; 10(4):1045–1057.View ArticleGoogle Scholar
 Wu J, Huang S, Zhou Z. GenomeWide Protein Function Prediction through Multiinstance Multilabel Learning. IEEE/ACM Trans Comput Bi. 2014; 99(99):1–10.View ArticleGoogle Scholar
 Yu G, Rangwala H, Domeniconi C, Zhang G, Zhang Z. Protein function prediction by integrating multiple kernels. In: Proc of Int Joint Conf on Artificial Intelligence (IJCAI). Beijing, China: AAAI Press: 2013. p. 1869–1875.Google Scholar
 Cozzetto D, Buchan DW, Bryson K, Jones DT. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinformatics. 2013; 14(S3):S1.View ArticlePubMedPubMed CentralGoogle Scholar
 Cao M, Pietras CM, Feng X, Doroschak KJ, Schaffner T, Park J, Zhang H, Cowen LJ, Hescott BJ. New directions for diffusionbased network prediction of protein function: incorporating pathways with confidence. Bioinformatics. 2014; 30(12):i219–i227.View ArticlePubMedPubMed CentralGoogle Scholar
 Rentzsch R, Orengo CA. Protein function prediction using domain families. BMC Bioinformatics. 2013; 14(S3):S5.View ArticlePubMedPubMed CentralGoogle Scholar
 Youngs N, PenfoldBrown D, Bonneau R, Shasha D. Negative Example Selection for Protein Function Prediction: The NoGO Database. PLoS Comput Biol. 2014; 10(6):e1003644.View ArticlePubMedPubMed CentralGoogle Scholar
 Wass MN, Mooney SD, Linial M, Radivojac P, Friedberg I. The automated function prediction SIG looks back at 2013 and prepares for 2014. Bioinformatics. 2014; 14(30):2091–2092.View ArticleGoogle Scholar
 Jiang JQ, McQuay LJ. Predicting protein function by multilabel correlated semisupervised learning. IEEE/ACM Trans Comput Bi. 2012; 9(4):1059–1069.View ArticleGoogle Scholar
 Zhang ML, Zhou ZH. A Review On MultiLabel Learning Algorithms. IEEE Trans Knowl Data En. 2014; 26(8):1819–1837.View ArticleGoogle Scholar
 Pandey G, Myers CL, Kumar V. Incorporating functional interrelationships into protein function prediction algorithms. BMC Bioinformatics. 2009; 10:142.View ArticlePubMedPubMed CentralGoogle Scholar
 Lin D. An InformationTheoretic Definition of Similarity. In: Proc of Int Conf on Machine Learning (ICML). Madison, Wisconsin, USA: Morgan Kaufmann: 1998. p. 296–304.Google Scholar
 Zhang XF, Dai DQ. A framework for incorporating functional interrelationships into protein function prediction algorithms. IEEE/ACM Trans Comput Bi. 2012; 9(3):740–753.View ArticleGoogle Scholar
 Wang H, Huang H, Ding C. Function–function correlated multilabel protein function prediction over interaction networks. J Comput Biol. 2013; 20(4):322–343.View ArticlePubMedGoogle Scholar
 Chi X, Hou J. An iterative approach of protein function prediction. BMC Bioinformatics. 2011; 12:437.View ArticlePubMedPubMed CentralGoogle Scholar
 Sun YY, Zhang Y, Zhou ZH. Multilabel learning with weak label. In: Procof AAAI Conf on Artificial Intelligence (AAAI). Atlanta, Georgia, USA: AAAI Press: 2010. p. 293–598.Google Scholar
 Yang SJ, Jiang Y, Zhou ZH. Multiinstance multilabel learning with weak label. In: Proc of Int Joint Conf on Artificial Intelligence (IJCAI). Beijing, China: AAAI Press: 2013. p. 1862–1868.Google Scholar
 Yu G, Rangwala H, Domeniconi C, Zhang G, Yu Z. Protein Function Prediction with Incomplete Annotations. IEEE/ACM Trans Comput Bi. 2014; 11(3):579–591.View ArticleGoogle Scholar
 Bucak SS, Jin R, Jain AK. Multilabel learning with incomplete class assignments. In: Proc of IEEE Conf on Computer Vision and Pattern Recognition (CVPR). Colorado Springs, Colorado, USA: IEEE: 2011. p. 2801–2808.Google Scholar
 Yu G, Domeniconi C, Rangwala H, Zhang G. Protein Function Prediction Using Dependence Maximization. In: Proc of European Conf on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD). Prague, Czech Republic: Springer: 2013. p. 574–589.Google Scholar
 Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, et al. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004; 32(18):5539–5545.View ArticlePubMedPubMed CentralGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000; 25:25–29.View ArticlePubMedPubMed CentralGoogle Scholar
 Tao Y, Sam L, Li J, Friedman C, Lussier YA. Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics. 2007; 23(13):i529–i538.View ArticlePubMedPubMed CentralGoogle Scholar
 Jiang X, Nariai N, Steffen M, Kolaczyk ED. Integration of relational and hierarchical network information for protein function prediction. BMC Bioinformatics. 2008; 9:350.View ArticlePubMedPubMed CentralGoogle Scholar
 Schietgat L, Vens C, Struyf J, Blockeel H, Kocev D, Dẑeroski S. Predicting gene function using hierarchical multilabel decision tree ensembles. BMC Bioinformatics. 2010; 11:2.View ArticlePubMedPubMed CentralGoogle Scholar
 Schwikowski B, Uetz P, Fields S. A network of protein–protein interactions in yeast. Nat Biotechnol. 2000; 18(12):1257–1261.View ArticlePubMedGoogle Scholar
 Wang J, Wang F, Zhang C, Shen HC, Quan L. Linear neighborhood propagation and its applications. IEEE Trans Pattern Anal. 2009; 31(9):1600–1615.View ArticleGoogle Scholar
 Sharan R, Ulitsky I, Shamir R. Networkbased prediction of protein function. Mol SysT Biol. 2007;3(1). doi:10.1038/msb4100129.Google Scholar
 Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M. Wholeproteome prediction of protein function via graphtheoretic analysis of interaction maps. Bioinformatics. 2005; 21(S1):i302–i310.View ArticlePubMedGoogle Scholar
 Chua HN, Sung WK, Wong L. Exploiting indirect neighbours and topological weight to predict protein function from protein–protein interactions. Bioinformatics. 2006; 22(13):1623–1630.View ArticlePubMedGoogle Scholar
 Pandey G, Arora S, Manocha S, Whalen S. Enhancing the Functional Content of Eukaryotic Protein Interaction Networks. PLoS ONE. 2014; 9(10):e109130.View ArticlePubMedPubMed CentralGoogle Scholar
 Xu Y, Guo M, Shi W, Liu X, Wang C. A novel insight into Gene Ontology semantic similarity. Genomics. 2013; 101(6):368–375.View ArticlePubMedGoogle Scholar
 Yu G, Zhu H, Domeniconi C. Supplementary files for ‘predicting protein functions using incomplete hierarchical labels’. 2014. [https://sites.google.com/site/guoxian85/home/pill]
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.