Skip to main content

Machine learning for discovering missing or wrong protein function annotations

A comparison using updated benchmark datasets



A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information.


The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods.


The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.


Due to technological advancements, the generation of proteomic data has increased substantially. However, annotating all sequences is costly and time-consuming, making it often unfeasible [1]. As a countermeasure, recent studies have employed machine learning methods due to their capacities of automatically predicting protein functions.

More specifically, protein function prediction is generally modeled as a hierarchical multi-label classification (HMC) task. HMC is a classification task whose objective is to fit a predictive model f which maps a set of instances X to a set of hierarchically organized labels Y, while respecting hierarchy constraints among Y [2, 3]. The hierarchy constraint states that whenever a particular label yi is predicted, all ancestors labels of yi up to the root node of the hierarchy must be predicted as well.

In the machine learning literature when proposing a new method, this method is typically compared to a set of competitor methods on benchmark datasets. For HMC, many studies [222] utilized the benchmark datasets proposed in [2]. These datasets are available at and contain protein sequences from the species Saccharomyces cerevisiae (yeast) whose functions are mapped to either the Functional Catalogue (FunCat) [24] or Gene Ontology (GO) [23]. The task associated with these datasets is to predict the functions of a protein, given a a set of descriptive features (e.g., sequence, homology or structural information).

FunCat and GO are different types of hierarchies. In FunCat (Fig. 1), labels are structured as a tree, meaning that they can have only a single parent label [24]. The GO (Fig. 2), however, allows labels to have multiple parent labels, forming a directed acyclic graph [23]. This complicates the fulfillment of the hierarchy constraint, since multiple classification paths are allowed throughout the graph.

Fig. 1
figure 1

Partial representation of the FunCat. Each node represents a protein function, and each node can only have a single parent node

Fig. 2
figure 2

Partial representation of the Gene Ontology. Each node represents a term, and terms can have multiple parent terms

These benchmark datasets were introduced to the HMC community in 2007, and, thus, the functional labels associated with each protein can be considered outdated. There are two reasons for this. First, functional annotations are updated on a regular basis. Second, as can be seen in Fig. 3a, there was a drastic increase in the number of terms throughout the Gene Ontology since the creation of these datasets (January 2007). A similar observation can be made for the number of obsolete terms as shown in Fig. 3b. Accordingly, one of the main goals of this article is to provide updated versions of these widely used HMC benchmark datasets to the research community.

Fig. 3
figure 3

Quantification of terms in the Gene Ontology since 2007. a Number of terms in the Gene Ontology. b Number of obsolete terms in the Gene Ontology

Using these new datasets, we present a comparison among four recent and open-source HMC methods that can be considered state-of-the-art,thus providing baseline performances as guidelines for future research on this topic. Finally, having two different versions of the same datasets provides us with the unique opportunity to be able to evaluate whether these HMC methods are able to generalize when learning from data with mislabeled instances. In particular, we evaluate whether they were able to predict the correct label in cases where the label has been altered since 2007. In order to do so, we propose an evaluation procedure where a predictive model is trained using the data from 2007, but tested with data from 2018.

The major contributions of this work are the following: i) We provide new benchmark datasets for HMCFootnote 1; ii) We provide baseline results for the new datasets; iii) We provide an evaluation procedure and results that evaluate whether HMC methods are able to discover new or wrong annotations.

The remainder of this article is organized as follows. “Related work” section presents an overview of studies on HMC which have used the functional annotation benchmark datasets proposed in 2007. “Updated datasets” section provides a description on how the datasets were updated, together with a quantification of new labels and annotations. In “Results” section, we present the results of our experiments. In “Discussion” section, we discuss our results. In “Conclusion” section we present our conclusion. Finally, “Methods” section contains the HMC methods employed and the evaluation strategies;

Related work

In this section, we provide a literature overview of studies that have used the datasets addressed in this work, and a brief review on hierarchical multi-label classification applications. In Table 1, we present studies which have used the FunCat and GO datasets.

Table 1 Review on HMC studies which used FunCat and GO datasets

In the HMC literature, methods are separated into two approaches: local and global. The difference between these approaches relies on how their predictive models are designed. The local approach employs machine learning decompositions where the task is divided into smaller classification problems, then the solutions of the sub-problems are combined to solve the main task. As an advantage, any predictive model, or even an ensemble of models, can be incorporated into the solution.

According to Silla and Freitas [33], the local approach is further divided into three strategies: Local Classifier per Level [3, 5, 14, 25, 30], Local Classifier per Node [7, 9] and Local Classifier per Parent Node [11, 16]. As their name suggest, these strategies train a predictive model for each level, node or parent node of the hierarchy, respectively. Allowing many types of decomposition is particularly interesting, since different problems may require different solutions. For instance, when handling large hierarchies, the usage of the Local Classifier per Parent Node and Local Classifier per Node result in a large number of classifiers being trained, making the Local Classifier per Level strategy more computationally efficient as it requires only one predictive model per level. However, the hierarchy may contain many labels per level, forcing the models to distinguish among them, and possibly making the task more difficult.

Using several strategies, Cerri and De Carvalho [32] investigated how problem transformation methods from the non-hierarchical multi-label literature, which decompose the task into smaller problems similarly to the local approach, behave on the HMC context using Support Vector Machines. Cerri et al. [3, 14, 30] use the Local Classifier per Level by training one neural network for each level of the hierarchy where prediction probabilities of the previous level are used as extra attributes for the neural network associated to the next level. Wehrmann et al. [5] extended this idea with an extra global loss function, allowing gradients to flow across all neural networks. Li [34] proposed to use this strategy with deep neural networks to predict the commission number of enzymes. In a follow up work, Zou et al. [35] extended this method by enabling the prediction of multi-functional enzymes.

The work of Feng et al. [9] proposed to use the Local Classifier per Node strategy by training one Support Vector Machine for each node of the hierarchy combined with the SMOTE oversampling technique. This work was slightly improved in Feng et al. [7] where the Support Vector Machines were replaced by Multi-Layer Perceptron and a post-prediction method based on Bayesian networks was used. Also using Support Vector Machines, the studies of Bi and Kwok [12, 20] proposed new loss functions specific for HMC which were optimized using Bayes optimization techniques. On a similar manner, Vens et al. [2] proposed to train Predictive Clustering Trees, a variant of decision trees which create splits by minimizing the intra-cluster variance, for each node, and also an alternative version where one predictive model is trained per edge.

Ramirez et al. [11, 16] employed the Local Classifier per Parent Node by training one predictive model per parent node of the hierarchy and augmenting the feature vectors with predictions from ancestors classifiers. On a similar note, Kulmanov et al. [36] proposed to train a predictive model for each sub-ontology of the Gene Ontology, combining features automatically learned from the sequences and features based on protein interactions.

Differently from the local approach, the global one employs a single predictive model which is adapted to handle the hierarchy constraint and relationships among classes. When compared to the local approach, the global one tends to present lower computational complexity, due to the number of models trained. However, its implementation is more complex, since traditional classifiers can not be used straightforwardly. The global approach is further divided into two strategies: algorithm adaptation and rule induction.

As its name suggests, the algorithm adaptation strategy consists of adapting a traditional algorithm to handle hierarchical constraints. Masera and Blanzieri [6] created a neural network whose architecture incorporates the underlying hierarchy, making gradient updates flow from the neurons associated to the leaves up neurons associated to their parent nodes; Sun et al. [8] proposed to use Partial Least Squares to reduce both label and feature dimension, followed by an optimal path selection algorithm; Barros et al. [17] proposed a centroid based method where the training data is initially clustered, then predictions are performed by measuring the distance between the new instance and all clusters, the label set associated to the closest cluster is given as the prediction; Borges and Nievola [31] developed a competitive neural network whose architecture replicates the hierarchy; Vens et al. [2] also proposed to train a single Predictive Clustering Tree for the entire hierarchy; as an extension of [2], Schietgat et al. [21] proposed to use ensemble of Predictive Clustering Trees; Stojanova et al. [18] proposed a slight modification for Predictive Clustering Trees in which the correlation between the proteins is also used to build the tree.

In the rule induction strategy, optimization algorithms are designed to generate classification rules which consist of conjunctions of attribute-value tests, i.e. many ifthen tests connected by the boolean operator . In this regard, several studies from Cerri et al. [4, 15, 19] proposed to use Genetic Algorithms with many different fitness functions. Similarly, other optimization algorithms such as Ant Colony Optimization [10, 22] and Grammar Evolution [29] were also investigated in this context.

Additionally, some studies have also addressed similar topics to HMC. For instance, Cerri et al. [25] examined how Predictive Clustering Trees can be used to perform feature selection using Neural Networks and Genetic Algorithms as base classifiers. Almeida and Borges [26] proposed an adaptation of K-Nearest Neighbours to address quantification learning in HMC. Similarly, Triguero and Vens [27] investigated how different thresholds can increase the performance of Predictive Clustering Trees in this context.

Other application domains have also explored HMC, such as managing IT services [37, 38], text classification on social media [39], large scale document classification [40] and annotation of non-coding RNA [41]. It can even be applied to non-hierarchical multi-label problems where artificial hierarchies are created [42].

Updated datasets

In this section, we present an overall description of the datasets and their taxonomies, followed by details on how we updated both FunCat and Gene Ontology versions. The resulting updated versions are available at

Overall description

Clare [43] originally proposed 12 datasets containing features extracted from protein sequences of the organism Saccharomyces cerevisiae (yeast) whose targets are their protein functions. These 12 datasets contain largely the same proteins, nonetheless differ in their descriptive features. Furthermore, these datasets are divided into train, test and validation sets.

It is known that the yeast and human genomes have many similar genes, furthermore yeast is considerably cheaper and experiment-wise efficient when compared to other species, making it a widely addressed subject in bioinformatics applications [44]. In Table 2, we provide more information about these datasets.

Table 2 Statistical information on the 2007 datasets

The Hom dataset presents information between analogous (similar) yeast genes. Using an homology engine, such as BLASTn Footnote 2, other similar yeast genes are discovered. Then, properties between the sequences from the dataset and their analogous ones are measured. The Pheno dataset contains phenotype data based on knock-out mutants. Each gene is removed to form a mutant strain, and the corresponding change in phenotype as compared to the wild type (no mutation) is observed after growing both strains on different growth media. The Seq dataset stores features extracted from the amino acid sequences of the proteins, such as molecular weight, length and amino acid ratios. As its name suggests, the Struc dataset contains features based on the second structure of the proteins annotated in a binary format. In the case of an unknown structure, the software PROF [45] was used to predict it. Known structures were promptly annotated. All the other datasets were constructed based on the expression of genes recorded across an entire genome using microchips [43].

As an extension to these datasets, Vens [2] mapped the targets to the Gene Ontology taxonomy. Additionally, the FunCat annotations used by Clare [43] were updated.

FunCat is an organism independent functional taxonomy of proteins functions which is widely adopted throughout bioinformatics. As shown in Fig. 1, FunCat places generic functions in high levels of the taxonomy, then it sequentially divides such functions into specific ones, forming a tree-shaped hierarchy where each function has one ancestor function. From the machine learning perspective, FunCat is used as an underlying hierarchy of labels. Thus, each protein function is addressed as a label in a classification task where the relationships established by FunCat are taken in account.

Similarly, the Gene Ontology (GO) is a taxonomy whose main goal consists of defining features of genes in an accurate and species independent fashion [23]. More specifically, the GO is composed of three sub-ontologies: molecular function, cellular component and biological process. The molecular function sub-ontology contains information about activities performed by gene products in the molecular-level. The cellular component sub-ontology, as its name suggests, describes the locations where gene products perform functions. Finally, the biological process sub-ontology annotates processes performed by multiple molecular activities.

All information in the GO is described using terms which are nodes with an unique ID, a description and their relationship with other terms. Due to these relationships, the GO is defined as a directed acyclic graph in the machine learning literature, making it a challenging task due to the substantial high number of terms, and many intrinsic relationships among them. Figure 2 presents a small part of the GO.

FunCat update

In order to update these datasets, we have performed the procedure described in Fig. 4. Using the IDs from the sequences, we have queried UniProt, obtaining new annotated functions for the sequences. Next, we built the hierarchy of each dataset, and replaced the old annotations by the new ones, i.e. we have removed entirely the annotations from 2007, and concatenated the new annotations with the original features. Mind that each dataset described in Table 2 uses a slightly different FunCat subset. The hierarchies differ between the datasets, because the protein subset differs as seen in Table 2, since not every protein can be found in every original dataset by Clare.

Fig. 4
figure 4

Procedure used to update each FunCat dataset. The sequence IDs are extracted from the 2007 dataset, and used to query new annotations using UniProt. A hierarchy (subset of FunCat) is built using the new annotations. Finally, the old annotations are removed, and the new dataset is created by concatenating the new annotations with the feature vector and IDs

In Table 3, we compared the 2007 datasets with the 2018 versions w.r.t. their label set. There was a significant increase in the number of labels across the hierarchy. More specifically, in the third and fourth level where the mean number of labels has increased from 175 to 208 and 140 to 168 respectively. A smaller increase is also noticeable in the first, second and last level.

Table 3 Comparison between the number of labels per level in FunCat 2007 and FunCat 2018

In Table 4, we presented for each dataset the number of instances with annotations per level. In this case, there was a slight increase in deeper levels, whereas the mean number of annotated instances on the second and third level has decreased in all datasets.

Table 4 Comparison between the number of annotated instances per level for FunCat 2007 and FunCat 2018

Further, we compared the number of annotations per level between the versions from 2007 and 2018 in Table 5. There was a considerable increase in the number of annotations across all levels of the hierarchy. The last level seemed remarkable, as its number of annotations is significantly low in both versions.

Table 5 Comparison between the number of annotations per level in FunCat 2007 and FunCat 2018

When analyzing the number of annotations that were added and removed in Table 6, the second level presented a higher average number of new annotations despite having fewer annotated instances now. Noticeable increases were also noticed in the third and fourth level.

Table 6 Comparison between added and removed annotations in FunCat 2007 and FunCat 2018 per level

Gene ontology update

In order to update these datasets, we have performed the procedure shown in Fig. 5.

Fig. 5
figure 5

Procedure used to update each Gene Ontology dataset. The sequence IDs are extracted from the 2007 dataset, and used to query new terms using UniProt. Obsolete and replaced terms are removed and merged into a single term, respectively. A hierarchy (subset of the Gene Ontology) is built using the new annotations. Finally, the old annotations are removed, and the new dataset is created by concatenating the new annotations with the feature vector and IDs

Initially, we queried Universal Protein (UniProt) using the IDs from the protein sequences using their web serviceFootnote 3, obtaining the GO terms associated to each sequence. Next, we preprocessed the queried terms. The GO keeps track of alternate (secondary) IDs which are different labels with identical meaning, hence we have merged them into a single label. Similarly, we have also removed obsolete annotations since they are deprecated and should not be used anymore. Finally, the old annotations were entirely removed, and the new ones were concatenated to the feature vector. Recall that we are not considering the first level of the Gene Ontology, since it contains 3 root terms which are present in all instances. Further, as for FunCat, each dataset contains only a subset of the entire Gene Ontology.

Mind that since the GO is a directed acyclic graph, annotations can belong to multiple levels. In order to present statistics about these datasets, we are considering the deepest path to determine the level for all labels in Tables 7, 8, 9 10.

Table 7 Comparison between the number of labels per level in Gene Ontology 2007 and Gene Ontology 2018
Table 8 Comparison between the number of annotated instances per level Gene Ontology 2007 and Gene Ontology 2018
Table 9 Comparison between the number of annotations per level in Gene Ontology 2007 and Gene Ontology 2018
Table 10 Comparison between the number of annotations added and removed in Gene Ontology 2007 and Gene Ontology 2018 per level

As shown in Table 7, there was a similar behaviour as in the FunCat update. There was a substantial increase in the number of labels throughout all levels, specially in the levels between the third and the twelfth. Two extra levels were added, making a total of 15, nonetheless there are only few classes in these levels.

We observed an overall increase in the number of instances per level throughout the hierarchies (Table 8). There were no remarkable decreases. We have noticed that only the validation and test datasets contain instances on the last level of the hierarchy. From the machine learning perspective, such condition might hinder predictive models, as most of them are not capable of predicting a class which is not present in the training dataset. Possibly, future studies might consider removing the last level. Difficulties might also emerge on the fourteenth level, as the datasets have very few instances on it.

As seen in Table 9, once again there was an increment in the number of annotations per level. The number of annotations gradually increases up to a certain level, until it decreases to almost none when it reaches the deepest levels.

When examining the number of annotations that are added or removed per level (Table 10), we can perceive once again an overall increment in all datasets. Naturally, no labels were removed on the fourteenth and fifteenth level as they were not present in the 2007 versions.


Initially, we present a standard evaluation among the HMC methods. Next, we also present an alternative evaluation where the HMC methods are compared w.r.t. their ability to discover new or wrong annotations.

Standard evaluation

In Table 11, we present a comparison of the PooledAUPRC obtained using the standard evaluation procedure. Since HMC-LMLP, HMC-GA and AWX are stochastic, we report the mean result of 5 runs, together with the standard deviation. Mind that, since we reran all methods on our datasets, variations may occur compared to the originally reported results in the respective papers.

Table 11 Pooled AUPRC of the evaluated methods

Even though Clus-Ensemble is the oldest of the compared methods, it still provided better results in most of the experiments. This is best seen in the FunCat 2018 datasets where Clus-Ensemble consistently presented results close to 0.4, and the second best method, HMC-LMLP, achieves at most 0.24 in any of the datasets. As can be seen in Fig. 6, Clus-Ensemble was the overall best method, and performs statistically significantly better than HMC-GA and AWX.

Fig. 6
figure 6

Friedmann-Nemenyi test evaluating the four HMC methods using the standard evaluation procedure

The second method evaluated, HMC-GA, yielded a lower performance overall. In most of the cases, HMC-GA was superior to AWX, but still inferior to Clus and HMC-LMLP. The method HMC-LMLP provided decent results. When compared to AWX, HMC-LMLP managed to significantly outperform it. Furthermore, HMC-LMLP was ranked as the second best method overall, providing superior results in all of the Gene Ontology 2007 datasets.

An unusual behaviour was noticed in the AWX method as it yielded very undesired results in many occasions. Even though the parameter values were extracted from the original paper, its results were fairly different. For instance, in the Derisi, Seq and Spo datasets from all versions, AWX was severely underfitted with results inferior to 0.1. It also presented similar cases in the FunCat and Gene Ontology 2007 Expr datasets.

When comparing the performance between different versions of the datasets, we noticed an overall improvement in the methods when moving from 2007 to 2018. Even though their label sets are bigger now, the addition of annotations to the instances compensate such difference, which resulted in better performances.

vs 2018

Here we evaluate how the HMC methods perform when trained using data from 2007, but evaluated using datasets from 2018. For the methods HMC-LMLP, HMC-GA and AWX, for each (instance,label) pair we have used the mean prediction probability of 5 runs.

For all figures presented here, we also include a boxplot for the (instance,label) pairs that did not change between the two dataset versions. This allows to see to what extent the methods can detect annotations that were falsely negative or falsely positive in the data of 2007. The number between parentheses corresponds to the number of (instance,label) pairs evaluated for a particular setting and dataset. Note that the number of unchanged pairs is much higher than the number of changed pairs, hence the outliers (prediction probabilities outside the whisker) should not be regarded.

Furthermore, we have also employed the Friedman-Nemenyi test to provide statistical validation. In this case, we have used the difference between the median of the prediction probabilities for the annotations that changed and those that did not change between the two dataset versions.


Figure 7 demonstrates that all methods are capable to detect missing annotations from the FunCat taxonomy, i.e., the distribution of prediction probabilities for the changed annotations is consistently higher than for the annotations that remained negative, since there is a visible difference between the location (median) and spread in the boxplots of the changed and unchanged annotations of the evaluated methods.

Fig. 7
figure 7

Evaluation on annotations that were added (0 to 1) and on annotations that did not change (0 in both versions) for FunCat. a Cellcycle, Derisi and Eisen datasets. b Expr, Gasch1 and Gasch 2 datasets. c Seq, Spo, Hom and Struc datasets

Clus-Ensemble and HMC-GA provided similar results, however Clus-Ensemble was slightly superior since its prediction probabilities tended to be higher. Moreover, when evaluating the labels that did not change (remained absent), Clus-Ensemble provided very low prediction probabilities. In Fig. 8, Clus-Ensemble was ranked first, however not statistically different from HMC-GA and HMC-LMLP.

Fig. 8
figure 8

Friedman-Nemenyi test evaluating annotations that were added (FunCat)

Similarly, the AWX method managed to be superior in the Hom dataset. However, it underperformed in other datasets, specially in Derisi, Expr, Seq and Spo. In these datasets, AWX predicted almost all annotations to be absent, except for very few outliers, which received a very high prediction probability.

HMC-LMLP presented decent results in almost all datasets. Nonetheless, for labels that did not change, HMC-LMLP tended to provide higher prediction probabilities, whereas Clus-Ensemble yielded lower ones, giving Clus-Ensemble an advantage over HMC-LMLP.

Hence, in the context of discovering new annotations, we can assume that Clus-Ensemble is the safer choice as it performed better on almost all datasets, nonetheless its advantage was close to minimal.

When addressing labels that were removed, see Fig. 9, we had very similar results. As seen in Fig. 10, HMC-GA provided superior results, but it still was not statistically different from Clus-Ensemble and HMC-LMLP. AWX yielded lower prediction probabilities in most of the datasets with exception to the Hom dataset. Since its prediction probabilities were also low for labels that were present in both versions of the datasets, it performs the worst among the compared methods.

Fig. 9
figure 9

Evaluation on annotations that were removed (1 to 0) and on annotations that did not change (1 in both versions) for FunCat. a Cellcycle, Derisi and Eisen datasets. b Expr, Gasch1 and Gasch2 datasets. c Seq, Spo, Hom and Struc datasets

Fig. 10
figure 10

Friedman-Nemenyi test evaluating annotations that were removed (FunCat)

Gene ontology

As can be seen in Fig. 11, Clus-Ensemble and HMC-GA were superior in most of the datasets. Additionally, the AWX method also presented desirable results, specially in the Derisi and Seq datasets where it output very high probabilities for added annotations and very low ones for labels that did not change. These three methods were not statistically different from each other, as shown in Fig. 12.

Fig. 11
figure 11

Evaluation on annotations that were added (0 to 1) and on annotations that did not change (0 in both versions) for GO. a Cellcycle, Derisi and Eisen datasets. b Expr, Gasch1 and Gasch2 datasets. c Seq, Spo, Hom and Struc datasets

Fig. 12
figure 12

Friedman-Nemenyi test evaluating annotations that were added (GO)

The HMC-LMLP method also presented overall visually comparable results, nonetheless it yielded higher predictions for annotations that did not change in some datasets, such as Expr, Gasch1 and Gasch2.

When examining the labels that were removed in Fig. 13, we noticed a different outcome. In this case, all methods presented very similar results, making performance almost indistinguishable in most of the datasets. Additionally, there was no statistical difference among these methods, as shown in Fig. 14.

Fig. 13
figure 13

Evaluation on annotations that were removed (1 to 0) and on annotations that did not change (1 in both versions) for GO. a Cellcycle, Derisi and Eisen datasets. b Expr, Gasch1 and Gasch2 datasets. c Seq, Spo, Hom and Struc datasets

Fig. 14
figure 14

Friedman-Nemenyi test evaluating annotations that were removed (GO)


In this section, we present a discussion about the results presented in the previous section. Following the same order, we first address the standard evaluation, followed by the comparison between the versions of the datasets.

Standard evaluation

As shown in Fig. 6, Clus-Ensemble’s superior predictive performance, in combination with an efficient learning method (random forest), the ability to handle datasets with many features (as seen in the Struc and Hom datasets), and the interpretabilty aspect (e.g. variable ranking and proximity measure associated to random forests), confirm the state-of-the-art status of Clus-Ensemble.

We believe that the ensemble method, random forest, contributes substantially to the performance. By considering many models, Clus-Ensemble is able to generalize more, and consequently provide superior results. The other methods evaluated do not make use of any ensemble method. Even though HMC-LMLP contains many neural networks, they are trained as a single model, and they distinguish between different classes.

HMC-GA provided inferior results in many cases, nonetheless it has the highest interpretability since it generates classification rules. Similarly, Clus-Ensemble presents many trees, which are readable by themselves, however their interpretability decreases as the number of trees increases. Differently, the neural networks, HMC-LMLP and AWX, are black-box models, and thus not readable in a straightforward way.

When comparing the neural network methods, HMC-LMLP and AWX, HMC-LMLP clearly had the upper hand. We believe that this is due to HMC-LMLP being a local approach, whereas AWX is a global one. Since one neural network is trained for each level of the hierarchy, the neural networks are trained to distinguish among fewer classes, making the classification task easier, and, thus, providing better results. The computational complexity of HMC-LMLP, however, is considerably higher than the other methods due to many neural networks being built during its training.

Despite some undesirable results, AWX is the only method that explicitly exploits the hierarchy constraint by propagating gradients from neurons associated to leaves to neurons associated to their parents. Mind that the other methods also respect the constraint, but they exploit it to a smaller extent during their training.

Moreover, we believe that AWX’s early stopping criterion has negatively affected the results. in order to prevent overfitting, AWX interrupts the training right after the performance in the validation set decreases. However, these datasets contain noise in their label set, thus a small oscillation might be noticed. Considering more iterations, as performed by HMC-LMLP, could possibly increase AWX’s performance. Moreover, neural networks are very parameter dependent, and despite using the recommended parameters for all methods on the version from 2007, their performance might increase if they are tuned again on the 2018 datasets.

vs 2018


As described previously, when analyzing labels that changed from absent to present (0 to 1), Clus-Ensemble had the overall best results, whereas HMC-GA was the best for present to absent (1 to 0). We believe that this finding is highly correlated to how the evaluated methods yield their prediction probabilities.

Clus-Ensemble outputs the mean prediction probability of the instances associated to the leaf node predicted. According to the parameters used, the minimum number of such instances is 5, making the lowest positive prediction probability to be 0.2 per tree. Even though fairly low, it still is reasonably high in HMC due to label sparsity, resulting in high prediction probabilities in many cases, and thus better performance.

Likewise, the HMC-GA method yielded high prediction probabilities in some cases, resulting in similar results to Clus. Moreover, their heuristic (variance reduction) is the same. The main difference between HMC-GA and Clus-GA relies on the fact that HMC-GA uses a mean rule (prediction of the mean label set of the training dataset) whenever a test instance is not classified by any of the rules. This possibly results in outputting a sparse prediction with very low prediction probabilities.

Despite having decent results, HMC-LMLP presented high very prediction probabilities for labels that did not change between versions. We believe that this is related to how neural networks learn the distribution of the data. Since neural networks are very powerful models, they can learn more complex boundaries when compared to Clus-Ensemble and HMC-GA, resulting in the neural networks adjusting themselves strictly to the training dataset. HMC-LMLP is not overfitted though, as shown in Table 11, nonetheless its usage is not recommended if label noise is likely to be present.

Lastly, AWX had the best performance in the Hom dataset. However, it underperformed in several other cases. Once again, the early stopping criterion might have forced the neural network to a sub-optimal configuration, resulting in very biased predictions, i.e. AWX assumes most of the labels to be either positive or negative.

When evaluating labels that were removed, HMC-GA was superior. We believe that the mean rule might have artificially contributed since very low probabilities are predicted for most of the labels in this case.

Gene ontology

In the GO datasets, we noticed a similar behaviour. In most of the situations, Clus-Ensemble performed better when evaluating labels that were added, whereas HMC-GA was superior for removed labels.

When it comes to removed labels, HMC-GA performed better. Consequently, we recommend the usage of HMC-GA to predict which annotations are likely to be removed in future versions of the datasets (noise) since it presented better results in both FunCat and GO.

Similarly to the FunCat experiments, HMC-LMLP had an average performance being statistically significantly inferior to other methods, but equivalent to them for removed labels.

When compared to its performance on FunCat, AWX performed better here. For labels that were added, even though ranked in lower positions, AWX managed to not be statistically significantly different from Clus-Ensemble and Clus-HMC. Likewise, for removed labels, AWX also performed reasonably. This is very surprising since GO datasets have even more labels to be distinguished, and the same parameters were used.


In this work, we have presented updated benchmark datasets for hierarchical multi-Label classification (HMC) in the area of protein function prediction. We have also performed a comparison among four HMC methods to provide baselines results on these datasets. Finally, we have proposed an alternative evaluation procedure to evaluate the ability of HMC methods to detect missing or wrong annotations. For this purpose we make use of both old and new versions of the datasets.

In all datasets, we have noticed a significant increase in the hierarchy size, and in the number of annotations associated to instances. As a consequence of that, when performing a standard evaluation, HMC methods performed better using the updated versions. Despite having more labels to distinguish, the instances have now more annotations associated to them, resulting in better predictions. The overall best method in this task was Clus-Ensemble, a random forest of decision trees adapted to HMC, nonetheless the results remained fairly low overall. Thus, protein function prediction is still a very challenging task for the machine learning community.

In this direction, further studies in this area are necessary. In particular, we instigate the use of Deep Learning methods, since the amount of data available is on a constant increase, and recent deep neural networks are capable of learning straight from DNA sequences (without the need of extracting features) [46].

When it comes to detect missing or wrong annotations, in the FunCat datasets, Clus-Ensemble was the best in detecting missing annotations, whereas HMC-GA did better for annotations that were removed. In the Gene Ontology datasets, Clus-Ensemble performed better for detecting missing annotations, and competitive results were obtained for wrong annotations.

To conclude, we recommend to use the updated datasets in future studies on this topic. However, the previous version of these datasets should not be disregarded, since having two versions can be of interest to perform an evaluation similar to ours on new HMC methods, or to other fields in machine learning such as weakly supervised classification, noise detection and incremental learning [47, 48].


In this section, we provide details about our experimental setup. First, we present the methods used for comparison. Then we describe two evaluation strategies. Finally, we explain which datasets were included in the evaluation.

Compared methods

We have compared 4 methods from the literature: Clus-Ensemble [2, 21], hierarchical multi-label classification with genetic algorithm (HMC-GA) [4, 19], hierarchical multi-label classification with local multi-layer perceptrons (HMC-LMLP) [3], and Adjacency Wrapping matriX (AWX) [6]. The methods were chosen due to the following reasons: 1) Apart from Clus-Ensemble, they are recent methods. Clus-Ensemble is included because it is used as the state-of-art benchmark in many studies; 2) They are based on different machine learning methods and HMC strategies, ranging from global to local approaches and from interpretable tree or rule based methods to more powerful, but black box techniques; 3) They are publicly available. Next, we provide a brief description of these methods, and details about their parameters. We have set the parameters to the values originally recommended by the authors.


Clus is a method from the global approach based on predictive clustering trees where decision trees are seen as a hierarchy of clusters whose top node corresponds to a cluster with all the training data. Recursively, Clus minimizes the intra-cluster variance until a stop criterion is met. In this work, we have used the (global) Clus-HMC variant due to its superior results, in combination with the ensemble method Random Forest. Hence, this predictive model consists of a Random Forest of Predictive Clustering Trees. We are using 50 trees within the Random Forest, at least 5 instances per leaf node and the best F-test stopping criterion significance level selected from {0.001,0.005,0.01,0.05,0.1,0.125}.


Using genetic algorithms and the global approach, the method hierarchical multi-label classification with genetic algorithm use a sequential rule covering method where optimal classification rules are created [4, 19]. At every iteration, one rule in the format ifthen is generated by optimizing the fitness function. Next, the examples covered by the new rule are removed from the training dataset, and new rules are generated until a stop criterion is met. We have used the following parameters:

  • Population size : 100 rules;

  • Number of Generations: 1000;

  • Stopping criterion: 1% of uncovered examples;

  • Crossover rate: 90%;

  • Mutation rate: 10%;


The method proposed by Cerri [3] addresses the classification problem using the Local approach. More specifically, the Local Classifier per Level strategy where one multi-layer perceptron is trained for each level of the hierarchy. Thus, each neural network is responsible for predicting the classes on its respective level. Moreover, this method adds prediction probabilities from the previous level as extra features for the next neural network, in the sense that each neural network is trained separately and its training dataset is augmented by the previous neural network. Finally, the predictions from each neural network are combined to perform a prediction. If the performance in the validation dataset does not improve in 10 iterations, the training is interrupted.

We have used the following parameters:

  • Hidden Layers Size: the number of neurons per hidden layer is obtained by multiplying the number of inputs by the values [0.6,0.5,0.4,0.3,0.2,0.1] for the FunCat datasets and [0.65,0.65,0.6,0.55,0.5,0.45,0.4,0.35,0.3,0.25,0.2,0.15,0.1] for the GO datasets;

  • Activation Function: Logistic (sigmoid) activation function;

  • Optimizer: Backpropagation with 200 epochs and learning rate {0.05,0.03} and momentum {0.03,0.01} alternating between levels;


Using neural networks and the global approach, the method Adjacency Wrapping matriX (AWX) employs a single model where the underlying hierarchy is mapped into the loss function [6]. This mapping is performed by an auxiliary matrix which makes the gradients updates flow from the neurons associated to leaves to the neurons that are associated to their parent nodes. If the performance degrades on the validation dataset, the training is interrupted immediately. We have used the following parameters:

  • l-norm: We have used l1, since it presented superior results;

  • Hidden layer: with 1000 neurons with the ReLu activation function and l2 regularizer 10−3;

  • Output layer: Logistic activation function and l2 regularizer 10−3;

  • Optimizer: Adam with learning rate 10−5,β1=0.9 and β2=0.999 and the cross entropy loss function;

Evaluated datasets

Even though we provide 12 datasets with updated Funcat and GO annotations, we have decided to not include all of them in our analysis. The Church and Pheno datasets have an unusual number of instances with identical feature vectors, mostly due to missing values. In the Church dataset, 2352 out of 3755 instances are unique, leaving 1403 instances with the same feature vector as another instances, but different annotations. A similar behaviour is noticed in the Pheno dataset where only 514 instances out of 1591 are unique [49].

We are considering the Hom and Struc datasets only using the methods Clus-Ensemble and AWX. The other methods, HMC-LMLP and HMC-GA, presented several difficulties when handling these datasets. HMC-LMLP demands much more computational power due to its many neural networks. Similarly, HMC-GA did not converge using the parameters suggested in the original paper. Some work, such as [5, 10, 11, 13, 17, 22], have also decided to not include them.

Table 12 presents the datasets evaluated in this work.

Table 12 Evaluated datasets

Standard evaluation

In order to provide benchmark results on the new datasets, we have first performed a standard evaluation. Thus, we evaluated 10 feature sets with 4 possible labels sets for each (two label hierarchies and two annotation timestamps), making a total of 40 datasets. We present the evaluation measure and the statistical test that we have used.

Pooled aUPRC

We have adopted the Pooled area under the precision-recall curve (AUPRC) evaluation measure since it is consistently used in HMC literature [2, 3, 5, 18, 19, 21, 22, 25]. Mind that, generally HMC datasets are heavily imbalanced, making negative predictions very likely, thus evaluation measures such as ROC curves are not recommended.

The Pooled AUPRC corresponds to the area under the precision-recall curve generated by taking the Pooled (i.e., micro-averaged) precision and recall over all classes for different threshold values. These threshold values usually consist of values ranging from 0 to 1 with increasing steps of 0.02 for all datasets.

In the equations below, tp stands for true positive, fp means false positive, fn refers to false negative and i ranges over all classes.

$$ Pooled\_precision = \frac{\sum{tp_{i}}}{\sum{tp_{i}} + \sum{fp_{i}}} $$
$$ Pooled\_recall = \frac{\sum{tp_{i}}}{\sum{tp_{i}} + \sum{fn_{i}}} $$

Friedman-Nemenyi test

In order to provide statistical evidence, we have used the Friedman-Nemenyi test. At first the Friedman test verifies if any of the compared methods performs statistically significantly different from others. Next, the Nemenyi test ranks the methods where methods with superior results are ranked in higher positions. Graphically, methods connected by a horizontal bar of length equal to a critical distance are not statistically significantly different.

Evaluation procedure to compare datasets from different versions

We also investigated whether models that were trained on a dataset from 2007 are able to discover new annotations, i.e., annotations that were unknown (negative) in 2007, but have been added afterwards. We also check the opposite situation: whether models are able to correct wrong annotations, i.e., annotations that were wrongly positive in 2007, and have been corrected to negative afterwards. For this purpose, we propose an evaluation strategy that compares the predicted probabilities for specific (instance,label) pairs over the different HMC methods.

In particular, for a fair comparison, first we take the intersection of the label sets in the 2007 and 2018 dataset versions, respectively. Then, for evaluating the discovery of new annotations, in this intersection, we check the (instance,label) pairs in the test set that were negative in 2007 and positive in 2018. For these pairs, we plot the distribution of predictions for each HMC method, trained on the 2007 dataset. Note that a high value would have yielded a false positive prediction in 2007, however, with the current knowledge in functional genomics, this would now yield a true positive prediction. Figure 15 illustrates the procedure. For evaluating the correction of wrong annotations, the procedure is similar, except that we look for positive pairs that became negative.

Fig. 15
figure 15

Prediction probabilities of labels that changed between versions (written in red inside the red box) are used to build the red box-plot. Labels that occur only in the 2018 versions are not considered in this evaluation (black box)

Availability of data and materials

The datasets from 2007 and the Clus-Ensemble method are available at The methods HMC-GA and HMC-LMLP are available at The AWX method is available at The new dataset versions are available at:


  1. Available in:





Area under the precision-recall curve


Adjacency wrapping matriX


Functional catalogue


Gene ontology


Hierarchical multi-label classification


Hierarchical multi-label classification with genetic algorithm


Hierarchical multi-label classification with local multi-layer perceptrons


Universal protein


  1. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, et al.A large-scale evaluation of computational protein function prediction. Nat Methods. 2013; 10(3):221.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H. Decision trees for hierarchical multi-label classification. Mach Learn. 2008; 73:185–214.

    Article  Google Scholar 

  3. Cerri R, Barros RC, de Carvalho ACPLF, Jin Y. Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinformatics. 2016; 17(1):373.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Cerri R, Basgalupp MP, Barros RC, de Carvalho ACPLF. Inducing hierarchical multi-label classification rules with genetic algorithms. Appl Soft Comput. 2019; 77:584–604.

    Article  Google Scholar 

  5. Wehrmann J, Cerri R, Barros R. Hierarchical multi-label classification networks In: Dy J, Krause A, editors. Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80. Stockholmsmässan: PMLR: 2018. p. 5075–84.

    Google Scholar 

  6. Masera L, Blanzieri E. Awx: An integrated approach to hierarchical-multilabel classification In: Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G, editors. Machine Learning and Knowledge Discovery in Databases. Springer, Cham: 2019. p. 322–36.

  7. Feng S, Fu P, Zheng W. A hierarchical multi-label classification method based on neural networks for gene function prediction. Biotechnol Biotechnol Equip. 2018:1–9.

  8. Sun Z, Zhao Y, Cao D, Hao H. Hierarchical multilabel classification with optimal path prediction. Neural Process Lett. 2017; 45(1):263–77.

    Article  Google Scholar 

  9. Feng S, Fu P, Zheng W. A hierarchical multi-label classification algorithm for gene function prediction. Algorithms. 2017; 10(4):138.

    Article  Google Scholar 

  10. Khan S, Baig AR. Ant colony optimization based hierarchical multi-label classification algorithm. Appl Soft Comput. 2017; 55:462–79.

    Article  Google Scholar 

  11. Ramírez-Corona M, Sucar LE, Morales EF. Hierarchical multilabel classification based on path evaluation. Int J Approx Reason. 2016; 68:179–93.

    Article  Google Scholar 

  12. Bi W, Kwok JT. Bayes-optimal hierarchical multilabel classification. IEEE Trans Knowl Data Eng. 2015; 27(11):2907–18.

    Article  Google Scholar 

  13. Golzari F, Jalili S. Vr-bfdt: A variance reduction based binary fuzzy decision tree induction method for protein function prediction. J Theor Biol. 2015; 377:10–24.

    Article  CAS  PubMed  Google Scholar 

  14. Cerri R, Barros RC, de Carvalho ACPLF. Hierarchical classification of gene ontology-based protein functions with neural networks. In: Neural Networks (IJCNN), 2015 International Joint Conference On: 2015. p. 1–8.

  15. Cerri R, Barros RC, Freitas AA, de Carvalho AC. Evolving relational hierarchical classification rules for predicting gene ontology-based protein functions. In: Proceedings of the Companion Publication of the 2014 Annual Conference on Genetic and Evolutionary Computation. ACM: 2014. p. 1279–86.

  16. Ramírez-Corona M, Sucar LE, Morales EF. Chained path evaluation for hierarchical multi-label classification. In: The Twenty-Seventh International Flairs Conference. AAAI Publications: 2014.

  17. Barros RC, Cerri R, Freitas AA, de Carvalho ACPLF. Probabilistic clustering for hierarchical multi-label classification of protein functions In: Blockeel H, Kersting K, Nijssen S, železný F, editors. Machine Learning and Knowledge Discovery in Databases. Berlin: Springer: 2013. p. 385–400.

    Chapter  Google Scholar 

  18. Stojanova D, Ceci M, Malerba D, Dzeroski S. Using ppi network autocorrelation in hierarchical multi-label classification trees for gene function prediction. BMC Bioinformatics. 2013; 14(1):285.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Cerri R, Barros RC, de Carvalho ACPLF. A genetic algorithm for hierarchical multi-label classification. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing. SAC ’12. New York: ACM: 2012. p. 250–5.

    Google Scholar 

  20. Bi W, Kwok JT. Multi-label classification on tree- and dag-structured hierarchies. In: Proceedings of the 28th International Conference on International Conference on Machine Learning. ICML’11. USA: Omnipress: 2011. p. 17–24.

    Google Scholar 

  21. Schietgat L, Vens C, Struyf J, Blockeel H, Kocev D, Džeroski S. Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics. 2010; 11(2):1–14.

    Google Scholar 

  22. Otero FEB, Freitas AA, Johnson CG. A hierarchical multi-label classification ant colony algorithm for protein function prediction. Memetic Comput. 2010; 2(3):165–81.

    Article  Google Scholar 

  23. Consortium GO. The gene ontology project in 2008. Nucleic Acids Res. 2007; 36(suppl_1):440–4.

    Article  Google Scholar 

  24. Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M, et al.The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004; 32(18):5539–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Cerri R, Mantovani RG, Basgalupp MP, de Carvalho ACPLF. Multi-label feature selection techniques for hierarchical multi-label protein function prediction. In: 2018 International Joint Conference on Neural Networks (IJCNN): 2018. p. 1–7.

  26. Almeida TB, Borges HB. An adaptation of the ml-knn algorithm to predict the number of classes in hierarchical multi-label classification In: Torra V, Narukawa Y, Honda A, Inoue S, editors. Modeling Decisions for Artificial Intelligence. Springer, Cham: 2017. p. 77–88.

    Chapter  Google Scholar 

  27. Triguero I, Vens C. Labelling strategies for hierarchical multi-label classification techniques. Pattern Recognit. 2016; 56(C):170–83.

    Article  Google Scholar 

  28. Santos A, Canuto A. Applying semi-supervised learning in hierarchical multi-label classification. Expert Syst Appl. 2014; 41(14):6075–85.

    Article  Google Scholar 

  29. Cerri R, Barros RC, de Carvalho AC, Freitas AA. A grammatical evolution algorithm for generation of hierarchical multi-label classification rules. In: 2013 IEEE Congress on Evolutionary Computation. IEEE: 2013. p. 454–61.

  30. Cerri R, Barros RC, de Carvalho ACPLF. Hierarchical multi-label classification for protein function prediction: A local approach based on neural networks. In: 2011 11th International Conference on Intelligent Systems Design and Applications: 2011. p. 337–43.

  31. Borges HB, Nievola JC. Multi-label hierarchical classification using a competitive neural network for protein function prediction. In: IJCNN: 2012. p. 1–8.

  32. Cerri R, de Carvalho ACPLF. New top-down methods using svms for hierarchical multilabel classification problems. In: The 2010 International Joint Conference on Neural Networks (IJCNN): 2010. p. 1–8.

  33. Silla J, Carlos N, Freitas A. A survey of hierarchical classification across different application domains. Data Min Knowl Discov. 2010; 22(1-2):31–72.

    Article  Google Scholar 

  34. Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2017; 34(5):760–9.

    Article  PubMed Central  Google Scholar 

  35. Zou Z, Tian S, Gao X, Li Y. mldeepre: Multi-functional enzyme function prediction with hierarchical multi-label deep learning. Front Genet. 2019; 9:714.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Kulmanov M, Khan MA, Hoehndorf R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics. 2017; 34(4):660–8.

    Article  PubMed Central  Google Scholar 

  37. Zeng C, Li T, Shwartz L, Grabarnik GY. Hierarchical multi-label classification over ticket data using contextual loss. In: 2014 IEEE Netw Oper Manag Symp (NOMS): 2014. p. 1–8.

  38. Zeng C, Zhou W, Li T, Shwartz L, Grabarnik GY. Knowledge guided hierarchical multi-label classification over ticket data. IEEE Trans Netw Serv Manag. 2017; 14(2):246–60.

    Article  Google Scholar 

  39. Ren Z, Peetz M-H, Liang S, van Dolen W, de Rijke M. Hierarchical multi-label classification of social text streams. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. SIGIR ’14. New York: ACM: 2014. p. 213–22.

    Google Scholar 

  40. Wang X, Zhao H, Lu B-l. Enhanced K-Nearest Neighbour Algorithm for Large-scale Hierarchical Multi-label Classification. In: Proceedings Joint ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification. Springer: 2011. p. 58–67.

  41. Zhang Z, Zhang J, Liu Y, Wang Z, Deng L. Ontological function annotation of long non-coding RNAs through hierarchical multi-label classification. Bioinformatics. 2017; 34(10):1750–7.

    Article  Google Scholar 

  42. Papanikolaou Y, Tsoumakas G, Katakis I. Hierarchical partitioning of the output space in multi-label data. Data Knowl Eng. 2018; 116:42–60.

    Article  Google Scholar 

  43. Clare A. Machine learning and data mining for yeast functional genomics. The University of Wales. 2003.

  44. Goffeau A, Barrell BG, Bussey H, Davis R, Dujon B, Feldmann H, Galibert F, Hoheisel J, Jacq C, Johnston M, et al.Life with 6000 genes. Science. 1996; 274(5287):546–67.

    Article  CAS  PubMed  Google Scholar 

  45. Ouali M, King RD. Cascaded multiple classifiers for secondary structure prediction. Protein Sci. 2000; 9(6):1162–76.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods. 2019.

    Article  CAS  PubMed  Google Scholar 

  47. Li Y, Li Z, Ding L, Yang P, Hu Y, Chen W, Gao X. Supportnet: solving catastrophic forgetting in class incremental learning with support data. 2018. arXiv preprint arXiv:1806.02942.

  48. Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019; 113:54–71.

    Article  PubMed  Google Scholar 

  49. Pliakos K, Vens C. Mining features for biomedical data using clustering tree ensembles. J Biomed Inform. 2018; 85:40–48.

    Article  PubMed  Google Scholar 

Download references


We would like to thank Ricardo Cerri for valuable assistance with the HMC-LMLP and HMC-GA methods. Likewise, we are also grateful to Luca Masera for providing the source code with the original parameters in the AWX method.


This study was funded by the Research Fund Flanders (FWO). FWO played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations



FKN performed the literature review and experimental analysis, ML built the updated version of the datasets, and CV supervised the work. FKN and CV drafted the manuscript. All the authors have read and approved the final manuscript.

Corresponding author

Correspondence to Felipe Kenji Nakano.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nakano, F.K., Lietaert, M. & Vens, C. Machine learning for discovering missing or wrong protein function annotations. BMC Bioinformatics 20, 485 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: