Machine learning for discovering missing or wrong protein function annotations

Background A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. Results The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. Conclusions The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.


Background
Due to technological advancements, the generation of proteomic data has increased substantially. However, annotating all sequences is costly and time-consuming, making it often unfeasible [1]. As a countermeasure, recent studies have employed machine learning methods due to their capacities of automatically predicting protein functions.
More specifically, protein function prediction is generally modeled as a hierarchical multi-label classification (HMC) task. HMC is a classification task whose objective is to fit a predictive model f which maps a set of instances X to a set of hierarchically organized labels Y, while respecting hierarchy constraints among Y [2,3]. The hierarchy constraint states that whenever a particular label y i is predicted, all ancestors labels of y i up to the root node of the hierarchy must be predicted as well.
In the machine learning literature when proposing a new method, this method is typically compared to a set of competitor methods on benchmark datasets. For HMC, many studies  utilized the benchmark datasets proposed in [2]. These datasets are available at https:// dtai.cs.kuleuven.be/clus/hmcdatasets/ and contain protein sequences from the species Saccharomyces cerevisiae (yeast) whose functions are mapped to either the Functional Catalogue (FunCat) [24] or Gene Ontology (GO) [23]. The task associated with these datasets is to predict the functions of a protein, given a a set of descriptive features (e.g., sequence, homology or structural information).
FunCat and GO are different types of hierarchies. In FunCat (Fig. 1), labels are structured as a tree, meaning that they can have only a single parent label [24]. The GO (Fig. 2), however, allows labels to have multiple parent labels, forming a directed acyclic graph [23]. This complicates the fulfillment of the hierarchy constraint, since multiple classification paths are allowed throughout the graph.
These benchmark datasets were introduced to the HMC community in 2007, and, thus, the functional labels associated with each protein can be considered outdated. There are two reasons for this. First, functional annotations are updated on a regular basis. Second, as can be seen in Fig. 3a, there was a drastic increase in the number of terms throughout the Gene Ontology since the creation of these datasets (January 2007). A similar observation can be made for the number of obsolete terms as shown in Fig. 3b. Accordingly, one of the main goals of this article is to provide updated versions of these widely used HMC benchmark datasets to the research community.
Using these new datasets, we present a comparison among four recent and open-source HMC methods that can be considered state-of-the-art,thus providing baseline performances as guidelines for future research on this topic. Finally, having two different versions of the same datasets provides us with the unique opportunity to be able to evaluate whether these HMC methods are able to generalize when learning from data with mislabeled instances. In particular, we evaluate whether they were able to predict the correct label in cases where the label has been altered since 2007. In order to do so, we propose an evaluation procedure where a predictive model is trained using the data from 2007, but tested with data from 2018.
The major contributions of this work are the following: i) We provide new benchmark datasets for HMC 1 ; ii) We provide baseline results for the new datasets; iii) We provide an evaluation procedure and results that evaluate whether HMC methods are able to discover new or wrong annotations.
The remainder of this article is organized as follows. "Related work" section presents an overview of studies on HMC which have used the functional annotation benchmark datasets proposed in 2007. "Updated datasets" section provides a description on how the datasets were updated, together with a quantification of new labels and annotations. In "Results" section, we present the results of our experiments. In "Discussion" section, we discuss our results. In "Conclusion" section we present our conclusion. Finally, "Methods" section contains the HMC methods employed and the evaluation strategies;

Related work
In this section, we provide a literature overview of studies that have used the datasets addressed in this work, and a brief review on hierarchical multi-label classification applications. In Table 1, we present studies which have used the FunCat and GO datasets.
In the HMC literature, methods are separated into two approaches: local and global. The difference between these approaches relies on how their predictive models are designed. The local approach employs machine learning decompositions where the task is divided into smaller classification problems, then the solutions of the subproblems are combined to solve the main task. As an advantage, any predictive model, or even an ensemble of models, can be incorporated into the solution.
According to Silla and Freitas [33], the local approach is further divided into three strategies: Local Classifier per Level [3,5,14,25,30], Local Classifier per Node [7,9] and Local Classifier per Parent Node [11,16]. As their name suggest, these strategies train a predictive model for each level, node or parent node of the hierarchy, respectively. Allowing many types of decomposition is particularly interesting, since different problems may require different solutions. For instance, when handling large hierarchies, Local Classifier per Node result in a large number of classifiers being trained, making the Local Classifier per Level strategy more computationally efficient as it requires only one predictive model per level. However, the hierarchy may contain many labels per level, forcing the models to distinguish among them, and possibly making the task more difficult.
Using several strategies, Cerri and De Carvalho [32] investigated how problem transformation methods from the non-hierarchical multi-label literature, which decompose the task into smaller problems similarly to the local approach, behave on the HMC context using Support Vector Machines. Cerri et al. [3,14,30] use the Local Classifier per Level by training one neural network for each level of the hierarchy where prediction probabilities of the previous level are used as extra attributes for the neural network associated to the next level. Wehrmann et al. [5] extended this idea with an extra global loss function, allowing gradients to flow across all neural networks. Li [34] proposed to use this strategy with deep neural networks to predict the commission number of enzymes. In a follow up work, Zou et al. [35] extended this method by enabling the prediction of multi-functional enzymes.
The work of Feng et al. [9] proposed to use the Local Classifier per Node strategy by training one Support Vector Machine for each node of the hierarchy combined with the SMOTE oversampling technique. This work was slightly improved in Feng et al. [7] where the Support Vector Machines were replaced by Multi-Layer Perceptron and a post-prediction method based on Bayesian networks was used. Also using Support Vector Machines, the studies of Bi and Kwok [12,20] proposed new loss functions specific for HMC which were optimized using Bayes optimization techniques. On a similar manner, Vens et al.  [2] proposed to train Predictive Clustering Trees, a variant of decision trees which create splits by minimizing the intra-cluster variance, for each node, and also an alternative version where one predictive model is trained per edge. Ramirez et al. [11,16] employed the Local Classifier per Parent Node by training one predictive model per parent node of the hierarchy and augmenting the feature vectors with predictions from ancestors classifiers. On a similar note, Kulmanov et al. [36] proposed to train a predictive model for each sub-ontology of the Gene Ontology, combining features automatically learned from the sequences and features based on protein interactions.
Differently from the local approach, the global one employs a single predictive model which is adapted to handle the hierarchy constraint and relationships among classes. When compared to the local approach, the global one tends to present lower computational complexity, due to the number of models trained. However, its implementation is more complex, since traditional classifiers can not be used straightforwardly. The global approach is further divided into two strategies: algorithm adaptation and rule induction.
As its name suggests, the algorithm adaptation strategy consists of adapting a traditional algorithm to handle hierarchical constraints. Masera and Blanzieri [6] created a neural network whose architecture incorporates the underlying hierarchy, making gradient updates flow from the neurons associated to the leaves up neurons associated to their parent nodes; Sun et al. [8] proposed to use Partial Least Squares to reduce both label and feature dimension, followed by an optimal path selection algorithm; Barros et al. [17] proposed a centroid based method where the training data is initially clustered, then predictions are performed by measuring the distance between the new instance and all clusters, the label set associated to the closest cluster is given as the prediction; Borges and Nievola [31] developed a competitive neural network whose architecture replicates the hierarchy; Vens et al. [2] also proposed to train a single Predictive Clustering Tree for the entire hierarchy; as an extension of [2], Schietgat et al. [21] proposed to use ensemble of Predictive Clustering Trees; Stojanova et al. [18] proposed a slight modification for Predictive Clustering Trees in which the correlation between the proteins is also used to build the tree.
In the rule induction strategy, optimization algorithms are designed to generate classification rules which consist of conjunctions of attribute-value tests, i.e. many if → then tests connected by the boolean operator ∧. In this regard, several studies from Cerri et al. [4,15,19] proposed to use Genetic Algorithms with many different fitness functions. Similarly, other optimization algorithms such as Ant Colony Optimization [10,22] and Grammar Evolution [29] were also investigated in this context.
Additionally, some studies have also addressed similar topics to HMC. For instance, Cerri et al. [25] examined how Predictive Clustering Trees can be used to perform feature selection using Neural Networks and Genetic Algorithms as base classifiers. Almeida and Borges [26] proposed an adaptation of K-Nearest Neighbours to address quantification learning in HMC. Similarly, Triguero and Vens [27] investigated how different thresholds can increase the performance of Predictive Clustering Trees in this context.
Other application domains have also explored HMC, such as managing IT services [37,38], text classification on social media [39], large scale document classification [40] and annotation of non-coding RNA [41]. It can even be applied to non-hierarchical multi-label problems where artificial hierarchies are created [42].

Updated datasets
In this section, we present an overall description of the datasets and their taxonomies, followed by details on how we updated both FunCat and Gene Ontology versions. The resulting updated versions are available at https:// www.kuleuven-kulak.be/nl/onderzoek/itec/projects/ research-focus/software.

Overall description
Clare [43] originally proposed 12 datasets containing features extracted from protein sequences of the organism Saccharomyces cerevisiae (yeast) whose targets are their protein functions. These 12 datasets contain largely the same proteins, nonetheless differ in their descriptive features. Furthermore, these datasets are divided into train, test and validation sets.
It is known that the yeast and human genomes have many similar genes, furthermore yeast is considerably cheaper and experiment-wise efficient when compared to other species, making it a widely addressed subject in bioinformatics applications [44]. In Table 2, we provide more information about these datasets.
The Hom dataset presents information between analogous (similar) yeast genes. Using an homology engine, such as BLASTn 2 , other similar yeast genes are discovered. Then, properties between the sequences from the dataset and their analogous ones are measured. The Pheno dataset contains phenotype data based on knockout mutants. Each gene is removed to form a mutant strain, and the corresponding change in phenotype as compared to the wild type (no mutation) is observed after growing both strains on different growth media. The Seq dataset stores features extracted from the amino acid sequences of the proteins, such as molecular weight, length and amino acid ratios. As its name suggests, the Struc dataset contains features based on the second structure of the proteins annotated in a binary format. In the case of an unknown structure, the software PROF [45] was used to predict it. Known structures were promptly annotated. All the other datasets were constructed based on the expression of genes recorded across an entire genome using microchips [43].
As an extension to these datasets, Vens [2] mapped the targets to the Gene Ontology taxonomy. Additionally, the FunCat annotations used by Clare [43] were updated.
FunCat is an organism independent functional taxonomy of proteins functions which is widely adopted throughout bioinformatics. As shown in Fig. 1, FunCat places generic functions in high levels of the taxonomy, then it sequentially divides such functions into specific ones, forming a tree-shaped hierarchy where each function has one ancestor function. From the machine learning perspective, FunCat is used as an underlying hierarchy of labels. Thus, each protein function is addressed as a label in a classification task where the relationships established by FunCat are taken in account.
Similarly, the Gene Ontology (GO) is a taxonomy whose main goal consists of defining features of genes in an accurate and species independent fashion [23]. More specifically, the GO is composed of three sub-ontologies: molecular function, cellular component and biological process. The molecular function sub-ontology contains information about activities performed by gene products in the molecular-level. The cellular component sub-ontology, as its name suggests, describes the locations where gene products perform functions. Finally, the biological process sub-ontology annotates processes performed by multiple molecular activities. All information in the GO is described using terms which are nodes with an unique ID, a description and their relationship with other terms. Due to these relationships, the GO is defined as a directed acyclic graph in the machine learning literature, making it a challenging task due to the substantial high number of terms, and many intrinsic relationships among them. Figure 2 presents a small part of the GO.

FunCat update
In order to update these datasets, we have performed the procedure described in Fig. 4. Using the IDs from the sequences, we have queried UniProt, obtaining new annotated functions for the sequences. Next, we built the hierarchy of each dataset, and replaced the old annotations by the new ones, i.e. we have removed entirely the annotations from 2007, and concatenated the new annotations with the original features. Mind that each dataset described in Table 2 uses a slightly different FunCat subset. The hierarchies differ between the datasets, because the protein subset differs as seen in Table 2, since not every protein can be found in every original dataset by Clare.
In Table 3, we compared the 2007 datasets with the 2018 versions w.r.t. their label set. There was a significant increase in the number of labels across the hierarchy. More specifically, in the third and fourth level where the mean number of labels has increased from 175 to 208 and 140 to 168 respectively. A smaller increase is also noticeable in the first, second and last level.
In Table 4, we presented for each dataset the number of instances with annotations per level. In this case, there was a slight increase in deeper levels, whereas the mean number of annotated instances on the second and third level has decreased in all datasets.
Further, we compared the number of annotations per level between the versions from 2007 and 2018 in Table 5. There was a considerable increase in the number of annotations across all levels of the hierarchy. The last level seemed remarkable, as its number of annotations is significantly low in both versions.
When analyzing the number of annotations that were added and removed in Table 6, the second level presented a higher average number of new annotations despite having fewer annotated instances now. Noticeable increases were also noticed in the third and fourth level.

Gene ontology update
In order to update these datasets, we have performed the procedure shown in Fig. 5.
Initially, we queried Universal Protein (UniProt) using the IDs from the protein sequences using their web service 3 , obtaining the GO terms associated to each sequence. Next, we preprocessed the queried terms. The GO keeps track of alternate (secondary) IDs which are different labels with identical meaning, hence we have merged them into a single label. Similarly, we have also removed obsolete annotations since they are deprecated and should not be used anymore. Finally, the old annotations were entirely removed, and the new ones were concatenated to the feature vector. Recall that we are not Mind that since the GO is a directed acyclic graph, annotations can belong to multiple levels. In order to present statistics about these datasets, we are considering the deepest path to determine the level for all labels in Tables 7, 8, 9 10. As shown in Table 7, there was a similar behaviour as in the FunCat update. There was a substantial increase in the number of labels throughout all levels, specially in the levels between the third and the twelfth. Two extra levels were added, making a total of 15, nonetheless there are only few classes in these levels.
We observed an overall increase in the number of instances per level throughout the hierarchies (Table 8). There were no remarkable decreases. We have noticed that only the validation and test datasets contain instances on the last level of the hierarchy. From the machine learning perspective, such condition might hinder predictive models, as most of them are not capable of predicting a class which is not present in the training dataset. Possibly, future studies might consider removing the last level. Difficulties might also emerge on the fourteenth level, as the datasets have very few instances on it.
As seen in Table 9, once again there was an increment in the number of annotations per level. The number of annotations gradually increases up to a certain level, until it decreases to almost none when it reaches the deepest levels. When examining the number of annotations that are added or removed per level (Table 10), we can perceive once again an overall increment in all datasets. Naturally, no labels were removed on the fourteenth and fifteenth level as they were not present in the 2007 versions.

Results
Initially, we present a standard evaluation among the HMC methods. Next, we also present an alternative evaluation where the HMC methods are compared w.r.t. their ability to discover new or wrong annotations.

Standard evaluation
In Table 11, we present a comparison of the PooledAUPRC obtained using the standard evaluation procedure. Since HMC-LMLP, HMC-GA and AWX are stochastic, we report the mean result of 5 runs, together with the standard deviation. Mind that, since we reran all methods on our datasets, variations may occur compared to the originally reported results in the respective papers. Even though Clus-Ensemble is the oldest of the compared methods, it still provided better results in most of the experiments. This is best seen in the FunCat 2018 datasets where Clus-Ensemble consistently presented results close to 0.4, and the second best method, HMC-LMLP, achieves at most 0.24 in any of the datasets. As can be seen in Fig. 6, Clus-Ensemble was the overall best method, and The second method evaluated, HMC-GA, yielded a lower performance overall. In most of the cases, HMC-GA was superior to AWX, but still inferior to Clus and HMC-LMLP. The method HMC-LMLP provided decent results. When compared to AWX, HMC-LMLP managed to significantly outperform it. Furthermore, HMC-LMLP was ranked as the second best method overall, providing superior results in all of the Gene Ontology 2007 datasets.
An unusual behaviour was noticed in the AWX method as it yielded very undesired results in many occasions. Even though the parameter values were extracted from the original paper, its results were fairly different. For instance, in the Derisi, Seq and Spo datasets from all versions, AWX was severely underfitted with results inferior to 0.1. It also presented similar cases in the FunCat and Gene Ontology 2007 Expr datasets.
When comparing the performance between different versions of the datasets, we noticed an overall improvement in the methods when moving from 2007 to 2018. Even though their label sets are bigger now, the addition of annotations to the instances compensate such difference, which resulted in better performances.

vs 2018
Here we evaluate how the HMC methods perform when trained using data from 2007, but evaluated using datasets from 2018. For the methods HMC-LMLP, HMC-GA and AWX, for each (instance,label) pair we have used the mean prediction probability of 5 runs.
For all figures presented here, we also include a boxplot for the (instance,label) pairs that did not change between the two dataset versions. This allows to see to what extent the methods can detect annotations that were falsely negative or falsely positive in the data of 2007. The number between parentheses corresponds to the number of (instance,label) pairs evaluated for a particular setting and dataset. Note that the number of unchanged pairs is much higher than the number of changed pairs, hence the outliers (prediction probabilities outside the whisker) should not be regarded.

FunCat
Similarly, the AWX method managed to be superior in the Hom dataset. However, it underperformed in other datasets, specially in Derisi, Expr, Seq and Spo. In these datasets, AWX predicted almost all annotations to be absent, except for very few outliers, which received a very high prediction probability.
HMC-LMLP presented decent results in almost all datasets. Nonetheless, for labels that did not change, HMC-LMLP tended to provide higher prediction probabilities, whereas Clus-Ensemble yielded lower ones, giving Clus-Ensemble an advantage over HMC-LMLP.
Hence, in the context of discovering new annotations, we can assume that Clus-Ensemble is the safer choice as it performed better on almost all datasets, nonetheless its advantage was close to minimal.      When addressing labels that were removed, see Fig. 9, we had very similar results. As seen in Fig. 10, HMC-GA provided superior results, but it still was not statistically different from Clus-Ensemble and HMC-LMLP. AWX yielded lower prediction probabilities in most of the datasets with exception to the Hom dataset. Since its prediction probabilities were also low for labels that were present in both versions of the datasets, it performs the worst among the compared methods.

Gene ontology
As can be seen in Fig. 11, Clus-Ensemble and HMC-GA were superior in most of the datasets. Additionally, the AWX method also presented desirable results, specially in the Derisi and Seq datasets where it output very high probabilities for added annotations and very low ones for labels that did not change. These three methods were not statistically different from each other, as shown in Fig. 12.
The HMC-LMLP method also presented overall visually comparable results, nonetheless it yielded higher predictions for annotations that did not change in some datasets, such as Expr, Gasch1 and Gasch2.
When examining the labels that were removed in Fig. 13, we noticed a different outcome. In this case, all methods presented very similar results, making performance almost indistinguishable in most of the datasets. Additionally, there was no statistical difference among these methods, as shown in Fig. 14.

Discussion
In this section, we present a discussion about the results presented in the previous section. Following the same order, we first address the standard evaluation, followed by the comparison between the versions of the datasets.

Standard evaluation
As shown in Fig. 6, Clus-Ensemble's superior predictive performance, in combination with an efficient learning method (random forest), the ability to handle datasets with many features (as seen in the Struc and Hom datasets), and the interpretabilty aspect (e.g. variable ranking and proximity measure associated to random forests), confirm the state-of-the-art status of Clus-Ensemble.
We believe that the ensemble method, random forest, contributes substantially to the performance. By considering many models, Clus-Ensemble is able to generalize more, and consequently provide superior results. The other methods evaluated do not make use of any ensemble method. Even though HMC-LMLP contains many neural networks, they are trained as a single model, and they distinguish between different classes.
HMC-GA provided inferior results in many cases, nonetheless it has the highest interpretability since it generates classification rules. Similarly, Clus-Ensemble presents many trees, which are readable by themselves, however their interpretability decreases as the number of trees increases. Differently, the neural networks, HMC-LMLP and AWX, are black-box models, and thus not readable in a straightforward way.
When comparing the neural network methods, HMC-LMLP and AWX, HMC-LMLP clearly had the upper hand. We believe that this is due to HMC-LMLP being a local approach, whereas AWX is a global one. Since one neural network is trained for each level of the hierarchy, the neural networks are trained to distinguish among fewer classes, making the classification task easier, and, thus, providing better results. The computational complexity of HMC-LMLP, however, is considerably higher than the other methods due to many neural networks being built during its training.
Despite some undesirable results, AWX is the only method that explicitly exploits the hierarchy constraint by propagating gradients from neurons associated to leaves to neurons associated to their parents. Mind that the other methods also respect the constraint, but they exploit it to a smaller extent during their training. Moreover, we believe that AWX's early stopping criterion has negatively affected the results. in order to prevent overfitting, AWX interrupts the training right after the performance in the validation set decreases. However, these datasets contain noise in their label set, thus a small oscillation might be noticed. Considering more iterations, as performed by HMC-LMLP, could possibly increase AWX's performance. Moreover, neural networks

vs 2018 FunCat
As described previously, when analyzing labels that changed from absent to present (0 to 1), Clus-Ensemble had the overall best results, whereas HMC-GA was the best for present to absent (1 to 0). We believe that this finding is highly correlated to how the evaluated methods yield their prediction probabilities.
Clus-Ensemble outputs the mean prediction probability of the instances associated to the leaf node predicted. According to the parameters used, the minimum number of such instances is 5, making the lowest positive prediction probability to be 0.2 per tree. Even though fairly low, it still is reasonably high in HMC due to label sparsity, resulting in high prediction probabilities in many cases, and thus better performance.
Likewise, the HMC-GA method yielded high prediction probabilities in some cases, resulting in similar results to Clus. Moreover, their heuristic (variance reduction) is the same. The main difference between HMC-GA and Clus-GA relies on the fact that HMC-GA uses a mean rule (prediction of the mean label set of the training dataset) whenever a test instance is not classified by any of the rules. This possibly results in outputting a sparse prediction with very low prediction probabilities.
Despite having decent results, HMC-LMLP presented high very prediction probabilities for labels that did not change between versions. We believe that this is related to how neural networks learn the distribution of the data. Since neural networks are very powerful models, they can learn more complex boundaries when compared to Clus-Ensemble and HMC-GA, resulting in the neural networks adjusting themselves strictly to the training dataset. HMC-LMLP is not overfitted though, as shown in Table 11, nonetheless its usage is not recommended if label noise is likely to be present.
Lastly, AWX had the best performance in the Hom dataset. However, it underperformed in several other cases. Once again, the early stopping criterion might have forced the neural network to a sub-optimal configuration, resulting in very biased predictions, i.e. AWX assumes most of the labels to be either positive or negative.
When evaluating labels that were removed, HMC-GA was superior. We believe that the mean rule might have artificially contributed since very low probabilities are predicted for most of the labels in this case.

Gene ontology
In the GO datasets, we noticed a similar behaviour. In most of the situations, Clus-Ensemble performed better when evaluating labels that were added, whereas HMC-GA was superior for removed labels.
When it comes to removed labels, HMC-GA performed better. Consequently, we recommend the usage of HMC-GA to predict which annotations are likely to be removed in future versions of the datasets (noise) since it presented better results in both FunCat and GO.
Similarly to the FunCat experiments, HMC-LMLP had an average performance being statistically significantly inferior to other methods, but equivalent to them for removed labels.  When compared to its performance on FunCat, AWX performed better here. For labels that were added, even though ranked in lower positions, AWX managed to not be statistically significantly different from Clus-Ensemble and Clus-HMC. Likewise, for removed labels, AWX also performed reasonably. This is very surprising since GO datasets have even more labels to be distinguished, and the same parameters were used.

Conclusion
In this work, we have presented updated benchmark datasets for hierarchical multi-Label classification (HMC) in the area of protein function prediction. We have also performed a comparison among four HMC methods to provide baselines results on these datasets. Finally, we have proposed an alternative evaluation procedure to evaluate the ability of HMC methods to detect missing or wrong annotations. For this purpose we make use of both old and new versions of the datasets.
In all datasets, we have noticed a significant increase in the hierarchy size, and in the number of annotations associated to instances. As a consequence of that, when performing a standard evaluation, HMC methods performed better using the updated versions. Despite having more labels to distinguish, the instances have now more annotations associated to them, resulting in better predictions. The overall best method in this task was Clus-Ensemble, a random forest of decision trees adapted to HMC, nonetheless the results remained fairly low overall. Thus, protein function prediction is still a very challenging task for the machine learning community.
In this direction, further studies in this area are necessary. In particular, we instigate the use of Deep Learning methods, since the amount of data available is on a constant increase, and recent deep neural networks are capable of learning straight from DNA sequences (without the need of extracting features) [46] .
When it comes to detect missing or wrong annotations, in the FunCat datasets, Clus-Ensemble was the best in detecting missing annotations, whereas HMC-GA did better for annotations that were removed. In the Gene Ontology datasets, Clus-Ensemble performed better for detecting missing annotations, and competitive results were obtained for wrong annotations.
To conclude, we recommend to use the updated datasets in future studies on this topic. However, the previous version of these datasets should not be disregarded, since having two versions can be of interest to perform an evaluation similar to ours on new HMC methods, or to other fields in machine learning such as weakly supervised classification, noise detection and incremental learning [47,48].

Methods
In this section, we provide details about our experimental setup. First, we present the methods used for comparison. Then we describe two evaluation strategies. Finally, we explain which datasets were included in the evaluation.

Compared methods
We have compared 4 methods from the literature: Clus-Ensemble [2,21], hierarchical multi-label classification with genetic algorithm (HMC-GA) [4,19], hierarchical multi-label classification with local multi-layer perceptrons (HMC-LMLP) [3], and Adjacency Wrapping matriX (AWX) [6]. The methods were chosen due to the following reasons: 1) Apart from Clus-Ensemble, they are recent Fig. 11 Evaluation on annotations that were added (0 to 1) and on annotations that did not change (0 in both versions) for GO. a Cellcycle, Derisi and Eisen datasets. b Expr, Gasch1 and Gasch2 datasets. c Seq, Spo, Hom and Struc datasets methods. Clus-Ensemble is included because it is used as the state-of-art benchmark in many studies; 2) They are based on different machine learning methods and HMC strategies, ranging from global to local approaches and from interpretable tree or rule based methods to more powerful, but black box techniques; 3) They are publicly available. Next, we provide a brief description of these methods, and details about their parameters. We have set the parameters to the values originally recommended by the authors.

Clus-Ensemble
Clus is a method from the global approach based on predictive clustering trees where decision trees are seen as a hierarchy of clusters whose top node corresponds to a cluster with all the training data. Recursively, Clus minimizes the intra-cluster variance until a stop criterion is met. In this work, we have used the (global) Clus-HMC variant due to its superior results, in combination with the ensemble method Random Forest. Hence, this predictive model consists of a Random Forest of Predictive Clustering Trees. We are using 50 trees within the Random Forest, at least 5 instances per leaf node and the best F-test stopping criterion significance level selected from {0.001, 0.005, 0.01, 0.05, 0.1, 0.125}.

HMC-GA
Using genetic algorithms and the global approach, the method hierarchical multi-label classification with genetic algorithm use a sequential rule covering method where optimal classification rules are created [4,19]. At every iteration, one rule in the format if → then is generated by optimizing the fitness function. Next, the examples covered by the new rule are removed from the training dataset, and new rules are generated until a stop criterion is met. We have used the following parameters:

HMC-LMLP
The method proposed by Cerri [3] addresses the classification problem using the Local approach. More specifically, the Local Classifier per Level strategy where one multi-layer perceptron is trained for each level of the hierarchy. Thus, each neural network is responsible for predicting the classes on its respective level. Moreover, this method adds prediction probabilities from the previous level as extra features for the next neural network, in the sense that each neural network is trained separately and its training dataset is augmented by the previous neural network. Finally, the predictions from each neural network are combined to perform a prediction. If the performance in the validation dataset does not improve in 10 iterations, the training is interrupted.
We have used the following parameters:  into the loss function [6]. This mapping is performed by an auxiliary matrix which makes the gradients updates flow from the neurons associated to leaves to the neurons that are associated to their parent nodes. If the performance degrades on the validation dataset, the training is interrupted immediately. We have used the following parameters: • l-norm: We have used l 1 , since it presented superior results; • Hidden layer: with 1000 neurons with the ReLu activation function and l 2 regularizer 10 −3 ; • Output layer: Logistic activation function and l 2 regularizer 10 −3 ; • Optimizer: Adam with learning rate 10 −5 , β 1 = 0. 9 and β 2 = 0.999 and the cross entropy loss function;

Evaluated datasets
Even though we provide 12 datasets with updated Funcat and GO annotations, we have decided to not include all of them in our analysis. The Church and Pheno datasets have an unusual number of instances with identical feature vectors, mostly due to missing values. In the Church dataset, 2352 out of 3755 instances are unique, leaving 1403 instances with the same feature vector as another instances, but different annotations. A similar behaviour is noticed in the Pheno dataset where only 514 instances out of 1591 are unique [49]. We are considering the Hom and Struc datasets only using the methods Clus-Ensemble and AWX. The other methods, HMC-LMLP and HMC-GA, presented several difficulties when handling these datasets. HMC-LMLP demands much more computational power due to its many neural networks. Similarly, HMC-GA did not converge using the parameters suggested in the original paper.
Some work, such as [5,10,11,13,17,22], have also decided to not include them. Table 12 presents the datasets evaluated in this work.

Standard evaluation
In order to provide benchmark results on the new datasets, we have first performed a standard evaluation. Thus, we evaluated 10 feature sets with 4 possible labels sets for each (two label hierarchies and two annotation timestamps), making a total of 40 datasets. We present the evaluation measure and the statistical test that we have used.

Pooled aUPRC
We have adopted the Pooled area under the precisionrecall curve (AUPRC) evaluation measure since it is consistently used in HMC literature [2,3,5,18,19,21,22,25]. Mind that, generally HMC datasets are heavily imbalanced, making negative predictions very likely, thus evaluation measures such as ROC curves are not recommended. The Pooled AUPRC corresponds to the area under the precision-recall curve generated by taking the Pooled (i.e., micro-averaged) precision and recall over all classes for different threshold values. These threshold values usually consist of values ranging from 0 to 1 with increasing steps of 0.02 for all datasets.
In the equations below, tp stands for true positive, fp means false positive, fn refers to false negative and i ranges over all classes.

Friedman-Nemenyi test
In order to provide statistical evidence, we have used the Friedman-Nemenyi test. At first the Friedman test verifies if any of the compared methods performs statistically significantly different from others. Next, the Nemenyi test ranks the methods where methods with superior results are ranked in higher positions. Graphically, methods connected by a horizontal bar of length equal to a critical distance are not statistically significantly different.

Evaluation procedure to compare datasets from different versions
We also investigated whether models that were trained on a dataset from 2007 are able to discover new annotations, i.e., annotations that were unknown (negative) in 2007, but have been added afterwards. We also check the opposite situation: whether models are able to correct wrong annotations, i.e., annotations that were wrongly positive in 2007, and have been corrected to negative afterwards. For this purpose, we propose an evaluation strategy that compares the predicted probabilities for specific (instance,label) pairs over the different HMC methods. In particular, for a fair comparison, first we take the intersection of the label sets in the 2007 and 2018 dataset versions, respectively. Then, for evaluating the discovery of new annotations, in this intersection, we check the (instance,label) pairs in the test set that were negative in 2007 and positive in 2018. For these pairs, we plot the distribution of predictions for each HMC Fig. 15 Prediction probabilities of labels that changed between versions (written in red inside the red box) are used to build the red box-plot. Labels that occur only in the 2018 versions are not considered in this evaluation (black box) method, trained on the 2007 dataset. Note that a high value would have yielded a false positive prediction in 2007, however, with the current knowledge in functional genomics, this would now yield a true positive prediction. Figure 15 illustrates the procedure. For evaluating the correction of wrong annotations, the procedure is similar, except that we look for positive pairs that became negative.