Machine learning for discovering missing or wrong protein function annotations

Nakano, Felipe Kenji; Lietaert, Mathias; Vens, Celine

doi:10.1186/s12859-019-3060-6

Research Article
Open access
Published: 23 September 2019

Machine learning for discovering missing or wrong protein function annotations

A comparison using updated benchmark datasets

Felipe Kenji Nakano^1,2,
Mathias Lietaert³ &
Celine Vens^1,2

BMC Bioinformatics volume 20, Article number: 485 (2019) Cite this article

4906 Accesses
17 Citations
3 Altmetric
Metrics details

Abstract

Background

A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information.

Results

The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods.

Conclusions

The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.

Background

Due to technological advancements, the generation of proteomic data has increased substantially. However, annotating all sequences is costly and time-consuming, making it often unfeasible [1]. As a countermeasure, recent studies have employed machine learning methods due to their capacities of automatically predicting protein functions.

More specifically, protein function prediction is generally modeled as a hierarchical multi-label classification (HMC) task. HMC is a classification task whose objective is to fit a predictive model f which maps a set of instances X to a set of hierarchically organized labels Y, while respecting hierarchy constraints among Y [2, 3]. The hierarchy constraint states that whenever a particular label y_i is predicted, all ancestors labels of y_i up to the root node of the hierarchy must be predicted as well.

In the machine learning literature when proposing a new method, this method is typically compared to a set of competitor methods on benchmark datasets. For HMC, many studies [2–22] utilized the benchmark datasets proposed in [2]. These datasets are available at https://dtai.cs.kuleuven.be/clus/hmcdatasets/ and contain protein sequences from the species Saccharomyces cerevisiae (yeast) whose functions are mapped to either the Functional Catalogue (FunCat) [24] or Gene Ontology (GO) [23]. The task associated with these datasets is to predict the functions of a protein, given a a set of descriptive features (e.g., sequence, homology or structural information).

FunCat and GO are different types of hierarchies. In FunCat (Fig. 1), labels are structured as a tree, meaning that they can have only a single parent label [24]. The GO (Fig. 2), however, allows labels to have multiple parent labels, forming a directed acyclic graph [23]. This complicates the fulfillment of the hierarchy constraint, since multiple classification paths are allowed throughout the graph.

These benchmark datasets were introduced to the HMC community in 2007, and, thus, the functional labels associated with each protein can be considered outdated. There are two reasons for this. First, functional annotations are updated on a regular basis. Second, as can be seen in Fig. 3a, there was a drastic increase in the number of terms throughout the Gene Ontology since the creation of these datasets (January 2007). A similar observation can be made for the number of obsolete terms as shown in Fig. 3b. Accordingly, one of the main goals of this article is to provide updated versions of these widely used HMC benchmark datasets to the research community.

Using these new datasets, we present a comparison among four recent and open-source HMC methods that can be considered state-of-the-art,thus providing baseline performances as guidelines for future research on this topic. Finally, having two different versions of the same datasets provides us with the unique opportunity to be able to evaluate whether these HMC methods are able to generalize when learning from data with mislabeled instances. In particular, we evaluate whether they were able to predict the correct label in cases where the label has been altered since 2007. In order to do so, we propose an evaluation procedure where a predictive model is trained using the data from 2007, but tested with data from 2018.

The major contributions of this work are the following: i) We provide new benchmark datasets for HMC^{Footnote 1}; ii) We provide baseline results for the new datasets; iii) We provide an evaluation procedure and results that evaluate whether HMC methods are able to discover new or wrong annotations.

The remainder of this article is organized as follows. “Related work” section presents an overview of studies on HMC which have used the functional annotation benchmark datasets proposed in 2007. “Updated datasets” section provides a description on how the datasets were updated, together with a quantification of new labels and annotations. In “Results” section, we present the results of our experiments. In “Discussion” section, we discuss our results. In “Conclusion” section we present our conclusion. Finally, “Methods” section contains the HMC methods employed and the evaluation strategies;

Related work

In this section, we provide a literature overview of studies that have used the datasets addressed in this work, and a brief review on hierarchical multi-label classification applications. In Table 1, we present studies which have used the FunCat and GO datasets.

Table 1 Review on HMC studies which used FunCat and GO datasets

Full size table

In the HMC literature, methods are separated into two approaches: local and global. The difference between these approaches relies on how their predictive models are designed. The local approach employs machine learning decompositions where the task is divided into smaller classification problems, then the solutions of the sub-problems are combined to solve the main task. As an advantage, any predictive model, or even an ensemble of models, can be incorporated into the solution.

According to Silla and Freitas [33], the local approach is further divided into three strategies: Local Classifier per Level [3, 5, 14, 25, 30], Local Classifier per Node [7, 9] and Local Classifier per Parent Node [11, 16]. As their name suggest, these strategies train a predictive model for each level, node or parent node of the hierarchy, respectively. Allowing many types of decomposition is particularly interesting, since different problems may require different solutions. For instance, when handling large hierarchies, the usage of the Local Classifier per Parent Node and Local Classifier per Node result in a large number of classifiers being trained, making the Local Classifier per Level strategy more computationally efficient as it requires only one predictive model per level. However, the hierarchy may contain many labels per level, forcing the models to distinguish among them, and possibly making the task more difficult.

Using several strategies, Cerri and De Carvalho [32] investigated how problem transformation methods from the non-hierarchical multi-label literature, which decompose the task into smaller problems similarly to the local approach, behave on the HMC context using Support Vector Machines. Cerri et al. [3, 14, 30] use the Local Classifier per Level by training one neural network for each level of the hierarchy where prediction probabilities of the previous level are used as extra attributes for the neural network associated to the next level. Wehrmann et al. [5] extended this idea with an extra global loss function, allowing gradients to flow across all neural networks. Li [34] proposed to use this strategy with deep neural networks to predict the commission number of enzymes. In a follow up work, Zou et al. [35] extended this method by enabling the prediction of multi-functional enzymes.

The work of Feng et al. [9] proposed to use the Local Classifier per Node strategy by training one Support Vector Machine for each node of the hierarchy combined with the SMOTE oversampling technique. This work was slightly improved in Feng et al. [7] where the Support Vector Machines were replaced by Multi-Layer Perceptron and a post-prediction method based on Bayesian networks was used. Also using Support Vector Machines, the studies of Bi and Kwok [12, 20] proposed new loss functions specific for HMC which were optimized using Bayes optimization techniques. On a similar manner, Vens et al. [2] proposed to train Predictive Clustering Trees, a variant of decision trees which create splits by minimizing the intra-cluster variance, for each node, and also an alternative version where one predictive model is trained per edge.

Ramirez et al. [11, 16] employed the Local Classifier per Parent Node by training one predictive model per parent node of the hierarchy and augmenting the feature vectors with predictions from ancestors classifiers. On a similar note, Kulmanov et al. [36] proposed to train a predictive model for each sub-ontology of the Gene Ontology, combining features automatically learned from the sequences and features based on protein interactions.

Differently from the local approach, the global one employs a single predictive model which is adapted to handle the hierarchy constraint and relationships among classes. When compared to the local approach, the global one tends to present lower computational complexity, due to the number of models trained. However, its implementation is more complex, since traditional classifiers can not be used straightforwardly. The global approach is further divided into two strategies: algorithm adaptation and rule induction.

As its name suggests, the algorithm adaptation strategy consists of adapting a traditional algorithm to handle hierarchical constraints. Masera and Blanzieri [6] created a neural network whose architecture incorporates the underlying hierarchy, making gradient updates flow from the neurons associated to the leaves up neurons associated to their parent nodes; Sun et al. [8] proposed to use Partial Least Squares to reduce both label and feature dimension, followed by an optimal path selection algorithm; Barros et al. [17] proposed a centroid based method where the training data is initially clustered, then predictions are performed by measuring the distance between the new instance and all clusters, the label set associated to the closest cluster is given as the prediction; Borges and Nievola [31] developed a competitive neural network whose architecture replicates the hierarchy; Vens et al. [2] also proposed to train a single Predictive Clustering Tree for the entire hierarchy; as an extension of [2], Schietgat et al. [21] proposed to use ensemble of Predictive Clustering Trees; Stojanova et al. [18] proposed a slight modification for Predictive Clustering Trees in which the correlation between the proteins is also used to build the tree.

In the rule induction strategy, optimization algorithms are designed to generate classification rules which consist of conjunctions of attribute-value tests, i.e. many if→then tests connected by the boolean operator ∧. In this regard, several studies from Cerri et al. [4, 15, 19] proposed to use Genetic Algorithms with many different fitness functions. Similarly, other optimization algorithms such as Ant Colony Optimization [10, 22] and Grammar Evolution [29] were also investigated in this context.

Additionally, some studies have also addressed similar topics to HMC. For instance, Cerri et al. [25] examined how Predictive Clustering Trees can be used to perform feature selection using Neural Networks and Genetic Algorithms as base classifiers. Almeida and Borges [26] proposed an adaptation of K-Nearest Neighbours to address quantification learning in HMC. Similarly, Triguero and Vens [27] investigated how different thresholds can increase the performance of Predictive Clustering Trees in this context.

Other application domains have also explored HMC, such as managing IT services [37, 38], text classification on social media [39], large scale document classification [40] and annotation of non-coding RNA [41]. It can even be applied to non-hierarchical multi-label problems where artificial hierarchies are created [42].

Updated datasets

In this section, we present an overall description of the datasets and their taxonomies, followed by details on how we updated both FunCat and Gene Ontology versions. The resulting updated versions are available at https://www.kuleuven-kulak.be/nl/onderzoek/itec/projects/research-focus/software.

Overall description

Clare [43] originally proposed 12 datasets containing features extracted from protein sequences of the organism Saccharomyces cerevisiae (yeast) whose targets are their protein functions. These 12 datasets contain largely the same proteins, nonetheless differ in their descriptive features. Furthermore, these datasets are divided into train, test and validation sets.

It is known that the yeast and human genomes have many similar genes, furthermore yeast is considerably cheaper and experiment-wise efficient when compared to other species, making it a widely addressed subject in bioinformatics applications [44]. In Table 2, we provide more information about these datasets.

Table 2 Statistical information on the 2007 datasets

Full size table

The Hom dataset presents information between analogous (similar) yeast genes. Using an homology engine, such as BLASTn ^{Footnote 2}, other similar yeast genes are discovered. Then, properties between the sequences from the dataset and their analogous ones are measured. The Pheno dataset contains phenotype data based on knock-out mutants. Each gene is removed to form a mutant strain, and the corresponding change in phenotype as compared to the wild type (no mutation) is observed after growing both strains on different growth media. The Seq dataset stores features extracted from the amino acid sequences of the proteins, such as molecular weight, length and amino acid ratios. As its name suggests, the Struc dataset contains features based on the second structure of the proteins annotated in a binary format. In the case of an unknown structure, the software PROF [45] was used to predict it. Known structures were promptly annotated. All the other datasets were constructed based on the expression of genes recorded across an entire genome using microchips [43].

As an extension to these datasets, Vens [2] mapped the targets to the Gene Ontology taxonomy. Additionally, the FunCat annotations used by Clare [43] were updated.

FunCat is an organism independent functional taxonomy of proteins functions which is widely adopted throughout bioinformatics. As shown in Fig. 1, FunCat places generic functions in high levels of the taxonomy, then it sequentially divides such functions into specific ones, forming a tree-shaped hierarchy where each function has one ancestor function. From the machine learning perspective, FunCat is used as an underlying hierarchy of labels. Thus, each protein function is addressed as a label in a classification task where the relationships established by FunCat are taken in account.

Similarly, the Gene Ontology (GO) is a taxonomy whose main goal consists of defining features of genes in an accurate and species independent fashion [23]. More specifically, the GO is composed of three sub-ontologies: molecular function, cellular component and biological process. The molecular function sub-ontology contains information about activities performed by gene products in the molecular-level. The cellular component sub-ontology, as its name suggests, describes the locations where gene products perform functions. Finally, the biological process sub-ontology annotates processes performed by multiple molecular activities.

All information in the GO is described using terms which are nodes with an unique ID, a description and their relationship with other terms. Due to these relationships, the GO is defined as a directed acyclic graph in the machine learning literature, making it a challenging task due to the substantial high number of terms, and many intrinsic relationships among them. Figure 2 presents a small part of the GO.

FunCat update

In order to update these datasets, we have performed the procedure described in Fig. 4. Using the IDs from the sequences, we have queried UniProt, obtaining new annotated functions for the sequences. Next, we built the hierarchy of each dataset, and replaced the old annotations by the new ones, i.e. we have removed entirely the annotations from 2007, and concatenated the new annotations with the original features. Mind that each dataset described in Table 2 uses a slightly different FunCat subset. The hierarchies differ between the datasets, because the protein subset differs as seen in Table 2, since not every protein can be found in every original dataset by Clare.

In Table 3, we compared the 2007 datasets with the 2018 versions w.r.t. their label set. There was a significant increase in the number of labels across the hierarchy. More specifically, in the third and fourth level where the mean number of labels has increased from 175 to 208 and 140 to 168 respectively. A smaller increase is also noticeable in the first, second and last level.

Table 3 Comparison between the number of labels per level in FunCat 2007 and FunCat 2018

Full size table

In Table 4, we presented for each dataset the number of instances with annotations per level. In this case, there was a slight increase in deeper levels, whereas the mean number of annotated instances on the second and third level has decreased in all datasets.

Table 4 Comparison between the number of annotated instances per level for FunCat 2007 and FunCat 2018

Full size table

Further, we compared the number of annotations per level between the versions from 2007 and 2018 in Table 5. There was a considerable increase in the number of annotations across all levels of the hierarchy. The last level seemed remarkable, as its number of annotations is significantly low in both versions.

Table 5 Comparison between the number of annotations per level in FunCat 2007 and FunCat 2018

Full size table

When analyzing the number of annotations that were added and removed in Table 6, the second level presented a higher average number of new annotations despite having fewer annotated instances now. Noticeable increases were also noticed in the third and fourth level.

Table 6 Comparison between added and removed annotations in FunCat 2007 and FunCat 2018 per level

Full size table

Gene ontology update

In order to update these datasets, we have performed the procedure shown in Fig. 5.

Initially, we queried Universal Protein (UniProt) using the IDs from the protein sequences using their web service^{Footnote 3}, obtaining the GO terms associated to each sequence. Next, we preprocessed the queried terms. The GO keeps track of alternate (secondary) IDs which are different labels with identical meaning, hence we have merged them into a single label. Similarly, we have also removed obsolete annotations since they are deprecated and should not be used anymore. Finally, the old annotations were entirely removed, and the new ones were concatenated to the feature vector. Recall that we are not considering the first level of the Gene Ontology, since it contains 3 root terms which are present in all instances. Further, as for FunCat, each dataset contains only a subset of the entire Gene Ontology.

Mind that since the GO is a directed acyclic graph, annotations can belong to multiple levels. In order to present statistics about these datasets, we are considering the deepest path to determine the level for all labels in Tables 7, 8, 9 10.

Table 7 Comparison between the number of labels per level in Gene Ontology 2007 and Gene Ontology 2018

Full size table

Table 8 Comparison between the number of annotated instances per level Gene Ontology 2007 and Gene Ontology 2018

Full size table

Table 9 Comparison between the number of annotations per level in Gene Ontology 2007 and Gene Ontology 2018

Full size table

Table 10 Comparison between the number of annotations added and removed in Gene Ontology 2007 and Gene Ontology 2018 per level

Full size table

As shown in Table 7, there was a similar behaviour as in the FunCat update. There was a substantial increase in the number of labels throughout all levels, specially in the levels between the third and the twelfth. Two extra levels were added, making a total of 15, nonetheless there are only few classes in these levels.

We observed an overall increase in the number of instances per level throughout the hierarchies (Table 8). There were no remarkable decreases. We have noticed that only the validation and test datasets contain instances on the last level of the hierarchy. From the machine learning perspective, such condition might hinder predictive models, as most of them are not capable of predicting a class which is not present in the training dataset. Possibly, future studies might consider removing the last level. Difficulties might also emerge on the fourteenth level, as the datasets have very few instances on it.

As seen in Table 9, once again there was an increment in the number of annotations per level. The number of annotations gradually increases up to a certain level, until it decreases to almost none when it reaches the deepest levels.

When examining the number of annotations that are added or removed per level (Table 10), we can perceive once again an overall increment in all datasets. Naturally, no labels were removed on the fourteenth and fifteenth level as they were not present in the 2007 versions.

Results

Initially, we present a standard evaluation among the HMC methods. Next, we also present an alternative evaluation where the HMC methods are compared w.r.t. their ability to discover new or wrong annotations.

Standard evaluation

In Table 11, we present a comparison of the PooledAUPRC obtained using the standard evaluation procedure. Since HMC-LMLP, HMC-GA and AWX are stochastic, we report the mean result of 5 runs, together with the standard deviation. Mind that, since we reran all methods on our datasets, variations may occur compared to the originally reported results in the respective papers.

Table 11 Pooled AUPRC of the evaluated methods

Full size table

Even though Clus-Ensemble is the oldest of the compared methods, it still provided better results in most of the experiments. This is best seen in the FunCat 2018 datasets where Clus-Ensemble consistently presented results close to 0.4, and the second best method, HMC-LMLP, achieves at most 0.24 in any of the datasets. As can be seen in Fig. 6, Clus-Ensemble was the overall best method, and performs statistically significantly better than HMC-GA and AWX.

The second method evaluated, HMC-GA, yielded a lower performance overall. In most of the cases, HMC-GA was superior to AWX, but still inferior to Clus and HMC-LMLP. The method HMC-LMLP provided decent results. When compared to AWX, HMC-LMLP managed to significantly outperform it. Furthermore, HMC-LMLP was ranked as the second best method overall, providing superior results in all of the Gene Ontology 2007 datasets.

An unusual behaviour was noticed in the AWX method as it yielded very undesired results in many occasions. Even though the parameter values were extracted from the original paper, its results were fairly different. For instance, in the Derisi, Seq and Spo datasets from all versions, AWX was severely underfitted with results inferior to 0.1. It also presented similar cases in the FunCat and Gene Ontology 2007 Expr datasets.

When comparing the performance between different versions of the datasets, we noticed an overall improvement in the methods when moving from 2007 to 2018. Even though their label sets are bigger now, the addition of annotations to the instances compensate such difference, which resulted in better performances.

vs 2018

Here we evaluate how the HMC methods perform when trained using data from 2007, but evaluated using datasets from 2018. For the methods HMC-LMLP, HMC-GA and AWX, for each (instance,label) pair we have used the mean prediction probability of 5 runs.

For all figures presented here, we also include a boxplot for the (instance,label) pairs that did not change between the two dataset versions. This allows to see to what extent the methods can detect annotations that were falsely negative or falsely positive in the data of 2007. The number between parentheses corresponds to the number of (instance,label) pairs evaluated for a particular setting and dataset. Note that the number of unchanged pairs is much higher than the number of changed pairs, hence the outliers (prediction probabilities outside the whisker) should not be regarded.

Furthermore, we have also employed the Friedman-Nemenyi test to provide statistical validation. In this case, we have used the difference between the median of the prediction probabilities for the annotations that changed and those that did not change between the two dataset versions.

FunCat

Figure 7 demonstrates that all methods are capable to detect missing annotations from the FunCat taxonomy, i.e., the distribution of prediction probabilities for the changed annotations is consistently higher than for the annotations that remained negative, since there is a visible difference between the location (median) and spread in the boxplots of the changed and unchanged annotations of the evaluated methods.

Clus-Ensemble and HMC-GA provided similar results, however Clus-Ensemble was slightly superior since its prediction probabilities tended to be higher. Moreover, when evaluating the labels that did not change (remained absent), Clus-Ensemble provided very low prediction probabilities. In Fig. 8, Clus-Ensemble was ranked first, however not statistically different from HMC-GA and HMC-LMLP.

Similarly, the AWX method managed to be superior in the Hom dataset. However, it underperformed in other datasets, specially in Derisi, Expr, Seq and Spo. In these datasets, AWX predicted almost all annotations to be absent, except for very few outliers, which received a very high prediction probability.

HMC-LMLP presented decent results in almost all datasets. Nonetheless, for labels that did not change, HMC-LMLP tended to provide higher prediction probabilities, whereas Clus-Ensemble yielded lower ones, giving Clus-Ensemble an advantage over HMC-LMLP.

Hence, in the context of discovering new annotations, we can assume that Clus-Ensemble is the safer choice as it performed better on almost all datasets, nonetheless its advantage was close to minimal.

When addressing labels that were removed, see Fig. 9, we had very similar results. As seen in Fig. 10, HMC-GA provided superior results, but it still was not statistically different from Clus-Ensemble and HMC-LMLP. AWX yielded lower prediction probabilities in most of the datasets with exception to the Hom dataset. Since its prediction probabilities were also low for labels that were present in both versions of the datasets, it performs the worst among the compared methods.

Gene ontology

As can be seen in Fig. 11, Clus-Ensemble and HMC-GA were superior in most of the datasets. Additionally, the AWX method also presented desirable results, specially in the Derisi and Seq datasets where it output very high probabilities for added annotations and very low ones for labels that did not change. These three methods were not statistically different from each other, as shown in Fig. 12.

The HMC-LMLP method also presented overall visually comparable results, nonetheless it yielded higher predictions for annotations that did not change in some datasets, such as Expr, Gasch1 and Gasch2.

When examining the labels that were removed in Fig. 13, we noticed a different outcome. In this case, all methods presented very similar results, making performance almost indistinguishable in most of the datasets. Additionally, there was no statistical difference among these methods, as shown in Fig. 14.

Discussion

In this section, we present a discussion about the results presented in the previous section. Following the same order, we first address the standard evaluation, followed by the comparison between the versions of the datasets.