PoGO: Prediction of Gene Ontology terms for fungal proteins
© Jung et al. 2010
Received: 20 January 2010
Accepted: 29 April 2010
Published: 29 April 2010
Skip to main content
© Jung et al. 2010
Received: 20 January 2010
Accepted: 29 April 2010
Published: 29 April 2010
Automated protein function prediction methods are the only practical approach for assigning functions to genes obtained from model organisms. Many of the previously reported function annotation methods are of limited utility for fungal protein annotation. They are often trained only to one species, are not available for high-volume data processing, or require the use of data derived by experiments such as microarray analysis. To meet the increasing need for high throughput, automated annotation of fungal genomes, we have developed a tool for annotating fungal protein sequences with terms from the Gene Ontology.
We describe a classifier called PoGO (Prediction of Gene Ontology terms) that uses statistical pattern recognition methods to assign Gene Ontology (GO) terms to proteins from filamentous fungi. PoGO is organized as a meta-classifier in which each evidence source (sequence similarity, protein domains, protein structure and biochemical properties) is used to train independent base-level classifiers. The outputs of the base classifiers are used to train a meta-classifier, which provides the final assignment of GO terms. An independent classifier is trained for each GO term, making the system amenable to updating, without having to re-train the whole system. The resulting system is robust. It provides better accuracy and can assign GO terms to a higher percentage of unannotated protein sequences than other methods that we tested.
Our annotation system overcomes many of the shortcomings that we found in other methods. We also provide a web server where users can submit protein sequences to be annotated.
Our ability to obtain genome sequences is quickly outpacing our ability to annotate genes and identify gene functions. The automated annotation of gene models in newly sequenced genomes is greatly facilitated by easy-to-use software pipelines . While equally as important as gene the gene models, functional annotation of proteins is usually limited to an automatic processing with sequence similarity searching tools such as BLAST, followed by extensive manual curation by dedicated database curators and community annotation jamborees. There is increasing demand to quickly annotate newly sequenced genomes so that they can be used for designing microarray analyses, proteomics, comparative genomics, and other experiments. It is clear that manual gene function annotation cannot be scaled to meet the influx of newly sequenced model organisms.
Two of the main considerations for developing a protein function classifier are the sources of evidence, or features, used for assigning functional categories to the proteins and the method used for defining the relationships between the features and the categories. A number of automated methods have been developed in recent years that vary both in the sources of features used as evidence for the classifier and in the classification algorithm used to assign protein functions. The most common feature types rely on sequence similarity, often using BLAST  to identify similar proteins from a large database. Classifiers may also utilize protein family databases such as PFAM and InterPro, bio-chemical attributes such as molecular weight and percentage amino acid content , phylogenetic profiles [4, 5] transcription factor binding sites  as sources of features. Several classifiers have also been developed that utilize features from laboratory experiments such as gene expression profiles [7, 8], and protein-protein interactions [8, 9].
A classifier is used to determine the relationships between protein features and their potential functions. The simplest examples are manually constructed mapping tables which represent the curator's knowledge of the relationships between features and functional categories. One example is Interpro2GO, a manually curated mapping, of InterPro terms to GO terms. The GOA project relies on several such mappings to provide automated GO annotations for users [10, 11]. More sophisticated classifiers have been developed using machine learning algorithms which can automatically deduce the relationships between the features and the functional categories based on a set of previously annotated proteins that serve as training examples. The use of machine learning algorithms can improve the accuracy of the annotations by discovering relationships between features and the functional categories that were not discovered by human curators [12–15]. In addition, they can be applied rapidly and consistently to large datasets, saving many hours of human curation.
Draft genome sequences and machine generated gene models are becoming available for an increasing number of fungal species but machine annotations of protein functions are still limited. We found that pre-existing functional classifiers are either not amenable to genome-scale analyses or are trained only for a small range of taxa [6, 16, 17]. In addition, many of the previously published protein function annotation systems utilize features obtained from laboratory experiments such as transcriptional profiling or protein-protein interaction assays. These methods cannot be applied to most fungal proteomes since experimental data are not available.
Classifiers that employ features derived from multiple, heterogeneous data sources usually transform the format of each data type into a common format regardless of the properties of each data source . For example, an e-value threshold might be applied to BLAST search results instead of utilizing the e-value directly. The problems with this approach are that the data from each data source is weighted differently and that distinctions between data points can be lost during transformation. Meta-learning classifiers (meta-classifiers) overcome this problem by training independent classifiers (called base-classifiers) for each heterogeneous data source and then use the decisions from the base-classifiers as features to train a meta-classifier. In this way, weights of each data source can be learned by the meta-classifier. Meta-learning classifiers are useful for combining multiple weak learning algorithms and for combining heterogeneous data sources.
We developed a functional classification system called PoGO (Prediction of Gene Ontology terms) that enables us to assign GO terms to fungal proteins in a high-throughput fashion without the requiring evidence from laboratory experiments. We incorporated several evidence sources that emulate what a human curator would use during manual protein function annotation and we avoided features that require laboratory experiments or that otherwise could not be applied to newly sequenced genomes. During the course of our experiments, we discovered that taxon-specific classifiers outperform classifiers that are trained with larger datasets from a wide array of taxa. We developed a taxon-specific classifier for Fungi and we are currently using it to assign GO terms to the proteomes of more than 30 filamentous fungi. We provide as web application that enables users to annotate proteins through their web browser.
Protein functional annotation is a multi-label classification problem in which each protein may be assigned one or more GO terms. Various approaches may be applied to solving multi-label classification problems, but the simplest method, and the one we employed, is to consider each GO term as an independent classification problem. Thus, PoGO is actually a team of independent classifiers. The disadvantage to this approach is that interdependencies among the labels cannot be incorporated into the classifier, thus potentially increasing error. The advantage, however, is that a wide range of binary classification algorithms may be applied to the problem. In addition, this approach is more flexible in that different algorithms and/or datasets may be used to train the classifier for each GO term, enabling us to optimize the classification of individual GO terms or groups of GO terms if necessary. Individual GO terms or groups of GO terms may also be re-trained, as new training data of evidence sources become available. PoGO consists of four base classifiers, each of which utilizes distinct data sources, and a meta-classifier for combining the outputs from the base classifiers into a final classification.
InterPro terms  are defined in the InterPro database which is a curated protein domain database that acts as a central reference for several protein family and functional domain databases, including Prosite, Prints, Pfam, Prodom, SMART, TIGRFams and PIR SuperFamily. We have previously showed that InterPro is an important source of features for identifying GO terms for proteins [19, 20]. Using the InterProScan application , we can obtain InterPro terms for unannotated proteins. Using InterProScan, we identified 3339 InterPro terms in the fungal protein dataset. We employed Support Vector Machines (SVM) algorithm as the classifier as described previously . For each GO term, we construct a dataset that is comprised of all proteins that are annotated with the GO term (the positive class) and all proteins that are not annotated with the GO term (the negative class). Since the data sets are highly imbalanced, we perform undersampling of the negative class as described previously . This step removes members of the negative class so that the training dataset will have the same number of proteins as the positive class. We also apply Chi-square feature selection to remove features from the classifier that do not contribute to the accuracy of the classifier. Previous results have shown that these steps improve classification accuracy and reduce learning time .
Several previous methods including GoFigure , GOblet , and OntoBlast  use sequence similarity based on BLAST  results as features. GOAnno  is also an extension of the similarity-based annotation using hierarchically structured functional categories. In this example, the annotations assigned to the BLAST hits are used as features. We employ a similar approach. The feature set is comprised of hit obtained in a BLAST search of a database of GO annotated proteins (excluding machine annotated terms) using an expect threshold (E-value) of 1-10. The BLAST database was constructed by extracting proteins from the taxonomic group Fungi from the UniProt database. The resulting feature set contains 3182 features. We employ undersampling and feature selection as described previously, and also utilize the SVM algorithm to train the classifier.
Other authors have previously shown that biochemical properties of proteins, such as a protein's charge, amino acid composition, are useful for functional classification . The properties that we use in this study include amino acid content, molecular weight, proportion of negatively and positively charged residues, hydrophobicity, isoelectric point and amino acid pair ratio. We use the pepstat and compseq programs in EMBOSS  to compute the biochemical properties based on the amino acid sequence of each protein. Unlike the previous datasets, the biochemical properties are not sparse, and are numeric rather than binary. We compared various forms of undersampling and learning methods and found that with this dataset, the Adaboosting method using a linear classifier  and using an unbalanced dataset provided the best accuracy (Table S1, Additional File 1).
The fourth feature set is protein structure as computed using the HHpred program  which is used for protein homology detection and structure prediction using the SCOP database . We use the top 10 selected templates as features and the remaining undefined templates are set to zero for each protein. This dataset is comprised of 8494 binary features. As described previously, we use the SVM algorithm combined with dataset undersampling to create the classifier.
Performance comparison of the base and meta classifiers.
Performance evaluation of protein function classifiers performance is a complex issue, since no standard methods exist . Authors typically utilize standard machine-learning metrics such as sensitivity and specificity but the manner in which they are applied vary, which prevents us from directly comparing the performance of classifiers by comparing the performance values reported in publications. Most authors compute performance statistics by first calculating the statistics for each protein individually, and then calculating an average over all of the tested proteins. In a traditional machine-learning approach, performance statistics are calculated for each functional category. The issue is further confounded since the test proteins may be annotated with functional categories that were not within the repertoire of categories that can be predicted by the classifier, and these categories may or may not be treated as classification errors during performance evaluation. Furthermore, the set of proteins used for evaluation is not consistent among all of the protein function classifiers. Authors usually employ cross-validation in order to evaluate their classifier using hold-out sets from the training data set. Standard protein data sets have been developed for annual competitions, such as the one that is often held in conjunction with the Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) but these data sets are not useful for taxon specific classifiers such as PoGO.
Since we train an individual classifier for each functional category, we evaluated the performance of each classifier individually and then computed an arithmetic mean of all the classifiers. All performance metrics were computed using 10-fold cross validation. When we compare the performance of PoGO to other GO term classifiers (AAPFC , MultiPfam2GO , Gotcha , GOPet , and InterPro2GO ), we use a hold-out set of 71 proteins (1% of our original dataset) that were not used in the training. We calculate the performance of each protein individually, and then report the overall average values. The various classifiers that we tested were all trained using different data sources and were developed to annotate different sets of GO terms. As performance metrics, we use sensitivity, specificity and F-measure. Sensitivity is the proportion of GO term annotations in the training dataset that were correctly annotated by the classifier, and specificity is the proportion of available GO term annotations not assigned to the proteins that were also not assigned to the protein by the classifier. We also use F-measure to provide a single overall metric for comparing the performance of various classifiers.
Considering only the base classifiers, the InterPro term classifier had the highest average F-measure value (Table 1). Surprisingly, the BLAST classifier had low F-measure as well as the lowest specificity, indicating that it results in a large number of false-positive annotations. The F-measure values of the Naïve Bayes meta-classifiers are higher than the base-classifiers (Table 1). The highest F-measure for a meta-classifier (0.3335) was obtained from the Naïve Bayes meta-classifier using the unbalanced dataset. Thus, the meta-classifier trained with the Naïve Bayes algorithm using the unbalanced dataset provides superior results over any single classifier and was used for further experiments.
Overall classifier performance after excluding poorly performing GO term classifiers.
Percentage of Proteins Annotated
Performance comparison of GO annotation methods using 71 randomly selected proteins.
Percentage of proteins annotated
Summary statistics and performance evaluation of the taxon-specific classifiers.
We compared the performance of classifiers trained with each of these data sets using 10-fold cross validation as described previously. Because the training and cross-validation time for all the datasets is prohibitively long we performed these experiments using the InterPro base classifier. We also measured the performance of the Interpro2GO mapping by using it to assign GO terms to proteins and measuring performance with Sensitivity, Specificity, and F-measure. All of the PoGO classifiers outperformed the Interpro2GO mapping (data not shown). Adding additional non-fungal proteins to the fungal specific GO terms (the Fungi-expanded dataset) added more than three times the number of GO terms than the Fungi dataset with only a slight reduction in performance (Table 4). In all cases, the classifiers trained with taxon specific datasets performed better than the classifier trained with the UniProt dataset. Since the Fungi dataset is significantly smaller than the UniProt dataset, we reasoned that the improved performance could be due to overfitting. Overfitting can occur when the training dataset is very small, or the number of features in the model is very large. An overfitted classifier can perform well when the instances (proteins in our case) are similar to the ones in the training data, but perform poorly when presented with instances that are not similar to the training data. To determine whether the improved performance is due to over-fitting, we trained PoGO classifiers with 10 smaller datasets containing the same number of proteins as the Fungi dataset. These subsamples were prepared by randomly selecting proteins from the UniProt dataset. If the classifier for the Fungi dataset was overfitted, we would expect that it would have better performance than the classifiers derived from the sub-sampled datasets. The average F-measure of the subsampled classifiers is 0.4135 while the F-measure for the Fungi classifier is 0.3101, indicating that the Fungi classifier is not subject to overfitting.
An important point to consider is that each GO term is trained independently, and the number of proteins that are used to train each GO term classifier varies considerably, depending on the availability of GO annotated proteins in UniProt. If small data sets lead to overfitting, then we would expect to see a correlation between data set size and classifier performance , where classifiers trained with a smaller number of proteins would tend to have higher performance. We compared the performance vs. dataset size for GO terms that were present in the taxon specific and the randomly undersampled datasets but found no correlation in any dataset (Table S2, Additional File 1). The taxon specific datasets may be considered to be a form of data partitioning that removes distantly related proteins from the data. This, in effect, would cause GO terms that are never found within a taxonomic group to be removed from the classifier, thus the error of the individual GO term classifier does not contribute to the overall error of the system.
In this paper, we describe a meta-classifier for assigning GO terms to proteins using heterogeneous feature sets. By using multiple, heterogeneous data sources, we developed a classifier that offers improved accuracy compared to previously reported annotation methods. More importantly, this system can assign GO terms to a higher proportion of proteins than annotation methods that rely only on one feature type. We also found that taxon-specific classifiers have significantly better performance than species independent systems. Functional annotations provided by PoGO are being used for fungal comparative genomics projects in our research group. We also provide a public web server where users can annotate protein sequences.
Project Name: PoGO: Prediction of gene ontology terms
Project Home page: http://bioinformatica.vil.usal.es/lab_resources/pogo/
Operating system: OS independent. Use any modern web browser
Programming language: MATLAB, php
Other requirements: None
Restrictions for non-academic use: None
Support Vector Machines
This research was supported by funds from the United States Department of Agriculture (grant number 2007-35600-17829) and the Ministerio de Ciencia e Innovación of Spain (grant number AGL2008-03177/AGR). Financial support from the Ramón y Cajal Program from the Spanish Ministerio de Ciencia e Innovación is also acknowledged.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.