PoGO: Prediction of Gene Ontology terms for fungal proteins
© Jung et al; licensee BioMed Central Ltd. 2010
Received: 20 January 2010
Accepted: 29 April 2010
Published: 29 April 2010
Automated protein function prediction methods are the only practical approach for assigning functions to genes obtained from model organisms. Many of the previously reported function annotation methods are of limited utility for fungal protein annotation. They are often trained only to one species, are not available for high-volume data processing, or require the use of data derived by experiments such as microarray analysis. To meet the increasing need for high throughput, automated annotation of fungal genomes, we have developed a tool for annotating fungal protein sequences with terms from the Gene Ontology.
We describe a classifier called PoGO (Prediction of Gene Ontology terms) that uses statistical pattern recognition methods to assign Gene Ontology (GO) terms to proteins from filamentous fungi. PoGO is organized as a meta-classifier in which each evidence source (sequence similarity, protein domains, protein structure and biochemical properties) is used to train independent base-level classifiers. The outputs of the base classifiers are used to train a meta-classifier, which provides the final assignment of GO terms. An independent classifier is trained for each GO term, making the system amenable to updating, without having to re-train the whole system. The resulting system is robust. It provides better accuracy and can assign GO terms to a higher percentage of unannotated protein sequences than other methods that we tested.
Our annotation system overcomes many of the shortcomings that we found in other methods. We also provide a web server where users can submit protein sequences to be annotated.
Our ability to obtain genome sequences is quickly outpacing our ability to annotate genes and identify gene functions. The automated annotation of gene models in newly sequenced genomes is greatly facilitated by easy-to-use software pipelines . While equally as important as gene the gene models, functional annotation of proteins is usually limited to an automatic processing with sequence similarity searching tools such as BLAST, followed by extensive manual curation by dedicated database curators and community annotation jamborees. There is increasing demand to quickly annotate newly sequenced genomes so that they can be used for designing microarray analyses, proteomics, comparative genomics, and other experiments. It is clear that manual gene function annotation cannot be scaled to meet the influx of newly sequenced model organisms.
Two of the main considerations for developing a protein function classifier are the sources of evidence, or features, used for assigning functional categories to the proteins and the method used for defining the relationships between the features and the categories. A number of automated methods have been developed in recent years that vary both in the sources of features used as evidence for the classifier and in the classification algorithm used to assign protein functions. The most common feature types rely on sequence similarity, often using BLAST  to identify similar proteins from a large database. Classifiers may also utilize protein family databases such as PFAM and InterPro, bio-chemical attributes such as molecular weight and percentage amino acid content , phylogenetic profiles [4, 5] transcription factor binding sites  as sources of features. Several classifiers have also been developed that utilize features from laboratory experiments such as gene expression profiles [7, 8], and protein-protein interactions [8, 9].
A classifier is used to determine the relationships between protein features and their potential functions. The simplest examples are manually constructed mapping tables which represent the curator's knowledge of the relationships between features and functional categories. One example is Interpro2GO, a manually curated mapping, of InterPro terms to GO terms. The GOA project relies on several such mappings to provide automated GO annotations for users [10, 11]. More sophisticated classifiers have been developed using machine learning algorithms which can automatically deduce the relationships between the features and the functional categories based on a set of previously annotated proteins that serve as training examples. The use of machine learning algorithms can improve the accuracy of the annotations by discovering relationships between features and the functional categories that were not discovered by human curators [12–15]. In addition, they can be applied rapidly and consistently to large datasets, saving many hours of human curation.
Draft genome sequences and machine generated gene models are becoming available for an increasing number of fungal species but machine annotations of protein functions are still limited. We found that pre-existing functional classifiers are either not amenable to genome-scale analyses or are trained only for a small range of taxa [6, 16, 17]. In addition, many of the previously published protein function annotation systems utilize features obtained from laboratory experiments such as transcriptional profiling or protein-protein interaction assays. These methods cannot be applied to most fungal proteomes since experimental data are not available.
Classifiers that employ features derived from multiple, heterogeneous data sources usually transform the format of each data type into a common format regardless of the properties of each data source . For example, an e-value threshold might be applied to BLAST search results instead of utilizing the e-value directly. The problems with this approach are that the data from each data source is weighted differently and that distinctions between data points can be lost during transformation. Meta-learning classifiers (meta-classifiers) overcome this problem by training independent classifiers (called base-classifiers) for each heterogeneous data source and then use the decisions from the base-classifiers as features to train a meta-classifier. In this way, weights of each data source can be learned by the meta-classifier. Meta-learning classifiers are useful for combining multiple weak learning algorithms and for combining heterogeneous data sources.
We developed a functional classification system called PoGO (Prediction of Gene Ontology terms) that enables us to assign GO terms to fungal proteins in a high-throughput fashion without the requiring evidence from laboratory experiments. We incorporated several evidence sources that emulate what a human curator would use during manual protein function annotation and we avoided features that require laboratory experiments or that otherwise could not be applied to newly sequenced genomes. During the course of our experiments, we discovered that taxon-specific classifiers outperform classifiers that are trained with larger datasets from a wide array of taxa. We developed a taxon-specific classifier for Fungi and we are currently using it to assign GO terms to the proteomes of more than 30 filamentous fungi. We provide as web application that enables users to annotate proteins through their web browser.
Implementation and Results
Overview of the PoGO classifier
Protein functional annotation is a multi-label classification problem in which each protein may be assigned one or more GO terms. Various approaches may be applied to solving multi-label classification problems, but the simplest method, and the one we employed, is to consider each GO term as an independent classification problem. Thus, PoGO is actually a team of independent classifiers. The disadvantage to this approach is that interdependencies among the labels cannot be incorporated into the classifier, thus potentially increasing error. The advantage, however, is that a wide range of binary classification algorithms may be applied to the problem. In addition, this approach is more flexible in that different algorithms and/or datasets may be used to train the classifier for each GO term, enabling us to optimize the classification of individual GO terms or groups of GO terms if necessary. Individual GO terms or groups of GO terms may also be re-trained, as new training data of evidence sources become available. PoGO consists of four base classifiers, each of which utilizes distinct data sources, and a meta-classifier for combining the outputs from the base classifiers into a final classification.
Design and evaluation of the base classifiers
InterPro terms  are defined in the InterPro database which is a curated protein domain database that acts as a central reference for several protein family and functional domain databases, including Prosite, Prints, Pfam, Prodom, SMART, TIGRFams and PIR SuperFamily. We have previously showed that InterPro is an important source of features for identifying GO terms for proteins [19, 20]. Using the InterProScan application , we can obtain InterPro terms for unannotated proteins. Using InterProScan, we identified 3339 InterPro terms in the fungal protein dataset. We employed Support Vector Machines (SVM) algorithm as the classifier as described previously . For each GO term, we construct a dataset that is comprised of all proteins that are annotated with the GO term (the positive class) and all proteins that are not annotated with the GO term (the negative class). Since the data sets are highly imbalanced, we perform undersampling of the negative class as described previously . This step removes members of the negative class so that the training dataset will have the same number of proteins as the positive class. We also apply Chi-square feature selection to remove features from the classifier that do not contribute to the accuracy of the classifier. Previous results have shown that these steps improve classification accuracy and reduce learning time .
Several previous methods including GoFigure , GOblet , and OntoBlast  use sequence similarity based on BLAST  results as features. GOAnno  is also an extension of the similarity-based annotation using hierarchically structured functional categories. In this example, the annotations assigned to the BLAST hits are used as features. We employ a similar approach. The feature set is comprised of hit obtained in a BLAST search of a database of GO annotated proteins (excluding machine annotated terms) using an expect threshold (E-value) of 1-10. The BLAST database was constructed by extracting proteins from the taxonomic group Fungi from the UniProt database. The resulting feature set contains 3182 features. We employ undersampling and feature selection as described previously, and also utilize the SVM algorithm to train the classifier.
Other authors have previously shown that biochemical properties of proteins, such as a protein's charge, amino acid composition, are useful for functional classification . The properties that we use in this study include amino acid content, molecular weight, proportion of negatively and positively charged residues, hydrophobicity, isoelectric point and amino acid pair ratio. We use the pepstat and compseq programs in EMBOSS  to compute the biochemical properties based on the amino acid sequence of each protein. Unlike the previous datasets, the biochemical properties are not sparse, and are numeric rather than binary. We compared various forms of undersampling and learning methods and found that with this dataset, the Adaboosting method using a linear classifier  and using an unbalanced dataset provided the best accuracy (Table S1, Additional File 1).
Protein Tertiary Structure
The fourth feature set is protein structure as computed using the HHpred program  which is used for protein homology detection and structure prediction using the SCOP database . We use the top 10 selected templates as features and the remaining undefined templates are set to zero for each protein. This dataset is comprised of 8494 binary features. As described previously, we use the SVM algorithm combined with dataset undersampling to create the classifier.
Design and evaluation of the meta-classifier
Performance comparison of the base and meta classifiers.
Performance evaluation of protein function classifiers performance is a complex issue, since no standard methods exist . Authors typically utilize standard machine-learning metrics such as sensitivity and specificity but the manner in which they are applied vary, which prevents us from directly comparing the performance of classifiers by comparing the performance values reported in publications. Most authors compute performance statistics by first calculating the statistics for each protein individually, and then calculating an average over all of the tested proteins. In a traditional machine-learning approach, performance statistics are calculated for each functional category. The issue is further confounded since the test proteins may be annotated with functional categories that were not within the repertoire of categories that can be predicted by the classifier, and these categories may or may not be treated as classification errors during performance evaluation. Furthermore, the set of proteins used for evaluation is not consistent among all of the protein function classifiers. Authors usually employ cross-validation in order to evaluate their classifier using hold-out sets from the training data set. Standard protein data sets have been developed for annual competitions, such as the one that is often held in conjunction with the Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) but these data sets are not useful for taxon specific classifiers such as PoGO.
Since we train an individual classifier for each functional category, we evaluated the performance of each classifier individually and then computed an arithmetic mean of all the classifiers. All performance metrics were computed using 10-fold cross validation. When we compare the performance of PoGO to other GO term classifiers (AAPFC , MultiPfam2GO , Gotcha , GOPet , and InterPro2GO ), we use a hold-out set of 71 proteins (1% of our original dataset) that were not used in the training. We calculate the performance of each protein individually, and then report the overall average values. The various classifiers that we tested were all trained using different data sources and were developed to annotate different sets of GO terms. As performance metrics, we use sensitivity, specificity and F-measure. Sensitivity is the proportion of GO term annotations in the training dataset that were correctly annotated by the classifier, and specificity is the proportion of available GO term annotations not assigned to the proteins that were also not assigned to the protein by the classifier. We also use F-measure to provide a single overall metric for comparing the performance of various classifiers.
Considering only the base classifiers, the InterPro term classifier had the highest average F-measure value (Table 1). Surprisingly, the BLAST classifier had low F-measure as well as the lowest specificity, indicating that it results in a large number of false-positive annotations. The F-measure values of the Naïve Bayes meta-classifiers are higher than the base-classifiers (Table 1). The highest F-measure for a meta-classifier (0.3335) was obtained from the Naïve Bayes meta-classifier using the unbalanced dataset. Thus, the meta-classifier trained with the Naïve Bayes algorithm using the unbalanced dataset provides superior results over any single classifier and was used for further experiments.
Overall classifier performance after excluding poorly performing GO term classifiers.
Percentage of Proteins Annotated
Performance comparison of GO annotation methods using 71 randomly selected proteins.
Percentage of proteins annotated
Evaluation of taxon-specific classifiers
Summary statistics and performance evaluation of the taxon-specific classifiers.
We compared the performance of classifiers trained with each of these data sets using 10-fold cross validation as described previously. Because the training and cross-validation time for all the datasets is prohibitively long we performed these experiments using the InterPro base classifier. We also measured the performance of the Interpro2GO mapping by using it to assign GO terms to proteins and measuring performance with Sensitivity, Specificity, and F-measure. All of the PoGO classifiers outperformed the Interpro2GO mapping (data not shown). Adding additional non-fungal proteins to the fungal specific GO terms (the Fungi-expanded dataset) added more than three times the number of GO terms than the Fungi dataset with only a slight reduction in performance (Table 4). In all cases, the classifiers trained with taxon specific datasets performed better than the classifier trained with the UniProt dataset. Since the Fungi dataset is significantly smaller than the UniProt dataset, we reasoned that the improved performance could be due to overfitting. Overfitting can occur when the training dataset is very small, or the number of features in the model is very large. An overfitted classifier can perform well when the instances (proteins in our case) are similar to the ones in the training data, but perform poorly when presented with instances that are not similar to the training data. To determine whether the improved performance is due to over-fitting, we trained PoGO classifiers with 10 smaller datasets containing the same number of proteins as the Fungi dataset. These subsamples were prepared by randomly selecting proteins from the UniProt dataset. If the classifier for the Fungi dataset was overfitted, we would expect that it would have better performance than the classifiers derived from the sub-sampled datasets. The average F-measure of the subsampled classifiers is 0.4135 while the F-measure for the Fungi classifier is 0.3101, indicating that the Fungi classifier is not subject to overfitting.
An important point to consider is that each GO term is trained independently, and the number of proteins that are used to train each GO term classifier varies considerably, depending on the availability of GO annotated proteins in UniProt. If small data sets lead to overfitting, then we would expect to see a correlation between data set size and classifier performance , where classifiers trained with a smaller number of proteins would tend to have higher performance. We compared the performance vs. dataset size for GO terms that were present in the taxon specific and the randomly undersampled datasets but found no correlation in any dataset (Table S2, Additional File 1). The taxon specific datasets may be considered to be a form of data partitioning that removes distantly related proteins from the data. This, in effect, would cause GO terms that are never found within a taxonomic group to be removed from the classifier, thus the error of the individual GO term classifier does not contribute to the overall error of the system.
In this paper, we describe a meta-classifier for assigning GO terms to proteins using heterogeneous feature sets. By using multiple, heterogeneous data sources, we developed a classifier that offers improved accuracy compared to previously reported annotation methods. More importantly, this system can assign GO terms to a higher proportion of proteins than annotation methods that rely only on one feature type. We also found that taxon-specific classifiers have significantly better performance than species independent systems. Functional annotations provided by PoGO are being used for fungal comparative genomics projects in our research group. We also provide a public web server where users can annotate protein sequences.
Availability and requirements
Project Name: PoGO: Prediction of gene ontology terms
Project Home page: http://bioinformatica.vil.usal.es/lab_resources/pogo/
Operating system: OS independent. Use any modern web browser
Programming language: MATLAB, php
Other requirements: None
Restrictions for non-academic use: None
Support Vector Machines
This research was supported by funds from the United States Department of Agriculture (grant number 2007-35600-17829) and the Ministerio de Ciencia e Innovación of Spain (grant number AGL2008-03177/AGR). Financial support from the Ramón y Cajal Program from the Spanish Ministerio de Ciencia e Innovación is also acknowledged.
- Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M: MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 2008, 18(1):188–196. 10.1101/gr.6743907View ArticlePubMedPubMed Central
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed Central
- King RD, Karwath A, Clare A, Dephaspe L: Genome scale prediction of protein functional class from sequence using data mining. Proc of the sixth ACM SIGKDD Inter Conf on Knowledge discovery and data mining 2003.
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285View ArticlePubMedPubMed Central
- Ranea JAG, Yeats C, Grant A, Orengo CA: Predicting Protein Function with Hierarchical Phylogenetic Profiles: The Gene3D Phylo-Tuner Method Applied to Eukaryotic Genomes. PLoS Comput Biol 2007, 3(11):e237. 10.1371/journal.pcbi.0030237View ArticlePubMedPubMed Central
- Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA 2003, 100(14):8348–8353. 10.1073/pnas.0832373100View ArticlePubMedPubMed Central
- Pavlidis P, Weston J, Cai J, Noble WS: Learning gene functional classifications from multiple data types. J Comp Biol 2002, 9(2):401–411. 10.1089/10665270252935539View Article
- Nariai N, Kolaczyk ED, Simon K: Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data. PLoS ONE 2007, 2(3):e337. 10.1371/journal.pone.0000337View ArticlePubMedPubMed Central
- Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D: A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402(6757):83–86. 10.1038/47048View ArticlePubMed
- Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, Kersey P, Mulder N, Oinn T, Maslen J, Cox A, et al.: The Gene Ontology Annotation (GOA) Project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res 2003, 13: 662–672. 10.1101/gr.461403View ArticlePubMedPubMed Central
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, (32 Database):D262–266. 10.1093/nar/gkh021
- Vinayagam A, Konig R, Moormann J, Schubert F, Eils R, Glatting KH, Suhai S: Applying Support Vector Machines for Gene Ontology based gene function prediction. BMC Bioinformatics 2004, 5: 116. 10.1186/1471-2105-5-116View ArticlePubMedPubMed Central
- Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, Konig R: GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 2006, 7: 161. 10.1186/1471-2105-7-161View ArticlePubMedPubMed Central
- Khan S, Situ G, Decker K, Schmidt CJ: GoFigure: automated Gene Ontology annotation. Bioinformatics 2003, 19(18):2484–2485. 10.1093/bioinformatics/btg338View ArticlePubMed
- Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 2004, 5: 178. 10.1186/1471-2105-5-178View ArticlePubMedPubMed Central
- Clare A, King RD: Predicting gene function in Saccharomyces cerevisiae . Bioinformatics 2003, 19(S2):42–49.
- Deng XG, Huimin , Ali HH: Learning Yeast Gene Functions from Heterogeneous Sources of Data Using Hybrid Weighted Bayesian Networks. Fourth International IEEE Computer Society Computational Systems Bioinformatics Conference 2005, 25–34.
- Mulder N, Apweiler R: InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 2007, 396: 59–70. full_textView ArticlePubMed
- Jung J, Thon MR: Automatic annotation of protein functional class from sparse and imbalanced data sets. Lecture Notes in Comput Sci 2006, 4316: 65–77. full_textView Article
- Jung J: Automatic Assignment of Protein Function with Supervised Classifiers. 2008.
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucleic Acids Res 2005, (33 Web Server):W116–120. 10.1093/nar/gki442
- Hennig S, Groth D, Lehrach H: Automated Gene Ontology annotation for anonymous sequence data. Nucleic Acids Res 2003, 31(13):3712–3715. 10.1093/nar/gkg582View ArticlePubMedPubMed Central
- Günther Z: OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res 2003, 31(13):3799–3803. 10.1093/nar/gkg555View Article
- Chalmel F, Lardenois A, Thompson JD, Muller J, Sahel JA, Leveillard T, Poch O: GOAnno: GO annotation based on multiple alignment. Bioinformatics 2005, 21(9):2095–2096. 10.1093/bioinformatics/bti252View ArticlePubMed
- Al-Shahib A, Breitling R, Gilbert D: Feature selection and the class imbalance problem in predicting protein function from sequence. Applied Bioinformatics 2005, 4(3):195–203.View ArticlePubMed
- Rice PL, Ian , Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet 2000, 16(6):276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMed
- Freund Y, Schapire RE: A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci Int 1997, 55: 119–139. 10.1006/jcss.1997.1504View Article
- Söding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 2005, 33: W244-W248. 10.1093/nar/gki408View ArticlePubMedPubMed Central
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.PubMed
- Fan DW, Chan PK, Stolfo SJ: A Comparative Evaluation of Combiner and Stacked Generalization. In Proceedings of AAAI-96 workshop on Integrating Multiple Learned Models. Edited by: Chan PK. Menlo Park, CA: AAAI Press; 1996:40–46.
- Chan PK, Stolfo SJ: Experiments in multistrategy learning by meta-Learning. In Proceedings of the second international conference on information and knowledge management. Edited by: Bhargava BK. Washington, DC: Association for Computing Machinery (ACM); 1993:314–323. full_text
- Yu C, Zavaljevski N, Desai V, Johnson S, Stevens FJ, Reifman J: The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation. BMC Bioinformatics 2008, 9: 52. 10.1186/1471-2105-9-52View ArticlePubMedPubMed Central
- Forslund K, Sonnhammer ELL: Predicting protein function from domain content. Bioinformatics 2008, 24(15):1681–1687. 10.1093/bioinformatics/btn312View ArticlePubMed
- Falkenauer E: On Method Overfitting. J Heuristics 1998, 4(3):281–287. 10.1023/A:1009617801681View Article
- MATLAB MATLAB [http://www.mathworks.com/]
- Pattern Recognition Toolbox for MATLAB Pattern Recognition Toolbox for MATLAB [http://cmp.felk.cvut.cz/cmp/software/stprtool/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.