Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins
© Jaramillo-Garzón et al.; licensee BioMed Central Ltd. 2013
Received: 8 March 2012
Accepted: 19 February 2013
Published: 26 February 2013
Proteins are the key elements on the path from genetic information to the development of life. The roles played by the different proteins are difficult to uncover experimentally as this process involves complex procedures such as genetic modifications, injection of fluorescent proteins, gene knock-out methods and others. The knowledge learned from each protein is usually annotated in databases through different methods such as the proposed by The Gene Ontology (GO) consortium. Different methods have been proposed in order to predict GO terms from primary structure information, but very few are available for large-scale functional annotation of plants, and reported success rates are much less than the reported by other non-plant predictors. This paper explores the predictability of GO annotations on proteins belonging to the Embryophyta group from a set of features extracted solely from their primary amino acid sequence.
High predictability of several GO terms was found for Molecular Function and Cellular Component. As expected, a lower degree of predictability was found on Biological Process ontology annotations, although a few biological processes were easily predicted. Proteins related to transport and transcription were particularly well predicted from primary structure information. The most discriminant features for prediction were those related to electric charges of the amino-acid sequence and hydropathicity derived features.
An analysis of GO-slim terms predictability in plants was carried out, in order to determine single categories or groups of functions that are most related with primary structure information. For each highly predictable GO term, the responsible features of such successfulness were identified and discussed. In addition to most published studies, focused on few categories or single ontologies, results in this paper comprise a complete landscape of GO predictability from primary structure encompassing 75 GO terms at molecular, cellular and phenotypical level. Thus, it provides a valuable guide for researchers interested on further advances in protein function prediction on Embryophyta plants.
The universe of protein functions can be summarized through the use of the Gene Ontology (GO) project, which aimed to construct controlled and structured vocabularies known as ontologies, and apply them in the annotation of gene products in biological databases. Molecular Function ontology refers to biochemical activities at the molecular level, no matter what entities accomplish that function or the context where it takes place; Cellular Component ontology refers to the specific sub-cellular location where a gene product is active, describing different parts of the eukaryotic cell; Biological Process ontology refers to a series of events with a defined beginning and end, to which the gene product contributes. Currently, as of February 2013 there are 38137 defined GO terms, distributed over 9467 molecular functions, 3050 cellular components and 23928 biological processes. However, in spite of such variety of functions, all proteins share a common basic configuration: a linear polypeptide chain composed by different combinations and repetitions of the twenty amino acids encoded by genes. Although, currently there are almost 8 million sequences in non-redundant databases, for most, we know just that amino acid sequence deduced from the DNA chain. Assessment of protein functions requires, in most cases, experimental approaches carried out in the lab. Unfortunately, these procedures must be focused on specific proteins or functions, and require either cloned DNA or protein samples from the genes of interest. Additionally, the function of many proteins may be related to its own native environment. Such perspective has led some authors to conclude that the only effective route towards the elucidation of the function of some proteins may be computational analysis and prediction from amino acid sequences that later can be subjected to experimental verification.
Many approaches have been developed in this matter (for complete revisions, see[4-6]). One of the earliest applications, yet still one of the more popular bioinformatics tools is the Basic Local Alignment Search Tool for proteins (BLASTP) which has been applied for obtaining annotation transfers based on sequence alignments. Also, a high number of methods (GOblet, OntoBlast, GOFigure and GOtcha) are based on the idea of refining and improving initial results from classic alignment tools such as BLASTP, by performing mappings and weightings of GO terms associated to BLASTP predictions. However, in such methods, the failure of conventional alignment tools to adequately identify homologous proteins at significant E-values is not considered. The same applies for some more recent methods that have improved specific points of this methodology such as speeding up the procedure through decision rules, including additional functionality for visualization and data mining or also including some statistics of GO terms to refine selection. In order to avoid the dependency to BLAST alignments in the cases where the alignment-based annotation transfer approach is not so effective, more recent methods have used machine learning techniques trained over feature spaces of physical-chemical, statistical or locally-based attributes. Those methods employ techniques such as neural networks (ProtFun), Bayesian multi-label classifiers and support vector machines (SVM-Prot, GOKey, PoGO), obtaining high performance results in their own respective databases, mostly composed by model organisms such as bacteria and a few high order species.
There are, however, several aspects that must be discussed about current performances in prediction of GO terms, when applied to non-model organisms such as land plants (Embryophyta). First, from the previously described methods, only Blast2GO was specialized for predicting GO terms in plant proteins. In fact, as it is pointed out by the authors of Blast2GO, very few resources are available for large-scale functional annotation of non-model species. Some methods specialized on vegetative species have been proposed recently, but they are only intended for performing cellular component predictions (Predotar, TargetP, Plant-mPloc). Moreover, Predotar and TargetP can discriminate among only three or four cellular location sites. Plant-mPloc, in turn, covers twelve different location sites and it was rigorously tested over a set of proteins with less than 25% of identity among them, where homologue-based tools like BLASTP would certainly fail. For such dataset, they obtained an overall success rate of 63.7%, much less than reported by other cellular location predictors tested over non-plant datasets. Second, none of the existing methods can be used to deal with plant proteins that can simultaneously exist or move between two or more different location sites, or belong to multiple functional classes at the same time.
In order to improve the performance of current GO term predictors for land plants, it would be useful to have a better understanding of the underlying relationships between primary structure information and protein functionality. However, the structure of the machine learning models behind high-accuracy predictors often makes difficult to understand why a particular prediction was made. In this sense, a recent method called Yloc was proposed for analyzing what specific features are responsible for given predictions. This method, nevertheless, is not intended to predict GO terms, but instead, it uses annotation information from PROSITE and GO as inputs to the predictor. Additionally, their study is only focused on predicting protein locations in the cell.
Since most of the current GO prediction methods are limited to a few arbitrary functional classes and single ontologies, they cannot provide any information about relationships on the predictability at the various levels of protein functionality (molecular, cellular, biological), which could be another key element for determining how the information of the primary structure is related to protein function.
This work presents an analysis on the predictability of GO terms over the Embryophyta group of organisms, which is composed by the most familiar group of plants including trees, flowers, ferns, mosses, and various other green land plants. The analysis provides the following key elements: predictions are made by using features extracted solely from primary structure information; analysis comprises most of the available organisms belonging to the Embryophyta group; biases due to protein families are avoided by considering only proteins with low similarity among them and a strong evidence of existence; predictions and analysis are made over a set of categories belonging to the three ontologies; proteins are allowed to be associated to several GO terms simultaneously.
Results from this work answer whether it is possible to predict most GO-slim terms from primary structure information, what categories are more susceptible to be predicted, which ontology is most related to the information contained in the primary structure and what relationships among ontologies could be influencing the predictability at different levels of protein functionality in land plants.
The implemented methodology for assessing predictability of GO terms in Embryophyta proteins comprises the following parts: (i) selection of the protein sequences conforming the database in order to cover the highest number of available plant proteins, while ensuring high confidence annotations and avoiding possible biases; (ii) categories describing positive and negative samples associated to each GO term are determined in order to minimize the impact of hierarchical relationships; (ii) protein sequences are mapped into feature vectors that encode a number of attributes of varied nature; (iii) computed features are clustered into groups of similar information content; (iv) one binary classifier is learned for each GO term and each feature cluster in order to evaluate the prediction power of individual clusters, and finally (v) one binary classifier is learned for each GO term using the whole set of features in conjunction with an automatic feature selection strategy in order to determine the global predictability of each GO term.
The following subsections describe the methods employed for each part of the methodology. All simulations were implemented on the R environment for statistical computing. Additional tools were mainly provided by Bioconductor, and the seqinR package, all of them freely distributed under the GNU General Public License.
The database comprises all the available Embryophyta proteins at UniProtKB/Swiss-Prot database (, file version: 10/01/2013), with at least one annotation in the Gene Ontology Annotation (GOA) project (, file version: 7/01/2013). The resulting set comprises proteins from 189 different land plants.
In order to avoid the presence of protein families that could bias the results, the dataset was filtered at several levels of sequence identity using the Cd-Hit software. The main results are reported for the lowest identity cutoff (30%). However, additional analyses at 40%, 50%, 60%, 70% and 80% were also performed in order to provide further information on the robustness of the method.
The main set comprises a total of 3368 protein sequences, from which 1973 sequences are annotated with molecular functions, 2210 with cellular components and 2798 with biological processes. Automatically-assigned annotations were not included in the analyses.
Definition of classes
Although, in principle, the method can be trained to predict any GO term for which there are enough training sequences, all tests were performed over the set of categories defined by the plants GO slim developed by The Arabidopsis Information Resource - TAIR (, file version: 14/03/2012). This choice was made because GO includes a large number of categories that do not occur in plants, due to its broad size. In turn, slims are smaller, more-manageable sub-sets of GO, that focus on terms relevant to a specific problem or data set, thus allowing to generate higher-level annotation more robust to tests of statistical significance.
Positive and negative samples associated to each GO term are selected by considering the propagation principle of GO. If a protein is predicted to be associated to any given GO term, it must be automatically associated to all the ancestors of that category and thus, it is enough to predict only the lowest level entries. Consequently, for each GO term, positive samples are all those proteins that have been annotated with this term or any of its descendants, excepting those descendants that are also included as categories. All the remaining samples in the database are selected as negative samples for that GO term. In order to explicitly note that some GO terms are not including their descendants categories, such “incomplete” GO terms are marked with an asterisk throughout the paper.
Definition and size of the classes
Carbohydrate metabolic process
Generation of precursor metabolites
Transcription factor activity
Nucleobase, nucleoside, nucleotide,
nucleic acid metabolic process*
DNA metabolic process
Protein modification process
Lipid metabolic process
Response to stress
Enzyme regulator activity
Multicellular organismal development*
Response to external stimulus*
Response to biotic stimulus
Response to abiotic stimulus
Anatomical structure morphogenesis
Response to endogenous stimulus
Response to extracellular stimulus
Cellular component organization
Protein metabolic process*
Secondary metabolic process
Regulation of gene expression,
Characterization of protein sequences
Initial set of features extracted from amino acid sequences
Positively charged residues (%)
Negatively charged residues (%)
Amino acid frequencies
Amino acid dimer frequencies
Structural dimer frequencies
In the case of ambiguous characters in the amino acid sequence, each feature was computed as its statistical expected value, with natural abundance percentages of amino acids as their prior probabilities. Additionally, since different groups of features are very heterogeneously scaled, z-score normalization was performed so that each feature has a zero mean and unitary standard deviation.
The full feature matrix is provided in the supplementary material along with a file specifying the membership of samples to each category.
Negative charge /Acidic
Positive charge /Basic
Feature selection strategy
The feature selection procedure is carried out in the second part of the Results and discussion section, where the global predictability of each GO term is evaluated by using the whole feature set. Since redundant features would possibly overfit the training data, an analysis of relevance and redundancy was applied. Let f i , i=1,2,…,n, be the initial set of features, y be the vector of labels, c ij =cor(f i ,f j ) be the linear correlation computed between any pair f i and f j and c i y =cor(f i ,y) be the linear correlation between f i and y. Defining this, relevance of features can be quantified by computing c iy for all features and then, redundant ones can be identified by analyzing the n×n feature correlation matrix. In order to speed up the calculations, an algorithm based on the Fast Correlation-Based Filter of was used.
In order to allow samples to be associated to multiple categories, decision making was implemented following the one-against-all strategy. The method produced a strong class imbalance since it trains a number of binary classifiers, each one intended to recognize samples from one class out of the whole training set. To overcome the problems that imbalanced classes commonly produce in pattern recognition techniques, the Synthetic Minority Over-sampling Technique (SMOTE) was employed.
A support vector machine (SVM) with Gaussian kernel was used for running all the classification tests. This SVM is trained with the ’kernlab’ package, available in R-CRAN. Dispersion of the kernel and trade-off penalization parameter of the SVM are tuned for each test with a particle swarm optimization meta-heuristic, a bio-inspired optimization method that has been used in multiple applications in the past years.
In order to estimate the performance of the predictive model, a 5-fold cross-validation strategy is implemented. In such strategy, the test procedure is repeated five times, and each time an 80% of the data is used for adjusting the SVM parameters and training the model, while the remaining 20% is used as testing samples. This strategy also allows providing an estimation of the reliability of the model by computing the variability of the results through the five repetitions.
Results and discussion
Analysis of predictability with individual feature clusters
Figure1(a) shows the analysis for the molecular function ontology. For all feature groups, Receptor binding achieved the highest classification scores. This category is intended to comprise proteins that interact selectively and non-covalently with one or more specific sites on a receptor molecule. About 63% of the proteins associated to this category in the database are proteins involved with binding of serine/threonine kinase receptors, which turned out to be easily predicted from most of the defined features.
Transcription factor activity achieved was easily predicted from the feature groups 1, 3, 6, 8 and 14. Not so surprising is the fact that DNA binding also presents a similar behavior since most transcription factors must interact with DNA molecules and consequently they are also included in this category. However it is worthy to note that several other proteins also perform DNA cleavage, such as polymerases, nucleases and histones, and they were also well predicted from the same feature groups. The conclusion from these results becomes more evident by observing the results of the DNA metabolic process in Figure1(c), which confirm the high predictability of all proteins involved with transcription when using the mentioned features groups. A similar behavior is also observed for nucleus* in Figure1(b), supported by the fact that the transcription process is mostly carried out in that sub-cellular location.
Transporter activity refers to proteins that enable the directed movement of substances into, out of, within or between cells. Most of them are integral transmembrane proteins, that are distinguished by their high content of hydrophobic residues. In fact, some of the highest performances of transporter activity were reached with the groups 3 and 6, which include GRAVY index as well as monomer and dimer frequencies of three out of the four most hydrophobic residues: leucine, isoleucine and phenylalanine. Additionally, predictability of this molecular function is reflected, while in a minor degree, on the transport biological process, which reaches its highest values for the same feature groups (see Figure1(c)). The main difference between those GO terms lies in that transport is a broader category, including external agents such as oxygen carriers and lipoproteins that perform transport within multicellular organisms.
On the other hand, the root node of the molecular function ontology was GO terms with the lowest average prediction performances. Remember that the root node contains the proteins that do not belong to any of its descendant categories, so it keeps a small set of sequences of a sparse nature, which explains the impossibility to model and predict them as a group. It is interesting to note that the same behavior is observer for the other two ontologies (Figures1(b) and1(c)).
Concerning the cellular component ontology, it can be observed in Figure1(b) that ribosome category has reached the highest classification accuracies, specially with groups 1, 2, 3 and 11. Such groups mainly consist of the four charged residues: lysine, arginine, glutamic acid and aspartic acid. This can be explained since ribosomal proteins must interact with the negatively charged phosphodiester bonds in the RNA backbone, so they are expected to have a high percentage of positively charged residues to neutralize such charge repulsion. In agreement with this, describes the composition of isolated ribosomal proteins as showing a high percentage of lysine and arginine residues and a low aromatic content. Hence, there is enough evidence to establish that ribosomal proteins are another highly predictable category from primary structure information.
As explained before, nucleus* becomes easily predicted from the same feature groups that have shown high discriminant capabilities for transcription related proteins. A similar behavior is also observed for proteins belonging to the nucleolus component, which encompasses proteins including RNA polymerases, transcription factors, processing enzymes and ribosomal proteins among others, which must interact with nucleic acids and have shown low isoelectric points in comparison to the remaining proteins in the database.
Thylakoid proteins also presented high prediction performances with several feature groups. Further studies would be required to explain this results.
Broad categories such as membrane* showed poor performances with most feature groups, presumably due to its high diversity. However, some rather well-defined categories such as mitochondrion and perixosome were also ranked in the lowest places in Figure1(b), simply proving to be poorly predictable from the extracted feature groups.
Concerning Figure1(c), the biological process that was better predicted for most group features is regulation of gene expression, epigenetic. This GO term encloses proteins involved in modulating the frequency, rate or extent of gene expression and is highly composed by histones. In fact, since histones are highly alkaline proteins, it is consistent to observe that this category became particularly well predicted from groups 3, 6 and 7, which are mainly conformed by frequencies of phenylalanine, leucine, isoleucine, lysine and histidine residues. Also, cysteine related frequencies were highly discriminant for regulation of gene expression, epigenetic (group 5 which can be explained by the fact that altering the redox state of cysteines serves for modulating protein activity, and several transcription factors become activated by the oxidation of cysteines that form disulfide bonds.
Tropism and Cell Cycle also appeared near the top of Figure1(c), just before DNA metabolic process which was already discussed.
Analysis of predictability with the full set of features
Figure3 depicts detailed results of predicting each class with the full feature set for an identity cutoff of 30%. Left plots show sensitivity, specificity and geometric mean (green line) achieved with the five-fold cross-validation procedure, while right plots depicts boxplots for analyzing performance variation throughout the five repetitions. Left plots also depicts the performance of the BLASTP algorithm for comparison purposes (blue line). Similar figures for the whole sweep of identity cutoffs are presented in the Additional file3.
Note that GO terms were ordered again from top to bottom according to their predictability, but this order is not strictly the same as in Figure1. Some interesting results in Figure3(b) are provided by categories such as plastid, which was not easily predicted with any feature set independently, but reached medium to high classification results when the complete set was used. Such behavior is a clear example of the multivariate associations that could be missed when analyzing only individual feature sets.
Other results were consistent with the insights provided by the previous analyses, showing that some of the best predicted GO terms were transporter activity, transcription factor activity, and DNA binding in molecular functions; ribosome, nucleus*, nucleolus and thylakoid in cellular components; regulation of gene expression, epigenetic, Cell cycle, Photosyntesis and DNA metabolic process in biological processes.
A reduced number of categories had performances under 50%, most of them from the biological process ontology and a few form the molecular function ontology. It is important to note that the majority of those categories achieved very high specificities and low sensitivities, pointing out to a high dispersion of such categories over the feature space, which yields to a very high number of false negatives. Also, the high dispersions observed in the boxplots for some of the worst predicted classes demonstrate that there is a high variability among repetitions of the experiment which means that those low performances are not confident. Conversely, the categories with high performances show also low dispersions associated to them, hinting consistency in the predictors.
Although the main purpose of this work is not to design a highly accurate GO term predictor, but to provide a comprehensive analysis of the predictability of GO terms from primary structure information, it is important to mention how this method compares with currently used prediction tools. The blue and green lines in Figure3 represent the prediction performances of BLASTP and the SVM based predictor used in this work, respectively. Both methods were tested over the same database described in the Methods section. From Figure3(a) it is possible to conclude that the two methods provide similar prediction capabilities for the molecular function ontology at this identity cutoff. However, Figures3(b) and3(c) show that the SVM out-performed BLASTP for the cellular component and biological process ontologies, with only a few exceptions. It is also important to point out that the results achieved here are competitive with those reported by, which is one of the more recent and effective predictors dedicated to plant proteins.
It is notable how categories with the major number of descendants have been negatively affected by their false positives. This is especially observed in Figure4(b) for cytoplasm, and intracellular, and Figure3(c) for cellular process and metabolic process. Conversely, a few classes that were lacking sensitivity were favored by the contributions of their descendants, as it is the case of the root nodes of the ontologies.
An analysis of GO terms predictability in land plants proteins was carried out in order to determine single categories or groups of related functions that are more related with primary structure information. For this purpose, pattern recognition techniques were employed over a feature set of physical-chemical and statistical attributes computed over the primary structure of the proteins. High predictability of several GO terms was observed in the three ontologies. Proteins associated to transport activities showed high correct prediction rates when using hydropathicity related features. Also, proteins involved with transcription (and therefore associated to the nucleus) presented high discriminability from the extracted features. Ribosomal and other proteins involved with translation, proved to be highly predictable from features related to electric charges of the amino acid sequence. At the biological process level, proteins related to regulation of gene expression and nucleic acid metabolic process were easily predicted, while some other biological processes showed low predictability from the extracted primary structure features. The information derived from this study could be used to get further improvement in prediction performances by combining the information from SVM classifiers with annotation transfer methods.
This work was partially supported by the Spanish Ministerio de Educación y Ciencia under the Ramón y Cajal Program and the grants TEC2010-20886-C02-02 and TEC2010-20886-C02-01, by Centro de Investigación Biomédica en Red en Bioingeniería, Biomateriales y Nanomedicina (CIBER-BBN), by Dirección de Investigaciones de Manizales (DIMA) from Universidad Nacional de Colombia and by the Colombian National Research Institute (COLCIENCIAS) under grant 111952128388.
- The Gene Ontology Consortium: The gene ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32: 258-261. 10.1093/nar/gkh036View Article
- Levitt M: Nature of the protein universe. Proc Natl Acad Sci 2009,106(27):11079. 10.1073/pnas.0905029106PubMed CentralView ArticlePubMed
- Baldi P, Brunak S: Bioinformatics: the Machine Learning Approach. Cambridge: The MIT Press; 2001.
- Zhao X, Chen L, Aihara K: Protein function prediction with high-throughput data. Amino Acids 2008,35(3):517-530. 10.1007/s00726-008-0077-yView ArticlePubMed
- Pandey G, Kumar V, Steinbach M: Computational approaches for protein function prediction: a survey. Twin Cities: Tech Rep, 06-028 Department of Computer Science and Engineering, University of Minnesota; 2006.
- Friedberg I: Automated protein function prediction-the genomic challenge. Brief Bioinformatics 2006,7(3):225. 10.1093/bib/bbl004View ArticlePubMed
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997,25(17):3389-3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Groth D, Lehrach H, Hennig S: GOblet: a platform for Gene Ontology annotation of anonymous sequence data. Nucleic Acids Res 2004,32(Web Server issue):W313—w317.PubMed CentralPubMed
- Zehetner G: OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res 2003,31(13):3799-3803. 10.1093/nar/gkg555PubMed CentralView ArticlePubMed
- Khan S: GoFigure: Automated gene ontologyTM annotation. Bioinformatics 2003,19(18):2484-2485. 10.1093/bioinformatics/btg338View ArticlePubMed
- Martin DMA, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 2004, 5: 178. 10.1186/1471-2105-5-178PubMed CentralView ArticlePubMed
- Hawkins T, Chitale M, Luban S, Kihara D: PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins 2009,74(3):566-582. 10.1002/prot.22172View ArticlePubMed
- Jones CE, Schwerdt J, Bretag TA, Baumann U, Brown AL: GOSLING: a rule-based protein annotator using BLAST and GO. Bioinformatics (Oxford, England) 2008,24(22):2628-2629. 10.1093/bioinformatics/btn486View Article
- Conesa A, Götz S: Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics 2008, 2008: 619832.PubMed CentralView ArticlePubMed
- Vinayagam A, del Val C, Schubert F, Eils R, Glatting KH, Suhai S, König R: GOPET: a tool for automated predictions of gene ontology terms. BMC bioinformatics 2006, 7: 161. 10.1186/1471-2105-7-161PubMed CentralView ArticlePubMed
- Jensen L, Gupta R, Staerfeldt H, Brunak S: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 2003,19(5):635. 10.1093/bioinformatics/btg036View ArticlePubMed
- Jung J, Thon MR: Gene function prediction using protein domain probability and hierarchical gene ontology information. 2008 19th Int Conf Pattern Recognit 2008, 19: 1-4.View Article
- Cai CZ: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 2003,31(13):3692-3697. 10.1093/nar/gkg600PubMed CentralView ArticlePubMed
- Bi R, Zhou Y, Lu F, Wang W: Predicting gene ontology functions based on support vector machines and statistical significance estimation. Neurocomputing 2007,70(4-6):718-725. 10.1016/j.neucom.2006.10.006View Article
- Jung J, Yi G, Sukno SA, Thon MR: PoGO: Prediction of gene ontology terms for fungal proteins. BMC bioinformatics 2010, 11: 215. 10.1186/1471-2105-11-215PubMed CentralView ArticlePubMed
- Small I, Peeters N, Legeai F, Lurin C: Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences. Proteomics 2004,4(6):1581-1590. 10.1002/pmic.200300776View ArticlePubMed
- Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 2000,300(4):1005-1016. 10.1006/jmbi.2000.3903View ArticlePubMed
- Chou KC, Shen HB: Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PloS one 2010,5(6):e11335. 10.1371/journal.pone.0011335PubMed CentralView ArticlePubMed
- Briesemeister S, Rahnenführer J, Kohlbacher O: Going from where to why-interpretable prediction of protein subcellular localization. Bioinformatics (Oxford, England) 2010,26(9):1232-1238. 10.1093/bioinformatics/btq115View Article
- Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 2010,38(Database issue):D161—D166.PubMed CentralPubMed
- R Core Team: R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2012. . [ISBN 3-900051-07-0] [http://www.R-project.org/] . [ISBN 3-900051-07-0]
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMed
- Charif D, Lobry J: SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In Structural approaches to sequence evolution: Molecules, networks, populations. Edited by: Bastolla U, Porto HRM, Vendruscolo M.. New York, Springer Verlag: Biological and Medical Physics, Biomedical Engineering; 2007:207-232.View Article
- Jain E, Bairoch A, Duvaud S, Phan I, Redaschi N, Suzek B, Martin M, McGarvey P, Gasteiger E: Infrastructure for the life sciences: design and implementation of the UniProt website. BMC Bioinformatics 2009, 10: 136. 10.1186/1471-2105-10-136PubMed CentralView ArticlePubMed
- Barrell D, Dimmer E, Huntley R, Binns D, O’Donovan C, Apweiler R: The GOA database in 2009-an integrated gene ontology annotation resource. Nucleic Acids Res 2008, 37: D396—D403.PubMed CentralPubMed
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006,22(13):1658-1659. 10.1093/bioinformatics/btl158View ArticlePubMed
- Berardini T, Mundodi S, Reiser L, Huala E, Garcia-Hernandez M, Zhang P, Mueller L, Yoon J, Doyle A, Lander G: Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 2004,135(2):745. 10.1104/pp.104.040071PubMed CentralView ArticlePubMed
- Davis MJ, Sehgal MSB: Ragan Ma: Automatic, context-specific generation of gene ontology slims. BMC bioinformatics 2010, 11: 498. 10.1186/1471-2105-11-498PubMed CentralView ArticlePubMed
- Rhee SY, Wood V, Dolinski K, Draghici S: Use and misuse of the gene ontology annotations. Nat Rev Genet 2008,9(7):509-515. 10.1038/nrg2363View ArticlePubMed
- Frishman D, Argos P: Seventy-five percent accuracy in protein secondary structure prediction. Proteins Struct Funct and Genet 1997,27(3):329-335. 10.1002/(SICI)1097-0134(199703)27:3<329::AID-PROT1>3.0.CO;2-8View Article
- Yu L, Liu H: Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 2004, 5: 1205-1224.
- Chawla N, Bowyer K, Hall L, Kegelmeyer W: SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002,16(3):321-357.
- Karatzoglou A, Smola A, Hornik K, Zeileis A: kernlab - An S4 package for kernel methods in R. J Stat Softw 2004,11(9):1-20. [http://www.jstatsoft.org/v11/i09/] View Article
- Kennedy J, Eberhart R: Particle swarm optimization. Proc ICNN’95 Int Conf Neural Netw 1995, 4: 1942-1948.View Article
- Whitford D: Proteins: Structure and Function. West Sussex: Wiley; 2005.
- Arrigo A: Gene expression and the thiol redox state. Free Radic Biol Med 1999,27(9-10):936-944. 10.1016/S0891-5849(99)00175-6View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.