The universe of protein functions can be summarized through the use of the Gene Ontology (GO) project, which aimed to construct controlled and structured vocabularies known as ontologies, and apply them in the annotation of gene products in biological databases
. Molecular Function ontology refers to biochemical activities at the molecular level, no matter what entities accomplish that function or the context where it takes place; Cellular Component ontology refers to the specific sub-cellular location where a gene product is active, describing different parts of the eukaryotic cell; Biological Process ontology refers to a series of events with a defined beginning and end, to which the gene product contributes. Currently, as of February 2013 there are 38137 defined GO terms, distributed over 9467 molecular functions, 3050 cellular components and 23928 biological processes. However, in spite of such variety of functions, all proteins share a common basic configuration: a linear polypeptide chain composed by different combinations and repetitions of the twenty amino acids encoded by genes. Although, currently there are almost 8 million sequences in non-redundant databases, for most, we know just that amino acid sequence deduced from the DNA chain
. Assessment of protein functions requires, in most cases, experimental approaches carried out in the lab. Unfortunately, these procedures must be focused on specific proteins or functions, and require either cloned DNA or protein samples from the genes of interest. Additionally, the function of many proteins may be related to its own native environment. Such perspective has led some authors to conclude that the only effective route towards the elucidation of the function of some proteins may be computational analysis and prediction from amino acid sequences that later can be subjected to experimental verification
Many approaches have been developed in this matter (for complete revisions, see
[4–6]). One of the earliest applications, yet still one of the more popular bioinformatics tools is the Basic Local Alignment Search Tool for proteins (BLASTP)
 which has been applied for obtaining annotation transfers based on sequence alignments. Also, a high number of methods (GOblet
 and GOtcha
) are based on the idea of refining and improving initial results from classic alignment tools such as BLASTP, by performing mappings and weightings of GO terms associated to BLASTP predictions. However, in such methods, the failure of conventional alignment tools to adequately identify homologous proteins at significant E-values is not considered
. The same applies for some more recent methods that have improved specific points of this methodology such as speeding up the procedure through decision rules
, including additional functionality for visualization and data mining
 or also including some statistics of GO terms to refine selection
. In order to avoid the dependency to BLAST alignments in the cases where the alignment-based annotation transfer approach is not so effective, more recent methods have used machine learning techniques trained over feature spaces of physical-chemical, statistical or locally-based attributes. Those methods employ techniques such as neural networks (ProtFun
), Bayesian multi-label classifiers
 and support vector machines (SVM-Prot
), obtaining high performance results in their own respective databases, mostly composed by model organisms such as bacteria and a few high order species.
There are, however, several aspects that must be discussed about current performances in prediction of GO terms, when applied to non-model organisms such as land plants (Embryophyta). First, from the previously described methods, only Blast2GO
 was specialized for predicting GO terms in plant proteins. In fact, as it is pointed out by the authors of Blast2GO, very few resources are available for large-scale functional annotation of non-model species. Some methods specialized on vegetative species have been proposed recently, but they are only intended for performing cellular component predictions (Predotar
). Moreover, Predotar and TargetP can discriminate among only three or four cellular location sites. Plant-mPloc, in turn, covers twelve different location sites and it was rigorously tested over a set of proteins with less than 25% of identity among them, where homologue-based tools like BLASTP would certainly fail. For such dataset, they obtained an overall success rate of 63.7%, much less than reported by other cellular location predictors tested over non-plant datasets. Second, none of the existing methods can be used to deal with plant proteins that can simultaneously exist or move between two or more different location sites
, or belong to multiple functional classes at the same time
In order to improve the performance of current GO term predictors for land plants, it would be useful to have a better understanding of the underlying relationships between primary structure information and protein functionality. However, the structure of the machine learning models behind high-accuracy predictors often makes difficult to understand why a particular prediction was made
. In this sense, a recent method called Yloc
 was proposed for analyzing what specific features are responsible for given predictions. This method, nevertheless, is not intended to predict GO terms, but instead, it uses annotation information from PROSITE
 and GO as inputs to the predictor. Additionally, their study is only focused on predicting protein locations in the cell.
Since most of the current GO prediction methods are limited to a few arbitrary functional classes and single ontologies, they cannot provide any information about relationships on the predictability at the various levels of protein functionality (molecular, cellular, biological), which could be another key element for determining how the information of the primary structure is related to protein function.
This work presents an analysis on the predictability of GO terms over the Embryophyta group of organisms, which is composed by the most familiar group of plants including trees, flowers, ferns, mosses, and various other green land plants. The analysis provides the following key elements: predictions are made by using features extracted solely from primary structure information; analysis comprises most of the available organisms belonging to the Embryophyta group; biases due to protein families are avoided by considering only proteins with low similarity among them and a strong evidence of existence; predictions and analysis are made over a set of categories belonging to the three ontologies; proteins are allowed to be associated to several GO terms simultaneously.
Results from this work answer whether it is possible to predict most GO-slim terms from primary structure information, what categories are more susceptible to be predicted, which ontology is most related to the information contained in the primary structure and what relationships among ontologies could be influencing the predictability at different levels of protein functionality in land plants.