Three-Level Prediction of Protein Function by Combining Profile-Sequence Search, Profile-Profile Search, and Domain Co-Occurrence Networks
© Wang et al.; licensee BioMed Central Ltd. 2013
Published: 28 February 2013
Skip to main content
© Wang et al.; licensee BioMed Central Ltd. 2013
Published: 28 February 2013
Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).
In the genome era, high-throughput genome, transcriptome, and proteome sequencing is generating an enormous amount of omics data such as gene and protein sequences. Since experimental characterization of these proteins can only be carried out at a selectively small scale, large-scale and high-throughput computational prediction methods are needed to annotate the structures and functions of most of these proteins in order for the biomedical research to effectively utilize this vast resource to study genotype - phenotype relationships. To fill the gap, a variety of computational methods have been developed to predict protein function from protein sequence from different perspectives.
The most commonly used approach to function prediction is based on sequence homology. It uses a sequence comparison/alignment tool to search a target protein sequence against protein sequences with known function annotations in a protein database, and if some homologous hits are found, their function annotations may be transferred to the target protein as predictions. GOtcha , OntoBlast , and Goblet  are tools that use BLAST  to search for homologues and then combine the Gene Ontology function terms  of homologous hits based on BLAST e-values. PFP  uses a more sensitive profile-sequence alignment tool PSI-BLAST  to search for remote homologues, and also considers co-occurrence relationships between GO  terms in order to improve the sensitivity of prediction.
Phylogenetic relationships between proteins have been proven to be helpful for inferring protein functions [7–9]. Paralogues and orthologues, the two kinds of homologous proteins generated by gene duplication and speciation during evolution, respectively , may still share similar functions. Thus, the function of a protein may be inferred from that of its paralogues or orthologues, even though the level of their functional similarity may depend on their evolutionary distance and other factors. As most of phylogenetic-tree based methods assume orthologous proteins are more likely to share similar functions , they often generate a phylogenetic tree to elucidate the evolutionary relationships between a target protein and its homologous proteins at first, and then preferably use the functions of its orthologues to infer its function. SIFTER , Orthostrapper , RIO , and AFAWE  are typical methods in this category.
Apart from homologous relationships mentioned above, network-based methods exploit other relationships stored in protein networks. Assuming that neighboring proteins interacting in a protein-protein interaction (PPI) network may have similar protein function, some early network-based methods use the functions of the direct (radius-one) neighbors of a target protein in a PPI to infer its function. More advances in this direction include the consideration of statistically enriched functions within neighbors [15, 16], the expansion of search from direct neighbors to radius-two and radius-three neighbors , and the development of more advanced function inference methods, such as Markov random field , random walk , and algorithms taking in account the global topology of a network [20–22]. In addition to PPI, Functional Linkage Networks (FLNs)  derived from protein interaction, gene expression data, phylogenetic profile, and genetic interaction [22, 24–26], have been used to predict protein function. More recently, Domain Co-occurrence Networks (DCN) has been used to predict protein function .
In an effort to directly link a protein with its function, machine learning methods, such as Support Vector Machines (SVM) and Artificial Neural Networks, have been developed to predict protein function from scratch. Machine learning methods usually generate features from protein sequence, secondary structure, hydrophobicity, subcellular location, and solvent accessibility, and then use these features as inputs to train a classifier to assign proteins to a number of predefined function categories. ProtFun  aims to classify a eukaryotic protein into 14 Gene Ontology (GO) categories and several Enzyme Commission classes. FFPred  uses features derived from protein disordered regions and protein sequence profiles with SVM to classify a protein into 300 Gene Ontology classes.
With these approaches developed from a variety of perspectives, protein function prediction remains a challenging, largely unsolved problem, particularly when little information regarding homology and protein interaction is known about a target protein. Both the specificity and sensitivity of function prediction need to be improved in order to reliably make function predictions for most proteins. One consensus in the community is to combine multiple complementary methods and explore and integrate more diverse sources of information to enhance prediction accuracy and broaden the annotation scope [29, 30]. In this spirit, we developed a three-level method to cope with the complexity of function prediction at different levels, from high homology, to remote homology, to no homology, and synergistically integrated them into a system that can make function prediction for almost all the target proteins in the 2011 Critical Assessment of Function Annotation (CAFA, http://biofunctionprediction.org/) . At the first level, our method uses PSI-BLAST to search SwissProt  for significant homologues of a target protein; at the second level, it applies a sensitive profile-profile alignment tool HHSearch to search against Pfam  to gather remote homologues; and at the third level, it detects domains existent in the target protein, and then uses their neighboring domains found in a species-specific Domain Co-occurrence Networks (DCNs) to infer the functions of the target protein, even though there may be no homology between the target protein and its neighboring domains. Our method combining predictions generated at all three levels participated in the 2011 CAFA experiment. In comparison with three base-line methods, our method not only substantially expanded the sensitivity/recall of function prediction by adding profile search and domain network at the top of traditional PSI-BLAST search, but increased the semantic similarity between predicted function terms and true ones according to the Gene Ontology. Another advantage is that our method can readily predict the functions of multi-domain proteins by decomposing a protein into individual domains and aggregating the function predictions of each domain on a Domain Co-occurrence Network.
The minimum, maximum, and average number of predictions per target of our three predictors and the three baseline methods.
Minimum number of predictions
Maximum number of predictions
Average number of predictions
Here, all of the top n predicted GO terms and the actual GO terms (determined by experimental methods) of protein i were propagated to the root of the Gene Ontology Directed Acyclic Graph (DAG). All the GO term nodes present in the paths of predicted GO terms toward the root were considered predicted GO terms; and all the GO term nodes existing in the paths of the actual GO terms toward the root were considered as true GO terms. The overlapping GO terms/nodes between predicted ones and true ones were considered correctly predicted nodes. For each n in [1, 20], we calculated the precision and recall for each target protein and averaged them over 436 targets protein as the precision and recall on the data set. Thus, for each method, we got 20 precision-recall pairs to generate a precision-recall curve.
It is also interesting to notice that the best recall value of our predictor 1 (~0.55) is higher than that of predictor 2 (~0.51), indicating that including radius-two domain neighbors in Domain Co-occurrence Network  may contribute to the increase of recall since predictor 1 used both radius-one and radius-two domain neighbors to make predictions whereas predictor 2 only used radius-one neighbors (see the Method section for details). That the recall of predictors 1 and 2 are much higher than that of predictor 3 (0.41) demonstrates that profile-profile alignment (HHSearch ) and DCN can substantially increase the sensitivity of protein function prediction at the top of the profile-sequence search methods such as PSI-BLAST (e.g. predictor 3).
We also calculated precisions and recalls according to a sliding threshold scheme, in which only the predictions with confidence scores higher than a threshold value t (0 < = t < = 1) were selected for evaluation. The predicted and actual GO terms were propagated to the root in the GO DAG as described in the sub-section above. At each threshold, we calculated precision and recall for each protein and used the average precision and recall on all 436 target proteins as the estimated prediction and recall for the threshold. By using the thresholds evenly distributed in the range [0, 1] at step size 0.01, we calculated a series of precision-recall pairs for each of six predictors.
The break-even values between precision and recall (i.e., when precision = recall) of the six predictors.
The DNC method predicted a GO term GO:0006464 (protein modification process) using radius-one neighboring domains and predicted GO:0043687 (post-translational protein modification), GO: 0051246 (regulation of protein metabolic process) and GO: 0008152 (metabolic process) using radius-two neighboring domains, which were highly related to the two real GO terms of T30248 (Swiss-Prot ID Q99KJ0.1) - GO:0031396 (regulation of protein ubiquitination) and GO:0042176 (regulation of protein catabolic process). Moreover, the above-mentioned predicted GO terms all existed in the propagated paths from the actual GO terms to the root in the Gene Ontology Directed Acyclic Graph (DAG). Particularly, the true GO term GO:0042176 (regulation of protein catabolic process) and the predicted GO term GO:0051246 (regulation of protein metabolic process) had a high semantic similarity score of 0.730 calculated by the tool G-SESAME , where 1 indicates exactly the same and 0 completely different. Because the homology-based method did not produce any predictions for T30248, this example demonstrates that the DCN of a species, which may be different from the species of a target protein, can be used to make de novo function prediction for the target from scratch. It also shows that the DCN method can readily decompose a multi-domain protein into multiple domains and aggregate function predictions of individual domains as the prediction for the whole protein.
We designed and developed an automated three-level method to predict protein functions integrating profile-sequence homology search, profile-profile homology search and domain co-occurrence networks. We blindly tested different ways of combining predictions generated at the three levels on a large number of protein targets in the 2011 Critical Assessment of Function Annotation. The results showed that our methods integrating complementary predictions performed mostly better than three standard baseline methods. Our experiments also clearly demonstrated that using profile-profile alignment (HHSearch) and domain co-occurrence networks not only increases the sensitivity of protein function prediction at top of the traditional BLAST- and PSI-BLAST-based homology search methods, but also make it possible to make ab initio predictions and handle multi-domain proteins readily.
We constructed three protein function predictors (predictor-1, precictor-2, predictor-3), whose predictions were submitted to CAFA as models 1, 2, and 3. The first two predictors combined function predictions derived from profile-sequence PSI-BLAST search, profile-profile HHSearch search, and domain co-occurrence networks using different strategies. The third one used only predictions derived from PSI-BLAST search at the default threshold (i.e. 10).
where p is the HHSearch probability score from 0 to 100. Thus, the confidence score of GO terms predicted from HHSearch hits is in the range [0.3, 0.6].
The target proteins without predictions made from the first two levels were considered hard cases. For hard cases, we used PSI-BLAST with the default threshold (i.e. 10) to search against both Swiss-Prot and the Gene Ontology database, and additionally applied the DCN-based "aggregated neighbor-counting" method  to make predictions. The DCN-based "aggregated neighbor-counting" method ran PSI-BLAST to search a target protein against the pre-built proteome databases to find its most closely related organism. Our database contains the whole-genome protein sequences of H. sapiens, S. cerevisiae, C. elegans, D. melanogaster, 15 plant species, and 398 single-chromosome prokaryotic organisms (detailed species names can be found at ). The organism whose protein was most similar to the target protein according to the PSI-BLAST search's e-value was considered the most closely related species for the target protein. The pre-constructed DCN of this species was used to make functional predictions for the target. The domain co-occurrence network (DCN) of a species was constructed by the following two steps: (1) each protein sequence of the entire genome was searched against Pfam database  using a profile-sequence alignment tool PfamScan to detect the occurrence of any protein domain families in the protein; and every detected domain family was represented as a node in DCN; (2) if two protein domain families co-occurred in one protein, one edge was drawn between them. In this way, a DCN of a species was created, in which a node represents a protein domain family and an edge the co-occurrence relationship between two domain families. Figure 6(C) illustrates the main DCN of H. sapiens.
where f is the occurrence-frequency of the GO term - the number of the neighboring domains that have the GO term function divided by the total number of occurrences of all predicted GO terms. Thus, the confidence score of DCN-based predictions is in range (0, 0.3]. The ranges of confidence scores assigned to three levels were chosen according to a benchmarking on 100 proteins randomly selected from Gene Ontology before making predictions for the CAFA targets. We compared the performances of the predictions at each level and set their ranges of confidence scores based on their prediction accuracies on the benchmark proteins from high to low.
Predictor-2 used PSI-BLAST to search a target protein against Swiss-Prot with e-value threshold 0.01, applied HHSearch to search the target against PfamA, and employed the DCN-based "aggregated neighbor-counting" method on radius-one neighbors, in order to gather GO terms at all the three levels. The same probability score threshold (> = 80) of HHSearch was used as in Predictor-1. We assigned weights 4, 2, and 1 to a GO term generated by PSI-BLAST, HHSearch, and DCN-based "aggregated neighbor-counting" method, respectively. The weighted frequency of each GO term was calculated and normalized. The normalized score was used as the confidence score of an individual GO term. For proteins without any predictions generated using these three methods, an additional PSI-BLAST search of the protein against Gene Ontology and Swiss-Prot with the default e-value threshold (i.e. 10) was executed in order to gather more hits if possible.
Only a PSI-BLAST search against Swiss-Prot with the default threshold (i.e. 10) was performed in predictor 3. All of the PSI-BLAST hits were included to make prediction. The occurrence frequency of a GO term among all hits was used as its confidence score.
The work was partially supported by a National Institute of Health grant (1R01GM093123) to JC. ZW was also supported in part by the Paul K. and Diane Shumaker fellowship.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 3, 2013: Proceedings of Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Function Annotations. The full contents of the supplement are available online at URL. http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.