Three-Level Prediction of Protein Function by Combining Profile-Sequence Search, Profile-Profile Search, and Domain Co-Occurrence Networks

Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).


Background
In the genome era, high-throughput genome, transcriptome, and proteome sequencing is generating an enormous amount of omics data such as gene and protein sequences. Since experimental characterization of these proteins can only be carried out at a selectively small scale, large-scale and high-throughput computational prediction methods are needed to annotate the structures and functions of most of these proteins in order for the biomedical research to effectively utilize this vast resource to study genotype -phenotype relationships. To fill the gap, a variety of computational methods have been developed to predict protein function from protein sequence from different perspectives.
The most commonly used approach to function prediction is based on sequence homology. It uses a sequence comparison/alignment tool to search a target protein sequence against protein sequences with known function annotations in a protein database, and if some homologous hits are found, their function annotations may be transferred to the target protein as predictions. GOtcha [1], OntoBlast [2], and Goblet [3] are tools that use BLAST [4] to search for homologues and then combine the Gene Ontology function terms [5] of homologous hits based on BLAST e-values. PFP [6] uses a more sensitive profile-sequence alignment tool PSI-BLAST [4] to search for remote homologues, and also considers cooccurrence relationships between GO [5] terms in order to improve the sensitivity of prediction.
Phylogenetic relationships between proteins have been proven to be helpful for inferring protein functions [7][8][9]. Paralogues and orthologues, the two kinds of homologous proteins generated by gene duplication and speciation during evolution, respectively [10], may still share similar functions. Thus, the function of a protein may be inferred from that of its paralogues or orthologues, even though the level of their functional similarity may depend on their evolutionary distance and other factors. As most of phylogenetic-tree based methods assume orthologous proteins are more likely to share similar functions [9], they often generate a phylogenetic tree to elucidate the evolutionary relationships between a target protein and its homologous proteins at first, and then preferably use the functions of its orthologues to infer its function. SIFTER [11], Orthostrapper [12], RIO [13], and AFAWE [14] are typical methods in this category.
Apart from homologous relationships mentioned above, network-based methods exploit other relationships stored in protein networks. Assuming that neighboring proteins interacting in a protein-protein interaction (PPI) network may have similar protein function, some early networkbased methods use the functions of the direct (radius-one) neighbors of a target protein in a PPI to infer its function. More advances in this direction include the consideration of statistically enriched functions within neighbors [15,16], the expansion of search from direct neighbors to radiustwo and radius-three neighbors [17], and the development of more advanced function inference methods, such as Markov random field [18], random walk [19], and algorithms taking in account the global topology of a network [20][21][22]. In addition to PPI, Functional Linkage Networks (FLNs) [23] derived from protein interaction, gene expression data, phylogenetic profile, and genetic interaction [22,[24][25][26], have been used to predict protein function. More recently, Domain Co-occurrence Networks (DCN) has been used to predict protein function [16].
In an effort to directly link a protein with its function, machine learning methods, such as Support Vector Machines (SVM) and Artificial Neural Networks, have been developed to predict protein function from scratch. Machine learning methods usually generate features from protein sequence, secondary structure, hydrophobicity, subcellular location, and solvent accessibility, and then use these features as inputs to train a classifier to assign proteins to a number of predefined function categories. ProtFun [27] aims to classify a eukaryotic protein into 14 Gene Ontology (GO) categories and several Enzyme Commission classes. FFPred [28] uses features derived from protein disordered regions and protein sequence profiles with SVM to classify a protein into 300 Gene Ontology classes.
With these approaches developed from a variety of perspectives, protein function prediction remains a challenging, largely unsolved problem, particularly when little information regarding homology and protein interaction is known about a target protein. Both the specificity and sensitivity of function prediction need to be improved in order to reliably make function predictions for most proteins. One consensus in the community is to combine multiple complementary methods and explore and integrate more diverse sources of information to enhance prediction accuracy and broaden the annotation scope [29,30]. In this spirit, we developed a three-level method to cope with the complexity of function prediction at different levels, from high homology, to remote homology, to no homology, and synergistically integrated them into a system that can make function prediction for almost all the target proteins in the 2011 Critical Assessment of Function Annotation (CAFA, http://biofunctionprediction.org/) [31]. At the first level, our method uses PSI-BLAST to search SwissProt [32] for significant homologues of a target protein; at the second level, it applies a sensitive profile-profile alignment tool HHSearch to search against Pfam [33] to gather remote homologues; and at the third level, it detects domains existent in the target protein, and then uses their neighboring domains found in a species-specific Domain Co-occurrence Networks (DCNs) to infer the functions of the target protein, even though there may be no homology between the target protein and its neighboring domains. Our method combining predictions generated at all three levels participated in the 2011 CAFA experiment. In comparison with three base-line methods, our method not only substantially expanded the sensitivity/recall of function prediction by adding profile search and domain network at the top of traditional PSI-BLAST search, but increased the semantic similarity between predicted function terms and true ones according to the Gene Ontology. Another advantage is that our method can readily predict the functions of multi-domain proteins by decomposing a protein into individual domains and aggregating the function predictions of each domain on a Domain Co-occurrence Network.

Results and discussion
We blindly tested our method in the 2011 Critical Assessment of Function Annotations (CAFA) experiment. In total, CAFA released 48,298 protein targets whose functions were not known to predictors from all around the world to make prediction from Sept. 2010 to Jan., 2011. At the end, 436 of the targets whose functions were later known and deposited into the SwissProt database were used to evaluate the performances of the predictors. In order to gauge the advances in the field, CAFA released the predictions of three baseline methods. The Prior method used the frequency of GO terms in the SwissProt database to select 836 most frequent GO terms for each target as prediction. The BLAST method used BLAST [4] to search a target protein against groups of proteins, where proteins in each group shared a common GO term; and the maximum sequence identify of BLAST alignments with sequences in a group was used as confidence score to rank GO terms for the target protein. The GOtcha method [1] generated GOtcha I-Scores as the sum of the negative logarithm of the e-values resulted from the BLAST search and used them as confidence scores to select GO terms. During the CAFA experiment, our method submitted three predictions or models for each target, which were produced by three different ways of combining predictions of three levels. Predictor 1 mapped predictions of three levels into three intervals of confidence from high to low; predictor 2 weighed predictions of three levels differently from more to less in confidence score calculation; and predictor 3 simply used the frequency of GO terms of all PSI-BLAST hits as confidence scores to select GO terms. The details of these three predictors are described in the Method section. Table 1 lists the minimum, maximum, and average number of GO terms predicted by three baseline methods and our three predictors. The following sub-sections report and discuss the performances of our predictors along with the three baseline predictors using complementary evaluation measures.

Precision and recall of top n predictions
We evaluated these methods using precision and recall of GO terms on top n predictions ranked by prediction confidence scores, for each n in the range [1,20]. One caveat is that multiple GO terms having the same confidence score received the same average rank. For example, if the confidence scores of top three GO term A, B, C are 0.9, 0.8, 0.8, respectively, the rank of them will be 1, 2.5 (i.e. (2+3)/2), 2.5, respectively. The top n groups (GO terms in each group have the same confidence scores) of GO terms were used for evaluation, which makes the actual number of GO terms possibly higher than n. The precision and recall of a protein i were calculated as: Here, all of the top n predicted GO terms and the actual GO terms (determined by experimental methods) of protein i were propagated to the root of the Gene Ontology Directed Acyclic Graph (DAG). All the GO term nodes present in the paths of predicted GO terms toward the root were considered predicted GO terms; and all the GO term nodes existing in the paths of the actual GO terms toward the root were considered as true GO terms. The overlapping GO terms/nodes between predicted ones and true ones were considered correctly predicted nodes. For each n in [1,20], we calculated the precision and recall for each target protein and averaged them over 436 targets protein as the precision and recall on the data set. Thus, for each method, we got 20 precision-recall pairs to generate a precisionrecall curve. Figure 1 plots the precision-recall curves of six predictors. The plot shows that our methods tend to have higher recall and lower precision in comparison with the lower recall and higher precision of two baseline methods: Prior and GOtcha. For instance, our best performing predictor 1 can reach a recall value as high as~0.55, whereas the highest recall value of the baseline methods (i.e. Prior and GOtcha) is at~0.27. That the baseline methods, particularly the Prior method that selects most frequent GO terms in a general protein function database, can have a high precision but a low recall may be because these methods tend to predict more general GO terms in top n predictions that are closer to the root node of the Gene Ontology, but farther away from the Table 1 The minimum, maximum, and average number of predictions per target of our three predictors and the three baseline methods. It is also interesting to notice that the best recall value of our predictor 1 (~0.55) is higher than that of predictor 2 (~0.51), indicating that including radius-two domain neighbors in Domain Co-occurrence Network [16] may contribute to the increase of recall since predictor 1 used both radius-one and radius-two domain neighbors to make predictions whereas predictor 2 only used radiusone neighbors (see the Method section for details). That the recall of predictors 1 and 2 are much higher than that of predictor 3 (0.41) demonstrates that profile-profile alignment (HHSearch [34]) and DCN can substantially increase the sensitivity of protein function prediction at the top of the profile-sequence search methods such as PSI-BLAST (e.g. predictor 3).

Precision and recall under a sliding threshold on confidence scores
We also calculated precisions and recalls according to a sliding threshold scheme, in which only the predictions with confidence scores higher than a threshold value t (0 < = t < = 1) were selected for evaluation. The predicted and actual GO terms were propagated to the root in the GO DAG as described in the sub-section above. At each threshold, we calculated precision and recall for each protein and used the average precision and recall on all 436 target proteins as the estimated prediction and recall for the threshold. By using the thresholds evenly distributed in the range [0, 1] at step size 0.01, we calculated a series of precision-recall pairs for each of six predictors. Figure 2 shows the precision-recall curves of our three predictors and three baseline methods. Because all the predictions with a confidence score above a threshold other than just top 20 predictions are selected for evaluation, all the predictors in Figure 2 can reach a higher recall than those in Figure 1. Particularly, the Prior method can yield the highest recall (0.82) among all predictors because it constantly predicts 836 GO terms for each target protein, which is several times more than all other predictors ( Table 1). The other two baseline methods (BLAST and GOtcha) that predict more than twice as many GO terms as our methods have the maximum recall values 0.55 and 0.5 respectively, which are lower than the ones of our three predictors. It is worth noting that our predictor 1 that predicted~73 GO terms on average delivered the second highest recall~0.68. When recall is higher than~0.31, our three predictors have higher precision than all three baseline methods at the same recall, while predictor 1 performed mostly better than or occasionally comparable to predictors 2 and 3. That predictor 1 mostly performed better than predictor 2 suggests that the sequential combination of three levels of predictions generated by PSI-BLAST, HHSearch Figure 2 Precision and recall of our three predictors and three baseline methods when considering predictions with confidence score above a threshold t, 0 < = t < = 1. A number of threshold values evenly distributed in the range [0, 1] at step size 0.01 were used to calculated precisions and recalls. and DCN is more effective than the weighted combination of these predictions. Another interesting observation is that predictor 3 that simply pooled all PSI-BLAST hits to generate GO term predictions performed better than the baseline BLAST and GOtcha methods throughout the entire recall range. It also worked better than the baseline Prior method when recall is >~0.15, whereas the latter yielded better precision than all other methods when the recall is low (<~0.15). The better performance of the Prior method in the low recall range may be explained by that its highly common GO term predictions were largely correct, but too far away from the specific GO function of target proteins. Thus its highest precision (~0.85) may serve as an upper limit that current function prediction methods can aim to achieve. Table 2 shows the break-even values of our predictors and the three baseline methods when precision equals to recall. According to this criteria, our methods performed better than the baseline methods, while predictor 1 yielded the highest break-even value 0.306.
In order to further analyze the amount of contributions made by profile-sequence alignment (PSI-BLAST), profile-profile alignment (HHSearch), and domain cooccurrence networks (DCN), we plotted a precisionrecall curve of predictor 1 in Figure 3 to show how precision and recall changes, when progressively considering predictions resulted from PSI-BLAST search at level 1, from both PSI-BLAST and HHSearch searches at levels 1 and 2, and from all three methods at levels 1, 2 and 3. Figure 3 shows that the profile-profile alignment (HHSearch) extended the recall of profile-sequence alignment (PSI-BLAST) from 0.57 to 0.64, and DCN further increased the recall to 0.69. The results demonstrate that three levels of predictions are complementary and can be combined effectively to increase the sensitivity of protein function prediction. Particularly, the DCN method may contribute valuable function predictions when all homology search methods fail to find useful hits, even though the prediction precision in this ab initio situation may be low.

Evaluations by semantic similarity
In addition to precision and recall measures based on the exact match of GO terms, we evaluated these predictors in terms of semantic similarity between true GO terms and predicted GO terms. For two GO terms g 1 and g 2 , we obtained their paths r 1 and r 2 to the root of the GO DAG, and calculated the similarity score between g 1 and g 2 as: where θ(r 1 ) and θ(r 2 ) denotes the number of GO terms of paths r 1 and r 2 , respectively. The numerator is the number of common GO terms shared by paths r 1 and r 2 . For a target protein, every predicted GO term was compared with each of the true GO terms to calculate similarity scores; and the highest score was considered as the similarity score between a specific predicted GO term and the actual GO terms. Averaging the similarities over all predicted GO terms of a target generated the similarity score for the target. The average similarity scores of all the target protein were used as the prediction similarity score of a predictor. We computed the similarity scores of all predictors for top 1 st , 2 nd , ..., 20 th ranked, predicted GO terms respectively. It is worth noting that, because the GO terms that have the same confidence scores get the same rank, the set of top 1 st .. n th ranked GO terms may actually contain more than n GO terms. The average similarity scores of these predictors were plotted in Figure 4. According to this measure, all our three predictors performed better than three baseline methods. Predictor-3 had higher scores than predictor-1 and predictor-2. It is largely because the latter two more often predicted GO terms with the same confidence scores than the former, resulting in more 1 st .. n th ranked GO terms selected for evaluation than the former. In addition to the average similarity score for each target, we also calculated the best similarity score of a targetthe highest similarity score among all GO term predictions for the target and averaged the best similarity scores over all the targets for each predictor. Figure 5 shows the best similarity scores of six predictors for top 1-20 predictions, which also demonstrates the better performances of our three predictors compared with the three baseline methods. Predict-1 and predictor-2 performed better than predictor-3 according to this measure.

An example illustrating the effectiveness of Domain Co-Occurrence Networks for protein function prediction
We chose an example to illustrate the effectiveness of the DCN function prediction component when both profilesequence alignment (PSI-BLAST [4]) and profile-profile alignment (HHSearch [34]) cannot make precise Table 2 The break-even values between precision and recall (i.e., when precision = recall) of the six predictors. predictions. Figure 6 shows how functions were predicted by using Domain Co-occurrence Network (DCN) for target T30248 -a multi-domain protein in Mus musculus (house mouse). Figure 6 (A) and 6(B) illustrate the tertiary structure of the protein predicted by MULTICOM [35] and electrostatic potentials (blue: positive charged; red: negative charged), calculated and visualized by Deep-View (http://spdbv.vital-it.ch/). In order to make function prediction, the DCN method executed PSI-BLAST to search the target protein against our pre-built protein sequence database containing the proteome of H. sapiens, S. cerevisiae, C. elegans, D. melanogaster, 15 plants, and 398 bacteria species (detailed organism names can be found at [16]), and identified the most significant homologous hit -a H. sapiens protein (Swiss-Prot ID P68543) with PSI-BLAST e-value 3e-95 and score 348. Then it utilized the DCN of H. sapiens (human) (Figure 6 (C)) to make prediction as follows. Firstly it used the profile-profile alignment tool HHSearch [34] to search the target protein against PfamA [33] database, which detected eight domain families with homologous probability > = 0. , directly inferring precise function of the target from these domains was not possible. However, the DCN method was able to use the annotated GO terms of the neighboring domains of these four domains to make function prediction for the target as shown in Figure 6 (D), where red nodes denotes the central domains detected for the target and yellow nodes represents their radius-one neighboring domains. The GO terms of the neighboring domains were aggregated and ranked based on frequency. The top ranked GO terms were used as predictions. The frequencies were used as the confidence scores of the predicted GO terms. Similarly, radius-two neighboring domains (not illustrated) were applied to generate predictions in a similar way. The DNC method predicted a GO term GO:0006464 (protein modification process) using radius-one neighboring domains and predicted GO:0043687 (post-translational protein modification), GO: 0051246 (regulation of protein metabolic process) and GO: 0008152 (metabolic process) using radius-two neighboring domains, which were highly related to the two real GO terms of T30248 (Swiss-Prot ID Q99KJ0.1) -GO:0031396 (regulation of protein ubiquitination) and GO:0042176 (regulation of protein catabolic process). Moreover, the above-mentioned predicted GO terms all existed in the propagated paths from the actual GO terms to the root in the Gene Ontology Directed Acyclic Graph (DAG). Particularly, the true GO term GO:0042176 (regulation of protein catabolic process) and the predicted GO term GO:0051246 (regulation of protein metabolic process) had a high semantic similarity score of 0.730 calculated by the tool G-SESAME [36], where 1 indicates exactly the same and 0 completely different. Because the homology-based method did not produce any predictions for T30248, this example demonstrates that the DCN of a species, which may be different from the species of a target protein, can be used to make de novo function prediction for the target from scratch. It also shows that the DCN method can readily decompose a multi-domain protein into multiple domains and aggregate function predictions of individual domains as the prediction for the whole protein.

Conclusions
We designed and developed an automated three-level method to predict protein functions integrating profilesequence homology search, profile-profile homology search and domain co-occurrence networks. We blindly tested different ways of combining predictions generated at the three levels on a large number of protein targets in the 2011 Critical Assessment of Function Annotation. The results showed that our methods integrating complementary predictions performed mostly better than three standard baseline methods. Our experiments also clearly demonstrated that using profile-profile alignment (HHSearch) and domain co-occurrence networks not only increases the sensitivity of protein function prediction at top of the traditional BLAST-and PSI-BLASTbased homology search methods, but also make it possible to make ab initio predictions and handle multidomain proteins readily.

Methods
We constructed three protein function predictors (predictor-1, precictor-2, predictor-3), whose predictions were submitted to CAFA as models 1, 2, and 3. The first two predictors combined function predictions derived from profile-sequence PSI-BLAST search, profile-profile HHSearch search, and domain co-occurrence networks using different strategies. The third one used only predictions derived from PSI-BLAST search at the default threshold (i.e. 10).

Predictor-1
The three-level method integrates profile-sequence alignment, profile-profile alignment, and Domain Co-occurrence Networks (DCNs) as shown in Figure 7. At the first level, PSI-BLAST was executed to search against Swiss-Prot [32]. The protein hits with e-value < = 0.01 were chosen and ranked by e-value. Only the GO terms of the top one hit were included as predictions, whose confidence score S was calculated as:   example, when e-value equals to 0, S is set to 1.0. So S is in the range [0.6, 1]. The second level applied a profileprofile alignment tool HHSearch [34] to detect domains of a target protein. HHSearch generated a hidden Markov model (HMM) for a target protein, which was aligned with the HMM of each Pfam domain family, resulting in a probability score in the range of [0, 100] for each hit. Only the hits with probability score > = 80 were kept, and their GO terms were retrieved from the PfamA database as predictions. The confidence scores of the predicted GO terms were assigned as: where p is the HHSearch probability score from 0 to 100. Thus, the confidence score of GO terms predicted from HHSearch hits is in the range [0.3, 0.6].
The target proteins without predictions made from the first two levels were considered hard cases. For hard cases, we used PSI-BLAST with the default threshold (i.e. 10) to search against both Swiss-Prot and the Gene Ontology database, and additionally applied the DCNbased "aggregated neighbor-counting" method [16] to make predictions. The DCN-based "aggregated neighborcounting" method ran PSI-BLAST to search a target protein against the pre-built proteome databases to find its most closely related organism. Our database contains the whole-genome protein sequences of H. sapiens, S. cerevisiae, C. elegans, D. melanogaster, 15 plant species, and 398 single-chromosome prokaryotic organisms (detailed species names can be found at [16]). The organism whose protein was most similar to the target protein according to the PSI-BLAST search's e-value was considered the most closely related species for the target protein. The pre-constructed DCN of this species was used to make functional predictions for the target. The domain co-occurrence network (DCN) of a species was constructed by the following two steps: (1) each protein sequence of the entire genome was searched against Pfam database [33] using a profile-sequence alignment Figure 7 The architecture of the three-level prediction methodology used in our predictor-1.
tool PfamScan to detect the occurrence of any protein domain families in the protein; and every detected domain family was represented as a node in DCN; (2) if two protein domain families co-occurred in one protein, one edge was drawn between them. In this way, a DCN of a species was created, in which a node represents a protein domain family and an edge the co-occurrence relationship between two domain families. Figure 6(C) illustrates the main DCN of H. sapiens.
For the domains of the target protein detected by HHSearch at level two, the predictor gathered the GO terms of their radius-one neighboring domains in this DCN as predictions, whose confidence scores was proportional to their occurrence frequencies. If no GO terms could be found in radius-one neighboring domains, it extended to the search to radius-two neighboring domains and made predictions according to the same procedure. The confidence of a predicted GO term was calculated as where f is the occurrence-frequency of the GO termthe number of the neighboring domains that have the GO term function divided by the total number of occurrences of all predicted GO terms. Thus, the confidence score of DCN-based predictions is in range (0, 0.3]. The ranges of confidence scores assigned to three levels were chosen according to a benchmarking on 100 proteins randomly selected from Gene Ontology before making predictions for the CAFA targets. We compared the performances of the predictions at each level and set their ranges of confidence scores based on their prediction accuracies on the benchmark proteins from high to low.

Predictor-2
Predictor-2 used PSI-BLAST to search a target protein against Swiss-Prot with e-value threshold 0.01, applied HHSearch to search the target against PfamA, and employed the DCN-based "aggregated neighbor-counting" method on radius-one neighbors, in order to gather GO terms at all the three levels. The same probability score threshold (> = 80) of HHSearch was used as in Predictor-1. We assigned weights 4, 2, and 1 to a GO term generated by PSI-BLAST, HHSearch, and DCN-based "aggregated neighbor-counting" method, respectively. The weighted frequency of each GO term was calculated and normalized. The normalized score was used as the confidence score of an individual GO term. For proteins without any predictions generated using these three methods, an additional PSI-BLAST search of the protein against Gene Ontology and Swiss-Prot with the default e-value threshold (i.e. 10) was executed in order to gather more hits if possible.

Predictor 3
Only a PSI-BLAST search against Swiss-Prot with the default threshold (i.e. 10) was performed in predictor 3. All of the PSI-BLAST hits were included to make prediction. The occurrence frequency of a GO term among all hits was used as its confidence score.