Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks

Background Protein function prediction is an important problem in the post-genomic era. Recent advances in experimental biology have enabled the production of vast amounts of protein-protein interaction (PPI) data. Thus, using PPI data to functionally annotate proteins has been extensively studied. However, most existing network-based approaches do not work well when annotation and interaction information is inadequate in the networks. Results In this paper, we proposed a new method that combines PPI information and protein sequence information to boost the prediction performance based on collective classification. Our method divides function prediction into two phases: First, the original PPI network is enriched by adding a number of edges that are inferred from protein sequence information. We call the added edges implicit edges, and the existing ones explicit edges correspondingly. Second, a collective classification algorithm is employed on the new network to predict protein function. Conclusions We conducted extensive experiments on two real, publicly available PPI datasets. Compared to four existing protein function prediction approaches, our method performs better in many situations, which shows that adding implicit edges can indeed improve the prediction performance. Furthermore, the experimental results also indicate that our method is significantly better than the compared approaches in sparsely-labeled networks, and it is robust to the change of the proportion of annotated proteins.


Background
The past decade has witnessed a revolution in highthroughput sequencing techniques, resulting in huge amounts of sequenced proteins. However, experimental determination of protein functions is not only expensive but also time-consuming. As a consequence, there is an increasing concern about using computational methods to predict protein functions. Though many efforts have been made in this regard, the functions of most proteins in fully sequenced genomes still remain unknown. This is true even for the six well-studied model species. Taking yeast as an example, approximately one-fourth of the proteins have no annotated functions [1]. Therefore, functional annotation of proteins is one of the fundamental issues in the post-genomic era.
The most common approach to computational prediction of protein functions is to use sequence or structure similarity to transfer functional information among proteins. According to a recent survey [2], homology-based transfer approaches can be further divided into two classes: sequence-based approaches and structure-based approaches. BLAST [3] is one of the most widely used sequence-based approaches, which assigns un-annotated proteins with the functions of their homologous proteins. Although sequence similarity is undoubtedly correlated to functional similarity, in many cases there is no need to treat a protein as a whole, This is because typically only the 100-300 amino acids in a functional protein domain perform their functions [4]. Therefore, a protein can be represented as several sequence (or structure) based signatures (motifs) that are associated with some particular functions. PROSITE [5] for example is a database of sequence motifs that is composed of manually selected sequence motifs. Structure-based approaches are based on the observation that protein structure is far more conserved than sequence [6], and thus structure is a useful indicator of function. FATCAT [7] and PAST [8] are the most popular databases composed of 3D protein structures. The reason for using structure motifs is analogous to that of sequence motifs, One example is PROCAT [9], a library of 3D enzyme structure motifs. However, sequence similarity does not necessary imply functional equivalence and thus homology-based transfer approaches can result in erroneous predictions, and the original erroneous annotations may be propagated and amplified in databases [10]. Furthermore, as the databases expand, the utility of the homology-based transfer approaches begins to break down. For example, it has been estimated that < 35% of all proteins could be annotated automatically when accepting error rates of ≤ 5%, while even allowing for error rates of > 40%, there is no annotation for > 30% of all proteins [11].
Recent advances in experimental biology have enabled the production of vast amounts of protein-protein interaction (PPI) data across human and most model species. These data are commonly represented as networks, where a node corresponds to a protein and an edge corresponds to an interaction between a pair of proteins. Thus, using PPI data to assign protein function has been extensively studied. Approaches based on PPI data assume that proteins with similar functions are topologically close in the network. In a review of the existing computational approaches based on PPI data for protein function prediction, Sharan et al. [1] distinguished two types of approaches: direct annotation schemes and module-assisted schemes.
Direct annotation schemes predict the functions of a protein from the known functions of its neighbors, representatives are neighborhood counting approaches [12][13][14][15], graph theoretic approaches [16][17][18] and Markov random field (MRF) approaches [19][20][21]. Majority and Indirect neighbors are two neighborhood counting approaches. Majority [12] is the simplest direct approach, it utilizes the biological hypothesis that interacting proteins probably have similar functions, it ranks each candidate function based on its occurrences in the immediate neighbors. Indirect neighbors [13] assumes that proteins interact with the same proteins may also have some similar functions. It exploits both indirect and immediate neighbors to rank each candidate function. Functional flow [18] is a graph theoretic approach, it simulates a discrete-time flow of functions from all proteins. At every time step, the function weight transferred along an edge is proportional to the edge's weight and the direction of transfer is determined by the functional gradient. Deng et al. [19] devised an MRF model in which the function of a protein is independent of all other proteins given the functions of its immediate neighbors. The parameters of the model are first estimated using quasi-likelihood method, and then Gibbs sampling is used for inferring the functions of unannotated proteins.
Instead of predicting functions for individual proteins, module-assisted schemes first identify modules of related proteins and then annotate each module based on the known functions of its members, examples include hierarchical clustering-based approaches [22,23] and graph clustering approaches [24][25][26][27]. A key problem of this kind of approaches is how to define the similarity between two proteins. Arnau et al. [23] used the shortest path between proteins as a distance measure and applied hierarchical clustering to detecting functional modules. Up to now, numerous graph-clustering algorithms have been applied to detecting functional modules, such as spectral clustering [24], edge-betweenness clustering [25], clique percolation [26] and overlapping clustering [27].
Additionally, Chua et al. [28] presented a simple framework for integrating a large amount of diverse information for protein function prediction. This framework integrated diverse information using simple weighting strategies and a local prediction method. Hu et al. [29] hybridized the PPI information and the biochemical/ physicochemical features of protein sequences to predict protein function. The prediction is carried out as follows: if the query protein has PPI information, the network-based method is applied; otherwise, the hybridproperty based method is employed.
However, most existing network-based approaches do not work well if there is not enough PPI information. In view of this, we proposed a new method that combines PPI information and protein sequence information to improve the prediction performance based on collective classification. Our method divided function prediction into two phases: First, the original PPI network is enriched by adding a number of edges that are computed based on protein sequence similarity. Second, based on the new network, a collective classification algorithm is employed to predict protein function. The main idea behind this method stems from the observation that existing network-based approaches ignore protein sequence information. Therefore, we increase the amount of useful information in the networks by adding a number of computed (or implicit) edges, which consequently improves the prediction performance.
We conducted experiments on S.cerevisiae and M. musculus functional annotation datasets. Compared to four existing protein function prediction methods, our method performs better in many situations, which shows that adding implicit edges can indeed improve the prediction performance. Furthermore, the experimental results also indicate that our method is significantly better than the compared methods in sparselylabeled networks, and it is robust to the change of the proportion of annotated proteins.

Notation and problem definition
Protein function prediction is a multi-label classification problem where we have a set of functions F = (F 1 , . . . , F m ) . Given a protein set, P = (P 1 , . . . , P n ) where the first l proteins are labeled as y 1 , ..., y l , each y i is a vector with y ij = 1 in case that the protein P i is associated with the j-th function F j , otherwise y ij = 0. Our goal is to predict the labels y 1+1 , ..., y n for the remaining unlabeled proteins P l+1 , ..., P n . In this study, we denote the PPI network as a finite undirected graph G = (V, E, W), with a vertex set V = L ∪ U where L corresponds to the set of annotated proteins and U corresponds to the set of un-annotated proteins. Each edge E i,j ε denotes an observed interaction between protein V i and V j and a weight w i,j W indicates the interaction confidence between V i and V j . Here, we employ collective classification to tackle this problem. In addition, we use both explicit edges that are extracted from PPI datasets and implicit edges that are computed from protein sequence information. In what follows, we present the method in detail.

Generating BLAST-inferred edges
As we pointed out above, most existing network-based approaches do not work well when there is not enough interaction information in the PPI networks. Considering this, here we propose a novel method that combines PPI information and protein sequence information to improve the prediction performance based on collective classification. The first step of our method is to enrich the original PPI network by adding a number of computed edges based on protein sequence similarity. Note that the similarity between two proteins is not a reliable proof that the two proteins interact, nevertheless, enriching PPI networks by adding a number of computed edges can increase the amount of useful information to the original PPI network and hence improve the prediction performance. In this paper, the basic local alignment search tool (BLAST) is employed to compute the similarity score between each pair of proteins.
For the protein V x , we define its sequence similarity scores with other proteins like this: where s x,i denotes the similarity score between protein V x and protein V i . We set s x,i = 0 if x = i, which means that we do not consider self-similarity. For each protein V x in the original network, we create k edges to the k vertices that have the highest similarity scores with V x , and use the similarity scores as the weights of these created edges in the enriched network. Thus, we have It is worth noting that there are two types of edges in the new network: BLAST-inferred edges (implicit edges) and explicit edges that are already there. Here, two questions need to be answered. One is how many edges be added for each protein, that is, how to set the value of parameter k, and another is how to combine the weights of these two types of edges with different semantics. We will answer the first question in the experimental evaluation section and the second question in the next subsection.

Gibbs sampling
The second step of our method is to employ the Gibbs sampling (GS) [30] based collective classification method o predict protein function based on the new network. GS is the most commonly used collective classification algorithm that aims at finding the best label estimate for each un-annotated vertex V x ∈ U by sampling each vertex label iteratively. GS based collective classification is divided into two phases: boot-strapping and iterative classification, its high-level pseudo-code is given in Algorithm 1. Detailed description on the algorithm is presented in the following subsections.
Algorithm 1 Gibbs sampling based collective classification for protein function prediction with implicit and explicit edges in PPI networks.
1: // bootstrapping 2: for each query protein V x do 3: compute the initial a → x using L ∩ N s x and L ∩ N w x 4: end for 5: // burn-in period 6: for i=1 to B do 7: for each query protein V x do 8: update a → x using current assignments to N w x , N w x 9: end for 10: end for 11: // sampling period 12: for i=1 to S do 13: for each query protein V x do 14: update a → x using current assignments to N w x , N w According to the observation that proteins with shorter distance to each other in the network are more likely to have similar functions, weighted voting is employed to predict an initial functional probability distribution for the query protein. Note that there are two types of annotated neighbors to vote: implicit neighbors (BLAST-inferred neighbors) and explicit neighbors. Thus, we introduce a combination parameter λ (0, 1) to control the tradeoff between these two types of neighbors.
Formally, for a query protein V x that has k implicit neighbors and N x explicit neighbors, we define the corresponding edge weights like this: Above, N s x and N w x are the vectors of implicit edges and explicit edges, respectively. Then, the probability of V x associated with the j-th function F j is computed like this: where Z s x and Z w x are the normalizers: The larger the value of P j x , the more likely protein V x is associated with the j-th function F j . Given a query protein V x , its initial functional probability distribution is denoted as an m-dimensional vector: Note that when predicting the functions of the query protein V x , we consider only its labeled neighbor proteins (either implicitly connected or explicitly connected). That is the reason why we use L ∩ N s x and L ∩ N w x in Algorithm 1 (Line 3), because unlabeled neighbor proteins can not be exploited in the bootstrapping phase. Codes corresponding to the Bootstrapping phase in Algorithm 1 are from Line 2 to Line 4.

Iterative classification
The iterative classification process is divided to the following two periods: the burn-in period and the sampling period. The burn-in period consists of a fixed number B of iterations where we update a → x using weighted voting. This period is implemented in Algorithm 1 from Line 6 to Line 10. The sampling period consists of S iterations. In each iteration, we not only update a → x but also maintain the count statistics as to how many times we have sampled the j-th function F j for protein V x . This period is implemented in Algorithm 1 from Line 12 to Line 20.
It is worth noting that most proteins in vivo often perform more than one function, thus, protein function prediction is a multi-label classification problem. For the query protein V x , its most likely function can be computed as follows: where b 1 x represents the argument j that maximizes the value of P j x , which is regarded as the 1st-rank result. Accordingly, the second most likely function b 2 x is regarded as the 2nd-rank result, and the third most likely function b 3 x is regarded as the 3rd-rank result, and so forth. In rare case when more than one element P j x in Eq. (7) has the same score, their ranks will be assigned randomly. So we can create an m-dimensional vector b → xi for the query protein V x to record its ranking result in the i-th iteration as follows: When the threshold number S of iterations is reached, we can get a matrix M x with S rows and m columns for the query protein V x : In the first column of the matrix M x , the most frequently sampled function is denoted by c 1 x , called the first rank predicted function. Accordingly, in the second column of the matrix M x , the most frequently sampled function (excluding c 1 x ) is denoted by c 2 x , called the second rank predicted function. In the third column of the matrix M x , the most frequently sampled function (excluding both c 1 x and c 2 x ) is denoted by c 3 x , called the third rank predicted function, and so forth. Finally, we get an m-dimensional vector c → x for the query protein V x : Results and discussion

Interaction and annotation data
We evaluated the performance of our approach with two PPI datasets. The firs dataset (denoted as Dataset A) used in this study is based on Gene Ontology (GO) annotation scheme [31]. GO annotations are arranged in a hierarchical order, and consist of three basic GO namespaces: molecular function, biological process and cellular component. There are 19655 GO terms that constitute 15 levels of annotations, and the higher level terms are more generic while the lower level terms are more specific. In this setting, some vague terms such as "GO:0005554 molecular function unknown" and annotations with evidence code "IEA" (Inferred from Electronic Annotation) were excluded. Furthermore, to avoid the bias problem in the annotations, we applied the concept of informative Functional Class [13] to selectively identify GO terms for validation. An informative GO is referred as the one that 1) is annotated by at least 30 proteins; and 2) has no child terms annotated by at least 30 proteins.  [32]. BioGRID is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans, it currently holds 347966 interactions (170162 genetic, 177804 physical) obtained from both high-throughput data sets and individual focused studies, which were derived from over 23000 publications in the literature.
In this study, we constructed one protein interaction network for each GO namespace using only physical interactions. Therefore, there are totally six PPI networks (three for S.cerevisiae and the other three for M. musculus) in Dataset A. In these networks, a node corresponds to a protein and a un-weighted edge corresponds to an interaction between two proteins. Each node was assigned with at least one Go term, and proteins without interaction data or sequence information were deleted. The details for these networks are shown in Table 1.
The second dataset (denoted as Dataset B) used in this study is based on Functional Catalogue (Fun-Cat) annotation scheme [33] taken from Munich Information Center for Protein Sequences (MIPS) (http://www.helmholtzmuenchen.de/en/ibis). Fun-Cat is organized as a hierarchically structured annotation system and consists of 28 main functional categories. FunCat annotations for S.cerevisiae were downloaded from Comprehensive Yeast Genome Database (CYGD) [34]. CYGD is a frequently used public resource for yeast related information. There are totally 6168 proteins in the dataset, of which 4774 were annotated. These proteins belong to 17 functional categories. The number of proteins for each functional category is listed in Table 2. FunCat annotations for M.musculus were downloaded from Mouse functional Genome Database (MfunGD) [35]. MfunGD provides a resource for annotated mouse proteins and comprises 17643 annotated proteins. These annotated proteins belong to 24 functional categories, which are also shown in Table 2.
Protein interactions of Dataset B were downloaded from STRING database [36], which is an integrated protein interaction database containing known and predicted protein interactions. These interactions were mainly derived from four data sources: genomic context, high-throughput experiments, conserved co-expression and previous knowledge. The most recent version of STRING covers about 5.2 million proteins from 1133 organisms. For Dataset B, we constructed two PPI networks (one for S.cerevisiae and another for M.musculus), proteins without interaction data or sequence information were deleted. As a result, in the S.cerevisiae interaction network, there are totally 388846 distinct interactions among 4687 proteins, and in the M.musculus interaction network there are 14269 proteins and 832124 interactions. Additionally, protein sequence information for Dataset A and Dataset B were also downloaded from the STRING database.

Competing approaches
We compared our method with a sequence similarity based approach (termed BLAST-mined) that does not take the PPI network into account. The BLAST-mined approach was performed in two steps. First, BLAST was adopted to compute similarity score between each pair of proteins. Second, we employed the kNN classifier to predict the functions of un-annotated proteins. We also conducted comparison with a graph based method: Functional flow, as well as two neighbor counting methods: Majority and Indirect neighbors. Functional flow [18] treats each annotated protein as the source of a functional flow. After simulating the spread over time of this functional flow through the network, each un-annotated protein is assigned a score for having the function based on the amount of flow it received during the simulation. Majority [12] makes use of the observation that interacting proteins are more likely to have similar functions, it determines the functions of a protein based on the known functions of proteins lying in its immediate neighborhood. The principal advantages of the Majority are its simplicity and effectiveness. Indirect neighbors [13] exploits both direct and indirect function associations. It computes scores based on level 1 and level 2 interaction partners of a protein.

Experimental setup
For traditional classification problems, the standard evaluation criterion is accuracy. However, in this paper we can not simply determine whether a prediction is correct or wrong because of the partially correct phenomenon in multi-label classification problems [37]. Therefore, as in [38] we adopted the widely-used performance measure, the ratio of TP/FP, which depicts the relative magnitude between the number of true positives and the number of false positives. In this setup, we define the i-th rank overall true positive (TP) as the number of proteins whose i-th rank predicted function c i x is one of the true functions of the protein V x and the i-th rank overall false positive (FP) as the number of proteins whose i-th rank predicted function c i x is not one of the true functions of the protein V x . To evaluate the prediction performance of our method, leave-one-out cross validation was used to compare the performance of our method with that of the competing approaches. The idea behind leave-oneout cross validation is simply to treat each annotated protein as un-annotated in turn, then run the algorithm and compare the predicted functions to the known functions of the protein. It is worth noting that the iterative classification step is omitted in leave-one-out validation, this is because the label vector of the query protein is never updated after bootstrapping. However, in real PPI networks, there are a significant number of un-annotated proteins, thus leave-one-out cross validation seems impracticable in reality. Therefore, we also compared the performance of our method with that of the competing approaches in sparsely-labeled networks. In our implementation, the proportion of annotated proteins was varied from 10% to 90%, we ran 10 experiments for each given proportion of annotated proteins and reported the average performance. Moreover, the burn-in period and the sampling period were set to contain 20 and 100 iterations respectively. Effect of parameters l and k We studied the effect of two parameters used in our study. The first one is the combination parameter λ that controls the tradeoff between implicit (BLAST-inferred) edges and explicit edges. We varied λ from 0.1 to 0.9 and compared the prediction performance. Table 3 and Table 4 lists the performance of different λ in dataset A, and Table 5 details the performance of different λ in dataset B. The experimental results show that the prediction performance is not definitely sensitive to the value of λ, as long as it is not chosen extremely small or extremely large. Thus, in our following experiments, the value of λ was set to 0.3 for both datasets. Next, we examined the effect of the number of BLAST-inferred edges k. We varied k from 1 to 15 for BLAST and compared the prediction performance. Table 6 and Table 7 gives the performance of different k in dataset A, and Table 8 shows the performance of different k in dataset B. We can see that when using k=5 for both dataset A and dataset B the proposed method performs best. This is because adding BLAST-inferred edges with low sequence similarity (when k is large) may produce false predictions. Hence, in our rest experiments, the value of k was set to 5.

Leave-one-out cross validation experiments
To evaluate the prediction performance of our method, leave-one-out cross validation was used to compare the performance of our method with that of the competing approaches. For S.cerevisiae in Dataset A, there are three PPI networks (corresponding to the three GO namespaces). The average number of functions that each protein has in these networks is 1.24, 1.33 and 1.64, respectively. So we considered only the top 2 (⌊1.24⌋ + 1, ⌊1.33⌋ + 1 and ⌊1.64⌋ + 1 are all 2) predictions. Figure 1 shows the performance comparison of our approach to the competing methods for the top-2 predictions. For M. musculus in Dataset A, as the average    Figure 2. In Dataset B, there are two PPI networks (corresponding to S.cerevisiae and M.musculus). The average number of functions that each protein has in these networks is 2.13 and 2.58, so we considered only the top 3 ranks. The results are shown in Figure 3. It can be seen from Figure 1, 2, 3 that our method can obtain more accurate predictions than the four competing approaches, due to the combination of implicit (BLAST-inferred) and explicit edges. These results indicate that enriching PPI networks by adding a number of BLAST-inferred edges can indeed improve prediction performance. The experimental results also validate that the network-based approaches outperform the sequence similarity based approach in most cases. This is because the sequence similarity based approach is    completely based on protein sequence information, and thus perform the worst. In addition, it is worth noting that higher rank functions are predicted better than lower ones, implying that the protein functions are well ranked by our approach.

Performance in sparsely-labeled networks
Here we compared the performance of our method with that of the competing approaches in sparsely-labeled networks. In our implementation, the proportion of annotated proteins in PPI networks was varied from 10% to 90%, with which we predicted the functions of the remaining (un-annotated) proteins. We ran 10 experiments for each given proportion of annotated proteins and evaluated the average performance. Figure 4, 5, 6 and Figure 7, 8, 9 present the results over S.cerevisiae and M.musculus data in Dataset A, and Figure 10 and Figure 11 show the results over the S.cerevisiae and M.musculus data in Dataset B. These results clearly show that our method performs better than the four compared approaches in most cases. The experimental results also validate that generally for all approaches the prediction performance gets better as the number of annotated proteins in the network increases. However, very interestingly we noticed that in Figure 9, the prediction performance of Function flow and Indirect neighbors slightly degrade as the number of annotated proteins in the network increases. And in Figure 11, when the ratio of annotated proteins increases up to 50%, prediction performance of our approach (for the 2nd and 3rd rank functions) turns slightly down. This may be due to the overfitting effect or annotation quality problem. Specifically, when the proportion of annotated proteins is 90%, the predicted functions of the unannotated proteins are mainly based on the immediate neighbors, annotation quality will considerably impact the prediction performance. However, when the ratio of annotated proteins is 10%, the predicted functions of the un-annotated proteins are mainly based on the whole network, which thus alleviates the impact of annotation quality.

Conclusion
In this paper, we proposed a new method to protein function prediction that combines PPI information and protein sequence information to improve prediction performance. It first reconstructs PPI networks by adding a number of BLAST-inferred implicit edges, and then  applies the collective classification method to predicting protein functions based on the new networks. The key idea of our work is to enrich the PPI information of PPI networks by adding a number of computed edges, which subsequently improves the prediction performance. We carried out experiments on S.cerevisiae and M.musculus functional annotation datasets. The experimental results demonstrate that our method outperforms the existing approaches across a series of label situations, especially in sparsely-labeled networks where the existing approaches do not work well due to PPI information inadequacy. Experimental results also validate the      robustness of the proposed approach to the number of labeled proteins in PPI networks.
In this paper, we used a very simple scheme (BLAST alignment) to infer implicit edges. Actually, there are some other methods that can be used to mine useful implicit edges, such as random walk. Random walk exploits both local and global network information, should be able to discover more useful hidden edges. We will explore this direction in the future.