Prior knowledge based mining functional modules from Yeast PPI networks with gene ontology

Background In the literature, there are fruitful algorithmic approaches for identification functional modules in protein-protein interactions (PPI) networks. Because of accumulation of large-scale interaction data on multiple organisms and non-recording interaction data in the existing PPI database, it is still emergent to design novel computational techniques that can be able to correctly and scalably analyze interaction data sets. Indeed there are a number of large scale biological data sets providing indirect evidence for protein-protein interaction relationships. Results The main aim of this paper is to present a prior knowledge based mining strategy to identify functional modules from PPI networks with the aid of Gene Ontology. Higher similarity value in Gene Ontology means that two gene products are more functionally related to each other, so it is better to group such gene products into one functional module. We study (i) to encode the functional pairs into the existing PPI networks; and (ii) to use these functional pairs as pairwise constraints to supervise the existing functional module identification algorithms. Topology-based modularity metric and complex annotation in MIPs will be used to evaluate the identified functional modules by these two approaches. Conclusions The experimental results on Yeast PPI networks and GO have shown that the prior knowledge based learning methods perform better than the existing algorithms.


Background
Protein-protein interactions give a fundament knowledge of the biological process within a cell. Such interactions are helpful for deciphering the molecular mechanisms underlying given biological functions. Usually, the connections between proteins can be represented on a graph in which the nodes corresponding to proteins and the edges corresponding to the interactions. There are many ways to identify protein-protein interactions, for instance, according to the proteins similarity calculated based on gene expression profile, biomedical literature, and etc, see [1,2]. In order to further investigate the topological properties and functional organizations of protein networks in cells, the discovery of complex formation (also called as functional module) from PPI networks becomes a major research topic in systems biology [3,4].

Related work
Most previous methods [5][6][7][8][9][10][11][12][13][14][15][16] for automatic complex identification or related functional module detection have employed the unsupervised graph clustering techniques and try to discover similarly or densely connected subgraphs of nodes, e.g., Newman-Girvan method (NG) [8]. Mason and Verwoerd [17] provided an overview of recent and traditional approaches to the problem of identifying community structure in biological networks. Brohee and Heldan [18] made a comparative assessment for protein-protein interaction networks of four clustering algorithms: Markov clustering (MCL), restricted neighborhood search clustering (RNSC), super paramagnetic clustering (SPC), and molecular complex detection (MCODE). They found that MCL and RNSC were more robust to identify community structure in graph alterations than the other two algorithms.
Qi et al. [15] summarized the existing complex identification methods and divided them into five categories: graph segmentation, overlapping clustering, new similarity measures, conservation across species and spatial constraints analysis. In [7] and [8], the authors attempted to segment the PPI graph into disjoint highly connected clusters (complexes) based on the nodes' neighboring interactions cost or the iterative edgeremoval process. Since some proteins are part of multiple complexes or functional modules, a number of approaches [5,6,9,11] allow overlapping clusters. Scholtens et al. [10] applied a local modeling method to better estimate the protein complex membership from direct mass spectrometry complex data and Y2H binary interaction data. They claimed to achieve a finer level of detail than that obtained by using only the mass spectrometry data. In contrast to the divisive approach, the techniques proposed in [12] in an agglomerative fashion. Asur et al. [12] proposed an ensemble approach based on different hierarchial clustering algorithms for various vertices topological similarity metrics. They experimentally demonstrated the effectiveness of such ensemble clustering approach. There are some approaches based on analysis of the spectrum of the Laplacian or similarity matrix of the network described in [19]. Also, several works have established the interconnection between expression profile similarity and protein interactions [20,21]. Even though there are fruitful algorithmic approaches developed for dissection of interaction network, identifying functional modules correctly becomes a bottleneck in the current research. One reason is the accumulation of such large-scale interaction data on multiple organisms [1,22]. The other reason is that a large portion of protein-protein interactions are not recorded in the existing PPI database. Thus it is emergent to design novel computational techniques that will be able to correctly and scalably analyze interaction data sets. Meanwhile, besides PPI databases, there are a number of large scale biological data sets providing indirect evidence for protein-protein interaction relationships. For instance, the well-established microarray technologies provide a wealth of information on gene expression in various tissues and under diverse experimental conditions. Recently, researchers began to combine these existing biological resources to detect the previously unknown regulated modules in interaction networks. In [13,14,23,24], researchers integrated gene expression profiles and PPI networks to evaluate the weights of edges or noded in the graph. Supervised predicting functional modules based on the complex prior information and eight data sources [15].
Sohler et al. proposed a joint analysis concept for mining biological networks and expression data in [23]. By integrating much more sources (including expression profile, sequence information, PPI database and etc.), Dittrich et al. [14], Zheng et al. [24], and Ulitsky and Shamir [13] used aggregation statistic methods to re-weight the importance of each node and edge in the protein-protein interactions graph. Dittrich et al. searched the subnets in the graph with large scores as the functional modules. Zheng et al. used the general graph clustering algorithm (MCL) to mine the subgraphs. Ulitsky and Shamir proposed a statistical method to find the subnetworks by using the maximum likelihood approach.

Prior knowledge based learning
Most existing functional modules mining methods are unsupervised and they are based on the assumption that complexes for a clique in the interaction graph. However, many complexes with other topological structures, e.g., 'star' or 'spoke' model exit in real applications. Yeger-Lotem et al. [25] have raised this issue in the complex identification. Qi. et al. [15] firstly adopted supervised learning for protein complex identification, but their method needs abundant prior knowledge about complexes to build a probabilistic Bayesian network as a learning model. In real applications, there may be only a few prior sufficient knowledge to build a learning model. In this case, semi-supervised learning [26] is a good way to handle the learning problem with only a few prior knowledge but with a large of unlabeled information. Usually, the prior knowledge is represented in the form of pair-wise constraints, must-link and cannot-link constraints. A must-link constraint specifies that objects pair connected by the constraint belonging to the same group, while a cannot-link constraint specifies that objects pair connected by the constraint, cannot belong to the same cluster. The semi-supervised learning method has been applied to many applications, such as text classification [27] and computer-aided diagnosis [28].
Hartwell et al. [29] defined a functional module as a discrete entity whose function is separable from those of other modules. In other words, the proteins in the same module should have similar functions. Usually, PPI networks are good resources to find protein functional modules. In PPI networks, functional modules can be taken as special kinds of subgraphs, where each subgraph is consistent of a subset of nodes with a specific set of edges connecting among proteins. Based on these knowledge, the main aim of this paper is to present a prior knowledge based learning strategy to identify functional modules from PPI networks with the aid of Gene Ontology [30]. The Gene Ontology (GO) database holds functional gene annotation in a hierarchical structure that reflects the relationship between the biological terms and associated gene products. Thus, the functional relationship between two annotated gene products can be calculated as a similarity value [31,32] according to the GO hierarchical structure. Higher similarity value means that two gene products are more functionally related to each other, so it is better to group such gene products into one functional module. Our proposed semi-supervised learning strategy can use such gene product pairs as the prior information in two ways. One is to encode these functional pairs into the data representation, i.e., combining them with the existing PPI networks. The other approach is to use these functional pairs as pairwise constraints to supervise the existing functional module identification algorithms such as MCL and MCODE. Topology-based modularity metric [8] and complex annotation in MIPs [33] will be used to evaluate the identified functional modules by the proposed approaches. The experimental results on Yeast PPI networks and GO have shown that the prior knowledge based learning methods perform better than the existing algorithms.
The rest of paper is organized as follows. In Section 2, we describe the methods calculating protein similarity based on GO, and analyze the relationship between functional similarity values and existing PPI networks. Then we propose two prior knowledge based learning methods for identifying functional modules from PPI networks with the aid of GO. Experimental results on Yeast PPI networks and Gene Ontology were described and discussed in Section 3. In Section 4, we make a conclusion and showed our future work in brief.

Gene products functional similarity
Quantitative measure of functional similarity between gene products has been used in many applications, eg., to validate high-throughput protein interaction, help the development of new pathway modeling tools and clustering methods and enable the identification of functionally related gene products independent of homology [32,34]. GO [30] provides a good vocabulary system to estimate the functional relationship between gene products.
In GO structure, terms and their relationships are represented in the form of directed acyclic graphs. GObased semantic similarity measures can be classified into two categories. The first category defines semantic similarity based on GO structure. The similarity between two gene products is estimated by the number of nodes two gene products share divided by the total number of nodes in two graphs. The other category is based on information content that is defined as the frequency of each GO term occurring in an annotated data set [32]. This kind of methods assume that the more information two terms share indicated by the information content of terms, the more similar they are. In this paper, we adopted the second category to calculate the semantic similarity between gene products because it was proved to be more efficient than the former one [32].

Similarity measure
The relevance similarity Sim Rel [32] is calculated based on the probability of each term. The probability of a term is assumed to be its frequency freq(c) = ∑ {occur (c i )|c Ancestors(c i )}. in the annotations of a databases [35] Note that, for each ancestor a of a concept term c, we have freq(a) ≥ freq(c), because the set of descendants of a contains all the descendants of c. Then the probability of a term c is defined as p(c) = freq(c)|freq(root) where freq(root) is the frequency of the root term. The probability is calculated independently for each ontology. It is monotonically increasing as one moves up on a path from a leaf to the root.
Based on the probability p(c) of each term, the information content (i.e., the amount of information shared by terms) can be measured. Schlicker et al. [32] called this kind of information content as the Relevance similarity Sim Rel between a pair of terms. Sim Rel defines the similarity between two terms in two parts: S(c1, c2) is the set of common ancestors of terms c 1 and c 2 . The first part evaluates the ratio of the commonality of the terms and the information needed to fully describe the two terms. The second part records the position information of the two terms in the whole ontology. Schlicker et al. [32] have studied many methods to compute the semantic similarity between GO terms. It has been shown in [32] that the measure in (1) can consider as much information about the terms in GO as possible. Given a gene products list G = {g 1 , g 2 , …, g n }, the corresponding annotation terms for each gene can be identified in GO as AT i = {c i1 , c i2 ,…, c i|g i | ,} where |g i | is the total number of annotation terms in GO for gene product g i . Finally, the semantic similarity between two gene products g i and g j can be calculated with We used Bioconductor package SemSim in R project (http://bioconductor.org/packages/2.2/bioc/html/Sem-Sim.html) to calculate the Yeast gene products functional similarity.

Comparison of protein pairs and PPIs in terms of functional similarity
In order to investigate the functional similarity between each pair of gene products, we calculated Sim Rel between Yeast gene products which were downloaded from SGD [36]. Meanwhile, we check the distribution of functional similarity value in recorded Yeast PPIs downloaded from MIPs database [33]). There are total 6201 Yeast proteins are included in SGD, among them, 4554 Yeast proteins are covered in MIPs database with 12316 protein protein interactions, here we ignored the self loop and direction. Based on Sim Rel (Eq.(1)) method, the functional similarity value of each gene product pair ranges from 0 to 1. A functional similarity value close to one indicates high functional similarity whereas a value close to zero indicates low similarity. We analyzed the distribution of the functional similarity value in terms of all Yeast protein pairs and MIPs Yeast PPI networks. Because there are some genes (1616) are not annotated in GO, total 4585 (6201-1616) genes as the input of SemSim package, 4585 × 4584/2 = 10508820 similarity values for the corresponding gene pairs. 65 percent of similarity values has zero value while 19 percent cannot be identified because some genes are annotated by different GO (eg., GOBP, GOCC or GOMF), the remaining 16 percent of gene product pairs has similarity value great than zero, as shown in Figure 1. Again, most of gene product pairs have smaller similarity value, which means that they are not very similar in terms of function. For the gene product pairs with similarity values close to one, we will analyze them in detail and used them as our prior information for semi-supervised mining functional modules from PPI networks. Because some genes are not annotated by GO terms, their corresponding similarity values are zero or can not be identified, and only about 10 percent of PPIs have functional similarity value greater than zero, higher similarity value means that the corresponding proteins are more similar to each other with regarding to function.
The goal of functional modules identification from PPI networks is to determine a group of cellular components and their interactions attributed as specific biological functions [29]. In other words, the proteins in one module will be related to each other with regarding function. Usually, PPI networks used here are recorded in the existing PPI networks database, e.g, MIPs, where the major part of PPI information are extracted by manual annotation from the yeast literature. However, limited number of literatures make such PPI information insufficient. Therefore, identifying modules based on such insufficient PPI information (i.e., the existing PPI networks database) will not get a good performance. As shown in Figure 2a, ten proteins were listed with eight interactions which are recorded in the existing PPI networks database. In this case, it is difficult for any method to identify functional modules. Actually, the protein pairs without recorded interaction information may share common functions shown in Figure 2b marked as dash line. Once such functional information is added, it will be easy to derive the functional components for these ten proteins, finally two modules (left module with clique shape and right module with star shape) in this example are found and circled in Figure 2b.
A B C Figure 2 The effect of the functional similarity on module identification. Subfigure (a) shows the protein protein interactions, new protein relations (marked as dash line) were added to (a) because they have higher functional similarity, then a new protein network was built as show in (b). Both modules (circled) will be easily found in (b). Note that the five genes in the left cycle form a module only after the addition of the functional relations between genes, so does the right module.
From the example in Figure 2, we can see that the additional protein functional information is helpful for modules identification. In this paper, GO was used to obtain protein functional information indicated by the similarity value. In the next section, we will describe how to use such functional information to supervise mining functional modules from PPI networks.

Hierarchical clustering
In the view of computation, functional modules are special kind of subgraphs in PPI networks, and each subgraph is consistent of a subset of nodes with a specific set of edges connecting them. Meanwhile, hierarchical clustering method [37] is popularly used in networks clustering, thus, we use it to obtain the base clusters, i.e, functional modules. The implementations of this hierarchical clustering algorithm (agglomerative averagelinkage hierarchical algorithm (Agnes)) is available in R project, a cluster package http://cran.r-project.org/web/ packages/cluster/index.html). Agnes finds the clusters by initially assigning each object to its own cluster and then repeatedly merging pairs of clusters until either the desired number of clusters has been obtained or all of the objects have been merged into a single cluster leading to a complete agglomerative tree. The algorithm takes input as a similarity matrix. Next, we will employ two different similarity metrics, Clustering Coefficient (S cc ) [3] and Neighborhood (S nb ) [39] designed to capture various topological properties of scale-free networks because PPI networks are typical scale-free networks [40], and the corresponding clustering methods are called as HC cc and HC nb respectively. The first similarity metric is based on the Clustering coefficient, a popular metric from graph theory. The clustering coefficient [41] is a measure that represents the inter-connectivity of a vertex's neighbors. The clustering coefficient of a vertex v with degree k v can be defined as follows: where n v denotes the number of triangles that go through node v. Essentially, if the edge between two nodes contributes a lot to the clustering coefficients of the nodes, then they are considered similar and should be clustered together. Here the edge-clustering coefficient [3] is defined, in analogy with the usual node-clustering coefficient, as the number of triangles to which a given edge belongs, divided by the number of triangles that might potentially include it, given the degrees of the adjacent nodes. More formally, for the edge-connecting node i and node j, the edge-clustering coefficient is where z i,j is the number of triangles built on that edge, i.e., the number of common neighbors of node i and node j. min[(k i -1), (k j -1)] is the maximal possible number of triangles.
The idea behind the use of this metric is that edges connecting nodes in different communities are included in few or no triangles and tend to have small values of S cc (i, j). On the other hand, many triangles exit within clusters. Hence the coefficient S cc (i, j) is a measure of how intercommunication a link is. Note that S cc (i,j) will be zero when k i ≤ 1 or k j ≤ 1, also, when z i.j = 0.
The second metric we use is a Neighborhood-based similarity metric. We use the well-known Czekanowski-Dice distance metric [39] for this purpose. This metric uses the adjacency list of each node and favors nodes that have several common neighbors. Two nodes having no common neighbor will have the minimum similarity value (i.e. zero), while those interacting with exactly the same set of nodes will have the maximum value. The Neighborhood Based similarity metric is defined as: Here, Int(i) and Int(j) denote the adjacency list (including themselves) of proteins i and j, respectively, and Δ represents the symmetric difference between the sets. The value of this metric ranges from 0 to 1. Note that using this metric, nodes that do not interact with each other may have a non-zero similarity if they have common neighbors.

Newman-Girvan method
Newman and Girvan [8] first introduced edge-betweenness measure for clustering networks in sociology and ecology to obtain communities. This measure favors edges between communities and disfavors ones within communities. As pointed out by Holme et al [42] edgebetweenness uses properties calculated from the whole graph, allowing information from non-local features to be used in the clustering. Newman et al. introduced three different edge-betweenness measures, Shortestpath, Random-walk and Current-flow. In this paper, we consider the Shortest-path betweenness measure, which computes for each edge in the graph the fraction of shortest paths that pass through it. It is given by: where SP i,j is the number of shortest paths passing through edge e i,j and SP max is the maximum number of shortest paths passing through an edge in the graph.
EB(e i,j ) denotes the shortest-path edge betweenness value of the edge between nodes i and j. The edgebetweenness of an edge is the proportion of the shortest paths that edge belongs to. NG method can be taken as a divisive clustering method. It starts with one cluster of all vertices and recursively splits the most appropriate cluster at the edges with a large edge-betweenness value. The process continues until a stopping criterion (the criterion is usually the splitting steps s) is achieved.

MCL
The Markov Cluster algorithm (MCL) [38,43] simulates a flow on the graph by calculating successive powers of the associated adjacency matrix. At each iteration, an inflation step is applied to enhance the contrast between regions of strong or weak flow in the graph. The process converges towards a partition of the graph, with a set of high-flow regions (the clusters) separated by boundaries with no flow. The value of the inflation parameter (r) strongly influences the number of clusters, i.e., a larger number of smaller clusters will be obtained with increasing of the inflation value (r). The core concept behind this method is that clusters of related nodes are densely interconnected and hence there should be more long paths between pairs of nodes belonging to the same cluster than between pairs of nodes belonging to distinct clusters. Subsequently, in [5,9,44] MCL was used to identify functionally related clusters in the protein interaction network of S. cerevisiae and Human. The experimental results indicated that the identified modules did represent functional clusters within the network. In this paper, we used the MCL package (http://www.micans.org/mcl/#source) to mine the functional modules from PPI networks.

MCODE
Molecular complex detection (MCODE) [6] is a method to detect densely connected regions. First it assigns a weight to each vertex corresponding to its local neighborhood density, i.e., with the core-clustering coefficient instead of the clustering coefficient for each vertex. Next, starting from the top-weighted vertex (seed vertex), it recursively moves outward, including in the cluster vertices whose weight is above a given threshold (Node Score Cutoff (t)). During the clustering process, new members are added only if their node score deviates from the cluster's seed node's score by less than the set cutoff threshold. Therefore, small cutoff values create much more smaller-size clusters and vice versa. The third stage is post-processing the above clustering results by increasing the size of the complex according to a given parameter (f), so that there can be overlap among the modules which have already been defined. In this paper, we used the MCODE plugin in Cytospace (http://baderlab.org/Software/MCODE) to mine the functional modules from PPI networks.

Prior knowledge based functional modules identification methods
Given the prior information (usually as pairwise constraints), semi-supervised learning approaches [26] can be implemented in two ways. One method is to restrict the solution space based on the pairwise constraints and then find the solution consistent with the constraints for other unlabeled data, such as probabilistic models [45], hierarchical clustering [46], spectral clustering [47], and etc.. The other method is employing the prior information to learn a distance metric which can be used to computer the pairwise similarity, so that the learning methods based on similarity matrix could be adopted, such as [48,49]. The key difficulty of semi-supervised learning is how to influence an learning algorithm with the prior information. An efficient and simple method to address this challenge is encoding the prior information into the data representation and then inputting the data into an existing learning algorithm [50]. The other way is using the prior information to supervised the learning process. In this section, we will give these two methods for prior knowledge based mining function modules from protein protein interaction networks.
(a) Prior information is combined into the original data set to form a new data set, and then all existing module identification algorithms can be applied on the new data set.
(b) Prior information is used by the proposed learning algorithms: -Semi-supervised hierarchical clustering (ssHC): prior knowledge is used to construct the transitive closure [46], and then set them as the initial clusters with the other points.
-Semi-supervised NG, Semi-supervised MCL and semi-supervised MCODE (ssNG, ssMCL and ssMCODE respectively): using NG, MCL or MCODE to group PPIs into a relatively large number of sub-modules, and then establish the connections between sub-modules according to the pairwise constraints.
The first approach (as indicated in Figure 3) encoding the prior information into the data representation is easily implemented. As shown in Figure 2, the protein functional pairs identified from GO can be added into the original PPI networks. Then, the existing functional modules identification methods (say, HC cc [3], HC nb [39], NG [8], MCL [5] and MCODE [6]) can be applied on the new combined PPI networks. Furthermore, we propose a novel prior knowledge based learning framework (as indicated in Figure 4 based on the pairwise constraints and can use any existing modules identification method, such as, HC cc , HC nb , NG, MCL, MCODE and etc.. For HC cc and HC nb , the prior information using protein functional pairs was used to construct the transitive closures [46]. The transitive closure is constructed based on the pairs of proteins with large functional similarity which gives the constraint degree between each pair of proteins. If the similarity is greater than a threshold (in this study, the best threshold is experimentally proved to be 0.999), we can say that there is a must-link constraint between P i ,P j . A set of constraints C makes up of all the must-link constraints. In this case, an undirected graph G, with one node for each point appearing in the constraints C, and an edge between two nodes if the corresponding points appear together in a must-link constraint. Then, the connected components of G give the sets in the transitive closure. For instance, in our example in last Section, there are 162 transitive closures on 654 Yeast proteins with 1488 protein functional pairs, where different closures may cover different numbers of proteins. The biggest closure has 21 proteins and 207 pairs, while the smallest closure has 2 proteins and 1 pair, as shown in Figure 5.
Such transitive closures and the other proteins which are not included in these closures will be set as the initial clusters of hierarchical clustering methods. Next, hierarchical clustering methods will merge a pair of clusters if they have a smallest distance or a largest similarity (here, clustering coefficient and neighborhood are used to measure the cluster similarity, and average-linkage method is adopted to merge the clusters). The merging procedure will end when the given number of clusters are obtained. These two semi-supervised hierarchical clustering methods (based on clustering coefficient and neighborhood) are denoted by ssHC cc and ssHC cc respectively.
For NG, MCL and MCODE, we adopted a two-stage semi-supervised learning approach with the aid of the prior information (i.e., protein functional pairs). In the first stage, the PPI networks are grouped into a relatively large number of sub-modules by relaxing the parameters of the existing algorithms. For instance, a large value for s, the number of splitting steps, will be given for NG algorithm, a large inflation r will be set in MCL method and a small cutoff (t) will be set in MCODE method.  Jing and Ng BMC Bioinformatics 2010, 11(Suppl 11):S3 http://www.biomedcentral.com/1471-2105-11-S11 relationship in the second stage. Finally, three semisupervised methods, ssNG, ssMCL and ssMCODE, are designed to mine the functional modules.

Results and discussion
In this section, we conducted a series of experiments to show how the protein functional pairs improve the performance of the existing modules identification methods (HC cc , HC nb , NG, MCL and MCODE) with our proposed prior knowledge based strategy. Yeast PPI networks in MIPs database [33] and GO [30] were used to test the presented methods. In MIPs database, the Yeast PPI networks covers total 12316 protein protein interactions between 4554 proteins, here we ignored the self loop and direction. The identification modules were evaluated by comparing them with the predefined biological complex annotations in MIPs database [33]. Meanwhile, modularity measure [8] was used to select the best parameters.

Evaluation metrics Modularity
Topology-based modularity metric, proposed by Newman and Girvan [8], can be used to evaluate cluster quality. This metric uses a k × k symmetric matrix of  Larger value modularity has, better performance the clustering method obtains.

Complex annotation measure
Since our goal is to find functional modules from PPI networks, it is necessary to test if the obtained modules correspond to known functional modules. This can be done by validating the modules with the predefined biological annotations from the MIPs database [33]. MIPs provides three domain annotation categories: function annotation, complex annotation and localization annotation. Because function annotation category of MIPs is based on GO and our proposed approach combined the functional information of GO, we used complex category to validate the different identification methods. Merely counting the proteins that share an annotation will be misleading since the underlying distribution of proteins among different annotations is not uniform.
Hence, the enrichment analysis are used to calculate the statistical and biological significance of a function module. The enrichment score [51] of a module is the minus log transformation on the geometric mean of pvalues (i.e.,log(pvalue)) from the enriched annotation terms association with one or more of the module members. The enrichment score essentially shows how the module is involved in the important annotation association with the module members. Probably, the higher the score the more important biology to the gene group.

How much prior information is perfect for learning
Based on the functional similarity analysis for all Yeast protein pairs in Section , there are total 14090 protein pairs with similarity greater than 0.99, as shown in Figure 6. Among them, 1488 protein pairs (noted as T1) have functional similarity value equal to 1, 3375 protein pairs (noted as T2) have functional similarity value great than and equal to 0.999, 4146 pairs (noted as T3) have similarity value great than and equal to 0.998, 6951 pairs (noted as T4) have similarity value great than and equal to 0.997 and etc. In our experiments, we tested the performance of prior knowledge based strategy with Figure 5 Transitive closure examples based on protein functional pairs. 162 transitive closures were obtained for Yeast proteins with 1488 protein functional pairs. The biggest closure has 21 proteins and 207 pairs. The smallest closure has 2 proteins and 1 pair.
the first four sets of protein pairs (named as MYP, MYPT1, MYPT2, MYPT3 and MYPT4 respectively). The first prior knowledge based strategy, combining the protein functional pairs into the PPI networks, was used to show how many protein functional pairs are suitable to be the prior information.
Five module identification methods were used in our experiments as described in last Section. For each approach, there are some parameters to be predefined, e.g., the inflation factor r for MCL, the number of splitting steps s in NG method, the node score cutoff t in MCODE, and the number of clusters k in HC cc and HC nb . In this case, the evaluation measure, modularity, was used to validate which parameter value makes the algorithm perform best. For reference, we listed the experimental results for two data sets MYP and MYPT2. The best modularity value for both data sets was obtained at the point 4200, i.e., the number of splitting steps on the modularity of NG method is 4200, as shown in Figure 7(e). Similarly, we can see MCL gets the best performance on modularity when the inflation r is equal to 1.4, as shown in Figure 7(a). MCODE got the best performance at the Node score cutoff t = 0.2, as shown in Figure 7(b). With the same way, the hierarchical clustering algorithms based on clustering coefficient similarity and neighborhood similarity got their best performance with complete linkage at k = 350. That is, the final number of clusters identified by hierarchical algorithms is 350, as shown in Figure 7(c) and Figure 7 (d). On the one hand, we experimentally show how the performance of different identification methods are improved by adding different numbers of protein functional pairs to the original protein interaction networks. For each method, five data sets were used, MYP, MYPT1, MYPT2, MYPT3 and MYPT4. They represent the original MIPs Yeast PPIs with 12316 PPIs, adding 1488 protein pairs with similarity = 1, adding 3375 protein pairs with similarity ≥ 0.999, adding 4146 protein pairs with similarity ≥ 0.998, and adding 6951 protein pairs with similarity ≥ 0.997 respectively. Here only the best results will be listed to compare the different algorithms on different data sets. We can consider the -log (p-value) of the significant modules identified by the corresponding method on one data set. Larger value shows the better performance. Because the smallest number of modules in all experiments is sixty, we showed the top sixty modules for all methods on all data sets. We can find that all algorithms got the best performance on data set MYPT2, adding 3375 protein pairs with similarity ≥ 0.999 on the original protein  networks. Even though the other three data sets, MYPT1, MYPT3 and MYPT4, do not make the identification method obtaining the best result, all of them increase the identification performance by comparing with the original PPI networks (MYP). For MYPT3 and MYPT4 adding 4146 protein pairs with similarity (≥ 0.998), and 6951 protein pairs with similarity (≥ 0.997) respectively, the identification results are better than on the original data set MYP, but less than on MYPT2, the reason is that more added protein pairs may add more noise. For MYPT1 adding 1488 protein pairs with similarity (= 1), the identification results are better than on MYP but less than on MYPT2, the reason is that MYPT1 may not have enough functional information. Figure 8 shows the detail results, where each sub-figure represents the experimental results of one identification method, and each line shows the -log(p-value) of the significant modules identified by the corresponding method on one data set.

Comparison of the identification performance
According to the above experimental results, we selected MYP and MYPT2 as the data sets to test the performance of our proposed prior knowledge based strategy. Three parts of experiments were conducted, one for the five existing identification methods (HC cc , HC nb , NG, MCL, and MCODE) on the original PPI networks (MYP), the other for these five algorithms on the combined PPI networks (MYPT2), another one for the five proposed semi-supervised ssHC cc , ssHC nb , ssNG, ssMCL, and ssMCODE on MYPT2. For the last part of experiments, the protein functional pairs would be taken as the prior information of the identification algorithms. Table 1 gives a comparison summarization on the first two parts of experimental results. For each algorithm, we listed the number of identified modules (# modules) which are annotated in MIPs complex annotation database, the averagelog(p-value) (noted as − − log( ) p value ) and the coverage which is the percentage of proteins which are covered by the annotated modules. From this table, we can see that − − log( ) p value on MYPT2 is better than on MYP. Also, we can see that MCODE has the best p-value but MCODE only covers a small part of proteins. MCL got the best result both on − − log( ) p value and coverage, which is also experimentally proven by Vlasblom and Wodak [44]. Newman-Girvan method got better − − log( ) p value than HC cc and HC nb , but NG includes less proteins than HC cc and HC nb . For two hierarchical clustering methods, it is obvious that clustering coefficient similarity method (HC cc ) is better than neighborhoodbased similarity (HC nb ). Table 2 gives the experimental results of our proposed strategy with the aid of prior information. Here, T2, the protein pairs with functional similarity (≥ 0.999) were used as the prior information. In ssMCL, ssMCODE and ssNG, T2 was used to merge the sub-modules identified by MCL, MCODE and NG respectively, where the initial sub-modules are identified by MCL at r = 5, by MCODE at t = 0.05 and by NG at s = 6500. For hierarchical clustering algorithm ssHC cc and ssHC nb , T2 was used to construct the initial clusters, and the parameter k was set to be 350. Meanwhile, we listed the identification results of the original methods (MCL, MCODE, NG, HC cc and HC nb ) at the given parameter value to compare with the proposed prior knowledge based methods. Obviously, the proposed strategy obtained better performance in terms of both coverage and complex annotation p-value. Even comparing with the best performance of the original methods on MYP in Table 1, our proposed strategy have a comparative performance.
In order to investigate the identified modules, we listed the top ten functional modules identified by the original methods (MCL, MCODE, NG, HC cc and HC nb ) and our proposed prior knowledge based strategy with the first method (i.e., encoding the prior information (T2) into the data representation) in Table 3, 4, 5, 6, 7. For each identified module, we can show the number of proteins it includes (Size), number of total annotated proteins by MIPs complex database (# Annotated), the corresponding complex ID in MIPs database (Com-plexID) and the number of genes both in the complex and current module (Hits). We can find that our proposed algorithm can identify functional modules with biological meaning. According to Tables 3-7, we find that there are more common modules detected by MCL, NG and HC cc on two data sets MYP and MYPT2 than those detected by MCODE and HC nb . More importantly, we find that there is only one module commonly detected by five different algorithms on MYP, namely, the complexID is 510.190. 10 Hits for these modules detection. These results show that the enhanced data set can provide a better network for functional module detection. Furthermore, a function module identified by the proposed strategy (ssMCL) is given in Figure 4. The function module with 62 proteins in Figure 9 is dominated by Complex 550.1.213 about 'probably transcription/DNA maintanance/chromatin structure' and Complex 510.10 about 'RNA polymerase'. In this figure, the red line represents the protein pair with higher functional similarity, and the blue line represents the protein pair recorded in MIPs database. ssMCL can successfully find this module because the prior information (protein functional pairs) were added to the original PPI networks. When checking the function modules identified by MCL on the original PPI networks (MYP), we did not find this module, while the proteins in the module were divided into different modules. Therefore, we can say prior knowledge based strategy has ability to effectively mine function modules.

Conclusions
In this paper, we presented a prior knowledge based strategy for mining function modules from PPI networks with the aid of GO. The functional protein pairs were extracted according to their functional similarity in GO, and then such pairs were taken as the prior information of the proposed mining methods. Two kinds of prior knowledge based methods were designed: one for encoding the prior information into the data representation, i.e., combining the functional protein pairs and PPI networks to a new PPI networks, and the other for using the prior information as pairwise constraints to supervise the existing ming methods. Experimental results on Yeast PPI networks and GO knowledge resource have shown that our proposed strategy performs well in terms of coverage and complex annotation p-value.