Dynamic identifying protein functional modules based on adaptive density modularity in protein-protein interaction networks

Background The identification of protein functional modules would be a great aid in furthering our knowledge of the principles of cellular organization. Most existing algorithms for identifying protein functional modules have a common defect -- once a protein node is assigned to a functional module, there is no chance to move the protein to the other functional modules during the follow-up processes, which lead the erroneous partitioning occurred at previous step to accumulate till to the end. Results In this paper, we design a new algorithm ADM (Adaptive Density Modularity) to detect protein functional modules based on adaptive density modularity. In ADM algorithm, according to the comparison between external closely associated degree and internal closely associated degree, the partitioning of a protein-protein interaction network into functional modules always evolves quickly to increase the density modularity of the network. The integration of density modularity into the new algorithm not only overcomes the drawback mentioned above, but also contributes to identifying protein functional modules more effectively. Conclusions The experimental result reveals that the performance of ADM algorithm is superior to many state-of-the-art protein functional modules detection techniques in aspect of the accuracy of prediction. Moreover, the identified protein functional modules are statistically significant in terms of "Biological Process" annotated in Gene Ontology, which provides substantial support for revealing the principles of cellular organization.


Background
As proteins are responsible for driving biological mechanisms and perform physiological functions within the cell [1], investigating the modular structure in protein-protein interaction (PPI) network has been a central content of proteomics studies in the post-genomic era. PPI networks comprising of interconnected protein functional modules dramatically reveals the feature of modular structurethey have dense connections between the nodes within modules but sparse connections between the nodes in different modules [2]. A module is a fundamental unit formed with highly connected proteins and often possesses specific biological functions [3]. Functional modules can help us to predict the functions of proteins [4]. Accumulated evidences suggest that functional modules are involved in many disease mechanisms [5]. Tracking the functional modules could reveal important insights into modular mechanisms and improve our understanding on the disease pathways etc [6,7]. Though many algorithms to detect protein functional modules have been proposed, yet how to measure the strength of the division of a network into modules (also called communities) has not been explicitly defined. So far the most widely used evaluation criterion for complex networks partitioning is modularity measure Q by Newman and Girvan [8]. However, it has been shown that the Q suffers a resolution limit, that is to say, it performs poorly on identifying small modules [9]. This is mainly because of the global characteristics of the network, which compels the small modules to be concealed in large modules [10]. Using the definition of natural density in number theory, Zhang et al. [11] defined network natural density, which is designed to measure how closely the nodes connecting within communities. Further, he introduced the concept of density modularity, which overcomes the resolution limit in Newman-Girvan algorithm, to evaluate the validity of community partitioning.
Strategies for protein functional modules clustering generally fall into two types -"bottom up" approach proceeding in the form of agglomeration, such as Newman-fast algorithm; and "top down" approach proceeding in the form of division, such as GN algorithm. However, one common drawback to both these two types of approaches is that the protein nodes that have assigned to functional modules have no chance to move into other functional modules during the follow-up processes; instead they are confined to their original modules and result in "module barrier", which lead the erroneous partitioning occurred at previous step to accumulate till to the end.
In this paper, we propose a new algorithm ADM to identify protein functional modules following the introducing of external closely associated degree and internal closely associated degree. In ADM algorithm, a protein opts adaptively whether to stay inside current functional module or move to another module according to the comparison between its internal closely associated degree and external closely associated degree. Owing to the fact that ADM algorithm avoids the shortcoming aforementioned and allows the proteins to dynamically rectify their locations in functional modules whenever necessary, the effectiveness of detecting protein functional modules got improved greatly. The experimental result shows that ADM algorithm outperforms many other state-of-the-art methods such as MCL [12], MCODE [13] and ClusterONE [14] in terms of the accuracy of prediction; moreover, it is capable to identify many protein functional modules with strong biological significance.

Idea of closely associated degree
Functional modules are a cornerstone of many (if not all) biological processes and together they form various types of molecular machineries that perform a vast array of biological functions [15]. So far, the most widely used evaluation criterion for the partitioning of a PPI network into functional modules is global modularity measure Q by Newman and Girvan, the maximum Q corresponds to the optimal partitioning result. However, Q suffers a resolution limit that it cannot effectively identify small clusters, even when these clusters are factions (complete graph).
Preliminary observation on considerable amount of PPI networks has indicated that speculating the connections among local protein functional modules through the overall PPI network is the primary cause of the resolution limit.
To overcome the limit of global modularity Q, a new quantitative function (be named density modularity D) was introduced to evaluate the validity of community structure partitioning [11]. Density modularity, which represents the degree of tightness inside a functional module, is defined as follows: Where L and l i denote the number of edges in the entire network and in module i, respectively; d i and n i denote the sum of the nodes' degree and the number of nodes in module i, respectively. The value of D ranges from 0 to 1, the same as Q. Density modularity can effectively evaluate the partitioning of a network into communities [11]. We get the optimally partitioning when the density modularity is maximum. In this paper, we introduce the definition of closely associated degree, which represents the increment in density modularity brought by assigning a node to a module. For each protein node, its internal closely associated degree is defined as the closely associated degree between the node and its host module, while its external closely associated degree as the closely associated degree between the node and an external module that connected with it. We calculate the external closely associated degree and internal closely associated degree for each protein node according to the variation of the density modularity during its moving processes. If its external closely associated degree is greater than internal closely associated degree, the node will jump into the corresponding external module; otherwise, it remains in current module.

Definition of internal closely associated degree
Suppose that the clustering is in progress in a PPI network (the number of its edges is denoted by L), M1 and M2 are the modules to be merged, and the other modules in the network -denoted by M0 -remain unchanged. The density modularity corresponding to the entire network is denoted by D, and that corresponding to the module M0 is denoted by D0. Let l i , n i and d i (i = 1, 2) represent the number of edges, nodes and the sum of the nodes' degree in Mi, respectively; the number of the edges between M1 and M2 is denoted by e 12 .
Before the merging of M1 and M2, we define the density modularity of a PPI network as follows: After the merging of M1 and M2, we define the density modularity of the PPI network as follows: Therefore the variation of density modularity of the PPI network can be formulated as follows: Where both L and 1 L 3 (n 1 + n 2 ) 2 are constants, which allowed to be omitted during the deviation process of ΔD.
Obviously, the variation of density modularity mainly depends on e 12 , l 1 +l 2 and n 1 +n 2 during the merging process of two modules. "Internal Closely Associated Degree" (denoted as R in (s) ) is defined as the incremental modularity ΔD brought when merging a node into its host module. Let s represent the module identifier to which the single node originally belongs, then R in (s) can be formulated as follows: Where e (s) in represents the number of edges that connect the node and all the other ones in module s.
Definition of external closely associated degree "External Closely Associated Degree" (denoted as R out (f) ) is defined as the incremental modularity ΔD brought when merging a node into its adjacent external module. As there often exist more than one adjacent module trying to pull a node out from its original module, the node tends to be merged into the module that offers the greatest closely associated degree. Let t mark the module that the node finally selects, then the external closely associated degree can be formulated as follows: Where e (f) out is the number of edges connecting the current node and all the nodes within module t. l 1 and n 1 represent the number of edges and nodes when the node is viewed as a single module, respectively. l 2 and n 2 denote the number of edges and nodes in the module to which the node belongs to, respectively. {ngbs} represents a collection including all the adjacent external modules that closely associated with the node. Among {ngbs}, the node need to find one module that with the greatest closely associated degree to merge into.
Studies show that it is unreasonable to speculate the connections between local protein functional modules through the overall PPI network. In our work, the location of each protein node would be updated constantly according to the comparison between its external closely associated degree and internal closely associated degree. Whether a protein node is to stay inside current module or move to another module, it would contribute to the improving and optimizing of the density modularity of PPI network.

The overview of ADM algorithm
During the identification of protein functional modules in ADM algorithm, R in and R out are directly proportional to the increment of density modularity, which indicates that the nodes' ever move will contributes the greatest increment of density modularity. When merging a protein node into a functional module, if R out is greater than R in , the node will move to the corresponding external functional module; otherwise, it will remain inside its original functional module. Therefor the density modularity of a PPI network is variational in the process of detecting functional modules -the value of R in and R out for each node need to be recalculated once a node has moved. The nodes in the PPI network will not stop moving until each of them has reached steady state, when the PPI network has been divided into functional modules correctly.
In the initial stage of ADM algorithm, we consider each node as an initial functional module to calculate the closely associated degree between the node and its neighbor modules. Then the node is merged into a neighbor module that with the greatest closely associated degree, which is considered to be the belonging module of the node. When the module structures have reach a stable state, considering all the modules in pairs, we choose a pair to merge if doing this could produce the maximum increment (or minimum reduction) of the sum of density modularity, such a process is repeated until to the end of ADM algorithm. Finally, we take the partition result obtained when the density modularity D is maximum as the collection of predicted functional modules.

ADM algorithm is detailed as follows
(1) Initialize the network as n modules, namely, each protein node is took as a separate module.
(2) For each node, the values of its R in and R out are calculated respectively, if R in <R out , the node moves to the corresponding external functional module; otherwise, it remains inside the original functional module.
(3) Repeat step 2 until all the nodes in PPI network are stable, and record the density modularity D when the modules have emerged.
(4) List all the pairs of modules gained from step 3 and suppose each of them has been merged, then we separately calculate the increment of density modularity brought by the merging.
(5) Select a pair of modules that brings to the network the maximum increment (or minimum reduction) of density modularity to merge.
(6) Repeat step 2 to step 5 until the positions of all the nodes remain unchanged.
(7) Pick the partition result obtained when the density modularity D gets the maximum value across step 2 to step 6 as the final solution of ADM algorithm.

Results and discussions
As the yeast PPI network is a relatively credible and complete dataset among the existing PPI networks, it is often used to test the validity of methods for identifying protein functional modules. Among the existing varies versions of yeast PPI network datasets, we choose Gavin dataset [16] and Krogan_extended dataset [17] to compare the performance of ADM algorithm against the following classic clustering algorithms: MCL, MCODE and ClusterONE. Gavin dataset comprising of 7669 interactions among 1855 proteins, and Krogan_extended dataset comprising of 14317 interactions among 3672 proteins, are both removed the self-loop and redundant edges. The known yeast protein functional module set obtained from MIPS contains 236 functional modules, each of which contains at least 3 proteins.
Owing to the randomness inherent to ADM algorithm, it runs 20 times on the above yeast PPI networks and thereby generating 20 groups experimental partition results, among which the one that corresponds to the greatest density modularity is preserved as the final identified functional module set. Each of the identified functional modules contains at least 3 proteins after filtering. As a result, 227 and 442 functional modules are identified, respectively, on Gavin dataset and Krogan_extended dataset. Accuracy metric and GO semantic similarity measurement are employed to evaluate the similarity between the identified protein functional modules and the reference known protein functional modules; besides, we use p-value to evaluate the biological significance of the predicted functional modules.

Accuracy metric
The harmonic mean of Sn (Sensitivity) and PPV (Positive Predictive Value), also known as accuracy metric (Acc), is typically used to assess the overall performances of varies algorithms.
Sn and PPV are calculated based on the matching matrix T between predicted functional modules and reference functional modules. The number of rows and columns in matrix T (denoted as n and m), represent the number of reference functional modules and predicted functional modules, respectively. The element t(i,j) in matrix T denotes the number of proteins involved in both the ith reference functional module and jth predicted protein functional module; and n(i) denotes the number of proteins involved in the ith reference functional modules. Thus Sn, PPV and Acc can be defined as follows: The comparison of ADM algorithm against the following three existing state-of-the-art clustering algorithms is performed by applying them to Gavin dataset and Krogan_extended dataset: ClusterONE (clustering with overlapping neighborhood expansion), MCL (Markov Clustering) and MCODE (Molecular Complex Detection). In terms of accuracy metric, the larger Sn to some extent indicates that the more reference protein functional modules could be found, while the lower PPV shows that there exist more predicted protein functional modules that matched with none of reference protein functional modules. As is shown in table 1, clusters is the number of identified protein functional modules, matched denotes the number of the identified protein functional modules matched with at least one reference functional module. On Gavin dataset, while ADM obtains the second most clusters 227, it achieves the most matched 94 in contrast to other algorithms. In addition, ADM obtains the greatest Acc 0.579, which is 15.3% higher than the second best Acc 0.502 achieved by MCL. On Krogan_extended dataset, while ADM obtains the third most clusters 442, it achieves the most matched 107 compared with other algorithms. Moreover, the Sn, PPV and Acc obtained by ADM algorithm are 20.5%, 64.2% and 41.7% higher than the second most ones, respectively. In short, ADM algorithm can detect functional modules from PPI network more effectively than all the other three algorithms. By the way, it can be seen that MCODE algorithm performs worst both on these two datasets, which incites us to use only MCL and Clus-terONE algorithms for the following comparison.

GO semantic similarity measurement
Biologists often compelled to spend much time and a lot of energy on searching biological information, which is attributed to the confusion definitions on biology. Fortunately, Gene ontology (GO) provides a platform to unify the representations of gene and gene product attributes across all species. The ontology covers three domains in terms of cellular component, molecular function and biological process.
GO semantic similarity of a functional module refers to the average associated degree of all the pair-wise proteins within the module [18]. The semantic similarity of cellular component, molecular function and biological process are separately calculated and then the geometric mean of them is took as the functional module's GO semantic similarity. We can obtain the GO semantic similarity by calculating the average weight of all the functional modules. Generally speaking, the greater the GO semantic similarity is, the greater the probability that the proteins perform similar biological functions.
It is convenient for us to calculate the GO semantic similarity of protein functional modules by the tool Pro-Cope [19]. Owing to the poor performance of MCODE algorithm in the above section, here we evaluate the performance of ADM algorithm in terms of GO semantic similarity by comparing it to MCL and ClusterONE algorithms.
As is exhibited in table 2, where MIPS complexes -a collection of protein complexes that has been curated from the biomedical literature -is often used as benchmark for evaluation [20]. On Gavin dataset, despite of the fact that the Biological Process achieved by ADM algorithm is lower than that obtained by ClusterONE algorithm, the Cellular Component and Molecular Function achieved by ADM algorithm, respectively, are 14.2% and 8.9% higher than that obtained by ClusterONE algorithm which has the second best performance here. On Krogan_extended dataset, the Cellular Component and Molecular Function achieved by ADM algorithm, respectively, are 50.6% and 9.9% higher than that achieved by ClusterONE which also has the second best performance here. Therefore, we have reason to conclude that ADM algorithm not only can identify significant protein functional modules from PPI network but also has better performance than the other algorithms.

Analysis of P-values
To evaluate the statistical significance of the identified functional modules, many researchers annotate their mainly biological functions by using p-values [21]. Given a predicted functional module with C proteins, p-value denotes the probability of observing k or more proteins from the functional module by chance in a biological function shared by F proteins from a total genome size of N proteins. P-values is formulated as follows: P-value measures the enrichment degree of a certain function by a protein functional module. The smaller the p-value is, the lower the probability that a biological function arises by chance in the predicted functional module, thus the more significant the predicted functional module is. Given that proteins within a protein functional module are assembled to perform common biological functions, they are expected to share common functions, among which we take the function that corresponding to the minimum p-value as its annotation function. More importantly, the unknown proteins' functions could be predicted according to its belonging functional modules' functions.
Here, we calculate the p-values of Biological Process by GO::TermFinder for each identified protein functional modules. GO::TermFinder takes a list of genes as input, and determines whether there are enriched GO terms for that list by searching the shared GO terms or their parents [18]. In table 3, we list some shared GO terms in terms of Gene Ontology term, Cluster frequency represents the ratio of the number of proteins that with the corresponding annotations to the total number of proteins in the module.
In most situations, the functional module that with p-value<0.01 is considered to be significant. The p-values of most protein functional modules identified by ADM algorithm are lower than 0.01, which indicates the occurrence of these predicted modules does not happen merely by chance. As is exhibited in table 3, the minimum p-value is 2.28E-63, explaining that our algorithm is capable to detect the functional modules with biological significance effectively. Table 4 lists some examples of functional modules that detected by applying ADM algorithm to Gavin dataset and Krogan_extended dataset. ADM algorithm is capable to detect many large functional modules both in Gavin dataset and Krogan_extended dataset. As is shown in table 4, a functional module that consists of 25 proteins is discovered in Gavin dataset, its clustering frequency is 100%, namely match perfectly, which shows that it enjoys significant biological significance and is probably a real protein functional module. In summary, our ADM algorithm is capable to detect many large functional modules with strong biological significance.

Conclusions
Protein functional module is a fundamental unit formed with highly connected proteins and often possesses specific biological functions [3]. While many algorithms have been developed to detect functional modules, they have a common drawback in terms of "module barrier". In this paper, after thoroughly analyzing the changes in density modularity during the merging process, first we defined the concepts of external closely associated degree and internal closely associated degree, then we proposed a new algorithm to identify protein functional modules based on adaptive density modularity. In ADM algorithm, the partitioning of a PPI network into functional modules always evolves quickly to increase the density modularity of the PPI network, thus ADM algorithm is capable to detect protein functional modules dynamically.
Owing to the incorporation of density modularity into the new algorithm ADM, we successfully surmounted the defect of "module barrier" existed in most previously proposed algorithms; moreover, the prediction of protein functional modules got dramatically improved compared with many state-of-the-art algorithms. Therefore, it has important implications for the detection of protein functional modules and the understanding of the principles of cellular organization.