An overlapping module identification method in protein-protein interaction networks
© Wang et al ; licensee BioMed Central Ltd. 2012
Published: 8 May 2012
Skip to main content
© Wang et al ; licensee BioMed Central Ltd. 2012
Published: 8 May 2012
Previous studies have shown modular structures in PPI (protein-protein interaction) networks. More recently, many genome and metagenome investigations have focused on identifying modules in PPI networks. However, most of the existing methods are insufficient when applied to networks with overlapping modular structures. In our study, we describe a novel overlapping module identification method (OMIM) to address this problem.
Our method is an agglomerative clustering method merging modules according to their contributions to modularity. Nodes that have positive effects on more than two modules are defined as overlapping parts. As well, we designed de-noising steps based on a clustering coefficient and hub finding steps based on nodal weight.
The low computational complexity and few control parameters prove that our method is suitable for large scale PPI network analysis. First, we verified OMIM on a small artificial word association network which was able to provide us with a comprehensive evaluation. Then experiments on real PPI networks from the MIPS Saccharomyces Cerevisiae dataset were carried out. The results show that OMIM outperforms several other popular methods in identifying high quality modular structures.
In general, a good understanding of protein families provides us with further views on biological processes. Previous studies have shown that modular structures are densely connected internally but sparsely interacting with others in PPI networks [1, 2]. Modules can be understood as independent sub-networks and proteins in the same module always interact more frequently and show stronger functional dependencies. These days, more and more people are likely to address biological problems with graphic models, where proteins or genes are viewed as nodes and their pair wise interactions as edges in a network [3, 4].
Several methods have been proposed for module identification in the last decade. In 2003, Bader and Hogue proposed a molecular complex detection method (MCODE), which can separate densely connected regions by assigning a weight to each protein . A Markov clustering method (MCL) which is based on flow simulation and high-flow areas corresponding to protein complexes was applied to detect protein families in 2002 . A network module mining method (NeMo) proposed by Yan et al. identifies frequent dense sub-graphs in input networks using coherent edge frequencies, which can lose statistical power in sparse networks with few edges . However, most of the existing methods cannot identify overlapping modules in PPI networks. As far as we know, some proteins may be included in multiple complexes and component parts of a complex could be activated at a specific time or location [8, 9].
In 2006, a clique percolation method (CPM) was used for the first time to identify overlapping modules in PPI networks by finding fully connected sub-graphs of different minimum clique sizes . But its high computational complexity (O(exp(n))where n represents the number of nodes in the network) hindered its application to large scale networks.
Based on these considerations, we propose the OMIM, which is able to partition large scale PPI networks with overlapping modular structures. OMIM first clusters all nodes using a Newman algorithm  and then defines nodes that have comparatively positive effects on the modularity of more than two modules as overlapping ones. Moreover, we designed de-noising steps through assigning a weight to each edge. Hubs can also be found according to their nodal weight. OMIM is a method that is able to identify highly interconnected modules and has few control parameters, allowing it to be applied to many types of networks. We evaluate OMIM as applied to an artificial network and a PPI network. The results showed that it outperforms several other current methodologies.
As we know, a PPI network can be described as an undirected and unweighted graph, G=(V,E), where V and E represent nodes (proteins) and edges (interactions) in the network. In our method, we first assign weights to all edges according to their importance to the network and remove those with lower weights as noise. Then the steps for identifying overlapping modules are performed. The main idea of identifying overlapping parts in OMIM is to find nodes that have comparatively positive effects on different modules. In addition, hubs were also found according to connections with their neighbors .
In general, data in PPI networks are obtained from high-throughput protein-protein interaction experiments. So far, the most frequently used protein-protein interaction detection methods are yeast-2-hybrid, tandem affinity purification, mass spectrometry technology and protein chip technology. Although these high-throughput detection methods make for easy experimentation, they bring about noise and incompleteness [13–15].
where n i denotes the number of triangles that go through node i.
where Q is a quality function representing modularity. The physical meaning of Eq. (4) is that modularity is equal to the fraction of edges that fall within modules, minus the expected value of the same quantity if edges fall at random without regard to its modular structure . The Newman algorithm is a method for optimizing Q in order to discover the best modular structure.
The steps of the Newman algorithm can be summarized as follows.
where m represent the total number of edges in the network.
Merge module pairs with the maximum value of ΔQ. Update matrix e by adding the rows and columns of the corresponding merged modules.
Step 3. Repeat Step 2, until the entire network has become one big module.
From this description, the progress of the Newman algorithm can be represented as a dendrogram. If we choose to cut at different levels, different modular structures can be obtained. Actually, Newman chooses to cut at the maximum value of Q to obtain the best modular structure.
It should be noted that complexes in PPI networks are not static and proteins can be included in different modules. Therefore, identifying overlapping parts between different modules is necessary. We first perform the Newman algorithm to the input data. Then we try to identify overlapping nodes according to their contribution to modularity. The detailed steps are as follows.
Step 1. Perform Newman algorithm. All nodes are clustered without overlapping parts.
Step 2. Define nodes, whose neighbors belong to more than two modules, to be candidate nodes.
where Q B and Q B' is the modularity of B and B'.
Step 4. Repeat Steps 2 ~ 3 until all overlapping parts are identified.
Jordan et al. first found hubs when they studied the evolution of protein and referred to the proteins with large number of partners as hubs . Han et al. divided hubs into two classes: party hubs and date hubs . Party hubs are hubs that interact with their partners at the same time, whereas date hubs either bind their different partners at different times or at different locations. According to their study in a network with a modular structure, date hubs always organize the proteome, while party hubs function inside modules. We propose a computational method to detect the hubs far easier.
where partly hub r means a party hub of module r.
input: G=(V,E); α
for all nodes i(i∈V) in G
compute the clustering coefficient CC i
for all edges (i,j)((i,j)∈E) in G
compute the weight SCC(i,j)
remove edge (i,j) as noise
a new graph G'=(V',E') is obtained
input: G'=(V',E'); number of nodes n; number of edges m
2. while (there are more than one modules)
merge the module pairs with the maximum ΔQ;
update e and a;
3. sort all Q s from all iterations and choose the modular structure M corresponding to the largest Q.
4. for node i in M
if i belongs to module A and its neighbor (in G') j belongs to B
copy i to B and construct B'
i is an overlapping node between A and B
5. a new modular structure M' with overlapping parts is obtained.
3. discovering hubs
for module r in M'
party hubr=argmax wi,i∈r
for each node i not in any module
if ACC i ≥3
i is a date hub
The yeast (Saccharomyces Cerevisiae) PPI networks used in our study are from the MIPS Comprehensive Yeast Genome Database (CYGD) (PPI_18052006) . The dataset contains 4989 proteins and 13583 interactions after removing isolated nodes and self-cycled edges. The on-line annotation tool, GO term finder (version 0.83), is from the SGD database (Saccharomyces Genome Database) , which contains 7292 genes as a background set.
Methods used for comparisons in our experiments are Newman, MCL and CPM. There are two main reasons for this selection. In first instance, these are three classical clustering algorithms that have been widely used in many fields. Their use makes for clearer comparisons. Secondly, these algorithms represent the most appropriate methods in different aspects for comparison with OMIM. According to Brohée et al. , MCL outperforms many other algorithms, especially in partitioning PPI networks. CPM is a widely known classical method for identifying overlapping modules and the Newman algorithm is the ancestor of OMIM.
Among these three methods, MCL was executed as an embedded program of BioLayout Express 3D  and the CPM algorithm was performed by using of CFiner, a tool created for clustering based on CPM .
where node j is a neighbor of node i, m i represents the total number of neighbor nodes of i, num_V(r) and num_E(r) represent the number of nodes and edges in module r respectively. x i (j) is a function defined as follows: if j is classified correctly, x i (j)=1; else, x i (j)=0.
Results of the comparison on the word association dataset
Eight party hubs were found by OMIM, i.e., month, sunshine, camp, sleep, work, enjoy, long and sunny. The date hub is day. Besides, we also discovered four overlapping nodes: moon, outside, delight and walk. Compared with the original network shown in Figure 1, our results can correctly cluster all nodes, verifying the effectiveness of our method.
where n represents the size of the entire network, n 1 is a cluster obtained from the experiment, n 2 the number of proteins annotated with a specific GO term and ol the number of proteins in n 1 that can be annotated with the specific GO term.
In our experiments, P-values that higher than 0.01 were eliminated. We used the negative natural logarithms (-log P-value) to substitute for P-value.
From Figure 3 we can see that, like most scale-free networks, the degree of the distribution of the PPI dataset follows the power law relationship P (K)~K -r with r≈2.5.
Enrichment analysis is an important index for protein function annotation. We used the GO term finder to assign a main function that corresponding to the best P-value to each module. 10 modules were selected randomly to demonstrate the results of the enrichment analysis (Additional file 2).
Enrichment analysis of 10 randomly selected modules
nuclear-transcribed mRNA poly(A) tail shortening (21.60)
ubiquitin-protein ligase activity (10.50)
CCR4-NOT core complex (24.36)
meiotic mismatch repair (31.64)
mismatched DNA binding (33.15)
mismatch repair complex (33.61)
tRNA-type intron splice site recognition and cleavage (29.28)
endoribonuclease activity, producing 3'-phosphomonoesters (30.03)
tRNA-intron endonuclease complex (29.00)
nuclear polyadenylation-dependent mRNA catabolic process (27.68)
molecular function unknown (RRP4/RRP42/RRP43/SKI6)
cytoplasmic exosome (RNase complex (30.24)
anaphase-promoting complex-dependent proteasomal ubiquitin-dependent protein catabolic process (75.04)
ubiquitin-protein ligase activity (44.07)
anaphase-promoting complex (83.63)
protein import into mitochondrial inner membrane (38.73)
protein transporter activity (27.42)
mitochondrial inner membrane protein insertion complex (43.22)
protein targeting to mitochondrion (31.14)
protein channel activity
mitochondrial outer membrane translocase complex (47.98)
transposition, RNA-mediated (14.06)
RNA binding (6.37)
retrotransposon nucleocapsid (13.24)
golgi vesicle transport (62.74)
rab guanyl-nucleotide exchange factor activity (35.95)
TRAPP complex (55.19)
positive regulation of transcription from RNA polymerase II promoter (9.93)
transcription factor binding transcription factor activity (15.39)
mediator complex (18.12)
Comparison OMIM with other competing algorithms on PPI dataset
Table 3 shows that OMIM and Newman discard the least number of proteins (44.26%) for constructing modules compared with the other two methods. Moreover, OMIM is superior to Newman and MCL according to the enrichment analysis of Gene Ontology categories (BP, MF and CC). Although it has higher -log P-values on BP and CC than OMIM, CPM filtered too many proteins (about 85.51%) which may result in losing much useful information.
The studies on an artificial and a PPI dataset verify the effectiveness of our method. In the experiment on the artificial dataset, the OMIM can find all modules correctly with an accuracy of 1.0000. All hubs that play key roles in the artificial networks are found precisely. In the experiment on the PPI dataset, we evaluated the performance of OMIM by enrichment analysis, cluster frequency analysis and in comparisons with other competing algorithms. All of the evaluation measures resulted in good performances. In addition, 30% of the hub proteins found by OMIM could directly be verified by the study of Han et al. . However, since the degree distribution of the PPI dataset follows a power law, the discrepancy on modular sizes was quite large, which is not rational. In our future work, we will try to settle the problem of unbalanced clustering.
This work was supported by the grants of the National Natural Science Foundation of China, Nos. 60804022, 60974050, 61072094, 61133010 & 31071168, the grants from the Program for New Century Excellent Talents in University under Award Nos. NCET-08-0836, and NCET-10-0765, and the grant from the Fok Ying-Tung Education Foundation for Young Teachers, No. 121066.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 7, 2012: Advanced intelligent computing theories and their applications in bioinformatics. Proceedings of the 2011 International Conference on Intelligent Computing (ICIC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S7.