Comparison of module detection algorithms in protein networks and investigation of the biological meaning of predicted modules
https://doi.org/10.1186/s12859-016-0979-8
© Tripathi et al. 2016
Received: 18 August 2015
Accepted: 6 March 2016
Published: 18 March 2016
Abstract
Background
It is generally acknowledged that a functional understanding of a biological system can only be obtained by an understanding of the collective of molecular interactions in form of biological networks. Protein networks are one particular network type of special importance, because proteins form the functional base units of every biological cell. On a mesoscopic level of protein networks, modules are of significant importance because these building blocks may be the next elementary functional level above individual proteins allowing to gain insight into fundamental organizational principles of biological cells.
Results
In this paper, we provide a comparative analysis of five popular and four novel module detection algorithms. We study these module prediction methods for simulated benchmark networks as well as 10 biological protein interaction networks (PINs). A particular focus of our analysis is placed on the biological meaning of the predicted modules by utilizing the Gene Ontology (GO) database as gold standard for the definition of biological processes. Furthermore, we investigate the robustness of the results by perturbing the PINs simulating in this way our incomplete knowledge of protein networks.
Conclusions
Overall, our study reveals that there is a large heterogeneity among the different module prediction algorithms if one zooms-in the biological level of biological processes in the form of GO terms and all methods are severely affected by a slight perturbation of the networks. However, we also find pathways that are enriched in multiple modules, which could provide important information about the hierarchical organization of the system.
Keywords
Background
The biological function on the molecular level emerges from the complex interaction of biological entities of a cell [1, 2]. Specifically, different types of molecules, e.g., proteins, metabolites, miRNA or tiRNA, can interact in many various ways with each other in dependence on the tissue type and the environmental condition of an organism. The interactions among biological molecules can be broadly categorized into three types of networks: metabolic networks, transcriptional regulatory networks and protein interaction networks [3–6]. These networks need to be inferred from experimental observations generated by different high-throughput platforms, including Next-Generation Sequencing (NGS), proteomics and microarrays.
Nowadays, it is generally accepted that biological networks are not randomly connected but follow certain structural patterns that give rise to (I) a scale-free topology, (II) a hierarchical organization and (III) a modular structure [7–12]. Especially modularity is one of the most important features of biological networks, because it suggests that nodes, which are tightly connected with each other as a community, are most likely to be a part of the same biological function or pathway. This may also be reflected in the evolution of the organisms [8, 13–15]. As a complicating factor, in reality, these pathways are not discrete, but each gene may take part in multiple biological functions, and therefore can be a part of multiple communities. Hence, a biological network with a modular structure can contain multiple overlapping communities, which might also contribute to the fact that biological networks are robust [16, 17].
For protein interaction networks (PINs) it is known that there are two types of modular structure that are of significant importance. These modules can be either formed by protein complexes or dynamic functional units [18]. Also the modules in PINs of different species have been explained as the efficient functioning of a cell and the basis of evolution in order to adapt the changes to the environment quickly [19, 20]. In [21] the existence of two further types of structural components of modules in protein networks has been revealed, which have been termed core components and ring components. The core components are more conserved and perform key biological functions, while the ring components performs certain specialized functions under particular circumstances potentially triggered by environmental changes. Furthermore, several methods have been developed to identify and integrate protein networks along with gene expression or other datasets such as disease-gene association to identify the functional activity of modules in different disease conditions [22–25]. Finally, in [26] the algorithm ClusterONE has been developed to identify overlapping nodes in modules in protein networks. These examples demonstrate that any systems-based analysis on the genomic level is incomplete without a network understanding of interactions on the molecular level.
Our study has four major objectives. The first objective of our study is to compare community detection algorithms for benchmark networks as well as 10 protein interaction networks. Second, we provide an in depth analysis of the biological meaning of the predicted networks across a variety of different biological aspects. Third, due to the fact that all PINs are inferred from experimental data they carry a certain uncertainty with respect to the correctness of the inferred interactions. For this reason, we are performing a robustness analysis of the predicted modules by perturbing the PINs by edge deletions. Finally, we investigate overlapping pathways that may form functional bridges between more specialized modules.
For the community detection analysis, we are using the 5 most popular module detection algorithms, fast-greedy [27], walktrap [28], label propagation [29], spinglass [30] and multi-level community [31], that have been developed for application to large networks and propose in addition 4 correlation-based module prediction methods. Briefly, for our approaches, we assign weights to each pair of nodes depending on the distance between them in the network and utilize this for the module prediction. This provides competitive modularity measures for artificial and biological networks in comparison to other community detection algorithms. The details about all measure will be given in the Methods section.
Typically, for large real networks there is only limited information available about the true module structure within these networks because of our lack of understanding of the underlying phenomena. However, for protein networks we can make use of the Gene Ontology (GO) database [32], which provides a comprehensive overview of thousands of biological processes in a variety of different organisms. Utilizing this information allows a biologically meaningfully assessment of the predicted modules. Specifically, in our analysis, we use protein networks of 10 different species to investigate the modularity predicted by the different community detection algorithms.
This paper is organized as follows. In the next section, we describe all methods, measures and data sets used for our analysis, including a description of the protein interaction networks. In the Results section, we present our numerical findings and this paper finishes with the Conclusions section summarizing and discussing our results.
Methods
Modularity
Here k v is a degree of node v∈i.
Fast-greedy algorithm
Walktrap algorithm
In the first step of the algorithm, all nodes are considered as individual communities. In the second step, the two closest communities are merged based on the distance between them, and the community structure is updated. Then the second step is repeated until all communities are merged into one community.
Label propagation algorithm
- 1.
Assign a unique label to each node.
- 2.
Order nodes randomly.
- 3.
label the selected node with the same label which is in maximum number in its neighbourhood.
- 4.
If all the nodes have the same label, which is in maximum number in their neighbourhood, then stop the algorithm, otherwise repeat step 3.
Spinglass community algorithm
This method was proposed in [30]. In this approach the community detection is mapped to finding the ground state of an infinite ranged Potts spin glass model, by combining the information from both present and missing links, where the clusters are represented as the number of occupied spin states. In the Spinglass algorithm, existing edges within a community and non-existing edges between communities are rewarded while the edges which are not present in the community and edges between communities are penalized.
Multi-level community algorithm
This method was proposed in [31]. This algorithm is divided into two phases. In the first phase, all nodes are considered as independent communities. Then communities are merged into a larger community if the modularity of the network increase. The first phase is stopped if there is no further increase in the modularity. In the second phase each community is represented in the form of a node and edges between and within communities are replaced by weighted-edges. The number of edges between two nodes (communities) are replaced by a single weighted edge and all the edges in a community are replaced by a self-connecting weighted edge. After the construction of a new weighted network, first phase is repeated to obtain an improvement in modularity. These two phases are iterated until there is no further improvement in the modularity of the network.
Correlation based hierarchical clustering
We use these two different distance measures for hierarchical clustering (ward algorithm). To get an optimal number of cluster we use modularity measure by newman [27] described in the “ Modularity ” section.
Data
In the Results section, we first analyze the performance of the community detection algorithms with artificially generated benchmark networks, and then we study protein interaction networks of different species. A description of these networks is provided in the following subsections.
Benchmark networks
- (1)
The degree, d, of each node is randomly assigned from the power law distribution with exponent γ, in our case it is 1. The degree distribution is assigned depending on the maximum degree d max ={20,40} and the average degree, d avg =10, selected as an input.
- (2)
Nodes are assigned a fraction of edges, μ, that are shared with nodes of other communities and the remaining fraction, 1−μ, is shared within the community.
- (3)
A community-size k min and k max is assigned in a following way, where k min >d min and k max >d max so that each node can be assigned to a community. The community size is decided based on the power law distribution so that the sum of the nodes in all communities is equal to the number on nodes in the network.
- (4)
First, nodes are not assigned to any community and than nodes are assigned randomly to a community if the community-size exceeds the number of neighbours of the node in the community. This step is repeated until all nodes are assigned to a community.
- (5)
In order to ensure that each node has a right approximation of μ and 1−μ for external and internal edges several rewiring steps are iterated.
For our analysis, we generated networks of vertex-size, |V|=1000, by varying different parameters for non-overlapping communities which are average degree, maximum degree, minimum cluster size, maximum cluster size and mixing parameter μ= {0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50}.
Protein interaction networks
A list of protein networks used for detecting communities by different community detection algorithms
Tax id | Biological | No. of | No. of | Edge |
---|---|---|---|---|
Name | vertices | interactions | density | |
10090 | House mouse | 5057 | 11560 | 0.000904 |
10116 | Norway rat | 1710 | 2582 | 0.001767 |
237561 | Candida albicans SC5314 | 304 | 316 | 0.006860 |
284812 | Schizosaccharomyces pombe 972h | 3854 | 55054 | 0.007414 |
36329 | Plasmodium falciparum 3D7 | 1172 | 2415 | 0.003519 |
3702 | Arabidopsis Thaliana | 7103 | 17752 | 0.000703 |
559292 | Saccharomyces cerevisiae S288c | 6008 | 227836 | 0.012620 |
6239 | Caenorhabditis elegans | 3701 | 7695 | 0.001123 |
7227 | Drosophila melanogaster (fruit fly) | 8017 | 38973 | 0.001212 |
9606 | Homo sapiens | 15795 | 159278 | 0.001276 |
As one can see in Table 1 these biological networks show a large variety in the network parameters such as number of nodes and number of edges.
Normalized mutual information (NMI)
In order to assess the predicted modules of the algorithms qualitatively, we use the normalized mutual information (NMI) [37–39].
A contingency table which defines overlap between two communities, U and V
U ↓∖ V → | V 1 | V 2 | . | . | . | V C | Sums |
---|---|---|---|---|---|---|---|
U 1 | n 11 | n 12 | . | . | . | n 1C | a 1 |
U 2 | n 21 | n 22 | . | . | . | n 2C | a 2 |
. | . | . | . | . | . | . | . |
. | . | . | . | . | . | . | . |
U R | n R1 | n R2 | . | . | . | n RC | a R |
Sums | b 1 | b 2 | . | . | . | b C | N |
Results
Benchmark networks
We start our analysis investigating the performance of community detection algorithms by application to benchmark networks. The benchmark networks are generated by an algorithm [34], as described in the Methods section, that result in networks with a predefined modularity structure. Hence, it is know that the networks have a module structure and can be used as a reference to quantify the performance of the community detection algorithms in an objective manner.
In the following, we study various parameters of the benchmark algorithm to generate benchmark networks. Specifically, we set the network size to |V|=1000 nodes, for the average degree of the vertices we use \(d_{i}^{avg} = 10 \) and for the maximum degree, \(d_{i}^{max} = 20\). The minimum community-size parameter, we vary for k min ={10,20,50,70,100,150} and the maximum community-size parameter for k max ={20,50,70,100,150,200}. For the mixing parameter, we study values in the set μ={0.05,0.10,0.15,0.20,0.25,0.30,0.40,0.50}. For each parameter combination, we generate 50 networks, resulting in a population of benchmark networks with the same characteristics but random variations. This allows an assessment of the robustness of the results due to stochastically ocuring structural changes in the networks.
Normalized mutual information of different module detection algorithms for the benchmark networks
Overall, the figure shows that as the mixing parameter, μ, increases the performance of all module detection algorithms deteriorates. Compared to all algorithms, the Label propagation algorithm underperforms throughout all values of μ and the Spinglass community algorithm performs better than all other algorithms, except for low values of the mixing parameter. This indicates that the method has an optimal working point for intermediately connected modules, which is a counterintuitive behavior. Furthermore, our distance measure-based approaches, notably A*SP Pearson and A*SP Spearman, are showing in general a good performances, and compared to Fast greedy and Walktrap they show even a favourable performance.
Performance of module detection algorithms by adding random edges
A comparison of modularity of different module detection algorithms by showing plots between modularity and mixing parameter (μ) in synthetic networks. The synthetic networks are modelled by adding certain percentage of random edges in the networks. a 5 % (b) 10 % (c) 20 % (d) 30 % (e) 40 % (f) 50 % additional edges of total edges are randomly added in synthetic networks
Biological networks
Next, we extend our investigation to biological networks. Specifically, we use 10 PPI networks from different species. Details of these networks can be found in Table 1.
Modularity in PPI networks
Modularity, Q, of PPI networks detected by different module detection algorithms
Tax id | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap |
---|---|---|---|---|---|---|---|---|---|
House mouse | 0.4903 | 0.4168 | 0.4281 | 0.3999 | 0.5647 | 0.4578 | 0.6066 | 0.6239 | 0.5265 |
Norway rat | 0.5061 | 0.2901 | 0.4985 | 0.4948 | 0.6608 | 0.5089 | 0.6682 | 0.6683 | 0.5951 |
Candida albicans SC531 | 0.4571 | 0.4629 | 0.4625 | 0.4629 | 0.4757 | 0.428 | 0.4757 | 0.4728 | 0.4689 |
Schizosaccharomyces pombe 972h | 0.1669 | 0.1673 | 0.1005 | 0.128 | 0.2396 | 3e-04 | 0.2516 | 0.268 | 0.1545 |
Plasmodium falciparum 3D7 | 0.4775 | 0.4576 | 0.4713 | 0.466 | 0.5171 | 0.0066 | 0.5222 | 0.5396 | 0.3505 |
Arabidopsis Thaliana | 0.6635 | 0.6004 | 0.5824 | 0.5781 | 0.6893 | 0.6977 | 0.7296 | 0.742 | 0.6822 |
Saccharomyces cerevisiae S288c | 0.2108 | 0.2055 | 0.0399 | 0.0283 | 0.2557 | 1e-04 | 0.2532 | 0.2741 | 0.2221 |
Caenorhabditis elegans | 0.5141 | 0.5087 | 0.5023 | 0.4989 | 0.6042 | 0.1872 | 0.6106 | 0.6231 | 0.5268 |
Drosophila melanogaster | 0.4509 | 0.4491 | 0.4124 | 0.4238 | 0.471 | 0.2608 | 0.5232 | 0.5307 | 0.3865 |
Homo sapiens | 0.2045 | 0.0898 | 0.0708 | 0.0655 | 0.2877 | 1e-04 | 0.3498 | 0.3612 | 0.253 |
Average modularity | 0.4141 | 0.3648 | 0.3568 | 0.3546 | 0.4765 | 0.2547 | 0.4990 | 0.5103 | 0.4166 |
Number of modules of PPI networks detected by different module detection algorithms
Tax id | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap |
---|---|---|---|---|---|---|---|---|---|
House mouse | 9 | 8 | 10 | 6 | 71 | 95 | 28 | 25 | 360 |
Norway rat | 26 | 4 | 42 | 55 | 28 | 56 | 28 | 25 | 123 |
Candida albicans SC531 | 16 | 14 | 12 | 14 | 14 | 11 | 14 | 12 | 13 |
Schizosaccharomyces pombe 972h | 5 | 11 | 2 | 2 | 20 | 5 | 8 | 13 | 582 |
Plasmodium falciparum 3D7 | 15 | 16 | 25 | 26 | 18 | 4 | 22 | 22 | 179 |
Arabidopsis Thaliana | 15 | 13 | 20 | 14 | 57 | 190 | 36 | 25 | 390 |
Saccharomyces cerevisiae S288c | 14 | 10 | 8 | 13 | 5 | 2 | 8 | 10 | 319 |
Caenorhabditis elegans | 10 | 6 | 9 | 6 | 38 | 51 | 29 | 25 | 351 |
Drosophila melanogaster | 21 | 21 | 20 | 19 | 55 | 24 | 29 | 25 | 884 |
Homo sapiens | 30 | 20 | 34 | 63 | 89 | 3 | 13 | 21 | 3425 |
The first observation we make is that the best performing algorithms are the Multilevel and the Spinglass community algorithms. Interestingly, for some organisms, e.g., Schizosaccharomyces pombe and Homo sapiens, the Label propagation algorithm almost fails entirely to detect communities. In contrast, Fast-greedy and Walktrap are also finding acceptable modularity values for the networks for which the Label propagation algorithm has problems. Among the distance-based measures, \(D_{M_{pearson}}\phantom {\dot {i}\!}\), is the best performing method.
For the predicted number of modules, the Walktrap algorithm results in many more modules than any other method, whereas the remaining methods predict a comparable number of modules. For instance, for the PPI network of Homo Sapiens (9606), Walktrap predicts 38 times more modules than Fast-greedy and 163 times more modules than the Spinglas method. This is interesting because this is not beneficially reflected in the modularity values Q, see Table 3, in a way that this would lead to superior modularity values.
Distribution of the size of modules detected in PPI networks by different module detection algorithms
Considering the agreement among different methods, the module structure of Candida albicans is least different and, hence, shows the highest level of consensus. For this organism, even Walktrap results in a moderate number of predicted modules, which is comparable to all other methods.
Scatter plot between the number of modules and the modularity. Each method is color coded by a different color. The shown curves correspond to Least Squares regression models. For A*SP Pearson, no statistically significant model could be fit that would be different from a horizontal line
Interestingly, the A*SP Pearson algorithm is somehow located between these models in the sense that the best linear fit would only use an intercept but no slope and the quadratic regression is barely not significant with p-values of 0.08 for both the linear and quadratic term but higher values of adjusted R 2 values. For this reason, we do not include results form the regression in Fig. 4.
Comparison of algorithms
In order to investigate the similarity of the identified modules for different algorithms in detail, we use again the NMI measure. However, this time we use the NMI to compare the predicted community structure of one method with the predicted community structure of another method. In this way, the similarity of the predicted communities is assessed. In other words, this analysis will provide us with information about the consistency of results among different methods but does not allow to gain insights into the absolute quality of the predicted module structures, because the ground truth does not enter this analysis.
Similarity of the predicted module structures in PPI networks assessed by the NMI. The values of the NMI are color coded, as indicated by the color bar in each figure, showing the range of assumed values
By looking at the scale of the NMI values, one can see that for Candida albicans the lower values of the scale assumes higher values than for all other organisms, ranging from 0.86 to 1.00. This indicates that the similarity among all community detection algorithms is for this PPI networks highest, confirming our observation in Fig. 3, where we have seen that the variation of the size of modules is for all methods similar and quite small. Finally, we want to note that, in general, the distance-based measures are showing a higher similarity among each other than to the other community detection algorithms.
Robustness of module detection regarding perturbations
Our next analysis investigates the robustness of the predicted modules for perturbed PPI networks. Specifically, we test how a module detection algorithm changes its performance if some interactions in a PPI network are randomly deleted. The rationale of our analysis is based on the assumption that biological networks, and the interactions they are made of, are not known with absolute certainty. Instead, some interactions present in our PPI networks may be false positives due to measurement errors. Since all PPI networks we are using are inferred from experimental data, we think this assumption is very reasonable.
In order to study the effect of false positive interactions, we generate 20 perturbed networks for each PPI network, \(G^{sub}_{1}, G^{sub}_{2} \dots G^{sub}_{20}\), by deleting randomly 5 % of the edges in a PPI network. In order to make sure the the resulting networks are still connected, we remove only edges from nodes having a degree of D(v i ,G)≥2 and prevent removal of the last remaining edge. Then, we apply the community detection algorithms to the networks, \(G^{sub}_{1}, G^{sub}_{2} \dots G^{sub}_{20}\), and compare the predicted modules with the results from the unperturbed PPI network by using the NMI.
Robustness of module detection regarding perturbation of the PPI networks. Distribution of NMI values comparing communities obtained from the unperturbed and perturbed PPI networks generated by randomly deleting 5 % of the edges
Overall, the results show that even a moderate change in a PPI network leads, usually, in quite large changes of the predicted module structure, regardless of the algorithm or the organism.
Biological meaning of predicted modules
As far, we focused on more technical aspects of predicted modules. Now we switch gears by investigating the biological meaning of these modules. We do this by using external information, not included in the network structure itself, for assessing the predicted modules. As source for this external information we are using the Gene Ontology (GO) database [32] that provides comprehensive information about the involvement of genes across many organisms in diverse biological processes.
Number of statistically significant pathways as identified by a Fisher’s exact test that are enriched in the predicted modules in the PPI networks
Tax id | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap | Total pathways |
---|---|---|---|---|---|---|---|---|---|---|
House Mouse | 608 | 617 | 476 | 477 | 949 | 818 | 817 | 801 | 903 | 7057 |
Norway rat | 182 | 35 | 159 | 164 | 265 | 97 | 315 | 311 | 147 | 5012 |
Schizosaccharomyces | 33 | 48 | 11 | 8 | 25 | 2 | 66 | 78 | 98 | 1115 |
pombe 972h | ||||||||||
Plasmodium falciparum 3D7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 51 |
Arabidopsis Thaliana | 466 | 458 | 485 | 433 | 657 | 834 | 747 | 668 | 869 | 2506 |
Saccharomyces cerevisiae | 780 | 772 | 265 | 199 | 944 | 10 | 1019 | 993 | 883 | 3236 |
S288c | ||||||||||
Caenorhabditis elegans | 60 | 114 | 104 | 105 | 203 | 101 | 145 | 203 | 249 | 1823 |
Drosophila melanogaster | 757 | 778 | 618 | 640 | 700 | 206 | 812 | 837 | 863 | 3984 |
Homo sapiens | 1277 | 762 | 467 | 536 | 1321 | 8 | 1863 | 2011 | 1824 | 9371 |
In the last column of this table, the total number of tested biological processes is shown as a reference. Overall, the Multilevel and Spinglass community detection algorithms have the largest number of enrichment biological pathways. But in general, these numbers are not too far apart from the remaining methods, with some exceptions. It is interesting to note that for Plasmodium falciparum (36,329) none of the algorithms predicts modules that contain at least one enriched pathway. The reason for this may be in the very small number of total pathways (51) tested for this organism.
Percentage of statistically significant pathways (%) as identified by a Fisher’s exact test that are enriched in the identified modules in the PPI networks
Tax id | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap |
---|---|---|---|---|---|---|---|---|---|
House mouse | 8.62 | 8.74 | 6.75 | 6.76 | 13.45 | 11.59 | 11.58 | 11.35 | 12.80 |
Norway rat | 3.63 | 0.70 | 3.17 | 3.27 | 5.29 | 1.94 | 6.28 | 6.21 | 2.93 |
Schizosaccharomyces pombe 972h | 2.96 | 4.30 | 0.99 | 0.72 | 2.24 | 0.18 | 5.92 | 7.00 | 8.79 |
Plasmodium falciparum 3D7 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Arabidopsis Thaliana | 18.60 | 18.28 | 19.35 | 17.28 | 26.22 | 33.28 | 29.81 | 26.66 | 34.68 |
Saccharomyces cerevisiae S288c | 24.10 | 23.86 | 8.19 | 6.15 | 29.17 | 0.31 | 31.49 | 30.69 | 27.29 |
Caenorhabditis elegans | 3.29 | 6.25 | 5.70 | 5.76 | 11.14 | 5.54 | 7.95 | 11.14 | 13.66 |
Drosophila melanogaster (fruit fly) | 19.00 | 19.53 | 15.51 | 16.06 | 17.57 | 5.17 | 20.38 | 21.01 | 21.66 |
Homo sapiens | 13.63 | 8.13 | 4.98 | 5.72 | 14.10 | 0.09 | 19.88 | 21.46 | 19.46 |
Bar plots of the number of pathways that are enriched in multiple modules. The numbers inside each bar correspond to the maximum number of modules to which pathways are enriched
Bar plots of pathways which are enriched in two or more organisms. The numbers in each figure are showing the total number of pathways that are enriched
GO pathways which are enriched to more than one modules predicted by spinglass and multilevel community detectin algorithms that are common among 6 organisms (see Fig. 8)
Common GO pathways | ||
---|---|---|
Algorithm | GO Pathways | Name |
Multilevel | GO:0006139 | Nucleobase-containing compound |
metabolic process | ||
GO:0007154 | Cell communication | |
GO:0090304 | Nucleic acid metabolic process | |
Spinglass | GO:0006139 | Nucleobase-containing compound |
metabolic process | ||
GO:0006725 | Cellular aromatic compound metabolic process | |
GO:0006807 | Nitrogen compound metabolic process | |
GO:0010467 | Gene expression | |
GO:0016070 | RNA metabolic process | |
GO:0034641 | Cellular nitrogen compound metabolic process | |
GO:0044260 | Cellular macromolecule metabolic | |
process | ||
GO:0046483 | Heterocycle metabolic process | |
GO:0090304 | Nucleic acid metabolic process | |
GO:1901360 | Organic cyclic compound metabolic process |
Subnetwork analysis of Homo sapiens obtained from different experimental methods
Bar plots of the number of pathways (CORUM complex) that are enriched in multiple modules of PPI subnetworks of Homo Sapiens from different experimental types. The numbers inside each bar correspond to the maximum number of modules to which pathways are enriched
Subnetwork of PPI interactions of Human obtained from different experimental types
Experiment type | No. of vertices | No. of interactions | Edge density |
---|---|---|---|
Affinity chromatography | 13124 | 82900 | 0.000962 |
Two hybrid | 9844 | 37280 | 0.000769 |
Biochemical | 3686 | 20083 | 0.00295 |
Pull down | 5714 | 10957 | 0.00067 |
Modularity, Q, of PPI subnetworks detected by different module detection algorithms
Experimental type | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap |
---|---|---|---|---|---|---|---|---|---|
Affinity chromatography | 0.057 | 0.027 | 0.079 | 0.078 | 0.319 | 0.0004 | 0.352 | 0.372 | 0.221 |
Two hybrid | 0.424 | 0.418 | 0.371 | 0.368 | 0.456 | 0.0030 | 0.495 | 0.508 | 0.406 |
Biochemical | 0.572 | 0.575 | 0.515 | 0.527 | 0.529 | 0.0715 | 0.585 | 0.611 | 0.512 |
Pull down | 0.535 | 0.424 | 0.450 | 0.446 | 0.649 | 0.5650 | 0.666 | 0.676 | 0.569 |
Total number of modules of PPI subnetworks detected by different module detection algorithms
Experimental type | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap |
---|---|---|---|---|---|---|---|---|---|
Affinity chromatography | 104 | 12 | 11 | 42 | 70 | 7 | 11 | 23 | 4215 |
Two hybrid | 19 | 14 | 18 | 20 | 88 | 15 | 27 | 25 | 995 |
Biochemical | n 9 | 15 | 13 | 10 | 58 | 31 | 24 | 22 | 604 |
Pull down | 11 | 7 | 17 | 13 | 75 | 156 | 46 | 25 | 370 |
Total number of significant CORUM complexes enriched to atleast one module of PPI subnetworks detected by different module detection algorithms
Experimental type | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap | Total pathways |
---|---|---|---|---|---|---|---|---|---|---|
Affinity chromatography | 193 | 82 | 11 | 39 | 144 | 0 | 151 | 177 | 214 | 431 |
Two hybrid | 19 | 12 | 0 | 6 | 9 | 0 | 28 | 28 | 52 | 361 |
Biochemical | 91 | 144 | 101 | 93 | 86 | 52 | 121 | 148 | 152 | 325 |
Pull down | 37 | 25 | 24 | 11 | 108 | 94 | 101 | 96 | 105 | 321 |
Total percentage of significant CORUM complexes enriched to atleast one module of PPI subnetworks detected by different module detection algorithms
Experimental type | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap |
---|---|---|---|---|---|---|---|---|---|
Affinity chromatography | 0.448 | 0.190 | 0.026 | 0.090 | 0.334 | 0.000 | 0.350 | 0.411 | 0.497 |
Two hybrid | 0.044 | 0.028 | 0.000 | 0.014 | 0.021 | 0.000 | 0.065 | 0.065 | 0.121 |
Biochemical | 0.211 | 0.334 | 0.234 | 0.216 | 0.200 | 0.121 | 0.281 | 0.343 | 0.353 |
Pull down | 0.086 | 0.058 | 0.056 | 0.026 | 0.251 | 0.218 | 0.234 | 0.223 | 0.244 |
CORUM complexes which are enriched to more than one modules predicted by spinglass and multilevel community detectin algorithms
Common CORUM complexes | |
---|---|
Affinity chromatography technology | |
Algorithm | Name |
Multilevel | 55S ribosome, mitochondrial |
Spinglass | RNA polymerase II complex, chromatin structure modifying |
Two hybrid | |
Multilevel | C complex spliceosome |
Spinglass | - |
Biochemical | |
Multilevel | - |
Spinglass | - |
Pull down | |
Multilevel | PA700-20S-PA28 complex |
BRCA1-RNA polymerase II complex | |
Spliceosome | |
18S U11/U12 snRNP | |
C complex spliceosome | |
17S U2 snRNP | |
Spinglass | RNA polymerase II holoenzyme complex |
BRCA1-RNA polymerase II complex" |
Time complexity of the algorithms
Estimated time, in seconds, to detect modules in biological networks by different module detection algorithms
Tax id | \(D_{M_{pearson}}\phantom {\dot {i}\!}\) | \(D_{M_{spearman}}\phantom {\dot {i}\!}\) | \(D_{{sp}_{pearson}}\phantom {\dot {i}\!}\) | \(D_{sp_{spearman}}\phantom {\dot {i}\!}\) | Fast greedy | Label propogation | Multilevel | Spinglass | Walktrap |
---|---|---|---|---|---|---|---|---|---|
House mouse | 231.8423 | 243.8301 | 230.2666 | 247.3037 | 1.1767 | 0.1042 | 0.0583 | 236.0281 | 1.6766 |
Norway rat | 13.2490 | 12.9271 | 11.7737 | 13.0321 | 0.1114 | 0.0084 | 0.0102 | 45.1722 | 0.2000 |
Candida albicans SC5314 | 2.1909 | 0.1740 | 0.1604 | 0.1898 | 0.0091 | 0.0025 | 0.0019 | 6.5747 | 0.0354 |
Schizosaccharomyces pombe 972h | 114.2011 | 116.9353 | 107.3772 | 116.8714 | 2.8521 | 0.0216 | 0.2264 | 468.3092 | 3.3914 |
Plasmodium falciparum 3D7 | 6.5812 | 5.2746 | 3.9287 | 4.3812 | 0.0227 | 0.0139 | 0.0242 | 43.3769 | 0.1493 |
Arabidopsis Thaliana | 630.2055 | 650.8486 | 636.2166 | 651.3501 | 1.2628 | 0.0968 | 0.0748 | 346.2693 | 3.2913 |
Saccharomyces cerevisiae S288c | 415.8757 | 430.9147 | 411.8197 | 422.0503 | 183.0446 | 0.0847 | 1.5457 | 2248.5317 | 115.8467 |
Caenorhabditis elegans | 100.4278 | 101.1053 | 94.3029 | 100.3038 | 0.2025 | 0.0438 | 0.0318 | 119.1461 | 0.8575 |
Drosophila melanogaster (fruit fly) | 887.6161 | 922.9284 | 889.0182 | 911.5044 | 5.5153 | 0.0796 | 0.2107 | 590.4792 | 7.1921 |
Homo sapiens | 6750.1157 | 7134.5106 | 7056.7373 | 7336.6568 | 51.3895 | 0.1185 | 0.5939 | 2411.9311 | 48.2544 |
Average time | 915.23 | 961.94 | 944.16 | 980.36 | 24.558 | 0.0574 | 0.2777 | 651.58 | 18.089 |
Discussion and conclusion
In our analysis, we used 9 community detection algorithms to predict modules in PPI networks of 10 different organisms. Overall, our analysis provides a comprehensive understanding of the performance of large community detection algorithms. Also, our analysis highlights organism-specific differences of PPI networks and the biological meaning of the predicted modules.
Overall, from our analysis of these networks we found that the Spinglass, Multilevel and Fastgreedy algorithm preform in general much better than the other algorithms. Furthermore, the Multilevel and Fast greedy algorithm have, in addition, a good run time (see Table 14) that allows to obtain results for large networks within seconds. Interestingly, despite the fact that these three algorithms are performing better, there is no complete similarity among these algorithms in terms of the predicted modules, but the results are to a large extend method-specific. Another interesting fact about the Multilevel and Spinglass community algorithm is that the number of modules and the modularity are linearly correlated, while the performance of Fast greedy decreases as the number of modules increases (see Fig. 4). At this point it is unclear which behavior reflects the modularity vs number of modules dependency best for biological organisms. However, it appears reasonable to assume that there is a limiting factor in the growth of modularity of biological networks, which would suggest that the behavior of Fast greedy is a reflection of biological properties of the networks rather than a technical property or a bias of the method.
Although, we studied extensively the performance of modules in biological networks and found high modularity for some organisms, still, for some organisms, such as Homo Sapiens and Saccharomyces cerevisiae, we find a low modularity. This is especially surprising for Homo Sapiens. One reason for the low modularity in these networks could be the existence of many overlapping nodes between communities giving raise to overlapping modules and pathways. Therefore, the standard non-overlapping community prediction methods may not be optimally suitable for detecting communities in such organisms. This would suggest that more effort needs to be placed on the development of such algorithms, because only in this way one could shed light on the nature of the overlapping modular structure of PPI networks. Another explanation could be that the PPI networks contain incomplete information. One reason for this argument is because the highest modularity is predicted by the Spinglass algorithm for Arabidopsis Thaliana (3702), which is a less complex organism, and for this reason is easier to study. Also the modularity of Arabidopsis Thaliana (3702) is constantly predicted higher by all other algorithms.
-
Pathways which are part of a single module only across many organisms.
-
Pathways which are part of multiple modules across many organisms.
-
Pathways which are part of a single module and a single organisms.
-
Pathways which are part of multiple modules and a single organisms.
It would be interesting to see what biological processes they contribute to and what role they play in different organisms in order to see changes in an evolutionary perspective or the emergence of a higher level of functioning in different organisms.
In summary, the identification of modules in networks is a very complex problem and more work needs to be done. A potential future direction could be to extend the analysis for identifying communities with overlapping proteins/genes. This would be a major step forward because it would require the inclusion of the hierarchy among the modules and as such, require fundamentally different algorithms.
Declarations
Acknowledgements
Matthias Dehmer thanks the Austrian Science Funds for supporting this work (project P26142).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Emmert-Streib F. The chronic fatigue syndrome: A comparative pathway analysis. J Comput Biol. 2007; 14(7):961–72.View ArticlePubMedGoogle Scholar
- Emmert-Streib F, Glazko G. Network Biology: A direct approach to study biological function. Wiley Interdiscip Rev Syst Biol Med. 2011; 3(4):379–91.View ArticlePubMedGoogle Scholar
- Förster J, Famili I, Fu P, Palsson BO, Nielsen J. Genome-scale reconstruction of the saccharomyces cerevisiae metabolic network. Genome Res. 2003; 13(2):244–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Guelzim N, Bottani S, Bourgine P, Kepes F. Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet. 2002; 31(1):60–63.View ArticlePubMedGoogle Scholar
- Lee TI, et al.Transcriptional regulatory networks in saccharomyces cerevisiae. Science. 2002; 298(5594):799–804.View ArticlePubMedGoogle Scholar
- Vidal M, Cusick ME, Barabási AL. Interactome networks and human disease. Cell. 2011; 144(6):986–98.View ArticlePubMedPubMed CentralGoogle Scholar
- Barabási AL, Albert R. Emergence of scaling in random networks. Science. 1999; 206:509–12.Google Scholar
- Han J-DJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM, Cusick ME, Roth FP, Vidal M. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004; 430:88–93.View ArticlePubMedGoogle Scholar
- Jeong H, Tombor B, Albert R, Olivai ZN, Barabasi AL. The large-scale organization of metabolic networks. Nature. 2000; 407:651–4.View ArticlePubMedGoogle Scholar
- Ravasz E. Detecting hierarchical modularity in biological networks. Methods in Molecular Biology, Springer. 2008; 541:1–16.Google Scholar
- Watts DJ, Strogatz SH. Collective dynamics of ‘small-world’ networks. Nature. 1998; 393:440–2.View ArticlePubMedGoogle Scholar
- Yu H, Gerstein M. Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci USA. 2006; 103:14724–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Emmert-Streib F. Limitations of the gene duplication model: Evolution of modules in protein interaction networks. PLoS ONE. 2012; 7(4):35531.View ArticleGoogle Scholar
- Hallinan J. Gene duplication and hierarchical modularity in intracellular interaction networks. Biosystems. 2004; 74(1–3):51–62.View ArticlePubMedGoogle Scholar
- Wagner GP, Pavlicev M, Cheverud JM. The road to modularity. Nat Rev Genet. 2007; 8(1):921–31.View ArticlePubMedGoogle Scholar
- Kitano H. Systems biology: a brief overview. Science. 2002; 295(5560):1662–1664.View ArticlePubMedGoogle Scholar
- Van Regenmortel M. Reductionism and complexity in molecular biology. EMBO Rep. 2004; 5(9):1016–1020.View ArticlePubMedGoogle Scholar
- Spirin V, Mirny LA. Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci U S A. 2003; 100(21):12123–12128.View ArticlePubMedPubMed CentralGoogle Scholar
- Hintze A, Adami C. Evolution of complex modular biological networks. PLoS Comput Biol. 2008; 4:23. doi:10.1371/journal.pcbi.0040023.View ArticleGoogle Scholar
- Clune J, Mouret JB, Lipson H. The evolutionary origins of modularity. Proc R Soc Lond B Biol Sci. 2013; 280(1755). doi:10.1098/rspb.2012.2863.
- Lin CY, Lee TL, Chiu YY, Lin YW, Lo YS, Lin CT, Yang JM. Module organization and variance in protein-protein interaction networks. Sci Rep. 2015; 5:9368.View ArticleGoogle Scholar
- Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Mueller T. Identifying functional modules in protein?protein interaction networks: an integrated exact approach. Bioinformatics. 2008; 24(13):223–31. doi:10.1093/bioinformatics/btn161. http://bioinformatics.oxfordjournals.org/content/24/13/i223.full.pdf+html.View ArticleGoogle Scholar
- Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, Bull S, Pawson T, Morris Q, Wrana JL. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotech. 2009; 27(2):199–204.View ArticleGoogle Scholar
- Zhang X, Zhang R, Jiang Y, Sun P, Tang G, Wang X, Lv H, Li X. The expanded human disease network combining protein-protein interaction information. Eur J Hum Genet. 2011; 19(7):783–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Cheng L, Li J, Ju P, Peng J, Wang Y. Semfunsim: A new method for measuring disease similarity by integrating semantic and gene functional association. PLoS ONE. 2014; 9(6):99415. doi:10.1371/journal.pone.0099415.View ArticleGoogle Scholar
- Nepusz T, Yu H, Paccanaro A. Detecting overlapping protein complexes in protein-protein interaction networks. Nat Meth. 2012; 9(5):471–2.View ArticleGoogle Scholar
- Clauset A, Newman MEJ, Moore C. Finding community structure in very large networks. Phys Rev E. 2004; 70:066111. doi:10.1103/PhysRevE.70.066111.View ArticleGoogle Scholar
- Pons P, Latapy M. Computing communities in large networks using random walks. J Graph Algorithms Appl. 2004; 10(2):284–93.Google Scholar
- Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E. 2007; 76(3):036106.View ArticleGoogle Scholar
- Reichardt J, Bornholdt S. Statistical mechanics of community detection. Phys Rev E. 2006; 74:016110. doi:10.1103/PhysRevE.74.016110.View ArticleGoogle Scholar
- Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008; 2008(10):10008. doi:10.1088/1742-5468/2008/10/p10008.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al.Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000; 25(1):25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Newman MEJ. Modularity and community structure in networks. Proc Natl Acad Sci. 2006; 103(23):8577–582. doi:10.1073/pnas.0601602103. http://www.pnas.org/content/103/23/8577.full.pdf.View ArticlePubMedPubMed CentralGoogle Scholar
- Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Phys Rev E. 2008; 78:046110. doi:10.1103/PhysRevE.78.046110.View ArticleGoogle Scholar
- Breitkreutz BJ, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bahler J, Wood V, Dolinski K, Tyers M. The BioGRID Interaction Database: 2008 update. Nucl Acids Res. 2008; 36(suppl 1):637–40.Google Scholar
- Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006; Complex Systems:1695.Google Scholar
- Kvalseth TO. Entropy and correlation: Some comments. IEEE Trans Syst Man Cybern. 1987; 17(3):517–9. doi:10.1109/TSMC.1987.4309069.View ArticleGoogle Scholar
- Danon L, Guilera AD, Duch J, Arenas A. Comparing community structure identification. J Stat Mech Theory Exp. 2005; 2005(9):09008–09008. doi:10.1088/1742-5468/2005/09/p09008.View ArticleGoogle Scholar
- Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 2010; 11:2837–854.Google Scholar
- Ruepp A, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Stransky M, Waegele B, Schmidt T, Doudieu ON, Stúmpflen V, Mewes HW. Corum: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res. 2008; 36(suppl 1):646–50. doi:10.1093/nar/gkm936.Google Scholar