Topological and functional comparison of community detection algorithms in biological networks

Rahiminejad, Sara; Maurya, Mano R.; Subramaniam, Shankar

doi:10.1186/s12859-019-2746-0

Research article
Open access
Published: 27 April 2019

Topological and functional comparison of community detection algorithms in biological networks

Sara Rahiminejad¹,
Mano R. Maurya² &
Shankar Subramaniam³

BMC Bioinformatics volume 20, Article number: 212 (2019) Cite this article

7786 Accesses
37 Citations
2 Altmetric
Metrics details

Abstract

Background

Community detection algorithms are fundamental tools to uncover important features in networks. There are several studies focused on social networks but only a few deal with biological networks. Directly or indirectly, most of the methods maximize modularity, a measure of the density of links within communities as compared to links between communities.

Results

Here we analyze six different community detection algorithms, namely, Combo, Conclude, Fast Greedy, Leading Eigen, Louvain and Spinglass, on two important biological networks to find their communities and evaluate the results in terms of topological and functional features through Kyoto Encyclopedia of Genes and Genomes pathway and Gene Ontology term enrichment analysis. At a high level, the main assessment criteria are 1) appropriate community size (neither too small nor too large), 2) representation within the community of only one or two broad biological functions, 3) most genes from the network belonging to a pathway should also belong to only one or two communities, and 4) performance speed. The first network in this study is a network of Protein-Protein Interactions (PPI) in Saccharomyces cerevisiae (Yeast) with 6532 nodes and 229,696 edges and the second is a network of PPI in Homo sapiens (Human) with 20,644 nodes and 241,008 edges. All six methods perform well, i.e., find reasonably sized and biologically interpretable communities, for the Yeast PPI network but the Conclude method does not find reasonably sized communities for the Human PPI network. Louvain method maximizes modularity by using an agglomerative approach, and is the fastest method for community detection. For the Yeast PPI network, the results of Spinglass method are most similar to the results of Louvain method with regard to the size of communities and core pathways they identify, whereas for the Human PPI network, Combo and Spinglass methods yield the most similar results, with Louvain being the next closest.

Conclusions

For Yeast and Human PPI networks, Louvain method is likely the best method to find communities in terms of detecting known core pathways in a reasonable time.

Background

The use of networks to study complex interacting systems has been applied to many domains during the last two decades, including sociology, physics, computer science and biology. An important task in the analysis of networks lies in the identification of communities or modules whose membership share one or more common features of the system. The problem that community detection attempts to solve is the identification of groups of nodes with more and/or better interactions amongst its members than between its members and the remainder of the network [1, 2]. For example, in social networks, a community may correspond to groups of friends who attend the same school or live in the same neighborhood; while in a biological network, communities may represent functional modules of interacting proteins.

Edges in a biological network may represent various types of direct interactions and indirect effects. Examples of direct interactions include protein-protein interactions as part of signaling pathways or as part of protein complexes and substrate-enzyme interactions. Indirect effects may include transport processes and regulatory effects, which, in most cases, can be substituted with a subnetwork of several direct interactions when modeled at a finer granularity. Examples of the latter are cholesterol and ion transport across the plasma membrane and protein-DNA interactions in gene-regulatory networks. Thus, in the context of a cell or tissue, subnetworks or communities may correspond to various cellular processes, pathways and functions, in which its components (nodes) exhibit a higher-degree of interaction as compared to those from outside the pathway.

Majority of the methods for community detection in networks are based on maximization of modularity. While the modularity metric Q, of a network, is defined in the Methods section, intuitively, given a network, if it can be partitioned in such a way that only a few connections exist between the nodes of different partitions and most connections are among the nodes within the partitions, then the modularity will be high. It is interesting to note that the modularity of a sparse network of fully connected subnetworks is higher than that of a fully connected network, which is zero. Any partition of a fully connected network results in Q < 0. Brandes et al. have carried out extensive theoretical analysis of properties of modularity and complexity of its maximization [3].

One of the most important objectives of any large-scale omics study is to identify mechanisms for specific functions and phenotypes in a chosen context. Biological networks derived from genome-scale experimental data and/or legacy knowledge are generally large and complex with thousands of nodes and many thousands of connections. Associating meaningful biological functions and interpretations to such networks is impossible. However, these large networks can be broken down into smaller (sub) networks (also called as modules or communities) which are more amenable to biological interpretation. Such communities are expected to represent one or a few biological functions and they may facilitate discovery of mechanisms relating the causes or perturbations to the observed phenotypes. Thus, community detection can provide valuable biological insights.

Several methods have been developed to find communities in networks using tools and techniques from different disciplines such as applied mathematics or statistical physics [4]. All these methods try to identify meaningful communities, while keeping the computational complexity of the underlying algorithm low [5]. Although these methods have proven to be successful in some cases, there is no guarantee that the resulting communities provide the best functional description of the system. Hence, selecting a suitable method to detect communities in a network is challenging. While there have been some studies comparing different methods for community detection [5], their focus has been on Lancichinetti, Fortunato, Radicchi (LFR) benchmark networks (artificial networks that have heterogeneity in the distributions of degree of nodes and the size of communities) [6]; comparisons with respect to biological networks are lacking.

Classical community detection algorithms initially divide networks into communities according to some network features such as edge betweenness. One of the most popular and prominent algorithms that uses edge betweenness is the Girvan-Newman algorithm [1, 7]. In this method edges are progressively removed from the original network till the modularity reaches its maximum value, making it an optimization problem. The connected nodes of the remaining network are the communities. The Girvan-Newman algorithm has been successfully applied to a variety of networks, including networks of email messages. However, its computational complexity, O(m²n) for a network with n nodes and m edges, practically restricts its use to networks of at most a few thousand nodes. There are other optimization-based algorithms with different objective functions that provide different approaches to solve the community detection problem. For example, Leading Eigen [8] algorithm also tries to maximize modularity but the modularity is expressed in the form of the eigenvalues and eigenvectors of a matrix called the modularity matrix. Spinglass method minimizes the Hamiltonian of the network [9].

Since the early 2000s, several methods have been developed that divide networks into communities based on the modularity [10,11,12,13,14,15]. The modularity criterion was revisited in 2005 when Duch and Arenas proposed a divisive algorithm [16] that optimizes the modularity using a heuristic search based on the Extremal Optimization (EO) algorithm proposed by Boettcher and Percus [17, 18]. Pizzuti has suggested an algorithm named GA-net that uses a special assessment function described as the community score in addition to the modularity function [19]. There are also other approaches to the community detection problem in which the use of multiple objectives (or assessment criteria) is preferred over the use of a single objective for complex networks. Since the objectives are usually directly related to the network properties, one advantage of using multi-objective optimization is that it balances among the multiple (important) properties of the network. The benefits of using multi-objective approach have been explained by Shi et al. [20].

In this manuscript, we briefly review eight algorithms for finding communities in biological networks such as Protein-Protein Interaction (PPI) networks (discussed in the Methods section). In such networks, each node represents a protein (or gene) and each edge represents an interaction between two proteins. In particular, we will apply six algorithms to the Yeast PPI network with 6532 nodes and 229,696 edges and the Human PPI network with 20,644 nodes and 241,008 edges. Using several topological metrics, we assess which methods provide similar (or dissimilar) results. We evaluate the biological interpretation of the communities identified and compare the results in terms of their functional features. At a high level, the main criteria for assessment of the methods are 1) appropriate community size (neither too small nor too large), 2) representation within the community of only one or two broad biological functions, 3) most genes from the network belonging to a pathway should also belong to only one or two communities, and 4) performance speed.

This paper is organized as follows: in the next section we will present the results of applying six methods on the Yeast and Human PPI networks and compare the communities based on their topological and functional features. In the last part of this section, we will describe an orthology analysis between the communities detected for the Yeast PPI network and the communities detected for the Human PPI network. In the following section, we will present discussion on the results providing insights into the algorithmic similarities and robustness of some of the methods. In the section after that, we will provide the conclusion of our paper. In the Methods section, we will describe eight different methods for finding communities in networks. We will also introduce three metrics to compare the communities identified by the algorithms.

Results

Six community detection methods, namely, Combo, Conclude, Fast Greedy, Leading Eigen, Louvain and Spinglass, have been applied to the Yeast PPI network with 6532 nodes and 229,696 edges and the Human PPI network with 20,644 nodes and 241,008 edges. A detailed description of the methods is included in the Methods section. We used the BioGRID database [21, 22] for the PPI networks for Yeast and Human. Since our focus in this paper is on undirected and unweighted networks, we removed repeated edges and self-loops from our data set.

In the first part of this section, we will present the results for the Yeast PPI network. In the second part, the results for the Human PPI network will be presented. In the third part, an orthology comparison will be provided between the Yeast and Human PPI networks.

Yeast PPI network

Among the methods tested to find communities of the Yeast PPI network, Combo, Conclude, Fast Greedy, Leading Eigen, Louvain and Spinglass give good partitioning results, i.e., the size of communities detected are not too small or too large compared to the size of the original network. Since the Yeast PPI network has 6532 nodes, Girvan-Newman algorithm is not an appropriate method to detect communities. It takes 44 min (on a PC with 4 GB RAM with 4 2.4 GHz processors) for Rattus PPI network which has 3379 nodes and 4580 edges. Its computational complexity is proportional to m²n (where n is the number of nodes and m is the number of edges), so, it will take ~ 148 days to find communities in the Yeast PPI network (using the computational resource mentioned above). Infomap, is also not a good method based on the size of communities it detects; the largest community has 6195 nodes and the smallest one has just 2 nodes. Since very small communities (e.g., those with less than 100 nodes) are not expected to yield significant biological insights, we will not consider them in our analysis. We note that there may be some exceptions.

In the next subsection, first we will compare the methods from a topological perspective of the communities identified. Then we will provide a functional comparison. To begin with, the results for all these methods are described in Table 1 in terms of the size of the communities detected for the Yeast PPI network.

Table 1 Number of nodes and edges for communities detected using different methods for the Yeast PPI network (6532 nodes and 229,696 edges). The number in parenthesis after the name of each method represents the number of communities detected by that method. For example, Combo finds 8 communities. Modularity scores are also provided for different methods. For each method, we only consider the communities with 100 or more nodes and list up to 10 communities

Topological and functional comparison of community detection algorithms in biological networks

Abstract

Background

Results

Conclusions

Background

Results

Yeast PPI network

Comparison based on topological features of communities

Comparison based on biological/functional features of communities

Comparing similar methods

Comparing dissimilar methods

Human PPI network

Orthology comparison of communities from Yeast and Human PPI networks using Louvain method

Discussion

Robustness of communities obtained by the Louvain method

Generality of the overall results

Optimization of method-specific parameters

Conclusions

Methods

Algorithms for community detection

Girvan-Newman

Fast greedy by Clauset, Newman and Moore

Combo

Louvain

COmplex Network CLUster DEtection (CONCLUDE)

Maps of random walk (Infomap)

Leading Eigen

Spinglass

Metrics for comparison of different algorithms

Rand index (RI)

Adjusted Rand index (ARI)

Normalized mutual information (NMI)

Jaccard index

Overall approach for topological and functional comparison of communities detected by different algorithms

Abbreviations

References

Acknowledgments

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional file

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us