 Research
 Open Access
 Published:
PSMCL: parallel shotgun coarsened Markov clustering of protein interaction networks
BMC Bioinformatics volume 20, Article number: 381 (2019)
Abstract
Background
How can we obtain fast and highquality clusters in genome scale bionetworks? Graph clustering is a powerful tool applied on bionetworks to solve various biological problems such as protein complexes detection, disease module detection, and gene function prediction. Especially, MCL (Markov Clustering) has been spotlighted due to its superior performance on bionetworks. MCL, however, is skewed towards finding a large number of very small clusters (size 13) and fails to detect many larger clusters (size 10+). To resolve this fragmentation problem, MLRMCL (Multilevel Regularized MCL) has been developed. MLRMCL still suffers from the fragmentation and, in cases, unrealistically large clusters are generated.
Results
In this paper, we propose PSMCL (Parallel Shotgun Coarsened MCL), a parallel graph clustering method outperforming MLRMCL in terms of running time and cluster quality. PSMCL adopts an efficient coarsening scheme, called SC (Shotgun Coarsening), to improve graph coarsening in MLRMCL. SC allows merging multiple nodes at a time, which leads to improvement in quality, time and space usage. Also, PSMCL parallelizes main operations used in MLRMCL which includes matrix multiplication.
Conclusions
Experiments show that PSMCL dramatically alleviates the fragmentation problem, and outperforms MLRMCL in quality and running time. We also show that the running time of PSMCL is effectively reduced with parallelization.
Background
Graph clustering is one of the most fundamental problems in graph mining and arises in various fields including bionetwork analysis [1, 2]. Graph clustering is extensively studied and applied in protein complex finding, [3–5], disease module finding [6], and gene function prediction [7].
In general, the main task of the graph clustering problem is to divide the graph into cohesive clusters that have low interdependency: i.e., few intercluster edges and many intracluster edges. Additional domainspecific constraints can be added to the graph clustering to improve the clustering quality, however, we only focus on improving topologybased clustering as constraints can be added easily afterward.
Among a number of clustering algorithms, MCL (Markov Clustering) [8] has received greatest attention in the bionetwork analysis. Various studies have shown its superiority to other methods [3, 9–11]. However, MCL tends to result in too small clusters, which is called the fragmentation problem. Considering that many bionetwork analysis related problems require cluster sizes in the range of 5–20 [12–14], fragmentation needs to be avoided. To solve the problem in MCL, RMCL (RegularizedMCL) has been developed [15], but it often generates clusters that are too large, e.g., one cluster containing most of the nodes. Satuluri et al. [9] generalizes RMCL to obtain clusters whose sizes are similar to the ones observed in real bionetworks by introducing a balancing factor. In the work, a large number of nodes belong to clusters of size 10–20 with an appropriate balancing factor; however, tiny clusters of size 1–3 also greatly increase compared with the original RMCL. To improve the scalability of RMCL, MLRMCL (Multilevel RMCL) has been developed [15]. MLRMCL first coarsens a graph and then runs RMCL with refinement. But, its coarsening scheme HEM (Heavy Edge Matching) is known to be inefficient for real world graphs, such as protein interaction networks, which have a heavytailed degree distribution [16–18].
In this paper, we propose PSMCL (Parallel Shotgun Coarsened MCL), a parallel graph clustering method for bionetworks with an efficient graph coarsening scheme and parallelization. First, we propose SC (Shotgun Coarsening) scheme for MLRMCL; SC allows grouping multiple nodes at a time [19]. Compared with HEM used in MLRMCL, which is similar to a greedy algorithm for the traditional matching problem, SC coarsens a graph to have more cohesive super nodes. Moreover, the coarsened graph with a manageable size is obtained more quickly by SC than by HEM. Second, we carefully parallelize main operations in RMCL which is a subroutine of MLRMCL: i.e. Regularize, Inflate and Prune operations are parallelized. The latter two are columnwise operations by definition, and we parallelize them by assigning each column to a core. The former, Regularize, is a matrix multiplication. We divide matrixmatrix multiplication into a number of matrixvector multiplications and parallelize them by distributing the vectors to multicores and sharing the matrix. Through experiments, we show that PSMCL not only resolves the fragmentation problem but also outperforms MLRMCL in quality and running time. Moreover, we show that PSMCL gets effectively faster as more processing cores are used. PSMCL produces clustering with the best quality, and its speed is comparable to MCL, which is a baseline method, and much faster than MLRMCL, which is our main competitor (Table 1).
Our contributions are summarized as follows.

Coarsening: We propose the Shotgun Coarsening (SC) scheme for MLRMCL. SC allows merging multiple nodes to a super node at a time. Compared with the existing Heavy Edge Matching (HEM) coarsening method, SC improves both the quality and efficiency of coarsening.

Parallelization: We carefully parallelize proposed algorithm using multiple cores by rearranging the operations to be calculable in a columnwise manner and assigning each columnwise computation to one core.

Performance: Through experiments, we show that PSMCL prefers clusters of sizes in the range of 10 to 20 and results in less fragmentation compared to MLRMCL. We also show that PSMCL is effectively parallelizable. As a consequence, PSMCL outperforms MLRMCL in both quality and speed (Table 1).
In the rest of the paper, we explain preliminaries including MCL based algorithms, describe our proposed method PSMCL in detail, show experimental results on various protein interaction networks, and make a conclusion.
Preliminaries
In this section, we explain existing MCL based algorithms: the original MCL, RMCL, and MLRMCL. Table 2 lists the symbols used in this paper.
Markov clustering (MCL)
MCL is a flowbased graph clustering algorithm. Let G=(V,E) be a graph with n=V and m=E, and A be the adjacency matrix of G where selfloops for all nodes are added. The (i,j)th element M_{ij} of the initial flow matrix M is defined as follows:
Intuitively, M_{ij} can be understood as the transition probability or the amount of flow from j to i. MCL iteratively updates M until convergence, and each iteration consists of the following three steps.

Expand: M←M×M.

Inflate: \(M_{ij} \leftarrow \left (M_{ij}\right)^{r} / {\sum \nolimits }_{k=1}^{n} \left (M_{kj}\right)^{r}\) where r>1.

Prune: elements whose values are below a certain threshold are set to 0; every column is normalized to sum to 1.
When MCL converges, each column of M has at least one nonzero element. All nodes whose corresponding columns have a nonzero element in the same row are assigned to the same cluster. If a node has multiple nonzero elements in its column, a row is arbitrarily chosen. Although MCL is simple and intuitive, it lacks scalability due to the matrix multiplication in the expanding step, and outputs a large number of too small clusters, e.g., outputs 1416 clusters from a network with 4741 nodes (fragmentation problem) [15].
RegularizedMCL (RMCL)
One reason of the fragmentation of clusters in MCL is that the adjacency structure of a given graph is used only at the beginning, which leads to diverging columns for neighboring node pairs. To resolve this fragmentation problem, RMCL [15] regularizes a flow matrix instead of expanding it. The flow of a node is updated by minimizing the weighted sum of KL divergences between the target node and its neighbors. This minimization problem has a closed form solution, and consequently, the regularizing step of RMCL is derived as follows.

Regularize M=M×M_{G}, where M_{G} is an initial flow matrix defined as M_{G}=AD^{−1}, and A is the adjacency matrix of G with the added selfloops and the weight transformation [15], and D is the diagonal matrix from A (i.e., \(D_{ii} = {\sum \nolimits }_{k=1}^{n} A_{ik} \)).
RMCL finds a smaller number of clusters than MCL does.
A problem of RMCL is that it finds clusters whose sizes are spread over a wide range, while clusters in bionetworks usually are in the size range of 5–20 [12, 13]. To resolve the problem, [9] generalizes the regularization step as follows:

\(mass(i) = {\sum \nolimits }_{j=1}^{n} M_{ij}\).

M_{R}=column_normal (diag(M^{⊤}×mass)^{−b}×M_{G}), where b is a balancing parameter.

Regularized by M=M×M_{R}.
The balancing parameter b controls the degree of balances in the cluster sizes; higher b encourages more balanced clustering. The intuition of this generalization is to penalize flows to a node currently having a large number of incoming flows. Note that b=0 is equal to the original RMCL.
Multilevel RMCL (MLRMCL)
MLRMCL uses graph coarsening to further improve the quality and the running time of RMCL [15]. Graph coarsening means to merge related nodes to a super node. MLRMCL first generates a sequence of coarsened graphs: (G_{0},G_{1},…,G_{ℓ}) where G_{0} is the original graph and G_{ℓ} is the most coarsened (smallest) graph. For i=ℓ down to 1, RMCL is run on G_{i} only for a few iterations, and the computed flows on G_{i} are projected to G_{i−1}. After reaching the original graph G_{0}, RMCL is run until convergence. Algorithm 1 shows the overall procedure of MLRMCL. Although the description is for b=0,b>0 can be also used by changing M_{G} to M_{R} as defined in the previous section.
The original RMCL and MLRMCL use HEM (Heavy Edge Matching), which picks an unmatched neighbor connected to the heaviest edge for a given node, to coarse the graph [15]. In HEM, the node v to which a node u is merged is determined as follows:
where N_{unmatched}(u) is the set of unmatched neighbors of u, and W(u,v^{′}) is the weight between u and v^{′}. Note that HEM allows a node to be matched with at most one other node. MLRMCL assigns all flows of a super node to one of its children for the flow projection. It is shown that a clustering result is invariant on the choice of the child to which all flows are assigned. For more details, refer to [15]. Note that MLRMCL greatly reduces the overall computation of RMCL since the flow update is done for the coarsened graph which is smaller than the original graph.
Limitation of HEM. HEM of MLRMCL has two main limitations. First, the strategy of HEM that merges at most two single nodes can lead to undesirable coarsening where super nodes are not cohesive enough (see “Cohesive super node” section for details). Second, HEM is known to be unsuitable for realworld graphs [19] due to skewed degree distribution of the graphs which prevents the graph size from being effectively reduced (see “Quickly reduced graph” section for details). These shortages of HEM make MLRMCL inefficient for realworld graphs. To overcome this, in the next section we propose PSMCL that allows multiple nodes to be merged at a time.
Implementation
In this section, we describe our proposed method PSMCL (Parallel Shotgun Coarsened MCL) which improves MLRMCL in two perspectives: 1. increasing efficiency of the graph coarsening and 2. parallelizing the operations of RMCL.
Shotgun coarsening (SC)
As described previously, HEM is ineffective on real world graphs. To overcome the limitation of HEM, we propose to use a graph coarsening which allows merging multiple nodes at a time. We call this scheme Shotgun Coarsening (SC) because it aggregates satellite nodes to the center one. Algorithm 2 describes the proposed SC coarsening method where N(u) denotes a set of neighbors of u in G=(V,E), and connected_ components(V,F) outputs a set of connected components, each of which is a set of nodes, of the graph (V,F).
Our SC algorithm consists of three steps: 1) identify a set F of edges whose end nodes will be merged (lines 1–6), 2) determine a set V^{′} of super nodes of a coarsened graph and associated weights to them (lines 7–12), and 3) determine a set E^{′} of edges between super nodes and their weights (lines 13–20). Let G=(V,E) be an input graph to be coarsened, \(Z:V\rightarrow \mathbb {N}\) be a node weight map for G, and \(W:E\rightarrow \mathbb {N}\) be an edge weight map for G. In the first step, we visit every node of G in an arbitrary order (line 2), and for each node u∈V visited, we find the best match node v. Precisely, the algorithm finds the neighboring node of u with the highest edge weight to u (line 3), i.e.,
There may be multiple neighbors with the same highest weight. Let N_{1}(u) be the set of those neighbors; then, in this case, the one with the smallest node weight is chosen among them (line 4), i.e.
This strategy of preferring a smaller node weight at the same edge weight prevents the emergence of an overcoarsened graph containing an excessively massive super node. Note that if every node in an initial graph has weight 1, the weight of a super node in a coarsened graph is equal to the number of nodes merged to create that super node. If there are multiple neighbors with the same highest edge weight and the smallest node weight, any v is arbitrarily chosen among the ties. Edge (u,v) is added to F (line 5).
The second step determines super nodes and associated weights to them. Note that for (u,v)∈F, u and v should belong to the same super node by definition. By mathematical induction, two nodes belong to the same super node if and only if they are reachable along edges in F. As a result, we can identify a set V^{′} of super nodes by computing connected components of the graph (V,F) (line 7).
After finding V^{′}, we determine weights of super nodes in V^{′} and their selfloops as follows. For each super node S∈V^{′}, its node weight Z^{′}(S) is defined by the sum of weights of nodes in V that belong to S (line 9). The selfedge (S,S) is added to E^{′} (line 10) and its weight W^{′}(S,S) is defined by the sum of weights of edges in E whose end nodes belong to S (line 11).
The last step determines nonself edges between nodes in V^{′} and their edge weights as follows. For each unordered pair (S,T)∈V^{′}×V^{′}, find a set H of edges in E that one end node is in S and the other is in T (line 15). If H≠∅ (line 16), (S,T) is added to E^{′} (line 17), and W^{′}(S,T) is defined by the sum of weights of edges in H (line 18). Otherwise, there is no edge between S and T.
Skip Rate In practice, a graph can be reduced too quickly by SC if it has superhub nodes. To coarsen the graph to a reasonable size, we propose to randomly skip merging while iterating nodes in SC, i.e. with probability 0≤p<1, lines 3–5 are not executed. We call p a skip rate, and use p=0.5 in this paper.
Cohesive super node
The goal of coarsening is to merge tightly connected nodes to one super node. In this aspect, HEM may prevent a super node from being cohesive. Figure 1a shows the ideal coarsening for a given graph. Let us assume that for the first merging, the leftmost two nodes are merged as shown in Fig. 1b, and the next node to be merged is u. If we use HEM, v is chosen since it is the only candidate, leading to Fig. 1c. Note that although u has more edges to the green super node than to v, it should be merged with v. Obviously, the result is undesired for good coarsening. In contrast, SC (Fig. 1d) chooses the green super node for u since the weight to the green node is larger than that to v. As a result, SC generates more cohesive super nodes than those by HEM, leading to a high quality coarsened graph.
Quickly reduced graph
Ideally, at each step of the coarsening, the number of nodes should halve; but that does not happen for real world graphs due to their highly skewed degree distribution [19]. In other words, a large number of coarsening steps are needed to obtain a coarsened graph of a manageable size, leading to large memory spaces for storing graphs themselves and node merging information. This problem arises mainly due to starlike structures, which is depicted in Fig. 2a. The red and yellow nodes are eventually merged with the blue and the green groups, respectively, but it needs 5 more coarsening steps because only two nodes can be merged. Note that for an additional coarsening step, we need spaces to store one graph and mapping from a node to a super node; if the graph size is not effectively reduced, the amount of the required spaces greatly increases with the coarsening depth. This inefficiency can be resolved by SC as shown in Fig. 2b. In contrast that 5 more coarsening steps are required with HEM, only one step is enough in SC.
Parallelization
We also improve MLRMCL via multicore parallelization for its three main operations: Regularize, Inflate, and Prune. For Regularize, we parallelize the computation by assigning columns of the resulting matrix into cores. In other words, for M_{3}=M_{1}×M_{2}, we divide the computation as follows.
where M_{k}(,i) denotes the ith column of M_{k}. Computing ith column of M_{3} is independent of computing other columns j≠i of M_{3}, and thus we distribute the columns of M_{2} to multiple cores while keeping M_{1} in a shared memory. Inflate and Prune themselves are columnwise operations. Thus, the computation on each column is assigned to a core.
For efficiency in memory usage, we use the CSC (Compressed Sparse Column) format [20] to represent a matrix, which requires much less memory when storing a sparse matrix compared to a twodimensional array format. In essence, the CSC format only stores nonzero values of a matrix. Note that this strategy is efficient especially for real world graphs which are very sparse in general, e.g. E=O(V). Figure 3 shows the CSC format for an example matrix. To access the nonzero elements from the jth column (1base indexing), we 1) obtain a=colPtr[j] and b=colPtr[j+1]−1 and 2) for a≤i≤b, get val[i]=A(rowInd[i],j). For example, to access the first column, we first obtain a=1 and b=2. By checking val[i] and rowInd[i] for 1≤i≤2, we identify the two nonzero values 10 and 9 at the first and the fourth rows, respectively, in the first column: i.e., A(1,1)=10 and A(4,1)=9 since val[1]=10 with rowInd[1]=1 and val[2]=9 with rowInd[2]=4.
Algorithm 3 shows our implementation for one iteration of parallelized RMCL with the CSC format. In the algorithm, nonzeros(M,j) is a set of pairs (i,x) indicating nonzeros in the jth column of the matrix M, i.e., M(i,j)=x; 0_{n} and 1_{n} denote n dimensional vectors all of whose elements are 0 and 1, respectively. Lines 4 to 9 correspond to Regularize. Each thread running in a dedicated core performs c=M×M_{G}(,j) for each column j assigned to it, and as a result, (j,c) is added to Λ after applying Inflate and Prune to c. Note that although we do not describe Inflate and Prune in Line 10, its implementation is trivial for each column c.
Lines 13 to 16 correspond to constructing colPtr, and allocating spaces for val and rowInd for the resulting matrix N, which is done sequentially. Afterwards, Lines 17 to 24 correspond to filling val and rowInd using Λ and colPtr in parallel on the columns. Note that the positions of val and rowInd to be updated for each column are specified in colPtr, and they do not overlap.
Results
We present experimental results to answer the following questions.

Q1 How does PSMCL improve the distribution of cluster sizes compared with MLRMCL?

Q2 What is the performance of PSMCL compared with MLRMCL in quality and running time?

Q3 How much speedup do we obtain by parallelizing PSMCL?

Q4 How accurate are clusters found by PSMCL compared to the groundtruth?
Table 3 lists the used datasets in our experiments. We use various bionetworks for evaluating the clustering quality and the running time; the largest dataset DBLP is used for scalability experiments.
Experimental settings
Machine. All experiments are conducted on a workstation with double CPU Intel(R) Xeon(R) CPU E52630 v4 @ 2.20GHz and 250GB memory.
Evaluation criteria. To evaluate the quality of clustering \(\mathcal {C}\) for Q1 and Q2, we use the average NCut [15] defined as follows.
where
For answering Q4, we focus on protein complex finding problem and use the accuracy measure defined by Hernandez et al. [4] as follows. Let \(\mathcal {G}\) be the set of ground truth clusters (protein complexes); then the degree of overlap T_{gc} for every \(g\in \mathcal {G}\) and \(c\in \mathcal {C}\) is defined as:
The accuracy ACC is the geometric mean of Sensitivity SST and Positive Predictive Value PPV:
Parameter. We use the coarsening depth of 3 for PSMCL and MLRMCL with which the improvement in quality and speed is large while the number of resulting clusters remains reasonable.
Performance of SC
In this section, we answer Q1 and Q2. Figure 4 shows comparison of PSMCL with MLRMCL and MCL. The horizontal axis denotes the cluster size, and the vertical axis denotes the number of nodes belonging to a specific cluster size.
Before going into details, we briefly summarize the results of MCL which are invariant within each column of Fig. 4 because of the lack of the balancing concept in MCL. As discussed in [15], for all cases, MCL suffers from the fragmentation problem that a large portion of nodes belongs to very tiny clusters of size 1–3. We provide results of MCL as a baseline in the figure, and the following analysis focuses on comparing the MLRMCL and PSMCL.
For all of our datasets, we observe the same patterns described in the following Observations 1–3, though we present the four representative results in the Fig. 4.
Observation 1
(Too massive cluster without balancing) Without balancing, i.e. b=0, a large number of nodes are assigned to one cluster. Often, the entire graph becomes one cluster. Balancing factor of 1.0 to 1.5 resulted in reasonable cluster size distribution for most of the networks.
The first row of Fig. 4 corresponds to the result without balancing, i.e. b=0. In this case, both PSMCL and MLRMCL group too many nodes into one cluster. Especially, on MINT, both output only one cluster containing all nodes in the graph. Figure 5 shows the ratio of the largest cluster size over the number of nodes for the bionetworks listed in 3. Note that the largest cluster sizes for BioPlex, Drosophila, MINT, Yeast1, Yeast2, and Yeast3 are nearly the same as the total number of nodes; those for DIPdroso and Yeast4 are relatively smaller, but still, occupy a large proportion in the entire size.
Observation 2
(PSMCL preferring larger clusters than MLRMCL) With b>0, the cluster size with the maximum number of total nodes in PSMCL is larger than that in MLRMCL.
The second, third and fourth rows of Fig. 4 show the results of varying the balancing factor b∈{1,1.5,2}, respectively. In contrast to the case of b=0, the cluster sizes of PSMCL and MLRMCL are concentrated on certain sizes. The mode of cluster size in PSMCL is larger than that in MLRMCL. The modes in PSMCL are 10–20 for DIPdroso, BioPlex and Drosophila, and 20–50 for MINT; those in MLRMCL are 5–10 for all. This observation is useful in practice when we want to cluster at a certain scale.
Observation 3
(PSMCL with less fragmentation than MLRMCL) With b>0, PSMCL results in a significantly smaller number of fragmented clusters whose sizes are 1–3 compared with MLRMCL.
PSMCL achieves concentrated cluster sizes as well as avoids the fragmented clusters; MLRMCL and MCL still suffer from the fragmentation. The number of nodes belonging to very small clusters in PSMCL is much smaller than that in MLRMCL. For instance, the number of nodes belonging to clusters of size 1–3 in PSMCL is less than 5% of that in MLRMCL for the DIP with b=1.5.
Observation 4
(PSMCL better than MLRMCL in time and NCut) PSMCL results in a faster running time with a smaller NCut than MLRMCL does.
Figure 6 shows the plot of running time versus the average NCut. PSMCL runs faster, down to 21%, and outputs clustering with a smaller average NCut, down to 87%, than MLRMCL does on average. For some cases, MCL is faster than PSMCL, but for all cases, its average NCut is worse than that by PSMCL.
Performance of parallelization
In this section, we answer Q3. We use b=1.5 for PSMCL and MLRMCL. Figure 7 shows the performance evaluation results for PSMCL on the bionetworks in 3 with increasing cores. For all cases, PSMCL gets faster as the number of cores increases.
To test the scalability more effectively, we use DBLP, the largest in our datasets though it is not a bionetwork. Figure 8a shows the speed up of PSMCL while increasing the number of cores, compared with MLRMCL and MCL. We use points, not lines, for MLRMCL and MCL since they are singlecore algorithms. PSMCL outperforms MLRMCL regardless of the number of cores and becomes faster effectively as the number of cores increases. Precisely, the running time of PSMCL is improved down to 81% and 55% with 2 and 4 cores, respectively, compared that with a single core.
Figure 8b shows the running time of PSMCL while increasing data sizes, compared with MLRMCL and MCL. Here, we use 4 cores for PSMCL. To obtain various sizes of graphs, we first take principal submatrices from the adjacency matrix of DBLP with sizes {20%,40%,60%,80%,100%} of the total, and use the giant connected components of them. As shown in the figure, the running times of all the methods are linear on the graph sizes, and PSMCL outperforms MLRMCL for all scales. Note that although MCL is slightly faster than PSMCL, MCL has fragmentation problem and worse NCut while PSMCL has no fragmentation problem and better NCut, as shown in Figs. 4 and 6.
Protein complex identification
In this section, we use the two bionetworks, i.e., DIPyeast [21] and BioGRIDhomo [22] described in 3, to answer Q4 on protein complex finding problem. The groundtruth protein complexes information are extracted from CYC2008 2.0 [13] for DIPyeast and CORUM [14] for BioGRIDhomo. The complexes are used as reference clusters for measuring the accuracy.
Figure 9 shows the performance of PSMCL while varying skip rates, in comparison with MLRMCL and MCL (Note: The skip rate is not applicable to MLRMCL and MCL, leading to one accuracy value. For clear performance comparison, we represent that value by the horizontal dash line along the xaxis.). Remind that the skip rate p determines the chance that each node is skipped and thus not merged with others. Namely, the smaller p, the more aggressive coarsening. In the figure, PSMCL performs the best with moderate values of p—0.6 and 0.7 for DIPyeast and BioGRIDhomo, respectively. For both networks, PSMCL consistently outperforms MLRMCL with 0.5≤p≤0.7: the accuracy of PSMCL is higher than that of MLRMCL up to 3.66% for DIPyeast and 8.24% for BioGRIDhomo. This result makes sense because SC with too large p hardly reduces a graph in size, while too small p leads to too large clusters due to aggressive coarsening.
PSMCL achieves up to 33.2% higher accuracy than MCL for DIPyeast, and 98.7% of the MCL accuracy for BioGRIDhomo. This is due to many dimer structures present in the CORUM database. Exclusion of dimers in the database, PSMCL greatly outperforms MCL as shown in Fig. 9c. Although PSMCL is not effective in finding dimers, note that MCL suffers from the fragmentation problem (Fig. 4) and performs poorly in internal evaluation by Average NCut (Fig. 6) which assesses the potentials of finding wellformed but undiscovered clusters.
Conclusion
In this paper, we propose PSMCL, a parallel graph clustering method which gives superior performance in bionetworks. PSMCL includes two enhancements compared to previous methods. First, PSMCL incorporates a newly proposed coarsening scheme we call SC to resolve the inefficiency of MLRMCL in realworld networks. SC allows merging multiple nodes at a time, leading to reducing the graph size more quickly and making super nodes much cohesive than HEM used in MLRMCL. Second, PSMCL gives a multicore parallel algorithm for clustering to increase scalability. Extensive experiments show that PSMCL results in clusters that generally have larger sizes than those by MLRMCL, and also greatly alleviate the fragmentation problem. Moreover, PSMCL finds clusters whose quality is better than those by MLRMCL in both internal (average NCut) and external (reference clusters) criteria. Also, as more cores are used, PSMCL gets faster and outperforms MLRMCL in speed even with a single core.
The PSMCL‘s capability to quickly find midsize clusters in large scale bionetworks has wide range of applicability on systems biology. Although we have only shown that PSMCL effectively find midsize protein complexes on two proteinprotein interaction network compared to existing topologybased clustering algorithms, we believe that it can be effectively applied on function prediction, disease modules detection, and other systems biology analysis.
Availability and requirements
Project name: PSMCL;
Project home page:https://github.com/leesael/PSMCL;
Operating system(s): Platform independent (tested on Ubuntu);
Programming language: Java;
Other requirements: Java 1.8 or higher;
License: BSD Any restrictions to use by nonacademics: licence needed.
Abbreviations
 HEM:

Heavy edge matching
 MCL:

Markov clustering
 MLRMCL:

Multilevel RMCL
 PSMCL:

Parallel shotgun coarsening MCL
 RMCL:

RegularizedMCL
 SC:

Shotgun coarsening
References
 1
Thomas J, Seo D, Sael L. Review on graph clustering and subgraph similarity based analysis of neurological disorders. Int J Mol Sci. 2016; 17(6):862.
 2
Lei X, Wu FX, Tian J, Zhao J. ABC and IFC: Modules detection method for PPI network. BioMed Res Int. 2014; 2014:1–11.
 3
Xu B, Wang Y, Wang Z, Zhou J, Zhou S, Guan J. An effective approach to detecting both small and large complexes from proteinprotein interaction networks. BMC Bioinformatics. 2017;18(Supple 12):419.
 4
Hernandez C, Mella C, Navarro G, OliveraNappa A, Araya J. Protein complex prediction via dense subgraphs and false positive analysis. PLoS ONE. 2017; 12(9):0183460.
 5
Bernardes JS, Vieira FR, Costa LM, Zaverucha G. Evaluation and improvements of clustering algorithms for detecting remote homologous protein families. BMC Bioinformatics. 2015;16(1):34.
 6
Tadaka S, Kinoshita K. NCMine: Coreperipheral based functional module detection using nearclique mining. Bioinformatics. 2016; 32(22):3454–60.
 7
Li Z, Liu Z, Zhong W, Huang M, Wu N, Yun Xie ZD, Zou X. Largescale identification of human protein function using topological features of interaction network. Sci Rep. 2016;6. 7:16199.
 8
Van Dongen S. Graph clustering by flow simulation. PhD thesis: University of Utrecht; 2000.
 9
Satuluri V, Parthasarathy S, Ucar D. Markov clustering of protein interaction networks with improved balance and scalability. In: ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. New York: ACM: 2010. p. 247–56.
 10
Brohee S, van Helden J. Evaluation of clustering algorithms for proteinprotein interaction networks. BMC Bioinformatics. 2006; 7(1):488.
 11
Vlasblom J, Wodak SJ. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinformatics. 2009;10:99.
 12
Beyer A, Wilhelm T. Dynamic simulation of protein complex formation on a genomic scale. Bioinformatics. 2005; 21(8):1610–6.
 13
Pu S, Wong J, Turner B, Cho E, Wodak SJ. Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2008; 37(3):825–31.
 14
Ruepp A, Waegele B, Lechner M, Brauner B, DungerKaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW. Corum: the comprehensive resource of mammalian protein complexes—2009. Nucleic Acids Res. 2009; 38(suppl_1):497–501.
 15
Satuluri V, Parthasarathy S. Scalable graph clustering using stochastic flows: Applications to community discovery. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2009. p. 737–46.
 16
Faloutsos M, Faloutsos P, Faloutsos C. On powerlaw relationships of the internet topology. In: SIGCOMM. New York: ACM: 1999. p. 251–62.
 17
Lim Y, Kang U, Faloutsos C. Slashburn: Graph compression and mining beyond caveman communities. IEEE Trans Knowl Data Eng. 2014; 26(12):3077–89.
 18
Lim Y, Lee W, Choi H, Kang U. MTP: discovering high quality partitions in real world graphs. World Wide Web. 2017; 20(3):491–514.
 19
AbouRjeili A, Karypis G. Multilevel algorithms for partitioning powerlaw graphs. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing. Washington, DC: IEEE Computer Society: 2006. p. 124.
 20
Duff IS, Grimes RG, Lewis JG. Sparse matrix test problems. ACM Trans Math Softw. 1989; 15(1):1–14.
 21
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004; 32(suppl 1):449–51.
 22
Wang J, Vasaikar S, Shi Z, Greer M, Zhang B. Webgestalt 2017: a more comprehensive, powerful, flexible and interactive gene set enrichment analysis toolkit. Nucleic Acids Res. 2017; 45(W1):W130–7.
 23
Huttlin EL, Bruckner RJ, Paulo JA, Cannon JR, Ting L, Baltier K, Colby G, Gebreab F, Gygi MP, Parzen H, et al.Architecture of the human interactome defines protein communities and disease networks. Nature. 2017; 545(7655):505–9.
 24
Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao Y, Ooi C, Godwin B, Vitols E, et al.A protein interaction map of drosophila melanogaster. Science. 2003; 302(5651):1727–36.
 25
ChatrAryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, Castagnoli L, Cesareni G. Mint: the molecular interaction database. Nucleic Acids Res. 2007; 35(suppl 1):572–4.
 26
Ryan CJ, Roguev A, Patrick K, Xu J, Jahari H, Tong Z, Beltrao P, Shales M, Qu H, Collins SR, et al.Hierarchical modularity and the evolution of genetic interactomes across species. Mol Cell. 2012; 46(5):691–704.
 27
Chen J, Hsu W, Lee ML, Ng SK. Increasing confidence of protein interactomes using network topological metrics. Bioinformatics. 2006; 22(16):1998–2004.
 28
Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JL, Toufighi K, Mostafavi S, et al.The genetic landscape of a cell. Science. 2010; 327(5964):425–31.
 29
Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, et al.Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res. 2003; 31(9):2443–50.
Acknowledgements
Not applicable.
Funding
Publication of this article has been funded by National Research Foundation of Korea grant funded by the Korea government (NRF2018R1A5A1060031, NRF2018R1A1A3A0407953) and by Korea Institute of Science and Technology Information (K18L03C02).
Availability of data and materials
The datasets analysed during the current study are available in the Github repository, https://github.com/leesael/PSMCL.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 13, 2019: Selected articles from the 8th Translational Bioinformatics Conference: Bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement13.
Author information
Affiliations
Contributions
YL took the lead in writing the manuscript. IY developed the software and conducted all the experiments. IY, UK, and LS designed the experiments, and YL and IY summarized the results. YL, IY, and UK designed the method. DS, UK, and LS provided critical feedback. LS provided the expertise and resources to analyze the results and supervise the project. All the authors read and approved the final manuscript.
Corresponding author
Correspondence to Lee Sael.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lim, Y., Yu, I., Seo, D. et al. PSMCL: parallel shotgun coarsened Markov clustering of protein interaction networks. BMC Bioinformatics 20, 381 (2019) doi:10.1186/s1285901928568
Published:
Keywords
 Graph clustering
 Markov clustering
 Parallel clustering
 Coarsening
 Nonoverlapping clusters
 Protein complex finding