The relative vertex clustering value  a new criterion for the fast discovery of functional modules in protein interaction networks
 Zina M Ibrahim^{1, 2, 4} and
 Alioune Ngom^{3}Email author
https://doi.org/10.1186/1471210516S4S3
© Ibrahim and Ngom; licensee BioMed Central Ltd. 2015
Published: 23 February 2015
Abstract
Background
Cellular processes are known to be modular and are realized by groups of proteins implicated in common biological functions. Such groups of proteins are called functional modules, and many community detection methods have been devised for their discovery from protein interaction networks (PINs) data. In current agglomerative clustering approaches, vertices with just a very few neighbors are often classified as separate clusters, which does not make sense biologically. Also, a major limitation of agglomerative techniques is that their computational efficiency do not scale well to large PINs. Finally, PIN data obtained from large scale experiments generally contain many false positives, and this makes it hard for agglomerative clustering methods to find the correct clusters, since they are known to be sensitive to noisy data.
Results
We propose a local similarity premetric, the relative vertex clustering value, as a new criterion allowing to decide when a node can be added to a given node's cluster and which addresses the above three issues. Based on this criterion, we introduce a novel and very fast agglomerative clustering technique, FACPIN, for discovering functional modules and protein complexes from a PIN data.
Conclusions
Our proposed FACPIN algorithm is applied to nine PIN data from eight different species including the yeast PIN, and the identified functional modules are validated using Gene Ontology (GO) annotations from DAVID Bioinformatics Resources. Identified protein complexes are also validated using experimentally verified complexes. Computational results show that FACPIN can discover functional modules or protein complexes from PINs more accurately and more efficiently than HCPIN and CNM, the current stateoftheart approaches for clustering PINs in an agglomerative manner.
Keywords
Background
Functional modules are groups of genes or proteins involved in common elementary biological functions. Proteins are also known to interact with each other by forming complexes, and each such complex performs an independent and discrete biological function through the interactions of its member proteins [1]. Single proteins may also participate in more than one complex or functional module. Functional modules or protein complexes correspond to modules, which are dense subgraphs within protein interaction networks (PINs), and hence, can be discovered by appropriate network clustering approaches. Generally speaking, modules in PINs refer to highly connected subgraphs which have more internal edges than external edges. Many definitions of modules have been proposed in literature [2], and consequently different community detection algorithms have been proposed based on these different definitions.
Module detection in PINs is a computationally hard task and conventional clustering algorithms are not well suited for this task [3, 4]. Efficient, accurate, robust, and scalable methods are therefore required for mining large PINs [5–8]. There are generally three classes of modules detection approaches: 1) those based on finding cliques, which are fully connected subnetworks [9, 10]; 2) those based on detecting dense subnetworks [11, 12], not necessarily cliques; and 3) those based on uncovering the hierarchical organization of modules within PINs [13, 14]. Clique techniques are not quite scalable to large PINs and the identified modules are too strict in the biological sense of modules since proteins participating in a complex may not all interact with each other. Current densitybased algorithms commonly misclassify proteins with low degree into small clusters which could be merged to core protein clusters [15]. Moreover, many biologically meaningful modules are ignored due to their low topological connectivity [15].
Hierarchical clustering methods based on global metric over nodes or edges, such as betweenness centralities, are very timeconsuming, and thus do not scale well to large PINs. The few hierarchical approaches based on local metric also have the common problem of classifying very lowdegree vertices into separate clusters, which does not make sense biologically. Another major issue in current hierarchical clustering approaches is their inability to perform well on noisy data. This is generally the case when clustering PIN data generated from large scale highthroughput experiments. As discussed in [16, 17], such PIN data usually contain many false positive interactions, and hence, care must be taken to deal with the sensitivity of hierarchical methods on such data.
The majority of the clustering methods proposed in the literature has focused on identifying nonoverlapping communities. However, it is well recognized that complex networks contain multiclass nodes corresponding to vertices belonging to many communities at once. Overlapping clustering algorithms have not been intensively studied nor successful at finding good subnetworks, although they first appeared three decades ago; see an extensive review of overlapping methods in [18]. Multifunctional proteins are proteins which perform several functions and interact specifically with distinct sets of protein partners simultaneously or not, depending on the function being performed. Thus, such proteins are involved in many functional modules or protein complexes, and hence, it is reasonable to assume that PINs have overlapping communities, each containing some multifunctional proteins. Few successful hierarchical clustering approaches such as the Overlapping Cluster Generator (OCG) algorithm of [19] and the Link Communities method of [20] (to cite just a few) have been recently proposed with the aim of identifying overlapping protein communities as well as multifunctional proteins from PINs.
In this paper, we propose a fast agglomerative clustering technique, FACPIN, which addresses the issues and limitations discussed above for hierarchical algorithms. FACPIN is based on a local similarity premetric of relative vertextovertex clustering value for clustering PINs in an agglomerative hierarchical manner.
Related works
Many hierarchical clustering approaches (both agglomerative and divisive techniques) have been introduced in literature, since the original publication of [21] for clustering networks. See the excellent survey on graph clustering algorithms in [22]. Thus, we will present only the few methods that are directly related to our proposed agglomerative approach.
where I_{ u,v } = N_{ u } ∩ N_{ v } , and 0 ≤ w(a, b) ≤ 1 is the weight assigned to the edge (a, b) and which represents the reliability of the interaction between vertices a and b or the probability of their interaction being a true positive. Clearly, Equation (2) is a special case of Equation (3) for weighted undirected graph with w(a, b) = 1 for all edges (a, b). In Equations (1)(3), two vertices connected by an edge with larger objective value are more likely to lie in the same module.
and applied a simple singlelinkage hierarchical clustering algorithm to build an link dendrogram from Equation (4) which yields link communities with the best edge partition density. By identifying such nonoverlapping link communities, [20] has detected hierarchically organized node community structures with pervasive overlap.
In the next section, we will propose a new criterion for weighted undirected graphs, which is a modification of the relative vertextovertex clustering value which we have first introduced in [26] for unweighted graph; in [26], however, the unweighted criterion was applied only to the problem of detecting protein complexes in PINs [27] whereas here we apply our weighted criterion here for identifying functional modules in PINs. It is a local similarity premetric combining the ideas behind the vertex clustering coefficient, the edge clustering coefficient, and the edge clustering value, and which allows to decide when a given vertex can be included into the cluster of another vertex, and which helps address all of the issues discussed above.
Methods
Network modularity structure
The problem of community detection is hence equivalent to searching for a k and a partition Pk to maximize the value of a modularity function.
The relative vertextovertex clustering value
where, ${I}_{u,v}^{+}={N}_{u}^{+}\cap {N}_{v}^{+}$, and 0 ≤ w(x, y) ≤ 1 is the weight assigned to the edge (x, y) and which represents the reliability of the interaction between vertices a and b or the probability of their interaction being a true positive. Clearly, Equation (9) is a special case of Equation (10) for weighted undirected graph with w(x, y) = 1 for all edges (x, y). For a node a ∈ V, we let ${k}_{a}={\sum}_{b\in V}{A}_{a,b}$ be its degree. For a weighted graph, we define the weighted degree of a vertex a as ${\kappa}_{a}={\sum}_{b\in V}w\left(a,b\right)$, similarly to [25].
${R}_{w}(u\to v)$, with 0 ≤ ${R}_{w}(u\to v)$ ≤ 1, is a similarity premetric since it does not satisfy the axiom of symmetry and the triangle inequality but satisfies the axioms of selfsimilarity and maximality [30]; see http://www.scholarpedia.org/article/Similarity_measures and http://en.wikipedia.org/wiki/Metric_(mathematics)#Premetrics. A vertex u with a larger clustering value given another vertex v is more likely to lie in the cluster containing v. In the following we let C(v) = (C_{ v } ⊂ V, E_{ v } ⊂ E) denotes the subnetwork cluster containing v and we assume C(v) is a community. Below, we describe the properties of ${R}_{w}(u\to v)$.
Analysis of ${R}_{w}(u\to v)$

1 Case k_{ u } = 1. $R(u\to v)$ = 1, thus it is maximal. $R(u\to v)$ is also maximal when kv = 1, and hence, the connected component C = ({u, v}, (u, v)) is a community. If on the other hand k_{ v } > 1, then we have $R(u\to v)$ >$R(v\to u)$ and therefore u should be merged with the current cluster C(v) of v (not the other way around, which corresponds to merging v with C(u)).

2 Case 1 <k_{ u } <k_{ v }. $R(u\to v)$ >$R(v\to u)$ and $R(u\to v)$ may or may not be maximal. Vertex u should be merged with C(v) only when $R(u\to v)$ > 0.5; that is, when more than 50% of the neighbors of u, ${N}_{u}^{+}$, are in the intersection, ${N}_{u}^{+}\cap {N}_{v}^{+}$. This is a reasonable decision since the number of triangles involving the edge (u, v) is N_{ u } ∩ N_{ v }, and that the edge (u, v) is definitely not a "bridge" connecting two clusters when most of u's neighbors form a triangle with v.

3 Case 1 <k_{ v } <k_{ u }. This is the reverse of case 2 above: thus, u should not merge with C(v) since $R(u\to v)$ <$R(v\to u)$.

4 Case k_{ u } = k_{ v }. $R(u\to v)$ = $R(u\to v)$, and we should consider two possible subcases.
 (a)
Subcase ${N}_{u}^{+}={N}_{v}^{+}$. We have $R(u\to v)$ = $R(u\to v)$ = 1 since ${N}_{u}^{+}={N}_{v}^{+}={N}_{u}^{+}\cap {N}_{v}^{+}$. Hence, u should be merged with C(v) given that the induced subnetwork of G for ${N}_{u}^{+}\cap {N}_{v}^{+}$ forms a community.
 (b)
Subcase ${N}_{u}^{+}\ne {N}_{v}^{+}$. We have $R(u\to v)$ = $R(v\to u)$ < 1. In this case, u should be merged with C(v), only when $R(u\to v)$ > 0.5.
 (a)
Given an edge (u, v), assume the degrees of vertices u and v in G are such that k_{ u } = k_{ v } = d are (very) large and that u and v do not have common neighbors. Then, we have $R\left(u\to v\right)=R\left(v\to u\right)=1\cdot \frac{2}{1+d}\le 0.5$ assuming d ≥ 3. In this case, the induced subnetwork of G for {u} ∪ C_{ v } (or for ${N}_{v}^{+}$) is not a community, and likewise for {v} ∪ C_{ u } (or for ${N}_{u}^{+}$). In general, consider the induced subgraph of G on ${N}_{u}^{+}\cup {N}_{v}^{+}$ we define the local betweenness value of edge (u, v) as the percentage of paths from vertices in N_{ u } \ N_{ v } to vertices in N_{ v } \ N_{ u } going through edge (u, v). Given the number of common neighbors between u and v, N_{ u } ∩ N_{ v }, the local betweenness of edge (u, v) is thus $\lambda \left(u,v\right)=100\cdot \frac{1}{{N}_{u}\cap {N}_{v}+1}$. Given two connected highdegree vertices u and v, the local edge betweenness value λ(u, v) increases as N_{ u } ∩ N_{ v } decreases, and hence, it corresponds to when both $R(u\to v)$ and $R(v\to u)$ values are both small (and both ≤ 0.5) at the same time. Edges with high local betweenness values are edges which are likely connecting two communities, and therefore, vertices u and v should not lie in the same community. This is not necessarily true since we are making an inference based not on the global edge betweenness metric defined in [21]. However, starting with correct initializations and using an appropriate node clustering mechanism, a greedy algorithm can be devised based on the faster local evaluations instead of the costly global evaluations.
$R(u\to v)$ is maximal when $\left{N}_{u}^{+}\right={N}_{u}^{+}\cap {N}_{v}^{+}$; that is either Case (1) or Case (4a) above. In either cases, u contributes only new internal edges in the induced subnetwork of G for ${C}_{v}^{+}=\left\{u\right\}\cup {C}_{v}$ (or for ${C}_{v}^{+}={N}_{v}^{+}$) and contributes no new external edges, and hence, the induced subnetwork of G for ${C}_{v}^{+}$ remains a community if C_{ v } (or ${N}_{v}^{+}$) is a community. Finally, u is more likely to be in the community C(v) and v less likely to be in the community C(u) when both $R(u\to v)$ > 0.5 and $R(u\to v)$ ≥ $R(v\to u)$. Since $R(u\to v)$ > 0.5 then k_{ u } ≤ k_{ v } and ${N}_{u}^{+}\cap {N}_{v}^{+}=\frac{\left{N}_{u}^{+}\right}{2}$; that is, more than 50% of the neighbors of u are in the intersection and less than 50% of the neighbors of v are in the intersection. Since k_{ u } ≤ k_{ v } then clearly the induced subnetwork of G for ${C}_{v}^{+}=\left\{u\right\}\cup {C}_{v}$ is a community when ${N}_{u}\cap {N}_{v}\subseteq C\left(v\right)$ with its modularity increasing with N_{ u } ∩ N_{ v }.
Quantitative definition of module
The communities (i.e. modules) C determined by algorithms which use this merging test are such that the merging condition is satisfied for every internal edge of C and not satisfied for every external edge of C. Given a weighted undirected graph G = (V, E) and the merging parameter μ, a subgraph C ⊆ G is said to be a μmodule if if the the condition for merging is true for every internal edge of c and false for every external edge of C. Different networks modularity structures are obtained by varying the value the merging parameter μ.
The relative vertex clustering value, $R(u\to v)$ implements the ideas behind the edge clustering coefficient, ${C}_{u,v}^{\left(k\right)}$, of [2], since for a given vertex v and a neighbor u the number of triangles given edge (u, v) is exactly N_{ u } ∩ N_{ v }; and u will be included into C(v) whenever most of the neighbors of u (excluding v) are in N_{ u } ∩ N_{ v } . This is also true even when (u, v) is not an edge; in such case, N_{ u } ∩ N_{ v } relates to the number of squares containing vertices u and v. On the other hand, we break through the limitations of [2] as in the edge clustering value, ECV (u, v) of [25], by not assuming the existence of closed loops in a networks, such as triangles or highorder loops. The relative vertex clustering values $R(u\to v)$ and ${R}_{w}(u\to v)$ also improves ECV (u, v) and ECV w (u, v) since neighbors u of v which have most of their neighbors forming a triangle with v are considered for possible inclusion in C(v). Searching for vertices u which form a cluster with v is also more efficient than searching for edges (u, v) that make a cluster since the number of edges is larger than the number of vertices in dense subgraphs.
The FACPIN algorithm
In a clustering task, we can use ${R}_{w}(u\to v)$ and ${R}_{w}(v\to u)$ to decide whether u should be included into C(v) = (C_{ v }, E_{ v }) ⊂ G = (V, E), the current cluster of v. Based on the definitions of relative vertextovertex clustering value and quantitative network modularity, we propose a fast agglomerative clustering nodefocused algorithm named FACPIN, shown in Algorithm 1. The input to algorithm FACPIN is an undirected weighted graph; when unweighted graph is used, then all edges (a, b) are treated equally with weight w(a, b) = 1. The output of FACPIN is a collection of nonoverlapping subnetwork communities.
Given a weighted undirected PIN G = (V, E), we initially consider each vertex as a singleton cluster, and sort the vertices v ∈ V into a queue Q_{ V } in nonincreasing order of their weighted degrees κ_{ v }. Then,
Algorithm 1 The FACPIN algorithm
Require: G = (V, E): undirected PIN graph;
A_{V × V}: adjacency matrix;
W_{V × V}: weight matrix;
μ: merging parameter;
Ensure: P_{ k } = {C_{1} ,..., C_{ k }}: nonoverlapping subnetwork communities
{Initialization Phase}
for all v ∈ V do
C_{ v } ← {v}; {C_{ v } = cluster containing node v}
E_{ v } ← ∅;
${\kappa}_{v}\leftarrow {\sum}_{b\in V}w\left(v,b\right)$; {weighted degree of v}
C(v) ← (C_{ v }, E_{ v }); {Each vertex is a singleton cluster }
{C(v) = subnetwork containing node v}
end for
{Community Detection Phase}
Sort V to Q_{ V } in nonincreasing order of κ_{ v } values;
repeat
v ← Q_{ V }; {Select highest κ_{ v } vertex in Q_{ V }}
N_{ v } ← {u ∈ V (u, v) ∈ E}; {Neighbor set of v}
for all u ∈ N_{ v } not yet assigned to a cluster do
if ${R}_{w}(u\to v)$ > 0.5μ and ${R}_{w}(u\to v)$ ≥ ${R}_{w}(v\to u)$
then
C_{ z } ← C_{ v } ∪ {u}, ∀ ∈ C_{ v } ∪ {u};
end if
end for
Q_{ V } ← Q_{ V }  v; {Remove v from Q_{ V }}
until Q_{ V } = ∅
{Compute the Partition P_{ k }}
U ← V;
i ← 1;
while U ≠ ∅ do
v ← randomly select a vertex from U ;
C_{ i } ← C(v) = the induced subgraph of G for C_{ v } ;
U ← U\{uC_{ u } = C_{ v }};
i ← i + 1;
end while
return P_{ k } ← {C_{1},...,C_{ k }}; Q_{ w } (P_{ k }) and Ω_{ w } (P_{ k });
{Evaluate the Modularity of Partition P_{ k }}
Modularity ← D_{ w }(P_{ k }), Q_{ w }(P_{ k }) and Ω_{ w }(P_{ k });
on each neighbor u ∈ N_{ v } of v in order to decide for its inclusion into the current cluster C_{ v } of v.
A neighbor u ∈ N_{ v } is added into the current cluster C_{ v } of v, when the majority of the neighbors of u are in ${N}_{u}^{+}\cap {N}_{v}^{+}$. That is when, $R(u\to v)$ > 0.5 and ${R}_{w}(u\to v)$ ≥ ${R}_{w}(v\to u)$; in which case κ_{ u } ≤ κ_{ v } and ${N}_{u}^{+}\cap {N}_{v}^{+}>\frac{1}{2}\left{N}_{u}^{+}\right$ which for weighted graphs is equivalent to ${\sum}_{a\in {I}_{u,v}^{+}}w\left(u,a\right)>\frac{1}{2}{\sum}_{b\in {N}_{u}^{+}}w\left(u,b\right)$ where ${I}_{u,v}^{+}={N}_{u}^{+}\cap {N}_{v}^{+}$. By gradually examining each highdegree vertex v from the queue Q_{ V } and then gradually adding its unassigned neighbors u to C_{ v }, FACPIN agglomerates all singleton clusters into V vertex sets C_{ v }. The final k communities C_{ i }, for 1 ≤ i ≤ k, are the induced subgraphs of G for all distinct C_{ v }; in the algorithm, we made a distinction between a cluster C_{ v } = {v_{1},...,v_{ n }}, a subnetwork C(v) = (C_{ v }, E_{ v }), and the ith subnetwork C_{ i }. In FACPIN, the merging parameter μ with 0 ≤ μ < 2 is userdefined. In particular for weighted PINs, different modularity results can be obtained by changing the values of μ
Most hierarchical methods, with the exception of the HCPIN algorithm of [25], are based on a costly global metric for partitioning a PIN network. FACPIN is based on the local similarity premetric ${R}_{w}(u\to v)$, which encodes useful information about the local topology around vertices u and v, and which helps make a local decision maximizing the modularity of the final partitioning.
Computational complexity of FACPIN
Given weighted PIN G = (V, E), let n = V, m = E, κ_{max} = max_{v∈V}κ_{ v } be the maximum weighted degree in G, and ${\kappa}_{ave}=\frac{1}{n}{\sum}_{v\in V}{\kappa}_{v}$ be the average weighted degree in G. The complexity of computing ${R}_{w}(u\to v)$ is O(κ_{max}), and hence, the complexity of FACPIN is $O\left(n{\kappa}_{ave}^{2}\right)\ll O\left(n{\kappa}_{\mathsf{\text{max}}}^{2}\right)\u22d8O\left({n}^{3}\right)$. PINs are powerlaw networks, thus the majority of proteins interact with few proteins only, and thus κ_{ ave } is generally small and can be considered a constant [25]. The CNM [23] and the HCPIN [25] methods run in O(mh log n) and $O\left(m{\kappa}_{ave}^{2}\right)$ steps, respectively; where, h is the depth of the dendrogram describing the network's community structure. These are the currently fastest agglomerative methods. The space complexity of the three algorithms is O(m^{2}). The main achievement with respect to computational complexity is that the cost of FACPIN is dependent on the number of nodes, rather than the number of edges, specially when κ_{ ave } is regarded as a constant in scalefree networks.
Results and discussion
We have carried out several computational experiments on nine PIN data from eight different species using our proposed FACPIN algorithm. In this section, the data sets and the evaluation methods used in our experiments are described first. Next, we discuss the effect of varying the merging parameter μ on the FACPIN clustering results. Then, we arbitrarily set the merging parameter to μ = 0.5 and then proceed to compare and study the clustering results of the FACPIN approach with those of the HCPIN and CNM methods on the same PIN data sets; the three algorithms are compared on (i) the functional enrichment of their predicted modules, (ii) their sensitivity, specificity, and F score, (iii) the network modularity structure of the partitioning results, and finally, (iv) their execution times.
All computational experiments were performed on an Intel machine (Core TM i51600, 2.400 GHz, CPU with 8 GB RAM). The program codes were all written in R.
PIN data sets
Original unweighted PIN data of eight distinct species was downloaded from the REACTOME database http://www.reactome.org/download/all_interaction.html and one species from the DIP database [31]. The eight PIN data from REACTOME are listed here along with their number of proteins and interactions in parenthesis are: B. taurus (5737, 113888), T. guttata (Finch bird, 3929, 74314), X. tropicalis (Frog, 5473, 122706), H. sapiens (Human, 8997, 34935), O. sativa (Rice, 3778, 320570), S. scrofa (Wild boar, 5303, 119920), D. rario (Zebra fish, 8188, 274358), and S. cerevisiae1 (Baker's yeast, 5697, 50675). The PIN data from DIP is S. cerevisiae2 (Baker's yeast, 4726, 15166). In all these PIN data, the number of edges is much larger than the number of vertices.
We also downloaded a list of protein complexes obtained from the MIPS database, which we consider as a gold standard data. We extracted the protein complexes corresponding to the S. cerevisiae2 PIN data from the MIPS Comprehensive Yeast Genome DatabaseCYGD ftp://ftpmips.gsf.de/fungi/yeast/catalogues/complexcat/complexcat_data_18052006. We proceeded similarly to [29] and considered only the known complexes (i.e., not those obtained by computational means) containing at least three proteins. Since FACPIN generates nonoverlapping clusters, we considered only known complexes which are at the bottom of the MIPS hierarchy of complexes and subcomplexes. The unconfirmed complexes, that is those in category 550, were excluded.
Evaluation methods
In order to study and compare the performance of FACPIN, we downloaded the CNM code http://cs.unm.edu/~aaron/research/fastmodularity.htm[23] and implemented the HCPIN algorithm [25]. The two methods were applied on the same PIN data as FACPIN. For HCPIN, we set the two parameters λ and s as in [25]; CNM has no parameters. Of the three algorithms, only FACPIN and HCPIN can cluster weighted PINs. There are other network clustering approaches which we could compare FACPIN with, however they are either not designed for clustering weighted PINs or they are not hierarchical agglomerative algorithms. It should be noted that [25] compared his HCPIN algorithms with six others PIN clustering approaches on the same S. cerevisiae2 PIN data; none of them are hierarchical and only three of them can cluster PIN data). Due to time and space limitations, we are not able to perform computational experiments comparing FACPIN approach with those other six PIN clustering techniques; we leave this task as a future work. In [25], HCPIN consistently outperforms those methods in terms of its (i) functional enrichment of the identified modules (ii) ability to detect both smallsized and largesized modules, (iii) accuracies of the identified modules, (iv) ability to predict protein complexes, and (v) clustering efficiency. Both HCPIN and CNM are currently the fastest agglomerative methods for clustering PIN data.
Functional enrichment validations
For the functional enrichment validations, we used DAVID's functional annotation tools http://david.abcc.ncifcrf.gov/[32] to identify enriched biological themes, particularly GO terms, and to estimate whether the predicted modules are biologically significant. DAVID uses a set of fuzzy classification algorithms to rank modules based on cooccurrences of their constituent proteins in annotation terms and computes a Pvalue indicating the significance of the module with respect to GO terms. The Pvalue is computed using an internal EASE score [33]. We used a Pvalue cutoff of 0.05 to find biologically significant clusters. A smaller Pvalue indicates that the predicted module is more biologically significant than one with a larger Pvalue
where, C is a module predicted by the algorithm, and F_{ i } is a known GO functional category mapped to C and considered as a true predictions. Thus, the proteins in C ∩ F_{ i } are the true positive predictions. Recall measures how effectively proteins with the same F_{ i } in the PIN are extracted, Precision measures how consistently proteins in the same C are annotated, and FMeasure is their harmonic mean [34]. The accuracy of the method is taken as the average FMeasure of the significant predicted modules. As in [25], we also only consider predicted modules of size 3 or more.
Protein complex validations
where, TP (true positive) is the number of the identified complexes C matched by the known complexes K, FN (false negative) is the number of known complexes that are not matched by the identified complexes, and FP (false positive) is the total number of the identified complexes C minus TP.
Modularity and efficiency analyses
All experiments in this paper were performed on an Intel machine (Core TM i72600, 3.400 GHz, CPU with 8 GB RAM). We compared FACPIN against HCPIN and CNM in terms of the modularity of their clustering results and in terms of their computational efficiencies. For FACPIN, we ran it with its merging parameter set to μ = 0.5, then evaluated and reported the modularity of its resulting partition P_{ k }. The execution times (in seconds) are also recorded; the PINs are sorted in increasing order of their number of proteins m.
Identification of functional modules in the S. cerevisiae2PINs
The effect of variation of μ on clustering S. cerevisiae2 PINs
μ  k  Max C_{ i }  Ave C_{ i } 

0.25  203  265  21.498 
0.5  232  374  18.810 
1.0  413  155  10.567 
1.5  489  120  8.924 
1.75  491  111  8.888 
Effect of the merging parameter μ
is satisfied for u. Hence, the size of a cluster C(v) increases as the merging parameter μ decreases since more neighbors are being merged together with v; and therefore, the number of clusters k also decreases as the sizes of clusters increase.
Functional enrichment of FACPIN modules
Functional enrichment of the predicted modules which comprises of three or more S. cerevisiae2 proteins; μ = 0.5
Algorithms  <E15  [E15, E10]  [E10, E5]  [E5, 1]  

N. Modules  Avg Size  N. Modules  Avg Size  N. Modules  Avg Size  N. Modules  Avg Size  
FACPIN  12 (8.1%)  49.08  18 (12.2%)  31.83  35 (23.6%)  25.57  73 (49.32%)  20.95 
HCPIN  16 (6.39%)  103.1  29 (23.77%)  63.23  38 (23.6%)  28.12  28 (22.95%)  25.11 
CNM  6 (12.77%)  439.833  1 (2.1%)  71  5 (10.63%)  36.35  28 (59.58%)  28.89 
Predicting largesized versus smallsized modules
Performance comparison of the algorithms for predicting modules of size ≥ 20 on S. cerevisiae2 PIN; μ = 0.5
Algorithms  Number of modules  Percentage of significant modules  Mean(log Pvalue)  Mean(FMeasure) 

FACPIN  58  98.28%  8.21  0.42 
HCPIN (λ = 1)  45  97.11%  12.25  0.31 
CNM  17  96.43%  13.53  0.05 
Performance comparison of the algorithms for predicting modules of size ≤ 6 on S. cerevisiae2 PIN; μ = 0.5
Algorithms  Number of modules  Percentage of significant modules  Mean(log Pvalue)  Mean(FMeasure) 

FACPIN  44  86.4%  6.16  0.41 
HCPIN (λ = 1)  33  59%  5.39  0.27 
CNM  26  35.7%  1.81  0.08 
In Table 3, we see that more than 96% of the modules predicted by each method are validated to be significant, though FACPIN yields a percentage slightly larger than that of HCPIN or CNM. Although CNM gives the highest average log Pvalue, it also yields the lowest average Fmeasure; this is due to the fact that its significant modules are much larger than those of HCPIN and FACPIN, and hence, less accurate. FACPIN, on the other hand, predicted more accurate significant modules than HCPIN and CNM but with the lowest average log Pvalue; again, this is due to the smaller sizes of its generated modules.
In Table 4 however, performed consistently better than CNM and HCPIN in all performance measures; FACPIN seems to be better at producing smallsized modules.
Accuracy of FACPIN
Performance Comparison of the accuracy of FACPIN, HCPIN, and CNM on S. cerevisiae2 PIN; μ = 0.5
Accuracy for Modules of Size ≥ 3  

Algorithms  Number of modules  Average size  Maximum size  Accuracy  
BP  MF  CC  
FACPIN  148  28.24  374  0.42  0.30  0.65 
HCPIN (λ = 1)  122  43.19  483  0.39  0.28  0.52 
CNM  47  88.59  790  0.22  0.23  0.25 
Accuracy for Modules of Size ≥ 2  
FACPIN  232  18.8  374  0.39  0.32  0.57 
HCPIN (λ = 1)  172  23.74  483  0.37  0.30  0.44 
CNM  147  29.68  790  0.09  0.15  0.21 
Identification of functional modules in the S. cerevisiae1PIN
Functional enrichment of the predicted modules of unweighted S. cerevisiae1 PIN; μ = 0.5
Algorithms  Q _{ w } ( P _{ k } )  k _{ 3 }  $\stackrel{\u0304}{s}$  Ontology  k _{ s }  <E15  [E15, E10]  [E10, E5]  [E5, 1]  $\stackrel{\u0304}{p}$  A 

FACPIN  0.529  65 (90)  8.96  Overall  57 (63.33%)  2  7  23  25  4.21  0.137 
BP  24  2  1  11  10  4.09  0.165  
CC  18  0  4  6  8  4.21  0.123  
MF  15  0  2  6  7  4.01  0.096  
HCPIN  0.139  64 (87)  9.17  Overall  36 (42%)  7  5  12  12  3.17  0.024 
BP  10  2  3  3  2  3.02  0.028  
CC  14  3  2  4  5  2.97  0.032  
MF  12  2  0  5  5  3.15  0.029  
CNM  0.248  61 (84)  9.62  Overall  19 (22%)  7  5  4  5  4.15  0.034 
BP  5  0  3  2  0  3.29  0.031  
CC  7  3  2  0  2  3.99  0.045  
MF  9  4  0  2  3  4.68  0.033 
Identification of protein complexes in the S. cerevisiae2PIN
Comparison of the Sensitivity, Specificity and FScore of FACPIN, HCPIN and CNM
Performances  

P  K  $\stackrel{\u0304}{S}$  Algorithms  k  $\left\stackrel{\u0304}{k}\right$  k _{ m }  Sensitivity  Specificity  F  Score 
1318  144  9.153  FACPIN  158  8.35  9  0.61  0.54  0.592 
HCPIN (λ = 0.5)  129  11.23  5  0.38  0.41  0.391  
HCPIN (λ = 1.0)  117  12.83  3  0.29  0.32  0.31  
CNM  291  6.29  3  0.15  0.16  0.204 
Modularity and efficiency of FACPIN
Network modularity quality Q_{ w } results of FACPIN, HCPIN, and CNM; μ = 0.5
Algorithms  Yeast  Finch Bird  Cattle  Wild Boar  Frog  Human  Zebra Fish  Rice 

FACPIN  0.529  0.500  0.441  0.502  0.471  0.491  0.527  0.575 
HCPIN  0.139  0.498  0.418  0.419  0.319  0.218  0.198  0.529 
CNM  0.248  0.766  0.693  0.626  0.754  0.719  0.736  0.348 
Network modularity quality Ω_{ w } results of FACPIN, HCPIN, and CNM; μ = 0.5
Algorithm  Yeast  Finch Bird  Cattle  Wild Boar  Frog  Human  Zebra Fish  Rice 

FACPIN  1.370  1.867  1.704  1.846  1.839  1.469  1.825  1.283 
HCPIN  1.291  0.131  0.619  0.948  1.796  0.823  0.182  1.279 
CNM  0.983  1.315  1.618  1.848  1.721  1.441  1.422  0.819 
Network modularity density D_{ w } results of FACPIN, HCPIN, and CNM; μ = 0.5
Algorithm  Yeast  Finch Bird  Cattle  Wild Boar  Frog  Human  Zebra Fish  Rice 

FACPIN  77.534  164.501  149.350  164.501  149.003  152.540  136.916  101.841 
HCPIN  71.829  129.292  130.418  111.419  127.124  104.822  121.927  79.182 
CNM  64.480  121.574  123.970  115.306  109.231  95.201  97.343  56.810 
Comparing cluster statistics of FACPIN and CNM on Q_{ w }; μ = 0.5
Statistics  Algorithms  Yeast  Finch Bird  Cattle  Wild Boar  Frog  Human  Zebra Fish  Rice 

k  FACPIN  90  247  285  267  268  269  379  154 
CNM  68  132  144  136  129  125  147  95  
Ave C_{ i }  FACPIN  8.96  16.98  21.74  24.05  22.35  10.53  22.32  14.90 
CNM  10.47  32.33  43.94  47.93  47.13  48.70  58.57  24.46  
Max C_{ i }  FACPIN  167  285  774  730  1043  1373  1104  541 
CNM  154  1199  1989  1471  2029  2029  2353  547 
The first term in the merging condition guarantees that only edges (u, v) which have low local betweenness value $\lambda \left(u,v\right)=100\cdot \frac{1}{{N}_{u}\cap {N}_{v}+1}$ are considered for possible inclusion in the induced subgraph C(v) of C_{ v }. The second term guarantees that only those neighbors u which can contribute more edges to C(v), than v contributes to C(u), are selected. Hence, FACPIN merges neighbors u which contribute low local betweenness edges while optimizing the density of C(v). Also as said before, the relative vertex clustering value ${R}_{w}(u\to v)$ combines the principles behind the vertex clustering coefficient of [14], the edge clustering coefficient ${C}_{u,v}^{\left(k\right)}$ of [2], and the edge clustering value ECV (u, v) of [25]. Since the objectives of Ω_{ w } and D_{ w } is to seek for modular partitioning containing dense clusters, we can see that in both Tables 9 and 10, FACPIN outperformed both HCPIN and CNM on both modularity function Ω_{ w }; in seven out of eight PIN data for Ω_{ w }, and in all PIN data for D_{ w }. In particular for D_{ w }, FACPIN yield much higher modularity values.
Execution times of FACPIN, HCPIN, and CNM; using Q_{ w } and μ = 0.5
PINs  Number of Proteins  Number of Interactions  FACPIN  HCPIN  CNM 

Yeast  5697  40675  313.315  446.231  501.239 
Finch Bird  3929  74314  235.804  610.238  441.365 
Cattle  5737  113888  300.766  781.231  596.833 
Wild Boar  5303  119920  649.483  691.472  972.213 
Frog  5473  122706  429.873  1021.432  912.692 
Human  12994  135935  533.000  702.325  822.511 
Zebra Fish  8188  274358  874.303  1183.350  1238.281 
Rice  3778  320570  349.712  539.329  1281.273 
Conclusions
In this paper, we have proposed a new agglomerative clustering approach, FACPIN algorithm, for detecting the communities of a given PIN networks, and then compared our method with two fast hierarchical techniques discussed in literature. Our approach is based on the use of a new measure, the relative vertextovertex clustering value which helps decide whether a given vertex u should be included within the cluster of another vertex v depending on how many of its neighbors form a triangle with v. Our approach is very fast since we are clustering vertices not edges, as in the compared methods. Thus our method is appropriate for PIN data, which in general contain more interactions than proteins. More study needs to be done, in particular the validation based on random networks, in order to analyze the robustness of FACPIN. Comparisons with other methods which are not necessarily hierarchical will also be important. Nonagglomerative clustering methods based on the relative vertextovertex clustering value will be investigated. In this current version of FACPIN, a neighbor u is merged with a cluster ${C}_{{v}_{i}}$ whenever its ${R}_{w}(u\to {v}_{i})$ value satisfies the merging condition and irrespective of whether there is another vertex vj such that ${R}_{w}(u\to {v}_{j})$ also satisfies the condition; we, therefore, plan a new variant of FACPIN in which each node u selects the best neighbor v to be merged with. Finally, we plan to modify FACPIN for directed (unweighted and weighted) protein interaction networks.
One of the reviewer of the initial manuscript has pointed out that this formula is incorrect since it depends only on the weights of edges connected to node u, not of the edges connected to v. An important consequence of this error, is that our analysis of ${R}_{w}(u\to v)$ (based on the formula above) will apply to the unweighted case only but will not necessarily apply to the weighted case. We have verified this, both computationally and theoretically, before engaging to experiment on weighted PINs. Due to time constraint, it is now impossible to perform and complete the experiments on weighted PINs using the correct formula in Equation (10). Our plan for the immediate future is therefore to perform these experiments.
Declarations
Acknowledgements
This research has been partially supported by the Canadian NSERC Grant #RGPIN2281172011 of AN. ZMI would like to acknowledge the supports of the NIHR Biomedical Research Centre for Mental Health, the Biomedical Research Unit for Dementia at the South London, the Maudsley NHS Foundation Trust and the Kings College London, and a joint infrastructure grant from Guys and St Thomas Charity and the Maudsley Charity, London, United Kingdom.
Declarations
The publication of this article is funded by the National Science and Engineering Council of Canada (NSERC).
AN declares that he was not invovled in the peer review process or any acceptance decisions regarding this article on which he is an author.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 4, 2015: Selected articles from the 9th IAPR conference on Pattern Recognition in Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S4.
Authors’ Affiliations
References
 Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular cell biology. Nature. 1999, 402: C47C52. 10.1038/35011540.View ArticlePubMedGoogle Scholar
 Radicchi F, Castellano C, Cecconi F: Defining and Identifying Communities in Networks. Proceedings of Natural Academy Of Sciences USA. 2004, 101 (9): 26582663. 10.1073/pnas.0400054101.View ArticleGoogle Scholar
 Pei P, Zhang A: A 'SeedRefine' Algorithm for Detecting Protein Complexes from Protein Interaction Data. IEEE Transcations of Nanobioscience. 2007, 6 (1): 4350.View ArticleGoogle Scholar
 Yook S, Olvai Z, Barabsi AL: Functional and Topological Characterization of Protein Interaction Networks. Protenomics. 2004, 4: 928942. 10.1002/pmic.200300636.View ArticleGoogle Scholar
 Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E: Fast Unfolding of Communities in Large Networks. Journal of Statistical Mechanics: Theory and Experiment. 2008, 2008 (10): P1000View ArticleGoogle Scholar
 Newman MEJ: Finding Community Structure in Networks using the Eigenvectors of Matrices. Physical Review E. 2006, 74 (036104):Google Scholar
 Wang RS, Zhang S, Wang Y, Zhang XS, Chen L: Clustering complex networks and biological networks by nonnegative matrix factorization with various similarity measures. Elsevier Neurocomputing. 2008, 72Google Scholar
 Pizzuti C, Rombo SE: A Coclustering Approach for Mining Large ProteinProtein Interaction Networks. IEEE Transcations of Computational Biology and Bioinformatics. 2012, 9 (3): 717730.View ArticlePubMedGoogle Scholar
 Li XL, Tan S, Foo C, Ng S: Interaction Graph Mining for Protein Complexes Using Local Clique Merging. Genome Informatics. 2006, 16: 260269.Google Scholar
 Spirin V, Mirny LA: Protein Complexes and Functional Modules in Molecular Networks. Proceedings of Natural Academy of Science USA. 2007, 100 (21): 1212312128.View ArticleGoogle Scholar
 AltafUlAmin M: Development and Implementation of an Algorithm for Detection of Protein Complexes in Large Interaction Networks. BMC Bioinformatics. 2006, 7 (207):Google Scholar
 Bader GD, Hogue CW: An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks. BMC Bioinformatics. 2003, 7 (2):Google Scholar
 Hartuv E, Shamir R: A Clustering Algorithm Based on Graph Connectivity. Information Processing Letters. 2000, 76 (46): 175181. 10.1016/S00200190(00)001423.View ArticleGoogle Scholar
 Li M, Wang JX, Chen JE: A Fast Agglomerative Algorithm for Mining Functional Modules in Protein Interaction Networks. Proceedings of First International Conference in BioMedical Engineering and Informatics (BMEI). 2008, 37.Google Scholar
 Luo F: Modular Organization of Protein Interaction Networks. BMC Bioinformatics. 2007, 23 (2): 207214. 10.1093/bioinformatics/btl562.View ArticleGoogle Scholar
 Brohee S, Helden JV: Evaluation of Clustering Algorithms for Protein Interaction Networks. BMC Bioinformatics. 2006, 7 (488):Google Scholar
 Mering CV, et al: Comparative Assessment of LargeScale Data Sets of ProteinProtein Interaction Networks. Nature. 2002, 417 (7887): 399403.View ArticleGoogle Scholar
 Brucker F, Barthlemy JP: Elments de classification: aspects combinatoires et algorithmiques. Herms, Paris. 2007, 438Google Scholar
 Becker E: Multifunctional Proteins Revealed By Overlapping Clustering in Protein Interaction Network. Bioinformatics. 2012, 28 (1): 8490. 10.1093/bioinformatics/btr621.PubMed CentralView ArticlePubMedGoogle Scholar
 Bagrow JP, Lehmann S: Link Communities Reveal Multiscale Complexity in Networks. Nature. 2010, 466: 761764. 10.1038/nature09182.View ArticlePubMedGoogle Scholar
 Girvan M, Newman ME: Community Structure in Social and Biological Networks. Proceedings of Natural Academy of Science USA. 2002, 99: 78217826. 10.1073/pnas.122653799.View ArticleGoogle Scholar
 Fortunato S: Community detection in graphs. Elsevier Physics Reports. 2010, 486: 75174. 10.1016/j.physrep.2009.11.002.View ArticleGoogle Scholar
 Clauset A, Newman MEJ, Moore C: Finding community structure in very large networks. Phys Rev E. 2004, 70: 066111View ArticleGoogle Scholar
 Friedel C, Zimmer R: Inferring Topology from Clustering Coefficients in ProteinProtein Interaction Networks. BMC Bioinformatics. 2006, 7 (519):Google Scholar
 Wang J, Li M, Chen J, Pan Y: A Fast Hierarchical Clustering Algorithm for Functional Modules Discovery in Protein Interaction Networks. IEEE/ACM Transaction on Computational Biology and Bioinformatics. 2011, 8 (3):Google Scholar
 Rahman MS, Ngom A: A Fast Agglomerative Community Detection Method for Protein Complex Discovery in Protein Interaction Networks. Proceedings of the 8th IAPR International Conference on Pattern Recognition in Bioinformatics. 2013, LNBI 7986: 112.View ArticleGoogle Scholar
 Zaki N, Berengueres J, Efimov D: A Method for Detecting Protein Complexes. Proceedings of the Genetic and Evolutionary Computation Conference. 2012, 209216.Google Scholar
 Newman MEJ: Fast algorithm for detecting community structure in networks. Physical Review. 2003, 69 (066133):Google Scholar
 Laarhoven TV, Marchiori E: Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks. Proceedings of the 7th IAPR International Conference on Pattern Recognition in Bioinformaics. 2012, LNBI 7632: 113.Google Scholar
 Chen S, Ma B, Zhang K: On the Similarity Metric and the Distance Metric. Theoretical Computer Science. 2009, 410 (2009): 23652376.View ArticleGoogle Scholar
 Xenarios I, et al: The Database of Interaction Proteins: A Research Tool for Studying Cellular Networks of Protein Interactions. Nucleic Acids Research. 2002, 30: 303305. 10.1093/nar/30.1.303.PubMed CentralView ArticlePubMedGoogle Scholar
 Dennis G, Sherman B, Hosack D, Jun Yang, Gao W, Lane HC, Lempicki R: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology. 2003, 4 (5): P310.1186/gb200345p3.View ArticlePubMedGoogle Scholar
 Huang D, Sherman B, Tan Q, Collins J, Alvord WG, Roayaei J, Stephens R, Baseler M, Lane HC, Lempicki R: The DAVID Gene Functional Classification Tool: A Novel Biological ModuleCentric Algorithm to Functionally Analyze Large Gene Lists. Genome Biology. 2007, 8: R18310.1186/gb200789r183.PubMed CentralView ArticlePubMedGoogle Scholar
 Cho YR, Hwang W, Ramanmathan M, et al: Semantic Integration to Identify Overlapping Functional Modules in Protein Interaction Networks. BMC Bioinformatics. 2007, 8 (265):Google Scholar
 Chua HN, Ning K, SUng WK, Leong HW, Wong L: Using Indirect ProteinProtein Interaction for Protein Complex Prediction. Journal of Bioinformatics and Computational Biology. 2008, 6 (3): 435466. 10.1142/S0219720008003497.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.