 Research
 Open Access
 Published:
The relative vertex clustering value  a new criterion for the fast discovery of functional modules in protein interaction networks
BMC Bioinformatics volume 16, Article number: S3 (2015)
Abstract
Background
Cellular processes are known to be modular and are realized by groups of proteins implicated in common biological functions. Such groups of proteins are called functional modules, and many community detection methods have been devised for their discovery from protein interaction networks (PINs) data. In current agglomerative clustering approaches, vertices with just a very few neighbors are often classified as separate clusters, which does not make sense biologically. Also, a major limitation of agglomerative techniques is that their computational efficiency do not scale well to large PINs. Finally, PIN data obtained from large scale experiments generally contain many false positives, and this makes it hard for agglomerative clustering methods to find the correct clusters, since they are known to be sensitive to noisy data.
Results
We propose a local similarity premetric, the relative vertex clustering value, as a new criterion allowing to decide when a node can be added to a given node's cluster and which addresses the above three issues. Based on this criterion, we introduce a novel and very fast agglomerative clustering technique, FACPIN, for discovering functional modules and protein complexes from a PIN data.
Conclusions
Our proposed FACPIN algorithm is applied to nine PIN data from eight different species including the yeast PIN, and the identified functional modules are validated using Gene Ontology (GO) annotations from DAVID Bioinformatics Resources. Identified protein complexes are also validated using experimentally verified complexes. Computational results show that FACPIN can discover functional modules or protein complexes from PINs more accurately and more efficiently than HCPIN and CNM, the current stateoftheart approaches for clustering PINs in an agglomerative manner.
Background
Functional modules are groups of genes or proteins involved in common elementary biological functions. Proteins are also known to interact with each other by forming complexes, and each such complex performs an independent and discrete biological function through the interactions of its member proteins [1]. Single proteins may also participate in more than one complex or functional module. Functional modules or protein complexes correspond to modules, which are dense subgraphs within protein interaction networks (PINs), and hence, can be discovered by appropriate network clustering approaches. Generally speaking, modules in PINs refer to highly connected subgraphs which have more internal edges than external edges. Many definitions of modules have been proposed in literature [2], and consequently different community detection algorithms have been proposed based on these different definitions.
Module detection in PINs is a computationally hard task and conventional clustering algorithms are not well suited for this task [3, 4]. Efficient, accurate, robust, and scalable methods are therefore required for mining large PINs [5–8]. There are generally three classes of modules detection approaches: 1) those based on finding cliques, which are fully connected subnetworks [9, 10]; 2) those based on detecting dense subnetworks [11, 12], not necessarily cliques; and 3) those based on uncovering the hierarchical organization of modules within PINs [13, 14]. Clique techniques are not quite scalable to large PINs and the identified modules are too strict in the biological sense of modules since proteins participating in a complex may not all interact with each other. Current densitybased algorithms commonly misclassify proteins with low degree into small clusters which could be merged to core protein clusters [15]. Moreover, many biologically meaningful modules are ignored due to their low topological connectivity [15].
Hierarchical clustering methods based on global metric over nodes or edges, such as betweenness centralities, are very timeconsuming, and thus do not scale well to large PINs. The few hierarchical approaches based on local metric also have the common problem of classifying very lowdegree vertices into separate clusters, which does not make sense biologically. Another major issue in current hierarchical clustering approaches is their inability to perform well on noisy data. This is generally the case when clustering PIN data generated from large scale highthroughput experiments. As discussed in [16, 17], such PIN data usually contain many false positive interactions, and hence, care must be taken to deal with the sensitivity of hierarchical methods on such data.
The majority of the clustering methods proposed in the literature has focused on identifying nonoverlapping communities. However, it is well recognized that complex networks contain multiclass nodes corresponding to vertices belonging to many communities at once. Overlapping clustering algorithms have not been intensively studied nor successful at finding good subnetworks, although they first appeared three decades ago; see an extensive review of overlapping methods in [18]. Multifunctional proteins are proteins which perform several functions and interact specifically with distinct sets of protein partners simultaneously or not, depending on the function being performed. Thus, such proteins are involved in many functional modules or protein complexes, and hence, it is reasonable to assume that PINs have overlapping communities, each containing some multifunctional proteins. Few successful hierarchical clustering approaches such as the Overlapping Cluster Generator (OCG) algorithm of [19] and the Link Communities method of [20] (to cite just a few) have been recently proposed with the aim of identifying overlapping protein communities as well as multifunctional proteins from PINs.
In this paper, we propose a fast agglomerative clustering technique, FACPIN, which addresses the issues and limitations discussed above for hierarchical algorithms. FACPIN is based on a local similarity premetric of relative vertextovertex clustering value for clustering PINs in an agglomerative hierarchical manner.
Related works
Many hierarchical clustering approaches (both agglomerative and divisive techniques) have been introduced in literature, since the original publication of [21] for clustering networks. See the excellent survey on graph clustering algorithms in [22]. Thus, we will present only the few methods that are directly related to our proposed agglomerative approach.
An effective agglomerative technique for clustering large networks was first proposed by [21]. The GirvanNewman (GN) algorithm [21] first computes the edgebetweenness centrality value of each edge; this is a global metric over the edges and is defined as the number of shortest paths containing a given edge. Then, GN subsequently sort and then remove edges with large betweenness values in an iterative manner and in order to detect the communities; since such edges correspond to bridges connecting two modules whereas lowbetweenness edges are internal to modules. To increase the computational speed of GN, [23] made a simple but nontrivial modification in the computation of the value of the modularity function used in GN. [15] defined the concept of the degree of a subnetwork S as the number the of edges containing one endpoint inside S and the other endpoint outside S. The degree of subnetworks was used along with the edgebetweenness values to devise an agglomerative method for module discovery. [14] developed a fast agglomerative approach for community detection based on a global centrality measure, the vertex clustering coefficient ; which is defined as the ratio of the number of edges between the neighbors of a given vertex v and the total number of possible edges in that neighborhood, it measures the degree of completeness of the subnetwork defined by v and its neighbors [24]. [2] designed an agglomerative technique based on the clustering coefficient of an edge; the edge clustering coefficient extends the vertex clustering coefficient and is a global measure defined as the number of triangles to which a given edge e = (u, v) belongs to, divided by the number of triangles that might potentially include (u, v). That is:
where, k_{ a } is the degree of a vertex a, ${Z}_{u,v}^{\left(3\right)}$ is the number of triangles containing edge (u, v), and min{(k_{ u }  1), (k_{ v }  1)} is the maximal possible number of triangles containing (u, v). This coefficient has been further generalized to higherorder cycles, ${C}_{u,v}^{\left(k\right)}$, such as squares for k = 4, ${C}_{u,v}^{\left(4\right)}$. Edges contained in few or no triangles have low clustering coefficients, and hence, correspond to bridges connecting two clusters. The edge clustering coefficient assumes the existence of cycles of length k in a network; which is problematic since a network can have many cycles of different lengths and the length distribution is unknown (e.g., there may be very few or very many shortlength cycles). For this reason, [25] defined a local node similarity metric over the edges, the edge clustering value, which is not based on cycles but on the common neighbors of the two endpoints of edge (u; v). The edge clustering value is defined as:
where, N_{ a } is the set of neighbors of a vertex a and its cardinality is defined as N_{ a }. Here, endpoints vertices of an edge (u, v) with a larger clustering value are more likely to be in the same cluster. Using the edge clustering value, [25] devised an agglomerative technique, the HCPIN algorithm, for discovering modules of a PIN and which is faster and more accurate than current hierarchical algorithms for network clustering. The edge clustering objectives in Equations (1) and (2) do not take into account the reliability of interactions in the presence of false positives in PIN data, and hence, will yield incorrect clustering results. In this regards, [25] modified the objective of Equation (2) to account for noise in the PIN data, as
where I_{ u,v } = N_{ u } ∩ N_{ v } , and 0 ≤ w(a, b) ≤ 1 is the weight assigned to the edge (a, b) and which represents the reliability of the interaction between vertices a and b or the probability of their interaction being a true positive. Clearly, Equation (2) is a special case of Equation (3) for weighted undirected graph with w(a, b) = 1 for all edges (a, b). In Equations (1)(3), two vertices connected by an edge with larger objective value are more likely to lie in the same module.
Recently, while finalizing this manuscript, we have been made aware of an hierarchical approach introduced in [20] and which focuses on grouping links (i.e., edges) rather than vertices, in contrast to the existing literature which has almost entirely focused on grouping nodes. It is wellknow that communities in complex networks often overlap such that nodes simultaneously belong to several groups at once, which in turn, are known to be involved into hierarchical structures. It has therefore proved difficult for nodefocused community detection methods to accurately identify relevant functional modules because of the hierarchical structures of the overlapping groups. Let ${N}_{a}^{+}$ denotes the set of node a and its neighbors and e_{ a,b } denote the edge (a, b), then by defining network communities as groups of links rather than groups of vertices, [20] proposed the following similarity function for link pairs that share a node in an undirected unweighted network
and applied a simple singlelinkage hierarchical clustering algorithm to build an link dendrogram from Equation (4) which yields link communities with the best edge partition density. By identifying such nonoverlapping link communities, [20] has detected hierarchically organized node community structures with pervasive overlap.
In the next section, we will propose a new criterion for weighted undirected graphs, which is a modification of the relative vertextovertex clustering value which we have first introduced in [26] for unweighted graph; in [26], however, the unweighted criterion was applied only to the problem of detecting protein complexes in PINs [27] whereas here we apply our weighted criterion here for identifying functional modules in PINs. It is a local similarity premetric combining the ideas behind the vertex clustering coefficient, the edge clustering coefficient, and the edge clustering value, and which allows to decide when a given vertex can be included into the cluster of another vertex, and which helps address all of the issues discussed above.
Methods
Network modularity structure
The concept of community is qualitative rather than quantitative; that is, nodes must be more densely connected within the community than with the rest of the network. The quantitative definition of the modularity of a network is still an open debate. Here, we use the modularity quality function Q which was introduced by the authors of [28], and which is a widely used quantitative measure for evaluating the modular structure of a network. Specifically, given an unweighted undirected graph G = (V, E) with V = n, its symmetric adjacency matrix A = [A_{ u,v }]_{n × n}where A_{ u,v } = 1 if nodes u and v are connected and otherwise A_{ u,v } = 0. Then, the modularity Q function is defined as
where: P(k) = ({C_{1},...,C_{ k } }) is a partition of V into k groups; ${e}_{ii}=\frac{L\left({C}_{i},{C}_{i}\right)}{L\left(V,V\right)}$ is the fraction of edges with both end vertices in the same community i; ${a}_{i}=\frac{L\left({C}_{i},V\right)}{L\left(V,V\right)}$ is the fraction of edges with at least one end vertex in community C_{ i }; and, $L\left({S}_{1},{S}_{2}\right)={\sum}_{u\in {S}_{1},v\in {S}_{2}}{A}_{u,v}$. Larger values of Q correspond to more distinct community structures in PINs. Function Q have serious resolution limits which have been discussed at length in [22], and the size of a detected community depends on the size of the whole network; thus, the choice of partition is highly sensitive to the total number of edges in the network. A second partition scoring function Ω which seeks to improve Q has been introduced in [29] and is defined as
Function Ω allows for more diverse cluster sizes than function Q and which are not too small and not too large, and smaller values corresponds to better modularity structures. A third scoring function, the modularity density function D of [14], overcomes the resolution limits of Q by directly including information on the number of nodes in a community. It is defined as
where, ${\stackrel{\u0304}{C}}_{i}=V\backslash {C}_{i}$ is the set of vertices not in C_{ i }. Thus, the aim of function D is to optimize both the modularity and the density of a community. For weighted undirected graphs G = (V, E) with weights assigned to edges in E, we propose new modularity functions, Q_{ w }, Ω_{ w } and D_{ w }. These three functions are direct generalizations of Q, Ω and D above, with L(S_{1}, S_{2}) redefined for weighted undirected graphs as
The problem of community detection is hence equivalent to searching for a k and a partition Pk to maximize the value of a modularity function.
The relative vertextovertex clustering value
Suppose an edge (u, v) in a scalefree network such that u has lower degree than v. We can reasonably assume that u has more likely joined the cluster containing v than v has joined the cluster containing u. This assumption stems from the principle of preferential attachment in powerlaw networks, which states that a new node u is likely to attach to a highdegree node v than to a low degree node. The edge clustering coefficient ${C}_{u,v}^{\left(k\right)}$ of [2] and the edge clustering value ECV (u, v) of [25] are similarity metrics which treat both endpoints of edges (u, v) equally, irrespective of their degrees. Also, another issue is that both ECV (u, v) and ${C}_{u,v}^{\left(3\right)}$ require vertices u and v to be connected by an edge. This requirement is quite restrictive and we aim to extend (in the future) to the case in which pair (u, v) is not an edge while still being able to decide if both vertices are in the same cluster. Finally, hierarchical approaches based on ECV (u, v) and ${C}_{u,v}^{\left(3\right)}$, or other objective functions, have the common problem of classifying lowdegree vertices (peripheral to dense subnetwork modules) into separate clusters rather than merging them with their neighboring modules. These criteria tell how likely that both u and v lie in the same cluster, and not which of u or v has likely joined the other's cluster. Let N_{ a } be the set of neighbors of a vertex a in an unweighted undirected graph G = (V, E). We define ${N}_{a}^{+}={N}_{a}\cup \left\{a\right\}$ as the neighbor set of a augmented with a itself. Given two vertices u and v, we define the clustering value of u relative to v as:
To consider the reliability of edges in the presence of false positive interactions in the the PIN data, we modify the objective of Equation (9) to apply for weighted graphs, as follows
where, ${I}_{u,v}^{+}={N}_{u}^{+}\cap {N}_{v}^{+}$, and 0 ≤ w(x, y) ≤ 1 is the weight assigned to the edge (x, y) and which represents the reliability of the interaction between vertices a and b or the probability of their interaction being a true positive. Clearly, Equation (9) is a special case of Equation (10) for weighted undirected graph with w(x, y) = 1 for all edges (x, y). For a node a ∈ V, we let ${k}_{a}={\sum}_{b\in V}{A}_{a,b}$ be its degree. For a weighted graph, we define the weighted degree of a vertex a as ${\kappa}_{a}={\sum}_{b\in V}w\left(a,b\right)$, similarly to [25].
${R}_{w}(u\to v)$, with 0 ≤ ${R}_{w}(u\to v)$ ≤ 1, is a similarity premetric since it does not satisfy the axiom of symmetry and the triangle inequality but satisfies the axioms of selfsimilarity and maximality [30]; see http://www.scholarpedia.org/article/Similarity_measures and http://en.wikipedia.org/wiki/Metric_(mathematics)#Premetrics. A vertex u with a larger clustering value given another vertex v is more likely to lie in the cluster containing v. In the following we let C(v) = (C_{ v } ⊂ V, E_{ v } ⊂ E) denotes the subnetwork cluster containing v and we assume C(v) is a community. Below, we describe the properties of ${R}_{w}(u\to v)$.
Analysis of ${R}_{w}(u\to v)$
In the following, we limit our discussions to the case of unweighted networks, though they also apply to weighted networks. To understand how the similarity premetric ${R}_{w}(u\to v)$ can be used to determine the communities in a network, we now discuss the relationships between values $R(u\to v)$ and $R(v\to u)$, and all the four possible cases of connectivity of an edge (u, v). The main question we address below is: when should we merge the vertex u with the current cluster C(v) of v?

1 Case k_{ u } = 1. $R(u\to v)$ = 1, thus it is maximal. $R(u\to v)$ is also maximal when kv = 1, and hence, the connected component C = ({u, v}, (u, v)) is a community. If on the other hand k_{ v } > 1, then we have $R(u\to v)$ >$R(v\to u)$ and therefore u should be merged with the current cluster C(v) of v (not the other way around, which corresponds to merging v with C(u)).

2 Case 1 <k_{ u } <k_{ v }. $R(u\to v)$ >$R(v\to u)$ and $R(u\to v)$ may or may not be maximal. Vertex u should be merged with C(v) only when $R(u\to v)$ > 0.5; that is, when more than 50% of the neighbors of u, ${N}_{u}^{+}$, are in the intersection, ${N}_{u}^{+}\cap {N}_{v}^{+}$. This is a reasonable decision since the number of triangles involving the edge (u, v) is N_{ u } ∩ N_{ v }, and that the edge (u, v) is definitely not a "bridge" connecting two clusters when most of u's neighbors form a triangle with v.

3 Case 1 <k_{ v } <k_{ u }. This is the reverse of case 2 above: thus, u should not merge with C(v) since $R(u\to v)$ <$R(v\to u)$.

4 Case k_{ u } = k_{ v }. $R(u\to v)$ = $R(u\to v)$, and we should consider two possible subcases.

(a)
Subcase ${N}_{u}^{+}={N}_{v}^{+}$. We have $R(u\to v)$ = $R(u\to v)$ = 1 since ${N}_{u}^{+}={N}_{v}^{+}={N}_{u}^{+}\cap {N}_{v}^{+}$. Hence, u should be merged with C(v) given that the induced subnetwork of G for ${N}_{u}^{+}\cap {N}_{v}^{+}$ forms a community.

(b)
Subcase ${N}_{u}^{+}\ne {N}_{v}^{+}$. We have $R(u\to v)$ = $R(v\to u)$ < 1. In this case, u should be merged with C(v), only when $R(u\to v)$ > 0.5.

(a)
Given an edge (u, v), assume the degrees of vertices u and v in G are such that k_{ u } = k_{ v } = d are (very) large and that u and v do not have common neighbors. Then, we have $R\left(u\to v\right)=R\left(v\to u\right)=1\cdot \frac{2}{1+d}\le 0.5$ assuming d ≥ 3. In this case, the induced subnetwork of G for {u} ∪ C_{ v } (or for ${N}_{v}^{+}$) is not a community, and likewise for {v} ∪ C_{ u } (or for ${N}_{u}^{+}$). In general, consider the induced subgraph of G on ${N}_{u}^{+}\cup {N}_{v}^{+}$ we define the local betweenness value of edge (u, v) as the percentage of paths from vertices in N_{ u } \ N_{ v } to vertices in N_{ v } \ N_{ u } going through edge (u, v). Given the number of common neighbors between u and v, N_{ u } ∩ N_{ v }, the local betweenness of edge (u, v) is thus $\lambda \left(u,v\right)=100\cdot \frac{1}{{N}_{u}\cap {N}_{v}+1}$. Given two connected highdegree vertices u and v, the local edge betweenness value λ(u, v) increases as N_{ u } ∩ N_{ v } decreases, and hence, it corresponds to when both $R(u\to v)$ and $R(v\to u)$ values are both small (and both ≤ 0.5) at the same time. Edges with high local betweenness values are edges which are likely connecting two communities, and therefore, vertices u and v should not lie in the same community. This is not necessarily true since we are making an inference based not on the global edge betweenness metric defined in [21]. However, starting with correct initializations and using an appropriate node clustering mechanism, a greedy algorithm can be devised based on the faster local evaluations instead of the costly global evaluations.
$R(u\to v)$ is maximal when $\left{N}_{u}^{+}\right={N}_{u}^{+}\cap {N}_{v}^{+}$; that is either Case (1) or Case (4a) above. In either cases, u contributes only new internal edges in the induced subnetwork of G for ${C}_{v}^{+}=\left\{u\right\}\cup {C}_{v}$ (or for ${C}_{v}^{+}={N}_{v}^{+}$) and contributes no new external edges, and hence, the induced subnetwork of G for ${C}_{v}^{+}$ remains a community if C_{ v } (or ${N}_{v}^{+}$) is a community. Finally, u is more likely to be in the community C(v) and v less likely to be in the community C(u) when both $R(u\to v)$ > 0.5 and $R(u\to v)$ ≥ $R(v\to u)$. Since $R(u\to v)$ > 0.5 then k_{ u } ≤ k_{ v } and ${N}_{u}^{+}\cap {N}_{v}^{+}=\frac{\left{N}_{u}^{+}\right}{2}$; that is, more than 50% of the neighbors of u are in the intersection and less than 50% of the neighbors of v are in the intersection. Since k_{ u } ≤ k_{ v } then clearly the induced subnetwork of G for ${C}_{v}^{+}=\left\{u\right\}\cup {C}_{v}$ is a community when ${N}_{u}\cap {N}_{v}\subseteq C\left(v\right)$ with its modularity increasing with N_{ u } ∩ N_{ v }.
Quantitative definition of module
Given the four cases above and a userdefined merging parameter μ with 0 ≤ μ < 2, the decision to merge a node u with the cluster C(v) of a node v can be summarized into a single test containing all the four cases; that is: include u to C(v) whenever
The communities (i.e. modules) C determined by algorithms which use this merging test are such that the merging condition is satisfied for every internal edge of C and not satisfied for every external edge of C. Given a weighted undirected graph G = (V, E) and the merging parameter μ, a subgraph C ⊆ G is said to be a μmodule if if the the condition for merging is true for every internal edge of c and false for every external edge of C. Different networks modularity structures are obtained by varying the value the merging parameter μ.
The relative vertex clustering value, $R(u\to v)$ implements the ideas behind the edge clustering coefficient, ${C}_{u,v}^{\left(k\right)}$, of [2], since for a given vertex v and a neighbor u the number of triangles given edge (u, v) is exactly N_{ u } ∩ N_{ v }; and u will be included into C(v) whenever most of the neighbors of u (excluding v) are in N_{ u } ∩ N_{ v } . This is also true even when (u, v) is not an edge; in such case, N_{ u } ∩ N_{ v } relates to the number of squares containing vertices u and v. On the other hand, we break through the limitations of [2] as in the edge clustering value, ECV (u, v) of [25], by not assuming the existence of closed loops in a networks, such as triangles or highorder loops. The relative vertex clustering values $R(u\to v)$ and ${R}_{w}(u\to v)$ also improves ECV (u, v) and ECV w (u, v) since neighbors u of v which have most of their neighbors forming a triangle with v are considered for possible inclusion in C(v). Searching for vertices u which form a cluster with v is also more efficient than searching for edges (u, v) that make a cluster since the number of edges is larger than the number of vertices in dense subgraphs.
The FACPIN algorithm
In a clustering task, we can use ${R}_{w}(u\to v)$ and ${R}_{w}(v\to u)$ to decide whether u should be included into C(v) = (C_{ v }, E_{ v }) ⊂ G = (V, E), the current cluster of v. Based on the definitions of relative vertextovertex clustering value and quantitative network modularity, we propose a fast agglomerative clustering nodefocused algorithm named FACPIN, shown in Algorithm 1. The input to algorithm FACPIN is an undirected weighted graph; when unweighted graph is used, then all edges (a, b) are treated equally with weight w(a, b) = 1. The output of FACPIN is a collection of nonoverlapping subnetwork communities.
Given a weighted undirected PIN G = (V, E), we initially consider each vertex as a singleton cluster, and sort the vertices v ∈ V into a queue Q_{ V } in nonincreasing order of their weighted degrees κ_{ v }. Then,
Algorithm 1 The FACPIN algorithm
Require: G = (V, E): undirected PIN graph;
A_{V × V}: adjacency matrix;
W_{V × V}: weight matrix;
μ: merging parameter;
Ensure: P_{ k } = {C_{1} ,..., C_{ k }}: nonoverlapping subnetwork communities
{Initialization Phase}
for all v ∈ V do
C_{ v } ← {v}; {C_{ v } = cluster containing node v}
E_{ v } ← ∅;
${\kappa}_{v}\leftarrow {\sum}_{b\in V}w\left(v,b\right)$; {weighted degree of v}
C(v) ← (C_{ v }, E_{ v }); {Each vertex is a singleton cluster }
{C(v) = subnetwork containing node v}
end for
{Community Detection Phase}
Sort V to Q_{ V } in nonincreasing order of κ_{ v } values;
repeat
v ← Q_{ V }; {Select highest κ_{ v } vertex in Q_{ V }}
N_{ v } ← {u ∈ V (u, v) ∈ E}; {Neighbor set of v}
for all u ∈ N_{ v } not yet assigned to a cluster do
if ${R}_{w}(u\to v)$ > 0.5μ and ${R}_{w}(u\to v)$ ≥ ${R}_{w}(v\to u)$
then
C_{ z } ← C_{ v } ∪ {u}, ∀ ∈ C_{ v } ∪ {u};
end if
end for
Q_{ V } ← Q_{ V }  v; {Remove v from Q_{ V }}
until Q_{ V } = ∅
{Compute the Partition P_{ k }}
U ← V;
i ← 1;
while U ≠ ∅ do
v ← randomly select a vertex from U ;
C_{ i } ← C(v) = the induced subgraph of G for C_{ v } ;
U ← U\{uC_{ u } = C_{ v }};
i ← i + 1;
end while
return P_{ k } ← {C_{1},...,C_{ k }}; Q_{ w } (P_{ k }) and Ω_{ w } (P_{ k });
{Evaluate the Modularity of Partition P_{ k }}
Modularity ← D_{ w }(P_{ k }), Q_{ w }(P_{ k }) and Ω_{ w }(P_{ k });
in an iterative manner, we select the next highest κ_{ v } vertex v from Q_{ V } and then we iteratively apply the merging condition
on each neighbor u ∈ N_{ v } of v in order to decide for its inclusion into the current cluster C_{ v } of v.
A neighbor u ∈ N_{ v } is added into the current cluster C_{ v } of v, when the majority of the neighbors of u are in ${N}_{u}^{+}\cap {N}_{v}^{+}$. That is when, $R(u\to v)$ > 0.5 and ${R}_{w}(u\to v)$ ≥ ${R}_{w}(v\to u)$; in which case κ_{ u } ≤ κ_{ v } and ${N}_{u}^{+}\cap {N}_{v}^{+}>\frac{1}{2}\left{N}_{u}^{+}\right$ which for weighted graphs is equivalent to ${\sum}_{a\in {I}_{u,v}^{+}}w\left(u,a\right)>\frac{1}{2}{\sum}_{b\in {N}_{u}^{+}}w\left(u,b\right)$ where ${I}_{u,v}^{+}={N}_{u}^{+}\cap {N}_{v}^{+}$. By gradually examining each highdegree vertex v from the queue Q_{ V } and then gradually adding its unassigned neighbors u to C_{ v }, FACPIN agglomerates all singleton clusters into V vertex sets C_{ v }. The final k communities C_{ i }, for 1 ≤ i ≤ k, are the induced subgraphs of G for all distinct C_{ v }; in the algorithm, we made a distinction between a cluster C_{ v } = {v_{1},...,v_{ n }}, a subnetwork C(v) = (C_{ v }, E_{ v }), and the ith subnetwork C_{ i }. In FACPIN, the merging parameter μ with 0 ≤ μ < 2 is userdefined. In particular for weighted PINs, different modularity results can be obtained by changing the values of μ
Most hierarchical methods, with the exception of the HCPIN algorithm of [25], are based on a costly global metric for partitioning a PIN network. FACPIN is based on the local similarity premetric ${R}_{w}(u\to v)$, which encodes useful information about the local topology around vertices u and v, and which helps make a local decision maximizing the modularity of the final partitioning.
Computational complexity of FACPIN
Given weighted PIN G = (V, E), let n = V, m = E, κ_{max} = max_{v∈V}κ_{ v } be the maximum weighted degree in G, and ${\kappa}_{ave}=\frac{1}{n}{\sum}_{v\in V}{\kappa}_{v}$ be the average weighted degree in G. The complexity of computing ${R}_{w}(u\to v)$ is O(κ_{max}), and hence, the complexity of FACPIN is $O\left(n{\kappa}_{ave}^{2}\right)\ll O\left(n{\kappa}_{\mathsf{\text{max}}}^{2}\right)\u22d8O\left({n}^{3}\right)$. PINs are powerlaw networks, thus the majority of proteins interact with few proteins only, and thus κ_{ ave } is generally small and can be considered a constant [25]. The CNM [23] and the HCPIN [25] methods run in O(mh log n) and $O\left(m{\kappa}_{ave}^{2}\right)$ steps, respectively; where, h is the depth of the dendrogram describing the network's community structure. These are the currently fastest agglomerative methods. The space complexity of the three algorithms is O(m^{2}). The main achievement with respect to computational complexity is that the cost of FACPIN is dependent on the number of nodes, rather than the number of edges, specially when κ_{ ave } is regarded as a constant in scalefree networks.
Results and discussion
We have carried out several computational experiments on nine PIN data from eight different species using our proposed FACPIN algorithm. In this section, the data sets and the evaluation methods used in our experiments are described first. Next, we discuss the effect of varying the merging parameter μ on the FACPIN clustering results. Then, we arbitrarily set the merging parameter to μ = 0.5 and then proceed to compare and study the clustering results of the FACPIN approach with those of the HCPIN and CNM methods on the same PIN data sets; the three algorithms are compared on (i) the functional enrichment of their predicted modules, (ii) their sensitivity, specificity, and F score, (iii) the network modularity structure of the partitioning results, and finally, (iv) their execution times.
All computational experiments were performed on an Intel machine (Core TM i51600, 2.400 GHz, CPU with 8 GB RAM). The program codes were all written in R.
PIN data sets
Original unweighted PIN data of eight distinct species was downloaded from the REACTOME database http://www.reactome.org/download/all_interaction.html and one species from the DIP database [31]. The eight PIN data from REACTOME are listed here along with their number of proteins and interactions in parenthesis are: B. taurus (5737, 113888), T. guttata (Finch bird, 3929, 74314), X. tropicalis (Frog, 5473, 122706), H. sapiens (Human, 8997, 34935), O. sativa (Rice, 3778, 320570), S. scrofa (Wild boar, 5303, 119920), D. rario (Zebra fish, 8188, 274358), and S. cerevisiae1 (Baker's yeast, 5697, 50675). The PIN data from DIP is S. cerevisiae2 (Baker's yeast, 4726, 15166). In all these PIN data, the number of edges is much larger than the number of vertices.
We also downloaded a list of protein complexes obtained from the MIPS database, which we consider as a gold standard data. We extracted the protein complexes corresponding to the S. cerevisiae2 PIN data from the MIPS Comprehensive Yeast Genome DatabaseCYGD ftp://ftpmips.gsf.de/fungi/yeast/catalogues/complexcat/complexcat_data_18052006. We proceeded similarly to [29] and considered only the known complexes (i.e., not those obtained by computational means) containing at least three proteins. Since FACPIN generates nonoverlapping clusters, we considered only known complexes which are at the bottom of the MIPS hierarchy of complexes and subcomplexes. The unconfirmed complexes, that is those in category 550, were excluded.
Evaluation methods
In order to study and compare the performance of FACPIN, we downloaded the CNM code http://cs.unm.edu/~aaron/research/fastmodularity.htm[23] and implemented the HCPIN algorithm [25]. The two methods were applied on the same PIN data as FACPIN. For HCPIN, we set the two parameters λ and s as in [25]; CNM has no parameters. Of the three algorithms, only FACPIN and HCPIN can cluster weighted PINs. There are other network clustering approaches which we could compare FACPIN with, however they are either not designed for clustering weighted PINs or they are not hierarchical agglomerative algorithms. It should be noted that [25] compared his HCPIN algorithms with six others PIN clustering approaches on the same S. cerevisiae2 PIN data; none of them are hierarchical and only three of them can cluster PIN data). Due to time and space limitations, we are not able to perform computational experiments comparing FACPIN approach with those other six PIN clustering techniques; we leave this task as a future work. In [25], HCPIN consistently outperforms those methods in terms of its (i) functional enrichment of the identified modules (ii) ability to detect both smallsized and largesized modules, (iii) accuracies of the identified modules, (iv) ability to predict protein complexes, and (v) clustering efficiency. Both HCPIN and CNM are currently the fastest agglomerative methods for clustering PIN data.
Functional enrichment validations
For the functional enrichment validations, we used DAVID's functional annotation tools http://david.abcc.ncifcrf.gov/[32] to identify enriched biological themes, particularly GO terms, and to estimate whether the predicted modules are biologically significant. DAVID uses a set of fuzzy classification algorithms to rank modules based on cooccurrences of their constituent proteins in annotation terms and computes a Pvalue indicating the significance of the module with respect to GO terms. The Pvalue is computed using an internal EASE score [33]. We used a Pvalue cutoff of 0.05 to find biologically significant clusters. A smaller Pvalue indicates that the predicted module is more biologically significant than one with a larger Pvalue
To estimate the performance of a network clustering algorithm in term of its ability to correctly identify the functional modules within a PIN, we also compute its Recall, Precision, and FMeasure as mapped to C as
where, C is a module predicted by the algorithm, and F_{ i } is a known GO functional category mapped to C and considered as a true predictions. Thus, the proteins in C ∩ F_{ i } are the true positive predictions. Recall measures how effectively proteins with the same F_{ i } in the PIN are extracted, Precision measures how consistently proteins in the same C are annotated, and FMeasure is their harmonic mean [34]. The accuracy of the method is taken as the average FMeasure of the significant predicted modules. As in [25], we also only consider predicted modules of size 3 or more.
Protein complex validations
Protein complex validations proceed by determining the degree of overlap between the complexes identified by network clustering algorithm and the known protein complexes; that is, we want to determine how effectively an identified module matches a known complex. We used the overlapping score function given in [12, 25, 29, 35]. The overlapping score, O(C, K), between a discovered complex C and a known complex K is defined as:
in which a cluster C is considered to match a known complex K whenever O(C, K) ≥ τ ; where, 0 < τ ≤ 1 is the matching threshold. We have a perfect match only when O(C, K) = 1. Threshold value τ = 0.2 was used in [12, 25, 35] whereas [29] used τ = 0.25. We used τ = 0.2 in our complex validation. After computing the overlapping scores between all pairs (C, K) of discovered complexes and known complexes for the PIN, we then determined the ability of the method to correctly classify the known complexes. The reason for doing this is that a given complex K_{1} may match many clusters but with different degrees of overlap, while another complex K_{2} may match with a single cluster only. Hence, we calculated the Specificity, the Sensitivity, and the FScore, as our measures of accuracy here; they are defined as follows:
where, TP (true positive) is the number of the identified complexes C matched by the known complexes K, FN (false negative) is the number of known complexes that are not matched by the identified complexes, and FP (false positive) is the total number of the identified complexes C minus TP.
Modularity and efficiency analyses
All experiments in this paper were performed on an Intel machine (Core TM i72600, 3.400 GHz, CPU with 8 GB RAM). We compared FACPIN against HCPIN and CNM in terms of the modularity of their clustering results and in terms of their computational efficiencies. For FACPIN, we ran it with its merging parameter set to μ = 0.5, then evaluated and reported the modularity of its resulting partition P_{ k }. The execution times (in seconds) are also recorded; the PINs are sorted in increasing order of their number of proteins m.
Identification of functional modules in the S. cerevisiae2PINs
The computational results in this section are all generated with the merging parameter arbitrarily set to μ = 0.5 (except in Table 1) and with the modularity quality function Q_{ w }.
Effect of the merging parameter μ
Table 1 shows the effect of parameter μ on FACPIN clustering results. Recall that a neighbor u of v is merged with the current cluster C(v) of v whenever the test
is satisfied for u. Hence, the size of a cluster C(v) increases as the merging parameter μ decreases since more neighbors are being merged together with v; and therefore, the number of clusters k also decreases as the sizes of clusters increase.
Functional enrichment of FACPIN modules
In Table 2, the three methods are compared for their functional enrichment of biological functions. The P value from DAVID's internal EASE score is computed for each predicted module C, and a Pvalue cutoff of 0.05 is used to find the biologically significant clusters; a module whose Pvalue is above this cutoff is considered insignificant. The table shows, in this order, the number (percentage) and the average size of significant predicted modules with Pvalues falling within intervals: <E15, [E15, E10], [E10, E5], and [E5, 1]. Although CNM and HCPIN show more enriched modules in the interval [<E15], the modules with pvalue falling in this range are much larger in CNM and HCPIN than in FACPIN (specially CNM) with an average size of 439.83 for CNM and 103.1 for HCPIN compared to 49.08 for FACPIN. Larger modules result in a high number of false positives, reducing the specificity of the highlyenriched modules. Figure 1 shows this trend. The figure compares the sizes of the modules whose enrichment Pvalues fall in the range [<E15]. In the figure, there is a clear shift to the right in the case of CNM, indicating much larger modules. This trend is apparent in all Pvalues ranges (from Table 2). This indicates that CNM is the worst at predicting enrichment in small modules. HCPIN's highlyenriched modules are also large compared to those produced by FACPIN, but their sizes are less than those of CNM. Also, FACPIN has the lowest rate of modules not passing the enrichment Pvalues cutoff of 0.05.
Predicting largesized versus smallsized modules
The Pvalue of a predicted module depends on its size, and hence, Table 3 and Table 4 show the accuracy of the methods respectively for predicting large and small modules.
In Table 3, we see that more than 96% of the modules predicted by each method are validated to be significant, though FACPIN yields a percentage slightly larger than that of HCPIN or CNM. Although CNM gives the highest average log Pvalue, it also yields the lowest average Fmeasure; this is due to the fact that its significant modules are much larger than those of HCPIN and FACPIN, and hence, less accurate. FACPIN, on the other hand, predicted more accurate significant modules than HCPIN and CNM but with the lowest average log Pvalue; again, this is due to the smaller sizes of its generated modules.
In Table 4 however, performed consistently better than CNM and HCPIN in all performance measures; FACPIN seems to be better at producing smallsized modules.
Accuracy of FACPIN
Table 5 lists the accuracy of each method with all the validations of Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Table 5 further confirms our analysis of the results in Table 3 and Table 4; that FACPIN predicts smaller but more accurate significant modules.
Identification of functional modules in the S. cerevisiae1PIN
Table 6 shows, in this order: the modularity value Q_{ w }(P_{ k }) of the generated partition P_{ k }; the number of predicted modules k_{3} with ≥ 3 proteins (and in parenthesis, the total k); and the average size $\stackrel{\u0304}{s}$ of the modules. Next, the validation results shows: the number k_{ s } of significant modules obtained overall (percentage of such modules is in parenthesis) and for each ontology class (Biological Process, Cellular Component, Molecular Function); the number of significant modules whose Pvalues fall within Pvalue interval <E15, [E15, E10], [E10, E5], [E5, 1] are listed next; the average $\stackrel{\u0304}{p}$ of log Pvalue; and, the accuracy A of each algorithm as the average FMeasure of the predicted significant modules. The data set is the original unweighted PIN of S. cerevisiae1 downloaded from the REACTOME database. In this PIN data, the number of modules discovered by FACPIN is comparable to (but still larger than) those detected by HCPIN and CNM. FACPIN still predicts smaller and more accurate significant modules in this S. cerevisiae1 with higher average log Pvalue; which is consistent with our findings in the previous tables that FACPIN perform better due to the smaller sizes of its predicted modules.
Identification of protein complexes in the S. cerevisiae2PIN
Table 7 shows the Specificity, the Sensitivity, and the FScore of the complexes identified by each method. The results are shown for the modularity scoring function Q_{ w }. For HCPIN, results are shown for two values of its parameter λ as in [25]. The first three columns show, respectively, the number of proteins, the number of known complexes, and the average size of the known complexes in the data; columns 5, 6, and 7 are the number of discovered complexes, their average size, and the number of perfectly matched discovered complexes. In the table, we see that FACPIN discovers complexes whose average sizes (column 6) are closer to the average sizes of the known protein complexes (column 3), whereas HCPIN and CNM predict farther average sizes. The consequence of this is that FACPIN complexes have higher accuracy in (Specificity, Sensitivity or FScore). In particular, we obtain a larger number of perfectly matched complexes to communities with FACPIN than with HCPIN or CNM.
Modularity and efficiency of FACPIN
Tables 8, 9, and 10 show the network modularity of the partitions obtained by the algorithms on the eight unweighted PIN data downloaded from the REACTOME database, respectively for the modularity functions Q_{ w }, Ω_{ w }, and D_{ w }. The aim of both objectives Q_{ w } and Ω_{ w } is to optimize the modularity of the detected clusters (though Ω_{ w }yields clusters that are not too small and not too large, and therefore, it generates denser clusters than those from Q_{ w }); the aim of D_{ w } is to optimize both the modularity and the density of the clusters.
CNM is a modularity optimization algorithm designed to directly optimize the modularity quality function Q_{ w }, and hence, it is no surprise that it performed best with this function, as shown in Table 8. The modularity maximization process of CNM [23] yields a partitioning containing one very large cluster and many much smaller ones; this because, a node is selected to be included into the currently largest cluster first and to maximize the current Q_{ w } value. In the columns for Rice and Yeast in Table 8, we see that FACPIN outperforms CNM on Q_{ w } ; Table 11 shows a possible reason for this, that the sizes max C_{ i } of their largest clusters are comparable.
Recall that given a currently highdegree vertex v with its cluster C_{ v }, FACPIN merges it with all its neighbors u satisfying the merging condition
The first term in the merging condition guarantees that only edges (u, v) which have low local betweenness value $\lambda \left(u,v\right)=100\cdot \frac{1}{{N}_{u}\cap {N}_{v}+1}$ are considered for possible inclusion in the induced subgraph C(v) of C_{ v }. The second term guarantees that only those neighbors u which can contribute more edges to C(v), than v contributes to C(u), are selected. Hence, FACPIN merges neighbors u which contribute low local betweenness edges while optimizing the density of C(v). Also as said before, the relative vertex clustering value ${R}_{w}(u\to v)$ combines the principles behind the vertex clustering coefficient of [14], the edge clustering coefficient ${C}_{u,v}^{\left(k\right)}$ of [2], and the edge clustering value ECV (u, v) of [25]. Since the objectives of Ω_{ w } and D_{ w } is to seek for modular partitioning containing dense clusters, we can see that in both Tables 9 and 10, FACPIN outperformed both HCPIN and CNM on both modularity function Ω_{ w }; in seven out of eight PIN data for Ω_{ w }, and in all PIN data for D_{ w }. In particular for D_{ w }, FACPIN yield much higher modularity values.
Table 12 shows the execution times (in seconds) of each algorithm and the same data sets as above, but for modularity function Q_{ w } only. As one can see, FACPIN ran faster than both HCPIN and CNM on all data sets.
Conclusions
In this paper, we have proposed a new agglomerative clustering approach, FACPIN algorithm, for detecting the communities of a given PIN networks, and then compared our method with two fast hierarchical techniques discussed in literature. Our approach is based on the use of a new measure, the relative vertextovertex clustering value which helps decide whether a given vertex u should be included within the cluster of another vertex v depending on how many of its neighbors form a triangle with v. Our approach is very fast since we are clustering vertices not edges, as in the compared methods. Thus our method is appropriate for PIN data, which in general contain more interactions than proteins. More study needs to be done, in particular the validation based on random networks, in order to analyze the robustness of FACPIN. Comparisons with other methods which are not necessarily hierarchical will also be important. Nonagglomerative clustering methods based on the relative vertextovertex clustering value will be investigated. In this current version of FACPIN, a neighbor u is merged with a cluster ${C}_{{v}_{i}}$ whenever its ${R}_{w}(u\to {v}_{i})$ value satisfies the merging condition and irrespective of whether there is another vertex vj such that ${R}_{w}(u\to {v}_{j})$ also satisfies the condition; we, therefore, plan a new variant of FACPIN in which each node u selects the best neighbor v to be merged with. Finally, we plan to modify FACPIN for directed (unweighted and weighted) protein interaction networks.
As a final note: we have not made experiments on weighted PINs. In our initial submission, we have used the following weighted criterium:
One of the reviewer of the initial manuscript has pointed out that this formula is incorrect since it depends only on the weights of edges connected to node u, not of the edges connected to v. An important consequence of this error, is that our analysis of ${R}_{w}(u\to v)$ (based on the formula above) will apply to the unweighted case only but will not necessarily apply to the weighted case. We have verified this, both computationally and theoretically, before engaging to experiment on weighted PINs. Due to time constraint, it is now impossible to perform and complete the experiments on weighted PINs using the correct formula in Equation (10). Our plan for the immediate future is therefore to perform these experiments.
References
 1.
Hartwell LH, Hopfield JJ, Leibler S, Murray AW: From molecular to modular cell biology. Nature. 1999, 402: C47C52. 10.1038/35011540.
 2.
Radicchi F, Castellano C, Cecconi F: Defining and Identifying Communities in Networks. Proceedings of Natural Academy Of Sciences USA. 2004, 101 (9): 26582663. 10.1073/pnas.0400054101.
 3.
Pei P, Zhang A: A 'SeedRefine' Algorithm for Detecting Protein Complexes from Protein Interaction Data. IEEE Transcations of Nanobioscience. 2007, 6 (1): 4350.
 4.
Yook S, Olvai Z, Barabsi AL: Functional and Topological Characterization of Protein Interaction Networks. Protenomics. 2004, 4: 928942. 10.1002/pmic.200300636.
 5.
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E: Fast Unfolding of Communities in Large Networks. Journal of Statistical Mechanics: Theory and Experiment. 2008, 2008 (10): P1000
 6.
Newman MEJ: Finding Community Structure in Networks using the Eigenvectors of Matrices. Physical Review E. 2006, 74 (036104):
 7.
Wang RS, Zhang S, Wang Y, Zhang XS, Chen L: Clustering complex networks and biological networks by nonnegative matrix factorization with various similarity measures. Elsevier Neurocomputing. 2008, 72
 8.
Pizzuti C, Rombo SE: A Coclustering Approach for Mining Large ProteinProtein Interaction Networks. IEEE Transcations of Computational Biology and Bioinformatics. 2012, 9 (3): 717730.
 9.
Li XL, Tan S, Foo C, Ng S: Interaction Graph Mining for Protein Complexes Using Local Clique Merging. Genome Informatics. 2006, 16: 260269.
 10.
Spirin V, Mirny LA: Protein Complexes and Functional Modules in Molecular Networks. Proceedings of Natural Academy of Science USA. 2007, 100 (21): 1212312128.
 11.
AltafUlAmin M: Development and Implementation of an Algorithm for Detection of Protein Complexes in Large Interaction Networks. BMC Bioinformatics. 2006, 7 (207):
 12.
Bader GD, Hogue CW: An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks. BMC Bioinformatics. 2003, 7 (2):
 13.
Hartuv E, Shamir R: A Clustering Algorithm Based on Graph Connectivity. Information Processing Letters. 2000, 76 (46): 175181. 10.1016/S00200190(00)001423.
 14.
Li M, Wang JX, Chen JE: A Fast Agglomerative Algorithm for Mining Functional Modules in Protein Interaction Networks. Proceedings of First International Conference in BioMedical Engineering and Informatics (BMEI). 2008, 37.
 15.
Luo F: Modular Organization of Protein Interaction Networks. BMC Bioinformatics. 2007, 23 (2): 207214. 10.1093/bioinformatics/btl562.
 16.
Brohee S, Helden JV: Evaluation of Clustering Algorithms for Protein Interaction Networks. BMC Bioinformatics. 2006, 7 (488):
 17.
Mering CV, et al: Comparative Assessment of LargeScale Data Sets of ProteinProtein Interaction Networks. Nature. 2002, 417 (7887): 399403.
 18.
Brucker F, Barthlemy JP: Elments de classification: aspects combinatoires et algorithmiques. Herms, Paris. 2007, 438
 19.
Becker E: Multifunctional Proteins Revealed By Overlapping Clustering in Protein Interaction Network. Bioinformatics. 2012, 28 (1): 8490. 10.1093/bioinformatics/btr621.
 20.
Bagrow JP, Lehmann S: Link Communities Reveal Multiscale Complexity in Networks. Nature. 2010, 466: 761764. 10.1038/nature09182.
 21.
Girvan M, Newman ME: Community Structure in Social and Biological Networks. Proceedings of Natural Academy of Science USA. 2002, 99: 78217826. 10.1073/pnas.122653799.
 22.
Fortunato S: Community detection in graphs. Elsevier Physics Reports. 2010, 486: 75174. 10.1016/j.physrep.2009.11.002.
 23.
Clauset A, Newman MEJ, Moore C: Finding community structure in very large networks. Phys Rev E. 2004, 70: 066111
 24.
Friedel C, Zimmer R: Inferring Topology from Clustering Coefficients in ProteinProtein Interaction Networks. BMC Bioinformatics. 2006, 7 (519):
 25.
Wang J, Li M, Chen J, Pan Y: A Fast Hierarchical Clustering Algorithm for Functional Modules Discovery in Protein Interaction Networks. IEEE/ACM Transaction on Computational Biology and Bioinformatics. 2011, 8 (3):
 26.
Rahman MS, Ngom A: A Fast Agglomerative Community Detection Method for Protein Complex Discovery in Protein Interaction Networks. Proceedings of the 8th IAPR International Conference on Pattern Recognition in Bioinformatics. 2013, LNBI 7986: 112.
 27.
Zaki N, Berengueres J, Efimov D: A Method for Detecting Protein Complexes. Proceedings of the Genetic and Evolutionary Computation Conference. 2012, 209216.
 28.
Newman MEJ: Fast algorithm for detecting community structure in networks. Physical Review. 2003, 69 (066133):
 29.
Laarhoven TV, Marchiori E: Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks. Proceedings of the 7th IAPR International Conference on Pattern Recognition in Bioinformaics. 2012, LNBI 7632: 113.
 30.
Chen S, Ma B, Zhang K: On the Similarity Metric and the Distance Metric. Theoretical Computer Science. 2009, 410 (2009): 23652376.
 31.
Xenarios I, et al: The Database of Interaction Proteins: A Research Tool for Studying Cellular Networks of Protein Interactions. Nucleic Acids Research. 2002, 30: 303305. 10.1093/nar/30.1.303.
 32.
Dennis G, Sherman B, Hosack D, Jun Yang, Gao W, Lane HC, Lempicki R: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology. 2003, 4 (5): P310.1186/gb200345p3.
 33.
Huang D, Sherman B, Tan Q, Collins J, Alvord WG, Roayaei J, Stephens R, Baseler M, Lane HC, Lempicki R: The DAVID Gene Functional Classification Tool: A Novel Biological ModuleCentric Algorithm to Functionally Analyze Large Gene Lists. Genome Biology. 2007, 8: R18310.1186/gb200789r183.
 34.
Cho YR, Hwang W, Ramanmathan M, et al: Semantic Integration to Identify Overlapping Functional Modules in Protein Interaction Networks. BMC Bioinformatics. 2007, 8 (265):
 35.
Chua HN, Ning K, SUng WK, Leong HW, Wong L: Using Indirect ProteinProtein Interaction for Protein Complex Prediction. Journal of Bioinformatics and Computational Biology. 2008, 6 (3): 435466. 10.1142/S0219720008003497.
Acknowledgements
This research has been partially supported by the Canadian NSERC Grant #RGPIN2281172011 of AN. ZMI would like to acknowledge the supports of the NIHR Biomedical Research Centre for Mental Health, the Biomedical Research Unit for Dementia at the South London, the Maudsley NHS Foundation Trust and the Kings College London, and a joint infrastructure grant from Guys and St Thomas Charity and the Maudsley Charity, London, United Kingdom.
Declarations
The publication of this article is funded by the National Science and Engineering Council of Canada (NSERC).
AN declares that he was not invovled in the peer review process or any acceptance decisions regarding this article on which he is an author.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 4, 2015: Selected articles from the 9th IAPR conference on Pattern Recognition in Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S4.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
AN proposed the current forms of the Relative VertextoVertex Clustering Value, introduced an initial version of the FACPIN algorithm, and suggested the experiments to be performed. ZMI proposed and implemented the current version of the FACPIN algorithm, and performed all the suggested computational experiments. AN and ZMI have equally contributed in writing the paper.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ibrahim, Z.M., Ngom, A. The relative vertex clustering value  a new criterion for the fast discovery of functional modules in protein interaction networks. BMC Bioinformatics 16, S3 (2015). https://doi.org/10.1186/1471210516S4S3
Published:
Keywords
 Protein Complexes
 Weighted Networks
 Functional Modules
 Network Clustering Criterion
 Protein Interaction Networks