Identifying protein complexes based on an edge weight algorithm and core-attachment structure

Background Protein complex identification from protein-protein interaction (PPI) networks is crucial for understanding cellular organization principles and functional mechanisms. In recent decades, numerous computational methods have been proposed to identify protein complexes. However, most of the current state-of-the-art studies still have some challenges to resolve, including their high false-positives rates, incapability of identifying overlapping complexes, lack of consideration for the inherent organization within protein complexes, and absence of some biological attachment proteins. Results In this paper, to overcome these limitations, we present a protein complex identification method based on an edge weight method and core-attachment structure (EWCA) which consists of a complex core and some sparse attachment proteins. First, we propose a new weighting method to assess the reliability of interactions. Second, we identify protein complex cores by using the structural similarity between a seed and its direct neighbors. Third, we introduce a new method to detect attachment proteins that is able to distinguish and identify peripheral proteins and overlapping proteins. Finally, we bind attachment proteins to their corresponding complex cores to form protein complexes and discard redundant protein complexes. The experimental results indicate that EWCA outperforms existing state-of-the-art methods in terms of both accuracy and p-value. Furthermore, EWCA could identify many more protein complexes with statistical significance. Additionally, EWCA could have better balance accuracy and efficiency than some state-of-the-art methods with high accuracy. Conclusions In summary, EWCA has better performance for protein complex identification by a comprehensive comparison with twelve algorithms in terms of different evaluation metrics. The datasets and software are freely available for academic research at https://github.com/RongquanWang/EWCA.

systems biology. In addition, understanding the biological functions is a fundamental task for different cellular systems and is beneficial for treating complex diseases. Due to the development of advanced high-throughput techniques, a large number of PPI networks have been generated [2], which makes discovering protein complexes more convenient. However, how to accurately identify biological protein complexes has been an important research topic in the post-genomic era [3]. The accurate identification of protein complexes in PPI networks is significant for understanding the principles of cellular organization and function [4]. As a result, a large number of methods including laboratory-based and computationalbased methods have been proposed to address this issue.
So far, some biologically experimental methods have been proposed to detect protein complexes from the PPI networks. However, these methods require high level of expensive cost and time-consuming. Thus, many efficient alternative computational methods are proposed to identify protein complexes in PPI networks. Moreover, a number of high-quality and large-scale PPI networks provide possible for computational methods to identify protein complexes. Generally, a PPI network can be modeled as an undirected graph (also called a network), where vertices represent proteins and edges represent interactions between proteins. Various state-of-the-art computational methods have been developed to identify protein complexes in the last few years. According to the use of information in identifying process, these computational methods are classified into two categories. One category only uses the topological information of PPI networks to identify protein complexes, and we call them topologybased methods. The other category is to combine the biological and topological information to identify protein complexes, such as IPC-BSS [5], GMFTP [6] and DPC [7], etc.
A large amount of topology-based methods have been proposed to identify protein complexes by employing different topological structures. For instance, CFinder [8] and CMC [9] are based on cliques or k-cliques; MCL [10], DPClus [11] and SPICi [12] use dense subgraph; ClusterONE [13] and CALM [14] depend on modularity concept; Core [15] and COACH [16] employ coreattachment structure. Moreover, ProRank+ [17] uses a ranking algorithm and spoke model for identifying protein complexes. All above methods are typical topology-based methods. Up to now, there is no clear and appropriate definition states that a group of proteins should be in the same complex in a PPI network.
As we all known, a clique is a complete subgraph and its all vertices are connected to each other. Some researchers believe that cliques or k-cliques are protein complexes. For example, CFinder [8] is based on clique percolation method (CPM) [18] which identifies the k-cliques. However, it is too strict to require a protein complex being a clique or k-clique, and it is computationally infeasible in the larger PPI networks, because it is NP-Complete [19]. Furthermore, many studies assume that dense subgraph corresponds to protein complex. The reason is that proteins in the same protein complex interact frequently among themselves [20,21]. MCL [10] is highly scalable clustering algorithm based on simulating random walk in biological networks. Another example is a fast heuristic graph clustering method, which is called SPICi [12], which selects the highest weighted node as a seed, and it is expanded according to local density and support measure. SPICi is efficiency methods for identifying protein complexes. However, it has low accuracy and can not identify overlapping protein complexes. In fact, some protein complexes are usually overlapping and many multi-functional proteins are involved in different protein complexes.
Consequently, some new computational methods have been proposed to identify overlapping protein complexes. For example, DPClus [11] is a seed-growth method based on different graph topological characteristics such as degree, diameter, density and so on. The main differences among them are density threshold and cluster expanding strategy [22]. More importantly, they may miss some low dense protein complexes [14]. Moreover, there are 408 known yeast protein complexes which are provided by Pu et al. in [23], 21% complexes' density is lower than 0.5. Additionally, there exists high false-positive interactions in the PPI networks. Therefore, some methods try to assess the reliability of existing PPIs and filter out the unreliable interactions [24] such as PEWCC [25] and Pro-Rank+ [17]. All of these methods are based on the single topological structure of protein complexes and do not utilize the information of known protein complexes.
Furthermore, some researchers find that many protein complexes have modularity structure, which means these protein complexes are densely connected within themselves but sparsely connected with the rest of the PPI networks [21,[26][27][28]. Motivated by this issue, a number of new clustering methods based on modularity structure have been proposed, including ClusterONE [13], CALM [14], EPOF [29] and PCR-FR [30], etc. One of most widely known is ClusterONE [13]. ClusterONE can identify overlapping protein complexes from the PPI networks, and authors introduce the maximum matching ratio (MMR) to evaluate predicted overlapping protein complexes. However, ClusterONE may neglect the effect of overlapping proteins in the process of identifying seeds [14] and some attachment proteins may be missed [28].
Recently, some research results have shown that the characteristics of detected protein complexes indicate that protein complexes generally have a core-attachment structure [31][32][33][34]. Gavin et al. [31] have revealed that proteins within a protein complex are organized as core proteins and attachment proteins. Although there is no detailed statement for this structure, some researchers think that a protein complex core is often a dense subgraph and that some attachment proteins are closely associated with its core proteins and assist these core proteins to perform subordinate functions [16]; then, together they form a biologically meaningful protein complex. Ahmed et al. 's studies also demonstrate a similar architecture and inherent organization in protein complexes [15,33,35].
Up to now, several methods based on core-attachment structure have been explored for identifying protein complexes, such as COACH [16], Core [15] and Ma et al. 's method [22]. These methods have a good performance dramatically, and demonstrate the significance of this structure [22]. Methods based on core-attachment structure are generally divided into two stages. In identifying complex cores phase, they are mainly to identify dense subgraph or maximal clique as protein complex core. In fact, some protein complex cores are dense subgraph or maximal clique, but other are not high-density [23]. Ma et al. [22] have argued that the density of a subgraph is not appropriate to characterize a protein complex core. In identifying attachment proteins phase, most of methods based on core-attachment structure follow Wu et al. ' criterion [16] that is to select the proteins whose neighbors interact with more than half of the proteins in its protein complex core. As we know the PPI networks are sparse and have proved that the size of protein complex cores varies from 1 to 23 [31]. Obviously, it could be sufficient to describe the relation between a protein complex core and their attachment proteins. However, the currently available PPI networks contain many false-positives interactions which greatly affect protein complexes detection accurately.
In this paper, we try to overcome these limitations and employ a protein complex internal structure to identify biologically and accurately meaningful protein complexes. Inspired by some reserachers's [14,32,[36][37][38] experimental works and the distinctive properties of core and attachment proteins. We further study the core-attachment structure. However, these previous studies only illustrate some concepts of this structure but do not give a method for how to identify various types of proteins including core proteins, peripheral proteins and overlapping proteins [14]. In real PPI networks, the overlapping protein complexes are universal [14]. Therefore, the overlapping proteins often play an important role in the identification of protein complexes. Generally, overlapping proteins are regard as member of two or more protein complexes at the same time. The overlapping proteins promote the interaction between protein complexes. In addition, in many real complex networks, the identification of overlapping nodes is useful in the social network, cited network, world wide web and so on. Most of the algorithms we mentioned before do not have the ability to differentiate and identify overlapping proteins and peripheral proteins while we extend the ability of EWCA. Thus, in this paper, we provide some definitions to distinguish and identify local overlapping proteins and locally peripheral proteins, which has not been done by other researchers. We take a simple example to show core-attachment structure in Fig. 1. We propose a method which is named EWCA, to identify protein complexes. Most existing protein complex identification approaches search for protein complexes based on 'density graph' assumptions. Unlike some of them, EWCA provides a new direction to use a Core-attachment structure to identify protein complexes. First, EWCA defines a new edge weight measure to weight and filter out interactions in PPI networks. Second, EWCA could generate some preliminary overlapping complex cores based on structural similarity rather than density. This approach is more reasonable because the core proteins in the same complex core have relatively more structural similarity. Third, EWCA designs a new method to discover attachment proteins for corresponding to the complex core. Finally, the experimental results show that EWCA performs better than existing state-ofthe-art methods in terms of some evaluation metrics (e.g., F-measure and MMR) and functional enrichment.

Preliminary
Generally, a PPI network can be typically modeled as an undirected graph G ppi = (V ppi , E ppi ), where V ppi represents as the set of vertices corresponding to proteins and E ppi stands for the set of edges which represent the interactions between proteins from V ppi . A PPI network is undirected and may be unweighted or weighted, with weight on an edge representing the confidence score (usually between 0 and 1) for an interaction. For a vertex v, N(v) stands for the set of all vertex v'neighbors.

Construction of a reliable weighted PPI network
Generally speaking, the PPI networks obtained from different experimental methods are quite noisy (many interactions are believed to be false positives) [39]. Hence we should reduce the false positives. To address this challenge, some researchers have proposed preprocessing strategies to assess and eliminate potential false positives by using the topological properties of the PPI networks [40][41][42][43]. Meanwhile, some experimental results [44,45] have shown that the PPIs with high confidence scores are assessed by the neighbor information-based methods, and these methods tend to be more reliable than others. Thus, we introduce a Jaccard's coefficient similarity (JCS) measure proposed by Jaccard et al. [46]. The Jaccard's coefficient similarity between two neighbor proteins v and u is defined by Eq. (1): where N(v) and N(u) stand for the set of neighbor nodes of nodes v and u, respectively. N(v) ∩ N(u) is the set of all common neighbors between nodes v and u, and is denoted by CN(v, u). |N(v) ∩ N(u)| stands for the number of all common neighbors of v and u. |N(v) ∪ N(u)| represents the union set of all distinct neighbors of v and u. Obviously, the more common neighbors two proteins A network with two protein complexes and three overlapping proteins. Each protein complex consists of core proteins, peripheral proteins and three overlapping proteins which are shared by two protein complexes in overlapping yellow area. Additionally, these core proteins inside the red dotted circle constitute their protein complex cores. Note that diamond nodes present core proteins, circle nodes present peripheral proteins, hexagonal nodes present overlapping proteins, parall elogram nodes present interspersed proteins share, the higher similarity between two adjacent nodes.
Here, to better quantify the connectivity between two adjacent nodes v and u, then we define a new high-order common neighbor (HOCN) similarity measure based on the Jaccard's coefficient between node v and node u, and we introduce HOCN as follows. The main idea is to estimate each edge according to the common neighbors of the common neighbors of the two adjacent nodes. HOCN(v, u) is defined as Eq. (2): where The weight of the edge (v,u) between protein v and protein u is determined by not only the Jaccard's coefficient between proteins v and u but also the probability that their common neighbors do support the edge (v, u). All common neighbors support (CNS) the edge (v, u) are calculated by Eq. (3). Finally, the weight of the edge (v, u) is determined by Eq. (2).
To assess the reliability of protein interactions process, we give an example as shown in Fig. 2. Suppose we assess the weight of edge e1 between b and d. According to Eq. (1), we can obtain JCS(b, d) = |{a,c}| |{a,b,c,d,e,f ,g,k,s}|   ≈ 0.102 according to Eq. (2). Here, we use HOCN (v, u) to calculate the weight of each pair of edge (v, u) so that EWCA improves the quality of the identified protein complexes. Obviously, HOCN(v, u) considers more widely about the connectivity of the entire neighborhood of two adjacent nodes and may well determine whether two interactional proteins belong to the same protein complex.
is considered unreliable and it has to be discarded. The more details pseudo-codes of this phase is shown in Algorithm 1.

Preliminary complex core identification
According to the latest research [31,36,[47][48][49][50], a protein complex consists of core and periphery (also called attachment) proteins. A complex core is a small group of proteins that show high co-expression similarity and share high functional similarity , which is a key cellular role and the essential function for a protein complex [31,35]. Unfortunately, due to the limitations of experimental methods, the functional information (gene ontology) of many proteins may be infeasible for the identification of protein complex cores [51]. However, the core proteins in the same complex core show a high level of functional similarity and have relatively more common neighbors among themselves than among other proteins in the PPI networks [15,36,51]. The biological functions of proteins are determined by their neighbors from the view of topological characteristics. This strategy is a good alternative in the absence of functional information. Thus, two proteins are assigned to the same protein complex core if they share many common neighbors. Because two proteins share many interaction neighbors, they are likely to carry out similar biological functions and be in the same complex core. Moreover, structural similarity could further assess the functional similarity between two proteins based on common neighbors and neighbourhood size [36,47,51].
As mentioned in "Preliminary" section, given a vertex v ∈ V ppi , N(v) stands for the set of all direct neighbors. Thus, the structural neighborhood of v is defined by Eq. (4): where SN(v) contains the node v and its immediate neighbors.
In the PPI networks, if two proteins have common neighbors, they may be functionally related. Furthermore, the structural similarity is used for normalizing common neighbors between two vertices in information retrieval [47]. This measure could be indirect functional similarity [36,45]. As a result, structural similarity SS can be calculated by using the number of common neighbors which are normalized by the geometric mean of the neighbourhood size of vertex v and w. Therefore, the structural similarity SS between two neighbor proteins v and u is defined by Eq. (5): when a vertex has a similar structure as that of one of its neighbors, their structural similarity is large. In additional, structural similarity is symmetric, i.e., SS(v, w) = SS(w, v).
Obviously, the value of structural similarity is between (0, 1]. Additionally, although the PPI networks have noise which will affect the clustering results, this scheme is not sensitive. Based on these statements, we mine a subgraph in the neighborhood graph G v based on structural similarity, which is used as a preliminary complex core and is written as Core(PC v ). Core(PC v ) consists of seed vertex v as the center and neighbors that should have high significance structural similarity with seed v. In addition, some biological experiments analyses, such as three-dimensional structure and yeast two-hybrid, have showed that the core proteins (vertices) in the same complex core are likely to be in direct physical contact with each other [31,52]. Therefore, for each neighbor u ∈ N(v), if the value of structural similarity between it and seed v is larger than a prefixed threshold (e.g., 0.4), we select protein u as a core protein. The detail of this prefixed threshold selection will be introduced in Parameter selection section. The Core(PC v ) of an identified complex PC v is defined as the subgraph which is made of all the core proteins and their corresponding edges.
1. If the subgraph is small dense and reliable, its core proteins within the same protein complex core have relatively more interactions among themselves. 2. The core proteins in the same complex core are likely to be directly physical contact with each other. 3. The core proteins in the same complex core should have relatively more common neighbors than other non-core proteins.
According to these possible conditions and our studies, we take account of a preliminary complex core, named Core(PC v ). It should satisfy the following three conditions.
(1) The size of the preliminary complex core is larger than 2 and consists of core proteins, where all its core proteins directly connect with each other. (2) The core proteins of a complex core should have more reliable and heavier weights among themselves. (3) A complex core should have higher functional similarity. (4) The core proteins of a protein complex core could be shared with multiple protein complexes.
More specifically, we consider that each vertex v ∈ V ppi is a seed to mine protein complex cores, and we compute SS(v, w) between v and each adjacent vertex w, when SS(v, w) is larger than or equal to a user-defined threshold (ss); then we take w as a core vertex to the preliminary complex core Core(PC v ). Moreover, vertex w should be included into Core(PC v ), because they are connected and share a similar structure. Each preliminary complex core Core(PC v ) consists of seed vertex v and core vertices, and the value of SS(v, w) between seed vertex v and its direct neighbors is larger than or equal to a previously set threshold ss. Finally, we discard some redundancy preliminary complex cores and only retain preliminary complex cores whose size is greater than or equal to 2. The pseudo-code of this phase is shown in algorithm 2.

Attachment protein detection
EWCA is used to detect the protein complex cores in the previous section. Next, we should identify the attachment proteins for each complex core to form the protein complex. The research of Gavin et al. [31] shows that attachment proteins are closely associated with core proteins within protein complexes and that a great degree of heterogeneity in expression levels and attachment Algorithm 2 Preliminary complex core identification. Input: The PPI network, G ppi = (V ppi , E ppi ); The structural similarity threshold, ss. Output: The set of preliminary complex core, PCC.
1: initialize preliminary complex core, PCC, variate i = 1; 2: for all v in V ppi do 3: initialize a preliminary complex core CC i = φ; 4: get the structural neighborhood of vertex v as SN(v) according to equation (4); // SN(v) includes v and all the neighbors of v. 5: for each vertex u ∈ SN(v) do 6: calculate the value of structural similarity, denoted SS(v, u) between vertices v and u according to equation (5); 7: if SS(v, u) > ss then 8: CC i = CC i ∪ {u}; // update CC i by adding u. 9: end if 10: end for 11: if the size of CC i 2 then 12: 14: end if 15: end for; 16: discard the same preliminary complex core in PCC; 17: return The set of preliminary complex core, PCC.
proteins might represent nonstoichiometric components [31]. Also, attachment proteins are shared by two or more complexes and some overlapping proteins may participate in multiple complexes [53,54]. According to Gavin et al. 's research [31] and our previous CALM algorithm [14], we know that a protein complex consists of a protein complex core and attachment proteins. Additionally, attachment proteins have two parts. One is peripheral proteins and the other is overlapping proteins. If the readers want to understand these concepts, please refer to ref [14,31].
Based on the concepts of attachment proteins, attachment proteins contain could be grouped into two categories. The first category is peripheral proteins, and its main feature is that they only belong to one protein complex. In other words, they closely connect to the protein complex and belong to the most favored protein complexes. The second category is overlapping proteins, which, in contrast, belong to multiple protein complexes. According to our previous CALM algorithm statistics, the number of overlapping proteins in the known protein complexes [14] shows that a large fraction of proteins (called overlapping proteins) participate in multiple protein complexes. Here, we summarize the features of overlapping proteins. Overlapping proteins are proteins that belong to several protein complexes at the same time.
Overlapping proteins connect to each protein complex with a different connection strength. We believe that dense protein-protein interaction in a protein complex is a key feature of protein complexes. Therefore, we adopt the average weighted degree of protein complexes which is based on the concept of density, to judge whether a protein is an overlapping protein or not.
Next, let us assume an identified complex, written as PC v . Here, we use a given a preliminary complex core Core(PC v ) = (V core , E core ) and a candidate attachment subset CAP to construct the identified complex PC v . We need to complete two tasks: one is to set up a subset CAP ⊆ V ppi in which each protein p ∈ CAP is a candidate attachment protein for the identified protein complex PC v and the other one is to decide which category each protein in CAP belongs to.
At first, for attachment proteins, we give two basic conditions: (1) attachment proteins should directly interact with the corresponding complex cores.
(2) attachment proteins should connect with at least two or more core proteins with its complex core. If a protein p satisfies these conditions, it is selected as a candidate attachment protein, where protein p belongs to the neighbourhood of the preliminary complex core Core(PC v ) and N(p)∩V core 2. As a result, we have constructed a candidate attachment subset CAP. Next, we will discuss how to specifically identify the two categories. First of all, we consider a protein belong to that an overlapping protein should satisfy the following: (1) Overlapping proteins interact directly and closely with the corresponding complex cores. (2) The weighted out-connectivity of the complex core of the overlapping protein is greater than the weighted in-connectivity of the complex core. (3) Overlapping proteins weakly interact with the corresponding complex core relative to the internal interactions within the corresponding complex core. (4) Overlapping proteins are not unique to a protein complex; instead, they may be present in more than one complex.
According to these conditions, we let a candidate attachment protein p of an identified complex PC v be an overlapping protein in a candidate attachment set CAP, that is, p ∈ Overlapping(PC v ): (1) The weighted out-connectivity of p with respect to Core(PC v ) is greater than or equal to the weighted ininteractions of p with respect to Core(PC v ), given by: The weighted in-interactions of p with respect to Core(PC v ) is at least half of the average weighted in-interactions of all core vertices in Core(PC v ), given by Here, d weight (p, Core(PC v )) is the total weight interactions of p with core proteins in Core(PC v ), given by d weight (p, Core(PC v )) = p / ∈V core ,t∈V core weight(p, t). weight avg (Core(PC v )) is the average of the weighted interactions of all core proteins within the complex core Core (PC v where |V core | is the number of proteins in the Core(PC v ) and (v,u)∈E core weight(v, u) represents the total weight of interactions in the protein complex core Core(PC v ). If a protein satisfies these conditions, we suppose that it belongs to protein complex PC v at the same time and make it an overlapping protein.
Second, when we have obtained all overlapping proteins from candidate attachment set CAP, we next obtain a candidate peripheral protein subset, CP(PC v ), which is a difference set, given by CAP − Overlapping(PC v ). We consider that a peripheral protein should satisfy the following: (1) Peripheral proteins are not overlapping proteins.
(2) The weighted in-connectivity of the complex core of the peripheral proteins is greater than the weighted out-connectivity of the complex core. (3) Peripheral proteins closely interact with corresponding complex core relative to the interaction of other non-member proteins with the corresponding complex core. (4) Peripheral proteins only belong to a protein complex.
Considering these criteria, we let a candidate attachment protein p of an identified complex PC v be a peripheral protein in a candidate peripheral protein subset CP(PC v ), that is, p ∈ Periphery(PC v ): (1) The weighted in-interactions of p with respect to Core(PC v ) is greater than the weighted out-connectivity of p with respect to Core(PC v ) and is written by: weight in (p, Core(PC v )) > weight out (p, Core(PC v )).

(2) The weighted in-interactions of p with respect to
Core(PC v ) is greater than the average weight of interactions of all all candidate peripheral proteins with Core(PC v ) and is given by: is the average weight of interactions of the entire candidate peripheral protein subset CP(PC v ) with Core(PC v ).
Combining the peripheral proteins and overlapping proteins, we form the final set of attachment proteins of protein complex core Core(PC v ), that is: The more detailed pseudo-codes of this phase is shown in Algorithm 3.

Algorithm 3
The attachment protein detection. Input: The weighted PPI network G = (V ppi , E ppi , W ppi ),the W ppi is computed based on equation (2) (HOCN(v, u)), the set of identified preliminary complex cores, PCC. Output: The set of identified candidate attachment proteins, AP. 1: for each preliminary complex core Core(PC v ) ∈ PCC do 2: obtain a candidate attachment protein subset CAP, for each p ∈ CAP, where it is the direct neighbor proteins around the Core(PC v ) and p connects with at least two or more core proteins with complex core Core(PC v ), given by: N(p) ∩ V core 2; 3: calculate weight avg (Core(PC v )) = 2 * (i,j)∈Ecore weight(i,j) |V core | ; 4: initialize Attachment protein subset, Attachment(PC v ), Periphery protein subset, Periphery(PC v ), Overlapping protein subset, Overlapping(PC v ); 5: for p ∈ CAP do 6: calculate weight in (p, Core(PC v )) = p / ∈V core ,t∈V core weight(p, t); 7: calculate weight out (p, Core(PC v )) = p / ∈V core ,t / ∈V core weight(p, t); 8: if weight in (p, Core(PC v )) weight out (p, Core(PC v )) and d weight (p, Core(PC v )) 1 2 weight avg (Core(PC v )) then 9: Overlapping(PC v ) = p ∪ Overlapping(PC v ); // add p to Overlapping(PC v ). 10: end if 11: end for 12: obtain a candidate peripheral protein subset,CP(PC v ), given by CP (PC v 14: for p ∈ CP(PC v ) do 15: if weight in (p, Core(PC v )) > weight out (p, Core(PC v )) and weight in (p, Core(PC v )) weight avg (CP(PC v )) then 16:

Protein complex formation
After we have obtained the set of identified preliminary complex cores and the set of identified candidate attachment protein, we combine a preliminary complex core and its attachment proteins and form the final set of unique complex (PC v ), i.e., Furthermore, we discard protein complexes with a size of less than 3 proteins. Moreover, because different protein complex cores may produce the same identified protein complexes, some redundant protein complexes are identified. Thus, some protein complexes are completely overlap with each other, which means that only one of them is retained while the others are removed as redundant protein complexes, The detailed pseudo-code of this phase is shown in Algorithm 4.  4: end for 5: discard the protein complexes with size less than 3 in PCs; 6: remove the same (redundant) protein complex in PCs; 7: return The set of identified protein complexes, PCs.

Experimental datasets
We do the experiment on the three PPI networks of S.cerevisiae extracted from the PPI Networks DIP [55], BioGRID [56] and Yeast [57], respectively. The general properties of the datasets are shown in Table 1. For human, the PPI network is constructed by combining the data from Human [57]. For more detail about Yeast and Human datasets, see the Ref [57]. For yeast, three reference sets of protein complexes are used in our experiments. One set comprises of handcurated complexes from CYC2008 [23] and the other set is NewMIPS which generated by MIPS [58], Aloy [59] and the Gene Ontology (GO) annotations in the SGD database [60]. The last Yeast complexes [57] come from the Wodak database (CYC2008) [23], PINdb and GO complexes. For human, Human complexes [57] are collected from the Comprehensive Resource of Mammalian protein complexes (CORUM) [61], protein complexes are annotated by GO [62], Proteins Interacting in the Nucleus database (PINdb) [63] and KEGG modules [64]. For all of them, we only keep the complexes with size no less than 3. The general properties of the reference complex sets are shown in Table 2.

Evaluation metrics
There are several evaluation metrics that can be used to perform comprehensive comparisons, such as recall, precision, F-measure and so on. Here, we employ them as previously suggested by study [13,16,65]. Overall, there are five types of evaluation metrics used to evaluate the quality of the identified complexes and compare the overall performance of the identification methods. The definitions of these evaluation measures are introduced as follows.

Recall, precision and F-measure
Generally speaking, clustering results are evaluated in terms of recall, precision, and F-measure. Recall [66] is termed the true positive rate or sensitivity, and it is the ratio of the number of proteins in both identification complexes and reference complexes to the number of proteins in the reference complexes. Precision [66] is the ratio of the maximal number of common vertices in both identified complexes and reference complexes to the number of vertices in identified complexes. Meanwhile, F-measure is a harmonic measure according to recall and precision [66] and it is used for evaluating the accuracy of the identified complexes. The F-measure could evaluate not only the accuracy of identified complexes matching reference complexes but also the accuracy of protein complexes matching identified complexes.
The identified complexes P = {p 1 , p 2 , ..., p k } is generated by identified method, and R = {r 1 , r 2 , ..., r l } is the reference complexes for any identified complex p i and reference complex r j . First, we introduce the neighborhood affinity (NA(p i , r j )) between the identified protein complexes and reference complexes, which is presented as follows [16,65,67]: Here, the neighborhood affinity NA(p i , r j ) is defined to measure the similarity between identified complexes and reference complexes, and it quantizes the closeness between them. |N p i | is the size of the identified complex, |N r j | is the size of the reference complex, and |N p i ∩ N r j | is the number of common proteins from the identified and reference complexes. The larger the value of NA(p i , r j ) is, the more possible two complexes closer are. If NA(p i , r j ) ≥ t, then the p i is considered to be matched with r j , where t is a predefined threshold. In this paper, we also set t = 0.2, which is consistent with previous studies [16,65]. After the neighborhood affinity NA(p i , r j ) has been defined, we will give the definition of recall, precision and F-measure. We assume that P and R are the set of identified complexes and real reference complexes, respectively. N mr is the number of reference complexes that match at least an identified complex, i.e. N mr = |{r|r ∈ R, ∃p ∈ P, NA(r, p) ≥ t}|. N mp is the number of correct identification complexes that match at least a real protein complex, i.e., N mp = |{p|p ∈ P, ∃r ∈ R, NA(p, r) ≥ t}|. Recall and precision are defined as follows [68]: and In general, a larger protein complex has the higher recall, while a smaller protein complex has higher precision. Therefore, the F-measure is defined as the harmonic mean of recall and precision, which The corresponding formulas are shown as follows [69]:

Coverage rate and mMR
The coverage rate is use for assessing how many proteins in the reference complexes could be covered by the identified complexes [70,71]. In detail, when the set of reference complexes R and the set of identified complexes P, are given the |R| × |P| matrix T is constructed, where each element max{T ij } is the largest number of proteins in common between the ith reference complex and the jth identified complex. The coverage rate is defined as: where N i is the number of proteins in the ith standard complex. The MMR metric, which is strongly recommended by Nepusz et al. [13], measures the number of maximal matching between reference complexes and identified protein complexes. As discussed by the authors, it penalizes the methods that tend to split a reference complex into more than one part in the identified complexes. To do so, a bipartite graph is composed by two sets of vertices, and the edge between an identified complex and a reference complex is weighted by the matching score of NA(A, B) (see Eq. (8)). The MMR score between the identified complex and the reference complex is the total weight of edges, selected by the maximum weighted bipartite matching and divided by the number of known complexes. For more details about computing MMR, please refer to references [13]. The above three kinds of metrics are independent and can work together to evaluate the performance of protein complex identification methods [13].

Analysis of function enrichment
Moreover, because of laboratory-based experiments limitation, the known protein complexes are incomplete. Therefore, many researchers [7,72] annotate their main biological functions by using p-value formulated as Eq. (13). We also adopt function enrichment test to demonstrate the biological significance of the identified protein complexes. Given an identified protein complex containing C proteins, p-value is used for calculating the probability of observing m or more proteins from the complex by chance in a biological function shared by F proteins from a total genome size of N proteins: Here, where N is the total number of vertices in the PPI networks, C is the size of the identified complex, F is the size of a functional group, and m is the number of proteins of the functional group in the identified complex. The pvalue is calculated on biological process ontologies. The smaller the p-value of a protein complex is, the more biological significance of the protein complex is. In general, if the p-value is lower than 0.01, the protein complex is considered to be significant.

Comparison between different methods
To demonstate the effectiveness of EWCA in identifying protein complexes, we compare EWCA with twelve existing state-of-the-art protein complex identification algorithms including MCL, CFinder, Core, DPClus, COACH, SPICi, ClusterONE, PEWCC, GMFTP, CMC, ProRank+ and DPC. To be fair for each compared method, we follow the strategy used in [6,13], the optimal parameters of the reference complexes are set to generate the best result for each compared method, and the optimal parameters with respect to the reference complexes are set to generate its best result or follow as suggested by the authors. More details and the selection of parameters for all the compared methods are supplied in website (https://github.com/RongquanWang/ EWCA/SupplementaryMaterial.docx). Here we chose these parameters that can maximize the value of Fmeasure, because it could fully balance the performance of all methods. Moreover, the comparison results between EWCA and other methods are shown in Tables 3 and 4, which is the overall performance of each methods based on recall, precision, F-measure, MMR and CR.
What's more, EWCA achieves almost the highest Fmeasure and MMR is also the highest through four combinations of the two PPI datasets and the two reference complexes. Please note that we have removed identified complexes with having two or less proteins, and we do not any supply biological data (e.g., Go annotations) in EWCA method and other compared methods. The bold values is the best result in comparison with other methods. In fact, F-measure is the harmonic mean of recall and precision. Obviously, the higher F-measure is better. Table 3 shows the comprehensive comparison results on the unweighted networks in terms of five criterion by using the NewMIPS complexes. EWCA achieves the highest F-measure and MMR, which are compared with the other methods across all two combinations of the two PPI datasets. It is obvious that EWCA could identify protein complexes more accurate. In Table 3, when using BioGRID dataset as input PPI network and NewMIPS as reference complexes, EWCA obtains the highest F-measure that is 0.6578, that is higher better balance between recall and precision. Similar, EWCA is the highest value in terms of MMR and CR. As shown in Table 3, EWCA achieves the highest recall of 0.7012, F-measure of 0.5830 and MMR of 0.3094 in the DIP PPI network, which obviously outperforms other methods. Meanwhile, EWCA obtains a higher MMR than other methods, and it indicates that the identification of protein complexes by EWCA can obtain a better maximal one-to-one mapping to NewMIPS complexes. In short, Table 3 shows that EWCA obviously outperforms other methods on the NewMIPS complexes. Table 4 shows the overall comparative results on the unweighted networks using the CYC2008 complexes. In Table 4, when the PPI dataset is BioGRID, EWCA achieves the highest F-measure of 0.6752, however the second highest ProRank+ is just 0.5104. It is the main difference between EWCA and other methods, which means EWCA has the absolutely advantage. Compared with other methods, EWCA's other criterion is just a little lower than the highest of other methods. Secondly, when we compare EWCA with the other methods by using DIP PPI network. Similarly, EWCA still outperforms others methods as shown in Table 4. The experimental results show that EWCA achieves both the highest recall of 0.7076, the highest F-measure of 0.6020 and the highest MMR of 0.3766 in the DIP PPI network. Meanwhile, it indicates that our identified protein complexes could match to reference complexes, which is significantly superior to the other methods. Furthermore, compared with CR, EWCA is a little lower than the best GMFTP on DIP PPI network. Furthermore, for other assessment measure, EWCA is very close the best in DIP dataset as shown in Table 4. Meanwhile, the experimental results by using the CYC2008 as reference complexes are basically consistent with using the NewMIPS as reference complexes. In summary, EWCA achieves the better performance on two PPI network, which is competitive or superior to the existing protein complexes identification methods. Especially, EWCA achieves a consistently better Fmeasure and MMR than the other twelve methods. Tables 3 and 4 present the comparison results under two reference complexes.

Analysis of function enrichment
Since the reference complexes are incomplete, to further validate the effectiveness of EWCA method, we investigate the biological significance of our identified protein complexes. Each identified complex is associated with a p-value (as formulated in Eq. (13)) for gene ontology (GO) annotation. In general, an identified complex by different identification methods is considered biologically significant if its p-value is less than 1E-2. Meanwhile, an identified complex has a lower p-value, the more statistically biological significance. We calculate the p-value of identified complexes based on biological process ontologies by using the web service of GO Term Finder (https:// www.yeastgenome.org/goTermFinder) [73] which is provided by SGD [74]. Here, for each identification complex, we use the smallest p-value over all possible gene ontology term to represent its functional homogeneity. Besides analyzing the protein complexes identified by EWCA, we also calculate the p-value of protein complexes identified by CMC, PEWCC, GMFTP, COACH, ProRank+ and DPC whose size are greater than or equal to 3, respectively. Selecting the above methods to compare with EWCA is because all of them obtained better performances in two test PPI networks as shown in Tables 3 and 4. The results of p-value test for CMC, PEWCC, GMFTP, COACH, ProRank+, DPC and EWCA are presented in Table 5. To compare the biological significance of different algorithms, the number of identified complexes, the number of identified complexes and the proportion of identified complexes by various methods whose pvalue falls within different value ranges are calculated for each algorithm. Most of previous algorithms only take account of the proportion of identified complexes. However, the p-value of protein complexes identified has close relationship with their size [16]. Therefore, we should consider both the number of identified complexes and the proportion of identified complexes to analyze function enrichment of identified protein complexes. As the Table 5 shows, on the BioGRID dataset, the proportion of significant protein complexes identified by EWCA is 96.62 percent, which is about 1 percentage point lower than the best method COACH and 0.97 percentage point lower than the second best method ProRank+. It may be due to the fact that EWCA detects many more protein complexes than COACH and ProRank+ and the size of identified protein complexes by EWCA is relatively smaller than other algorithms, such as ProRank+. However, it is obvious that the number of identified protein complexes by EWCA is 1341, which is maximum and it is far more than COACH and ProRank+.
On the DIP dataset, the proportion of significant protein protein complexes identified by EWCA is 90.15 percent, which is about 4 percentage point lower than the best method ProRank+. Meanwhile, the number of identified protein complexes by EWCA is also maximum. Similarly, the number of identified protein complexes by CMC and GMFTP in BioGRID dataset is 1113, 2167, respectively. The number of identified protein complexes by PEWCC and DPC in BioGRID dataset is 676 and 622, respectively. Generally, the smaller the number of identified protein complexes is, the higher the proportion of significant complexes is. In fact, the number of identified protein complexes by CMC, GMFTP and PEWCC is much smaller than EWCA. However, they have the percentage of significant protein complexes is relatively lower than EWCA method. All in all, EWCA has more practical and biological significant than other methods in terms of the number of identified protein complexes and the proportion of identified complexes. According to their p-value, those identified protein complexes by EWCA has a higher possibility to be identified as real protein complexes through laboratory experiments in the future.
To further reveal the biological significance of identified complexes, five identified protein complexes with very low p-values provide by EWCA method with different datasets are presented in Table 6, which lists the p-values (Biological Process) of protein complexes, Cluster frequency and Gene Ontology term. The third column of Table 6 shows the cluster frequency. From this column, we can see that many of our identification protein complexes match well with the Gene ontology term. The p-value of identified complexes in Table 6 is very low, which further demonstrates that the protein complexes identified have high statistical significance.
Furthermore, we discover many identified protein complexes with cluster frequency of 100%. Here,  Table 7. Such identified protein complexes are probably real protein complexes, which also provide meaningful references to the related researchers.

Parameter selection
In this experiment, we introduce an user-defined parameter structural similarity (ss) and study its effect to identifying protein complexes. For yeast, protein complexes are To investigate the effect of the parameter ss on performance of EWCA, we evaluate the identification accuracy by setting different values of ss and we change the value of parameter ss from 0.1 to 1.0 with 0.1 increment. It is obvious that ss is allowed when ss > 0 and is not allowed when ss = 0. Figures 3 and 4 show the performance of EWCA method fluctuates under various ss and the results on DIP dataset and BioGRID dataset are shown separately. Figures 3 and 4 indicate that EWCA gets the better performance when ss is assigned 0.4. Figs. 3 and 4, with the increase of ss, the value of recall, MMR and CR decrease but the value of precision increases. It is shown almost similar trends in all cases. Furthermore, we study the behaves of EWCA in terms of F-measure. Notably, in DIP dataset, the value of F-measure increases gradually with the increase of ss until ss = 0.4. Here, when CYC2008 and NewMIPS reference complexes are used, the maximum value of F-measure is 0.6020 and 0.5830, respectively. As the gradual increase of ss, the value of F-measure shows different change trends, which are all below ss = 0.4. For the DIP dataset, we set ss = 0.4. Similarly, in the BioGRID dataset, the value of Fmeasure increases as ss increasing and the value reach up to 0.6752 and 0.6578 by using CYC2008 and NewMIPS reference complexes when ss = 0.4, which is the optimal Fig. 4 The effect of ss. Performance of EWCA with different structural similarity threshold ss is measured by all evaluation meterics, with respect to CYC2008 and NewMIPS standard complex sets. The x-axis denotes the value of structural similarity and the y-axis denotes evaluation metrics in BioGRID dataset. The F-measure is maximised at ss=0.4 on unweighted BioGRID dataset value as shown in Fig. 4. In the rest of experiment, we set ss = 0.4 for obtaining experimental results.

As shown in
As a result, we recommend that the suitable range of ss would be from 0.4 to 0.6. Because the value of F-measure does not change significantly in this range.

Time complex analysis
In this section, we analyze the computational complexity of EWCA algorithm. All experiments are run on an Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz computer with 12.0 GB memory. For simplicity, we run all the programs with their default parameter. Meanwhile, all reported run times are clock times for running protein complexes identification methods. Furthermore, because the accuracy of protein complexes identification methods is most important. Therefore, we only select these comparison methods with having high accuracy according to Tables 3 and 4 to compare efficiently.
We present an analysis of the computation complexity of the algorithm EWCA. Given a graph with m edges and n vertices, EWCA first executes Algorithm 1. For each edge, EWCA computes the weight of the edge. For one vertex, EWCA visits its direct neighbors. Here, we use an adjacency list which is a data structure where each vertex has a list which includes all its neighbor vertices. The cost of neighborhood query is proportional to the number of neighbors, that is, the degree of query vertex. Therefore, the total cost is O( .., n is the degree  ). Thirdly, we executes Algorithm 3. We assume that EWCA obtains that the number of preliminary complex cores is |N(PCC)| according to Algorithm 2. The value of |N(PCC)| must be lower than n. Let us assume that the average degree is k in a given PPI network. Furthermore, the real PPI networks generally have sparser degree distributions and follow a power-law degree distribution [47]. Thus, k is generally quite small constant. For each preliminary complex core, during the expansion of a preliminary complex core, we assume that the size of the preliminary complex core pcc i is |n(pcc i )|. Next, we should obtain a candidate attachment proteins subset |Neighbor(pcc i )| from the neighbor of the preliminary complex core pcc.
The time complexity of this process is O(|n(pcc i )| * k).
After we have a candidate attachment proteins subset |Neighbor(pcc i )|, we judge whether each candidate vertex p should be added to the pcc by some conditions given in the attachment protein detection section. In this paper, for the parameters selection of PEWCC, COACH and ProRank+, we use the default value according to suggestions by their authors. Similarly, because EWCA only has a structural similarity parameter, in order to ensure a fairness, we also use the default 0.4 to obtain experimental results. We run EWCA and previous clustering algorithms which have a higher degree of accuracy according to Tables 3 and 4 on two smaller PPI network datasets. In order to show that EWCA could ensure the accuracy and is also efficient. Therefore, we run them in two slightly larger PPI networks. Table 8 gives the accuracy and runtime usage of each algorithm on two species PPI networks. As Table 8 shows, experimental results show that EWCA not only has a high accuracy but also need less time than other methods. All in all, EWCA could be better balance accuracy and efficiency.

Explain the novelty of EWCA approach
Compared to earlier protein complex identification methods, EWCA possesses several advantages that are enumerated below.
1. As we all known, the reliability of existing PPIs has a great effect on the accuracy of protein complex identification methods. According to the literatures [44,46], we define a high neighborhood-based methods based on Jaccard measure to assess the similarity of interactions. 2. The density-based methods or the core-attachment structure based methods [7,11,12,15,16] have achieved ideal performance; compared to these methods, EWCA also considers core-attachment structure and could identify protein complexes with varying densities. 3. Furthermore, EWCA has fewer parameters and provides some definitions to distinguish and identify local overlapping proteins and peripheral proteins. 4. Finally, although Wang et al. [14] consider the core-attachment structure and use the node degree and node betweenness to identify global overlapping proteins and seed proteins, then they use the modularity concept to predict overlapping protein complexes. However, it has high costs which increase with the number of nodes and edges in the PPI network and EWCA could be better balance accuracy and efficiency.

Conclusion
In this paper, we have proposed a new method to identify protein complexes by identifying complex cores and attachment proteins. Our main contributions are as follows: (1) we define a new high-order topological similarity measure to weight each edge.
(2) we further extend the protein complex cores identification methods by using the concept of structural similarity; and (3) we propose a new method to distinguish and identify local overlapping and peripheral proteins. Through the comparative analysis with other methods, the experimental results indicate that the performance of EWCA is more effective and accurate. Furthermore, each method has unique characteristics, and selecting a clustering method suitable for your purpose is important. Additionally, EWCA can balance various assessment measures, which means that EWCA provides more insight for future biological studies. We may be able to conceive these further research directions: The available PPI data are full of noise caused by high false-positive and false-negative rates [75]. To overcome this issue, there are two ways to reconstruct a reliable PPI network by predicting new interactions among proteins [76] and designing noise-robust methods [77,78]. In fact, methods that integrate the two strategies could enhance the performance. In addition, EWCA could be applied to cluster other biological networks, such as metabolic networks and gene regulatory networks, and it can also be used to tackle massive networks. We will further explore these applications in our future work.