Detecting overlapping protein complexes in weighted PPI network based on overlay network chain in quotient space

Background Protein complexes are the cornerstones of many biological processes and gather them to form various types of molecular machinery that perform a vast array of biological functions. In fact, a protein may belong to multiple protein complexes. Most existing protein complex detection algorithms cannot reflect overlapping protein complexes. To solve this problem, a novel overlapping protein complexes identification algorithm is proposed. Results In this paper, a new clustering algorithm based on overlay network chain in quotient space, marked as ONCQS, was proposed to detect overlapping protein complexes in weighted PPI networks. In the quotient space, a multilevel overlay network is constructed by using the maximal complete subgraph to mine overlapping protein complexes. The GO annotation data is used to weight the PPI network. According to the compatibility relation, the overlay network chain in quotient space was calculated. The protein complexes are contained in the last level of the overlay network. The experiments were carried out on four PPI databases, and compared ONCQS with five other state-of-the-art methods in the identification of protein complexes. Conclusions We have applied ONCQS to four PPI databases DIP, Gavin, Krogan and MIPS, the results show that it is superior to other five existing algorithms MCODE, MCL, CORE, ClusterONE and COACH in detecting overlapping protein complexes.


Introduction
Analyzing the mechanism of proteins is crucial for understanding the function of cell machinery and explaining biological processes [1]. Proteins often bind together to form complexes to carry out their biological functions [2,3]. A protein complex is a molecular group of two or more functionally related proteins assembled via multiple protein interactions [4]. Detecting protein complexes has great significance in biology and proteomics [5]. In the early stage of protein complex research, the protein complexes were found mainly through biological experiments methods, such as RNA interference, conditional gene knockout, single gene knockout and Coimmunoprecipitation [6,7]. However, these methods are costly and time-consuming.
The high throughput techniques have generated a large amount of protein related data. In 2001, Legrain et al. [8] described the protein-protein interactions (PPI) as an undirected graph G(V, E), where the point set V represents protein nodes and the edge set E represents protein-protein interactions. This idea transforms largescale protein-protein interaction data into network structure, which triggered scholars to recognize protein complexes based on the topological properties of protein networks. In 2003, Bader and Hogue [9] proposed MCODE method which is a local-search method to detect protein complexes based on the proteins' connectivity values in PPI network. In 2006, Gavin et al. [10] demonstrated that protein complexes was made up of core and additional attachment proteins or protein modules. According to the core-attachment structure of protein complexes, Leung et al. [11] designed CORE algorithm which calculated the p-value for all pairs of proteins to detect cores. Wu et al. [12] proposed COACH algorithm which detected dense subgraphs as cores. In 2009, Liu et al. [13] presented a method called CMC which identified protein complexes based on maximal cliques. In fact, a protein may belong to multiple protein complexes, and there may be overlaps between protein complexes. In 2012, NepusZ et al. developed a clustering algorithm ClusterONE [14] to detect overlapping protein complexes. Recently, attributed network embedding methods have be proved to be remarkably effective in generating vector representations for nodes in the network [15]. Xu et al. designed a method GANE to predict protein complexes based on Gene Ontology attributed network embedding [15].
Some classical clustering algorithms such as Markov Clustering (MCL) [16] and swarm intelligence optimization algorithm [17,18] were also developed to detect protein complexes. Lei et al. [19] proposed F-MCL clustering model based on Markov clustering in which automatically adjusted the parameters by introducing the firefly algorithm. Wang et al. [4] developed a heuristic graph clustering algorithm called HGCA based on multiple topological characteristics.
In recent years, quotient space theory has been applied to cluster. Zhang [20] defined the fuzzy equivalence relation and stratified hierarchical structure, and established the fuzzy granular computing model in quotient space in order to solve the uncertain problem. Xu [21] proposed fuzzy clustering method based on Gaussian function. The method, with the nature of the distance metric spaces, merged the individual particles in information synthesis way for clustering results. Cluster analysis method [22] based on fuzzy similarity relations and normalized distance is proposed to solve data structure analysis of complex systems. The conclusion is suitable for the complicated systems.
In this study, a new clustering algorithm based on overlay network chain in quotient space, marked as ONCQS, was proposed to detect overlapping protein complexes in weighted PPI networks. Firstly, the GO annotation data is used to weight the PPI network. Then, the maximal complete subgraph of the PPI network is found. The maximal complete subgraph of the current network is regarded as the node in the next layer of network. According to the compatibility relation, the overlay network chain in quotient space is calculated, the protein complexes are contained in the last layer of the overlay network. The algorithm ONCQS is tested on four well-known PPI databases DIP [23], Gavin [10], Krogan [24] and MIPS [25]. The simulation results illustrate that ONCQS algorithm has a higher performance and outweighs than other five algorithms in mining protein complexes.

Constructing weighted PPI network
It is inaccurate to mine protein complexes directly in PPI networks because the data produced by highthroughput experiments contain a high rate of false positive and false negative interactions [26,27].To address this problem, some scholars integrate protein biologic data such as gene expression data, subcellular localization data, GO annotation data [28,29] to increase the reliability and accuracy of data. A protein complex is a group of two or more associated polypeptide chains. Different polypeptide chains may have same functions, so we integrate GO annotation data to measure the interactions. If two interacted proteins v i and v j have more common GO annotations, their functions are more similar and their interaction is believed to be more believable. The weight between protein v i and v j is defined as follows: Table 1 Pseudo code of maximum complete subgraph where GO v i and GO v j are the GO annotation set of node v i and v j respectively, jGO v i ∩GO v j j represents the number of the same annotation between GO v i and GO v j . Our previous research shows that the W v i ;v j value is greater than 0.6, and the effect is better [30]. If weight between protein v i and v j is less than 0.6, the interaction will be deleted in the PPI network. This preprocessing step can help us to filter out possible false positive interactions [31].

Quotient space theory
Granular computing is a simulation of global analysis ability of human beings. One of the basic characteristics in human problem solving is the ability to conceptualize the world at different granularities and translate from one abstraction level to the others easily, deal with them hierarchically. Human beings can solve problems in different sizes of granularity spaces. Different levels represent different granularity. There are three main theories of granular computing, granular computing based on fuzzy logic [32], granular computing based on rough set and granular computing based on quotient space. Granularity analysis is in fact to analyze the quotient set.
Triple structure (X, F, T) is used to represent the problem in the quotient space. Domain X refers to universe of discourse, F is the attribute set of X, T is the structure of X. Define a relation R for the universe of discourse X, construct corresponding quotient set [X], quotient attribute set [F], and quotient structure [T], and then define the granularity coefficient to study the quotient space([X], [F], [T]). The relation R can be equivalence relation or compatibility relation.
For the PPI network G, G = (X, F, T), domain X refers to the protein nodes in PPI network.

Overlay network chain in quotient space
Given a network G, the maximum complete subgraph of the network is regarded as a cover according to the compatibility relation [33]. The pseudo code of the maximum complete subgraph algorithm is shown in Table 1.
After the sets of all maximal complete subgraphs is solved. Then, maximal complete subgraphs are used as nodes, if two maximal complete subgraphs have common nodes, two corresponding nodes are defined to be connected, the new network constructed is called the 1 st level overlay network of G in quotient space, which is denoted as G 1 . Figure 1 illustrates the construction of overlay network chain in quotient space. The network G has 11 nodes ( There are 7 maximal complete subgraphs in the network G, so there are 7 nodes (u 1 , u 2 , u 3 , u 4 , u 5 , u 6 , u 7 ) in the 1 st level overlay network G 1 .
, u 1 and u 2 has common nodes v 5 , u 2 and u 3 has common nodes v 6 , so u 1 and u 2 are connected in G 1 , u 2 and u 3 are connected in G 1 , u 1 and u 3 have no common nodes, and there is no connection between them in G 1 . Network G 1 has two complete subgraphs, the 2nd level overlay network G 2 has 2 nodes (w 1 , w 2 ). w 1 represents (u 1 , u 2 , u 4 , u 5 , u 7 ), w 2 represents (u 2 , u 3 , u 6 , u 7 ), w 1 and w 2 has common nodes (u 2 , u 7 ), so w 1 and w 2 are connected in G 2 . G 1 and G 2 are different levels of overlay network of G, (G, G 1 , G 2 ) is called overlay network chain.
Assuming that G i is the i th level overlay network of G, and G i + 1 is the 1st level overlay network of G i , therefore, G i + 1 is the (i + 1) th level overlay network of G. (G, G 1 , G 2 ,…, G i ) is called overlay network chain in quotient space [34].

The ONCQS main algorithm
A new clustering algorithm ONCQS is developed to detect overlapping protein complexes in weighted PPI network using overlay network chain in quotient space. A protein may belong to multiple protein complexes. As shown in Fig. 2, two protein complexes elF3 complex and multi-elF complex in the CYC2008 benchmark have three overlapped proteins.
In overlay network G i , each node represents a maximum complete subgraph of overlay network G i-1 . There may be repeated points and edges between maximal complete subgraphs. The protein complexes are contained in the last level of the overlay network. Each point can be regarded as a complex. So overlapping protein complexes can be found by using covering network. As shown in Fig. 1 10 , v 11 ), they have four overlapped nodes.
In algorithm ONCQS, the static PPI network is usually described as an undirected graph G(V, E) which consists of a set of nodes V and a set of edges E, the nodes V represents the proteins and the edges E = {e(v i , v j )} is the set of edges connecting two proteins v i and v j . First, we use GO annotation data to weight the PPI network, and then construct multilevel overlay network. In overlay network theory, if two maximal complete subgraphs have common nodes, two corresponding nodes are defined to be connected. However, in ONCQS algorithm, formula 2 is used to Fig. 2 An example of overlapping protein complexes measure the similarity of two maximal complete subgraphs mcs i and msc j .
where |mcs i ∩ mcs j | is the number of the common nodes of mcs i and msc j , |mcs i ∪ mcs j | is the summation of the nodes of mcs i and msc j. Only when sim(mcs i , mcs j ) is great than the granularity coefficient gc, two corresponding nodes are defined to be connected in the next level overlay network. In i th level overlay network, if there is no maximal complete subgraph satisfying the similarity condition, the overlay network chain (G, G 1 , G2,…, G i ) can be obtained. The pseudo code of the ONCQS algorithm is shown in Table 2.
At this point, each node in G i represents a protein complex. Each node represents a maximal complete subgraph, so the proteins in the subgraph have high similarity and the similarity between the subgraphs is poor.

Results and discussion
The proposed ONCQS algorithm is implemented in Matlab R2015b and executed on a quad-core processor 3.30GHz PC with 8G RAM.

Experimental data set
In this study, the developed methods and computational analysis were applied to four PPI network, including DIP [23], Gavin [10], Krogan [24] and MIPS [25]. All the data used in this study are Saccharomyces cerevisiae protein data.
Protein-protein interactions data: After removing the noise, the self-interactions and the repeated interactions, Gene Ontology data: The Saccharomyces cerevisiae GO annotation data was extracted from GO-slims dataset. GO-slims data are cut-down version of the GO ontologies [31]. GO-slim data provide GO terms to explain gene product feature in biological process (BP), molecular function (MF), cellular component (CC). we used GO slims to annotate PPI data. There are 7014 proteins in the GO annotation data. Proteins with GO annotation data cover 98.23% of proteins in the DIP dataset, 100% of proteins in Gavin, 99.89% of proteins in Krogan, 99.16% of proteins in MIPS.
The standard protein complexes: CYC2008 [35] is used to evaluate clustering results of Saccharomyces cerevisiae, which includes 408 protein complexes. Detailed data intersection information of experimental data is shown in Table 3.

Evaluation metrics
The overlapping score OS is used to evaluate the match quality of a predicted protein complex and standard protein complex.
where V pc and V sc denote the node sets of predicted protein complex pc and standard protein complex sc, respectively. Usually we set the threshold for 0.2 [17]. If OS(pc, sc) is greater than 0.2, the predicted protein complex pc is considered to match standard protein complex sc. OS = 1 shows that the predicted protein complex is perfectly matched with the standard protein complex. Three commonly used metrics Precision, Recall and Fmeasure are used to measure the efficiency of the proposed ONCQS algorithm and evaluate the performance of the clustering results.
The Precision denotes the accuracy of the predicted protein complexes matched by the standard protein complexes, defined as follows: where |pc| represents the number of predicted protein complexes, |mpc| denotes the number of the predicted protein complexes matched by the standard protein complexes.
The Recall denotes the accuracy of the standard protein complexes matched by the predicted protein complexes, defined in the following eq. (5): where |sc| represents the number of the standard protein complexes, |msc| denotes the number of the standard protein complexes matched by the predicted protein complexes. The Precision and Recall describe the accuracy of the algorithm from different aspects. In order to consider these two indicators synthetically, the F-measure is defined as the harmonic mean of Precision and Recall. Fmeasure is defined as follows: Parameter analysis The proposed algorithm ONCQS only has one parameter, granularity coefficient: gc. In overlay network, the similarity of two maximal complete subgraphs is greater than gc, we consider them connected in the next level overlay network. If the value of gc is too small, the complexity of algorithm will increase. On the contrary, if the value of gc is too large, the accuracy of the algorithm will decrease. It is significant to select the appropriate value of gc. The experiments on four PPI databases with gc from 0.1 to 0.9 were carried out to verify the influence of parameter gc. The results are shown in Table 4. where PC is the total  number of predicted protein complexes, Perfect is the count of predicted protein complexes and standard complexes are perfectly matched, OS(pc, sc) = 1. AS represents the average size of the predicted protein complexes. F-measure reflects the effectiveness of the algorithm, and Perfect reflects the accuracy of the algorithm. In order to comprehensively consider the impact of gc on the performance of the algorithm, we performed min-max normalization on F-measure and Perfect. The parameter F is defined as the harmonic mean of F-measure and Perfect, as shown in eq. (9).
The influence of parameters gc is shown in Fig. 3. F value gets the best value when gc equals 0.4 in DIP, Gavin and Krogan. When gc is greater than 0.4 the F value will rise tends to be stable in MIPS. So set gc for 0.4 in this study.

Comparison based on precision, recall and F-measure
The performance of ONCQS is compared with five other state-of-the-art protein complex prediction algorithms: MCODE, MCL, CORE, ClusterONE and COACH. The MCODE and ClusterONE are run using Cytoscape [36] and the parameters are set to the default setting. Figure 4 depicts the Precision, Recall and Fmeasure of each algorithm on four datasets. As shown in Fig. 4, it is obvious that the Recall and F-measure value of our method is much more excellent than other methods on four datasets. It indicates that ONCQS algorithm can detect protein complexes more accurately. In Fig. 4a Table 5 depicts the PC, Perfect and AS of each algorithm on four datasets. Obviously, the algorithm ONCQS can mine the protein complex more accurately, and the perfect value is much higher than other algorithms.

Comparison with standard complexes
In order to show the experimental results more clearly, we visualized the 379 th standard protein complex of CYC2008 "UTP B complex" and the corresponding mining results of 6 algorithm on Krogan dataset in Fig. 5. As shown in Fig. 5a, the standard protein complex is bound together by 6 proteins. Figure 5b shows the results of MCL and MCODE, the pink area is the result of the MCL algorithm, and the orange area is the result of MCODE. MCL algorithm has 2 proteins that are incorrect predictions. MCODE predicts three closely connected subgraphs into a protein complex. Figure 5c  Compare the ability to mine overlapping protein complexes Individual proteins can participate in the formation of a variety of different protein complexes, different complexes perform different functions. There are overlaps between protein complexes. ONCQS method is proposed to mine overlapping protein complexes. The standard protein complexes in the CYC2008 database contain many overlapping protein complexes. Figure 2 shows a pair of overlapping protein complexes elF3 complex and multi-elF complex. We analyzed the matching of the six algorithms in four databases to these two complexes. The elF3 complex and multi-elF complex were recorded as sc1 and sc2. Their complexes information is listed in Table 6. The elF3 complex contains seven proteins, multi-elF complex contains eight proteins, three of which are common. Then we analyze the clustering results of the 6 algorithms in four databases respectively. Similarly, only when the overlapping score is greater than 0.2, the matching is  considered successful, and when there are multiple successful matches, the maximum overlapping score is obtained. The results of the 6 algorithms in DIP, Gavin, Krogan and MIPS are shown in Tables 7, 8, 9 and 10 respectively. Where pc1 represents the predicted complex that matches elF3 complex (sc1), pc2 represents the predicted complex that matches multi-elF complex (sc2). The boldface indicates that the proteins are predicted correctly. As shown in Tables 7, 8, 9 and 10, MCODE, MCL, CORE and ClusterONE cannot detect overlapping protein complexes. MCODE and CORE failed to dig out complexes that match sc1 and sc2 respectively. COACH can dig out protein complexes that match sc1 and sc2, the accuracy is not as good as ONCQS. ONCQS achieved the best performance in identifying overlapping protein complexes. Both CluterONE and COACH algorithms are proposed for mining    overlapping protein complexes. In this case, Cluster-ONE cannot detect overlapping protein complexes, and the performance of COACH is poor. This further shows that it is meaningful to design efficient and accurate algorithms to mine overlapping protein complexes. ONCQS combines GO functional annotation information, which can improve the accuracy of the algorithm.

Conclusion
Protein complexes are involved in multiple biological processes, and thus the detection of protein complexes is essential to understand cellular mechanisms. At the same time, there is overlap between protein complexes. This paper proposes a new algorithm ONCQS to identify overlapping protein complexes based on overlay network chain in quotient space. Combining the network properties of protein interaction networks with the biological properties of proteins, protein complexes are seen as nodes in the overlay network. Build an overlay network chain to mine protein complexes. Compared with the other competing clustering methods, ONCQS can effectively identify the overlapping protein complexes and has higher precision and accuracy.