Protein complex detection in PPI networks based on data integration and supervised learning method

Background Revealing protein complexes are important for understanding principles of cellular organization and function. High-throughput experimental techniques have produced a large amount of protein interactions, which makes it possible to predict protein complexes from protein-protein interaction (PPI) networks. However, the small amount of known physical interactions may limit protein complex detection. Methods The new PPI networks are constructed by integrating PPI datasets with the large and readily available PPI data from biomedical literature, and then the less reliable PPI between two proteins are filtered out based on semantic similarity and topological similarity of the two proteins. Finally, the supervised learning protein complex detection (SLPC), which can make full use of the information of available known complexes, is applied to detect protein complex on the new PPI networks. Results The experimental results of SLPC on two different categories yeast PPI networks demonstrate effectiveness of the approach: compared with the original PPI networks, the best average improvements of 4.76, 6.81 and 15.75 percentage units in the F-score, accuracy and maximum matching ratio (MMR) are achieved respectively; compared with the denoising PPI networks, the best average improvements of 3.91, 4.61 and 12.10 percentage units in the F-score, accuracy and MMR are achieved respectively; compared with ClusterONE, the start-of the-art complex detection method, on the denoising extended PPI networks, the average improvements of 26.02 and 22.40 percentage units in the F-score and MMR are achieved respectively. Conclusions The experimental results show that the performances of SLPC have a large improvement through integration of new receivable PPI data from biomedical literature into original PPI networks and denoising PPI networks. In addition, our protein complexes detection method can achieve better performance than ClusterONE.


Background
Protein-protein interactions (PPI) are fundamental to the biological processes within a cell. Beyond individual interactions, there is a lot more systematic information contained in protein interaction graphs. Complex formation is one of the typical patterns in this graph and many cellular functions are performed by these complexes containing multiple protein interaction partners.
Many automatic approaches have been proposed to detect the protein complexes from PPI networks, such as CMC [1], COACH [2], MCODE [3], MCL [4], Cfinder [5], and ClusterONE [6]. However, most of these methods are based on unsupervised graph clustering methods and predict protein complexes only with pre-defined rules. Compared with them, supervised learning methods [7,8] can utilize the known complexes information and may achieve better performances.
At present, large number of PPI databases have been created. Gavin [9], Krogan [10] and DIP [11] are popular PPI databases used by the protein complex detection methods. However, these databases are sparse since the fraction of known true physical interactions is limited [12]. For example, the average numbers of interactions per protein are 6.98, 7.86, and 9.13 in DIP, Krogan, and Gavin, respectively. Nevertheless, large amounts of PPIs could be found in the rapidly growing biomedical literature. Furthermore, since these PPI data are provided by biomedical experts, they are relatively accurate. Their Integration with the existing PPI datasets can be hopeful to eliminate the PPI networks' sparsity, and, therefore, improve the complex detection performance.
In this paper, we present a complex detection approach based on data integration and supervised learning. In this approach, the new PPI networks are constructed by integrating PPI datasets with the PPI data extracted by PPIExtractor [13] from biomedical literature, and then the less reliable PPI between two proteins are filtered out based on semantic similarity and topological similarity of the two proteins. Finally, the supervised learning protein complex detection (SLPC) method, which can make full use of the information of available known complexes, is applied to detect protein complex on the new PPI networks. The experimental results demonstrate that our approach outperform ClusterONE, the state-of-the-art method.

Extracting PPI data with PPIExtractor
In our work, we use PPIExtractor [13] to extract PPI interactions from biomedical literature and then integrate them into the PPI networks. PPIExtractor is a useful tool publicly available for extracting new PPI data from a large collection of biomedical literature. Experimental evaluations show that it can achieve state-of-the-art performance on a DIP subset with respect to comparable evaluations.
PPIExtractor contains four modules: (i) Named Entity Recognition (NER) module which aims to identify the protein names in the biomedical literature; (ii) Normalization module which determines the unique identifier of proteins identified in NER module; (iii) PPI extraction module which extracts the PPI information in the biomedical literature; (iv) PPI visualization module which displays the extracted PPI information in the form of a graph. Figure 1 shows the architecture of PPIExtractor.
127,217 PubMed abstracts were downloaded from PubMed website (http://www.ncbi.nlm.nih.gov/pubmed) with the query string "((Saccharomyces cerevisiae) OR yeast) AND protein" and PPIExtractor extracted a total of 126,165 protein interactions from these abstracts.
Since most of the protein names in the PPI databases are systematic names for nuclear-encoded ORFs begin with the letter 'Y' (for 'Yeast') while those in PubMed abstracts are not, we built a yeast protein alias name list with about 6,000 entries from the UniProt website (http://www.uniprot.org/uniprot/? query=yeast&sort= score). The list is used to convert the protein names in PubMed abstracts to systematic names for nuclearencoded ORFs.

PPI datasets
DIP, Krogan, Gavin, three yeast PPI datasets, are used in our work. The details of these PPI datasets are shown in Table 1. For each dataset, original PPI and denoising PPI networks are built, respectively, to verify our method's effectiveness. Original PPI networks are original three yeast PPI datasets mentioned above. Denoising PPI networks are three filtered PPI datasets, in which low reliability interactions are removed with different denoising thresholds. As a matter of fact, protein interaction data produced by high-throughput experiments are often associated with high false positive and false negative rates. Therefore, a method based on both semantic and topological similarity of the two proteins is applied in our work to measure the reliability of the interaction. GO (The Gene Ontology Consortium [14]) annotation from SGD [15] is used in this measurement approach. In this method, a PPI's reliability is defined as formula (1): Figure 1 The architecture of PPIExtractor. Where |C(m, n)| denotes the number of terms in C(m, n), the set of the GO terms in which annotation proteins m and n are included. | T i (m, n) | denotes the number of terms in T i (m, n), the set of annotated proteins on GO term g i in whose annotation m and n are included. T max denotes the maximum size of annotated proteins on all GO terms. The GO term's specificity can be quantified by the proportion of the annotation size of a GO term (T i (m,n)) to the total number of annotated proteins (T max ), i.e. a GO term is regarded to be more specific if it has less annotated proteins. NE(m, n) denotes the number of neighbors that m and n share. The formula (1) demonstrates that if the GO term proteins m and n share is more specific, or if they have more common neighbors or GO terms, the interaction between them is more reliable. The details of the denoising PPI networks are shown in Table 2.

Integration of the extracted PPI data into the PPI networks
PPIExtractor assigns the extracted PPIs from the biomedical literature weights representing their reliability [13]. In our study, only PPIs with the weights equal to or higher than an integrating threshold are integrated into the original PPI dataset. In addition, both two proteins in a new PPI should already exist in the PPI dataset. The amounts of the PPI added into the original PPI networks with different integrating thresholds are shown in Table 3.
The weights of the PPIs added into the denoising PPI networks are higher than the integrating threshold -0.6. the reason is that our SLPC method have the best performance on the original PPI networks with the integrating threshold -0.6. What is more, the PPIs, when integrated into the denoising PPI networks, are also filtered with different denoising thresholds. The amounts of the PPIs added with different denoising thresholds are shown in Table 4.

Protein complexes detection with SLPC
In our work, a supervised learning protein complex detection (SLPC) method is employed to predict the protein complexes from PPI networks. Currently, most of protein complex detection methods are unsupervised ones, without utilizing the known complexes information. However, in the research field of protein complexes, numerous complexes have been provided, which can be used as the prior knowledge of the complex detection methods. In previous work, we presented a supervised learning protein complex detection (SLPC) method to predict protein complexes [8]. The SLPC method utilizes the features including Graph density [3], Degree statistics, Edge weight statistics, Clustering coefficient [16], and Topologic change [17]. Experimental evaluations show that SLPC can achieve better performances than   other present protein complex detection methods. SLPC algorithm is showed in Table 5 and more details are provided in [8].

Gold standard protein complexes
We constructed the gold standard protein complexes by combining MIPS [18], Aloy [19], SGD [15] with TAP06 [9]. Proteins absent from the corresponding PPI networks are filtered out from the gold standard. In addition, only the protein complexes including at least two different proteins are retained as the research shows that most of the protein complexes include more than one protein [20]. The details of the gold standard protein complexes of original PPI networks and denoising PPI networks are shown in Tables 6 and 7, respectively.

Evaluation metrics
In our study, F-score, Accuracy (Acc), maximum matching ratio (MMR) are used as the evaluation metrics. The neighborhood affinity score NA(A, B) defined as follows is used to evaluate the similarity of two protein complexes A and B: If the NA(A, B) is large than or equal to 0.25, complexes A and B are regarded to be matching.
F-score, a popular metric of evaluating complex detection method, is used as the first measure to evaluate the performance.
Where P and B are the predicted and gold standard complex sets, respectively; Ncb is the number of the gold standard complexes matching at least one predicted complex and Ncp is the number of the predicted complexes matching at least one gold standard complex and F-score is calculated as the harmonic mean of precision and recall values.
The second measure we used is the geometric accuracy as introduced by Broh´ee et al. [21], which is the geometric mean of clustering-wise sensitivity (Sn) and clustering-wise positive predictive value (PPV). A high Sn value indicates that the protein complex prediction has a good coverage of the proteins in the gold standard complexes, and a high PPV value indicates that the predicted protein complexes are likely to be true protein complexes. Assuming the number of the gold standard complexes is n and the number of the predicted complexes is m. T ij denotes the number of proteins that are Table 5. Protein complex detection algorithm Input : an unweighted network, a weighted network built via GO annotation and a training set Complex detection process: Step 1: construct the feature vector space for the complexes in the training set from the unweighted and weighted PIN networks and train the Regression model Step 2: find maximal cliques in the PIN by the Cliques algorithm -rank the clique set C={C 1 , C 2 , ..., C n } in descending order of the scores given by the Regression model -for each clique C i , check all the cliques (denoted as C j ) with lower scores, if C i ∩C j > threshold, then remove C j .
-output: the updated clique set Step 3: grow the cliques -for each clique C i , the set of its neighbors is denoted as N(C i ), do update operation as follows: -check all the nodes in N(C i ) -select v i ∈N(C i ), which makes v i ∪C i achieve higher score given by the Regression model -update C i = v i ∪C i , N(C i ) = N(C i ) -v i -repeat the update operation until there is no node v j in N(C i ) that leads to score(v j ∪C i ) > score(C i ) -output: the candidate complex set C = {C 1 , C 2 , ..., C n } Step 4: filter the candidate complexes -rank the candidate complexes in descending order of the score given by the Regression model -for each candidate complex C i , check all the candidates C j with lower scores -if overlap (C i , C j ) > merg_thred if score(C i ∪C j ) > score(C i ) do merge operation: update C i = C i ∪C j else do remove operation: remove C j from the candidate set output: the predicted complex set found both in gold standard complex i and predicted complex j. The Sn, PPV, Acc are defined as follows: The third metric we used is the maximum matching ratio (MMR) [6], which is based on a maximal one-toone mapping between gold standard complex and predicted complex.
Where n denotes the number of the gold standard complexes; m the number of the predicted complexes; j as the member of the predicted complexes. MMR offers a natural, intuitive way to compare predicted complexes with a gold standard and it explicitly penalizes cases when a reference complex is split into two or more parts in the predicted set, as only one of its parts is allowed to match the correct reference [6].
The Acc measure explicitly penalizes predicted complexes that do not match any of the reference complexes. However, gold standard sets of protein complexes are often incomplete [22]. As a consequence, predicted complexes not matching any known reference complexes may still exhibit high functional similarity or be highly co-localized, and therefore they could still be prospective candidates for further in-depth analysis. In other words, a predicted complex that does not match a reference complex is not necessarily an undesired result, and optimizing for the geometric accuracy measure might prevent us from detecting novel complexes from a PPI dataset. Therefore, in the performance comparison, the F-score and MMR are used as the main metrics; the Acc is only used as an auxiliary one.

The performances of SLPC on original PPI networks
First we tested SLPC on three original PPI networks, i.e. DIP, Krogan and Gavin. The results of F-score, accuracy and MMR are shown in Tables 8, 9, and 10, respectively. It can be seen that the performances measured with these metrics keep improving on these networks with the integrating threshold decreasing from 0 to -0.6. With the threshold -0.6, SLPC achieves the highest average improvements on all three original PPI networks: 4.76, 6.81 and 15.75 percentage units in F-score, accuracy and MMR, respectively. This shows that the introduction of PPIs extracted from literature into the original PPI datasets can boost the performance. The reason is that, the higher integrating threshold means more reliable new PPI interactions are integrated into the original PPI networks, which relieves the sparse problem of PPI networks. As shown in Table 11, in most cases, the average size of complexes predicted from extended PPI networks is much closer to the one of the gold standard protein complexes than that from the original PPI networks, and, therefore, SLPC achieves better   Tables 12, 13 and 14. The performance of SLPC on the denoising extended PPI network is better than that on the corresponding denoising PPI network with any denoising threshold. With denoising threshold 0.9, SLPC achieves highest average improvement of 3.91, 4.61 and 12.10 percentage units in F-score, accuracy and MMR, respectively on denoising extended PPI networks over denoising PPI networks. This shows, once again, that the introduction the PPIs extracted from literature can boot the performance of complex detection methods.
In addition, Tables 12, 13 and 14 also show that, since the higher denoising threshold means more PPIs are filtered from the original PPI networks, which may lead to the missing of some real PPIs, the performances of SLPC algorithm on the denoising PPI networks and denoising extended PPI networks begin to decline after they reach the highest values.
The performance of ClusterONE, the state-of-the-art complex detection method, is also tested (its parameters are set as those described in [6]). With the denoising threshold 0.9, it achieves average improvements of 0.31, 0.40 and 1.29 percentage units in F-score, accuracy and MMR, respectively on denoising extended PPI networks over denoising PPI networks. This indicates that the introduction the PPIs extracted from literature can also boot the performance of ClusterONE. In addition, experimental results show that SLPC achieves better performance than ClusterONE. With the denoising threshold 0.9, the average performance improvement of SLPC over ClusterONE is 26.02 and 22.40 percentage units in F-score and MMR, respectively.

Conclusions
Protein complexes, consisting of molecular aggregations of proteins assembled by multiple protein interactions, are of the fundamental units of macro-molecular organizations and play crucial roles in integrating individual gene products to perform useful cellular functions. Large amounts of PPI data generated by high-throughput experimental techniques can be used to predict protein complexes from PPI networks. At the same time, numerous accurate PPIs could be found in the rapidly growing biomedical literature since they are provided by biomedical experts. Their Integration with the existing PPI datasets can be hopeful to eliminate the PPI networks' sparsity, and, therefore, improve the complex detection performance.
In this paper, an approach of introducing PPIs from biomedical literature into existing PPI networks and applying supervised learning method in protein complex     detection is presented. In the approach, the new PPI networks are constructed by integrating PPI datasets with the large and readily available PPI data from biomedical literature, and then the less reliable PPI between two proteins are filtered out based on semantic similarity and topological similarity of the two proteins. Finally, the supervised learning protein complex detection, SLPC, which can make full use of the information of available known complexes, is applied to detect protein complex on the new PPI networks. The best average improvements of 4.76, 6.81 and 15.75 percentage units in F-score, accuracy and MMR are achieved respectively, on original extended PPI networks. In addition, the best average improvements of 3.91, 4.61 and 12.10 percentage units in F-score, accuracy and MMR are achieved, respectively, on denoising extended PPI networks. All these results show that, the introduction of PPIs extracted from literature into the original PPI datasets can boost the performance significantly. The reason is that the sparsity problem of PPI networks is remitted by integrating PPI data from biomedical literature. The results also show that our method outperforms ClusterONE, the state-of-the-art method. This is because our method makes full use of the information of available known complexes. To summarize, our complex detection method, based on supervised learning method and integrating PPI data from biomedical literature, can achieve the better performances than other complex detection methods.