Interlog protein network: an evolutionary benchmark of protein interaction networks for the evaluation of clustering algorithms

Background In the field of network science, exploring principal and crucial modules or communities is critical in the deduction of relationships and organization of complex networks. This approach expands an arena, and thus allows further study of biological functions in the field of network biology. As the clustering algorithms that are currently employed in finding modules have innate uncertainties, external and internal validations are necessary. Methods Sequence and network structure alignment, has been used to define the Interlog Protein Network (IPN). This network is an evolutionarily conserved network with communal nodes and less false-positive links. In the current study, the IPN is employed as an evolution-based benchmark in the validation of the module finding methods. The clustering results of five algorithms; Markov Clustering (MCL), Restricted Neighborhood Search Clustering (RNSC), Cartographic Representation (CR), Laplacian Dynamics (LD) and Genetic Algorithm; to find communities in Protein-Protein Interaction networks (GAPPI) are assessed by IPN in four distinct Protein-Protein Interaction Networks (PPINs). Results The MCL shows a more accurate algorithm based on this evolutionary benchmarking approach. Also, the biological relevance of proteins in the IPN modules generated by MCL is compatible with biological standard databases such as Gene Ontology, KEGG and Reactome. Conclusion In this study, the IPN shows its potential for validation of clustering algorithms due to its biological logic and straightforward implementation. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0755-1) contains supplementary material, which is available to authorized users.


Background
One of the important challenges in the interpretation of proteomic data is the detection of the cellular active process by exploring protein function. This newly emerging discipline; network science, has demonstrated that the majority of biological and evolutionary concepts make sense in the light of Systems Biology [1,2]. Hence, the protein function, and consequently, cell function are more clearly demonstrated in the context of protein interaction network [3]. Protein-protein interactions make up the major branch in the study of protein interaction networks.
From a biochemical view, these interactions can be divided into two categories: physical and functional [4,5]. There are several methods employed to describe these interactions. The advantages and disadvantages of these methods have been widely reviewed [6][7][8]. Different limitations such as slow-and small-scale performances, inability to identify protein complexes, artificial interaction obtained from the in vitro assay and the operational restrictions, led to the discovery of methods that complement each other [9]. Some experimental methods are involved in determining physical and functional interactions [10,11]. The phylogenetic profile, Rosetta stone, gene neighborhood and co-evolution are the most prevalent computational methods [11][12][13].

Biological network modules
After constructing a Protein-Protein Interaction Network (PPIN), the next step is the exploration of the protein tasks within this complex circuit. As Alessandro Vespignani mentioned, "evolution thinks modular" [14]; a cell's activity is a result of groups of interacting proteins, known as functional module (if they do not necessarily interact at the same time and place), in PPIN [15,16]. Therefore, the PPIN modules should be identified and determined and then a biological function could be assigned to them based on the protein annotations. Sometimes this procedure is specifically more successful for the protein complexes that work together at the same time and place, rather than for the functional modules [6,9].
Module detection can be divided into two approaches, namely graph clustering, and distance-based clustering. In the first approach, algorithms seek communities of the nodes in the graph that contain more intra-edges than inter-edges, e.g. Super Paramagnetic Clustering (SPC) [17], Highly Connected Subgraph (HCS) based on the Monte Carlo algorithm [18], Markov clustering (MCL) [19] and Restricted Neighborhood Search Clustering (RNSC) [20]. In the distance-based approach, the clustering algorithms e.g. the hierarchical or k-means are used so that the concept of distance and its associated measures in graph theory are applied as the similarity measures in the clustering. Some of the distances used in this approach are as follows: shortest path [21], number of edges [22,23], shortest path profiles [24,25] and a combination of distance and the statistical objects [26]. The detected modules are then well-characterized, biologically, based on information beyond the network topology such as gene expression, cell localization, virulence and knockout phenotypes [6,27].
The biological data are also applied in the next step for validation assays of module finding. The validation is based on the protein annotation homogeneity in the modules and these annotations could consist of functional, structural, local and/or interactomic information. Generally, the Gene Ontology (GO) and MIPS database are used in PPIN module validation [20,[28][29][30][31][32]. In addition, data from the gene expression profiles, colocalization and gene phenotype are also used [9]. It should be highlighted that the major presumption of the validation is that most of the proteins in a module, i.e. a statistically significant number of proteins, should be similar in intended attribute if the modules were identified correctly. A pictogram of this procedure is summarized in Fig. 1.
Although, protein complexes consist of functional and also physical interactions at the same time, it is obvious that many of the functional interactions are not in the data of the protein complexes. Several protein interactions happen transiently and indirectly, and as such are not detectable by empirical routine tests. These are important steps in the protein interactions which are lacking in the MIPS database [33][34][35]. In addition, modeling and representing the protein complex acquired by some experimental techniques, such as a graph (e.g. "spoke" and "matrix" model) is a challenging issue [36]. Furthermore, one cannot ignore the fact that the inherited experimental errors inherent to these problems and many more protein complexes, have not been fully studied as yet [37]. These flaws may lead to misinterpretation in the validation step if we use only the MIPS database.
On the other side, GO is a standard glossary of biological terms known as the first and most common reference for the Biological Process (BP), Molecular Function (MF) and Cellular Compartment (CC) of the proteins [38]. Additionally, a significant correlation between the node distances in some biological networks and the semantic similarity of their GO terms has also been reported [39][40][41][42]. Although the GO contains comprehensive and organized information, it has some limitations, namely, insufficient GO annotations (35-55 % false-negatives) [43], inaccurate GO annotation (false-positives, it should be observed that most of the annotations in GO are obtained by an indirect method such as gene manipulation as well as heterogeneous experimental and computational data), the functional diversity of proteins under different conditions resulting in different and sometimes conflicting annotations for one protein (false-positives) [44] and errors due to the manual annotation approach [45]. These deficiencies lead also to misinterpretation in the validation step.

The goal of present study
Regarding the aforementioned restrictions in the validation step of module finding algorithms, we propose a network-based evolutionary benchmark as a complementary approach to solving some of the presented issues. Recently, in a companion study [46], we introduced a common network, that had low false-positives and tuned false-negatives. Using the four PPINs, a network with a high degree of conservation between four species was constructed. We call this common network the Interlog Protein Network (IPN) (Additional file 1). In the present study, the IPN, which is confirmed using experimentally proteomic data, has been suggested to be applied as a complementary benchmark in the validation of the different module finding algorithms, namely, Markov Clustering (MCL) [19], Restricted Neighborhood Search Clustering (RNSC) [20], Cartographic Representation (CR) [47], Laplacian Dynamics (LD) [48,49] and Genetic Algorithm, to find communities in Protein-Protein Interaction networks (GAPPI) [16].

Mitochondrial IPN
In the current study, the mitochondrial IPN of the four eukaryotic species was constructed. These species consisted of, human, rat, fruit fly and worm. The IPN was achieved through the interlog finding procedure of the mitochondrial PPIN of these species. In the other words, the IPN is an evolutionarily conserved network obtained from the overlap of orthologous proteins reinforced by gene expression. By pair-wise sequence alignment (≥30 %), the 226 Orthologous Protein Sets (OPSs) were obtained. Each OPS contained four orthologous proteins from the four species (83 human, 82 rat, 83 fruit fly and 80 worm). Finally, the IPN showed 29 nodes, 61 edges, 4.34°on average, a diameter of 6, and an average clustering coefficient of 0.625 (Additional files 1 and 2). This network represents the evolutionarily conserved topological network features shared among these species.
The expression data is used to empirically validate the IPN. The significantly high correlation between the protein concentrations endorsed the edges in the IPN. In fact, the correlation or co-expression network was reconstructed based on this concentration data and this network was compared to the IPN. In the previous study, the rat mitochondrial proteins (~500) were analyzed by several electrophoresis techniques [50]. It was claimed that different electrophoresis techniques are capable of fractionating proteins with different subcellular localizations [50]. Hence, the significant correlation between the expression profiles of the proteins in different electrophoresis implies they are co-localized proteins [51][52][53].
Of the total 563 proteins reviewed (UniProtKB database) 82 were in the mitochondria of rats which participated in OPSs. Later, 31 proteins were involved in the IPN and 20 of the 31 proteins were detected experimentally and involved in the co-expression network. In all, there were 23 significantly high correlations among these 20 proteins that matched with 13 of the 22 links in the IPN between these 20 proteins. The hypergeometric test confirms the reasonable matching ratio of the IPN~60 % with respect to the edge match number in the rat PPIN 12 % with p-value 3.9 × 10-7 ( Table 1).

Comparison of network clustering methods
All the methods including RNSC, MCL, CR, LD and GAPPI were performed to discover modules in all four PPINs and the IPN. It should be noted that all of these algorithms are unsupervised and the network size affects the number of clusters. By defining the IPN as There are two approaches to explore function from network in biology, direct and module-assisted methods. In the direct method which is not our subject in this study, the annotation of gene/protein neighbors are used to predict function. But in the module-assisted methods, node's community/ neighborhood is principal for function prediction. These methods are also divided into two categories i.e. graph and distance-based clustering methods. The validation is the main step after finding modules in the PPIN or using direct methods. Any assignment of function based on annotation of neighbors or neighborhood should be evaluated by the different validating methods. We introduce the IPN-based validation for this purpose the benchmark, some external measures were used to validate all the methods. Next, some comparison indices were used, including Jaccard, Rand, Fowlkes-Mallows and Minkowski for all species. Except for the Minkowski index with the range [0, +∞) (where the values near to zero indicated the greater similarity) the other indices have a range [0,1] and the values closer to zero, indicate greater inconsistency.
The Rand index (unlike the Jaccard, Fowlkes-Mallows and Minkowski indices), measures the degree of similarity between two matrices as a function of the positive and negative agreements. Some studies claimed that the n 00 value (Number of paired entities in the similarity matrices in which both are 0, see Methods) was often larger than the other values in most of the gene clustering studies and suggested using three other indices [54]. On the other hand, the Jaccard index is recommended due to its low variance [55]. However, the mean value of all the indices and standard deviations are shown graphically in Fig. 2 (All the values are shown in Additional file 2).
As represented in Fig. 2, our defined network showed that MCL outperforms the clustering methods of GAPPI, RNSC, LD and CR in terms of external measure indices. The range of the standard deviations showed that the MCL is more dependent on the size of the graph. This superiority was evident in all the indices, even in the Rand index with the above-mentioned imperfection. However, in the case of human PPIN as a large network, MCL and GAPPI cluster with similar accuracy (Additional file 2).
Meanwhile, the superiority of the MCL is compatible with the earlier results [32]. By distinct approach, they presented a comparative assessment of clustering algorithms and showed that the MCL was remarkably robust in graphing alterations and capable of the extraction of the complexes from the PPINs. CR took a long computation time and could not specify the modularity as well as the other algorithms within a reasonable time. The RNSC, LD and CR clustering showed similar ability to find the module robustly but the LD algorithm showed the lowest standard deviation  among all the indices calculated. The GAPPI as the most recently proposed algorithm for this problem, works better than RNSC, LD and CR. This algorithm takes second place after MCL in all comparisons and it clusters large network same as MCL. This pattern was almost repeated in the recent study [16] using MIPS as gold standard.

Biological evidence
The biological relevance of the proteins in each module detected by MCL was assessed. By using the Enrichr tool [56], three well-known biological standard databases are used namely; Gene Ontology (Biological Processes) [38], KEGG [57] and Reactome [58]. The result shows that each module enriched significantly and annotated separately (Additional file 3). Briefly, in 3 modules of this conservative IPN, the results are as follows. The first module is related to citrate/TCA cycle and oxidation phosphatase based on these ID numbers (GO:0006099, GO:0022904, GO:0022900, ko00020 and ko00190). The second module is related to Mitochondrial protein import based on these ID numbers (GO:0006626, GO:0070585, GO:0072655 and GO:0006839). The last module is related to Mitochondrial translation, ribosome and nucleoside biosynthetic process based on these ID numbers (GO:0046031, GO:0009133, GO:0046033, GO:0009135, GO:0009179, ko03010, ko00240 and ko00230). These results are compatible with our earlier study about biological meaningful communities in IPN as a pure evolutionary extract of mitochondrial PPIN [46].

Conclusion
There are several module detection methods based on different approaches. Validation assays are required to compare and select the best one for network analysis. The major prerequisite for validation is the determination of the reliable benchmark. A standard topological and functional PPIN helps us to assess and verify the PPIN modularity results. In the earlier studies, researchers used the MIPS or GO dataset as the gold standard in validation assays. As mentioned earlier, these datasets are not point-device gold standard and each one has its own particular shortcomings. In other words, these databases have been designed with specific purpose and are diverse conceptually [59].
In the current study, we used the pair-wise sequence alignment and comparative interactomics of evolutionary distant species to reconstruct a conserved and common network that can be used as the benchmark or ground truth. The proposed benchmark does not have the above-mentioned limitations. First, the edges (interaction data) in IPN and associated compared networks are generally of the same origin. This implies that if the edges of the associated compared networks are predicted and designated computationally, this benchmark is also constituted from the computational data and so on for experimentally identified interaction. In other words, the IPN edges are a result of the filtering procedure (see Additional file 4) and they do not originate from logically distinct methods. Second, the IPN reconstruction procedure most likely leads to a network with low false-positives and tuned false-negatives. This issue has a high impact on the assessment results in the validation step. Third, the reconstruction of IPN is possible for all the sequenced proteins and genes that are well-conserved across multiple species with predicted interactions. It implies that this approach does not require special expensive and time consuming techniques to generate the experimental data and evaluate the molecular networks.
Similar to the previous result [32], but dissimilar in approach, we found MCL to be the outstanding algorithm based on its performance in the comparison study. In the traditional method, the MIPS database was used to evaluate the different clustering methods. The sensitivity and accuracy of the different methods was also examined by adding and subtracting the edges i.e. artificial false-positives and negatives (Note that their tests did not contain large size changes). Our findings about GAPPI implementation are also consistent with the prior study [16], which showed the improving ability of a genetic algorithm to search modules in PPINs based on the MIPS database.
However, interaction data was retrieved from the STRING database which includes different sources of information, including various experimental, computational and even text mining methods [60,61]. In addition, an independent set of empirical data was applied and the IPN quality was experimentally confirmed. However, our goal was to search for and introduce a method that could segregate the functional modules. It should be noted again that a functional module means a group of cell components and their interactions that do or do not promise specific biological functions at the same time and place. So, these modules also include all the protein complexes. Therefore, the validation standard should not lack the functional interaction data. MCL is the superior module detection method in exploring the protein complexes and also for the functional modules based on the previous [32] and current studies, respectively. In addition, in terms of different graph sizes, it appears that MCL is not as robust as the other algorithms based on the range of the standard deviation.
In this study, we suggest the IPN to justify the modularity results of any PPIN due to three preponderances mentioned above. The graph clustering algorithm would be inefficient if it could not find the modules analogously in the individual PPINs and IPN as a purified, conserved and confirmed network. This approach to make a new benchmark may also help to assess and verify other biological networks e.g. gene regulatory networks or gene correlation networks and other biological network analysis methods such as network motif finding or orienting PPINs, which are subjects for further research. Again, this approach uses evolutionary concept i.e. conservedness to evaluate the biological networks. This is reminiscent of the well-known quote, "Nothing in biology makes sense, except in the light of evolution" [62].

Interlog protein network (IPN)
Construction of the common PPIN or IPN has been described earlier in detail [46]. Briefly, the mitochondrial reviewed proteins were retrieved from the four eukaryotic model species (Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans and Homo sapiens) from the UniProtKB database (UniProt release 2013_02) [63]. Then, Using the Needleman and Wunsch algorithm, the homologous proteins were identified in the OPSs. In the next step, four distinct mitochondrial PPINs of the four species were identified from the STRING database (Ver. 9) [61]. The four PPINs were elicited with the default value in the database by all the prediction methods. Finally, by applying a stringent rule that is the existence of interlog in all four species, the mitochondrial IPN of these species was enucleated.

Proteomics data
The results of the mitochondrial proteomic study of rat [50] were used for the empirical evaluation. In the shotgun proteomics strategy, the rat liver proteome with different cellular compartments was detected and quantified by several gel-based fractionation techniques. In the present study, normalized peptide counts were used to estimate the protein concentrations in a label-free quantification method. Then, Pearson correlation was applied to find the correlated proteins (|r coefficient| ≥ 0.7, P-value ≤ 0.05). According to the distinction made by the electrophoresis methods, the correlated proteins are likely co-localized. And, also as discussed earlier, co-localization can confirm the protein-protein interaction. Later, the ratio of the correlated proteins in the rat PPIN and IPN was computed separately and compared with the hypergeometric test (P-value ≤ 0.001). Thus, the IPN edges were examined by independent experimental data statistically.

Network clustering algorithms
In the present study, five well-known different clustering algorithms (MCL, RNSC, CR, LD and GAPPI) were used to cluster the PPINs and IPN. The general characteristics of each algorithm are shown in the Table 2 and the associated references [19,20,[47][48][49]. We presented further details regarding these algorithms in the Additional file 5. We clustered all the PPINs of species and IPN independently.

Evaluation of the clustering results
After clustering, validation is required to confirm the results or compare the different methods. A new benchmark was introduced, i.e., IPN in the validation step, so that the modules corresponding to each of the PPINs are compared with the IPN's modules. In fact, the IPN was used as the ground truth in the standard external measures assay. Note that the clustering results on the PPIN are restricted to those proteins also in the IPN. It was expected that the successful algorithm should be able to find the modules analogously in PPIN and IPN. In order to assess the clustering results, the similarity matrices (Symmetric binary matrices) of clustering results were constructed, such that a 1 indicated placing two objects in the same cluster or module and a 0 indicated the opposite. Then, the entities of each of the PPINs and IPN matrices were compared with each other. If the corresponding entities in the two matrices were equal, the two clustering methods resulted in the same clusters. The following four conditions occurred: Agreements; n11 (Number of paired entities in the similarity matrices in which both are 1) and n00 (Number of paired entities in the similarity matrices in which both are 0), Disagreements; n10 (Number of paired entities that are 1 in the PPIN similarity matrix and 0 in the IPN similarity matrix) and n01 (Number of paired entities that are 1 in the IPN similarity matrix and 0 in the PPIN similarity matrix).
There are several benchmarking indices to measure the degree of agreement and disagreement between the two matrices [55]. Some of the indices used in this study are as follows: Rand ¼ n 11 þ n 00 ð Þ n 11 þ n 10 þ n 01 þ n 00 ð Þ Jaccard ¼ n 11 n 11 þ n 10 þ n 01 ð Þ Minkowski ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n 10 þ n 01 ð Þ n 11 þ n 01 ð Þ s Folkes−Mallows ¼ n 11 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n 11 þ n 01 ð ÞÂ n 11 þ n 10 ð Þ p In order to perform biological evaluation of IPN modules, Enrichr software was used [56]. In this web-based tool, significantly enriched terms are extracted based on the Gene Ontology, Biological Processes [38], Kegg Orthology [57] and Reactome databases [58]. The combined score; consisting of the Z-score and adjusted p-value, was used to rank and define enriched terms. This validation was done for the modules defined by MCL algorithm as a superior algorithm in our comparison. Additional file 3: Biological evidence. In this file, the IPN nodes, the MCL modules, the human proteins in OPSs and the enriched terms of proteins consist IPN modules are presented. The enrichment analysis of each module presented separately in different sheets. The human proteins were used to perform this analysis. (XLSX 38 kb) Additional file 4: IPN reconstruction steps. First, the mitochondrial proteins are extracted from the UniProt database, and then the reviewed proteins are filtered. Using the Needleman and Wunsch algorithm, the homologous proteins are identified in the OPSs. In the next step, four distinct PPINs from four species are identified from the STRING database. Finally, the IPN is created by finding the interlog proteins in all four PPINs. In each step some proteins are pretermitted to discern conserved structures. (PDF 123 kb) Additional file 5: Some details regarding the five clustering algorithm algorithms (MCL, RNSC, CR, LD and GAPPI) are described briefly. (DOCX 13 kb)