Many biological features have been explored in the prediction of protein-protein interactions and it has been found that there is limited prediction power when utilizing only one genomic feature. Investigators are now moving toward integration [12, 22, 35]. A systematic assessment of the existing methods is a prerequisite to an effective integration. In this study, we focused on four major methods (PPM, GCM, GFM, and GNM) that utilize genomic context information. Each method characterizes in its own way. We hypothesized that an efficient integration of these four major methods would improve prediction performance. We first performed extensive comparisons of these four methods using three positive datasets (KEGG, EcoCyc, and DIP). We found that these four methods lacked consensus but complemented each other to some extent. Based on these comparisons, we developed an integrated method, InPrePPI, which optimally weighs the scores of protein pairs predicted by the four methods. Our performance comparison indicates that InPrePPI outperforms each individual method (Figure 2) and, in general, the other two integrated methods: the JOM and STRING (Table 2, Figures 4 and 5).
However, InPrePPI did not outperform the JOM or STRING in all tests. In the JOM, the accuracy values were higher for the PPIs that were consistently predicted by at least three methods. Such high values were reached by dramatically decreasing the coverage. This makes JOM impractical when multiple methods or supporting evidence is employed. InPrePPI does not have this limitation because it uses an integration score, rather than an intersection of multiple data. Compared to STRING, InPrePPI had consistently higher accuracy values and its coverage values were higher or close, in most cases, except in the high confidence class of the EcoCyc and DIP datasets. In the latter two cases, the difference was not as remarkable as it was in the comparison between the JOM and InPrePPI. For example, the coverage value in InPrePPI was 33.19% in the high confidence class of EcoCyc; this is comparable to the 42.33% in STRING but much higher than the 2.65% in the JOM4 (Table 2). When we considered both the accuracy and coverage values, InPrePPI outperformed STRING in all tests except in the high confidence class of EcoCyc (Figure 4). Furthermore, our independent test using COG annotations indicates that the fractions of true positives in InPrePPI were consistently higher than those in STRING in all three classes of predicted PPIs (Figure 5).
The STRING database provides a comprehensive, high quality collection of protein-protein associations for a large number of organisms . The association data were compiled from high-throughput experimental data, mining of other databases and literature, and the predicted PPIs by genomic context approaches. We demonstrated that InPrePPI has an overall better performance than the prediction methods (phylogenetic co-occurrence, conserved neighborhood, and gene fusion methods) in STRING. However, InPrePPI is limited to the evaluation and prediction of protein-protein pairs based on the genomic context features and its web site provides only prediction function rather than a comprehensive evidence collection. While the STRING database provides a powerful system for proteomics research, the amount of PPI data collected by the high-throughput experiments, or from the existing literature, is still very limited at present in most organisms in nature and is likely to be limited for some time. Computational approaches are expected to play an important role in uncovering the interactomes of most genomes. Although one recent study failed to improve the prediction by adding more features , the InPrePPI method demonstrates that an integration, if appropriate, can improve prediction power. Thus, our integrated method based on the genomic context, which is to be further optimized and enhanced, can be applied to the prediction of PPIs in many other (prokaryotic) genomes and also integrated into the comprehensive database such as STRING.
InPrePPI integrates four genomic context based methods. These four methods are currently the best computational methods for prokaryotic genomes. This implies that InPrePPI may be applied to the discovery of PPIs at least in prokaryotic genomes. InPrePPI uses a constant, k, to normalize the AC value and calculate the weight of each method. This constant depends on the data used and the methods integrated and can be obtained by a heuristic approach. When true positives are available in a genome, the optimal k value and weight of each method can be directly obtained by the method in this study. To predict PPIs in a genome without true positive data, which is very challenging at present and always relies on the knowledge in other well-studied organisms, we may use the optimal k value and the weight available in E. coli or any other genome that is related to the target genome and then refine it after some of the predicted PPIs have been validated (i.e., true positives). InPrePPI may be extended to eukaryotic genomes as well. Recent assessments of phylogenetic profiling in the E. coli and yeast confirmed the similar strategy of reference organism selection in the construction of phylogenetic profiles [36–38] and indicate that phyletic patterns of proteins in prokaryotes alone are adequate to predict functional linkages between proteins in prokaryotic and eukaryotic genomes . Some studies have reported that neighboring genes have similar expression patterns in higher eukaryotes, implying possible interactions [39–41]. Qi et al.  found that gene co-expression is consistently the most important feature in their comprehensive evaluation of PPI prediction in yeast using an integrated framework, which supports the previous finding that the most obvious co-expression comes from permanent complexes such as ribosome and proteasome [42, 43]. Therefore, we may consider both the genomic context information and the gene co-expression data when we extend InPrePPI to eukaryotic genomes.
We used the gold standards of positives to evaluate the PPI prediction methods. In previous studies, positive data was selected from the standardized SWISS-PROT keywords [3, 30], the metabolic map in KEGG , the pathway information in COG , or the protein complexes . So far, there has been no complete biological database to serve as a gold standard of positives. To avoid a biased selection of positive data, we used three well-documented datasets: (1) biological pathway information from KEGG, (2) protein complexes from EcoCyc, and (3) protein-protein interactions identified by experiments from DIP. The prediction performance of each method varied among these three datasets (Figure 1), suggesting that the selection of positive control data should be made carefully and should consider the types of interactions.