PPI networks play a critical role in many biological studies. While there are many publicly available PPI databases, each source provides a special focus on one type of interaction, and no single source provides a comprehensive view of all interactions. Thus, integration of multiple sources is a promising approach to establish a comprehensive PPI network. In this study, a collection of seven interaction databases is explored for the construction of a robust and biologically significant PPI network. The main contributions are two fold: first, we devised a novel approach, namely k-votes, for the integration of multiple interaction networks that were extracted from publicly available sources; second, we developed a network clustering-based framework to determine the best integration strategy, which is defined by the value of k.
Recently, Cerami et al applied the union approach for the fusion of publicly available pathway data from multiple sources . While the union approach is easy to implement and has maximal coverage of potential interactions, the interactions may not be accurate in the integrated network due to quality issues such as processing errors or missing values in the individual databases. Therefore, the resulting network is not as reliable as our k-votes approach using an optimal k, where each individual network can be seen as an expert, who has both strengths and weaknesses in terms of the interaction data. Thus, a more robust integration can be achieved based on a partial consensus of the committee of all experts, which consists of individual input databases.
To determine an optimal k, we used several quality measures and performed cluster analysis on the integrated network. The rationale is that a high quality network yields high quality functional modules, which can be determined by quality measures including modularity, similarity-based modularity, clustering score, and enrichment. Therefore, the optimal k is estimated by calculating the clustering quality measures for all possible value of k. The optimal k yields a network that achieves an overall maximum of clustering quality measures. Note that using a higher k decreases the number of interactions found in the networks; the increased robustness is achieved at a possible loss of information.
We used the SCAN algorithm for the cluster analysis. Both theoretical and empirical studies show that SCAN can quickly and successfully identify clusters as well as vertices that play special roles (e.g., outliers and hubs) in large networks . In another study, Mete et al. applied SCAN for the identification of functional modules in PPI networks . The experimental results demonstrate a superior performance compared to other state-of-the-art algorithms, such as modularity-based algorithms .
The modules enriched in the PPI networks were mined to discover new biomarkers related to specific diseases such as breast cancer, diabetes, etc. [20, 21]. In this study, our SCAN results yield not only a statistically significant integrated PPI network, but also produce biologically meaningful modules, which are similar to network analysis results from GeneGo (http://www.genego.com/) and IPA (http://www.ingenuity.com/). The enrichment results in Table 3 demonstrate that similar functional PPI can be clustered together.
In summary, this study demonstrates that the integration strategy of using the consensus of two out of the seven databases delivered the best results in terms of both robustness and significance. On the other hand, there is a trade-off between the coverage and the reliability of protein-protein interactions. The maximal coverage can be achieved by using traditional union approach for the integration, which is also a special case of our k-votes method (k=1). The integration of multiple databases is a promising bioinformatics strategy that can advance knowledge discovery using various public biological databases.