Volume 12 Supplement 10
Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade
Constructing a robust proteinprotein interaction network by integrating multiple public databases
 VenkataSwamy Martha†^{1},
 Zhichao Liu†^{2},
 Li Guo^{3},
 Zhenqiang Su^{4},
 Yanbin Ye^{4},
 Hong Fang^{4},
 Don Ding^{4},
 Weida Tong^{2}Email author and
 Xiaowei Xu^{1, 2}Email author
DOI: 10.1186/1471210512S10S7
© Swamy et al; licensee BioMed Central Ltd. 2011
Published: 18 October 2011
Abstract
Background
Proteinprotein interactions (PPIs) are a critical component for many underlying biological processes. A PPI network can provide insight into the mechanisms of these processes, as well as the relationships among different proteins and toxicants that are potentially involved in the processes. There are many PPI databases publicly available, each with a specific focus. The challenge is how to effectively combine their contents to generate a robust and biologically relevant PPI network.
Methods
In this study, seven public PPI databases, BioGRID, DIP, HPRD, IntAct, MINT, REACTOME, and SPIKE, were used to explore a powerful approach to combine multiple PPI databases for an integrated PPI network. We developed a novel method called kvotes to create seven different integrated networks by using values of k ranging from 17. Functional modules were mined by using SCAN, a Structural Clustering Algorithm for Networks. Overall module qualities were evaluated for each integrated network using the following statistical and biological measures: (1) modularity, (2) similaritybased modularity, (3) clustering score, and (4) enrichment.
Results
Each integrated human PPI network was constructed based on the number of votes (k) for a particular interaction from the committee of the original seven PPI databases. The performance of functional modules obtained by SCAN from each integrated network was evaluated. The optimal value for k was determined by the functional module analysis. Our results demonstrate that the kvotes method outperforms the traditional union approach in terms of both statistical significance and biological meaning. The best network is achieved at k=2, which is composed of interactions that are confirmed in at least two PPI databases. In contrast, the traditional union approach yields an integrated network that consists of all interactions of seven PPI databases, which might be subject to high false positives.
Conclusions
We determined that the kvotes method for constructing a robust PPI network by integrating multiple public databases outperforms previously reported approaches and that a value of k=2 provides the best results. The developed strategies for combining databases show promise in the advancement of network construction and modeling.
Background
Proteinprotein interaction (PPI) is a critical component of almost every biological process related to physiological conditions, and can be analyzed in a PPI network to discover underlying mechanisms of toxicity and disease at the integrated system level [1]. A PPI network reflects the mode of action caused by interruptions of the protein and related targets. Crucial PPIs are proven to participate in diseaseassociated signaling pathways, which can offer insight for novel target identification and drug discovery. With the development of highthroughput molecular technology such as gene expression microarrays and in vitro assay screening platforms, analyzing PPIs has become a common strategy to interpret the findings.
For example, many current studies focus on how to mine diseaserelated genes/proteins to provide a better understanding of the mechanisms of diseases by using PPI databases; the hypothesis is that genes related to the same disease tend to encode proteins that interact with each other [2]. Therefore, PPI data are crucial for new disease biomarker discovery, diseasedisease relationship searching, and common biological function detection. However, most research focuses on constructing and evaluating PPI networks based on a single source of PPI data or by using simple unions of multiple PPI databases [3, 4]. Although many public PPI databases provide rich information, each database is developed with a specific focus and emphasis, and no single existing database is comprehensive. Therefore, developing methods to integrate PPI databases and construct a robust and biologically relevant PPI network is of great importance. The question is how to combine multiple PPI databases so that the best integrated PPI network can be established.
In this study, seven PPI databases (BioGRID, DIP, HPRD, IntAct, MINT, REACTOME, and SPIKE) were used as case studies to explore a novel approach to effectively combine multiple databases into an integrated PPI network. A structural clustering algorithm for networks (SCAN), was employed to evaluate seven integrated networks resulting from different values for k[5]. Statistical and biological measures including modularity, similaritybased modularity, clustering score and enrichment were used to assess the integrated networks [2]. The developed strategies for combining multiple databases show promise for future application in network construction and modeling.
Methods
Database construction
For this study, seven PPI databases were downloaded from public domain sources. BioGRID is a publication searchdriven database which covers the raw protein data and genetic interactions from major model species such as Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Homo sapiens[6]. The DIP^{TM} database catalogs experimentally determined interactions between proteins with automatic computational corrections as well as manual reviews performed by experts [7]. HPRD includes around 40,000 PPIs detected through experiments , covering over 30,000 human protein entities [8]. IntAct is a molecular interaction database, either extracted from literature or directly deposited by expert curators following a comprehensive annotation [9]. MINT focuses on experimentally verified PPIs in all species by datamining scientific literature [10]. REACTOME is an opensource, manually curated, and peerreviewed pathway database, which provides insight into gene/protein interactions from pathway perspectives and species comparisons [11]. SPIKE is a database of thoroughly curated human signaling pathways [12].
Information for the seven public PPI databases
Databases  Number of proteins  Number of interactions  Websites 

BioGRID  8204  33625  
DIP  1137  1509  
HPRD  9553  38802  
IntAct  7495  30965  
MINT  5230  15353  
REATOME  3599  74490  
SPIKE  6927  23224 
Given a set of PPI databases, each can be represented by a network consisting of a set of vertices that are connected to each other by a set of edges. In this model, each vertex is a protein; and the interaction between a pair of proteins is an edge in the network. We constructed seven interaction databases in this study; however, our method may be reproduced for any number of databases. In the following, we assume there are n networks to be integrated. Our goal is to find an optimal strategy to integrate them for the most robust and biologically significant PPI network. Formally, we use G_{ i }=<V_{ i }, E_{ i }>, where i =1, 2, 3,…, n, to represent the n networks obtained from corresponding PPI databases where G_{ i } represents a network, V_{ i } represents the set of vertices in a network, and E_{ i } represents the set of edges in a network.
Traditional union approach
Novel kvotes approach
where {i, j} is a subset of {1, 2, 3, …, n}.
The size of the integrated network shrinks as k grows according to set theory [13]. To determine an optimal value for k, we used network clustering, or functional module analysis.
Network clustering algorithm  SCAN
We applied SCAN for functional module analysis [13]. SCAN identifies clusters or functional modules based on structural similarity of a pair of vertices that are connected by an edge. Structural similarity is calculated by using common neighbors. The algorithm tries to assign a vertex to a cluster where it shares many common neighbors with other members of the cluster. More specifically, a vertex is added into a cluster if its structural similarity with a member is larger than a threshold value ε.
Statistical clustering quality measures
In order to achieve an optimal clustering result, the threshold ε needs to be determined. For this purpose, different ε values such as 0.1, 0.2, … , 0.9 is used for SCAN. This gives a clustering result for each ε value. The quality of the clustering result is then measured by two statistical clustering quality measures, modularity and similaritybased modularity [14].
Modularity
where NC is the number of clusters, L is the number of edges in the network, l_{ s } is the number of edges between vertices within cluster s, and d_{ s } is the sum of the degrees of the vertices in cluster s. The value of the modularity measure Q_{ N } ranges from 0 to 1. The optimal clustering is achieved by maximizing Q_{ N }. Modularity is defined such that Q_{ N } is 0 at either extreme of all vertices clustered into a single cluster, or of all vertices randomly clustered.
Similaritybased modularity
where NC is the number of clusters, IS_{ i } is the sum of structural similarity between any connected pair of vertices within cluster i, DS_{ i } is the total structural similarity for all vertices in cluster i, and TS is the sum of structural similarities for all pairs of vertices in the network [14]. Maximization of Q_{ S } yields an optimal network clustering even when the size of the clusters varies strongly. Like regular modularity, the value of similaritybased modularity is between 0 and 1.
Significance tests
In addition to the two statistical clustering quality measures above, we also perform significance tests to evaluate the clustering results based on the biological meaning. These tests include clustering score and enrichment test, described below.
Clustering score
where n_{ S } and n_{ I } denote the number of significant and insignificant clusters, respectively [18]. The clustering score is between 0 and 1. The maximal clustering score indicates an optimal clustering outcome.
Enrichment
where C_{ s } is a cluster of size C_{ s }. Enrichment is the average annotation similarity between all pairs of protein that share a cluster divided by the average annotation similarity between all pairs of genes in the network [2]. This enrichment quantifies the quality of all clusters given the domain knowledge from KEGG. To compare enrichment with other quality measures in the same scale we normalize enrichment, so that the normalized enrichment value ranges from 0 to 1.
KEGG pathway
There are a total of 199 unique human pathways in the KEGG, which involve 5197 unique genes/proteins; the pathway data can be downloaded from http://www.genome.jp/kegg/pathway.html. In this study, the KEGG pathway analysis was performed to investigate whether the biological meanings of modules are significant.
Results
After all seven integrated networks were constructed; cluster analysis was performed on each one using the SCAN algorithm with ε values in steps of 0.01 from 0 to 1. Each ε value yielded a clustering result. We calculated the four quality measures including modularity, similaritybased modularity, clustering score, and normalized enrichment for each clustering result, shown in Figures 2A and 2B. The integrated network that achieved the best overall performance in terms of overall clustering quality measures was recognized as the most informative network.
Seven integrated PPI networks yielded by using the kvotes method
Ĝ_{ 2 } ( k =2): The network comprises interactions that are present in at least two PPI databases. We observed that modularity could not be optimized for any ε value, as was the case for the case of Ĝ_{1} (k=1) (Figure 2A). We obtained an optimal similaritybased modularity at ε=0.3, which again demonstrates a superior performance over modularity. In contrast to Ĝ_{1} (k=1), there is a clear maximum for both the clustering score and normalized enrichment value, which was at ε=0.59 and at ε=0.74, respectively. Therefore, the network Ĝ_{2} (k=2) is both statistically significant and biologically meaningful.
Ĝ_{ 3 }, Ĝ_{ 4 }, and Ĝ_{ 5 } ( k =3, 4, 5): For the three networks constructed by using k=3, 4, and 5 respectively, we observed an optimality in terms of statistical clustering quality measures including both modularity and similaritybased modularity (Figure 2B). However, there is no biological optimality in terms of either clustering score or enrichment. Therefore, the networks are statistically significant, but not biologically meaningful. Interestingly, we found both modularity and similaritybased modularity were optimized at the same ε value. Since these networks do not possess biological significance, we rule out them as comprehensive networks. One factor that could contribute to the poor biological significance of these networks is the low coverage of interactions, which is the result of high number of votes (k) required for the consensus.
Ĝ_{ 6 } and Ĝ_{ 7 } ( k =6, 7): For networks constructed by using k=6 and 7, respectively, the significance tests show flat results over every ε value, which indicates there is neither statistical nor biological significance for both networks (Figure 2B). The main reason behind this is the sparse interactions among proteins; most of the proteins and their interactions are not present in these networks.
Presences of Optimal Quality Measures
#Nodes  #Edges  Presence of Optimal Modularity  Presence of Optimal Similaritybased Modularity  Presence of Optimal Clustering Score  Presence of Optimal Enrichment  

Ĝ _{ 1 }  12043  132603  No  Yes  No  No 
Ĝ _{ 2 }  9188  36086  No  Yes  Yes  Yes 
Ĝ _{ 3 }  6464  17222  Yes  Yes  No  No 
Ĝ _{ 4 }  4209  8108  Yes  Yes  No  No 
Ĝ _{ 5 }  2286  3619  Yes  Yes  No  No 
Ĝ _{ 6 }  345  302  No  No  No  No 
Ĝ _{ 7 }  59  40  No  No  No  No 
Pathway analysis
Top ten modules with significant biological enrichment in KEGG
Cluster ID  KEGG Pathway  Total number of proteins in the module  Number of proteins in the KEGG Pathway from the module  Total number of proteins in the KEGG Pathway  Fisher’s pvalue 

1  RNA polymerase / Transcription / Genetic Information Processing  10  10  29  5.10E24 
2  Progesteronemediated oocyte maturation / Endocrine System / Cellular Processes  12  12  86  1.91E22 
3  Proteasome / Folding_ Sorting and Degradation / Genetic Information Processing  17  12  48  5.22E22 
4  Basal transcription factors / Transcription / Genetic Information Processing  9  9  36  1.24E20 
5  Cell cycle / Cell Growth and Death / Cellular Processes  12  12  128  2.97E20 
6  Ubiquitin mediated proteolysis / Folding_ Sorting and Degradation / Genetic Information Processing  12  12  138  7.61E20 
7  Cell cycle / Cell Growth and Death / Cellular Processes  13  12  128  3.78E19 
8  Pyrimidine metabolism / Nucleotide Metabolism / Metabolism  10  10  98  3.57E18 
9  Oocyte meiosis / Cell Growth and Death / Cellular Processes  12  11  114  4.09E18 
10  RNA degradation / Folding_ Sorting and Degradation / Genetic Information Processing  11  9  59  8.97E17 
Discussion
PPI networks play a critical role in many biological studies. While there are many publicly available PPI databases, each source provides a special focus on one type of interaction, and no single source provides a comprehensive view of all interactions. Thus, integration of multiple sources is a promising approach to establish a comprehensive PPI network. In this study, a collection of seven interaction databases is explored for the construction of a robust and biologically significant PPI network. The main contributions are two fold: first, we devised a novel approach, namely kvotes, for the integration of multiple interaction networks that were extracted from publicly available sources; second, we developed a network clusteringbased framework to determine the best integration strategy, which is defined by the value of k.
Recently, Cerami et al applied the union approach for the fusion of publicly available pathway data from multiple sources [3]. While the union approach is easy to implement and has maximal coverage of potential interactions, the interactions may not be accurate in the integrated network due to quality issues such as processing errors or missing values in the individual databases. Therefore, the resulting network is not as reliable as our kvotes approach using an optimal k, where each individual network can be seen as an expert, who has both strengths and weaknesses in terms of the interaction data. Thus, a more robust integration can be achieved based on a partial consensus of the committee of all experts, which consists of individual input databases.
To determine an optimal k, we used several quality measures and performed cluster analysis on the integrated network. The rationale is that a high quality network yields high quality functional modules, which can be determined by quality measures including modularity, similaritybased modularity, clustering score, and enrichment. Therefore, the optimal k is estimated by calculating the clustering quality measures for all possible value of k. The optimal k yields a network that achieves an overall maximum of clustering quality measures. Note that using a higher k decreases the number of interactions found in the networks; the increased robustness is achieved at a possible loss of information.
We used the SCAN algorithm for the cluster analysis. Both theoretical and empirical studies show that SCAN can quickly and successfully identify clusters as well as vertices that play special roles (e.g., outliers and hubs) in large networks [5]. In another study, Mete et al. applied SCAN for the identification of functional modules in PPI networks [19]. The experimental results demonstrate a superior performance compared to other stateoftheart algorithms, such as modularitybased algorithms [15].
The modules enriched in the PPI networks were mined to discover new biomarkers related to specific diseases such as breast cancer, diabetes, etc. [20, 21]. In this study, our SCAN results yield not only a statistically significant integrated PPI network, but also produce biologically meaningful modules, which are similar to network analysis results from GeneGo (http://www.genego.com/) and IPA (http://www.ingenuity.com/). The enrichment results in Table 3 demonstrate that similar functional PPI can be clustered together.
In summary, this study demonstrates that the integration strategy of using the consensus of two out of the seven databases delivered the best results in terms of both robustness and significance. On the other hand, there is a tradeoff between the coverage and the reliability of proteinprotein interactions. The maximal coverage can be achieved by using traditional union approach for the integration, which is also a special case of our kvotes method (k=1). The integration of multiple databases is a promising bioinformatics strategy that can advance knowledge discovery using various public biological databases.
Conclusions
We determined that the kvotes method for constructing a robust PPI network by integrating multiple public databases outperforms previously reported approaches. Furthermore, our systematic analysis reveals that using a consensus of k=2 yields the optimal integration for the seven PPI databases used in this study. The kvotes approach holds the potential to improve the integration of other types of networks, such as human disease networks.
Disclaimer
The views presented in this article do not necessarily reflect those of the US Food and Drug Administration.
Notes
List of abbreviations used
 PPI:

Proteinprotein interaction
 SCAN:

structural clustering algorithm network.
Declarations
Acknowledgements
ZL and XX are grateful to the National Center for Toxicological Research (NCTR) of the U.S. Food and Drug Administration (FDA) for postdoctoral and faculty support through the Oak Ridge Institute for Science and Education (ORISE).
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 10, 2011: Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S10.
Authors’ Affiliations
References
 Bonetta L: Interactome under construction. Nature 2010, 468(7325):851–854. 10.1038/468851aView ArticlePubMedGoogle Scholar
 Ahn YY, Bagrow JP, Lehmann S: Link communities reveal multiscale complexity in networks. Nature 2010, 466(7307):761U711. 10.1038/nature09182View ArticlePubMedGoogle Scholar
 Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur O, Anwar N, Schultz N, Bader GD, Sander C: Pathway Commons, a web resource for biological pathway data. Nucleic Acids Research 2011, 39: D685D690. 10.1093/nar/gkq1039PubMed CentralView ArticlePubMedGoogle Scholar
 Kamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H, Herwig R: ConsensusPathDB: toward a more complete picture of cell biology. Nucleic Acids Research 2011, 39: D712D717. 10.1093/nar/gkq1156PubMed CentralView ArticlePubMedGoogle Scholar
 Xu X, Yuruk N, Feng Z, Schweiger T: SCAN: a structural clustering algorithm for networks. In In Proceedings of the 13th ACM SIGKDD international conference on Knowledge Discovery and Data Mining. San Jose, California, USA; 2007:824–833.View ArticleGoogle Scholar
 Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Research 2006, 34: D535D539. 10.1093/nar/gkj109PubMed CentralView ArticlePubMedGoogle Scholar
 Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D: DIP: the Database of Interacting Proteins. Nucleic Acids Research 2000, 28(1):289–291. 10.1093/nar/28.1.289PubMed CentralView ArticlePubMedGoogle Scholar
 Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al.: Human Protein Reference Database2009 update. Nucleic Acids Research 2009, 37: D767D772. 10.1093/nar/gkn892View ArticleGoogle Scholar
 Aranda B, Achuthan P, AlamFaruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, et al.: The IntAct molecular interaction database in 2010. Nucleic Acids Research 2010, 38: D525D531. 10.1093/nar/gkp878PubMed CentralView ArticlePubMedGoogle Scholar
 Ceol A, Aryamontri AC, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G: MINT, the molecular interaction database: 2009 update. Nucleic Acids Research 2010, 38: D532D539. 10.1093/nar/gkp983PubMed CentralView ArticlePubMedGoogle Scholar
 Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, et al.: Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Research 2009, 37: D619D622. 10.1093/nar/gkn863PubMed CentralView ArticlePubMedGoogle Scholar
 Paz A, Brownstein Z, Ber Y, Bialik S, David E, Sagir D, Ulitsky I, Elkon R, Kimchi A, Avraham KB, et al.: SPIKE: a database of highly curated human signaling pathways. Nucleic Acids Research 2011, 39: D793D799. 10.1093/nar/gkq1167PubMed CentralView ArticlePubMedGoogle Scholar
 Jech T: Set Theory: Third Millennium Edition. Berlin, New York: SpringerVerlag: Springer Monographs in Mathematics; 2003.Google Scholar
 Feng Z, Xu X, Yuruk N, Schweiger T: A novel similaritybased modularity function for graph partitioning. Lect Notes Comp Sci 2007, 64: 358–396.Google Scholar
 Newman MEJ, Girvan M: Finding and evaluating community structure in networks. Physical Review E 2004, 69(2):15.View ArticleGoogle Scholar
 Fortunato S, Barthelemy M: Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America 2007, 104(1):36–41. 10.1073/pnas.0605965104PubMed CentralView ArticlePubMedGoogle Scholar
 Bu DB, Zhao Y, Cai L, Xue H, Zhu XP, Lu HC, Zhang JF, Sun SW, Ling LJ, Zhang N, et al.: Topological structure analysis of the proteinprotein interaction network in budding yeast. Nucleic Acids Research 2003, 31(9):2443–2450. 10.1093/nar/gkg340PubMed CentralView ArticlePubMedGoogle Scholar
 Asur S, Ucar D, Parthasarathy S: An ensemble framework for clustering proteinprotein interaction networks. 2007, 23(13):i29i40.Google Scholar
 Mete M, Tang FS, Xu X, Yuruk N: A structural approach for finding functional modules from large biological networks. Bmc Bioinformatics 2008., 9:Google Scholar
 Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Networkbased classification of breast cancer metastasis. Molecular Systems Biology 2007., 3:Google Scholar
 Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S: Networkbased analysis of affected biological processes in type 2 diabetes models. Plos Genetics 2007, 3(6):958–972.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.