Identification of functional hubs and modules by converting interactome networks into hierarchical ordering of proteins
© Cho et al; licensee BioMed Central Ltd. 2010
Published: 29 April 2010
Protein-protein interactions play a key role in biological processes of proteins within a cell. Recent high-throughput techniques have generated protein-protein interaction data in a genome-scale. A wide range of computational approaches have been applied to interactome network analysis for uncovering functional organizations and pathways. However, they have been challenged because ofcomplex connectivity. It has been investigated that protein interaction networks are typically characterized by intrinsic topological features: high modularity and hub-oriented structure. Elucidating the structural roles of modules and hubs is a critical step in complex interactome network analysis.
We propose a novel approach to convert the complex structure of an interactome network into hierarchical ordering of proteins. This algorithm measures functional similarity between proteins based on the path strength model, and reveals a hub-oriented tree structure hidden in the complex network. We score hub confidence and identify functional modules in the tree structure of proteins, retrieved by our algorithm. Our experimental results in the yeast protein interactome network demonstrate that the selected hubs are essential proteins for performing functions. In network topology, they have a role in bridging different functional modules. Furthermore, our approach has high accuracy in identifying functional modules hierarchically distributed.
Decomposing, converting, and synthesizing complex interaction networks are fundamental tasks for modeling their structural behaviors. In this study, we systematically analyzed complex interactome network structures for retrievingfunctional information. Unlike previous hierarchical clustering methods, this approach dynamically explores the hierarchical structure of proteins in a global view. It is well-applicable to the interactome networks in high-level organisms because of its efficiency and scalability.
Recent high-throughput experimental techniques, such as yeast two-hybrid system  and mass spectrometry , have made remarkable advances in identifying protein-protein interactions on a genome-wide scale. Since the evidence of protein-protein interactions provides insights into the underlying mechanisms of biological processes within a cell, the availability of a large amount interaction data has introduced a new paradigm towards functional characterization of proteins on a system level.
A protein interactome network is structured by the set of genome-wide protein-protein interactions determined in each organism. A wide range of computational approaches [3–6] have attempted to analyze the interaction networks effectively for the purpose of predicting protein function or detecting functional modules. However, unraveling the complex connectivity has been a critical challenge. The false positive interactions, which typically appear in high-throughput experimental data, and functionally inconsistent interacting pairs  have reinforced the complexity. Thus, refining the noisy data and restructuring the complex network into a well-organized data format should be crucial pre-processes to enhance the network analysis.
In recent years, it has been investigated that protein interaction networks are characterized by intrinsic features , such as high modularity and hub-oriented structure. A network comprises a collection of functional modules that are interpreted as sets of proteins participating in the same function . In general, a module is considered as a sub-graph whose nodes are densely connected with each other and sparsely connected with the others. Density-based clustering methods have been proposed to seek densely connected sub-graphs using various density functions [10–13]. However, they are not able to capture the global patterns of functional organizations from protein interaction networks. Functional modules are typically organized in a recursive manner such that a module includes one or more sub-modules having more specific functions. Hierarchical clustering methods have thus been applied to the networks for finding functional organizations[14–17]. The bottom-up approaches iteratively merge nodes or sub-networks, whereas the top-down approaches recursively divide the network into sub-networks. However, as a critical drawback, they are typically sensitive to complex connectivity and noisy data.
Hubs in a scale-free network  play a central role in characterizing its structure. Intramodule hubs ('party' hubs) have high connectivity to the members in a module, and intermodule hubs ('date' hubs) bridge different modules . Previous studies have observed that such hubs in protein interaction networks are essential in terms of functionality [19–22] and, in particular, intramodule hubs have low evolutionary rates [23, 24]. The concepts of modules and hubs, extending from specific (local) to general (global), suggest the potential structure of a hierarchy that might be hidden in complex interaction networks. How can we then effectively extract the hierarchical structure of proteins from the complex network to reveal the global picture of functional organizations?
In this study, we present a novel method for restructuring a complex interactome network into a hierarchical data format in order to reveal functional hubs and organizations. Our algorithm uses a weighted interaction network as an input. Because the network includes a significant number of false positive connections, the reliability or intensity of interactions should be assessed and assigned into the edges as weights. For network restructuring, we design a path strength model which proposes the quantification of functional similarity between two proteins. The interactome network having complex connectivity is then dynamically converted into a hub-oriented tree structure by the definition of path-strength-based centrality. From the hierarchical structure, we score hub confidence for each node, and generate hierarchically organized clusters of proteins. Unlike degree as a local significance measure, the hub confidence estimates the global significance of nodes. It is thus capable of selecting hubs that are located in critical positions of the network. The experimental results demonstrate that the hubs with high confidence are essential for performing functions. In network topology, they mostly bridge different functional modules. Furthermore, our approach has higher accuracy in identifying functional modules than other hierarchical clustering methods.
Path strength model
The path strength of a path p thus has a positive relationship with the weights of the edges on p, and a negative relationship with the weighted degrees of the nodes on p. Formula 2 also implies that the path strength has an inverse relationship with the length of p because the weighted probability, wi(i=1) / d wt (v i ), is in the range between 0 and 1, inclusive. As the length of p increases, the product of the weighted probability decreases monotonically. In the same manner, as the average degree of the nodes on p increases, the path strength of p is likely to decrease.
Since any node pair selected in a small world network  are directly or indirectly connected with a relatively small path length, the maximum path length between them is typically limited. However, Formula 3 still has a computational problem when it enumerates all possible paths between a and b. To solve the computational complexity, we restrict the maximal boundary of path length.
where l ≤ k ≤ l + θ and l is the shortest path length between a and b. Based on the assumption that edge weights represent the likelihood of functional linkage of interacting protein pairs, Formula 5 measures the potential of functional association between two proteins, directly or indirectly connected in a protein interactome network.
Selecting a parent node for each node by Formula 8 then efficiently constructs a hierarchical tree structure. The node having the highest centrality among all the nodes in the network has no parent and becomes the root node. This hierarchical structure is dynamically converted on network growth, depending on the distribution of the path-strength-based centrality of nodes.
Identifying hubs and clustering proteins
The hub confidence in Formula 11 quantifies how likely the node is to be a structural hub. Since an edge weight represents the functional consistency between two ending nodes, the structural hubs have a significant role in not only maintaining topology but also functionality.
We finally generate clusters as functional modules from the tree structure. We iteratively select a structural hub a with the highest hub confidence score and output L a as a cluster until the hub confidence of the selected node a reaches a user-specified threshold. The clusters are hierarchically arranged based on the positions of their hubs in the tree structure.
Results and discussion
Currently, genome-wide protein-protein interaction data of several model organisms are publicly available in a number ofopen databases, for example, BioGRID , MIPS , DIP , MINT  and IntAct . They have been mostly generated by high-throughput methods. However, because of unreliability of the high-throughput experimental data, we tested our algorithm using the core protein-protein interaction data of Saccharomyces cerevisiae from DIP, which were curated by other biological information such as protein sequences and expression profiles. They include total 2526 distinct proteins and 5725 interactions between them.
Next, we applied gene co-expression profiles for interacting proteins. The gene expression data were obtained from SMD , and the coherence of expressions was calculated by the Pearson coefficient. Finally, we adopted annotations in the GO  database. The semantic similarity measure  was used to compute the functional similarity of each pair of interacting proteins.
Evaluation of path strength model
Topological significance of structural hubs
We implemented the conversion of the weighted interaction network to a hierarchical tree structure by Formula 8. We then identified the structural hub proteins based on their hub confidence scores in Formula 11. To make topological assessment of the structural hubs, we tested network vulnerability on random and hub attacks. It has been known that typical scale-free networks are robust on random attacks, but vulnerable on targeted attacks to the hubs. For this experiment, we observed the fractions of the largest component when we repeatedly disrupted a randomly selected node, a hub with the highest degree and a structural hub with the highest hub confidence score, respectively.
Overall, a protein interaction network is more vulnerable on structural hub attacks than random attacks. It is noticeable that the hub confidence measure is effective at selecting topologically significant hub proteins in complex networks. In general, hub confidence has a positive relationship with node degree. However, some low-degree structural hubs with high hub confidence can be detected by our algorithm. Whereas degree is a factor for local significance of nodes in network topology, the hub confidence formula measures the global significance of nodes to select hubs in the hierarchical structure.
Biological essentiality of hub proteins
Modularity of clusters
We implemented clustering of proteins using the tree structure converted from a protein interaction network, and inspected whether the output clusters are likely to be functional modules. Modularity of a sub-network has been commonly estimated by the ratio of the number of edges within the sub-network to the number of all edges starting from the nodes in the sub-network. However, in this estimation, the modularity depends on the number of nodes in the sub-network. For example, suppose a network G has 500 nodes. Sub-networks G′ and G″ of G consist of 10 and 100 nodes, respectively. A node in G″ has a higher probability having links to the nodes within the same sub-network (intraconnections) and a lower probability having links to the nodes outside of the sub-network (interconnections), comparing to a node in G′. We thus normalized the formula of modularity by the probability of a node in the sub-network being linked to the members in the same sub-network.
Clustering performance comparison by f — measure
Decomposing, converting and synthesizing complex systems are fundamental tasks for modeling their structural behavior. Recently, such approaches in protein interaction networks has been widely attempted to understand biological processes and functional organizations within a cell. We have studied the methodology for converting a protein interactome network into an effective structure for the purpose of functional knowledge discovery. For this task, we designed the path strength model and exploited the novel concept of centrality. The generated hierarchical tree structure can be applied to selecting functionally essential hub proteins and identifying functional modules. Unlike other hierarchical clustering methods, our approach dynamically explores the entire hierarchical structure of proteins in a global view. All the individual parent-child relationships between proteins in the hierarchy are meaningful and comparable. The performance of our approach can be more improved by developing the advanced methods, which efficiently integrate a massive amount of current heterogeneous biological data and accurately analyze the reliability of functional associations between interacting proteins.
YRC designed and implemented the method, analyzed the results, and drafted the manuscript. AZ coordinated the project, analyzed the results, and revised the final manuscript.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 3, 2010: Selected articles from the 2009 IEEE International Conference on Bioinformatics and Biomedicine. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S3.
- Parrish JR, Gulyas KD, Finley RL: Yeast two-hybrid contributions to interactome mapping. Current Opinion in Biotechnology 2006, 17: 387–393. 10.1016/j.copbio.2006.06.006View ArticlePubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422: 198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Molecular Systems Biology 2007, 3: 88. 10.1038/msb4100129PubMed CentralView ArticlePubMedGoogle Scholar
- Li W, Liu Y, Huang H-C, Peng Y, Lin Y, Ng W-K, Ong K-L: Dynamic systems for discovering protein complexes and functional modules from biological networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2007, 4(2):233–250. 10.1109/TCBB.2007.070210View ArticlePubMedGoogle Scholar
- Cho Y-R, Hwang W, Ramanathan M, Zhang A: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics 2007, 8: 265. 10.1186/1471-2105-8-265PubMed CentralView ArticlePubMedGoogle Scholar
- Banks E, Nabieva E, Peterson R, Singh M: NetGrep: fast network schema searches in interactomes. Genome Biology 2008, 9: R138. 10.1186/gb-2008-9-9-r138PubMed CentralView ArticlePubMedGoogle Scholar
- Cho Y-R, Shi L, Ramanathan M, Zhang A: A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge. BMC Bioinformatics 2008, 9: 382. 10.1186/1471-2105-9-382PubMed CentralView ArticlePubMedGoogle Scholar
- Barabasi A-L, Oltvai ZN: Network biology: understanding the cell's functional organization. Nature Reviews: Genetics 2004, 5: 101–113. 10.1038/nrg1272View ArticlePubMedGoogle Scholar
- Wang Z, Zhang J: In search of the biological significance of modular structures in protein networks. PLoS Computational Biology 2007., 3(6):Google Scholar
- Spirin V, Mirny LA: Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA 2003, 100(21):12123–12128. 10.1073/pnas.2032324100PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003, 4: 2. 10.1186/1471-2105-4-2PubMed CentralView ArticlePubMedGoogle Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 2006, 7: 207. 10.1186/1471-2105-7-207PubMed CentralView ArticlePubMedGoogle Scholar
- Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Nature 2005, 435: 814–818. 10.1038/nature03607View ArticlePubMedGoogle Scholar
- Rives AW, Galitski T: Modular organization of cellular networks. Proc. Natl. Acad. Sci. USA 2003, 100(3):1128–1133. 10.1073/pnas.0237338100PubMed CentralView ArticlePubMedGoogle Scholar
- Brun C, Herrmann C, Guenoche A: Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics 2004, 5: 95. 10.1186/1471-2105-5-95PubMed CentralView ArticlePubMedGoogle Scholar
- Dunn R, Dudbridge F, Sanderson CM: The use of edge-betweenness clustering to investigate biological function in protein interaction networks. BMC Bioinformatics 2005., 6:Google Scholar
- Luo F, Yang Y, Chen C-F, Chang R, Zhou J, Scheuermann RH: Modular organization of protein interaction networks. Bioinformatics 2007, 23(2):207–214. 10.1093/bioinformatics/btl562View ArticlePubMedGoogle Scholar
- Han J-DJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJM, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 2004, 430: 88–93. 10.1038/nature02555View ArticlePubMedGoogle Scholar
- Jeong H, Mason SP, Barabasi A-L, Oltvai ZN: Lethality and centrality in protein networks. Nature 2001, 411: 41–42. 10.1038/35075138View ArticlePubMedGoogle Scholar
- Chen Y, Xu D: Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics 2005, 21(5):575–581. 10.1093/bioinformatics/bti058View ArticlePubMedGoogle Scholar
- He X, Zhang J: Why do hubs tend to be essential in protein networks? PLoS Genetics 2006, 2(6):e88. 10.1371/journal.pgen.0020088PubMed CentralView ArticlePubMedGoogle Scholar
- Batada NN, Reguly T, Breitkreutz A, Boucher L, Breitkreutz B-J, Hurst LD, Tyers M: Stratus not altocumulus: a new view of the yeast protein interaction network. PLoS Biology 2006, 4(10):e317. 10.1371/journal.pbio.0040317PubMed CentralView ArticlePubMedGoogle Scholar
- Fraser HB: Modularity and evolutionary constraint on proteins. Nature Genetics 2005, 37(4):351–352. 10.1038/ng1530View ArticlePubMedGoogle Scholar
- Saeed R, Deane CM: Protein-protein interactions, evolutionary rate, abundance and age. BMC Bioinformatics 2006, 7: 128. 10.1186/1471-2105-7-128PubMed CentralView ArticlePubMedGoogle Scholar
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393: 440–442. 10.1038/30918View ArticlePubMedGoogle Scholar
- Breitkreutz B-J, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bahler J, Wood V, Dolinski K, Tyers M: The BioGRID interaction database: 2008 update. Nucleic Acids Research 2008, 36: D637-D640. 10.1093/nar/gkm1001PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer KFX, Munsterkotter M, Ruepp A, Spannagl M, Stumptflen V, Rattei T: MIPS: analysis and annotation of genome information in 2007. Nucleic Acid Research 2008, 36: D196-D201. 10.1093/nar/gkm980View ArticleGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The database of interacting proteins: 2004 update. Nucleic Acid Research 2004, 32: D449-D451. 10.1093/nar/gkh086View ArticleGoogle Scholar
- Chatr-aryamontri A, Ceol A, Montecchi-Palazzi L, Nardelli G, Schneider MV, Castagnoli L, Cesareni G: MINT: the Molecular INTeraction database. Nucleic Acids Research 2007, 35: D572-D574. 10.1093/nar/gkl950PubMed CentralView ArticlePubMedGoogle Scholar
- Kerrien S, et al.: IntAct - open source resource for molecular interaction data. Nucleic Acids Research 2007, 35: D561-D565. 10.1093/nar/gkl958PubMed CentralView ArticlePubMedGoogle Scholar
- Demeter J, et al.: The Stanford Microarray Database: implementation of new analysis tools and open source release of software. Nucleic Acid Research 2007, 35: D766-D770. 10.1093/nar/gkl1019View ArticleGoogle Scholar
- The Gene Ontology Consortium: The Gene Ontology project in 2008. Nucleic Acids Research 2008, 36: D440-D444. 10.1093/nar/gkm883PubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.