- Proceedings
- Open Access
Enhancing the accuracy of HMM-based conserved pathway prediction using global correspondence scores
- Xiaoning Qian^{1}Email author,
- Sayed Mohammad Ebrahim Sahraeian^{2} and
- Byung-Jun Yoon^{2}Email author
https://doi.org/10.1186/1471-2105-12-S10-S6
© Qian et al; licensee BioMed Central Ltd. 2011
- Published: 18 October 2011
Abstract
Background
Comparative network analysis aims to identify common subnetworks in biological networks. It can facilitate the prediction of conserved functional modules across different species and provide deep insights into their underlying regulatory mechanisms. Recently, it has been shown that hidden Markov models (HMMs) can provide a flexible and computationally efficient framework for modeling and comparing biological networks.
Results
In this work, we show that using global correspondence scores between molecules can improve the accuracy of the HMM-based network alignment results. The global correspondence scores are computed by performing a semi-Markov random walk on the networks to be compared. The resulting score naturally integrates the sequence similarity between molecules and the topological similarity between their molecular interactions, thereby providing a more effective measure for estimating the functional similarity between molecules. By incorporating the global correspondence scores, instead of relying on sequence similarity or functional annotation scores used by previous approaches, our HMM-based network alignment method can identify conserved subnetworks that are functionally more coherent.
Conclusions
Performance analysis based on synthetic and microbial networks demonstrates that the proposed network alignment strategy significantly improves the robustness and specificity of the predicted alignment results, in terms of conserved functional similarity measured based on KEGG ortholog (KO) groups. These results clearly show that the HMM-based network alignment framework using global correspondence scores can effectively find conserved biological pathways and has the potential to be used for automatic functional annotation of biomolecules.
Keywords
- Protein Pair
- Alignment Result
- Node Similarity
- Synthetic Network
- Network Alignment
Background
With the increasingly high coverage of molecular interactions owing to the advancement of high-throughput techniques for measuring biomolecular interactions, such as the two-hybrid screening [1] and co-immunoprecipitation [2], comparative analysis of biological networks has recently attracted significant research attention. It has been demonstrated that comparative network analysis can provide an effective means of systematically studying molecular interactions in various organisms and gaining novel system-level insights [3–18]. For example, local network alignment across different species can identify similar subnetwork regions in the respective networks, which may lead to the discovery of conserved pathways that carry out essential cellular functionalities [3, 5, 6, 9, 11, 15, 16, 19]. The concept of comparative network analysis can lead to the development of novel computational tools that allow us to transfer biological knowledge across species, especially from well-studied species to less-studied species [19].
Current local network algorithms [3, 5, 6, 9, 15] search for similar subnetwork regions by optimizing a pre-defined alignment score that incorporates the topological similarity of the interaction patterns in the compared networks as well as the node similarity of the molecules that belong to different networks, typically measured based on sequence similarity. To obtain better alignment results that are biologically more significant, there have been research efforts to improve the scoring scheme by incorporating evolutionary [4] or functional relationships [11, 16] between molecules. Although there are various approaches for measuring the similarity between network nodes, most of the existing approaches compute this similarity based on the properties of individual nodes, such as their composition, functionality, or evolutionary relationships. However, cellular functions are carried out by collaborative efforts among many molecules, where interacting molecules may carry similar functionalities and share common characteristics. Therefore it would be reasonable to expect that, when evaluating the node similarity, incorporating additional information about the interacting molecules would enhance the network alignment results and lead to predictions that are biologically more meaningful.
Recently, we have introduced an effective framework for local network alignment based on hidden Markov models (HMMs), in which we integrate both the node sequence similarity and the interaction reliability into the scoring scheme by determining the parameters of the HMMs correspondingly [15]. We also developed an efficient dynamic programming algorithm that can find the closest pair of pathways from the respective networks in polynomial time. The HMM-based local alignment method can deal with a large class of path isomorphism and it allows one to search for long conserved pathways across large-scale networks. In this paper, we implement a semi-Markov random walk framework that diffuses the relationships of all the molecule pairs across the networks to obtain a global correspondence score between every pair of nodes. The resulting global correspondence score reflects the global similarity between nodes in different networks, by seamlessly integrating the topological similarity and individual node similarity. Alignment results based on synthetic networks and microbial protein-protein interaction (PPI) networks show that the performance of the HMM-based local alignment scheme can be significantly improved by utilizing the global correspondence score instead of the original individual sequence similarity score. The major contributions of this paper include the following: first, we integrate the global node correspondence scoring scheme into the HMM-based local network alignment framework [15], which leads to more accurate and robust alignment results; second, we thoroughly evaluate the performance of the proposed scheme based on synthetic benchmark networks, as well as real microbial networks, which clearly demonstrates the advantages of utilizing global correspondence scores, especially, in combination with the HMM-based framework.
Methods
Local network alignment based on hidden Markov models
In this section, we briefly review our local network alignment algorithm based on hidden Markov models (HMMs) [14, 15]. We focus on aligning two biological networks to identify the common pathways that are conserved in both networks. Suppose we have two biological networks, represented as two graphs and . In graph of N_{1} nodes represents the corresponding molecules, and of M_{1} edges indicates the presence of interactions d_{ ij } between the two molecules u_{ i } and u_{ j }. Similarly, we assume that has a set of N_{2} nodes and a set of M_{2} edges. We denote the interaction reliability score between u_{ i } and u_{ j } in as w_{1}(u_{ i }, u_{ j }) and the interaction reliability between v_{ i } and v_{ j } in as w_{2}(v_{ i }, v_{ j }). The node similarity between and is denoted as s(u_{ i }, v_{ j }).
In order to use HMMs to search for the pathways that are conserved in both networks, we search for the best matching pair of paths and of length L in the respective networks that maximizes the pathway alignment score H(u, v). The alignment score H(u, v) integrates the node similarity score s(u_{ i },v_{ j }) between the aligned nodes u_{ i } and v_{ j } (1 ≤ i, j ≤ L), the interaction reliability score w_{1}(u_{ i }, u_{ i }_{+1}) between u_{ i } and u_{ i }_{+1} (1 ≤ i ≤ L – 1), the interaction reliability score w_{2}(v_{ j }, v_{ j }_{+1}) between v_{ j } and v_{ j }_{+1} (1 ≤ j ≤ L – 1), and the penalty for potential gaps in the alignment.
by iteratively computing the score in (1) for l = 1, 2, ⋯ , L. Instead of finding only the best matching pair of paths, we can also search for the top k path pairs by replacing the max operator in (2) by an operator that finds the k largest scores. The computational complexity of the described dynamic programming algorithm is only O(kLM_{1}M_{2}) for finding the top k pairs of matching paths. Note that the computational complexity is linear with respect to each parameter k, L, M_{1}, and M_{2}.
In our previous implementation of HMM-based local alignment [14, 15, 20], we have used the sequence similarity between individual molecules to measure the node similarity s(u_{ i }, v_{ j }). As we discussed earlier, it is desirable to integrate all the available information to measure the similarity between network nodes, instead of relying on the similarity between individual molecules. In this paper, we propose to use a semi-Markov random walk model to define a global correspondence scoring scheme for measuring node similarity by incorporating the topological properties around the nodes. As we will demonstrate later, the use of global correspondence scores can improve the accuracy and robustness of the HMM-based alignment results.
Computation of global correspondence scores through semi-Markov random walk
In order to predict the global correspondence between nodes, we should first consider the similarity between the corresponding molecules themselves, in terms of sequence, structure, and/or function. However, considering that biomolecules carry out their functions through intertwined interactions with other molecules, it is important to consider these interaction patterns as well when evaluating the global similarity between nodes. As recently proposed and discussed in [10, 18, 21, 22], Markov random walk can provide an elegant framework for evaluating the global correspondence between nodes that belong to different networks by seamlessly integrating the similarity between the nodes themselves and that between their interaction patterns.
in which π_{1}(u_{ i }) is the stationary probability of visiting node u_{ i } in an ordinary Markov random walk on , π_{2}(v_{ j }) is the stationary probability of visiting v_{ j } in a Markov random walk on , and h(u_{ i },v_{ j }) estimates the individual node similarity between u_{ i } and v_{ j }, which is measured in terms of sequence similarity in this work. The above scheme is conceptually similar to the one proposed in [10], where the similarity between two nodes in different networks are measured by linearly combining the topological similarity score and the sequence similarity score. The resulting score can be viewed as the long-run proportion of time spent at the given pair of nodes based on a “Markov random walk with restart” model, in which the restart probability has to be chosen in advance to balance the contributions from the interaction similarity and the sequence similarity, typically in an ad-hoc manner. Note that such parameter tuning is not needed in the semi-Markov random walk approach adopted in this work.
In the following sections, we analyze the effect of using the global correspondence scores in the HMM-based local network alignment method. More specifically, we evaluate the performance of the HMM-based local network alignment method when using the global correspondence score for s(u_{ i }, v_{ j }) given in (3), and compare it to the performance of the HMM-based alignment method that directly uses the sequence similarity score with s(u_{ i }, v_{ j }) = h(u_{ i }, v_{ j }), as originally proposed in [14, 15].
Results and discussion
Aligning synthetic networks
We applied the HMM-based local alignment to identify the most similar pair of paths of length L = 5. The identified top pair of paths when directly using the assigned node similarity scores is shown in Fig. 1(B). We notice that the alignment result is strongly influenced by the high similarity pairs (u_{4},v_{11}) and ((u_{8}, v_{13}) in this case and the prediction does not capture the obvious topological similarity in the two networks. Next, we computed the global correspondence scores between nodes based on the semi-Markov random walk scheme and used these scores in the alignment algorithm, instead of the original node similarity scores. Figure 1(C) shows the top path alignment for this case, where the core paths were accurately identified as we expect based on the topology of the two networks. Simulations based on other small synthetic networks, constructed in similar ways, yielded similar results (see Additional file 1 for other examples).
For a more thorough performance comparison between the two different schemes—the original scheme that directly uses the individual similarity scores and the proposed scheme that uses the global correspondence scores computed by semi-Markov random walk—we further created a benchmark set that consists of large synthetic networks generated based on a scale-free model [23]. Although we can also evaluate the performance of network alignment algorithms by aligning real biological networks and measuring the accuracy of the alignment results using functional annotations based on Gene Ontology (GO) terms [24] or KEGG ortholog (KO) group annotations [25], these annotations are still highly incomplete and may not accurately reflect the real functional similarity between molecules. As a result, a carefully constructed synthetic benchmark dataset may provide a better benchmark for evaluating future network alignment algorithms.
To construct the synthetic networks, we first randomly generated an undirected seed network of size 20 with an average degree of 10. Next, we grew this network according to the BA (Barabasi and Albert) model [23] to generate a random scale-free network using the preferential attachment algorithm [26]. In this algorithm, at each time step, a new node is added to the network and connected to m existing nodes with a probability that is proportional to the number of links that the nodes already have. As shown in [23], the resulting network captures several important characteristics of real PPI networks. The scale-free degree distribution is one such property, which means that the degree distribution of the network approximately follows the power law P(k) ~ k^{ γ }, where γ is the degree exponent. In this work, we used this model with m = 10 to grow to a network of size 1000. Once was created, we duplicated the network into two identical networks and . To model the functional coherence between orthologous proteins, we then assigned a distinct group annotation to each pair of corresponding proteins in the two networks. More specifically, both the node u_{ i } in and the node v_{ i } in were assigned to the i th functional group. We randomly assigned individual node similarity scores between orthologous nodes according to the Gaussian distribution with mean µ_{ o } = 300 and standard deviation σ_{ o } = 100. The node similarity scores between non-orthologous nodes were randomly assigned according to a different distribution , where σ = 100, and µ was used as a free parameter that determines the level of overlap between the two similarity score distributions. Node similarity scores below a certain threshold (set to 50 in this work) were set to zero. For every node, we also restricted the number of non-orthologous nodes with a nonzero similarity score to 10. These settings were used to make the resulting random networks similar to real PPI networks in public databases.
Up to this point, the two networks and were topologically identical. To introduce topological differences in these networks, we randomly deleted 10% of the edges in and . Furthermore, we also randomly deleted 10% of the nodes in the two networks and added back an identical number of new nodes by growing the networks using the preferential attachment algorithm. No functional group was assigned to these randomly inserted nodes. The node similarity between the inserted nodes in one network and the nodes in the other network was sparsely assigned according to , as before.
Aligning microbial PPI networks
For further evaluation of the proposed method, we performed pairwise alignments of three microbial PPI networks obtained from [7]. In our experiments, we aligned the E. coli network and the C. crescentus networks to detect conserved functional modules in the two networks. Similarly, we also performed a pairwise alignment between the E. coli and the S. typhimurium networks to find conserved modules in these networks. As before, we have assessed the accuracy of the alignment results using two metrics—namely, specificity and coverage—based on the KEGG ortholog (KO) group annotations [25] of the proteins in the microbial networks. A protein alignment is regarded as being correct if the aligned proteins have the same KO group annotations, and incorrect if the annotations do not agree.
Second, we used the BLASTP hit scores between protein pairs, provided in [7], as the individual node similarity scores. The global correspondence scores were computed according to the semi-Markov random walk approach described earlier. These two types of node similarity scores were normalized such that they lie in the same range.
As we can see from the pairwise alignment results of the E. coli and the C. crescentus networks, shown in Fig. 4(A), (B), (C) and Fig. 5(A), (B), (C), when the coverage of the predicted path alignments is comparable, using the global correspondence scores results in higher specificity compared to using the individual node similarity scores. This implies that HMM-based network alignment based on global correspondence scores can more effectively capture the functional similarity between nodes. However, as we can see in Fig. 5(D), (E), (F), the protein pairs aligned using the semi-Markov random walk based global correspondence scores are less annotated (as reflected in the lower coverage cc_{ k }) for the pairwise alignment of the E. coli and the S. typhimurium networks, in which case the specificity of the predicted alignment is not necessarily improved by the global scores. This can be seen in Fig. 4(D), (E), (F). One possible explanation for this observation is that the KO group annotations may have been curated largely baed on sequence similarity between proteins. For example, for remote orthologs that do not have high sequence similarity, it may be practically difficult to judge to which KO group they should belong since there is not enough evidence. From this point of view, network alignment using global correspondence scores obtained from semi-Markov random walk could be used to validate and improve functional annotation of proteins.
Gene names and Region names based on the GenInfo Identifiers (GIs) of the top 20 unannotated protein pairs that are aligned in the top conserved paths. Synonymous gene names are shown within parentheses.
E. coli | S. typhimurium | ||||
---|---|---|---|---|---|
GI | Gene name | Region name | GI | Gene name | Region name |
16131641 | wzzE | Wzz | 16767194 | wzzE | Wzz |
49176398 | viaA (yieM) | VWA_YIEM_type | 39546380 | yieM | VWA_YIEM_type |
16131399 | yhjJ | PqqL | 16766899 | yhjJ | PqqL |
16131130 | aaeB (yhcP) | FUSC | 16766659 | yhcP | FUSC |
16130240 | yfcl | Transposase_3 1 | 16767050 | STM3766 | Transposase_31 |
49176226 | bamC (nlpB) | Lipoprotein_l8 | 16765808 | nlpB | Lipoprotein_l8 |
16129342 | ydbH | DctA-YdbH | 16764990 | ydbH | DctA-YdbH |
49176233 | sseB | SseB | 16765855 | sseB | SseB |
16129572 | ydgA | PRK11367 | 16764812 | ydgA | PRK11367 |
16131557 | yidR | propeller_TolB | 16767096 | yidR | TolB |
16131404 | bcsB (yhjN) | BcsB | 16766904 | yhjN | BcsB |
16130950 | ygiF | CYTH-like_Pase_CHAD | 16766502 | ygiF | CYTH-like_Pase_CHAD |
16131855 | yjbH | DUF94O | 16767475 | yjbH | DUF940 |
16128005 | yaaW (htgA) | Ubiq_cyt_C_chap | 16763400 | htgA | Ubiq_cyt_C_chap |
16130391 | ypfG | DUF1176 | 16765796 | ypfG | DUF1176 |
16130357 | yfeY | DUF1131 | 16765767 | STM2447 | DUF1131 |
49176330 | yhdP | PRK10899 | 16766664 | yhdP | PRK10899 |
16131526 | yicH | AsmA | 16767033 | yicH | AsmA |
16131275 | yrfF | IgaA | 16766783 | yrfF | IgaA |
16129282 | ycjx | DUF463 | 16765028 | ycjx | DUF463 |
Conclusion
In this paper, we studied the effect of using a global similarity scoring scheme to measure the node similarity and incorporating these global scores in the HMM-based local network alignment algorithm. We used the semi-Markov random walk framework to compute the global correspondence scores between nodes in different networks. The resulting scores can effectively combine the topological similarity of the subnetworks around the network nodes as well as their individual molecular similarity. Experimental results on microbial protein-protein interaction networks and synthetic scale-free networks show that the use of global correspondence scores can better identify paths with similar topological properties, thereby improving the specificity of the predicted alignment. We believe that the proposed alignment scheme can provide an effective and computationally efficient framework for developing robust and accurate functional annotation tools for proteins.
Authors contributions
Conceived and designed the experiments: XQ, SMES, BJY. Performed the network alignment experiments: XQ. Implemented the semi-Markov random walk based scoring scheme: SMES. Analyzed the data and wrote the paper: XQ, SMES, BJY.
Declarations
Acknowledgements
XQ was supported in part by the University of South Florida Internal Awards Program under Grant No. 78068. BJY was supported in part by the Texas A&M Faculty start-up fund.
Authors’ Affiliations
References
- Osman A: Yeast two-hybrid assay for studying protein-protein interactions. Methods Mol. Biol 2004, 270: 403–422.PubMedGoogle Scholar
- Aebersold R, Mann M: Mass spectrometry-based proteomics. Nature 2003, 422: 198–207. 10.1038/nature01511View ArticlePubMedGoogle Scholar
- Kelley BP, Sharan R, Karp RM, Sittler T, Root DE, Stockwell BR, Ideker T: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. U.S.A 2003, 100: 11394–11399. 10.1073/pnas.1534710100PubMed CentralView ArticlePubMedGoogle Scholar
- Koyutürk M, Grama A, Szpankowski W: An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics 2004, 20: SI200–207.View ArticleGoogle Scholar
- Pinter R, Rokhlenko O, Yeger-Lotem E, Ziv-Ukelson M: Alignment of metabolic pathways. Bioinformatics 2005, 21(16):3401–3408. 10.1093/bioinformatics/bti554View ArticlePubMedGoogle Scholar
- Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, Uetz P, Sittler T, Karp RM, Ideker T: Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. U.S.A 2005, 102: 1974–1979. 10.1073/pnas.0409522102PubMed CentralView ArticlePubMedGoogle Scholar
- Flannick J, Novak A, Srinivasan B, McAdams H, Batzoglou S: Graemlin: general and robust alignment of multiple large interaction networks. Genome Res 2006, 16(9):1169–1181. 10.1101/gr.5235706PubMed CentralView ArticlePubMedGoogle Scholar
- Li Z, Zhang S, Wang Y, Zhang X, Chen L: Alignment of molecular networks by integer quadratic programming. Bioinformatics 2007, 23(13):1631–1639. 10.1093/bioinformatics/btm156View ArticleGoogle Scholar
- Yang Q, Sze S: Path matching and graph matching in biological networks. J Comput Biol 2007, 14: 56–67. 10.1089/cmb.2006.0076View ArticlePubMedGoogle Scholar
- Singh R, Xu J, Berger B: Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. U.S.A 2008, 105: 12763–12768. 10.1073/pnas.0806627105PubMed CentralView ArticlePubMedGoogle Scholar
- Flannick J, Novak A, Do CB, Srinivasan BS, Batzoglou S: Automatic parameter learning for multiple local network alignment. J. Comput. Biol 2009, 16: 1001–1022. 10.1089/cmb.2009.0099PubMed CentralView ArticlePubMedGoogle Scholar
- Klau G: A new graph-based method for pairwise global network alignment. BMC Bioinformatics 2009, 10(Suppl 1):S59. 10.1186/1471-2105-10-S1-S59PubMed CentralView ArticlePubMedGoogle Scholar
- Liao CS, Lu K, Baym M, Singh R, Berger B: IsoRankN: Spectral methods for global alignment of multiple protein networks. Bioinformatics 2009, 25: i253–258. 10.1093/bioinformatics/btp203PubMed CentralView ArticlePubMedGoogle Scholar
- Qian X, Sze SH, Yoon BJ: Querying pathways in protein interaction networks based on hidden Markov models. Journal of Computational Biology 2009, 16: 145–157. 10.1089/cmb.2008.02TTPubMed CentralView ArticlePubMedGoogle Scholar
- Qian X, Yoon BJ: Effective identification of conserved pathways in biological networks using hidden Markov models. PLoS ONE 2009, 4: e8070. 10.1371/journal.pone.0008070PubMed CentralView ArticlePubMedGoogle Scholar
- Tian W, Samatova N: Pairwise alignment of interaction networks by fast identification of maximal conserved patterns. Pac Symp Biocomput 2009, 14: 99–110.Google Scholar
- Zaslavskiy M, Bach F, Vert J: Global alignment of protein-protein interaction networks by graph matching methods. Bioinformatics 2009, 25: 259–267. 10.1093/bioinformatics/btp196View ArticleGoogle Scholar
- Sahraeian SME, Yoon BJ: Fast network querying algorithm for searching large-scale biological networks. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2011.Google Scholar
- Sharan R, Ideker T: Modeling cellular machinery through biological network comparison. Nat. Biotechnol 2006, 24: 427–433. 10.1038/nbt1196View ArticlePubMedGoogle Scholar
- Qian X, Yoon BJ: Comparative analysis of protein interaction networks reveals that conserved pathways are susceptible to HIV-1 interception. BMC Bioinformatics 2011., Suppl 1(S19):Google Scholar
- Sahraeian S, Yoon BJ: A novel low-complexity HMM similarity measure. IEEE Signal Processing Letters 2011, 18(2):87–90.View ArticleGoogle Scholar
- Yoon BJ, Qian X, Sahraeian S: Comparative analysis of biological networks using Markov chains and hidden Markov models. IEEE Signal Processing Magzines 2011. in press in pressGoogle Scholar
- Barabasi AL, Albert R: Emergence of scaling in random networks. Science 1999, 286: 509–512. 10.1126/science.286.5439.509View ArticlePubMedGoogle Scholar
- Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al.: Gene Ontology: Tool for the unification of biology, the gene ontology consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Barabasi AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nat. Rev. Genet 2004, 5: 101–113. 10.1038/nrg1272View ArticlePubMedGoogle Scholar
- Srinivasan B, Novak A, Flannick J, Batzoglou S, McAdams H: Integrated protein interaction networks for 11 microbes. Proc of the 10th Annu Int Conf Res Comput Mol Bio (RECOMB 2006) 2006.Google Scholar
- National Center for Biotechnology Information (NCBI)[http://www.ncbi.nlm.nih.gov/protein/]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.