Enhancing the accuracy of HMM-based conserved pathway prediction using global correspondence scores

Background Comparative network analysis aims to identify common subnetworks in biological networks. It can facilitate the prediction of conserved functional modules across different species and provide deep insights into their underlying regulatory mechanisms. Recently, it has been shown that hidden Markov models (HMMs) can provide a flexible and computationally efficient framework for modeling and comparing biological networks. Results In this work, we show that using global correspondence scores between molecules can improve the accuracy of the HMM-based network alignment results. The global correspondence scores are computed by performing a semi-Markov random walk on the networks to be compared. The resulting score naturally integrates the sequence similarity between molecules and the topological similarity between their molecular interactions, thereby providing a more effective measure for estimating the functional similarity between molecules. By incorporating the global correspondence scores, instead of relying on sequence similarity or functional annotation scores used by previous approaches, our HMM-based network alignment method can identify conserved subnetworks that are functionally more coherent. Conclusions Performance analysis based on synthetic and microbial networks demonstrates that the proposed network alignment strategy significantly improves the robustness and specificity of the predicted alignment results, in terms of conserved functional similarity measured based on KEGG ortholog (KO) groups. These results clearly show that the HMM-based network alignment framework using global correspondence scores can effectively find conserved biological pathways and has the potential to be used for automatic functional annotation of biomolecules.


Background
With the increasingly high coverage of molecular interactions owing to the advancement of high-throughput techniques for measuring biomolecular interactions, such as the two-hybrid screening [1] and co-immunoprecipitation [2], comparative analysis of biological networks has recently attracted significant research attention. It has been demonstrated that comparative network analysis can provide an effective means of systematically studying molecular interactions in various organisms and gaining novel system-level insights [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18]. For example, local network alignment across different species can identify similar subnetwork regions in the respective networks, which may lead to the discovery of conserved pathways that carry out essential cellular functionalities [3,5,6,9,11,15,16,19]. The concept of comparative network analysis can lead to the development of novel computational tools that allow us to transfer biological knowledge across species, especially from well-studied species to less-studied species [19].
Current local network algorithms [3,5,6,9,15] search for similar subnetwork regions by optimizing a predefined alignment score that incorporates the topological similarity of the interaction patterns in the compared networks as well as the node similarity of the molecules that belong to different networks, typically measured based on sequence similarity. To obtain better alignment results that are biologically more significant, there have been research efforts to improve the scoring scheme by incorporating evolutionary [4] or functional relationships [11,16] between molecules. Although there are various approaches for measuring the similarity between network nodes, most of the existing approaches compute this similarity based on the properties of individual nodes, such as their composition, functionality, or evolutionary relationships. However, cellular functions are carried out by collaborative efforts among many molecules, where interacting molecules may carry similar functionalities and share common characteristics. Therefore it would be reasonable to expect that, when evaluating the node similarity, incorporating additional information about the interacting molecules would enhance the network alignment results and lead to predictions that are biologically more meaningful.
Recently, we have introduced an effective framework for local network alignment based on hidden Markov models (HMMs), in which we integrate both the node sequence similarity and the interaction reliability into the scoring scheme by determining the parameters of the HMMs correspondingly [15]. We also developed an efficient dynamic programming algorithm that can find the closest pair of pathways from the respective networks in polynomial time. The HMM-based local alignment method can deal with a large class of path isomorphism and it allows one to search for long conserved pathways across large-scale networks. In this paper, we implement a semi-Markov random walk framework that diffuses the relationships of all the molecule pairs across the networks to obtain a global correspondence score between every pair of nodes. The resulting global correspondence score reflects the global similarity between nodes in different networks, by seamlessly integrating the topological similarity and individual node similarity. Alignment results based on synthetic networks and microbial protein-protein interaction (PPI) networks show that the performance of the HMM-based local alignment scheme can be significantly improved by utilizing the global correspondence score instead of the original individual sequence similarity score. The major contributions of this paper include the following: first, we integrate the global node correspondence scoring scheme into the HMM-based local network alignment framework [15], which leads to more accurate and robust alignment results; second, we thoroughly evaluate the performance of the proposed scheme based on synthetic benchmark networks, as well as real microbial networks, which clearly demonstrates the advantages of utilizing global correspondence scores, especially, in combination with the HMM-based framework.

Local network alignment based on hidden Markov models
In this section, we briefly review our local network alignment algorithm based on hidden Markov models (HMMs) [14,15]. We focus on aligning two biological networks to identify the common pathways that are conserved in both networks. Suppose we have two biological networks, represented as two graphs and In graph   , , ,  of N 2 nodes and a set  = { } e ij of M 2 edges. We denote the interaction reliability score between u i and u j in  1 as w 1 (u i , u j ) and the interaction reliability between v i and v j in  2 as w 2 (v i , v j ). The node similarity between u i ∈  and v j ∈ is denoted as s(u i , v j ).
In order to use HMMs to search for the pathways that are conserved in both networks, we search for the best matching pair of paths u = ∈ u u u u L i works that maximizes the pathway alignment score H(u, v). The alignment score H(u, v) integrates the node similarity score s(u i ,v j ) between the aligned nodes u i and v j (1 ≤ i, j ≤ L), the interaction reliability score w 1 (u i , u i+1 ) between u i and u i+1 (1 ≤ i ≤ L -1), the interaction reliability score w 2 (v j , v j+1 ) between v j and v j+1 (1 ≤ j ≤ L -1), and the penalty for potential gaps in the alignment.
We first construct two HMMs respectively for two given networks. For  1 , we design the state transition diagram of its corresponding HMM based on the graph structure of  1 . The resulting HMM contains a hidden state for each node u i ∈  , which we also denote as u i for convenience. State transition is allowed from u i to u j for (u i , u j ) such that d ij ∈  . The HMM for  2 can be constructed in a similar way. To allow flexible node insertions and/or deletions in the alignment result, we add auxiliary states to the HMMs as described in [14,15]. The state transition probabilities of the HMMs are determined based on the interaction reliability scores w 1 (u i , u i+1 ) and w 2 (v j , v j+1 ). By introducing a "virtual observation sequence" q = q 1 ... q L that is jointly emitted by the two HMMs, we design the emission probabilities based on the node similarity s(u i , v j ). Using these HMMs, the problem of finding the optimal pair of paths in the two networks is translated into that of finding the optimal pair of state sequences in the two HMMs that jointly maximize the probability P(q, u, v) of the "virtual observation sequence": ( , ) argmax ( , , ). ( , ) u v q u v u v * * = P We can use log P(q, u, v) as the alignment score H(u, v), and find the best matching pair of paths using dynamic programming [15]. For this purpose, we first define the score for the most probable pair of paths of length t(≤ L) as follows: where t w 1 (u i , u j ) and t w 2 (v k , v l ) are the logarithms of transition probabilities determined by the interaction reliability scores in the respective networks. Next, we find the optimal pair of paths (u*, v*) by iteratively computing the score in (1) for l = 1, 2, ... , L. Instead of finding only the best matching pair of paths, we can also search for the top k path pairs by replacing the max operator in (2) by an operator that finds the k largest scores. The computational complexity of the described dynamic programming algorithm is only O(kLM 1 M 2 ) for finding the top k pairs of matching paths. Note that the computational complexity is linear with respect to each parameter k, L, M 1 , and M 2 .
In our previous implementation of HMM-based local alignment [14,15,20], we have used the sequence similarity between individual molecules to measure the node similarity s(u i , v j ). As we discussed earlier, it is desirable to integrate all the available information to measure the similarity between network nodes, instead of relying on the similarity between individual molecules. In this paper, we propose to use a semi-Markov random walk model to define a global correspondence scoring scheme for measuring node similarity by incorporating the topological properties around the nodes. As we will demonstrate later, the use of global correspondence scores can improve the accuracy and robustness of the HMMbased alignment results.

Computation of global correspondence scores through semi-Markov random walk
In order to predict the global correspondence between nodes, we should first consider the similarity between the corresponding molecules themselves, in terms of sequence, structure, and/or function. However, considering that biomolecules carry out their functions through intertwined interactions with other molecules, it is important to consider these interaction patterns as well when evaluating the global similarity between nodes. As recently proposed and discussed in [10,18,21,22], Markov random walk can provide an elegant framework for evaluating the global correspondence between nodes that belong to different networks by seamlessly integrating the similarity between the nodes themselves and that between their interaction patterns.
In this work, we adopt the semi-Markov random walk approach [18] to compute the global correspondence scores for the node similarity s(u i , v j ). The basic idea of this scheme is to perform a simultaneous semi-Markov random walk on  1 and  2 , such that the random walker moves to one of the neighboring nodes in each network at each time point. The next node is randomly selected among all the neighboring nodes, where nodes with higher interaction reliability have a larger chance to be selected. The time that the random walker spends at a given pair of nodes u i ∈  1 and v j ∈  2 is proportional to the sequence similarity between the nodes. According to this model, the long-run proportion of time that the random walker spends at (u i , v j ) will increase if u i and v j have higher individual node similarity (e.g., sequence similarity). Furthermore, the proportion of time spent at (u i ,v j ) will also increase if the two nodes are surrounded by similar nodes, hence have a higher topological similarity. As a result, this semi-Markov random walk provides an elegant way of evaluating the global similarity between nodes by integrating individual node similarity and topological similarity. Using this model, we can compute the global correspondence score as follows: in which π 1 (u i ) is the stationary probability of visiting node u i in an ordinary Markov random walk on  1 , π 2 (v j ) is the stationary probability of visiting v j in a Markov random walk on  2 , and h(u i ,v j ) estimates the individual node similarity between u i and v j , which is measured in terms of sequence similarity in this work. The above scheme is conceptually similar to the one proposed in [10], where the similarity between two nodes in different networks are measured by linearly combining the topological similarity score and the sequence similarity score. The resulting score can be viewed as the long-run proportion of time spent at the given pair of nodes based on a "Markov random walk with restart" model, in which the restart probability has to be chosen in advance to balance the contributions from the interaction similarity and the sequence similarity, typically in an ad-hoc manner. Note that such parameter tuning is not needed in the semi-Markov random walk approach adopted in this work.
In the following sections, we analyze the effect of using the global correspondence scores in the HMMbased local network alignment method. More specifically, we evaluate the performance of the HMM-based local network alignment method when using the global correspondence score for s(u i , v j ) given in (3), and compare it to the performance of the HMM-based alignment method that directly uses the sequence similarity score with s(u i , v j ) = h(u i , v j ), as originally proposed in [14,15].

Aligning synthetic networks
To illustrate the advantages of using the global correspondence scores in the HMM-based local network alignment scheme, we first conducted a set of experiments based on synthetically generated networks. Figure  1(A) shows two simple synthetic networks, each of which contains a similar core path, respectively marked in dark blue and dark red. We added more nodes around the core path in each network in a way that the corresponding nodes in the core paths have similar local topological structures. We further assigned individual node similarity scores, where nodes in the respective networks that are located at similar vertical levels were given higher scores. We assigned exceptionally high similarity scores to two node pairs (u 4 , v 11 ) and (u 8 ,v 13 ), which are shown by dashed lines in Fig. 1(A). These two highly similar pairs are analogous to molecules in real biological networks that have high sequence similarity without real biological significance. Such nodes can mislead the alignment algorithm, yielding inaccurate alignment results. In fact, we may typically face similar problems when aligning two or more large-scale biological networks. The network adjacency matrices and the node similarity matrix are provided in the Additional file 1.
We applied the HMM-based local alignment to identify the most similar pair of paths of length L = 5. The identified top pair of paths when directly using the assigned node similarity scores is shown in Fig. 1(B). We notice that the alignment result is strongly influenced by the high similarity pairs (u 4 ,v 11 ) and ((u 8 , v 13 ) in this case and the prediction does not capture the obvious topological similarity in the two networks. Next, we computed the global correspondence scores between nodes based on the semi-Markov random walk scheme and used these scores in the alignment algorithm, instead of the original node similarity scores. For a more thorough performance comparison between the two different schemes-the original scheme that directly uses the individual similarity scores and the proposed scheme that uses the global correspondence scores computed by semi-Markov random walk-we further created a benchmark set that consists of large synthetic networks generated based on a scale-free model [23]. Although we can also evaluate the performance of network alignment algorithms by aligning real biological networks and measuring the accuracy of the alignment results using functional annotations based on Gene Ontology (GO) terms [24] or KEGG ortholog (KO) group annotations [25], these annotations are still highly incomplete and may not accurately reflect the real functional similarity between molecules. As a result, a carefully constructed synthetic benchmark dataset may provide a better benchmark for evaluating future network alignment algorithms.
To construct the synthetic networks, we first randomly generated an undirected seed network  of size 20 with an average degree of 10. Next, we grew this network according to the BA (Barabasi and Albert) model [23] to generate a random scale-free network using the preferential attachment algorithm [26]. In this algorithm, at each time step, a new node is added to the network and connected to m existing nodes with a probability that is proportional to the number of links that the nodes already have. As shown in [23], the resulting network captures several important characteristics of real PPI networks. The scale-free degree distribution is one such property, which means that the degree distribution of the network approximately follows the power law P (k)~k g , where g is the degree exponent. In this work, we used this model with m = 10 to grow  to a network of size 1000. Once  was created, we duplicated the network into two identical networks   1 = and   2 = . To model the functional coherence between orthologous proteins, we then assigned a distinct group annotation to each pair of corresponding proteins in the two networks. More specifically, both the node u i in  1 and the node v i in  2 were assigned to the ith functional group. We randomly assigned individual node similarity scores between orthologous nodes according to the Gaussian distribution  ( , ) m s o o 2 with mean µ o = 300 and standard deviation s o = 100. The node similarity scores between non-orthologous nodes were randomly assigned according to a different distribution  ( , ) m s 2 , where s = 100, and µ was used as a free parameter that determines the level of overlap between the two similarity score distributions. Node similarity scores below a certain threshold (set to 50 in this work) were set to zero. For every node, we also restricted the number of non-orthologous nodes with a nonzero similarity score to 10. These settings were used to make the resulting random networks similar to real PPI networks in public databases.
Up to this point, the two networks  1 and  2 were topologically identical. To introduce topological differences in these networks, we randomly deleted 10% of the edges in  1 and  2 . Furthermore, we also randomly deleted 10% of the nodes in the two networks and added back an identical number of new nodes by growing the networks using the preferential attachment algorithm. No functional group was assigned to these randomly inserted nodes. The node similarity between the inserted nodes in one network and the nodes in the other network was sparsely assigned according to  ( , ) m s 2 , as before. Based on the above model, we generated two networks  1 and  2 with µ = 200. Using the HMM-based local network alignment algorithm, we identified the top 200 high-scoring path alignments with gaps. To find the top 200 path alignments, we iterated the following steps: (i) find the optimal path alignment; (ii) store the predicted alignment; (iii) remove the interactions included in the path alignments; (iv) repeat the experiment to find the next path alignment. The accuracy of the identified path pairs are measured based on the group annotations of the aligned nodes as in [11,15]. We define the cumulative specificity of the top k alignments as follows: where c i c the total number of correctly aligned node pairs in the top ith alignment, and c i a is the total number of annotated node pairs also in the top ith alignment. We also define the cumulative coverage of the top k alignments as cc c k i c i k = = ∑ 1 , which computes the cumulative number of pairs with the same functional annotations. This metric measures the size of the accurately aligned network regions covered by the top k path alignments. We repeated the alignment experiments for different path lengths: L = 10, 20, and 30. The alignment results are summarized in Fig. 2. As we can see in this figure, the use of global node correspondence scores computed by the semi-Markov random walk approach significantly improves the specificity of the alignment, while the coverage is also slightly improved. This clearly shows that incorporating topological information of network nodes into the alignment process can be crucial in obtaining accurate alignment results when the individual node similarity scores are not highly discriminative by themselves. Next, we investigated performance improvement by the proposed approach for varying levels of overlap between the node similarity score distribution for orthologous nodes and that for non-orthologous nodes. We set L = 30 and evaluated the performance of the HMM-based alignment algorithm using the global correspondence scores and the individual similarity scores, respectively, for various values of µ. The simulation results are shown Fig. 3. As shown in this figure, the advantage of using the global correspondence scores for network alignment becomes more prominent for larger µ. When µ is small (e.g., µ = 150), we can accurately identify orthologous nodes by using individual node similarity scores. However, as µ increases and the individual node similarity scores become less discriminative, it becomes more critical to utilize topological information to identify orthologs. This can be seen when µ = 250, in which case the use of global correspondence scores can remarkably improve the accuracy of the alignment. It is also apparent from the figure that the performance of local network alignment using individual node similarity degrades much faster as the individual similarity gets less informative. This implies that we can obtain more robust and reliable alignment results by incorporating the global correspondence scores into the HMM-based local network alignment algorithm. (The URL for downloading the synthetic networks used in our experiments can be found in the Additional file 1.)

Aligning microbial PPI networks
For further evaluation of the proposed method, we performed pairwise alignments of three microbial PPI networks obtained from [7]. In our experiments, we aligned the E. coli network and the C. crescentus networks to detect conserved functional modules in the two networks. Similarly, we also performed a pairwise alignment between the E. coli and the S. typhimurium networks to find conserved modules in these networks. As before, we have assessed the accuracy of the alignment results using two metrics-namely, specificity and coverage-based on the KEGG ortholog (KO) group annotations [25] of the proteins in the microbial networks. A protein alignment is regarded as being correct if the aligned proteins have the same KO group annotations, and incorrect if the annotations do not agree.
In these experiments, the parameters of the HMMs were chosen as follows. First, the transition scores t w (u i , u j ) in (1) were determined based on the presence (or absence) of protein interactions in the microbial networks determined by the SRINI algorithm [27]: Second, we used the BLASTP hit scores between protein pairs, provided in [7], as the individual node similarity scores. The global correspondence scores were computed according to the semi-Markov random walk approach described earlier. These two types of node similarity scores were normalized such that they lie in the same range.
Based on the constructed HMMs, we used our local network alignment algorithm to find the 200 top-scoring pathway alignments with gaps. This experiment was also repeated for several different virtual path length: L = 10, 20, and 30. In all our experiments, we disallowed multiple occurrences of identical protein pairs in a given path alignment. The cumulative specificity cs k of the pairwise alignment of the E. coli and the C. crescentus networks using the two different types of node similarities are shown in Figs  scores. This implies that HMM-based network alignment based on global correspondence scores can more effectively capture the functional similarity between nodes. However, as we can see in Fig. 5(D), (E), (F), the protein pairs aligned using the semi-Markov random walk based global correspondence scores are less annotated (as reflected in the lower coverage cc k ) for the pairwise alignment of the E. coli and the S. typhimurium networks, in which case the specificity of the predicted alignment is not necessarily improved by the global scores. This can be seen in Fig. 4(D), (E), (F). One possible explanation for this observation is that the KO group annotations may have been curated largely baed on sequence similarity between proteins. For example, for remote orthologs that do not have high sequence similarity, it may be practically difficult to judge to which KO group they should belong since there is not enough evidence. From this point of view, network alignment using global correspondence scores obtained from semi-Markov random walk could be used to validate and improve functional annotation of proteins.
In order to further examine the biological significance of the alignment results obtained from the proposed method, we retrieved the unannotated protein pairs that are aligned in the top path alignments predicted by comparing the E. coli and the S. typhimurium networks. As these proteins do not have curated functional annotations, such as GO terms or KO groups, we manually checked the protein information for the first 20 unannotated pairs in the Protein database at the National Center for Biotechnology Information (NCBI) [28]. Table  1 shows the assigned gene names and regions names based on the Genlnfo Identifiers (GIs) of these aligned proteins. Among these 20 protein pairs that are unannotated in the KEGG database, 18 pairs consist of proteins that have been assigned with the exactly same gene names and region names. In fact, these protein pairs indeed have similar cellular functionalities, where many of them are membrane proteins. Further information about these protein pairs can be found at the URLs included in the Additional file 1. For the remaining two protein pairs (shown in bold face in Table 1), we could see that the aligned proteins were not assigned with the same gene names because the proteins in the S. typhimurium network were not annotated. When we checked the region names of the aligned proteins in each pair, we could see that the proteins in the first pair have the same region name "Transposase_3l" and those in the second pair have the same region name "DUF1131". These observations suggest that the HMM-based network alignment method using semi-Markov random walk scores may provide a promising framework for automatic functional annotation of proteins. Table 1 Gene names and Region names based on the GenInfo Identifiers (GIs) of the top 20 unannotated protein pairs that are aligned in the top conserved paths. Synonymous gene names are shown within parentheses.

Conclusion
In this paper, we studied the effect of using a global similarity scoring scheme to measure the node similarity and incorporating these global scores in the HMM-based local network alignment algorithm. We used the semi-Markov random walk framework to compute the global correspondence scores between nodes in different networks. The resulting scores can effectively combine the topological similarity of the subnetworks around the network nodes as well as their individual molecular similarity. Experimental results on microbial protein-protein interaction networks and synthetic scale-free networks show that the use of global correspondence scores can better identify paths with similar topological properties, thereby improving the specificity of the predicted alignment. We believe that the proposed alignment scheme can provide an effective and computationally efficient framework for developing robust and accurate functional annotation tools for proteins.

Additional material
Additional file 1: supplement.pdf This file contains: (i) information of the benchmark set that contains randomly generated synthetic networks, (ii) additional experimental results, and (iii) relevant information regarding the unannotated pairs of aligned proteins in microbial networks.