- Research article
- Open Access
CASCADE_SCAN: mining signal transduction network from high-throughput data based on steepest descent method
BMC Bioinformaticsvolume 12, Article number: 164 (2011)
Signal transduction is an essential biological process involved in cell response to environment changes, by which extracellular signaling initiates intracellular signaling. Many computational methods have been generated in mining signal transduction networks with the increasing of high-throughput genomic and proteomic data. However, more effective means are still needed to understand the complex mechanisms of signaling pathways.
We propose a new approach, namely CASCADE_SCAN, for mining signal transduction networks from high-throughput data based on the steepest descent method using indirect protein-protein interactions (PPIs). This method is useful for actual biological application since the given proteins utilized are no longer confined to membrane receptors or transcription factors as in existing methods. The precision and recall values of CASCADE_SCAN are comparable with those of other existing methods. Moreover, functional enrichment analysis of the network components supported the reliability of the results.
CASCADE_SCAN is a more suitable method than existing methods for detecting underlying signaling pathways where the membrane receptors or transcription factors are unknown, providing significant insight into the mechanism of cellular signaling in growth, development and cancer. A new tool based on this method is freely available at http://www.genomescience.com.cn/CASCADE_SCAN/.
Signal transduction plays an essential role in cell response to environment changes. This biological process is usually characterized by phosphorylation/dephosphorylation of some key proteins (e.g. kinases) and generally involves a signal cascade. The signal transduction process often starts from a membrane protein (usually a membrane surface receptor), spans a series of intercellular signaling proteins and then transfers to transcription factors in the nucleus, subsequently raising the expression of downstream genes. Studies demonstrate that many important cellular processes such as cell proliferation, differentiation, cell cycle control and cellular responses to nutrient limiting conditions are involved in different signaling pathways [1, 2]. For example, Yokoi et al demonstrated that hyperglycemia mediates endothelial cell senescence through the ASK1 signaling pathway. Tang et al showed that the receptor kinase BRI1 and BR-signaling kinases (BSKs) mediate growth regulation related signal transduction in Arabidopsis. The Toll-like receptor (TLR) signaling cascade plays an essential role in recognizing and eliciting responses upon invasion of pathogens . Recent high-throughput genomic and proteomic techniques, such as large-scale yeast two-hybrid (Y2H) , Co-Immunoprecipitation (Co-IP) [7, 8], tandem affinity purification-mass spectrometry (TAP-MASS) [9, 10], protein chip [11–14] and microarray experiments [15, 16] have generated enormous amounts of data for uncovering signal transduction networks. This abundance of information brings increasing complexity to network analysis, which is a major obstacle to understanding the mechanisms of cell signaling.
Recently, computational methods have been introduced in mining signal transduction network. Steffen et al developed a static model, NetSearch, to reconstruct the signal transduction network from PPI and gene expression data. For a given membrane protein and transcription factor, NetSearch will search for all possible linear paths that link the two proteins. By employing a depth first search (DFS) algorithm [17–20], paths of a specified length are kept, and then a statistical score is assigned to each path. Top scoring pathways are then assembled into the final branched signal transduction network. Liu et al have worked on determining the order of signal transduction network components. They calculated the correlations between each gene pair and recorded the significance using a hypergeometric test to specify the correlation threshold. A score function is constructed to determine the final signal transduction network. Zhao et al[18, 22] proposed a novel computational approach aimed at finding an optimal signal transduction network using an integer linear programming (ILP) and mixed integer linear programming (MILP) model. Similar approaches have also been proposed in more recent studies [20, 23]. All those existing methods mainly use integrated PPI and gene expression data, which have been widely adopted in many related studies. They all aim at finding an optimal signal transduction network starting from a given membrane receptor and ending at a specific transcription factor. However, in most situations, we even do not know which membrane receptor or transcription factor is involved in a certain signaling pathway. In fact, most intermediate proteins are more easily available for their dominant position in quantity, which is neither a membrane receptor or transcription factor. These proteins could also be used in mining signal transduction networks. Besides, the datasets utilized in these methods are primarily based on experiments. Though the interactions are more reliable compared with computationally predicted interactions, the data is deficient. Some computational methods, e.g. gene co-expression  and semantic similarity of Gene Ontology (GO) annotations , indicate that genes with high scored interactions may be involved in the same signaling pathway . However, this information either is limited or has not been incorporated in most databases constructed from experimental data. Though these interactions may not necessarily be direct interactions, using this information may help to improve prediction of signal transduction networks. We define "direct interaction" as a direct physical association between two proteins and "indirect interaction" as no direct physical association between two proteins in the actual state. Two proteins with indirect interaction must function through at least one medial protein.
Here, we present a novel computational method, named CASCADE_SCAN, under Linux system, to detect signal transduction networks from high-throughput data based on a customized steepest descent method (SDM) . We do not mean to find a signal transduction network starting and ending from two specific proteins, but to find a high scored network within the top largest densities that contain a series of given proteins, also referred to as seed proteins, which are the supposed known components of a specific signaling pathway. These given proteins may also include the membrane receptor and the transcription factor. In addition, by searching for additional high related proteins automatically, indirect interactions are used effectively. This pre-processing is demonstrated below to be very useful. The well studied yeast MAPK signaling pathways (Figure 1), which have been widely used in previous studies, were also employed here to test our model. All figures were generated by Cytoscape . The results indicate that the precision and recall values are comparable with those of other existing methods, even though the dataset we used is much larger than those previously utilized.
Since the yeast MAPK pathways have been well studied and widely used as reference standards in previous studies [17, 18, 21, 23], we thus used those pathways (Figure 1) downloaded from the KEGG database  to validate the reliability of CASCADE_SCAN. In the following steps, we discuss the results of the pheromone response and the filamentous growth pathway, respectively. It is worth noting that the seed proteins we used were not restricted to membrane receptors and transcription factors, as in previous studies.
Detecting the pheromone response pathway
The pheromone response pathway (Figure 2) mediates cell signaling in response to extracellular peptide pheromones. In the current KEGG database, there are about twenty proteins present in the pheromone response pathway, as shown in Figure 2A. We randomly selected four seed proteins and varied the score threshold from 0.800 to 0.950 with an interval of 0.050. Different parameters were tested (Additional file 1). Table 1 shows the results of the performance evaluation mentioned above compared with other methods (color-coding, NetSearch, PathFinder, ILP). We can see that CASCADE_SCAN obtains ~76% recall after 20 independent experiments, which is comparable with color-coding, NetSearch and PathFinder, though the ~54% precision is lower (Table 1 and Additional file 2). The p-values detected by both the hypergeometric test and Fisher's exact test from DAVID are less than or equal to 2.43E-16 (Table 2), indicating the effectiveness of our method.
Figure 2B shows the results detected by CASCADE_SCAN. Though our dataset included many computationally predicted interactions, the precision and recall reached a high level. It can be seen that 15 proteins in the KEGG pheromone response signaling pathway are detected by our method except for STE2, STE3, BNI1, MSG5 and MCM1 (Figure 2A and 2B). However, several additional proteins, KSS1, STE50 and CLA4, are also detected by our method (Figure 2B).
We note, however, that some of those proteins not in the KEGG pheromone response signaling pathway were also detected by some of the other methods (color-coding, NetSearch, PathFinder, ILP). For example, KSS1 (mitogen-activated protein kinase) was also presented in the pheromone response pathway by our method (Figure 2B). In a previous study, KSS1, together with FUS3, was demonstrated to regulate the signal transduction cascade of the pheromone response and filamentous growth pathways . KSS1 also activates the transcription factor STE12 as well as phosphorylates DIG1, DIG2 and STE12, which are involved in both the pheromone response and filamentous growth pathways [30, 31]. Moreover, KSS1 was also detected by all of the other four methods. CLA4 is a member of the STE20 subfamily, which belongs to the STE Ser/Thr protein kinase family. CLA4 may play a role in the phosphorylation of alpha-factor-arrested yeast . STE50 was reported to interact with the MAPKKK STE11 through the respective SAM domain [33, 34], and is required in growth arrest during conjugation at an early step in yeast mating . We, therefore, can draw the conclusion that those proteins indeed have some association with the yeast pheromone response pathway, though they are not included in the KEGG pheromone response pathway.
BNI1, which was not detected by most of the other methods (color-coding, NetSearch, PathFinder, ILP), was also not detected by CASCADE_SCAN. This may be because it has both weaker and fewer interactions with other proteins in this pathway, which may also have been the case for the other proteins STE2, STE3, MSG5 and MCM1.
Detecting the filamentous growth pathway
The filamentous growth pathway (Figure 3) regulates cellular response to nutrient limiting conditions. For this pathway (Figure 3A), there are many common proteins with other MAPK pathways (Figure 1). So, we randomly selected three or four seed proteins only from SSU81(SHO1), TEC1, RAS2 and KSS1. Different parameters were also tested (Additional file 3). After five independent experiments, we obtained an average of ~89% recall and ~36% precision (Table 3 and Additional file 4). Table 3 shows the performance of CASCADE_SCAN in detecting the filamentous growth pathway compared with that of NetSearch, PathFinder and ILP. CASCADE_SCAN clearly shows both higher recall and precision than all the other methods. In addition, the p-values detected by both the hypergeometric test and Fisher's exact test from DAVID are less than or equal to 4.42E-23 (Table 4), indicating that our method is effective.
Figure 3B shows the output of CASCADE_SCAN when using SSU81, RAS2 and KSS1 as seed proteins. CASCADE_SCAN identified a total of 18 proteins (Figure 3B). The 11 proteins in the KEGG filamentous growth pathway are shown in Figure 3A, nine of which were detected by CASCADE_SCAN except for CDC42 and TEC1 (Figure 3B). We investigated the nine additional proteins, and found most of them were identified in the pheromone response pathway, including STE4, STE5, FUS3 and FUS1. This sharing proteins phenomenon was also observed in previous studies [18, 23]. In addition, PEA2, SPA2, KAR4, STE50 and FUS2, which are not mentioned in any KEGG MAKP pathway, were newly detected by our method (Figure 3B). Those proteins were detected mainly because the integrated database contains computationally predicted PPIs with high scores and no protein activity data were utilized.
Detecting the cell wall remodeling and high osmolarity pathways
We also evaluated the cell wall remodeling (Figure 4) and high osmolarity (Figure 5) MAPK pathways. These two KEGG pathways are shown in Figure 4A and 5A. After 20 independent experiments by randomly selected three seed proteins each time, CASCADE_SCAN obtained ~69% recall and ~21% precision for the cell wall remodeling pathway (Table 5 and Additional files 5 and 6), and ~77% recall and ~20% precision for the high osmolarity pathway (Additional files 7 and 8), using a default network size threshold of 50. Though the recall values are relatively high (Table 5), the precision seems to be very low. Generally speaking, the more complex the organism is, the more components a signaling pathway contains. Through our investigation of all the KEGG signaling pathways, the number of signaling pathway components usually does not exceed 50 in higher organisms (e.g. human), and 30 in relatively lower organisms (e.g. yeast and fly). Therefore, we also evaluated the two pathways with a network size threshold of 30. CASCADE_SCAN obtained ~64% recall and ~32% precision for the cell wall remodeling pathway (Table 5), and ~82% recall and ~39% precision for the high osmolarity pathway, showing significant improvement in the precision at a relative high level. Moreover, all of the p-values are less than or equal to 8.58E-17 (Table 6) and 5.85E-11 (Table 7), respectively, indicating that our method is effective.
Figures 4B and 5B show the results of the cell wall remodeling and the high osmolarity pathways detected by CASCADE_SCAN. As we can see in the figures, most components of the two pathways are detected by CASCADE_SCAN. However, the integrated database contains many more interactions, so several other proteins were also detected. In addition, CASCADE_SCAN seems to predict fewer edges between the proteins in the predicted signal transduction networks comparing with other methods. Hence, even though the membrane receptor and transcription factor are not known, we still know where the signal is from and to among those proteins, since most proteins have only one link to the preceding and succeeding element in the predicted network. In fact, if we require the order between the proteins to be more intuitive, fewer edges should be kept in the predicted network. We achieved this goal by maximizing the average weight of the network while keeping most of the reliable interactions, as described by equation (3).
Generally, some potential proteins involved in a signaling pathway stimulated by environmental factors are easily available through various reliable means, such as manual literature curation and biological experiments. But in most situations, not all or none of these proteins are membrane receptors or transcription factors. Moreover, the proteins we obtained may be more than just two proteins (a membrane receptor and a transcription factor). Therefore, CASCADE_SCAN is more suitable for actual biological application compared with existing methods such as color-coding, NetSearch, PathFinder and ILP. Nevertheless, although those methods utilize a more reliable dataset, the data is limited. However, using computationally predicted interactions may make up for the deficiency of experiment data, which is also one of our original aims.
In addition, we used a different but both a reasonable and effective data pre-processing scheme for CASCADE_SCAN. Firstly, we used a customized DFS algorithm to search for common nodes within a certain path length. We found that, in the integrated database (incorporating the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) and Biological General Repository for Interaction Datasets (BioGRID) databases), more than 90% nodes can link with any other nodes within a six-path length, and proteins in the same pathway often within a two-path length, which is different from the previously reported 6-9 steps. This is primarily because the integrated database stores many more indirect PPIs that were primarily generated by computational methods. The indirect interaction weights are often lower than the direct interaction weights but much higher than the weights between two unrelated nodes. Therefore, the path length between any two nodes would become correspondingly shorter, and we used a relatively shorter path length for the DFS algorithm here. Secondly, proteins involved in the same signaling pathway usually have a similar gene expression profile, so there are usually more and stronger interactions among them than others in the integrated database. And vice versa, many computationally predicted and high scored PPIs also indicate that the corresponding proteins may be involved in the same signaling pathway. Our approach--searching for additional seed proteins--was based on this hypothesis. When detecting the pheromone response pathway, by providing only four seed proteins, CDC24, DIG1, FAR1 and STE4, seven other proteins were detected in the pre-processing step (PPI score threshold: 0.800, credible PPI score threshold: 0.980, DFS path length: 2), including CDC42, BEM1, STE12, STE11, FUS3, STE18 and STE5, all of which are in the KEGG pheromone response pathway. When determining the filamentous growth pathway, by only giving three seed proteins KSS1, RAS2 and SSU81, three other proteins including FUS1, STE12 and STE11 were detected in the pre-processing step (PPI score threshold: 0.900, credible PPI score threshold: 0.950, DFS path length: 5), of which STE12 and STE11 are in the KEGG filamentous growth pathway except for FUS1. In all, these pre-processing schemes were demonstrated to be effective in inferring signaling pathways when using the integrated database.
Recent studies have indicated that genes involved in the same signaling pathway tend to have similar gene expression profiles, which is especially notable regarding adjacent pathway components . Moreover, signal transduction usually shows different activation under different situations . Based on those hypotheses, microarray expression data [36, 37] were employed in previous studies as a complement to PPI data. Recently, studies have reported that using fold change criterions in the pre-processing schemes removes genes without significant change . However, this process may also eliminate some important genes without significant quantity change in mRNA level, but showing activity changes in protein level. Therefore, fold change was not incorporated in this study. However, fold change criterion of gene expression data could be easily incorporated into the filtering process. Besides, an appropriately bigger score threshold, a relatively shorter DFS path length and a more stricter PPI scoring system would be helpful for excluding most irrelevant proteins as well as most indirect interactions.
When we attempted to compare the results of color-coding, NetSearch, PathFinder and ILP obtained from Zhao's paper , we found the results are actually incomparable because we used a different dataset and method. Firstly, the integrated dataset we used contained 5,717 yeast proteins and 322,084 yeast protein interactions, much larger than the previously used dataset. Moreover, most of these interactions are indirect interactions, while the previously used dataset contains mostly direct interactions. Secondly, the seed proteins used in our method are not confined to membrane receptors or transcription factors, and are usually more than just two proteins. Using different seed proteins may also lead to different precision and recall values. Therefore, it is difficult to say which method is better. Furthermore, the PPI scoring system is of the most importance, for which we combined the topological clustering semantic similarity (TCSS) scoring and STRING scoring systems. The TCSS scoring system predicts PPI from semantic similarity based on GO annotations . The STRING scoring system method predicts PPI from various methods, such as the neighbourhood method, fusion events, co-occurrence, co-expression, experimental methods and text-mining. The combined scoring system seems to be stricter, and may be one of the reasons for our better performance. Nevertheless, we still recommend using other methods as complements for predicting signal transduction networks because no single method is perfect. More work needs to be done in this field as well as controlling for network size.
We note that the precision values are much lower for the other three pathways (filamentous growth, cell wall remodelling, and high osmolarity pathways) compared to the pheromone response pathway. This is primarily because most proteins in each of the other three pathways also have strong interactions with other proteins that are in other pathways. Therefore, more proteins are predicted than for the other three pathways. For instance, CDC42, STE20, STE11, STE7, DIG1, DIG2 and STE12 in the filamentous growth pathway could interact with BEM1, PBS2, FUS3, MCM1, STE4, STE5, STE18 and CDC24 in the pheromone response pathway, STE11 could interact with PBS2 in the high osmolyte pathway, and RAS2 could interact with CYR1 in the meiosis pathway. This is the sharing proteins phenomenon as mentioned in previous studies [18, 23]. Though all the methods predicted mostly true proteins, i.e., with high recall values, the other three pathways contain fewer proteins than the pheromone response pathway. Thus, the precision for determining these signal transduction networks is relatively lower. Another reason may be there is less sufficient PPI information available for the other three pathways compared with the well studied pheromone response pathway.
In this paper, we reported a new method, named CASCADE_SCAN, to detect signal transduction networks from high-throughput data. A new tool based on this method is freely available from the website http://www.genomescience.com.cn/CASCADE_SCAN/. Different from previous methods, in CASCADE_SCAN, the SDM is employed for inferring the signal transduction network and the given proteins utilized are not confined to membrane receptors or transcription factors. We also demonstrated that indirect interactions (e.g. the most computationally predicted interactions) can be effectively used in mining signaling transduction networks. This is particularly useful because all databases will include more and more indirect interactions as the data accumulates.
Widely used datasets often include PPI data and gene expression data. Here, only the PPI dataset was employed. The Yeast Proteome Database (YPD) [39, 40], Saccharomyces Genome Database (SGD)  and Database of Interacting Proteins (DIP) [42, 43] are the most frequently used PPI databases, but the interaction dataset in those databases is very limited, which may lead to misconnections due to deficient data. In this study, we constructed a PPI dataset from the STRING database (Version 8.3) [24, 44] and BioGRID database (Version 3.1.74) . The current STRING database contains 6,015 yeast proteins and 245,782 yeast protein interactions, and the BioGRID database contains 6,063 yeast proteins and 168,599 yeast protein interactions. Our combined database contains both direct and indirect PPIs derived from both computational methods and biological experiments, providing more comprehensive information than previously used.
PPI scoring system
To score the PPI pairs in the combined database, we integrated the STRING scoring system and the TCSS method . The STRING database infers PPIs through various approaches, including the neighbourhood method, fusion events, co-occurrence, co-expression, experimental methods and text-mining. It integrates all probabilities of those methods and assigns each PPI pair a reasonable score . The original PPI score in STRING database is from 0 to 999, which is subsequently normalized from 0.000 to 0.999 by dividing by 1000. For all the PPI pairs in the combined database, the TCSS algorithm was implemented to give each PPI pair a score with a default biological processes cutoff of 3.5. The score is normalized from 0.000 to 0.999 by multiplying by 0.999. However, not all of the proteins would have a corresponding GO term in the annotation file. PPI pairs having neither a score in the STRING database nor a TCSS score were removed. Finally, we obtained 5,717 yeast proteins and 322,084 yeast protein interactions. The combined score (the weight of an edge) of each PPI pair is calculated as:
where, S TCSS is the normalized score calculated by TCSS and S STRING is the normalized STRING score. Note, if there is a PPI pair only in the STRING database or could not be scored by TCSS, S TCSS would be 0, and if the PPI pair is not presented in the STRING database, S STRING would be 0.
To reduce the false positive rate, two pre-processing steps were carried out to exclude obviously irrelevant proteins but to keep the high correlated proteins as much as possible.
Firstly, given several seed proteins, which we assumed to be known components in a signaling pathway, a similar but different DFS algorithm was realized. Our DFS algorithm would search for all proteins connected with each seed protein within a certain path length. The common proteins within this scope were kept.
In addition, given an interaction score threshold and several seed proteins, CASCADE_SCAN searches for additional proteins having common association with the seed proteins. If a protein (the nodes in the network graph) possesses at least three interactions (the edges in the network graph) with the given seed proteins, this protein would be selected as a new seed protein, which would also be considered as a component of the signal transduction network. Besides, a protein, with at least two high credible interactions with the given seed proteins, would be also selected as a new seed protein. All of the seed proteins would be included in the calculation of the distance or score when we selected the steepest descent nodes or edges, as described in the following steps.
These steps were carried out by CASCADE_SCAN automatically. The parameters are adjustable. By default, the path length is set as 10 and the credible edge weight threshold 0.980.
Customized steepest descent method for detecting signaling transduction networks
Given PPI information, our model is based on the following hypotheses: Firstly, proteins in the same signaling pathway have stronger interactions than that in different signaling pathways. Secondly, proteins have more interactions with each other in the same signaling pathway than in different signaling pathways. Thus, nodes in the same network usually have shorter distances to each other than nodes in different networks, and an "actual network" usually processes a higher density than networks containing more unrelated nodes. The density of a network is calculated as shown in equation (2). However, there is still a problem: networks composed of parts of the "actual network" may also have a similar or even much higher density. To resolve this problem, we adopted a strategy to balance the network density and size. We selected the top five networks sorted by network density in descending order, and the network of the largest size from among the five as the most possible one containing the most components of the "actual network". Generally, the more and stronger the interactions, the higher the total score. Our goal was to find a compact network with the highest score. The score of a network is calculated as shown in equation (4).
To achieve our goal, the SDM, which is the simplest of the gradient methods, was employed to remove obviously irrelevant nodes. The basic idea of the method is choosing a direction where the objective function decreases most quickly. The optimization process starts at a certain point and slides along the direction until the result get close enough to the final solution. We detail this process below. Firstly, the interaction score between two proteins is converted to a distance. An interaction score below the score threshold is reassigned as 0, which denotes there is no interaction between the two proteins, and the corresponding distance will be considered as an infinite number. Otherwise, if the interaction score is above the score threshold, the corresponding distance will be assigned as 1 - score. Then, the graph search algorithm Dijkstra  is employed to calculate the distance between any two proteins. Given the initial node, this algorithm will find the shortest path between this node and any other node in the graph. Hence, it is often used in solving routing related problems. Consequently, the distances between each protein and any seed protein are obtained. Generally speaking, if the distance of a node to the "actual network" is maximal, the node represents the steepest descent direction. However, the "actual network" is unknown, we simply use the distance of the node to any of the given nodes (i.e. seed proteins) as the distance to the "actual network". Hence, the node corresponding to the maximal distance will be selected as a candidate steepest descent node to remove from the network. While, if removing the candidate steepest descent node leads to a disconnection between given nodes, the node corresponding to the next maximal distance is selected as a candidate steepest descent node, and so forth.
However, there still may be many nodes with short distances to some of the given nodes. So, when the network size is reduced to a certain extent, using the distances will no longer be able to distinguish the differences among those nodes. Moreover, proteins in the same pathway usually have more than one high scored interaction predicted by a computational method. Futhermore, the aggregate scores, which represent the contributions of a certain node to all of the given nodes, are obviously distinct from distances to distinguish which node should be selected as the candidate steepest descent node. For instance, suppose there are two nodes, N1 and N2, and three given nodes, G1, G2 and G3. The edge weight (interaction score for two proteins) between N1 and G1, N1 and G2, and N1 and G3 are 0.850, 0.900 and 0.95, respectively. While, there is no edge between N2 and G3, the edge weights between N2 and G1, and N2 and G2 are 0.950 and 0.900, respectively. Therefore, the distance between N1 and the given nodes (G3 corresponding to the maximal distance) is 0.050, and the distance between N2 and the given nodes (G1 corresponding to the maximal distance) is also 0.050. That is to say, there is no difference between N1 and N2 in distance. However, the aggregate contributing score of N1 (2.700) is much higher than the aggregate contributing score of N2 (1.850). So, when all of the remaining nodes have at least one edge with the given nodes, the aggregate scores are employed to select the candidate steepest descent node.
When the network size is below a specific threshold (by default, the maximal network size is restricted to 50), the top five largest density networks are recorded. However, these networks include many false negative edges, which represent mostly indirect interactions in the networks. To remove these false negative edges as much as possible, we use a maximal average weight controlling strategy, and the steepest descent method is employed here again. If an edge weight in the current network is minimal, removing this edge would maximize the average weight the most quickly. So, the least weighted edge is selected to be removed from the network until the average weight peaks. The average weight of a network is calculated as shown in equation (3). However, there may be more than one edge having the same weight, and all of them would be minimal. In this situation, we randomly select an edge from among them as the candidate steepest descent edge. If the removal of the edge leads to disconnection of the network components, the next least weighted edge is selected immediately. Finally, the aggregate score of each network is calculated, the network with the highest score being the most possible one. The formulas to illustrate the process are shown in detail below:
where, i is the i-th node, j is the j-th node and n is the total number of nodes in a signal transduction network, respectively, and each node in the signal transduction network represents a protein. e ij is the edge between node i and j, which represents whether the interaction between node i and j is selected as part of the network or not. If the edge between node i and j is selected, e ij is 1, otherwise, e ij is 0. w ij is the interaction weight of edge e ij , which is represented by the PPI score. We consider the network as an undirected graph here. Therefore, interaction weights w ij and w ij are the same in nature. D n is the density of a network with a specific size of n. W n is the average weight of a network with a specific size of n. Each time a least weighted edge is removed from the network, the average weight becomes larger. S n is the score of the network with a specific size of n when the average weight peaks.
To date, the network with the highest score has been obtained. For comprehensive consideration, we extended this restriction to the top N highest scored network.
We employed precision and recall rates to evaluate the performance of our model. Precision is defined as the rate of the number of correctly predicted nodes to the total number of nodes determined in the signal transduction network. The recall rate is defined as the correct number of predicted proteins to the total number of proteins in the corresponding KEGG pathway. Statistical significance was also measured by functional enrichment analysis of the network components. These test methods are recognized as common evaluation criteria for signal transduction networks [17, 18, 23].
Different from previous repeat experiments, we randomly selected several proteins in the same pathways as seed proteins, rather than using the same membrane receptor and transcription factor as the seed proteins.
GO functional enrichment analysis was performed by DAVID [47, 48]http://david.abcc.ncifcrf.gov/. We also employed hypergeometric distribution to calculate the statistical significance of the enriched network components, as described by:
where N represents the total number of proteins in the SGD database, n represents the number of proteins with the specific function annotated by DAVID in N proteins, M denotes the network size and x is the number of proteins with the specific function annotated by DAVID in M proteins. Fisher's exact test p-values were also obtained from DAVID functional enrichment analysis.
Hunter T: Signaling--2000 and beyond. Cell 2000, 100: 113–127. 10.1016/S0092-8674(00)81688-8
Takahashi A, Ohtani N, Hara E: Irreversibility of cellular senescence: dual roles of p16INK4a/Rb-pathway in cell cycle control. Cell Div 2007, 2: 10. 10.1186/1747-1028-2-10
Yokoi T, Fukuo K, Yasuda O, Hotta M, Miyazaki J, Takemura Y, Kawamoto H, Ichijo H, Ogihara T: Apoptosis signal-regulating kinase 1 mediates cellular senescence induced by high glucose in endothelial cells. Diabetes 2006, 55: 1660–1665. 10.2337/db05-1607
Tang W, Kim TW, Oses-Prieto JA, Sun Y, Deng Z, Zhu S, Wang R, Burlingame AL, Wang ZY: BSKs mediate signal transduction from the receptor kinase BRI1 in Arabidopsis. Science 2008, 321: 557–560. 10.1126/science.1156973
Lang T, Mansell A: The negative regulation of Toll-like receptor and associated pathways. Immunol Cell Biol 2007, 85: 425–434. 10.1038/sj.icb.7100094
Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 2001, 98: 4569–4574. 10.1073/pnas.061034498
Free RB, Hazelwood LA, Sibley DR: Identifying novel protein-protein interactions using co-immunoprecipitation and mass spectroscopy. Curr Protoc Neurosci 2009., Chapter 5: Unit 5 28 Unit 5 28
Bridgeman JS, Blaylock M, Hawkins RE, Gilham DE: Development of a flow cytometric co-immunoprecipitation technique for the study of multiple protein-protein interactions and its application to T-cell receptor analysis. Cytometry A 2010, 77: 338–346.
Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415: 141–147. 10.1038/415141a
Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415: 180–183. 10.1038/415180a
Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R, Bidlingmaier S, Houfek T, et al.: Global analysis of protein activities using proteome chips. Science 2001, 293: 2101–2105. 10.1126/science.1062191
Smith MG, Jona G, Ptacek J, Devgan G, Zhu H, Zhu X, Snyder M: Global analysis of protein function using protein microarrays. Mech Ageing Dev 2005, 126: 171–175. 10.1016/j.mad.2004.09.019
Fasolo J, Snyder M: Protein microarrays. Methods Mol Biol 2009, 548: 209–222. 10.1007/978-1-59745-540-4_12
Kung LA, Snyder M: Proteome chips for whole-organism assays. Nat Rev Mol Cell Biol 2006, 7: 617–622. 10.1038/nrm1941
Gilchrist M, Thorsson V, Li B, Rust AG, Korb M, Roach JC, Kennedy K, Hai T, Bolouri H, Aderem A: Systems biology approaches identify ATF3 as a negative regulator of Toll-like receptor 4. Nature 2006, 441: 173–178. 10.1038/nature04768
Segal E, Wang H, Koller D: Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 2003, 19(Suppl 1):i264–271. 10.1093/bioinformatics/btg1037
Steffen M, Petti A, Aach J, D'Haeseleer P, Church G: Automated modelling of signal transduction networks. BMC Bioinformatics 2002, 3: 34. 10.1186/1471-2105-3-34
Zhao XM, Wang RS, Chen L, Aihara K: Uncovering signal transduction networks from high-throughput data by integer linear programming. Nucleic Acids Res 2008, 36: e48. 10.1093/nar/gkn145
Scott J, Ideker T, Karp RM, Sharan R: Efficient algorithms for detecting signaling pathways in protein interaction networks. J Comput Biol 2006, 13: 133–144. 10.1089/cmb.2006.13.133
Arga KY, Onsan ZI, Kirdar B, Ulgen KO, Nielsen J: Understanding signaling in yeast: Insights from network analysis. Biotechnol Bioeng 2007, 97: 1246–1258. 10.1002/bit.21317
Liu Y, Zhao H: A computational approach for ordering signal transduction pathway components from genomics and proteomics Data. BMC Bioinformatics 2004, 5: 158. 10.1186/1471-2105-5-158
Zhao XM, Wang RS, Chen L, Aihara K: Automatic modeling of signaling pathways by network flow model. J Bioinform Comput Biol 2009, 7: 309–322. 10.1142/S0219720009004138
Bebek G, Yang J: PathFinder: mining signal transduction pathway segments from protein-protein interaction networks. BMC Bioinformatics 2007, 8: 335. 10.1186/1471-2105-8-335
von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B: STRING: a database of predicted functional associations between proteins. Nucleic Acids Res 2003, 31: 258–261. 10.1093/nar/gkg034
Jain S, Bader GD: An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinformatics 2010, 11: 562. 10.1186/1471-2105-11-562
Steinmetz HL: USING THE METHOD OF STEEPEST DESCENT. Industrial & Engineering Chemistry 1966, 58: 33–39.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13: 2498–2504. 10.1101/gr.1239303
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32: D277–280. 10.1093/nar/gkh063
Madhani HD, Styles CA, Fink GR: MAP kinases with distinct inhibitory functions impart signaling specificity during yeast differentiation. Cell 1997, 91: 673–684. 10.1016/S0092-8674(00)80454-7
Bardwell L, Cook JG, Voora D, Baggott DM, Martinez AR, Thorner J: Repression of yeast Ste12 transcription factor by direct binding of unphosphorylated Kss1 MAPK and its regulation by the Ste7 MEK. Genes Dev 1998, 12: 2887–2898. 10.1101/gad.12.18.2887
Bardwell L, Cook JG, Zhu-Shimoni JX, Voora D, Thorner J: Differential regulation of transcription: repression by unactivated mitogen-activated protein kinase Kss1 requires the Dig1 and Dig2 proteins. Proc Natl Acad Sci USA 1998, 95: 15400–15405. 10.1073/pnas.95.26.15400
Li X, Gerber SA, Rudner AD, Beausoleil SA, Haas W, Villen J, Elias JE, Gygi SP: Large-scale phosphorylation analysis of alpha-factor-arrested Saccharomyces cerevisiae. J Proteome Res 2007, 6: 1190–1197. 10.1021/pr060559j
Grimshaw SJ, Mott HR, Stott KM, Nielsen PR, Evetts KA, Hopkins LJ, Nietlispach D, Owen D: Structure of the sterile alpha motif (SAM) domain of the Saccharomyces cerevisiae mitogen-activated protein kinase pathway-modulating protein STE50 and analysis of its interaction with the STE11 SAM. J Biol Chem 2004, 279: 2192–2201.
Kwan JJ, Warner N, Maini J, Chan Tung KW, Zakaria H, Pawson T, Donaldson LW: Saccharomyces cerevisiae Ste50 binds the MAPKKK Ste11 through a head-to-tail SAM domain interaction. J Mol Biol 2006, 356: 142–154. 10.1016/j.jmb.2005.11.012
Rad MR, Xu G, Hollenberg CP: STE50, a novel gene required for activation of conjugation at an early step in mating in Saccharomyces cerevisiae. Mol Gen Genet 1992, 236: 145–154.
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, et al.: Functional discovery via a compendium of expression profiles. Cell 2000, 102: 109–126. 10.1016/S0092-8674(00)00015-5
Roberts CJ, Nelson B, Marton MJ, Stoughton R, Meyer MR, Bennett HA, He YD, Dai H, Walker WL, Hughes TR, et al.: Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science 2000, 287: 873–880. 10.1126/science.287.5454.873
Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32: D258–261. 10.1093/nar/gkh036
Payne WE, Garrels JI: Yeast Protein database (YPD): a database for the complete proteome of Saccharomyces cerevisiae. Nucleic Acids Res 1997, 25: 57–62. 10.1093/nar/25.1.57
Hodges PE, Payne WE, Garrels JI: The Yeast Protein Database (YPD): a curated proteome database for Saccharomyces cerevisiae. Nucleic Acids Res 1998, 26: 68–72. 10.1093/nar/26.1.68
Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE, et al.: Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res 2004, 32: D311–314. 10.1093/nar/gkh033
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 2002, 30: 303–305. 10.1093/nar/30.1.303
Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449–451. 10.1093/nar/gkh086
Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, et al.: STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 2009, 37: D412–416. 10.1093/nar/gkn760
Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, Nixon J, Van Auken K, Wang X, Shi X, et al.: The BioGRID Interaction Database: 2011 update. Nucleic Acids Res 2011, 39: D698–704. 10.1093/nar/gkq1116
Dijkstra EW: A note on two problems in connexion with graphs. Numerische Mathematik 1959, 1: 269–271. 10.1007/BF01386390
Sherman BT, Huang da W, Tan Q, Guo Y, Bour S, Liu D, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. BMC Bioinformatics 2007, 8: 426. 10.1186/1471-2105-8-426
Huang da W, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, Guo Y, Stephens R, Baseler MW, Lane HC, Lempicki RA: DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res 2007, 35: W169–175. 10.1093/nar/gkm415
This work was supported by the National Natural Science Foundation of China [No. 31070954], the Shanghai Commission of Science and Technology Basic Research Fund [No.09JC1406600], and the Shanghai Education Committee and Shanghai Leading Academic Discipline Project (J50103; key discipline "Molecular Physiology").
KW completed the study and wrote the manuscript. FH and KX checked the mathematical algorithm. HC, MJ, RF and JL checked the results. TW supervised the project. All authors read, revised and approved the final manuscript.