Automated modelling of signal transduction networks
© Steffen et al 2002
Received: 26 August 2002
Accepted: 1 November 2002
Published: 1 November 2002
Skip to main content
© Steffen et al 2002
Received: 26 August 2002
Accepted: 1 November 2002
Published: 1 November 2002
Intracellular signal transduction is achieved by networks of proteins and small molecules that transmit information from the cell surface to the nucleus, where they ultimately effect transcriptional changes. Understanding the mechanisms cells use to accomplish this important process requires a detailed molecular description of the networks involved.
We have developed a computational approach for generating static models of signal transduction networks which utilizes protein-interaction maps generated from large-scale two-hybrid screens and expression profiles from DNA microarrays. Networks are determined entirely by integrating protein-protein interaction data with microarray expression data, without prior knowledge of any pathway intermediates. In effect, this is equivalent to extracting subnetworks of the protein interaction dataset whose members have the most correlated expression profiles.
We show that our technique accurately reconstructs MAP Kinase signaling networks in Saccharomyces cerevisiae. This approach should enhance our ability to model signaling networks and to discover new components of known networks. More generally, it provides a method for synthesizing molecular data, either individual transcript abundance measurements or pairwise protein interactions, into higher level structures, such as pathways and networks.
Signal transduction is the primary means by which cells coordinate their metabolic, morphologic, and genetic responses to environmental cues such as growth factors, hormones, nutrients, osmolarity, and other chemical and tactile stimuli. Traditionally, the discovery of molecular components of signaling networks in yeast and mammals has relied upon the use of gene knockouts and epistasis analysis. Although these methods have been highly effective in generating detailed descriptions of specific linear signaling pathways, our knowledge of complex signaling networks and their interactions remains incomplete. New computational methods that capture molecular details from high-throughput genomic data in an automated fashion are desirable and can help direct the established techniques of molecular biology and genetics.
DNA microarray technology has evolved to the point where one can simultaneously measure the transcript abundance of thousands of genes under hundreds of conditions, producing hundreds of thousands of individual data points. Similarly, high-throughput yeast two-hybrid experiments have identified thousands of pairwise protein-protein interactions. Once a core pathway is established, these data can readily be integrated into model refinements, as a recent study in systems biology elegantly demonstrates . However, synthesizing these data de novo into models of pathways and networks remains a significant challenge.
How can one bridge the gap from transcript abundances and protein-protein interaction data to pathway models? Clustering expression data into groups of genes that share profiles is a proven method for grouping functionally related genes, but does not order pathway components according to physical or regulatory relationships. Here we present an automated approach for modelling signal transduction networks in S. cerevisiae by integrating protein-protein interaction [2–4] and gene expression data. Our program, NetSearch, draws all possible linear paths of a specified length through the interaction map starting at any membrane protein and ending on any DNA-binding protein. Microarray expression data [5–7] is then used to rank all paths according to the degree of similarity in the expression profiles of pathway members. Linear pathways that have common starting points and endpoints and the highest ranks are then combined into the final model of the branched networks.
Using the NetSearch algorithm, this protein interaction network was queried for paths up to length eight that begin at membrane proteins and end at transcription factors. The search generated approximately 4.4 million candidate pathways of length eight or less whose biological plausibility was assessed using gene expression data.
To score the pathways, we first used a k-means algorithm to cluster all yeast genes into clusters based on their expression profiles. NetSearch then assigned each pathway a statistical score  according to the number of pathway members that clustered together. For example, a path with six members in one cluster would score higher than a path that only had five members in that cluster. Cluster size influenced path scoring such that a path that had three members from a cluster of 30 elements would score higher than a path that had three members from a cluster of 100 elements. Also, a path with four elements in one cluster and three elements in a second cluster would score higher than a path that had four elements in cluster one, but no more than two elements in cluster two.
Pathways were scored using NetSearch's 'sumprob' scoring metric: Assuming N proteins total and a partitioning of proteins into k clusters C1, C2,...Ck, with N1, N2,...Nk members, respectively, and a pathway p of L proteins p1→p2→...→ pL, where cp(i) = number of proteins in p in cluster Ci, the sumprob score is computed as follows:
probp(i) scores a pathway for a cluster Ci such that pathways which are more concentrated in Ci have higher scores. The summation in probp(i) computes the cumulative hypergeometric probability of pathway p containing cp(i) or more members of Ci. probp(i) assesses co-clustering of pathway members in the single cluster Ci. sumprob(p), the sum of probp(i) values over all clusters for which cp(i) >= 2, is a simple measure of co-clustering across the entire collection of available clusters. The rationale for the restriction cp(i) >= 2 is that without it a pathway could get a high score simply from having single members in one or more rare clusters, in which case the score would no longer reflect co-clustering.
The exact composition of paths discovered using NetSearch depend on the parameters used in path drawing and path scoring. To ensure that NetSearch reproducibly generates statistically significant, biologically plausible paths, we combinatorially varied every parameter value in the path-drawing and path-scoring algorithms, and selected parameter combinations that generate the most statistically significant pathways. Statistical significance was measured by drawing pathways from membrane proteins to DNA-binding proteins through the experimentally determined protein-interaction map (henceforth called "real pathways") and comparing these pathways with pathways drawn through control interaction maps that were created by randomizing all pairwise interactions in the original dataset. (The randomization procedure was performed three times and statistics were calculated on the average output of these runs. Paths produced using these interaction maps are henceforth referred to as "random pathways"). We ultimately chose parameters that maximized the number of high-scoring pathways produced with real interactions, while minimizing high-scoring pathways from the randomized interactions.
The parameters we varied included the number of clusters into which the genes were grouped, the microarray expression datasets used in clustering, the maximum path length, and the scoring metric. Expression data were clustered into 12, 25, 50, 100 and 250 clusters, and NetSearch best discriminated between real pathways and random pathways when genes were grouped into 25 clusters. Three S. cerevisiae expression datasets were examined individually, including the "Compendium" set, composed of expression profiles in response to 300 diverse mutations and chemical treatments ; the "MAPK" set, composed of 56 conditions chosen to probe the behavior of MAPK signal transduction ; and the "Cell Cycle" set, composed of 77 conditions relevant to the cell cycle . Combinations of these datasets were also examined, for a total of five different sets that allowed us to compare the utility of data that probes specific biological processes, such as MAPK signaling or the cell cycle, and that which probes the state of the cell more broadly, such as the Compendium set and the combined sets. The composite data set that combined all three individual sets (for a total of 433 conditions) provided the best discrimination between real pathways and random pathways, although the other sets performed comparably.
The final input parameter that required evaluation was the maximum path length allowable for NetSearch paths. While short path lengths risk omission of key path members, longer path lengths increase the likelihood of including false-positive interactions. As a first step towards determining the optimal maximum path length, we examined the path lengths connecting every possible pair of the 3725 proteins in the interaction dataset, regardless of subcellular localization. The minimal path length between any two proteins chosen at random contains on average 7.4 members. Secondly, we examined the fraction of pathways with high coclustering ratios for various path lengths. Consistent with our finding that the average path length between any two proteins is 7.4, this fraction peaks at eight, which we set as our maximum, unless otherwise noted.
Fig. 4B depicts the pheromone response network at several different score cutoffs, and demonstrates how higher co-clustering score cutoffs reduces the complexity of the protein-interaction map. NetSearch detects 354 paths of length eight from Ste3p to Ste12p, and incorporates 70 different proteins into those paths. The top graph in Fig. 4B shows the network constructed from all 354 paths (with each protein arranged on the perimeter of an ellipse for clarity). In the middle graph, all paths that scored below the median have been eliminated, leaving only 27 proteins. On the bottom of Fig. 4B, only the highest scoring paths (those used to construct the network in Fig. 4A) with 19 proteins, are depicted. Comparison of these networks indicates that most proteins are eliminated by simply excluding the pathways that score in the bottom half; further modifications to the cutoff affect the results incrementally. In setting a precise cutoff for pathway inclusion in the final network models, one seeks to strike a balance between the inclusion of false-positives and the omission of true-positives. We set the cutoff such that the top fifteen paths for each network were included.
The network model generated for the cell wall integrity pathway is depicted in Fig. 4C. Membrane proteins in particular may fail to produce interactions when forced into the nucleus by the requirements of the standard two-hybrid technique. We observed this to be the case for the cell wall integrity pathway, as neither Wsc1p, Wsc2p, Wcs3p or Mid2p were observed to interact in any of the high-throughput screens. To reconstruct this network, we therefore started with the momomeric GTPase Rho1p, and restricted our search to a length of seven because of the omission of the initial signal sensor. Of the 18 proteins included in this network model, all but Smd3p have descriptions consistent with a role in cell wall maintenance. NetSearch included both GTPase constituents of this pathway, Rho1p and Cdc42p, as well as associated GAPs and other interactors, including Rdi1p, Rga1p, and Gic2p. Other included network elements are Fks1p, the 1,3-βglucan synthase of which Rho1p is a subunit , the actin protein Act1p, and the proteins Bni1p, Bud6p, and Sph1p, which are associated with Rho-mediated signal transduction, actin filament organization, cell polarity establishment, and bud growth. Smd3p forms a complex with the Sm core spliceosomal proteins , and we are not aware of any role it may play in maintaining cell wall integrity. Its inclusion is most likely a result of its expression correlation with BUD6 in one of the microarray datasets, but it seems unlikely that the observed interactions of Smd3p with Spa2p and Slt2p have biological significance. In the NetSearch-generated model, Bck1p is downstream of Mkk1p because, although it interacts with both Mkk1p and Mkk2p, it has been shown specifically not to interact with Pkc1p in two-hybrid assays .
The network model for filamentous growth (Fig. 4D) involves 21 proteins, 20 of which are known to play a role in filamentous growth, or have functions consistent with that role, with the exception of Fus1p. As in the pheromone response and cell wall integrity network models, key components of the Ras GTPase are included, such as Cdc25p (the Ras guanine nucleotide exchange factor), Cyr1p (the Ras-associated adenylate cyclase), and Srv2p, which enables the activation of adenylate cyclase by Ras2p. Several proteins with roles in actin filament organization, cell polarity establishment, bud growth, and GTPase-mediated signal transduction are shared with the cell wall integrity pathway, including Bni1p, Spa2p, Bud6p, and Act1p. NetSearch depicts interactions between Abp1p and both Srv2p and Act1p, consistent with the function of Abp1 in tethering Srv2p to the cytoskeleton. The adenylate cyclase and associated proteins mentioned above, along with Hsp82p and Hsc82p, activate the cAMP pathway , a pathway that acts in parallel with the MAPK pathway to promote filamentation. Hsp82p is a chaperone protein known to interact with a number of signaling pathway components . It is required for activation of the pheromone signaling pathway , and for the general response to amino acid starvation . It may play a similar role in response to nitrogen (ammonia) starvation, a trigger for filamentation. Fus1p, included in our predicted network, does not have a documented role in filamentation; it is required for cell fusion during pheromone initiated mating. Its transcript levels are significantly upregulated in response to pheromone, but are unchanged in tec1Δ strains ; that study notes, however, that in dig1Δ dig2Δ cells, fus1 is constitutively activated, and both mating and invasive growth are observed. Tec1p, conspicuously absent in our model, has not been observed to interact with any proteins in high-throughput two-hybrid screens.
The utility of yeast protein-protein interaction maps for generating signaling network models has previously been suggested , and they have been used to predict metabolic pathways . Expression data has been used to generate and refine models for genetic regulatory networks without the benefit of protein-protein interaction data . In this study, we have used expression data to rank candidate pathways of interacting proteins. This approach has a strong biological and experimental rationale: proteins used in the same signaling network must exist simultaneously with its activation. The genes encoding these proteins must be transcribed at approximately the same time, and under the same environmental conditions in which the signaling network is required. Furthermore, experimental evidence suggests that when a signaling network is activated, positive feedback mechanisms upregulate the expression of genes that encode pathway proteins , implying that this rationale is also applicable to "surveillance" pathways, whose protein components may need to be constitutively present in small quantities, but whose concentration increases with activation. This biological rationale is borne out by evidence that interacting proteins have more highly correlated expression profiles than do non-interacting proteins . However, if a single component of a signaling network is independently (and differentially) regulated, it would not necessarily be excluded using our approach, if for instance, it connected two halves of a pathway which had similar average expression profiles.
NetSearch can be used to predict new signaling pathways, identify previously unknown members of documented pathways, or identify smaller clusters of interacting proteins. Until we have a more complete protein-interaction set, a user who wishes to explore a particular pathway http://arep.med.harvard.edu/NetSearch needs to specify pathway starting points and ending points (such as membrane and DNA-binding proteins, respectively). This selection can be based on a known genetic interaction, a shared mutant phenotype, a shared functional classification, or signature expression profile. This is the approach we have followed in constructing the networks depicted in Fig. 4. Those networks are comprised of all highest ranking linear paths connecting the receptors and transcription factors for that pathway.
The pheromone response pathway is commonly depicted as a simple, linear transmission of the mating signal from the membrane receptor, Ste2p (for alpha-factor) or Ste3p (for a-factor), to the nuclear effectors, Ste12p and Mcm1p, via a MAPK cascade. However, mating pheromone exposure also induces other cellular processes such as those required for polarized growth, cell cycle arrest, and recovery from cell cycle arrest. Furthermore, the topology of the protein interactions required for these processes is considerably more complicated than a series of pairwise interactions. In addition to accurately depicting the MAPK cascade, our predicted pheromone response network identifies many proteins necessary to execute the coordinated processes of growth polarization and cell cycle arrest, and reflects the complex topology of the interaction network.
The complexity of these interactions are observed in large, multifunctional complexes of possibly dynamic composition. For example, products of Ste18, Ste4, Cdc42, Cdc24, Far1, Bem1, Ste20, Ste5, and other proteins are thought to constitute a complex that has numerous interactions among components, and that mediates many different cellular processes [14, 28]. The complex may coordinate mating pheromone detection with (1) cell cycle arrest via Far1p, (2) MAPK signal transduction via Ste5p and Ste20p, and (3) cell polarity via Bem1p and Far1p (among others) [29, 30].
Given that several of these networks share components of the MAPK cascade, the mechanism by which input-output specificity is maintained remains one of the most important questions in the field of molecular signal transduction. One well accepted hypothesis is that scaffolding proteins such as Ste5p and Pbs2p tether the MAPK module to the appropriate input and output components . The recent identification of numerous Ste5p analogs in yeast and mammals makes this hypothesis even more intriguing . Beyond scaffolding proteins, higher-order protein complexes have been hypothesized to play a role in maintaining signal specificity . Our computational results suggest that this may indeed be the case. When comparing the minimal pathways for pheromone response and filamentation as depicted in Fig. 1, it appears that maintaining signal specificity would be a considerable challenge. But when comparing the two network predictions depicted in Fig. 4, one notes many differences, all of which may help ensure specificity. The network perspective suggests not a single scaffolding protein, but many scaffolding proteins – in fact, a "scaffolding network." The possibility exists that relatively nonspecific kinases function simply as "phosphorylation modules," operating inside insulating networks that are the primary determinant of signaling specificity.
Because our protein-protein interaction data is only a small fraction of a truly complete interaction map, one finds portions of a network that cannot be connected using available protein-protein interaction data. This was the case in our attempts to model the HOG network. While NetSearch correctly identified the upstream elements of this pathway (Sln1p → Ypd1p → Ssk1p → Ssk22p), it was unable to form any connections to Pbs2p or Hog1p that ended in a transcription factor. In some cases, a missing interaction can be circumvented, however. In the model for the pheromone response network, NetSearch inserted Akr1p, a known inhibitor of the pheromone pathway , between Ste3p and the G protein complex (Ste4p/Ste18p/Gpa1p). Although the protein-interaction dataset we used contained no direct interaction between Ste2p/Ste3p and Ste4p, Ste2p-Ste4p has been shown to interact in a targeted yeast two-hybrid study .
Our failure to model the HOG pathway underscores the fact that, for the purposes of this algorithm, missing interactions (false-negatives) are a more significant obstacle than are false-positive interactions. Missing interactions cannot be "created" by the algorithm, but false-positive interactions are de-emphasized as a result of the bias imposed by ranking paths according to the similarity of expression profiles. Bearing this out, of the fifty-eight proteins included in our networks, only Smd3p seems to be included as a result of false-positive interactions. (This is distinct from the case of Fus1p, which may be misplaced in the filamentation pathway, but whose interactions with Act1p and Ste7p are real.) This highlights a general observation on the integration of genomic technologies. Two-hybrid and microarray expression studies are both known to have a sizable fraction of systematic errors (for instance, self-activators in two-hybrid experiments, and cross-hybridization in microarrays), but when looking at the intersection of the two, the true signals tend to reinforce one another, whereas the systematic errors in the two tend to be different and are reduced further into the noise. These effects may help explain why we observe so few false-positive proteins inserted into our predicted networks.
In addition to using more complete interaction datasets, such as those found in Ho  and Gavin , one could improve this approach by integrating more types of data. Homology modelling could be used to differentially weight the inclusion of molecules likely to be involved in signal transduction (e.g. kinases), and genetic interactions could weight the inclusion of the two proteins in the same path. Signaling motif identification  and data from protein kinase chips  could also easily be incorporated into this framework. Based on the interaction data available, the networks depicted in Fig. 4 are static, with all interactions given equal weight, and without information on the direction of information transfer. In reality, signaling networks are dynamic and vectorial complexes, with interactions of varying strengths among component proteins . The technology necessary to generate data which will allow modelling of these network properties are beginning to emerge. Kinase chips  will allow one to incorporate information about the direction of information flow. The strength of protein interactions (with DNA) has been measured on chips in a highly parallel manner  and the same could be done for protein-protein interactions . Data on the spatial and temporal co-localization of signaling components is being generated by new imaging techniques , which will yield insight into the mechanism with which the cellular response to a signal is modulated by the intensity and the duration of the signal , and the interplay with parallel pathways.
The approach we have presented allows one to query the intersection of two enormous sets of functional genomic-derived molecular data. One can, in effect, simultaneously browse protein-protein interaction and gene expression data. It allows one to extract a group of highly-connected, highly-correlated proteins from global data to isolate a sub-network of particular interest. Significantly, this approach does not require prior knowledge of pathway intermediates. The interaction data determines the pathways that are considered, and gene expression data is used to rank the pathways. Although we have focused on signaling pathways, this approach should be applicable to modelling the relationships among any group of interacting proteins that cooperate to perform a given function within a cell, and the web-version of the software allows for these queries. As many genomic techniques are generating increasingly large amounts of molecular data, new tools such as this will be required for the synthesis of "parts into pathways" in order that we may understand how cells regulate the many processes necessary for growth and development.
Supplementary website – http://arep.med.harvard.edu/NetSearch Web interface for NetSearch: http://arep.med.harvard.edu/NetSearch/runprog.html
We thank Lisa Pacella, Aimee Dudley and Vasudeo Badarinarayana for excellent advice and assistance and all members of the Church and Winston labs for helpful discussion. We also thank the Lipper Foundation, ONR, NSF and DOE grant DE-FG02-87ER60565.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.