Beyond microarrays: Finding key transcription factors controlling signal transduction pathways

Background Massive gene expression changes in different cellular states measured by microarrays, in fact, reflect just an "echo" of real molecular processes in the cells. Transcription factors constitute a class of the regulatory molecules that typically require posttranscriptional modifications or ligand binding in order to exert their function. Therefore, such important functional changes of transcription factors are not directly visible in the microarray experiments. Results We developed a novel approach to find key transcription factors that may explain concerted expression changes of specific components of the signal transduction network. The approach aims at revealing evidence of positive feedback loops in the signal transduction circuits through activation of pathway-specific transcription factors. We demonstrate that promoters of genes encoding components of many known signal transduction pathways are enriched by binding sites of those transcription factors that are endpoints of the considered pathways. Application of the approach to the microarray gene expression data on TNF-alpha stimulated primary human endothelial cells helped to reveal novel key transcription factors potentially involved in the regulation of the signal transduction pathways of the cells. Conclusion We developed a novel computational approach for revealing key transcription factors by knowledge-based analysis of gene expression data with the help of databases on gene regulatory networks (TRANSFAC® and TRANSPATH®). The corresponding software and databases are available at .

(page number not for citation purposes) tional regulators. Gene expression changes in different cellular states measured by microarrays, in fact, reflect just an "echo" of real molecular processes in the cells. A way to facilitate data interpretation is to construct gene regulatory networks that include signal transduction mediators, transcriptional regulators and target genes. This is a complex task, not only because of the huge number of molecules involved, but also because of variations across tissues, developmental stages and physiological conditions. However, these networks hold the key to the understanding of the regulatory processes within a cell and, thus, to the majority of life processes in general.
Changes of expression of genes encoding transcription factors (TFs), a class of key regulatory molecules, are often hard to reveal to be significantly up-or downregulated in microarray experiments since their expression changes are small and their activity is mainly regulated on the posttranscriptional level. Analysis of promoters of coexpressed genes can provide one source of evidences on involvement of certain TFs in the regulation of the genes. Several computational approaches have been developed in the past few years in order to reveal potential binding sites in the promoter regions of co-expressed genes. They applied various techniques ranging between simple pattern search and complex models such as HMMs (Hidden Markov Models). The most widely used method is based on positional weight matrices (PWMs) that are constructed from collections of known binding sites for given TF or TF family. One of the largest collections of TF binding sites (TFBS) and corresponding PWMs is the TRANS-FAC ® database [1]. The PWM approach was applied intensively in the last years for the analysis of regulatory regions of many different functional classes of genes, for instance, globin genes [2], muscle-and liver-specific genes [3,4], and cell cycle-dependent genes [5]. In recent approaches, in order to improve the site prediction quality, different authors have searched for combinations of TFBS -cis-regulatory modules [6][7][8][9][10] and have applied comparative genomics approaches [11][12][13]. Despite these efforts, understanding the full complexity of the gene regulatory regions remains a great challenge and it is still rather problematic to identify transcription factors involved in the regulation of genes under any particular cellular condition based on the promoter analysis alone.
Another source of evidences on the key role of transcription factors in regulating cellular regulatory processes comes from analysis of signal transduction pathways. Multiple signal transduction pathways of a cell transduce extracellular signals from receptors at the cellular membrane to the transcription factors in the nucleus where they regulate the transcription of genes. There are several databases that collect information about signal transduction pathways in different cells. Among them, the TRANS-PATH ® database [14] stores a large body of information on signaling pathways allowing computational search through the graph of signaling reactions. One aim of such searches is to find the key transcription factors that mediate the concerted changes in expression of specific components of the signal transduction network.
In this paper we report an attempt to integrate the two complementary approaches for identification of key TFs: 1) analysis of promoters of co-expressed genes and 2) analysis of networks of the differentially expressed components of the signal transduction pathways. We have developed two computational tools: F-Match™ for revealing over-and underrepresented sites of promoters and ArrayAnalyzer™ for identification of key nodes in signal transduction networks. The developed integrated approach aims to reveal multiple evidences of positive feedback loops in signal transduction circuits through activation of pathway-specific transcription factors.
We demonstrated that promoters of genes encoding components of many known signal transduction pathways are enriched by binding sites of those transcription factors that are end-points of the considered pathways. Application of the approach to the microarray gene expression data on TNF-alpha stimulated primary human endothelial cells [15] helped to reveal novel key transcription factors that are potentially involved in the regulation of the signal transduction pathways of these cells.

TFBS in promoters of genes belonging to the same pathway
In order to evaluate the approach we applied the promoter analysis algorithm (see Methods) to the promoters of genes encoding various known signal transduction pathways. Promoter sequences were extracted from the TRANSPRO ® database [16]. We were also testing a hypothesis that genes encoding components of the same signaling pathway are co-regulated by the "target transcription factors", i.e. TFs that receive the signal through the pathway in focus and, through that, provide multiple positive feedback loops in the pathway. We made this analysis for pathways and chains collected in the TRANSPATH ® database. Chains represent experimentally proven sequences of reactions and are utilized in the pathway search algorithm using nonlinear cost function ( Figure 1).
We applied the Match™ tool [17] with default parameters (vertebrate matrices, minSUM cut-offs) and searched for potential TFBS in promoters of the genes encoding components of signal transduction pathways (see Methods). After that, we applied F-Match to compare site frequency in promoters of each pathway compared to the promoters of all other pathways. We found overrepresented sites for target TFs in 13 pathways and 23 chains (see Table 1). In the cases where the chains constitute a part of the same pathway, the sites were grouped together. Only 7 pathways and 2 chains exhibited underrepresented sites for their target TFs.
We exemplarily show the results for two pathways (see Figure 2 and Figure 3). Sites for AP-1 factors have been found being over-represented in promoters of genes encoding components of JNK pathway. AP-1 sites were found in promoters of several genes encoding different variants of JNK kinases as well as other molecules upstream and downstream in the pathway leading to the activation and repression of the AP-1 transcription factors (see Figure 2). Such complex organization of the regulatory circuits clearly provides mechanisms for the autoregulation of the pathway. Evidence for direct autoregulation was found also for the pathways and networks activating the following TFs: p53, HIF1-alpha, p300, SRF, c-jun, NF-AT and DP1.
In the case of T-cell antigen receptor pathway sites for transcription factors Ets/Elk-1 are overrepresented in the promoters of the genes of this pathway. Most of the hits are located in the "upper" part of the pathway, e.g. molecules located in the pathway several steps upstream from the target transcription factors at the "bottom" (see Figure  3). We speculate that such organization of feedback loops from regulated TFs to the topmost parts of the pathway can provide optimal autoregulation, and enhances the specificity of the transduced signal.
In the extreme case of the prolactin pathway, more than half of the genes in the pathway are probably regulated by the target TF FKHRL, including both the membrane receptor molecule and the target TFs STAT5A and STAT5B (see Figure 4).
Since the PWMs used for identification of TFBS are relatively short, the chance that they can match with any DNA sequence is very high. The essence of our method is therefore to estimate the statistical significance of over-or under-representation of the number of matches in a given set of promoters (see Method section) as an indicator of potential functional involvement of TFs in the regulation of the corresponding genes. Here, we did an evaluation of the statistical significance of overrepresentation of Elk-1 sites found in the promoters of T-cell antigen receptor pathway. Two datasets were built collecting human promoter sequences at random from the TRANSPRO ® database, one with the same number as in the T-cell antigen receptor pathway (65 promoters), and another with as many promoters as in the control set (990 promoters). Elk-1 sites were predicted in both sets using the Match™ program, and their frequencies compared using F-Match. This simulation experiment was repeated 100 times, and a distribution of the ratio of Elk-1 site frequency in the query vs. control set was generated (see Figure 5). The average ratio of this random promoters distribution is 1.28 with a standard deviation of 0.21. Whereas, the ratio observed in the real set of promoters of T-cell antigen receptor pathway is 2.1 (we found Elk-1 sites in 13 promoters, but random expectation is 6.2 promoters), which is 3.9 SD units away from the random expectation (see Figure 5).

Identification of key node TFs in the TRANSPATH ® pathways
To evaluate whether the "pathway persistence" parameter h can help in identifying key TFs, we applied the Pathway Persistence algorithm described in the Methods section to the T-cell antigen receptor pathway as it is presented in the TRANSPATH ® database (see Figure 3). As it can be seen from the diagram, 6 target transcription factors of this pathway are AP-1 (c-Jun), NF-AT, NF-kappaB (heterodimer of p50 and RelA), IkappaB and Elk-1. Out of 51 molecules belonging to this pathway we took 13 molecules whose gene promoters contain the Elk-1 binding sites (see Figure 3, molecules marked by red rectangles) identified at the previous step of analysis. We applied the key node search algorithm starting from these 13 molecules and searched downstream with parameters of the search: dmax = 4, α = (1/2) 12 ; h = 0.9. The 7 top scoring transcription factors were: (E2F, IkappaB, NF-kappaB, PPAR-gamma, c-Jun, Egr-1, RXR-alpha). As one can see, NF-kappaB and components of AP-1 were among the top hits. Figure 6 shows the automatically generated output for the search of key TFs of the pathway.

Pathway in TRANSPATH ® TF (family) TRANSFAC ® PWM acc p-value
Overrepresented sites in pathways In order to evaluate the Pathway Persistence algorithm we made a series of random simulation experiments of the pathways analysis varying the parameters dmax (3,4) and h (0.0, 0.6, 0.9, 1.0). For each pair of dmax and h, we randomly selected 10-times 13 molecules belonging to the Tcell antigen receptor pathway and performed the downstream search for the reachable TFs. We computed an average sensitivity (Se) and specificity (Sp) of the search by counting the average number of true target TFs (N) that were reached in such downstream search (out of the maximal number N max = 6), and by counting the average number of other TFs (M) that were also reached in the same search. Then: The results of the simulation experiments are shown on Figure 7. One can see that the use of the high values of persistence parameter h allows reaching sensitivity levels above 50% without significant decrease of specificity.

Analysis of TNF-alpha gene expression data
We applied the F-Match method (see Methods) to analyze promoters of differentially expressed genes in primary human endothelial cells upon TNF-alpha stimulation [15]. We compared promoters of genes induced or repressed by at least a factor of 2 (fold change, FC > 2.0) with the promoters of genes whose expression did not change upon this treatment. In this analysis, we were able to identify several PWMs whose hits are significantly overrepresented or under-represented in the promoters in focus (see Table 2). We also compared results of identification of over-and under-represented sites taking differ- With the red rectangle we marked molecules in whose gene promoters we can identify AP-1 sites -sites for the "target" TFs of the pathway. As one can see, such TF sites can provide fine regulation of the pathway activity through numerous feedback loops.
ent cut-offs of the expression fold change (FC>2.5 and FC > 3.0, see Table 2). We observed that most of the matrices identified at the FC cut-off 2.0, were also identified under the higher cut-offs and often the over-representation ratio was increasing, which clearly indicates the correlation between the gene expression change and the frequency of the corresponding sites in promoters. Some matrices have disappeared from the list of the significantly over-or under-represented sites (e.g., V$IRF_Q6 and V$MAZ_Q6) and several new matrices have appeared under the higher FC cut-offs (V$TBX5_02, V$HNF1_Q6, V$HFH3_01, V$HNF3B_01, V$CEBPGAMMA_Q6). Such matrices seem to be specific for a subgroup of promoters of the differentially expressed genes. We focus our further attention on the matrices that were identified under several FC cutoffs with increased stringency.
At the second step of the analysis we mapped all differentially expressed genes (with expression change higher than 2 times) onto the TRANSPATH ® database and extracted 129 signaling molecules (44 up-regulated and 85 downregulated). Mapped molecules include all components of signaling pathways, such as receptors, their ligands, adaptor proteins, kinases of various levels as well as TFs. We applied the pathway analysis algorithm described in the Method section (downstream search with parameters: dmax = 4, α = (1/2) 12 ; h = 0.9) and obtained a list of key transcription factors (see Table 3) characterized with the maximal score s i .
By comparison of the data in the Table 2 and Table 3 we can reveal TFs most probably involved in regulation of differential gene expression upon TNF-alpha stimulation. Among them are, e.g., NF-kappaB (and its inhibitor, Ikap-paB) and the components of AP-1 (c-Jun and c-Fos), the known targets of TNF signaling. It is interesting to observe that expression of genes encoding these two TFs has been changed in this experiment. For other factors such as FOXO, CREB, Elk-1 (ETS-domain), and E2F any significant change of their expression was not observed, but following the two independent sources of evidences, promoter analysis and pathway analysis, we conclude that they may play an important role in the regulation of gene expression in the considered system.

Discussion
The approach demonstrated here allows identifying key TFs involved in the regulation of signaling pathways in the cell. The approach combines two algorithms, one is based on the analysis of promoter sequences and another on the analysis of the signal transduction pathway connectivity graph. As we demonstrated here, these two independent algorithms can often produce rather complementary or even similar results, pointing to the most important TFs involved in the regulation of the considered molecular processes.
The complementarity of the results of these two algorithms may suggest interesting dynamics of the considered pathways such as for the T-cell antigen receptor pathway analyzed above. First, we found an overrepresentation of sites for the ETS domain factor Elk-1 in the promoters of genes belonging to the pathway. Next, starting from these genes and searching downstream in the signal transduction network we identified other target TFs: NF-kappaB, AP-1 and NF-AT. So, we can speculate, that first signals coming through this pathway activate the Elk-1 Prolactin pathway Figure 4 Prolactin pathway. Numerous sites for the "target" transcription factor FKHRL1 can be found in promoters of many genes encoding components of this pathway.
factors, which in turn activates expression of many genes encoding components of the same pathway therefore enhancing the signal flow through this pathway. The amplified signals then activate other target transcription factors switching on the appropriate response in the cell.
A systematic analysis of many pathways reveals that sites for a total of 16 different TFs are overrepresented in the 36 pathways and chains, implying that different pathways share the same target TFs. We think that the fine-tune regulation of genes in overlapping or interconnected pathways might be achieved through the cooperative simultaneous regulation of many TFs, suggesting that a composite module analysis will further explain the differential regulation of these intersecting pathways.

Conclusion
We developed a novel computational approach for revealing key transcription factors by knowledge-based analysis of gene expression data with the help of databases on gene transcription regulation and signal transduction networks, TRANSFAC ® and TRANSPATH ® . We demonstrated that promoters of genes encoding components of many known signal transduction pathways are enriched by binding sites for transcription factors that are endpoints of the considered pathways. The corresponding software and databases are available at http://www.gene-regula tion.com.

Microarray data
We have analyzed microarray gene expression data on TNF-alpha stimulation of primary human endothelial cells (HUVEC) [15]. Data have been taken from GEO (GSE2639). Gene expression profiles were measured in HUVEC stimulated for 5 hours with TNF and in untreated HUVEC using the Affymetrix ® GeneChip ® Human Genome U133A array [15]. Four biological replicates were used for each condition. We applied the criteria of an at least 2.0-fold change in gene expression levels and p-value revealed by t-test of less than 0.05. Based on these criteria, the expression of 121 transcripts was increased after TNFalpha treatment and 214 transcripts showed decreased expression.

Promoter sequences
Promoter sequences were taken from TRANSPRO ® [16], a database based on the international Genomic Sequence Builds for H. sapiens, M. musculus and R. norvegicus. We have collected information about TSSs from several databases, EPD [18], dbTSS [19] and Ensembl [20]. Collected TSSs are frequently not located in tight clusters, but widespread throughout the sequence. An algorithm applying a set of rules was designed to find 'clusters' of TSSs. A 'clustering score' was calculated by adding up contributions from each TSS in a sliding window, weighting the different reliability of TSSs from different sources and the distance of data points from the central window position. The peaks of the resulting clustering score function were regarded as potential 'virtual TSSs' for windows with enough evidence points. In cases of genes with only few data points the 5' most data point was used as 'virtual TSS'. This collection of 'virtual TSSs' is the basis for the extraction of promoter sequences.

Association between promoters and signal transduction pathways
TRANSPATH ® [14] is a manually curated database focusing on signal transduction pathways. Relevant elements such as ligands, receptors, enzymes as well as TFs are stored along with information about interactions, modifications, and other reactions they undergo. Chains are sequences of reactions described in at least one paper, and several chains may constitute a pathway. There are 24441 promoters for 14283 human genes documented in TRANSPRO ® 3.1, of them 1049 genes are associated with a pathway or reaction chain described in TRANSPATH ® 7.1 [14]. These 1049 promoter regions -1000 to +100 relative to TSSs calculated in TRANSPATH ® were taken as a primary data source. The promoters are associated with Evaluation of significance of Elk-1 sites overrepresentation Figure 5 Evaluation of significance of Elk-1 sites overrepresentation. Histogram of the ratio between Elk-1 site frequency in two sets of promoters randomly chosen from TRANSPRO ® database. (100 times: 65 promoters in one set and 990 promoters in another set). The site frequency ratio 2.1 observed in the real set of promoters from T-cell antigen receptor pathway is 3.9 SD away from the mean of the distribution.
Real dataset ratio 2.1 360 pathways or chains, provided that at least 5 promoters were associated with each pathway.

Identification of putative TFBS
TRANSFAC ® 10.1 provides, among other information, more than 8000 TFs, of them 1555 human TFs, as well as a library of 795 PWMs including 569 matrices for vertebrate TFs [1].
The program Match™ [17] was used to identify putative binding sites in the promoters of each pathway or chain, using the vertebrate PWMs. A built-in matrix profile with

F-Match: Identification of over-and under-represented sites in promoter sequences
In order to identify TFBS overrepresented in the promoter sets, a control set was derived for each pathway or reaction chain; the complete data source was taken excluding only the promoters of the genes in the pathway being investigated. The Match™ program was used to identify TF bind-ing sites in this control collection and the number of sites was taken as a background to compare with the number of sites found in the promoters from the pathway in question.
We have developed a new program, F-Match, which compares the number of sites found in a query sequence set against the control set. We assume, if a certain TF (or factor family), alone or as a part of a cis-regulatory module, plays a significant role in the regulation of the considered set of promoters, then the frequency of the corresponding sites found in these sequences should be significantly higher than expected by random chance. Often, the stringency of the interaction of this TF with their target sequences in the considered promoters is not known leading to the uncertainty in setting thresholds on the site searches using the Match™ program.
F-Match carefully evaluates the set of promoters and for each matrix tries to find two thresholds: one, th-max, which provides maximum ratio between the frequency of matches in the promoters in focus (set A) and background promoters (set B) (over-represented sites); and the second threshold, th-min, that minimizes the same ratio (underrepresented sites).
In order to find these two thresholds, first, we apply the Match™ program to both sets of promoters using the lowest PWM thresholds corresponding to the minimum of false negative rates (minFN, see [17] for details). As a result, for each PWM we obtain a set of predicted K sites and M sites in the promoter sets A and B with the corresponding matrix scores ms: ms A,1 , ms A,2 , ... ms A,K ; The results of the simulation experiments Figure 7 The results of the simulation experiments. Dependence of the sensitivity and specificity levels on the key node search parameters: d-distance, and h-pathway persistence.
Usage of high values of persistence parameter h allows reaching sensitivity levels above 50% without significant decrease of specificity. The F-Match program makes an exhaustive search through the space of all scores observed in the sequence sets. Each observed score is taken as a threshold th and the program computes the number of sites k found in the promoter set A (ms A,j ≥ th/j = 1,K) and number of sites m found in the promoter set B (ms B,j ≥ th/j = 1,M).
Then, the expected number of sites in the set A to be observed in the case of even distribution of sites between two sets will be: and assuming a binomial distribution of the sites between two sets we can calculate the p-value of finding the observed number of sites and higher, for over-represented matches, or lower, in the case of under-represented matches giving the p-value of over-and under-representation of matches in the promoter set A.
For a given significance level ξ (e.g. ξ = 0.001) the F-Match finds such thresholds th-max and th-min that maximizes and minimizes, respectively, the ratio k/k exp provided that the p-value <ξ. If the required significance level cannot be  reached for a given matrix, then the program reports that no optimal threshold can be found.

Identification of key nodes in signal transduction networks
To understand the mechanisms of gene expression, microarray data should be analysed in the context of complex regulatory networks of a cell. Through such networks, different signals can converge to few key TFs involved in regulation of the large gene sets. We have developed a tool, integrated in the ExPlain™ computer system, that allows fast search for key transcription factors regulated by a signal transduction network. A network is defined as weighted graph G = (V, E, C), where V = genes ∪ molecules is the set of vertices, E = reactions are the edges and C : E → R + ∪{0} is the cost function that defines a non-negative value for every edge. In the simplest variant of the algorithm, the initial values of the cost functions for each direct reaction are taken = 1.0. So, the cost of any path through the consequent reactions will be equal to the number of reactions.
Dijkstra's shortest-path algorithm is the core of this approach. The algorithm of downstream search starts from each molecule of the set of molecules in focus (subset V x of V) (e.g. signalling molecules whose expression has changed in the microarray experiment) and construct the shortest-path to all nodes i of V being within a given radius of dmax. After evaluating all nodes of V x the algorithm calculates the number of visits N i for each node i of V. This corresponds to the number of molecules in focus that can transfer the signal to the node i within dmax number of steps. The values N i could already be used as a ranking score for potential key nodes, since N i corresponds to the number of differentially expressed molecules from the initial list V x that can transfer the signal to the node i. In the case that node i represents a transcription factor and is characterized by a high N i , then this transcription factor seems the most probable target and key regulator of the signal transduction system under study.
In order to calculate the "specificity score" we use the formula: where M i is the number of all other molecules in the whole network, which can reach node i within dmax and which are considered as "non-relevant nodes". By setting different penalty levels (α = [0,1]) user has an opportunity to adjust the balance between true positives and false positives while searching for the key nodes. In the Web implementation of the algorithm we offer 21 penalty levels that correspond to: The values M i are pre-calculated for every node and radius.

Pathway Persistence algorithm
In addition to the information about individual reactions TRANSPATH ® contains information about pathways and chains of consecutive reactions known to take place in certain cellular conditions. Any pathway P is defined by a graph G P = (V P , E P , C) which is a sub-graph of the graph G.
We use the information about chains and pathways to improve the accuracy of key node prediction, especially to diminish the false positive error. While searching, the priority is given to the potential paths that utilize annotated chains of reactions. Still, predictions beyond the known paths are allowed but with low priority.
Application of pathway information into a certain graph can be transparently modeled by introduction of additional transitive edges with specific costs, which results in a new graph G'. The pathway information is fully represented by these additional edges. The main Dijkstra algorithm remains unchanged in this model. The additional edges for one pathway P are generated as follows: Let S Pij be the graph of the shortest paths between i,j ∈ V P within the graph (e.g. pathway) P. Further, let C Pij be the cost of the shortest path with the exceptional case where C Pij = ∞ if S Pij = ∅. We combine G P and G yielding G' = (V,E',C') simply by introducing new edges E' = E∪{(i,j) | S Pij ≠ ∅} and by extending the cost function for them by C' = {f(C Pij ) | f(C Pij )≤C ij }∪{C ij |f(C Pij )>C ij }. As function f we use f(x) = x (1-h) , with h∈ [0,1]. The aim of f is to make the cost function of a new edge (i, j) sublinearly dependent on the length of the corresponding shortest path S Pij within each pathway. As a result, with increasing h the shortestpath search is progressively pushed to stay within known pathways (see Figure 1). This affects the search of the key nodes. The effect is maximal if h = 1 and absent if h = 0. Therefore we call h "pathway persistence" which allows the user to select the proper balance between making speculative predictions or staying within known pathways only. This approach is inspired by Higher-Order-Markov-Chains, where subsequent past states define the probability of the future states of a stochastic system. While the search for common key nodes is already easily possible by application of Dijkstra to G', the connection details between a certain identified key node and starting nodes are not yet obvious, since they are established in G', not in G. Therefore, we implemented a sophisticated dynamic cost function which still can be used by Dijkstra, and which mimics the existence of edges E' on-the-fly, while the shortest-paths themselves consist of edges from E only.