TF-finder: A software package for identifying transcription factors involved in biological processes using microarray data and existing knowledge base

Cui, Xiaoqi; Wang, Tong; Chen, Huann-Sheng; Busov, Victor; Wei, Hairong

doi:10.1186/1471-2105-11-425

Methodology article
Open access
Published: 12 August 2010

TF-finder: A software package for identifying transcription factors involved in biological processes using microarray data and existing knowledge base

Xiaoqi Cui¹,
Tong Wang²,
Huann-Sheng Chen^1,3,
Victor Busov^2,3 &
…
Hairong Wei^2,3

BMC Bioinformatics volume 11, Article number: 425 (2010) Cite this article

10k Accesses
14 Citations
1 Altmetric
Metrics details

Abstract

Background

Identification of transcription factors (TFs) involved in a biological process is the first step towards a better understanding of the underlying regulatory mechanisms. However, due to the involvement of a large number of genes and complicated interactions in a gene regulatory network (GRN), identification of the TFs involved in a biology process remains to be very challenging. In reality, the recognition of TFs for a given a biological process can be further complicated by the fact that most eukaryotic genomes encode thousands of TFs, which are organized in gene families of various sizes and in many cases with poor sequence conservation except for small conserved domains. This poses a significant challenge for identification of the exact TFs involved or ranking the importance of a set of TFs to a process of interest. Therefore, new methods for recognizing novel TFs are desperately needed. Although a plethora of methods have been developed to infer regulatory genes using microarray data, it is still rare to find the methods that use existing knowledge base in particular the validated genes known to be involved in a process to bait/guide discovery of novel TFs. Such methods can replace the sometimes-arbitrary process of selection of candidate genes for experimental validation and significantly advance our knowledge and understanding of the regulation of a process.

Results

We developed an automated software package called TF-finder for recognizing TFs involved in a biological process using microarray data and existing knowledge base. TF-finder contains two components, adaptive sparse canonical correlation analysis (ASCCA) and enrichment test, for TF recognition. ASCCA uses positive target genes to bait TFS from gene expression data while enrichment test examines the presence of positive TFs in the outcomes from ASCCA. Using microarray data from salt and water stress experiments, we showed TF-finder is very efficient in recognizing many important TFs involved in salt and drought tolerance as evidenced by the rediscovery of those TFs that have been experimentally validated. The efficiency of TF-finder in recognizing novel TFs was further confirmed by a thorough comparison with a method called Intersection of Coexpression (ICE).

Conclusions

TF-finder can be successfully used to infer novel TFs involved a biological process of interest using publicly available gene expression data and known positive genes from existing knowledge bases. The package for TF-finder includes an R script for ASCCA, a Perl controller, and several Perl scripts for parsing intermediate outputs. The package is available upon request (hairong@mtu.edu). The R code for standalone ASCCA is also available.

Background

Whole-system approaches employing data derived from microarray and high-throughput sequencing technologies require development of new methods for inferring novel knowledge discovery in large-scale data sets. The generation of spatially or temporally interactive transcriptome profiles in a multicellular organism is still challenging and expensive. Therefore methods that can analyze already existing data are urgently needed.

Crop varieties for sustainable biomass production and adaptation to multiple environmental stresses are needed to meet climatic and environmental challenges, and fulfil the world's bioenergy needs. Development of such varieties requires in-depth knowledge of the regulators that play key roles in abiotic stress tolerance and adaptive growth. Understanding the underpinning regulatory mechanisms would enable development of viable solutions to modify plants with augmented stress tolerance and allow sustainable production on marginal lands. Traditional experimental approaches that use candidate gene approaches suffer from biased subjective selection of genes' sets. Thus, often these genes' modifications have little or no impact on the targeted trait and/or in many cases have severe pleitropic effects compromising their commercial deployment. For example, over-expression of DREB1A, and ADR1 results in severely stunted growth [1] and the expression of AtNHX1 negatively impacts many cellular processes including protein transport and modification [2]. Now it is becoming increasingly clear that only systems-based approaches providing thorough knowledge of the intricate genetic networks can provide solutions to these problems and lead to successful translation of biological knowledge into downstream commercial applications [3]. Although our knowledge is incomplete, it has been shown that gene expression is often regulated in a combinatorial manner [4] indicative of the underlying genetic network interactions. Development of methods that can capture these synergistic regulations will provide new insights into the regulatory mechanisms underpinning many biological processes.

Canonical Correlation Analysis (CCA) is a common means to simultaneously analyze the relationships between two sets of variables. However, when applied on large-scale microarray data sets, where the number of genes (variables) greatly exceeds the number of samples, CCA has two major shortcomings: (1) It causes computational problems and inaccurate estimates of parameters; (2) It leads to linear combinations of entire sets of available variables, which may lack biological plausibility and interpretability. To overcome these problems, sparse canonical correlation analysis (SCCA) was recently proposed [5, 6]. SCCA, an extension of CCA, can find the maximally correlated relationship between two sets of variables by determining the linear combinations of variables from each set. SCCA provides sparse loadings in the linear combinations and thus results in smaller groups of variables, which can aid the biological interpretability. To further reduce the bias in model selection and number of selected variables, adaptive SCCA (ASCCA) has been recently proposed [5]. ASCCA outperforms SCCA by selecting the correct subset of variables for better discovery of the most plausible model. In addition, ASCCA produces fewer noise variables than SCCA. In this paper, we developed a package, TF-finder that takes advantage of ASCCA to identify TFs involved in a process of interest. As a test case we used TF-finder to identify TFs involved in stress tolerance and adaptive growth. We demonstrated that TF-finder produced interpretable and biologically meaningful data.

We also compared TF-finder with a closely related method, Intersection of Coexpression (ICE) [7], which evaluates a gene from a candidate pool based on how significantly this gene is coexpressed with the number of genes in a positive gene set. We implemented ICE in such a way that the expression data of all TFs were used for identifying novel TFs that are assumed to be involved in the same biological process as these positive TFs that are used as positive gene set. The comparison concluded that TF-finder outperforms ICE in finding novel positive TFs. The novel positive TFs in this study are defined as the newly identified genes that do not belong to the positive TF used as guide genes but are evidenced to be positive genes by present knowledge for involvement in the same biological process.

Results

We used TF-finder to identify candidate regulatory genes that are involved in salt and drought stress tolerance as well as the adaptive growth under these conditions.

Identification of salt stress response and tolerance regulators

We applied ASCCA to 109 microarray data sets collected from seven salt stress microarray experiments. The input files contain the expression profiles of 159 positive target genes (non-TFs, Additional file 1) that are known to be involved in salt response and tolerance, 1638 Arabidopsis TFs present in Affymetrix ATH1 array, and 13 TFs (AT1G01520, AT2G40950-BZIP17, AT5G39610-ATNAC2, AT5G67450-AZF1, AT3G19580, AT1G52890-ANAC019, AT1G35515-HOS10, AT2G47190-MYB2, AT2G27300-NTL8, AT3G55980-SZF1, AT2G30250-WRKY25, AT2G38470-WRKY33, AT4G28110-MYB41) (Additional File 1) known to be involved in salt response and tolerance. The cluster analysis of the 159 target genes resulted in about 800 clusters that were used to hook TFs in a recursive manner. All TFs identified through this procedure were pooled for frequency calculation. The top 70 genes with highest occurrence frequencies are shown in Additional file 2. Among these genes, 17 TFs were clearly supported by existing evidence to be involved in salt response and tolerance (Additional file 2). For example, WRKY33, AZF2, and NATAC6 were among the list of 13 TFs used as guide genes (Additional file 1). Although the other 14 were all novel, indirect evidence suggests that they are likely involved in this stress response. For instance, CZF1, also known as SZF2, is the most homologous gene to SZF1, and it regulates salt stress responses in Arabidopsis[8]. ZAT6 is the most homologous gene to STZ (salt tolerance zinc finger) in Arabidopsis. RHL41 (also called ZAT12) is involved in hyperosmotic salinity response [9]. ANAC055 has been found to bind to the early responsive to dehydration (ERD1) stress gene promoter, and over-expression of this gene, together with ANAC019 and ANAC072, causes the expression of several stress-inducible genes that enhance drought tolerance [10]. Over-expression of SZF1 (Salt-inducible zinc finger 1) in transgenic plants caused reduced induction of salt responsive genes and increased tolerance to salt [8]. STZ (salt tolerance zinc finger) was found to increase salt tolerance of calcineurin mutants of wild-type yeast, which appears to be partially dependent on ENA1/PMR2, a P-type ATPase required for Li⁺ and Na⁺ efflux. ATAF1 is responsive to wounding and ABA. DREB2A and DREB2B (DRE/CRT-binding protein) are induced upon dehydration and high salinity [11]. ATMYC2 is a positive regulator of ABA signalling. MYBR1 is ABA-regulated and participates in mediating ABA effects [12]. CBF1 functions as a transcriptional activator that binds to the C-repeat/DRE DNA regulatory element in response to low temperature and water deficit [13]. Although CBF1 mainly responds to chilling, the expression of CBF1 also confers salt stress tolerance [14]. BZIP28, an ER-resident TF, serves as a sensor/transducer in Arabidopsis to mediate ER stress responses [15].

Identification of adaptive growth regulators under salt condition

We also used TF-finder to identify TFs controlling growth under the same stress condition. We used the expression profiles of 74 positive target genes that are involved in growth, 10 positive TFs (Additional file 1), and 1640 TFs. 10 positive TFs include AT5G02470-DPA, AT4G16110-ARR2, AT3G13960-GRF5, AT5G53660-GRF7, AT2G16720-MYB7, AT3G49690-MYB84, AT5G20730-NPH4, AT1G13260-RAV1, AT2G33880-HB3, and AT1G32640-MYC2. The resulting top 70 candidate TFs are shown in Additional file 2. Among these genes, 26 TFs have regulatory functions in growth. Three NAC domain-containing TFs: AT3G61910-ANAC066, AT1G60280-ANAC023, and AT5G04400-ANAC077, have been shown to be involved in the differentiation and expansion of petals, stamen, and roots [16–18]. Three closely related basic helix-loop-helix (bHLH) proteins, AT5G53210-SPCH, AT3G06120-MUTE and AT_FAMA, have been identified as positive regulators that direct three consecutive cell-fate decisions during stomatal development [19, 20]. AT3G13960-AtGRF5 is one of the nine members of GRF gene family that contain nuclear targeting domain, and is involved in root development [21]. AT2G13570-NF-YB7 encoding LEAFY COTYLEDON1-LIKE is a regulator essential for embryo development [22, 23]. KNAT6 is expressed in roots and is required for proper lateral root formation[24]. AT4G27330-SPL plays a central role in patterning of both the proximal-distal and the adaxial-abaxial axes in the ovule and is generally involved in cell differentiation [25]. AT2G35670-FIS2 and AT1G02580-MEA are involved in seed development [26]. CAL is floral homeotic gene encoding a MADS domain protein homologous to AP1 promoting the flower to shoot transformation in ap1 mutants [27]. AT3G15170-CUC1, together with CUC2 and CUC3, are responsible for shoot organ boundary and meristem formation throughout the different stages of Arabidopsis life cycle [28, 29]. NUB encodes a protein with a single C(2)H(2) zinc-finger domain and is involved in the growing of later organs [30]. DOT5 is involved in vein patterning, but dot5-1 mutants often have shorter roots, suggesting its functions in root development [31]. INO is involved in ovule development [32]. BLH8 encoded a BEL1 like protein, which was identified to play a role in shoot meristem [33] and ovule development[34]. B3 is differentially expressed in anther, and presumably involved in anther development and differentiation [35]. LBD10 encodes a protein that functions in defining the lateral organ boundaries [36]. AT5G58080-ARR18 encodes a type B response regulator that mediates cytokinins signaling transduction in Arabidopsis[37].

Identification of adaptive growth regulators under drought condition

After showing TF-finder can be used to identify key regulators using data from salt stress experiments, we were interested in extending TF-finder performance testing to a different data set and biological process. We therefore used data from water stress experiments. The three input files contained the profiles of 74 genes involved in various growth processes (Additional file 1), 10 positive TFs (Additional file 1), and all 1640 TFs detected to be expressed in the water stress data set. The top 70 TFs are shown in Additional file 2, among which 21 TFs were previously implicated to be involved in regulation of growth, and one TF, AT2G16720_MYB7, in these 21 genes is a re-discovered positive TF. To avoid spelling out their functions at length, we showed all pieces of evidence that support these genes are positive in Table 1.

Table 1 Identified TFs that are involved in growth and stress tolerance under drought condition.

Full size table

Identification of regulatory genes from water stress data using ASCCA

To test if TF-finder can identify growth regulators from water stress data, we used three files that contained the profiles of 120 target genes, 9 positive TFs (AT3G57600, AT1G75490, AT5G05410-DREB2A, AT2G47190-MYB2, AT1G54160-NF-YA5, AT2G38880-NF-YB1, AT4G27410-RD26, AT1G69600-ZFHD1, and AT4G28110-MYB41) (Additional file 1) and all 1640 TFs detected to be expressed in the water stress data set. The resulting top 70 genes were found to contain 9 novel TFs (Additional file 2) that are supported by existing evidence to be involved in root growth under water stress condition. Again we are not going to elaborate these genes' functions at length. All pieces of evidence that support these genes are positive were shown in Table 1.

Discussion

We have developed and shown that the TF-finder package can be used to discover TFs involved in various biological processes. The discovery efficiency varies with both biological processes and genes used to guide the recognition process. To further evaluate the performance of TF-finder, we compared it to the ICE algorithm [7] in identification of TFs involved in namely four biologically processes: (1) salt tolerance, (2) growth under salt stress, (3) growth under water stress, (4) drought tolerance. The inputs for ICE algorithm include the transcriptome profiles of all 1640 TFs and one of the following positive TF sets: 13 TFs involved in salt tolerance, 10 TFs involved in root growth, and 9 TFs involved in water stress tolerance. The results of TFs identified through the two algorithms are shown in Figure 1, and are also listed in Additional file 3

Among the top 70 TFs discovered to be regulators of salt tolerance, 43 are common between the two methods. Among these, 14 novel TFs were identified by both TF-finder and ICE. Among these 14 TFs, 9 were common. This seems to indicate similar efficiency of the two algorithms. However, the comparison between TF-finder and ICE in identifying growth regulators operating during both salt and water stress suggests that TF-finder outperforms ICE. Of the top 70 TFs identified for controlling growth in salt stress, only 10 TFs are common between two methods. 26 and 12 novel TFs identified by TF-finder and ICE respectively were implicated to be positive by the existing annotation with only 5 common. Of the top 70 growth TFs from water stress data, 19 TFs found by two methods were common. Similarly 20 and 13 novel TFs were identified by TF-finder and ICE respectively, and only 5 TFs found by both methods were common. Finally, the efficiency of two methods was compared to discover TFs involved in water stress. Of the identified top 70 TFs, only 1 was common, indicating that despite that both are linear-based methods, TF-finder indeed can identify different TFs. In this case, 9 and 16 TFs were discovered by TF-finder and ICE respectively, and the existing annotation suggests their involvement in response to water stress. This indicates a better performance of ICE in identifying the genes involved in water stress response and tolerance.

The fact that the two methods can recognize different TFs is not surprising because they use different inputs and employ different mechanisms for identifying transcription regulators. Namely, TF-finder hooks TFs using positive target genes. Although both methods use positive TFs, TF-finder uses them as guide genes while ICE as baits to recognize co-expressed genes. Because by design TF-finder identifies a group of TFs controlling a group of targets, it tends to discover combinatorial nature of TFs in regulating a group of target genes. As it is well-known, the drought tolerance gene, proline dehydrogenase in Arabidopsis [38], and GSY2 in yeast [39], as well as ABA-induced gene expression [4] are controlled by a small number of TFs in combinatorial manner. This is mediated by presence of the same stress-responsive cis-elements in the promoter sequences of many downstream stress-responsive genes and much less TFs that regulate these genes [40, 41]. Therefore we believe that there should be more genes subjected to combinatorial regulation during abiotic stress response and tolerance. In contrast to TF-finder, ICE uses the pair-wise correlation. It thus tends to identify very tightly coupled or co-ordinated TFs by using those that are known to be involved in the same biological process. As the overall efficiency is concerned, in three out of four cases we examined TF-finder identified more TFs for which prior knowledge for involvement in the process of interest existed. The higher efficiency of TF-finder can be ascribed to the repeatedly TF recognition using clustered targets or the use of positive targets, or both of them.

Integration of biological with mathematical models is critically important in discovering novel biological knowledge. However, the complexity of transcription regulation and the lack of data from well-designed experiments impede deriving a biological model using mathematic means. Thus, employment of models (behaviors) of known positive TFs to discover novel TFs is instrumental. We integrated these known TFs for novel TF discovery in a way that the enrichment of these TFs is indicative of a meaningful identification.

In this study, we tried only one set of input files for each case as above-mentioned, namely (1) salt tolerance; (2) growth under salt stress; (3) growth under water stress; and (4) drought tolerance, the number of novel TFs identified is remarkably high. In realty, an even larger number of multiple sets of input files can be formulated to amplify the power of the method. In this regard, the existing TF-finder package can be further improved to take multiple batches of input files, and run iteratively towards more exhaustive results. We believe that such an improvement can lead to the discovery of more novel TFs. In this study, we relied on the existing literature to annotate the identified gene lists and show the efficiency of the TF-finder in identifying positive TFs; however, we indeed noticed there were some highly ranked TFs that were not supported by existing evidence. Due to this reason, we strongly believe there are more positive genes in the identified lists. In a real application, we encourage users to validate those highly ranked but functionally undefined genes by employing experimental means. By the way, we also tested the performance of SCCA [5] on the same data sets with the same inputs. Unfortunately it performed poorly in finding any transcription regulators, which further confirmed the previous conclusion that ASCCA provides better noise filtering and includes fewer uninformative variables than SCCA [5].

With the availability of large volume of gene expression data, and more and more positive target and TF genes being validated by molecular biologists, TF-finder will no doubt have a wide variety of applications in the future. Nevertheless, TF-finder is not useful when these resources are not available, for example, for a newly sequenced species. In addition, TF-finder may not be applicable to some biological processes in which the response of target genes to TF is slow or lagged.

Conclusions

The integration of existing knowledge base, cluster analysis, and ASCCA algorithm into a package (pipeline) for finding novel TFs with pooled microarray-derived expression data is viable as evidenced by the significant number of discovered TFs. These TFs include previously identified to be involved in mediating abiotic stresses response, indicating that the method can successfully identify TFs involved in the process of interest. In addition, the results imply that combinatorial regulation is dominated in stress response and tolerance, and can be studied through the use of standard positive target (guide) and regulatory genes (bait). Finally, the identification of so many regulatory genes in abiotic stresses is indicative of the involvement of a large complex gene networks. Computational approaches as the one employed by TF-finder can allow insights into the backbone of these genetic networks.

Methods

The workflow of the TF-finder is shown in Figure 2. It covers four methods: k-means clustering, ASCCA, ICE and enrichment test. K-means clustering software was developed earlier [45]. Other methods and the preparation of microarray data are described below:

ASCCA (Adaptive Sparse Canonical Correlation Analysis)

Assume the expression profiles of all TFs (assume p TFs in total) are represented by X = {x₁, x₂, ..., x_p} with each x_i(i = 1,..., p) being a vector of length n (measured on n samples), and similarly, the group of target genes is represented by Y = {y₁, y₂, ..., y_p}with each y_j(j = 1, .., q) being a vector of length n. Therefore X and Y can also be written in the matrix forms:

\begin{matrix} X = {(\begin{matrix} x_{11} & \dots & x_{1 p} \\ ⋮ & ⋱ & ⋮ \\ x_{n 1} & \dots & x_{n p} \end{matrix})}_{n x p} & Y = \end{matrix} {(\begin{matrix} y_{11} & \dots & y_{1 q} \\ ⋮ & ⋱ & ⋮ \\ y_{n 1} & \dots & y_{n q} \end{matrix})}_{n x q}

where each x_i(i = 1, ..., p) is a variable in the set X, and each y_i(j = 1, ..., q) is a variable in the set Y. Then by applying ASCCA on these two sets of data X and Y, we could get a pair of p and q entry weight vectors (canonical vectors), a and b, such that the correlation between the two linear combinations (canonical variates), X a and Y b, is maximized. The canonical vectors a and b are sparse due to many of their entries being zero, which is achieved by introducing L₁ penalties into the criterion that constrains a and b (Witten and Tibshirani, 2010). Specifically, in our study, we only focus on the first component, i.e. first pair of canonical vectors, of the ASCCA solution. To facilitate the ASCCA implementation, Parkhomenko et al. (2009) has developed an iterative algorithm as described below (we assume X and Y are standardized to have columns with zero means and unit variances).

Consider the singular vectors u and v, which are related with the canonical vectors a and b by $a = \sum_{X X}^{- 1 / 2} u$ and $b = \sum_{Y Y}^{- 1 / 2} v$ , where Σ_XXand Σ_YYare the variance matrices of X and Y respectively. Given the penalization parameters, λ_u, λ_vand γ, as well as the initial values u⁰ and v⁰, the singular vectors u and v could be approximated iteratively by the following two steps until convergence:

Step 1 Update u:

a)
ui+1 u ⁱ⁺¹← K v ⁱ
b)
Normalize: u ⁱ⁺¹← u ⁱ⁺¹/ ||u ⁱ⁺¹||
c)
$u_{j}^{i + 1} \leftarrow (| u_{j}^{i + 1} | - \frac{1}{2} λ_{u} / {| u_{j}^{S V D} |}^{y}) + S i g n (u_{j}^{i + 1})$
for j = 1,2, ..., p

where K = (diag(Σ_xx))^-1/2 Σ_xy(diag(Σ_yy))^-1/2 is a p × q matrix, Σ_XYis the covariance matrix between X and Y, i is the iteration index and u^SVDdenotes the first left singular vector (normalized) obtained from a full Singular Value Decomposition (SVD) of K. Also,

\begin{matrix} {(x)}_{+} = {\begin{matrix} x, & if & x \geq 0 \\ 0, & if & x < 0 \end{matrix} & and & S i g n (x) = {\begin{matrix} - 1, & if & x < 0 \\ 1, & if & x > 0 \\ 0, & if & x = 0 \end{matrix} \end{matrix}

d)
Normalize: u ⁱ⁺¹← u ⁱ⁺¹/ ||u ⁱ⁺¹||

Step 2 Update v:

a)
v ⁱ⁺¹← K'u ⁱ⁺¹
b)
Normalize: v ⁱ⁺¹← v ⁱ⁺¹/ ||v ⁱ⁺¹||
c)
$v_{j}^{i + 1} \leftarrow (| v_{j}^{i + 1} | - \frac{1}{2} λ_{v} / {| v_{j}^{S V D} |}^{y}) + S i g n (v_{j}^{i + 1})$ for j = 1,2, ..., q

Where v^SVDdenotes the first right singular vector (normalized) obtained from a full SVD of K.

d)
Normalize: v ⁱ⁺¹← v ⁱ⁺¹/ ||v ⁱ⁺¹||

In our analysis, we set the initial values u⁰ and v⁰ as the standardized column and row means of K. The penalization parameters λ_u, λ_vand γ are selected by evaluating their different combinations through two-dimensional k - fold cross-validation (CV), and then choosing the best combination that maximizes test sample correlation:

Δ_{c o r} = \frac{1}{k} \sum_{j = 1}^{k} | c o r (X_{j} {\overset{\land}{a}}^{- j}, Y_{j} {\overset{\land}{b}}^{- j}) |

Where X_jand Y_j(the j th subset of the k - fold CV) are the testing sets, and ${\overset{\land}{a}}^{- j}$ and ${\overset{\land}{b}}^{- j}$ are the canonical vectors estimated for the training set, in which subset j was removed; Since increasing the penalization parameters decreases the number of non-zero terms in u and v, and for our data u and v would become zero vectors if λ_uand λ_vare greater than 0.4. Consequently, we screen the λ_uand λ_vvalues from 0 to 0.4 with a step of 0.01, and trace γ from 0 to 2 with a step of 0.1. Finally, 35301 (41*41*21) combinations of the three parameters λ_u, λ_vand γ are examined.

To further evaluate the set of TFs identified by ASCCA, we take the advantage of known positive TFs by examining if the set of identified TFs contains "enough" number of known TFs (Figure 2), which has the similar reasoning as the enrichment test (Rivals et al., 2007) but uses more straightforward and computationally efficient criterion. Denote N as the total number of TFs in X, N_posas the total number of known positive TFs involved in the same biological process (original input), N_ASCCAas the number of TFs fished out by ASCCA, and N_pos∩ASCCAas the number of known positive TFs that are fished out by ASCCA. Then based on the ratio of positive TFs to total TFs (N_pos/ N), the expected number of positive TFs identified by ASCCA is (N_pos/ N)*N_ASCCAwhich is an ideal criterion to be compared with N_pos∩ASCCA, the actual number of positive TFs identified by ASCCA. To make the above criterion more stringent such that only the true significant TF sets being retained, we multiply the expected number of positive TFs by an enrichment factor (EF) which varies from 1~5. That is, if N_pos∩ASCCA> EF * (N_pos/ N)*N_ASCCA, the hooked TF set is saved and discarded otherwise. For all the sample results shown in this study, we set EF = 3. In this way, we integrate prior biological knowledge into our mathematical model to deciding if the hooked set of TFs should be retained for further investigation or not.

Cluster analysis

Before applying ASCCA to extract candidate TFs, we applied k-means clustering method (Eisen et al., 1998) to partition the positive target genes into several clusters (Figure 2) and then use each cluster as an input (Y) for ASCCA to bait TFs. The k-means algorithm was selected because: first, target genes in the same cluster are assumed to be co-regulated under the same regulatory machinery and thus each cluster can serve as an ideal bait for ASCCA; second, the result of ASCCA is subject to considerable instability from one input to another, i.e. including or excluding one target gene in Y would possibly result in two quite different sets of TFs. This is not surprising because on one hand the sparse canonical vectors (a and b) are derived from both the greatest correlation between two sets (X and Y) and correlations among variables within each set; while on the other hand, when there is so much information in the datasets (TFs across whole-genome), there exist several alternative solutions that are almost equally good (Waaijenborg et al., 2008). Consequently, because we aimed to identify TFs by the virtue of their true regulatory causality rather than by chances or due to extraneous factors, we performed ASCCA using many target gene clusters and finally averaged the outcomes to minimize the effect of instability.

Because the optimal number of clusters may not exist since the genes involved in different functional domains are co-regulated in varying sizes, we ran cluster analysis several times by varying the number of clusters from a lower to an upper boundary. At the lower boundary, the average number of target genes in each cluster is 20, and at the upper boundary, the average number of genes in each cluster is 4. For instance, given 100 target genes, k-mean clustering analysis is run 17 times with the average size of clusters varying from 4 to 20, and totally $\sum_{i = 4}^{25} n_{i}$ (n_irepresents the number of clusters when the average size of each cluster is i) clusters are processed by ASCCA.

The application of ASCCA on each target gene cluster results in a set of candidate TFs who cooperatively regulate the target genes in this cluster. To extract the truly important TFs from all of the resulting TFs sets, we calculate how many times a TF has been identified by ASCCA. Then the TFs are ranked by the frequency of their occurrence. The more frequent a TF has been identified, the more important is its role in the corresponding biological process. Therefore the list of ranked TFs can provide new hypotheses for further experimental testing. Below is a step-by-step summary of our algorithm:

Step-by-Step Summary of TF-finder

TF-finder proceeds as follows:

(1)
Select a set of positive target genes involved in certain biological process
(2)
Select a set of TFs across the genome as input X
(3)
Set the average size of a cluster s = 4
(4)
Partition the target genes into n _sclusters using k-means clustering method
(5)
Use each cluster as input Y and apply ASCCA on X and Y, then save the resulted set of candidate TFs if N _pos∩ASCCA> EF * (N _pos/ N)*N _ASCCAand discard otherwise
(6)
If s < 25, set s ← s + 1 and repeat steps (3)-(5)
(7)
For each TF, calculate the frequency of being captured by ASCCA
(8)
Rank TFs by their frequencies following decreasing order

Comparison of TF-finder with ICE

The principle of ICE is based on "guilty by association". It was implemented in such a way that if a candidate gene is associated with a group of positive genes more often than the others [7], this candidate should be selected. In this study, we used a group of positive TFs to judge if a candidate TF is associated multiple times with the members in this group. Due to the presence of multigenic regulation in the gene network, it is usual that transcription regulators controlling the same set of target genes are coordinated or co-expressed. Therefore, we employed Spearman correlation to 'associate' a candidate TF to a number of positive TF. Detail for ICE implementation is described as following. Let Y = {y₁,y₂, ..., y_m} is a set of known positive TFs controlling a biological process, and X = {x₁,x₂, ..., X_n} is a set of TFs across genome with X∩Y = ϕ. A Spearman rank correlation ρ_ijis calculated between any pair of x_iand y_j(i = 1, ..., n, j = 1, ..., m), and X_iand y_jare considered linked when ρ_ijis larger than a pre-specified threshold ρ₀. In our study, we set ρ₀ = 0.6. Then all TFs in X are sorted by the number of links to Y. The genes at the top of the list have more links to Y, and thus are the candidate regulating genes involved in the biological process. Since each selected x_iis located at the "intersection" of multiple elements from Y in a network, we termed this approach as "the intersection of coexpression (ICE)".

Preparation of microarray data sets

Microarray data sets were downloaded from multiple resources. Salt stress experimental data set contains108 chips from 6 experiments (GSE7636, 7639, 7641, 7642, 8787, 5623) and was downloaded from NCBI GEO http://www.ncbi.nlm.nih.gov/geo/. Water stress data sets were downloaded from European Arabidopsis Stock Centre's website http://arabidopsis.info/ and include 62 chips from 3 experiments of AtGenExpress: Stress Treatments (Drought stress) contributed by AtGenExpress Consortium. All data mentioned above are derived from hybridization of Affymetrix 25 k ATH1 microarrays [42]. The original CEL files were processed by the robust multiarray analysis (RMA) [43] algorithm using the Bioconductor package. For quality control we used methods that were previously described [44]

Availability of software package

The ASCCA package was written in R. A wrapper for calling ASCCA, and a number of parsers were written in Perl. To facilitate use of this package, we release for public use the original codes rather than executables. The users need to use an Unix/Linux environment where R and Eisen's k-means clustering package [45] are installed. Installation of Perl is not necessary because it is usually carried by the Linux/Unix operating system. Interested users can receive the package by sending email to: hairong@mtu.edu.

Acknowledgements

This project was supported by Agriculture and Food Research Initiative Competitive Grant no. 2009-65504-05767 from Agriculture Plant Feedstock Genomics for Bioenergy: A Joint Research Program of USDA and DOE Program

References

Kasuga M, Liu Q, Miura S, Yamaguchi-Shinozaki K, Shinozaki K: Improving plant drought, salt, and freezing tolerance by gene transfer of a single stress-inducible transcription factor. Nat Biotechnol 1999, 17(3):287–291. 10.1038/7036
Article CAS PubMed Google Scholar
Sottosanto JB, Gelli A, Blumwald E: DNA array analyses of Arabidopsis thaliana lacking a vacuolar Na+/H+ antiporter: impact of AtNHX1 on gene expression. Plant J 2004, 40(5):752–771. 10.1111/j.1365-313X.2004.02253.x
Article CAS PubMed Google Scholar
Denby K, Gehring C: Engineering drought and salinity tolerance in plants: lessons from genome-wide expression profiling in Arabidopsis. Trends Biotechnol 2005, 23(11):547–552. 10.1016/j.tibtech.2005.09.001
Article CAS PubMed Google Scholar
Singh KB: Transcriptional regulation in plants: the importance of combinatorial control. Plant Physiol 1998, 118(4):1111–1120. 10.1104/pp.118.4.1111
Article CAS PubMed PubMed Central Google Scholar
Parkhomenko E, Tritchler D, Beyene J: Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol 2009, 8(1):Article1.
Google Scholar
Waaijenborg S, Verselewel de Witt Hamer PC, Zwinderman AH: Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Stat Appl Genet Mol Biol 2008, 7(1):Article3.
PubMed Google Scholar
Wei H, Persson S, Mehta T, Srinivasasainagendra V, Chen L, Page GP, Somerville C, Loraine A: Transcriptional coordination of the metabolic network in Arabidopsis. Plant physiology 2006, 142(2):762–774. 10.1104/pp.106.080358
Article CAS PubMed PubMed Central Google Scholar
Sun J, Jiang H, Xu Y, Li H, Wu X, Xie Q, Li C: The CCCH-type zinc finger proteins AtSZF1 and AtSZF2 regulate salt stress responses in Arabidopsis. Plant Cell Physiol 2007, 48(8):1148–1158. 10.1093/pcp/pcm088
Article CAS PubMed Google Scholar
Davletova S, Schlauch K, Coutu J, Mittler R: The zinc-finger protein Zat12 plays a central role in reactive oxygen and abiotic stress signaling in Arabidopsis. Plant Physiol 2005, 139(2):847–856. 10.1104/pp.105.068254
Article CAS PubMed PubMed Central Google Scholar
Nakashima K, Ito Y, Yamaguchi-Shinozaki K: Transcriptional regulatory networks in response to abiotic stresses in Arabidopsis and grasses. Plant Physiol 2009, 149(1):88–95. 10.1104/pp.108.129791
Article CAS PubMed PubMed Central Google Scholar
Liu Q, Kasuga M, Sakuma Y, Abe H, Miura S, Yamaguchi-Shinozaki K, Shinozaki K: Two transcription factors, DREB1 and DREB2, with an EREBP/AP2 DNA binding domain separate two cellular signal transduction pathways in drought- and low-temperature-responsive gene expression, respectively, in Arabidopsis. Plant Cell 1998, 10(8):1391–1406. 10.1105/tpc.10.8.1391
Article CAS PubMed PubMed Central Google Scholar
Huang D, Jaradat MR, Wu W, Ambrose SJ, Ross AR, Abrams SR, Cutler AJ: Structural analogs of ABA reveal novel features of ABA perception and signaling in Arabidopsis. Plant J 2007, 50(3):414–428. 10.1111/j.1365-313X.2007.03056.x
Article CAS PubMed Google Scholar
Stockinger EJ, Gilmour SJ, Thomashow MF: Arabidopsis thaliana CBF1 encodes an AP2 domain-containing transcriptional activator that binds to the C-repeat/DRE, a cis-acting DNA regulatory element that stimulates transcription in response to low temperature and water deficit. Proc Natl Acad Sci USA 1997, 94(3):1035–1040. 10.1073/pnas.94.3.1035
Article CAS PubMed PubMed Central Google Scholar
LEE J-T, PRASAD V, YANG PT, WU JF, DAVID HO T-H, CHARNG Y-Y, CHAN MT: Expression of Arabidopsis CBF1 regulated by an ABA/stress inducible promoter in transgenic tomato confers stress tolerance without affecting yield. Plant, Cell & Environment 2003, 26(7):1181–1190.
Article CAS Google Scholar
Liu JX, Srivastava R, Che P, Howell SH: An endoplasmic reticulum stress response in Arabidopsis is mediated by proteolytic processing and nuclear relocation of a membrane-associated transcription factor, bZIP28. Plant Cell 2007, 19(12):4111–4119. 10.1105/tpc.106.050021
Article CAS PubMed PubMed Central Google Scholar
Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR, et al.: Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science 2000, 290(5499):2105–2110. 10.1126/science.290.5499.2105
Article CAS PubMed Google Scholar
Ooka H, Satoh K, Doi K, Nagata T, Otomo Y, Murakami K, Matsubara K, Osato N, Kawai J, Carninci P, et al.: Comprehensive analysis of NAC family genes in Oryza sativa and Arabidopsis thaliana. DNA Res 2003, 10(6):239–247. 10.1093/dnares/10.6.239
Article CAS PubMed Google Scholar
Duarte JM, Cui L, Wall PK, Zhang Q, Zhang X, Leebens-Mack J, Ma H, Altman N, dePamphilis CW: Expression pattern shifts following duplication indicative of subfunctionalization and neofunctionalization in regulatory genes of Arabidopsis. Mol Biol Evol 2006, 23(2):469–478. 10.1093/molbev/msj051
Article CAS PubMed Google Scholar
Pillitteri LJ, Torii KU: Breaking the silence: three bHLH proteins direct cell-fate decisions during stomatal development. Bioessays 2007, 29(9):861–870. 10.1002/bies.20625
Article CAS PubMed Google Scholar
Pillitteri LJ, Sloan DB, Bogenschutz NL, Torii KU: Termination of asymmetric cell division and differentiation of stomata. Nature 2007, 445(7127):501–505. 10.1038/nature05467
Article CAS PubMed Google Scholar
Horiguchi G, Kim GT, Tsukaya H: The transcription factor AtGRF5 and the transcription coactivator AN3 regulate cell proliferation in leaf primordia of Arabidopsis thaliana. Plant J 2005, 43(1):68–78. 10.1111/j.1365-313X.2005.02429.x
Article CAS PubMed Google Scholar
Kwong RW, Bui AQ, Lee H, Kwong LW, Fischer RL, Goldberg RB, Harada JJ: LEAFY COTYLEDON1-LIKE defines a class of regulators essential for embryo development. Plant Cell 2003, 15(1):5–18. 10.1105/tpc.006973
Article CAS PubMed PubMed Central Google Scholar
Kanaoka MM, Pillitteri LJ, Fujii H, Yoshida Y, Bogenschutz NL, Takabayashi J, Zhu JK, Torii KU: SCREAM/ICE1 and SCREAM2 specify three cell-state transitional steps leading to arabidopsis stomatal differentiation. Plant Cell 2008, 20(7):1775–1785. 10.1105/tpc.108.060848
Article CAS PubMed PubMed Central Google Scholar
Dean G, Casson S, Lindsey K: KNAT6 gene of Arabidopsis is expressed in roots and is required for correct lateral root formation. Plant Mol Biol 2004, 54(1):71–84. 10.1023/B:PLAN.0000028772.22892.2d
Article CAS PubMed Google Scholar
Schiefthaler U, Balasubramanian S, Sieber P, Chevalier D, Wisman E, Schneitz K: Molecular analysis of NOZZLE, a gene involved in pattern formation and early sporogenesis during sex organ development in Arabidopsis thaliana. Proc Natl Acad Sci USA 1999, 96(20):11664–11669. 10.1073/pnas.96.20.11664
Article CAS PubMed PubMed Central Google Scholar
Luo M, Bilodeau P, Dennis ES, Peacock WJ, Chaudhury A: Expression and parent-of-origin effects for FIS2, MEA, and FIE in the endosperm and embryo of developing Arabidopsis seeds. Proc Natl Acad Sci USA 2000, 97(19):10637–10642. 10.1073/pnas.170292997
Article CAS PubMed PubMed Central Google Scholar
Alvarez-Buylla ER, Garcia-Ponce B, Garay-Arroyo A: Unique and redundant functional domains of APETALA1 and CAULIFLOWER, two recently duplicated Arabidopsis thaliana floral MADS-box genes. J Exp Bot 2006, 57(12):3099–3107. 10.1093/jxb/erl081
Article CAS PubMed Google Scholar
Hibara K, Karim MR, Takada S, Taoka K, Furutani M, Aida M, Tasaka M: Arabidopsis CUP-SHAPED COTYLEDON3 regulates postembryonic shoot meristem and organ boundary formation. Plant Cell 2006, 18(11):2946–2957. 10.1105/tpc.106.045716
Article CAS PubMed PubMed Central Google Scholar
Li QJ, Xu B, Chen XY, Wang LJ: The effects of increased expression of an Arabidopsis HD-ZIP gene on leaf morphogenesis and anther dehiscence. Plant Science 2007, 173(5):567–576. 10.1016/j.plantsci.2007.08.011
Article CAS Google Scholar
Dinneny JR, Yadegari R, Fischer RL, Yanofsky MF, Weigel D: The role of JAGGED in shaping lateral organs. Development 2004, 131(5):1101–1110. 10.1242/dev.00949
Article CAS PubMed Google Scholar
Petricka JJ, Clay NK, Nelson TM: Vein patterning screens and the defectively organized tributaries mutants in Arabidopsis thaliana. Plant J 2008, 56(2):251–263. 10.1111/j.1365-313X.2008.03595.x
Article CAS PubMed Google Scholar
Gallagher TL, Gasser CS: Independence and interaction of regions of the INNER NO OUTER protein in growth control during ovule development. Plant Physiol 2008, 147(1):306–315. 10.1104/pp.107.114603
Article CAS PubMed PubMed Central Google Scholar
Rutjens B, Bao D, van Eck-Stouten E, Brand M, Smeekens S, Proveniers M: Shoot apical meristem function in Arabidopsis requires the combined activities of three BEL1-like homeodomain proteins. Plant J 2009, 58(4):641–654. 10.1111/j.1365-313X.2009.03809.x
Article CAS PubMed Google Scholar
Brambilla V, Battaglia R, Colombo M, Masiero S, Bencivenga S, Kater MM, Colombo L: Genetic and molecular interactions between BELL1 and MADS box factors support ovule development in Arabidopsis. Plant Cell 2007, 19(8):2544–2556. 10.1105/tpc.107.051797
Article CAS PubMed PubMed Central Google Scholar
Wijeratne AJ, Zhang W, Sun Y, Liu W, Albert R, Zheng Z, Oppenheimer DG, Zhao D, Ma H: Differential gene expression in Arabidopsis wild-type and mutant anthers: insights into anther cell differentiation and regulatory networks. Plant J 2007, 52(1):14–29. 10.1111/j.1365-313X.2007.03217.x
Article CAS PubMed Google Scholar
Shuai B, Reynaga-Pena CG, Springer PS: The lateral organ boundaries gene defines a novel, plant-specific gene family. Plant Physiol 2002, 129(2):747–761. 10.1104/pp.010926
Article CAS PubMed PubMed Central Google Scholar
Mason MG, Mathews DE, Argyros DA, Maxwell BB, Kieber JJ, Alonso JM, Ecker JR, Schaller GE: Multiple type-B response regulators mediate cytokinin signal transduction in Arabidopsis. Plant Cell 2005, 17(11):3007–3018. 10.1105/tpc.105.035451
Article CAS PubMed PubMed Central Google Scholar
Weltmeier F, Ehlert A, Mayer CS, Dietrich K, Wang X, Schutze K, Alonso R, Harter K, Vicente-Carbajosa J, Droge-Laser W: Combinatorial control of Arabidopsis proline dehydrogenase transcription by specific heterodimerisation of bZIP transcription factors. EMBO J 2006, 25(13):3133–3143. 10.1038/sj.emboj.7601206
Article CAS PubMed PubMed Central Google Scholar
Enjalbert B, Parrou JL, Teste MA, Francois J: Combinatorial control by the protein kinases PKA, PHO85 and SNF1 of transcriptional induction of the Saccharomyces cerevisiae GSY2 gene at the diauxic shift. Mol Genet Genomics 2004, 271(6):697–708. 10.1007/s00438-004-1014-8
Article CAS PubMed Google Scholar
Singh K, Foley RC, Onate-Sanchez L: Transcription factors in plant defense and stress responses. Curr Opin Plant Biol 2002, 5(5):430–436. 10.1016/S1369-5266(02)00289-3
Article CAS PubMed Google Scholar
Shinozaki K, Yamaguchi-Shinozaki K: Gene Expression and Signal Transduction in Water-Stress Response. Plant Physiol 1997, 115(2):327–334. 10.1104/pp.115.2.327
Article CAS PubMed PubMed Central Google Scholar
Redman JC, Haas BJ, Tanimoto G, Town CD: Development and evaluation of an Arabidopsis whole genome Affymetrix probe array. Plant J 2004, 38(3):545–561. 10.1111/j.1365-313X.2004.02061.x
Article CAS PubMed Google Scholar
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4(2):249–264. 10.1093/biostatistics/4.2.249
Article PubMed Google Scholar
Persson S, Wei H, Milne J, Page GP, Somerville CR: Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Proc Natl Acad Sci USA 2005, 102(24):8633–8638. 10.1073/pnas.0503392102
Article CAS PubMed PubMed Central Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863
Article CAS PubMed PubMed Central Google Scholar
Byrne ME, Groover AT, Fontana JR, Martienssen RA: Phyllotactic pattern and stem cell fate are determined by the Arabidopsis homeobox gene BELLRINGER. Development 2003, 130(17):3941–3950. 10.1242/dev.00620
Article CAS PubMed Google Scholar
Bauer D, Viczian A, Kircher S, Nobis T, Nitschke R, Kunkel T, Panigrahi KC, Adam E, Fejes E, Schafer E, et al.: Constitutive photomorphogenesis 1 and multiple photoreceptors control degradation of phytochrome interacting factor 3, a transcription factor required for light signaling in Arabidopsis. Plant Cell 2004, 16(6):1433–1445. 10.1105/tpc.021568
Article CAS PubMed PubMed Central Google Scholar
Monte E, Tepperman JM, Al-Sady B, Kaczorowski KA, Alonso JM, Ecker JR, Li X, Zhang Y, Quail PH: The phytochrome-interacting transcription factor, PIF3, acts early, selectively, and positively in light-induced chloroplast development. Proc Natl Acad Sci USA 2004, 101(46):16091–16098. 10.1073/pnas.0407107101
Article CAS PubMed PubMed Central Google Scholar
Stephenson PG, Fankhauser C, Terry MJ: PIF3 is a repressor of chloroplast development. Proc Natl Acad Sci USA 2009, 106(18):7654–7659. 10.1073/pnas.0811684106
Article CAS PubMed PubMed Central Google Scholar
Husbands A, Bell EM, Shuai B, Smith HM, Springer PS: LATERAL ORGAN BOUNDARIES defines a new family of DNA-binding transcription factors and can interact with specific bHLH proteins. Nucleic Acids Res 2007, 35(19):6663–6671. 10.1093/nar/gkm775
Article CAS PubMed PubMed Central Google Scholar
Muller D, Schmitz G, Theres K: Blind homologous R2R3 Myb genes control the pattern of lateral meristem initiation in Arabidopsis. Plant Cell 2006, 18(3):586–597. 10.1105/tpc.105.038745
Article PubMed PubMed Central Google Scholar
Dinneny JR, Weigel D, Yanofsky MF: A genetic framework for fruit patterning in Arabidopsis thaliana. Development 2005, 132(21):4687–4696. 10.1242/dev.02062
Article CAS PubMed Google Scholar
Galinha C, Hofhuis H, Luijten M, Willemsen V, Blilou I, Heidstra R, Scheres B: PLETHORA proteins as dose-dependent master regulators of Arabidopsis root development. Nature 2007, 449(7165):1053–1057. 10.1038/nature06206
Article CAS PubMed Google Scholar
Nakazawa M, Ichikawa T, Ishikawa A, Kobayashi H, Tsuhara Y, Kawashima M, Suzuki K, Muto S, Matsui M: Activation tagging, a novel tool to dissect the functions of a gene family. Plant J 2003, 34(5):741–750. 10.1046/j.1365-313X.2003.01758.x
Article CAS PubMed Google Scholar
Kumar R, Kushalappa K, Godt D, Pidkowich MS, Pastorelli S, Hepworth SR, Haughn GW: The Arabidopsis BEL1-LIKE HOMEODOMAIN proteins SAW1 and SAW2 act redundantly to regulate KNOX expression spatially in leaf margins. Plant Cell 2007, 19(9):2719–2735. 10.1105/tpc.106.048769
Article CAS PubMed PubMed Central Google Scholar
Datta S, Hettiarachchi GH, Deng XW, Holm M: Arabidopsis CONSTANS-LIKE3 is a positive regulator of red light signaling and root growth. Plant Cell 2006, 18(1):70–84. 10.1105/tpc.105.038182
Article CAS PubMed PubMed Central Google Scholar
Bensmihen S, Rippa S, Lambert G, Jublot D, Pautot V, Granier F, Giraudat J, Parcy F: The homologous ABI5 and EEL transcription factors function antagonistically to fine-tune gene expression during late embryogenesis. Plant Cell 2002, 14(6):1391–1403. 10.1105/tpc.000869
Article CAS PubMed PubMed Central Google Scholar
Suzuki M, Ketterling MG, Li QB, McCarty DR: Viviparous1 alters global gene expression patterns through regulation of abscisic acid signaling. Plant Physiol 2003, 132(3):1664–1677. 10.1104/pp.103.022475
Article CAS PubMed PubMed Central Google Scholar
Koyama T, Furutani M, Tasaka M, Ohme-Takagi M: TCP transcription factors control the morphology of shoot lateral organs via negative regulation of the expression of boundary-specific genes in Arabidopsis. Plant Cell 2007, 19(2):473–484. 10.1105/tpc.106.044792
Article CAS PubMed PubMed Central Google Scholar
Wang ZY, Nakano T, Gendron J, He J, Chen M, Vafeados D, Yang Y, Fujioka S, Yoshida S, Asami T, et al.: Nuclear-localized BZR1 mediates brassinosteroid-induced growth and feedback suppression of brassinosteroid biosynthesis. Dev Cell 2002, 2(4):505–513. 10.1016/S1534-5807(02)00153-3
Article CAS PubMed Google Scholar
Waters MT, Wang P, Korkaric M, Capper RG, Saunders NJ, Langdale JA: GLK transcription factors coordinate expression of the photosynthetic apparatus in Arabidopsis. Plant Cell 2009, 21(4):1109–1128. 10.1105/tpc.108.065250
Article CAS PubMed PubMed Central Google Scholar
Sieber P, Petrascheck M, Barberis A, Schneitz K: Organ polarity in Arabidopsis. NOZZLE physically interacts with members of the YABBY family. Plant Physiol 2004, 135(4):2172–2185. 10.1104/pp.104.040154
Article CAS PubMed PubMed Central Google Scholar
Kumaran MK, Bowman JL, Sundaresan V: YABBY polarity genes mediate the repression of KNOX homeobox genes in Arabidopsis. Plant Cell 2002, 14(11):2761–2770. 10.1105/tpc.004911
Article CAS PubMed PubMed Central Google Scholar
Tajima Y, Imamura A, Kiba T, Amano Y, Yamashino T, Mizuno T: Comparative studies on the type-B response regulators revealing their distinctive properties in the His-to-Asp phosphorelay signal transduction of Arabidopsis thaliana. Plant Cell Physiol 2004, 45(1):28–39. 10.1093/pcp/pcg154
Article CAS PubMed Google Scholar
Mara CD, Irish VF: Two GATA transcription factors are downstream effectors of floral homeotic gene action in Arabidopsis. Plant Physiol 2008, 147(2):707–718. 10.1104/pp.107.115634
Article CAS PubMed PubMed Central Google Scholar
Tominaga-Wada R, Iwata M, Sugiyama J, Kotake T, Ishida T, Yokoyama R, Nishitani K, Okada K, Wada T: The GLABRA2 homeodomain protein directly regulates CESA5 and XTH17 gene expression in Arabidopsis roots. Plant J 2009, 60(3):564–574. 10.1111/j.1365-313X.2009.03976.x
Article CAS PubMed Google Scholar
Welch D, Hassan H, Blilou I, Immink R, Heidstra R, Scheres B: Arabidopsis JACKDAW and MAGPIE zinc finger proteins delimit asymmetric cell division and stabilize tissue boundaries by restricting SHORT-ROOT action. Genes Dev 2007, 21(17):2196–2204. 10.1101/gad.440307
Article CAS PubMed PubMed Central Google Scholar
Ramon M, Rolland F, Thevelein JM, Van Dijck P, Leyman B: ABI4 mediates the effects of exogenous trehalose on Arabidopsis growth and starch breakdown. Plant Mol Biol 2007, 63(2):195–206. 10.1007/s11103-006-9082-2
Article CAS PubMed Google Scholar
Sakuma Y, Liu Q, Dubouzet JG, Abe H, Shinozaki K, Yamaguchi-Shinozaki K: DNA-binding specificity of the ERF/AP2 domain of Arabidopsis DREBs, transcription factors involved in dehydration- and cold-inducible gene expression. Biochem Biophys Res Commun 2002, 290(3):998–1009. 10.1006/bbrc.2001.6299
Article CAS PubMed Google Scholar
Borghi L, Bureau M, Simon R: Arabidopsis JAGGED LATERAL ORGANS is expressed in boundaries and coordinates KNOX and PIN activity. Plant Cell 2007, 19(6):1795–1808. 10.1105/tpc.106.047159
Article CAS PubMed PubMed Central Google Scholar
Gopalakrishna R, Kumar G, KrishnaPrasad BT, Mathew MK, Udaya Kumar M: A stress-responsive gene from groundnut, Gdi-15, is homologous to flavonol 3-O-glucosyltransferase involved in anthocyanin biosynthesis. Biochem Biophys Res Commun 2001, 284(3):574–579. 10.1006/bbrc.2001.4992
Article CAS PubMed Google Scholar
Yuan Y, Chiu LW, Li L: Transcriptional regulation of anthocyanin biosynthesis in red cabbage. Planta 2009.
Google Scholar
Castellarin SD, Pfeiffer A, Sivilotti P, Degan M, Peterlunger E, G DIG: Transcriptional regulation of anthocyanin biosynthesis in ripening fruits of grapevine under seasonal water deficit. Plant Cell Environ 2007, 30(11):1381–1399. 10.1111/j.1365-3040.2007.01716.x
Article CAS PubMed Google Scholar
Gonzalez A, Zhao M, Leavitt JM, Lloyd AM: Regulation of the anthocyanin biosynthetic pathway by the TTG1/bHLH/Myb transcriptional complex in Arabidopsis seedlings. Plant J 2008, 53(5):814–827. 10.1111/j.1365-313X.2007.03373.x
Article CAS PubMed Google Scholar
Nelson DE, Repetti PP, Adams TR, Creelman RA, Wu J, Warner DC, Anstrom DC, Bensen RJ, Castiglioni PP, Donnarummo MG, et al.: Plant nuclear factor Y (NF-Y) B subunits confer drought tolerance and lead to improved corn yields on water-limited acres. Proc Natl Acad Sci USA 2007, 104(42):16450–16455. 10.1073/pnas.0707193104
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematical Science, Michigan Technological University, 1400 Townsend Drive, Houghton, MI, 49931, USA
Xiaoqi Cui & Huann-Sheng Chen
School of Forest Resources and Environmental Science, Michigan Technological University, 1400 Townsend Drive, Houghton, MI, 49931, USA
Tong Wang, Victor Busov & Hairong Wei
Biotechnology Research Center, Michigan Technological University, 1400 Townsend Drive, Houghton, MI, 49931, USA
Huann-Sheng Chen, Victor Busov & Hairong Wei

Authors

Xiaoqi Cui
View author publications
You can also search for this author in PubMed Google Scholar
Tong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huann-Sheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Victor Busov
View author publications
You can also search for this author in PubMed Google Scholar
Hairong Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hairong Wei.

Additional information

Authors' contributions

XC implemented ASCCA in R, and wrote the method part of manuscript. TW prepared the microarray data. HSC directed XC for implementing ASCCA, and revised manuscript. VB supported the project and wrote some parts of manuscript. HW automated the package, ran the TF-finder to produce the results, and wrote the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2010_3882_MOESM1_ESM.XLS

Additional file 1: Positive target genes and positive TFs used for testing TF-finder. This is a Microsoft Excel file (.xls) that can be visualized using the Excel contained in Microsoft Office package. (XLS 58 KB)

12859_2010_3882_MOESM2_ESM.XLS

Additional file 2: Novel TFs recognized by TF-finder. This is a Microsoft Excel file (.xls) that can be visualized using the Excel contained in Microsoft Office package. (XLS 411 KB)

12859_2010_3882_MOESM3_ESM.XLS

Additional file 3: Comparison of TFs recognized by TF-finder and ICE. This is a Microsoft Excel file (.xls) that can be visualized using the Excel contained in Microsoft Office package. (XLS 35 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Cui, X., Wang, T., Chen, HS. et al. TF-finder: A software package for identifying transcription factors involved in biological processes using microarray data and existing knowledge base. BMC Bioinformatics 11, 425 (2010). https://doi.org/10.1186/1471-2105-11-425

Download citation

Received: 18 May 2010
Accepted: 12 August 2010
Published: 12 August 2010
DOI: https://doi.org/10.1186/1471-2105-11-425

TF-finder: A software package for identifying transcription factors involved in biological processes using microarray data and existing knowledge base