Systematic identification of yeast cell cycle transcription factors using multiple data sources

Background Eukaryotic cell cycle is a complex process and is precisely regulated at many levels. Many genes specific to the cell cycle are regulated transcriptionally and are expressed just before they are needed. To understand the cell cycle process, it is important to identify the cell cycle transcription factors (TFs) that regulate the expression of cell cycle-regulated genes. Results We developed a method to identify cell cycle TFs in yeast by integrating current ChIP-chip, mutant, transcription factor binding site (TFBS), and cell cycle gene expression data. We identified 17 cell cycle TFs, 12 of which are known cell cycle TFs, while the remaining five (Ash1, Rlm1, Ste12, Stp1, Tec1) are putative novel cell cycle TFs. For each cell cycle TF, we assigned specific cell cycle phases in which the TF functions and identified the time lag for the TF to exert regulatory effects on its target genes. We also identified 178 novel cell cycle-regulated genes, among which 59 have unknown functions, but they may now be annotated as cell cycle-regulated genes. Most of our predictions are supported by previous experimental or computational studies. Furthermore, a high confidence TF-gene regulatory matrix is derived as a byproduct of our method. Each TF-gene regulatory relationship in this matrix is supported by at least three data sources: gene expression, TFBS, and ChIP-chip or/and mutant data. We show that our method performs better than four existing methods for identifying yeast cell cycle TFs. Finally, an application of our method to different cell cycle gene expression datasets suggests that our method is robust. Conclusion Our method is effective for identifying yeast cell cycle TFs and cell cycle-regulated genes. Many of our predictions are validated by the literature. Our study shows that integrating multiple data sources is a powerful approach to studying complex biological systems.


Background
Eukaryotic cell cycle is a complex process and is precisely regulated at many levels. One important aspect of this regulation is at the transcriptional level. That is, many genes specific to the cell cycle are transcribed just before they are needed [1]. To have a good understanding of the cell cycle, it is essential to identify the cell cycle-regulated genes and their transcriptional regulators. DNA microarray analysis has revealed that the expression levels of ~800 genes vary in a periodic fashion during the yeast cell cycle, but little is known about the transcriptional regulation of most of these genes [2,3]. To fill this gap, we aim to iden-tify the cell cycle transcription factors (TFs) that regulate the cell cycle-regulated genes inferred by DNA microarray analysis [2].
Two major approaches have been proposed to identify cell cycle TFs. First, clustering and motif-discovering algorithms have been applied to cell cycle gene expression data to find sets of co-expressed genes and plausible binding motifs of their TFs [2,4]. This approach has been expanded to incorporate existing knowledge about the genes, such as cellular functions [5] or promoter sequence motifs [6]. However, this approach provides only indirect evidence of genetic regulatory interactions and does not directly identify the relevant cell cycle TFs. Second, the ChIP-chip technique was developed to identify physical interactions between TFs and promoters. Using ChIP-chip data, Simon et al. [3] investigated how the yeast cell cycle gene expression program is regulated by the nine known major transcriptional activators. Later, Lee et al. [7] constructed a network of TF-promoter interactions and Harbison et al. [8] constructed an initial map of yeast's transcriptional regulatory code. However, ChIP-chip data alone cannot tell whether a TF is an activator or a repressor and, most importantly, ChIP-chip data are noisy and, depending on the chosen p-value cutoff, may include many false positive or false negative TF-promoter binding relationships. For example, if the p-value cutoff is chosen to be 0.001, a false negative rate of ~24% in determining TF-promoter binding was estimated [8].
To overcome the weakness of the above two approaches, we develop a method (details shown in Figure 1) to systematically identify cell cycle TFs by combining four data sources: transcription factor binding site (TFBS), mutant, ChIP-chip, and cell cycle gene expression data. In order to reduce the high false negative rate of the ChIP-chip data, we use current TFBS data [9,10] to avoid using a stringent p-value threshold (≤ 0.001) to determine TF-promoter binding. We assume that a TF binds to a specific promoter if (1) the p-value for the TF to bind the promoter is ≤ 0.01 in ChIP-chip data and (2) the promoter contains one or more binding sites of the TF. That is, we allow the p-value cutoff to be relaxed to 0.01 but the TF-promoter binding event must be supported by the TFBS data.
It is known that the ChIP-chip technique can only detect those TF-promoter binding events that happen in the same physiological condition in which the ChIP-chip experiment is conducted, so it can potentially miss many TF-promoter binding events. We use the mutant data [10] and the TFBS data [9,10] to rescue some of these false negative TF-promoter binding events. We assume that a TF binds to a specific promoter if (1) the disruption of the TF results in a significant change of expression of the gene that has the specific promoter and (2) the promoter con-tains one or more binding sites of the TF. That is, the TFpromoter binding event can be assumed without using ChIP-chip data if it is supported by both the mutant and the TFBS data. This step can rescue some plausible TF-promoter binding events that are missing in the current ChIPchip data.
From the above procedure, we can derive a high-confidence TF-promoter binding matrix (see Methods). However, binding of a TF to the promoter of a gene does not necessarily imply regulation. A TF may bind to the promoter of a gene but has no regulatory effect on that gene's expression. Hence, additional information is required to solve this ambiguity inherent in the TF-promoter binding matrix. In this study, we use the additional information provided by the yeast cell cycle gene expression data [2] to solve this problem. We use a time-lagged dynamic system model of gene regulation to describe how the target gene's expression during cell cycle is controlled by the TFs that bind to its promoter (inferred from the TF-promoter binding matrix). Among these bound TFs, those that have significant regulatory effects on the target gene's expression can be extracted (see Methods). From this procedure, we can refine the TF-promoter binding matrix into a highconfidence TF-gene regulatory matrix. Each TF-gene regulatory relationship in this matrix is supported by at least three data sources: gene expression, TFBS, and ChIP-chip or/and mutant data. From the high-confidence TF-gene regulatory matrix, the regulatory targets of each of the 203 TFs in yeast can be inferred. Finally, a TF is said to be a cell cycle TF if a statistically significant portion of its regulatory targets are cell cycle-regulated genes. By integrating current ChIP-chip, mutant, TFBS, and yeast  cell cycle gene expression data, our method identified 17  cell cycle TFs (Table 1). Among them, 12 are known cell cycle TFs according to the MIPS database [11], including the nine well-known major cell cycle TFs (Ace2, Fkh1, Fkh2, Mbp1, Mcm1, Ndd1, Swi4, Swi5, and Swi6), and Cin5, Cst6, and Stb1.

Identification of 17 cell cycle TFs
The remaining five predicted novel cell cycle TFs (Ash1, Rlm1, Ste12, Stp1 and Tec1) are supported by three lines of evidence. First, each novel cell cycle TF has physical or genetic interactions with some known cell cycle TFs (see Figure 2), suggesting that these five TFs may play a role in the yeast cell cycle. Second, four (Ash1, Rlm1, Ste12 and Tec1) of the five predicted novel cell cycle TFs have also been predicted in previous computational studies [12][13][14]. Third, Ash1, Rlm1, Stp1 and Tec1 were predicted to be cell cycle-regulated by previous studies [1,15]. Being cell cycle regulated themselves, these TFs may play a role in the cell cycle process.
Flowchart of the procedure of our method Figure 1 Flowchart of the procedure of our method.

The cell cycle phases in which a cell cycle TF functions
After identifying the cell cycle TFs, it is desirable to determine in which cell cycle phases a cell cycle TF functions. We regard that a cell cycle TF functions in the X phase (X = MG 1 , G 1 , S, SG 2 , G 2 M) if a statistically significant portion of its regulatory targets belong to the X phase cell cycle-regulated genes defined by Spellman et al. [2] (see Methods). We found that a cell cycle TF may function in more than one cell cycle phase (see Table 1). On average, 86% (31/36) of our predictions have literature support. More specifically, 39% (14/36) of our predictions have experimental evidence and 47% (17/36) of our predictions are consistent with previous computational studies (see Table 1).
The following predictions have experimental evidence. Ace2 and Swi5 have been shown to control certain genes expressed in MG 1 [16], supporting our prediction that Ace2 and Swi5 function in MG 1 . It is known that in the absence of Ndd1 and Fkh2, Mcm1 participates in the regulation of genes essential for cellular functions specific to late mitosis and early G1 [17,18], supporting our prediction that Mcm1 functions in MG 1 . Previous molecular and genetic analysis suggested that SBF (Swi4+Swi6) and MBF (Mbp1+Swi6) are activators of genes essential for cellular functions specific to late G 1 [3,19], supporting our prediction that Mbp1, Swi4, and Swi6 function in G 1 . Two genomic studies [7,20] indicated the involvement of SBF in regulating S phase genes, supporting our prediction  (4)). For each identified cell cycle TF, the cell cycle phases in which the TF functions are shown. "E" means that the prediction is supported by experimental evidence, "C" means that the prediction is supported by previous computational studies, and "N" stands for our novel prediction.
Interactions between a novel cell cycle TF and the other identified cell cycle TFs Figure 2 Interactions between a novel cell cycle TF and the other identified cell cycle TFs. The physical or genetic interactions between a novel cell cycle TF ((a) Ash1, (b) Rlm1, (c) Tec1, (d) Stp1, and (e) Ste12) and the other identified cell cycle TFs are shown. Each oval indicates an identified cell cycle TF. A TF name is colored purple if it is a known cell cycle TF but black otherwise. Two ovals are connected by an undirected red line if these two TFs have physical interactions indicated by the current protein-protein interaction data [48]. Two ovals are connected by a directed blue line if the two TFs have genetic interactions indicated by ChIP-chip or/and mutant data [10]. For example, Ace2→Ash1 means that either TF Ace2 binds to the promoter of gene ASH1 or the disruption of TF Ace2 results in a significant change of the expression of gene ASH1.

Identification of novel cell cycle-regulated genes
For each of the 17 identified cell cycle TFs, we look at their regulatory targets to find novel cell cycle-regulated genes. We regard a gene as a cell cycle-regulated gene if it is regulated by at least two of the 17 identified cell cycle TFs. The requirement for defining a cell cycle-regulated gene to be regulated by at least two rather than one cell cycle TF is to reduce the number of false positives. In total, we identified 178 novel cell cycle-regulated genes that are not in the set of 800 cell cycle-regulated genes identified by Spellman et al. [2]. We found that 64% (114/178) of the novel cell cycle-regulated genes have literature support. More specifically, 25% (45/178) of our predictions have experimental evidence and 39% (69/178) of our predictions are consistent with previous computational studies (see Additional file 1). Among the 178 identified novel cell cycle-regulated genes, 59 genes have no known function according to the Saccharomyces Genome Database [23]. We suggest that these 59 uncharacterized genes are involved in the cell cycle process. Two lines of evidence supported our predictions. First, 68% (40/59) of these genes have literature support. More specifically, 26% (15/ 59) of our predictions have experimental evidence and 42% (25/59) of our predictions are consistent with previous computational studies (see Additional file 1). Second, each of these 59 genes is regulated by at least two cell cycle TFs, and the TF-gene regulatory relationship is supported by at least three data sources: gene expression, TFBS, and ChIP-chip or/and mutant data. Let us consider three examples. According to the Saccharomyces Genome Database [23], YJL160C is a putative cell wall protein, BUD7 may be involved in the budding process, and YCG1 may be involved in mitotic chromosome condensation. However, the exact functions of YJL160C, BUD7 and YCG1 are still unknown [23]. Since cell wall synthesis, budding and chromosome condensation are all important to the cell cycle process [2], this suggests that YJL160C, BUD7, and YCG1 play a role in the cell cycle process, supporting our predictions.

Advantages of our method
Our method has four features that make it more powerful than existing approaches. First, it can reduce false negatives in determining TF-promoter binding events from current ChIP-chip data. Most previous methods [7,8,[24][25][26][27], except GRAM [25], used a stringent p-value threshold (≤ 0.001) to determine TF-promoter binding events in order to reduce the number of false positives, but it was at the expense of false negatives (~24%) [8]. In comparison, our method allows the p-value cutoff to be relaxed to 0.01 but requires that the promoter must have one or more binding sites of the TF. Therefore, using additional information provided by the TFBS data, our method can rescue some false negatives without substantially increasing the number of false positives. For example, we rescue 40 binding targets of Ace2. The promoter of each of these 40 genes has one or more binding sites of Ace2. However, their pvalues of binding events in the ChIP-chip data are all larger than 0.001, so they would not have been identified using a stringent p-value of 0.001 (see Additional file 1 for the other 16 examples).
Second, it is known that ChIP-chip data can only indicate those TF-promoter binding events that happen in the same physiological condition in which the ChIP-chip experiments are conducted. Therefore, many plausible TFpromoter binding events may be missing in the current ChIP-chip data. In order to solve this problem, our method considers that a TF binds to a specific promoter if the disruption of the TF results in a significant change of the expression of the gene that has the specific promoter and if the promoter contains one or more binding sites of the TF. That is, using the information provided by the mutant and the TFBS data, our method can rescue many TF-promoter binding events that are missing in the current ChIP-chip data. For example, we rescue 16 binding targets of Ace2. The promoter of each of these 16 genes has one or more binding sites of Ace2 and the disruption of Ace2 results in a significant change of the expressions of all these 16 genes [10]. All these genes would not be identified as binding targets of Ace2 even when using a relaxed p-value of 0.01 in the ChIP-chip data (see Additional file 1 for the other 16 examples).
Third, our method can extract plausible TF-gene regulatory relationships from TF-promoter binding relationships. Most pervious methods [7,8,[24][25][26] regard the TFpromoter binding relationships provided by ChIP-chip data as the TF-gene regulatory relationships. This may not be true because the binding of a TF to the promoter of a gene does not necessarily imply regulation. A TF may bind to the promoter of a gene but has no regulatory effect on that gene's expression. To solve this problem, our method uses a time-lagged dynamic system model of gene regulation to extract the TFs that have significant regulatory effects on the target gene's expression from all TFs that bind to the promoter of the target gene. Through this process, our method can extract plausible TF-gene regulatory relationships from TF-promoter binding relationships. Thus, in our method each TF-gene regulatory relationship is supported by at least three data sources: gene expression, TFBS, and ChIP-chip or/and mutant data. We found that, on average, 44% of the binding targets of the 17 identified cell cycle TFs are their regulatory targets (see Additional file 1). This ratio is slightly lower than Gao et al.'s estimation (58%) [27] and Wu et al.'s estimation (55%) [28], possibly due to our stringent requirement for a TF-gene regulatory relationship to be supported by at least three data sources, whereas in both previous studies the TF-gene regulatory relationship is only supported by two data sources: gene expression and ChIP-chip data.
Fourth, our method can identify the time lag for a cell cycle TF to exert regulatory effects on its target genes. It is known that the regulatory effects of a TF on its target genes may have a time lag [29][30][31][32][33]. By using a time-lagged dynamical system model, our method takes the time lag into consideration, making it more realistic than those previous studies that did not allow a time lag [27,[34][35][36]. As shown in Figure 3, the average time lag for each of the 17 cell cycle TFs to exert regulatory effects on its target genes was estimated.

Performance comparison with existing methods
Four previous studies also tried to identify the yeast cell cycle TFs. Tsai et al. [12] identified 30 cell cycle TFs by applying a statistical method (ANOVA analysis) and Cheng et al. [14] identified 40 cell cycle TFs by applying another statistical method (Fisher's G test). Cokus et al. [37] identified 12 cell cycle TFs by applying linear regression analysis. Andersson et al. [38] identified 15 cell cycle TFs by applying rule-based modeling. Since these four approaches are different from ours, a performance comparison should be done. As suggested by de Lichtenberg et al. [15], we tested the ability of each of these five methods to retrieve the known cell cycle TFs annotated in the MIPS database [11]. Performance comparison was based on the Jaccard similarity score [39,40], which scores the overlaps between a method's output and the list of known cell cycle TFs (i.e., the true answers). Therefore, the higher the Jaccard similarity score, the better the ability of a method to retrieve the known cell cycle TFs. As shown in Table 2, our method has the highest Jaccard similarity score among the five methods. Therefore, our method outperforms the four existing methods.
The average time lag for a cell cycle to exert regulatory effects on its target genes Figure 3 The average time lag for a cell cycle to exert regulatory effects on its target genes. The average and standard deviation of the time lag for each of the 17 identified cell cycle TFs to exert regulatory effects on its target genes are shown. For example, on average, it takes 9.5 minutes for Stb1 to exert regulatory effects on its target genes.

Robustness against different cell cycle gene expression datasets
We applied our method to two newer cell cycle gene expression datasets (alpha26 and alpha38) published by Pramila et al. in 2006 [41]. Both datasets are alpha-factor synchronized microarray time series spanning two cell cycles. The alpha26 dataset has a sampling interval of 10 minutes and a total of 13 data points, and the alpha38 dataset has a sampling interval of 5 minutes and a total of 25 data points. We found that among the 17 cell cycle TFs identified using Spellman et al.'s dataset, 14 TFs are also identified using the alpha38 dataset, and 12 TFs are also identified using the alpha26 dataset (see Figure 4). This suggests that our method is robust against different cell cycle gene expression datasets.

Parameter settings of our method
The choices of both the relaxed p-value and time-lag parameter have biological meanings. Two previous papers [7,8] used a statistical error model to assign a p-value of the binding relationship of a TF-promoter pair. They found that if p = 0.001, the binding relationship of a TF- Performance comparison was based on the Jaccard similarity score [39,40], which scores the overlaps between a method's output and the list of known cell cycle TFs. Specifically, the Jaccard similarity score is defined as TP/(TP+FP+FN), where TP stands for true positives, FP for false positives, and FN for false negatives. Note that the higher the Jaccard similarity score, the better the ability of a method to retrieve the known cell cycle TFs.
The results of using different cell cycle gene expression datasets promoter pair is of high confidence and can usually be confirmed by promoter-specific PCR. If p > 0.01, the binding relationship of a TF-promoter pair is of low confidence and cannot be confirmed by promoter-specific PCR most of the time. However, if 0.001 <p ≤ 0.01, the binding relationship of a TF-promoter pair is ambiguous and can be confirmed by promoter-specific PCR in some cases but not in the other cases. Our aim is to solve this ambiguity, so we choose 0.01 to be the relaxed p-value. We say that an ambiguous binding relationship of a TF-promoter pair is plausible if 0.001 <p < 0.01 and if the promoter contains one or more binding sites of the TF. As to the timelag parameter, its value is chosen to make the maximal time lag approximately equal to two consecutive cell cycle phases because Simon et al. [3] found cases where a cell cycle TF that expresses in one phase of the cell cycle can regulate genes that function in the next phase.
We regard a gene as a cell cycle-regulated gene if it is regulated by at least two of the 17 identified cell cycle TFs. The requirement for defining a cell cycle-regulated gene to be regulated by at least two rather than one cell cycle TF is to reduce the number of false positves. When the stringent criterion is used, 64% (25% with experimental evidence and 39% with computational evidence) of the identified novel cell cycle-regulated genes are supported by the literature, whereas when the loose criterion is used, only 50% (8% with experimental evidence and 42% with computational evidence) of the identified novel cell cycle-regulated genes are supported by the literature. In this study, we want to be more conservative on calling a gene a "novel" cell cycle-regulated gene, so we aim to eliminate many false positives, though at the expense of some false negatives.

Conclusion
We developed a method to identify cell cycle TFs in yeast by integrating current ChIP-chip, mutant, TFBS, and cell cycle gene expression data. We identified 17 cell cycle TFs, 12 of which are known cell cycle TFs. The remaining five TFs (Ash1, Rlm1, Ste12, Stp1, Tec1) are putative novel cell cycle TFs. Our predictions are supported by interactions (physical or genetic) data and previous studies. In addition, our method can assign each cell cycle TF to specific cell cycle phases in which the TF functions. We found that a cell cycle TF may function in more than one cell cycle phase. On average, 86% of our predictions have literature support (39% with experimental evidence and 47% with computational evidence). Besides, our method can identify the time lag for a cell cycle TF to exert regulatory effects on its target genes. By using a time-lagged dynamical system model, our method takes the time lag into consideration, which makes it more biologically realistic than those previous studies that did not allow a time lag. Moreover, we identified 178 novel cell cycle-regulated genes, 64% of which have literature support (25% with experimental evidence and 39% with computational evidence). Among the 178 novel cell cycle-regulated genes, 59 have no known function (i.e., they are uncharacterized). These 59 uncharacterized genes may now be annotated as cell cycle related genes, supported by the fact that 68% of our predictions have literature support (26% with experimental evidence and 42% with computational evidence). Furthermore, a high-confidence TF-gene regulatory matrix is derived as a byproduct of our method. Each TF-gene regulatory relationship in this matrix is supported by at least three data sources: gene expression, TFBS, and ChIP-chip or/and mutant data. Moreover, we compared the performance of our method with four existing methods and showed that our method has a better ability to retrieve the known cell cycle TFs. Finally, applying our method to different cell cycle gene expression datasets, we identify similar sets of TFs, suggesting that our method is robust.

Data sets and data preprocessing
We use four data sources in this study. First, the ChIP-chip data are from Harbison et al. [8]. They used genome-wide location analysis to determine the genomic occupancy of 203 TFs in rich media conditions. Second, the TFBS data are from MacIsaac et al. [9] and the YEASTRACT database [10]. MacIsaac et al. used evolutionarily conservative criteria to computationally identify the binding sites of many TFs in yeast. The YEASTRACT database includes a set of computational tools that can be used to identify complex motifs over-represented in the promoters of co-regulated genes. Third, the mutant data are from the YEASTRACT database [10]. The mutant data can tell us which gene's expression was changed significantly owing to the deletion (or mutation) of the gene that encodes a TF. The evidence may come from detailed gene by gene analysis or genome-wide expression analysis. Finally, the yeast cell cycle gene expression data are from Spellman et al. [2]. The alpha factor data set is used because it was shown to have a better data quality than the other data sets [15]. Samples for all genes in the yeast genome are collected at 18 time points (0, 7, 14, 21, ..., 119 minute), which cover two cell cycles. That is, each gene has a 18-timepoint gene expression profile. The cubic spline method [42] is then used to reconstruct the missing values and interpolate extra data points into the original time profile. Note that genes that have more than one missing value in their gene expression profiles are excluded in this study.

Construction of a high-confidence TF-gene regulatory matrix
Using three data sources (ChIP-chip, mutant and TFBS data), we can construct a high-confidence TF-promoter binding matrix B = [b i,j ], where b i,j = 1 if (1) the p-value for TF j to bind the promoter of gene i is ≤ 0.01 in the ChIP-chip data and the promoter of gene i contains one or more binding sites of TF j or (2) the disruption of TF j results in a significant change of the expression of gene i and the promoter of gene i contains one or more binding sites of TF j. Otherwise, b i,j = 0.
However, binding of a TF to the promoter of a gene does not necessarily imply regulation. Hence, additional information is required to solve this ambiguity inherent in the TF-promoter binding matrix. Using a time-lagged dynamic model of gene regulation, we can refine the TFpromoter binding matrix into a high-confidence TF-gene regulatory matrix. We consider the transcriptional regulatory mechanism of a target gene as a system with the regulatory profiles of several TFs as the inputs and the gene expression profile of the target gene as the output. The transcriptional regulation of a target gene is described by the following time-lagged dynamic system model [43][44][45] where It has been shown that TF binding usually affects gene expression in a nonlinear fashion: below some level it has no effect, while above a certain level the effect may become saturated. This type of binding behavior can be modeled using a sigmoid function. Therefore, x i [t] (the regulatory profile of TF i at time point t) is defined as a sigmoid function of z i [t] (the gene expression profile of TF i at time point t) [13,28,44,45]: where g denotes the transition rate of the sigmoid function and A i denotes the mean of the gene expression profile of TF i. It is also known that the regulatory effect of a TF on its target genes may not be simultaneous but has a time lag [28][29][30][31][32][33]. Therefore, we incorporate a time lag term into our dynamic system model. The time lag τ i between TF i and the target gene y is determined by , where r(q) is the correlation between (the expression profile of the target gene y) and (the regulatory profile of TF i) with a lag of q time points [28,29]: where , M is the number of time points of the target gene's expression profile and Q is the maximal time lag of the TF's regulatory profile considered. The time lag may be interpreted as the time for a TF to exert a regulatory effect on its target gene's expression. The value of Q is chosen to make the maximal time lag approximately equal to two consecutive cell cycle phases because Simon et al. [3] found cases where a cell cycle TF that expresses in one phase of the cell cycle can regulate genes that function in the next phase.
After writing down the time-lagged dynamic system model of gene regulation, the next step is to estimate the unknown parameters in the model. We rewrite Equation (1) into the following regression form: , ,...,   [46]. The pvalue computed by the t-distribution is then adjusted by the Bonferroni correction to represent the true alpha level in the multiple hypotheses testing [46]. Finally, TF i is said to be a true regulator of the target gene if the adjusted pvalue p adjusted ≤ 0.05.

Identification of cell cycle TFs
From the high-confidence TF-gene regulatory matrix, the regulatory targets of each of the 203 TFs in yeast can be inferred. Then a TF is said to be a cell cycle TF if a statistically significant portion of its regulatory targets are in the set of 800 cell cycle-regulated genes identified by Spellman et al. [2]. The hypergeometric distribution is used to test the statistical significance [46,47]. The procedure for checking whether TF j is a cell cycle TF is as follows. Let S be the set of cell cycle-regulated genes identified by Spellman et al. [2], G be the set of genes that are regulated by TF j (inferred from the TF-gene regulatory matrix), T = S ∩ G be the set of cell cycle-regulated genes that are also regulated by TF j, and F be the set of all genes in the yeast genome. Then the p-value for rejecting the null hypothesis (H 0 : TF j is not a cell cycle TF) is calculated as where |G| means the number of genes in set G. This pvalue is then adjusted by the Bonferroni correction to represent the true alpha level in the multiple hypotheses testing [46]. TF j is said to be a cell cycle TF if the adjusted pvalue p adjusted ≤ 0.05. This procedure is applied to each of the 203 TFs under study.

Identification of the cell cycle phases in which a cell cycle TF functions
For each of the 17 identified cell cycle TFs, we want to determine in which cell cycle phases it functions. We regard that a cell cycle TF functions in the X phase (X = MG 1 , G 1 , S, SG 2 , G 2 M) if a statistically significant portion of its regulatory targets belong to the X phase cell cycleregulated genes identified by Spellman et al. [2]. Equation (4) is again used to test the statistical significance. While G and F are defined as before, S now denotes the set of X phase cell cycle-regulated genes identified by Spellman et al. [2] and T = S ∩ G now denotes the set of X phase cell cycle-regulated genes that are also regulated by the cell cycle TF under study. The p-value computed by Equation (4) is then adjusted by the Bonferroni correction to represent the true alpha level in the multiple hypotheses testing. We say that a cell cycle TF functions in the X phase (X = MG 1 , G 1 , S, SG 2 , G 2 M) if the adjusted p-value p adjusted ≤ 0.05.

Authors' contributions
WSW developed the algorithm, performed the simulation and wrote the manuscript. WHL conceived the research topic, provided essential guidance and revised the manu- script. All authors read and approved the final manuscript.