CAGER: classification analysis of gene expression regulation using multiple information sources
© Ruan and Zhang. 2005
Received: 03 November 2004
Accepted: 12 May 2005
Published: 12 May 2005
Skip to main content
© Ruan and Zhang. 2005
Received: 03 November 2004
Accepted: 12 May 2005
Published: 12 May 2005
Many classification approaches have been applied to analyzing transcriptional regulation of gene expressions. These methods build models that can explain a gene's expression level from the regulatory elements (features) on its promoter sequence. Different types of features, such as experimentally verified binding motifs, motifs discovered by computer programs, or transcription factor binding data measured with Chromatin Immunoprecipitation (ChIP) assays, have been used towards this goal. Each type of features has been shown successful in modeling gene transcriptional regulation under certain conditions. However, no comparison has been made to evaluate the relative merit of these features. Furthermore, most publicly available classification tools were not designed specifically for modeling transcriptional regulation, and do not allow the user to combine different types of features.
In this study, we use a specific classification method, decision trees, to model transcriptional regulation in yeast with features based on predefined motifs, automatically identified motifs, ChlP-chip data, or their combinations. We compare the accuracies and stability of these models, and analyze their capabilities in identifying functionally related genes. Furthermore, we design and implement a user-friendly web server called CAGER (Classification Analysis of Gene Expression Regulation) that integrates several software components for automated analysis of transcriptional regulation using decision trees. Finally, we use CAGER to study the transcriptional regulation of Arabidopsis genes in response to abscisic acid, and report some interesting new results.
Models built with ChlP-chip data suffer from low accuracies when the condition under which gene expressions are measured is significantly different from the condition under which the ChIP experiment is conducted. Models built with automatically identified motifs can sometimes discover new features, but their modeling accuracies may have been over-estimated in previous studies. Furthermore, models built with automatically identified motifs are not stable with respect to noises. A combination of ChlP-chip data and predefined motifs can substantially improve modeling accuracies, and is effective in identifying true regulons. The CAGER web server, which is freely available at http://cic.cs.wustl.edu/CAGER/, allows the user to select combinations of different feature types for building decision trees, and interact with the models graphically. We believe that it will be a useful tool to facilitate the discovery of gene transcriptional regulatory networks.
A major challenge in computational biology is to reveal the cis-regulatory logics of gene expression through analysis of high-throughput genomic data, for example, genomic sequences and microarray gene expression data. A common practice is to first identify putatively co-regulated genes by clustering gene expression patterns [1–3], and then search for common motifs from the promoter sequences of these genes [4–6]. However, motif finding methods are often sensitive to noises and usually do not consider combinatorial nature of cis-regulation. Furthermore, these methods by themselves do not reveal the actual transcription factors (TFs) that bind to particular sequence motifs.
Recently, many researchers attempted to build quantitative or qualitative models to associate a gene's expression level with regulatory motifs on its promoter sequence. Pilpel et al.  explicitly analyzed the combinatorial effects of motif pairs on gene expression profiles and identified many significant motif combinations. Bussemaker et al.  and others [9, 10] modeled the expression levels of genes as a linear regression of putative binding motifs, and applied feature selection techniques to find the most significant motifs. Hu et al.  used decision trees to find motif combinations that best separate two sets of genes. Phuong et al.  applied multivariate regression trees to model the transcriptional regulation of gene expressions over several time points simultaneously. Middendorf et al.  used an ensemble of decision trees to model gene expression levels by combining putative binding motifs and the expression levels of putative TFs. Simonis et al.  combined a string-based motif finding method and linear discriminant analysis to identify motif combinations that can separate true regulons from false ones. Segal et al.  and Beer and Tavazoie  built probabilistic graphical models, e.g., Bayesian networks, to explain gene expression patterns from motifs. In these models, the predictors (features) are the matching scores of promoter sequences to putative binding motifs, and the predictions (responses) can be continuous or discrete gene expression levels or categorical cluster labels.
The features used in these models generally come from one of the following sources. First, one can use computer programs to automatically find motifs from the promoters of the genes to be modeled [10–11,14–16]. Second, predefined motifs can be obtained independently from sources such as databases of experimentally verified or putative motifs [12, 13]. Third, one can enumerate all words up to a certain length as features [8, 9]. In addition, TF binding data derived from Chromatin Immunoprecipitation (ChIP) assays  have been used as a substitution of motif scores. For example, Banerjee and Zhang  directly applied the method of Pilpel et al.  to ChIP-chip data to identify TF combinations; Gao et al.  replaced the variables in the linear model of Bussemaker et al.  with ChIP-chip data and identified significant regulators for many experimental conditions. We recently applied a decision tree method to S. cerevisiae ChIP-chip data and identified all known TFs and many interesting TF combinations for yeast cell cycle .
Each type of the features discussed above (motifs or binding data) has its advantages and disadvantages in modeling gene transcriptional regulations. As to our knowledge, no comparison has been made to evaluate their relative merits. Modeling accuracy is largely affected by the type of features considered in model construction. If all relevant features are included correctly, many modeling algorithms may have equally high accuracies. On the other hand, if most significant features are omitted, no model can achieve satisfactory accuracy. Furthermore, the inclusion of many irrelevant features may significantly decrease modeling accuracy. Therefore, a comparative study can identify limitations of these feature types, and provide some guidelines and justifications.
Although many classification tools are publicly available (for example, WEKA ), most of them are not designed specifically for modeling transcriptional regulation, and are not convenient for biological applications. The only related software is a web server called REDUCE , which combines linear regression and feature selection methods to identify significant motifs for specific biological events. It uses enumerated words up to a certain length as features, but does not allow other types of features such as position specific weight matrices  or ChIP-chip data to be used. Moreover, the linear model used in REDUCE assumes that each motif contributes linearly to gene expressions, and therefore is unable to represent complex cis-regulatory logics such as AND and OR relations [16, 24].
In this study, we apply a well-studied classification method, decision tree [25–27], to model significantly up-or down-regulated genes in each of 250 microarray experiments of S. cerevisiae. The utilization of decision trees in modeling transcriptional regulation has been explored previously by others [11–13] and in our own research . Here we focus on analyzing the extent to which the expression of these genes can be predicted using different features. We compare the cross-validation accuracies of the models built with different features or feature combinations. We also compare the robustness of these models by introducing noises into training data. Furthermore, we analyze the enrichment of functional categories of the genes that can pass model tests comparing to those fail the tests, and show that decision tree models can be used to detect true regulons. Finally, we present the design and implementation of a user-friendly web interface that combines multiple information sources for automated analysis of gene transcriptional regulations using decision trees. As an example, we also present a case study on the transcriptional regulation of genes in Arabidopsis thaliana in response to abscisic acid (ABA) treatment.
Examine every possible binary split of the genes in the node based on each feature i, such that all genes in one subset have f i <x and those in the other subset have f i ≥ x.
Select the best split, and create two child nodes that contain the two subsets of genes respectively.
Steps 1 and 2 are then recursively applied to each of the child nodes until no split is possible, or until all genes in the current node have the same label. Finally, some branches of the tree may be pruned to prevent over-fitting. Nodes with or without child node are called internal nodes or leaf nodes, respectively. For examples and biological interpretations of decision trees, see Figure 4 and Figure 5.
Each entry f i in the feature vector of a gene corresponds to the matching score of the gene's promoter to the ith binding motif, or the binding data of the ith TF to the gene's promoter, depending on the type of features used. A split is equivalent to a test for a gene in the form of, for example, "is the matching score of the gene's promoter to motif A greater than x?" or "is the binding affinity of the gene's promoter to TF B greater than y?" The exact split point x or y is determined by maximizing an objective function that reflects the purity of the child nodes. Information gains and gain ratios are two frequently used objective functions . Here we used information gains (see Methods).
The class label of a gene represents a property of the gene that we want to model. For example, one can cluster genes according to expression patterns, and then assign the same label to genes in the same group. Labels can also be derived from other sources such as functional annotations. In this work, we assign labels to genes according to the change of their expressions under certain conditions relative to some reference state (see Experimental setup), and focus our attention on the comparison of different features. This modeling approach is based on the assumption that co-regulated genes very often share common regulatory elements on their promoters. This approach, on the other hand, will not capture post-transcriptional modifications, and will ignore genes that share no regulatory elements with other genes. Note that the underlying assumption may not always be met, since not all co-expressed genes are co-regulated. Moreover, genes may be mislabeled due to noises in their expression data. The purpose is, therefore, to identify rules of thumb for the regulation of the majority genes, while tolerating to some extent the failure in modeling particular genes. As we will see later, decision tree models can indeed be used to detect the true regulons from putatively co-regulated genes.
We collected data for 250 microarray experiments on yeast S. cerevisiae, of which 77 were for cell cycle [28, 29] and 173 were for responses to various stress conditions . We built decision trees for each condition with two classes of instances: positive genes that are differentially expressed (up- or down-regulated) with respect to the reference state, negative genes that are neither up- nor down-regulated. For each experiment, we selected up to 50 up- or down-regulated genes as positive instances and sampled negative instances from non-differentially expressed genes. (See next subsection for selecting candidate genes). Once the genes were selected, we modeled the regulation of up- and down-regulated genes separately. There may be common regulatory motifs shared by up- and down-regulated genes, which may not be revealed if they were modeled together.
Since most genes are not differentially expressed, the number of negative instances is far greater than the number of positive instances. Previous researches have shown that class distribution is an important factor for successful modeling . In general, a skewed class distribution will lead to a degraded modeling accuracy. Therefore, for a set of n positive instances, we randomly sampled μn instances from all possible negative instances to obtain a desired class distribution. The sampling was repeated 10 times, and each set of sampled negative instances was combined with the set of positive instances to build a decision tree. The accuracy of each model was estimated with a ten-fold cross-validation and measured by the kappa statistic (see Methods). As shown in Figure 1A, the highest accuracy was achieved when μ = 3, which is consistent with the results of Simonis et al. . Therefore, we used μ = 3 in all subsequent analysis.
In general, the inclusion of a large number of irrelevant features in training data decreases the accuracy of most classification algorithms. Therefore, feature selection methods are usually applied to reduce the number of features. Most feature selection methods can be categorized as wrappers  or filters . A wrapper method searches for a subset of features that maximize the cross-validation accuracy of a given classification algorithm. This strategy is guaranteed to improve the classification accuracy if the same algorithm is used in feature selection and model training. However, it may over-fit to the specific classification algorithm. Furthermore, the method is computationally expensive since many iterations of the classification algorithm need to be executed. In a filter method, features are selected independently of any classification algorithm. Individual features or feature subsets are ranked according to certain scoring functions, and the top ones are selected. This approach is efficient in removing a large number of irrelevant features, but may sometimes eliminate low-ranked, nevertheless important, features. In this study, we used a filter method because of its efficiency in computation. The method ranks individual features according to their information gains and selects the top d features (see Methods and ). As shown in Figure 1B, the best kappa was achieved with as few as 5 features. This agrees with the fact that a transcriptional regulation only involves a few transcription factors in general. Careful inspections on individual decision trees show that with 5 features, the performance of some trees may be worse than those with more features, due to the loss of some significant features. With 10 features, the modeling accuracies were almost never worse than those with more features. Therefore we used d = 10 in all subsequent analysis. The accuracy of each classification model was estimated using a ten-fold cross-validation procedure. First, a training set was randomly divided into ten equal-sized subsets. Each subset was then used in turn as a validation set to test the accuracy of the model built with the other nine subsets. We calculated kappa  to measure model accuracies (see Methods). The kappa statistic, written κ, measures the agreement between the class labels and the predictions made by the classifier, corrected by the amount of agreement that may be achieved by chance. Therefore, it reflects the extent to which the differential expression between positive and negative genes can truly be explained by the classification model. For example, given a data set containing 20 positive and 80 negative genes, a model that simply guesses all genes as negative agrees with the true labels on 80% of cases, as does another model that makes five mistakes in positive and 15 in negative genes. Taking into account the amount of agreement that we would expect by chance, the value of κ is 0.0 for the former model, while 0.47 for the latter.
It is known that κ depends on the class distribution and the number of categories in the test data . This, however, was not a problem in our case, since the models in our test all had binary classes and the same class distributions. Another difficulty associated with κ is its lack of interpretability, although some relations between κ and model quality have been suggested . Therefore, in addition to κ, we calculated sensitivity (SS) and specificity (SP) for each model (see Methods). SS is the proportion of positive genes that are correctly predicted by the model, i.e., the proportion of up- or down-regulated genes that can be explained by the regulatory elements identified. SP is the proportion of negative genes that are correctly predicted by the model, and 1 - SP represents the proportion of negative genes that cannot be separated from the positive genes based on the regulatory elements.
Microarray data are noisy and often measured with limited or no replication, which makes the identification of differentially expressed genes (DEGs) difficult. In most early microarray analysis, a fixed fold-change threshold (generally two-fold) was used to identify DEGs, while more sophisticated methods have emerged recently [38–41]. Although it is not the focus of this paper, we considered and compared several different DEG identification methods to show that our conclusions on the classification models are unlikely to be affected by the specific DEG identification method used.
For the first method (referred to as the vanilla method), we downloaded the transformed and normalized log ratios for the 250 microarray experiments. The data set had been corrected for background noises, and globally normalized by constant factors such that the mean log2(cy5/cy3) value is zero within each slide . For expression data in cell cycle conditions, the log ratios were further normalized such that the mean log ratio for each gene across all cell cycle conditions is zero . Several rules were also applied to remove spurious data points . For each column (condition) of this data set, we selected the genes with log2(cy5/cy3) ≥ 1 (more than two-fold induction) as up-regulated genes. In cases there were more than 50 up-regulated genes, we selected the top 50 with the highest fold changes, or until ties were broken. Likewise, down-regulated genes with log2(cy5/cy3) ≤ - 1 were selected. Genes with |log2(cy5/cy3)| ≤ 0.6 (less than 1.5 fold expression change) were considered as non-DEGs and were used to sample negative instances. Note that we intentionally used two different thresholds for DEGs and non-DEGs, in order to exclude genes whose labels may be ambiguous.
For the second to fifth methods, we downloaded the raw intensity data. However, we only found raw data for 216 of the 250 conditions, and it was sometimes unclear how to match the name of a raw data file with a column in the log ratio data. The intensity data were background corrected, without any other normalization or transformation. We removed low quality data points that were annotated by the authors with failed status or non-zero flags.
The second approach (referred to as the global normalization method ) was similar to the vanilla approach, except that no per-gene normalization was made. The third approach (referred to the lowess normalization method) was similar to the second approach, except that within-slide normalization factors were intensity dependent, obtained through a locally weighted regression approach . The fourth method (referred to as the vsn method) transformed the intensities with a generalized logarithm function, in order to stabilize the variances which were originally intensity dependent . We applied the same thresholds as in the vanilla approach to the transformed data. The fifth method (referred to as the EDGE method) quantified the heteroscedasticity as a function of intensity and then taking it into account to identify DEGs . The EDGE method assigned each gene a p-value based on its distance from the line of equivalence (cy5 = cy3), corrected for multiple tests. We ranked genes according to their corrected p-values, and selected up to 50 up-regulated or down-regulated genes with false discovery rate (FDR) < 0.002 . The FDR threshold was used to ensure that the number of DAGs selected by EDGE was approximately equal to those chosen by the other approaches. It had no impact for most experiments, where all top 50 genes had FDRs less than 0.002. Genes with FDR > 0.5 were considered as non-DEGs. The programs for vsn and EDGE were obtained from their original authors.
Statistics of DEG sets identified by different methods. The ranges represent one standard deviation from the means.
Number of down-regulated sets
Number of down-regulated sets
Number of genes per set
48 ± 3
47 ± 8
45 ± 6
49 ± 7
46 ± 6
We compared the cross-validation accuracies of the models built with three types of features that we discussed early: ChIP-chip data, predefined motifs and auto motifs. The combination of ChIP-chip data with predefined motifs was also tested. For each type of features or feature combinations, 446 × 10 = 4460 decision tree models were built (446 sets of DEGs and 10 sets of negative genes randomly sampled for each set of DEGs). We randomly exchanged the labels for positive and negative genes to serve as controls. Note that we carried out two types of cross-validations for models built with auto motifs. In the first method, promoter sequences of genes in both training data and test data were combined to find motifs. In other words, motifs were identified from all positive genes, and the same set of motifs was used to train models for each fold in a cross-validation. In the second method, genes were first divided into ten subsets without constructing the actual feature vectors. A subset was chosen for testing, and the other nine subsets were used for motif finding. In other words, motifs were identified from only the training genes, and a different set of motifs was used to train models for each fold in a cross-validation. The second method provided a more stringent estimation of the generalization accuracy of a model, since it completely hided the test data from the learning algorithm until they were tested. The first method, however, was used in several previous studies [11, 16], probably because it is simple to implement and convenient to test. Here we analyzed the results of both cross-validation methods to compare auto motifs with other feature types. In the next two subsections, we used only the first method to show other aspects of the models based on auto motifs.
Stability of decision tree models. Up: the models built for up-regulated genes. Down: the models built for down-regulated genes. Noise: the number of noisy instances added into the training set. TP: the number of true positive genes predicted by models built on the original data. TP': the number of true positive genes predicted by models built on the noisy data. Loss: the number of positive instances correctly classified in the original data but mis-classified in the noisy data. Rescue: the number of positive instances correctly classified in the noisy data but mis-classified originally. FP: the number of newly added noise genes classified as positive. Each value is an average across 223 up-regulated or 223 down-regulated gene sets. The standard errors for loss, rescue and FP are all less than 0.2.
As shown in Figure 3A, when the first cross-validation method is used, the models using auto motifs have the highest kappa values (~0.53) among the three individual feature types. However, it is important to note that these models also have the highest kappa values on randomly selected genes (~0.4).
Furthermore, the accuracies measured by the second cross-validation method are much lower: the average kappa values are 0.15 for models in cell cycle and 0.22 for models in stress response experiments, and are approximately zero for models of randomly selected genes. Therefore, the first method considerably over-estimates the accuracies of the models built with auto motifs. This is because that, with the first cross-validation method, the feature set contains some information about the test instances, even though the models are built only on training instances. Consequently, although the results reported in previous studies utilizing automatically identified motifs [11, 16] may still be valid qualitatively, the exact accuracies may need to be re-evaluated. Nevertheless, an apparent advantage of using automatically identified motifs is that it may be able to discover new features not included in predefined motifs and ChIP data.
In cell cycle experiments, the models using ChIP data or predefined motifs have similar kappa values (0.27 and 0.29, respectively; p-value = 0.15 in a paired t-test). In contrast, in stress response experiments, the models using ChIP data alone have very low kappa values than that using predefined motifs (0.15 vs. 0.33, p-value = 10-47 in a paired t-test). The ChIP data used in this study were measured under normal cell growth conditions. However, it is known that TF binding may change with environmental conditions . While the cell cycle expression data were measured under conditions relatively similar to normal cell growth conditions, stress treatment dramatically changes the environmental conditions and thereby alters the binding of TFs. It is thus expected to observe lower prediction accuracies in stress response experiments than in cell cycle experiments when using ChIP data alone.
Figure 3B for SS shows almost identical trends as Figure 3A, which means that with combined features, the models can better explain the co-regulation of the genes. On the other hand, Figure 3C shows that models built with ChIP data have higher specificity than those built with predefined or auto motifs (p-value = 10-68 in a paired t-test between ChIP data and predefined motifs). This can be explained by the fact that ChIP data are more reliable than motif scores as an indicator of co-regulations, since ChIP data explicitly measure the binding affinities of gene promoters to TFs. There are also other advantages in using ChIP-chip data as features. For example, the number of features is well bounded by the number of TFs, which is estimated to be around 200 in yeast , comparing to thousands of putative binding motifs. As a result, the correlations and redundancies among features are low in ChIP-chip data, which makes it possible to build simple models for better interpretability. Furthermore, the models built with ChIP-chip data directly suggest regulatory relations between TFs and genes, which can be used to construct regulatory networks. Our results indicate that, however, the conditions of microarray experiments and ChIP assays must be considered with special care.
In cell cycle experiments, the models built with a combination of predefined motifs and ChIP data have substantially better kappa values than those with ChIP data or predefined motifs alone (Figure 3D). The p-value is 10-14 in a paired t-test for results of ChIP data (0.27) and combined features (0.34), and 10-12 for predefined motifs (0.29) and combined features (0.34). In stress experiments, the kappa values for models based on combined features (0.35) are only marginally better than that for predefined motifs (0.33), due to limited usage of ChIP data as pointed out above. Nevertheless, the models using combined features perform significantly better (improvement of kappa > 0.1) in 33 cases, while the models using predefined motifs are never better by more than 0.1 in kappa (Figure 3E). A paired t-test yields a p-value 10-5, which is still statistically significant. Putative binding motifs and ChIP data represent two distinct and complementary sources of evidence of regulation. Therefore, a combination of them can provide a better discriminative power than either of them does.
Since the role of decision tree models is exploratory, there is no need to be restricted to any specific type of features. Indeed, one should try a variety of them to identify the most relevant features for a particular set of genes. A good strategy in choosing feature types is probably as follows: first use a combination of TF binding data and predefined motifs for modeling; if a model with good cross-validation accuracy can not be found, then consider using automatically identified motifs. Obviously when predefined motifs or TF binding data are unavailable or insufficient, using automatically identified motifs as features is the only option.
To analyze the stability of decision tree models, we introduced some noises into positive training data and tested whether the models can separate the true positives from the noises. For each microarray experiment we randomly selected 5 to 50 negative genes and deliberately mislabeled them as positive. The original data and the additional fake positive instances were then combined to build a decision tree model. We compared the results of the new models to the results of the original models in ten-fold cross-validations and counted the numbers of losses, rescues and false positives (FPs). According to , a loss is a positive instance that is mis-classified in the noisy data but correctly classified in the original data. A rescue is a positive instance that is correctly classified in the noisy data but mis-classified originally. An FP corresponds to a newly added noise gene that is classified as a (false) positive gene.
As shown in Table 2, the models built with predefined motifs are more robust than the models built with auto motifs. Even at 100% noise level (50 randomly selected genes added), almost 90% of the introduced noises can be correctly filtered out by the models built with predefined motifs. In comparison, the models built with auto motifs can only filter out ≈ 75% of the noises. This is because the motif finding program was distracted by the wrongly labeled genes so that the discovered motifs became less effective to discriminate the true positives from false ones.
Furthermore, the predictions made by the models built with predefined motifs are more stable, as reflected by the smaller values of loss + rescue in Table 2. The addition of a small amount of noises may result in, dramatic change to the predictions by the models built with auto motifs. When as few as 5 noisy instance were added, the models based on auto motif changed their predictions by ≈ 58%, while the models based on predefined motif changed their predictions by only ≈ 30%.
Functional enrichment in correctly predicted positive genes. Predefined motifs: the models built with predefined motifs. Auto motifs: the models built with auto motifs. Up: the models built for up-regulated genes. Down: the models built for down-regulated genes. #Gp: the number of Gp genes (positive genes that are predicted as positive by a model). #Gn: the number of Gn genes (positive genes that are predicted as negative by a model). : the number of enriched functional categories in the set of Gp genes. : the number of enriched functional categories in the set of Gn genes. : the number of enriched functional categories in the set of Gp' genes (see text). The sample sizes for up- and down-regulated genes in cell cycle are 56 and 59, respectively, and 167 and 164 in stress conditions. The ranges represent individual 95% confidence interval for the means.
17.2 ± 2.2
27.1 ± 2.7
15.1 ± 2.9
8.2 ± 2.6
12.3 ± 3.5
14.0 ± 2.1
31.0 ± 2.5
14.5 ± 3.7
9.6 ± 3.4
8.4 ± 3.3
16.4 ± 1.0
33.1 ± 1.1
13.4 ± 2.3
10.6 ± 1.8
10.3 ± 1.9
21.0 ± 2.0
30.7 ± 1.7
17.1 ± 2.4
7.8 ± 1.7
14.7 ± 2.3
24.9 ± 1.8
19.3 ± 1.8
16.9 ± 3.2
7.9 ± 2.5
13.9 ± 3.2
23.6 ± 1.9
21.4 ± 1.8
15.6 ± 3.7
8.4 ± 2.9
11.8 ± 3.5
27.0 ± 0.8
22.6 ± 0.8
16.2 ± 2.3
9.1 ± 1.9
15.2 ± 2.4
29.4 ± 1.3
22.2 ± 1.1
18.2 ± 2.3
6.6 ± 1.5
16.4 ± 2.4
Furthermore, we tested whether a similar degree of functional enrichment can be achieved without decision tree models. Suppose a decision tree model predicted 25 out of 50 positive genes as true positives, and GO analysis showed that these 25 genes were functionally more coherent than the remaining 25 genes. We want to know whether there is another way to select 25 genes that have higher degree of functional enrichment than those predicted by decision trees. For example, we may simply pick genes that are top ranked according to their expressions. We used the following procedure to test this hypothesis. For each set of up-or down-regulated genes, A, which has p predicted true positive genes, we selected p top ranked genes from A based on the absolute log ratios of their expressions. We denoted this set of genes as Gp', which has the same number of genes as Gp, and counted the number of enriched GO categories in it. The average results over all microarray experiments are shown in the column of Table 3. As shown, the genes in Gp contain more enriched functional categories than genes in Gp'( ). A paired t-test between and combining all cases has a p-value 10-16. Therefore, we can conclude that both models built with auto motifs and with predefined motifs are more effective in selecting functionally related genes than the naive differential expression model.
Several previous studies have shown that decision tree is a valuable tool in analyzing transcriptional regulation of gene expressions [11–13 20]. Although there are many publicly available software packages for building decision trees (for example, [21, 25–26]), they are not specifically designed for biological applications, and are not convenient for biologists to use. Therefore, to make a good use of the results from this study, we designed and implemented a user-friendly web server and interface for building decision trees to analyze transcriptional regulation. The server integrates several software components that allow the user to select from different types of features and to interact with the constructed models.
The interface for user inputs is shown in Figure 4A. To submit a job to CAGER, the user is first asked to provide positive and negative genes as either ORF identifies for one of the supported organisms, which currently includes yeast S. cerevisiae and Arabidopsis thaliana, or promoter sequences in FASTA format. These can be copy-pasted to the web form or uploaded from files in the user's local computer. Given ORF identifiers, the promoter sequences in a supported genome are retrieved automatically from a local database. Second, the user specifies some feature sets, which may be ChIP-chip data for yeast S. cerevisiae , predefined motifs from Pilpel et al. , motifs automatically identified from the promoter sequences by AlignACE , or a combination of them. The user can also specify whether features should be identified from negative instances as well as from positive instances, whether feature filters should be used, and the minimum number of instances per node. Finally, the user fills in his or her email address and submits the job. After a job is completed, the user will be notified by email for instructions about how to access the results.
On the output page, a decision tree is displayed as a portable network graph (PNG), along with related statistics for the tree in training and cross-validation processes (Figure 4B). The text inside an internal node of the tree gives the name of a feature, and the text inside a leaf node shows the predicted label for genes inside the node, as well as the number of supporting and counter-instances for the prediction. Each node of the decision tree can be clicked to show some details. For example, if an internal node contains a feature derived from ChIP-chip data for a TF in yeast, clicking on it leads the user to SGD  for detailed information about the TF. If the feature is a binding motif, a click opens a new window to display the sequence logo  and the position specific weight matrix of the motif. A click on a leaf node brings up a window for displaying the identifiers of the positive and negative genes in the leaf.
Here we show an example of using the web server to study the regulation of genes expressed in response to abscisic acid (ABA) in Arabidopsis. ABA is a phytohormone that plays important roles in many stages of plants, such as seed development and stress responses (see  for a review). Seki et al.  identified about 250 genes in Arabidopsis that are induced by at least 5-fold after ABA treatment. Since Arabidopsis is one of the supported organisms in our current server, its promoter sequences are available in our database.
Therefore, we provided as positive instances the list of ORF names corresponding to the up-regulated genes, and as negative instances a list of randomly selected genes that are not up-regulated. Promoter sequences were retrieved for 152 of the positive genes. We used auto motifs identified from these promoters as features. The decision tree and the sequence logos for the most interesting motifs are shown in Figure 5. AlignACE identified a total of 37 motifs with default parameters, five of which were selected by the decision tree (Figure 5A). Three motifs, ace_m2, ace_m9, and ace_18 (Figure 5B) together correctly classified 35 (= 13 + 6 + 22 - 6) positive genes (the rightmost three leaves labeled with 'p'), while as many as 107 positive genes were classified as negative. This may be due to the fact that ABA triggers a lot of down-stream responsive genes, many of which are not co-regulated with direct targets of ABA. The motif ace_m2 has a conserved CACGTG core, which is very close to the known ABA responsive elements (ABREs) identified in many other plants [50–56]. It is known that an ABRE often functions together with a coupling element (CE), but the consensus sequence of CE in Arabidopsis is elusive. Here, the decision tree suggests that ace_m18 and ace_m9 may be two possible CEs. Motif ace_m18 has a CGTGTG core which partially resembles the CE in rice OSEM gene (GACGCGTGTC) , and CE3 in barley HVA1 gene (ACGCGTGTCCTC) . Note that ace_m18 has a weak second copy of the GTGTG core, whicl may be important for enhanced binding activity. Motif ace_m18 is also remotely similar to CE3 in maize RAB28 gene (ACGCGCCTCCTC) , although in maize the CGTGTG core is replaced by CGCGCC. Motif ace_m9 is a weak but significant motif that consists of a series of G's separated by one or two A's. This motif is not similar to any known motif in the database of Plant Cis-acting Regulatory DNA Elements (PLACE) , and therefore may be a new motif for Arabidopsis.
In this research, we compared the effect of using different features to study transcriptional regulation of gene expressions by classification methods. We considered features based on ChIP-chip data, predefined motifs, automatically identified motifs, and their combinations. We found that TF binding data from ChIP assays are effective in modeling gene expressions only under the same conditions where ChIP-chip experiments were conducted. Our results also indicate that many previous studies may have over-estimal the cross-validation accuracies of models built with automatically identified motifs. Furthermore, the models built with automatically identified motifs are not robust with respect to noises, comparing to those built with predefined motifs. A combination of ChIP-chip data with predefined motifs seems to be superior to either one of them applied separately.
We also showed that the positive genes correctly predicted by decision tree models are more functionally related than those that are not correctly predicted. Therefore, decision tree models can be used to refine putative regulons and detect new genes of a regulon. Simonis et al.  have showed this by testing on known regulons, while we confirmed this through analyzing the functional enrichment of predicted regulons. We presented a web service that integrates motif finding and decision tree learning for analyzing transcriptional regulation of gene expressions. Its usefulness was illustrated with an example of studying the regulation of ABA-responsive genes in Arabidopsis. We identified two motifs that are similar to known ABA-responsive elements (ABREs) and coupling elements (CEs), and suggested a new CE, which may deserve further studies. As demonstrated by the example, the web interface combines a number of software components and hides most specific parameters from the user, while still allows some flexibilities. The graphical representation of a decision tree makes it easy to visualize and extract significant regulatory rules. We believe that it can significantly reduce the learning curve for those who are interested in applying classification methods to analyzing transcriptional regulation, and will be a useful tool to facilitate the discovery of transcriptional regulatory networks by combining multiple information sources.
Yeast gene expression data were downloaded from from Expression Connection . ABA-induced gene expression data for Arabidopsis were obtained from . ChIP-chip data for yeast transcription factors were downloaded from . Putative binding motifs for yeast genome were obtained from . Upstream sequences of yeast and Arabidopsis ORFs were obtained from RSA-tools  by retrieving up to 800 and 1500 bases, respectively, from translation start sites. Overlaps with other ORFs were truncated, and upstream sequences shorter than 100 bases were discarded. For yeast ORFs, the upstream sequences were used as promoters. For Arabidopsis ORFs, the upstream sequences were used as inputs to a promoter prediction program, TSSP , to predict transcription start sites (TSSs). Promoters of Arabidopsis ORFs were defined as from 350 bases upstream to 50 bases downstream relative to the predicted TSSs, where most known ABREs and CEs were discovered.
Given a training set, features were ranked according to the restricted information gains that can be achieved by using individual features to separate positive and negative instances. Suppose that there are p positive instances and n negative instances, and by selecting a splitting point x on feature f i , the numbers of positive and negative instances on the left (f i <x) and right (f i ≥ x) child nodes are p 1, n 1, p 2, and n 2, respectively. The restricted information gain due to feature f i with respect to this split can be calculated by:
where I gain is the normal information gain computed from entropies . The restriction to the information gain calculation ensures that a selected feature is more over-represented in positive instances than in negative instances, which is necessary since negative instances were chosen randomly and therefore should not be co-regulated. Features were ranked by and the top d features were selected for decision tree building (d is denned by the user and is default to 10 in this study). Decision trees were built with C4.5 algorithm , except that instead of I gain were used.
Let TP, TN, FP, and FN be the numbers of true positive, true negative, false positive and false negative predictions made by a binary classifier, respectively, and N = TP + TN + FP + FN. Sensitivity is defined as . Specificity is defined as . The kappa static κ of the classifier is defined as , where is the percentage of correctly predicted instances, and C is the expected percentage of instances that a classifier can predict correctly by chance.
Note that when A = 100% and C ≠ 100%, κ = 1.0, corresponding to a perfect classifier; when A ≤ C, κ ≤ 0, meaning that the classifier does not perform better than random guessing.
GO annotations were retrieved from SGD (version September 2004) . Go functional enrichment were calculated with accumulative hyper-geometric distribution. GO::TermFinder perl module  was used to search for significantly enriched functional categories with a false discovery rate (FDR) < 0.05 .
Motifs were identified with AlignACE  and scanned against promoter sequences with ScanACE . Decision trees were built with the J48 program, which is a java implementation of the C4.5 decision tree learning algorithm , included in the WEKA machine learning package . Decision trees were drawn with the dot program in Graphviz 1.0  and displayed with webdot in the same package. Sequence logos of motifs were drawn with the seqlogo program . The CAGER web service was implemented in perl and run on an apache web server with dual AMD Athlon 1600 MHz CPUs and 2 GB of RAM.
open reading frame
differentially expressed gene
false discovery rate
This research was supported in part by NSF grants IIS-0196057 and EIA-0113618 under the ITR program, and a grant from Monsanto Corporation. We thank two anonymous reviewers for their very useful comments. JR also wishes to thank Alexander V Loguinov for the EDGE software.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.