A cis-regulatory logic simulator
© Zeigler et al; licensee BioMed Central Ltd. 2007
Received: 17 November 2006
Accepted: 27 July 2007
Published: 27 July 2007
A major goal of computational studies of gene regulation is to accurately predict the expression of genes based on the cis-regulatory content of their promoters. The development of computational methods to decode the interactions among cis-regulatory elements has been slow, in part, because it is difficult to know, without extensive experimental validation, whether a particular method identifies the correct cis-regulatory interactions that underlie a given set of expression data. There is an urgent need for test expression data in which the interactions among cis-regulatory sites that produce the data are known. The ability to rapidly generate such data sets would facilitate the development and comparison of computational methods that predict gene expression patterns from promoter sequence.
We developed a gene expression simulator which generates expression data using user-defined interactions between cis-regulatory sites. The simulator can incorporate additive, cooperative, competitive, and synergistic interactions between regulatory elements. Constraints on the spacing, distance, and orientation of regulatory elements and their interactions may also be defined and Gaussian noise can be added to the expression values. The simulator allows for a data transformation that simulates the sigmoid shape of expression levels from real promoters. We found good agreement between sets of simulated promoters and predicted regulatory modules from real expression data. We present several data sets that may be useful for testing new methodologies for predicting gene expression from promoter sequence.
We developed a flexible gene expression simulator that rapidly generates large numbers of simulated promoters and their corresponding transcriptional output based on specified interactions between cis-regulatory sites. When appropriate rule sets are used, the data generated by our simulator faithfully reproduces experimentally derived data sets. We anticipate that using simulated gene expression data sets will facilitate the direct comparison of computational strategies to predict gene expression from promoter sequence. The source code is available online and as additional material. The test sets are available as additional material.
Transcriptional regulation of genes is controlled largely through the concerted action of combinations of cis-regulatory sites in the promoters and surrounding regulatory DNA of genes. The interactions between cis-regulatory sites can be complex and may include synergistic , competitive , and amplifying  interactions, and are often influenced by the spacing and orientation of the sites relative to each other and to the transcriptional start site [4, 5]. The complexity of the "cis-regulatory code" makes predicting gene expression from promoter sequence a challenging problem.
Computational approaches for determining the cis-regulatory code include multiple regression models , Bayesian networks , logic operators , and machine learning methods . Though their mathematical frameworks differ, all of these approaches use large-scale transcriptional data (usually microarray-based expression profiling data) and attempt to correlate expression patterns with the presence or absence of computationally predicted cis-regulatory motifs. Currently, we do not have good ways to compare the performance of these different approaches to each other or to new approaches being developed. A serious problem in comparing these methods is the lack of robust test data in which the cis-regulatory interactions underlying the expression data are accurately known. We need data in which the "true" answer is known if we are to compare methodologies. To address this limitation, we built a rule based simulator to create test data sets.
Simulators are playing a useful role in reconstructing gene regulatory networks (GRN). A GRN models the regulatory connections between genes, as opposed to the interactions between cis-regulatory sites in a promoter. Because the true GRN of a cell is not known, artificially created GRNs are used to evaluate the accuracy of algorithms that attempt to determine network architecture and dynamics . GRN simulators provide test datasets [11, 12], which in turn are used to assess the performance of network reconstruction techniques . We anticipate that gene expression simulators will play a similar role in the development of computational approaches to decipher the interactions between cis-regulatory sites.
We present a regulatory rule simulator that generates random promoters and produces expression data based on user-defined interactions between cis-elements. Whereas a GRN simulator attempts to create a web of genes connected in a biologically relevant manner, our simulator generates promoter regions and predicts the expression from those promoters. We also present test datasets, created by the simulator, which can be used to assess the performance of algorithms that attempt to determine underlying regulatory rules. The promoter generator and simulator, named ReLoS (cis-Re gulatory Lo gic S imulator), are available for download  (see additional files 4 and 5). A web interface  is also available. The test data sets are available in additional file 1.
Results and Discussion
Simulating regulatory rules
With Relos a user can encode a wide variety of cis-regulatory rules. The rules are defined in an XML simulation file to make the attributes of the simulation, including the rules, legible to the user. A single rule in a rule-set is defined by the cis-regulatory sites involved, the conditions required by the rule, conditions excluded by the rule, context dependencies for each condition, and the output expression generated by that rule. Logical relationships such as OR, NOT and AND can be expressed in describing interactions between sites. Constraints on the spacing, orientation, and distance of sites from each other can be incorporated into any rule. Rule outputs may be combined in linear and non-linear ways (see Methods). A rule may simply specify the additive contribution of a particular regulatory element, or it may determine the parameters of an epistatic (eg: cooperative, competitive, synergistic, etc.) interaction between elements. Promoters are parsed by each rule in the order in which the rules are specified. When a rule matches a promoter, however, that rule may specify a set of rules which should be skipped in the analysis of the matched promoter.
Promoter processing by rules is delegated to the "analyzer". The analyzer is responsible for determining whether a rule will affect a promoter, based on the constraints specified for the rule. The analyzer is also responsible for specifying the effect of a rule on the expression of a promoter. Analyzers serve as the central point of extensibility in Relos. For each rule, it is possible to specify a custom analyzer. Relos comes with a regular expression analyzer, which modifies promoter expression if the regular expression is matched. Another analyzer allows user-defined mathematical functions to be used to determine rule outputs. For example, a Hill function [16, 17] might be used to describe cooperativity between sites. The flexibility inherent in the design of Relos allows users to simulate virtually any mode of regulation among cis-regulatory sites.
Real expression data are bounded. At the lower bound, a cell cannot express less than zero copies of a gene. There is also an upper limit of detection in any experimental setup and to the levels of RNA that can be produced when a promoter is fully occupied by the transcriptional machinery and transcribing at the maximum rate. These constraints produce sigmoid expression patterns. For this reason, Relos allows users to sigmoidally transform the output data. Users may explicitly tell Relos to transform the data. In this case, Relos uses a sigmoid transformation centered on the average expression for the simulation (see methods). Using the simulation expression mean to center the transformation allows rule-sets to be compared in terms of the variation present in the parsed promoters. Simulations with large variance will show a spread of values between zero and one. Simulations with little variance will, when transformed, cluster around the value of 0.5. One consequence of the mean-dependent transformation is that it is impossible to generate a transformed dataset in which all expression is either "on" or "off" since datasets with very little variation will result in midline expression when transformed. Users may therefore specify a rule at the end of the pipeline employing a custom analyzer to transform the data. Relos comes with a SigmoidalTransform analyzer (see Methods) that can be used for this purpose, but users may also provide their own transformations. The SigmoidalTransform analyzer uses four parameters (see Methods) to adjust the shape and scale of the transformation. These parameters are independent of the simulation dataset and determine an absolute scale of expression onto which all rule-sets are mapped. By using a consistent set of parameters, users can compare rule-sets with regard to their strength of expression and compare variances according to where the mean lies in the absolute expression scale. Since this transformation does not depend on the dataset, the absolute scale is arbitrarily determined by the choice of parameters and users should be careful to use rules consistent with the scale determined by the parameters.
In addition to rules, their analyzers and constraints, and transformation parameters, the XML simulation file contains other adjustable attributes for the simulation. For example, after the promoters have been interpreted using the current rule set, Gaussian noise is added by the simulator with a user defined standard deviation. Relos is also capable of generating random promoters based on user-defined properties, such as promoter length, cis-regulatory elements and their frequencies and outputting promoters in either fasta or Relos format. These synthetic promoters can be used directly by the simulator. For more details, see Methods.
Where x is the input and ϕ and n are parameters used to adjust the location and steepness of the transition. Hill functions have been used to model biological cooperativity in proteins such as Hemoglobin  and in cis-regulatory interactions . In Figure 2d, x is the number of cooperative elements, n is 3, and ϕ is 5. Since the expression is a function of the number of A-elements, and the number of A-elements is distributed according to the Poisson distribution, the expression pattern should be a function of a Poisson distribution. As expected, the simulator output in Figure 2d follows a Poisson distribution with an elongated right tail. This tail represents the high expression of promoters with multiple cooperative sites. See additional file 2 for the rule-sets used to create figure 2.
The main motivation for creating the simulator was to synthesize expression datasets for which we know the underlying regulatory rules. These datasets will be necessary to compare the accuracy of different methods that infer cis-regulatory rules because there are no experimental datasets for which the true underlying relationships between cis-regulatory sites are known. We therefore created ten test datasets using different rule-sets. The test datasets vary in the number and types of rules and in the complexity of the rule-set. We have made the datasets and rule sets used to generate them (see additional file 1) available in both Relos format and fasta format. We anticipate that the availability of test datasets will allow researchers to evaluate their own methods and compare their methods against commonly used algorithms that deduce regulatory rules from expression data. While the test data we provide will be useful for researchers who want to get started right away testing their rule-finding algorithms, we emphasize that the real power of Relos is the capability it provides to quickly produce custom data sets for algorithm testing. Researchers can now rapidly create their own test datasets to compare the dependency of any method on any particular parameter (number or sites, types of interactions, noisy data).
Comparison to experimental data
One noticeable discrepancy between the Relos data and the Beer and Tavazoie data was the noise function. Relos uses Gaussian noise, scaled by the noise-less expression value. This results in a smaller absolute level of noise around expression values close to zero. The Beer and Tavazoie data does not appear to follow this trend; the absolute level of noise around zero is still quite large. Accordingly, we wrote an unscaled noise analyzer that applies unscaled Gaussian noise to simulated data.
We also used the same rule sets defined above to analyze Relos-generated promoters. Randomly generated promoters were created based on the frequency distributions of the cis-regulatory sites that comprised the five modules we simulated. When the rule set was applied to these computationally derived promoters the five expression patterns from Beer and Tavazoie were again recapitulated (see additional file 6). Randomly generated promoters, filtered through Relos, faithfully replicate the observed expression patterns in real data.
We sought to create a tool that simulates expression from promoters based on cis-regulatory logic. Because there are examples of additivity, synergism, cooperativity, and competition between regulatory sites we created ways to simulate these interactions in a straightforward manner. The full spectrum of interactions between regulatory sites is not known. We recognize that our knowledge of cellular regulation is still relatively limited and that new types of interactions may appear. We therefore did not want to be limited by preconceived models. With its rule-pipeline and analyzer plug-in architecture, Relos allows for virtually any regulatory model to be implemented.
The ease of specifying regulatory models and the speed with which data can be generated will allow algorithms that predict gene expression from promoter sequence to be comprehensively tested. Algorithms that attempt to determine regulatory logic rules from expression and sequence data can be analyzed for their performance with respect to noise, the number of underlying rules, and the complexity of the interactions between the rules. Furthermore, researchers can study the size of the dataset required for an algorithm to recapitulate the rules and the ability of the algorithm to recapitulate the specified rules, as opposed to alternate rule sets which also correlate with the data. We have used Relos to generate a test dataset for use in such studies. We anticipate that the ability to rapidly generate unlimited quantities of simulated expression data will speed the design and comparison of algorithms to decode the cis-regulatory logic that underlies real patterns of gene expression.
The final arbiter of the performance of cis-regulatory rule-finding algorithms will be how well they capture the trends in real data. Algorithms that perform well on synthetic data sets, such as those produced by Relos, will not necessarily perform well on biological data. Because experimentally derived data is still of limited quantity and variable quality, extensive testing on synthetic data is the best way to understand the strengths and limitations of specific rule-finding methods. Testing and training on synthetic data avoids over fitting rule finders on the limited quantities of real data that are now available. Testing rule-finding methods on synthetic data sets will clearly be one of the paths forward on the way to decoding the interactions between cis-regulatory sites.
Relos generates a promoter as a set of elements. Each promoter element is associated with a "cis-element" and an orientation. Each cis-element has an identifier (eg: A, Oct4, etc), a sequence, and a frequency (expected occurrence). The sequence is only used for output purposes; all built-in rule processing is done on promoter elements.
Relos supports two modes of promoter generation: exact length and expected length. In exact length mode, a cis-element is selected from the user-specified list of elements by a roulette wheel selection process. The selected element is added to the promoter starting from the position furthest upstream of the transcription start site. The element is added in a sense or anti-sense orientation with equal probability. Element selection and addition continues until the number of elements added equals the user-specified length. Relos does not insert spacer elements between cis-elements. Rather, all cis-elements are treated as spacer elements unless a rule is defined which uses the cis-element in a manner inconsistent with a spacer element (see Rule Specification below).
Where D i is the transformed frequency of the i- th element, d i is the non-transformed frequency for i- th element, E is the expected promoter length, and n is the number of elements. This results in a distribution of cis-elements that includes a "stop" pseudo-element with probably 1/E. The distribution sums to one and preserves the relative probabilities of the user-specified elements. Promoter elements are added as in the exact length procedure until the stop element is selected.
Rules are specified in an XML-based format defined by the expression_rules.dtd document type definition file. Each rule is defined in terms of the cis-elements the rule uses, an optional custom analyzer to use in place of the default Relos analyzer, the "output" (the amount by which the rule will affect the current expression level for the promoter), and the "operation" (the way in which the output will affect the current expression). Rules may also define precluded rules. Precluded rules are those that are prevented from operating on a promoter should the precluding rule match. Rules using the default analyzer, or custom analyzers that rely on the default analyzer, may specify one or more conditions that determine whether a particular element on the promoter "matches" the rule. Conversely, these conditions may "exclude" elements on the promoter that should not match the rule.
Conditions are comprised of the cis-element(s) to consider, the allowed position(s) and required orientation of the element(s), and zero or more contexts. Each context defines a cis-element that must appear in the promoter with the element under consideration in the condition. Contexts may include specification of the spacing between the two elements and the orientation of the "context" element.
More details on rule specification can be found in additional file 3.
Relos uses a pipeline to perform rule by rule analysis of the promoters. Typically, promoters are moved through the pipeline in the order in which the rules appear in the simulation XML file. However, when a precluding rule matches a promoter, Relos prevents the precluded rules from operating on the matched promoter. Rules which define a custom analyzer delegate promoter analysis to the custom module. All other rules delegate promoter analysis to the default analyzer. The default analyzer determines the number of elements in a promoter that match the rule and multiplies the number of matches by the output amount to determine the magnitude of the effect on the current promoter expression. Promoter expression is then affected by this amount according to the operation defined for the current rule. Valid operations include add (new expression equals the current expression plus the output); multiply (new expression equals the current expression times the output); exponentiate (new expression equals the old expression raised to the power of the output); and replace (new expression equals the output). Matching is performed on a promoter element-wise basis. If the attributes and contexts of at least one condition and no exclusions match, an element will be considered a match. When no conditions or exclusions are specified, the element only needs to match one of the cis-elements specified by the rule.
Where μ is the current expression value, σ = μ*η, and η is the user defined level of noise. The Relos default sets the noise to be 5% of the current expression level.
Where V T is the transformed expression value, V0 is the original expression, α adjusts the slope of the curve at the inflection point, β adjusts the position of the inflection point, γ determines the expected midline expression, and ϕ scales the resulting transformation.
More details on promoter analysis can be found in additional file 3.
Creating test dataset
Ten test-set simulations were run. Two hundred promoters, comprised of eight cis-elements selected from a pool of four possible elements (A-D), were generated for each simulation, except for test-set simulation ten. A noise level of 5% of the expression level was used. None of the datasets were subjected to upper or lower bound constraints. The first nine test-set simulation rule sets were comprised of: an additive activator, an activator with spacing and ordering constraints, two synergistic rule sets with spacing constraints, two cooperative rule sets, a dominant-negative competitive rule set, a dominant positive rule set, and a rule set with constraints on many elements and an enhancer. In the final test-set simulation, two hundred promoters were generated, each comprised of eight cis-elements selected from a pool of eight possible elements (A-H). The final simulation rule set consisted of multiple additive and non-additive effects, incorporating many of the non-additive effects encountered separately in other rule sets. For more details, see additional file 1.
Comparison to experimental data
Beer and Tavazoie  classified 49 transcriptional modules in S. cerevisiae. We simulated the "ribosome biogenesis", "peroxisome", "mitochondrion", "cell cycle", and "glycolysis" modules. These modules were chosen because they vary in size, expression outputs, and regulatory complexity. Promoters with no regulatory motifs were removed from the dataset, leaving 254 promoters. Tree regression  was performed to determine the best classification tree for separating the promoters into the five transcriptional modules. Input to the classification for each promoter was their assigned module and the presence or absence of each of the 666 proposed motifs. Based on the structure of the classification tree, a general rule set was constructed (additional file 7). The ruleset was then duplicated for each microarray experiment, except the output for each rule was changed to match the average expression for that module. All 254 promoters were used as input sequences for each of the 255 simulations. Tree regression and statistical calculations were performed in R.
We used Relos to generate synthetic promoters based on the frequency of the motifs used in the above rule set. The frequency of each motif was determined in the 254 biological promoters as the number of times each motif occurred divided by the total number of motifs in these promoters. The frequencies of the remaining biological motifs not considered by the ruleset were conglomerated into a single "Spacer" motif (see additional file 8). Relos was used to generate 1000 promoters which were then analyzed by the same rule set described above, with the addition of an "all spacer" rule.
We thank members of the Cohen Lab for discussions and critical readings of the manuscript. JG is funded by a National Science Foundation Graduate Research Fellowship DGE-0202737. The project was supported by grants from the American Cancer Society (RSG-06-039-01-GMC) and the National Science Foundation (0543156).
- Uemura H, Koshio M, Inoue Y, Lopez MC, Baker HV: The role of Gcr1p in the transcriptional activation of glycolytic genes in yeast Saccharomyces cerevisiae. Genetics. 1997, 147 (2): 521-532.PubMed CentralPubMed
- Pierce M, Benjamin KR, Montano SP, Georgiadis MM, Winter E, Vershon AK: Sum1 and Ndt80 proteins compete for binding to middle sporulation element sequences that control meiotic gene expression. Mol Cell Biol. 2003, 23 (14): 4814-4825. 10.1128/MCB.23.14.4814-4825.2003.PubMed CentralView ArticlePubMed
- Yuh CH, Bolouri H, Davidson EH: Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development. 2001, 128 (5): 617-629.PubMed
- Makeev VJ, Lifanov AP, Nazina AG, Papatsenko DA: Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Res. 2003, 31 (20): 6016-6026. 10.1093/nar/gkg799.PubMed CentralView ArticlePubMed
- Anholt RR, Dilda CL, Chang S, Fanara JJ, Kulkarni NH, Ganguly I, Rollmann SM, Kamdar KP, Mackay TF: The genetic architecture of odor-guided behavior in Drosophila: epistasis and the transcriptome. Nat Genet. 2003, 35 (2): 180-184. 10.1038/ng1240.View ArticlePubMed
- Roven C, Bussemaker HJ: REDUCE: An online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data. Nucleic Acids Res. 2003, 31 (13): 3487-3490. 10.1093/nar/gkg630.PubMed CentralView ArticlePubMed
- Beer MA, Tavazoie S: Predicting gene expression from sequence. Cell. 2004, 117 (2): 185-198. 10.1016/S0092-8674(04)00304-6.View ArticlePubMed
- Istrail S, Davidson EH: Logic functions of the genomic cis-regulatory code. Proc Natl Acad Sci U S A. 2005, 102 (14): 4954-4959. 10.1073/pnas.0409624102.PubMed CentralView ArticlePubMed
- Ligr M, Siddharthan R, Cross FR, Siggia ED: Gene expression from random libraries of yeast promoters. Genetics. 2006, 172 (4): 2113-2122. 10.1534/genetics.105.052688.PubMed CentralView ArticlePubMed
- Michaud DJ, Marsh AG, Dhurjati PS: eXPatGen: generating dynamic expression patterns for the systematic evaluation of analytical methods. Bioinformatics. 2003, 19 (9): 1140-1146. 10.1093/bioinformatics/btg132.View ArticlePubMed
- Mendes P, Sha W, Ye K: Artificial gene networks for objective comparison of analysis algorithms. Bioinformatics. 2003, 19 Suppl 2: II122-II129.PubMed
- Van den Bulcke T, Van Leemput K, Naudts B, van Remortel P, Ma H, Verschoren A, De Moor B, Marchal K: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics. 2006, 7: 43-10.1186/1471-2105-7-43.PubMed CentralView ArticlePubMed
- Laubenbacher R, Stigler B: A computational algebra approach to the reverse engineering of gene regulatory networks. J Theor Biol. 2004, 229 (4): 523-537. 10.1016/j.jtbi.2004.04.037.View ArticlePubMed
- Hill AV: The possible effects of the aggregation of the molecules of haemoglobin on its dissociation curves. J Physiol. 1910, 40: iv - vii.
- Granek JA, Clarke ND: Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol. 2005, 6 (10): R87-10.1186/gb-2005-6-10-r87.PubMed CentralView ArticlePubMed
- Breiman L, Friedman J, Stone CJ, Olshen RA: Classification and Regression Trees. 1998, Boca Raton, Florida , CRC Press LLC
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.