Increasing the efficiency of bacterial transcription simulations: When to exclude the genome without loss of accuracy
© Iafolla et al; licensee BioMed Central Ltd. 2008
Received: 26 May 2008
Accepted: 12 September 2008
Published: 12 September 2008
Simulating the major molecular events inside an Escherichia coli cell can lead to a very large number of reactions that compose its overall behaviour. Not only should the model be accurate, but it is imperative for the experimenter to create an efficient model to obtain the results in a timely fashion. Here, we show that for many parameter regimes, the effect of the host cell genome on the transcription of a gene from a plasmid-borne promoter is negligible, allowing one to simulate the system more efficiently by removing the computational load associated with representing the presence of the rest of the genome. The key parameter is the on-rate of RNAP binding to the promoter (k_on), and we compare the total number of transcripts produced from a plasmid vector generated as a function of this rate constant, for two versions of our gene expression model, one incorporating the host cell genome and one excluding it. By sweeping parameters, we identify the k_on range for which the difference between the genome and no-genome models drops below 5%, over a wide range of doubling times, mRNA degradation rates, plasmid copy numbers, and gene lengths.
We assess the effect of the simulating the presence of the genome over a four-dimensional parameter space, considering: 24 min <= bacterial doubling time <= 100 min; 10 <= plasmid copy number <= 1000; 2 min <= mRNA half-life <= 14 min; and 10 bp <= gene length <= 10000 bp. A simple MATLAB user interface generates an interpolated k_on threshold for any point in this range; this rate can be compared to the ones used in other transcription studies to assess the need for including the genome.
Exclusion of the genome is shown to yield less than 5% difference in transcript numbers over wide ranges of values, and computational speed is improved by two to 24 times by excluding explicit representation of the genome.
In recent decades, extraordinary advances in biochemistry and molecular biology have led to an unprecedented level of understanding biological systems at the molecular level. The complexity of cellular pathways and networks often makes it difficult or impossible to reliably predict the behavior of a system from knowledge of its components, and thus there is considerable interest in formulation of quantitative, predictive mathematical models of cellular functions. Such efforts, collectively described by such terms as systems biology and in silico biology [1–9], aim in the long term toward goals such as predicting the effects of drugs or other interventions on the state of diseased cells, and enhancing our fundamental understanding of how cells respond to stimuli and regulate their internal environments.
The internal dynamics of cells are driven by the kinetics of a complex set of biochemical reactions: the state of the cell may be viewed as the numbers and binding states of all species of interest, and the time evolution of that state is defined by how those species react with one another. A central challenge in cellular modelling is to formulate correct biochemical reaction schemes to represent a process of interest, and then to populate the reaction system with appropriate rate constants [5–9]. Within this effort, two persistent difficulties arise: populating mathematical models based on incomplete experimental information [10, 11]; and the computational demands of simulating the resulting systems, which can grow large for even moderately complex processes.
We have previously carried out a study aimed at the first of these problems, in which we used bulk expression data from Escherichia coli to deduce the numbers of free RNA polymerases available to transcribe a target gene of interest ; this information is not currently experimentally available, with bulk studies  able to provide the average numbers of each enzyme type but not to determine how many are "tied up" in the cell, transcribing other genes, at any given time. When simulating the expression of a gene or network of genes, whether an engineered or "synthetic" system [13–18], or a natural one [2, 6, 8, 19–21], the total number of RNA polymerases is less relevant than the number that are not currently occupied expressing genes outside the target system of interest. Our method for deducing this number involved using bulk measurements (collected as a function of growth rate ) to create an average (or "mean field") behaviour for the set of genes in the bacterial genome; we then tested how many expression enzymes our target gene had available to transcribe it, and generate free enzyme levels as a function of growth rate .
We turn now to the second of the challenges mentioned above, that of computational time. Having the rest of the genome present in the system, even in our bulk-averaged way, added significantly to the computational demands of the simulations. Further investigation shows, however, that there are regimes in which the target system is not significantly affected by the presence of the remainder of the genome, and may thus well approximated by excluding the genome portion and simulating only the target system. The key quantity is the "on rate" of binding between RNA polymerase and the promoter of the target gene: for certain ranges of this parameter, the perturbation introduced by the presence of other genes (the rest of the genome in the cell) is small enough to be neglected, saving significant amounts of computational time. We explore the details of these ranges, as a function of other system parameters, below. We view this work as complementary to the various ongoing large-scale cellular simulation projects [2, 7, 19, 22–25], offering a method of simplifying the system in cases where including genes outside the immediate system of interest does not alter the overall behaviour significantly. Although our results are obtained for our particular gene expression model, we anticipate that our promoter on-rates will apply, at least approximately, to other studies of transcription in bacteria, and thus offer guidance to others wishing to simplify their system by omitting the genomic influence.
E. coli gene expression model
Biochemical reactions that make up our bacterial gene expression model (version incorporating the host's genome).
Forward Rate constant
Backward rate constant
operon_ns + Rpoly
operon_ns + Rpoly_operon_ns_1
Rpoly_operon_ns_2 + mRNA
Rpoly_operon_ns_3 + mRNA
Rpoly_operon_ns_4 + mRNA
Rpoly_operon_ns_5 + mRNA
Rpoly_operon_ns_6 + mRNA
Rpoly_operon_ns_7 + mRNA
Rpoly + mRNA_small
operon_s + Rpoly
operon_s + Rpoly_operon_s
Rpoly + stable_RNA
plas + Rpoly
plas + Rpoly_reporter + incom_mRNA_reporter
Rpoly + mRNA_reporter
List of the biochemical reactions that make up our bacterial gene expression model (version excluding the host's genome). Gene expression model excluding the host's genome
Forward rate constant
Backward rate constant
plas + Rpoly
plas + Rpoly_reporter + incom_mRNA_reporter
Rpoly + mRNA_reporter
Species nomenclature used in biochemical models
RNA polymerase in a closed-complex with the average genomic mRNA operon promoter
RNA polymerase in a closed-complex with the average genomic sRNA operon promoter
RNA polymerase in a closed-complex with the reporter mRNA promoter on the plasmid
RNA polymerase in an open-complex with the average genomic mRNA operon promoter
RNA polymerase in an open-complex with the average genomic sRNA operon promoter
RNA polymerase in an open-complex with the reporter mRNA promoter on the plasmid
RNA polymerase elongating the average genomic mRNA transcript from the average genomic mRNA operon
RNA polymerase elongating the average genomic sRNA transcript from the average genomic sRNA operon
RNA polymerase elongating the reporter mRNA transcript from the plasmid
Nascent average genomic mRNA
Nascent genomic mRNA where its final length is approximately 90% of the average genomic mRNA
Nascent reporter mRNA
Average genomic mRNA (represented as the length of the average genomic mRNA gene)
Genomic mRNA that is approximately 90% of the average genomic mRNA
Average genomic sRNA (represented as the length of the average genomic sRNA operon)
Reporter mRNA – the mRNA of interest
Average genomic mRNA operon
Average genomic sRNA operon
Reporter promoter on the plasmid
Refers to species used exclusively in the gene expression model that incorporates the host's genome
Cell growth and division
The cellular volume grows exponentially until a threshold is reached, at which point it is approximately halved (a binomial distribution is used) and exponential growth restarts. A counter species, v, is used to represent volume: v → 2v, with rate constant k = ln(2)/τ, where τ is the doubling time of the cells. At cell division, all species present are divided between two hypothetical daughter cells, and the simulation follows one of these daughters. We treat the system as ergodic, and average over long times for a single cell to obtain ensemble averages across the cellular population.
Enzyme binding and isomerization
RNA polymerases (Rpoly) are responsible for initiating and catalyzing the transcription of messenger RNA (mRNA) strands. As the model assumes all mRNA transcripts reside in operons, Rpoly binds to promoter sequences in the DNA (operon) and forms a closed complex (Rpoly+operon→closed_Rpoly_prom). This closed complex then must isomerize into an open complex (closed_Rpoly_prom→open_Rpoly_prom) before transcription can begin.
RNA polymerases clear the promoters, leaving those sites free to bind additional enzymes while transcription proceeds further down the DNA strand. We model this by regenerating the promoter after clearance occurs, forming an enzyme-template complex plus the original site: open_Rpoly_prom→Rpoly_operon+incom_mRNA+operon. We create a nascent transcript (mRNA_incom) at this step to allow subsequent translation to proceed; this feature will prove very helpful in studying future simulated studies of protein synthesis. Conservation of the number of promoters is maintained: when the enzyme-template complex finishes elongation, only the enzyme and the polymerized product are released.
To avoid the complexity of accounting for each enzyme at different stages of elongation, a single reaction is used to represent the process of completing the mRNA chain: Rpoly_operon→Rpoly+mRNA. Compliment to this reaction is the disappearance of the nascent transcript made available during transcription: incom_mRNA→(), where () is a null placeholder. Both the reactions have the same elongation rate constant that can be summarized as kelongation = ρ/λ, where ρ and λ are the polymerization rate and length of template, respectively.
Since the kinetics of RNA polymerase assembly are not fully known, the model is simplified by treating enzyme production as a zero-order process in which enzymes appear from outside the model at a constant rate: ()→Rpoly. The enzymes are partitioned at cell division like all other species. The rate constant for production can be summarized as krep = (ν/1.5)/τ, where ν and τ are the average number per cell and cellular doubling time, respectively.
DNA replication in bacteria is a complex process involving multiple replication forks. We represent the coding portion of the genomic DNA by the number of operons present (operon), and simplify the replication process as a zero-order process: ()→operon. Rate constants for this process are chosen to match the number of genomes per cell at different growth rates.
RNases act to destroy mRNA in E. coli, and we represent the degradation of mRNA by these enzymes with first-order reactions: mRNA→(), and incom_mRNA→(); the latter is an additional RNase-driven degradation, beyond the above-mentioned rate of disappearance of incomplete mRNA through conversion to complete mRNA strands.
RNA production from operons
We assume that all genes in the genome are clustered into operons: groups of genes transcribed from a single promoter, as in the lac operon. The model keeps track of which gene on the mRNA operon Rpoly is currently transcribing and makes available completed transcripts of the nascent operon (this latter point will prove relevant in future protein synthesis models): Rpoly_operon1→Rpoly_operon2+mRNA. In response to the genome-wide average of 6.9 genes per operon [10, 12] the model tracks the 7 transcripts representing the average mRNA operon (six genes of equal size, one 90% the length of the average size).
In addition to messenger RNA, other forms of RNA collectively known as stable RNA (sRNA) are produced within the cell. Since sRNA is transcribed but not translated the model does not consider nascent sRNA production.
With-genome and no-genome models
Computational simulation method
The chemical kinetics of this system were initially simulated using the Gillespie Monte Carlo algorithm [26–28], and these results were used to validate a deterministic, ordinary differential equation (ODE) version of the system, which was shown to yield identical average transcript numbers, allowing us to use the significantly faster ODE model to generate larger numbers of points in parameter space. Comparing the two models allowed us to determine the point at which the on-rate constant between the target promoter and RNA polymerase, k_on, crossed a threshold where the two models (with and without the host genome included) generated average transcript numbers differing by more than a certain percentage; here, we have chosen a five percent difference as an admittedly arbitrary significance threshold.
The original experimental measurements in the literature were carried out over a range of cellular growth rates, each of which yielded different average quantities of biomolecules per cell. Stochastic simulations of our system were carried out at each experimentally-examined growth rate (doubling times of 24, 30, 40, 60, and 100 minutes ) and sampled at discrete points in parameter space, as follows: plasmid copy numbers of 10, 100, and 1000; mRNA half-lives of 2, 6, 10, and 14 min; and gene lengths of 10, 100, 1000, and 10000 bp. The relationship between these independent variables and the point at which the promoter-RNAP on-rate begins to yield a significant difference between the genome and no-genome models is complex and highly nonlinear, and not amenable to reduction to a single equation. We have instead produced a MATLAB script (The MathWorks, Natick, MA) that generates an on-rate threshold given a user's input of plasmid copy number, mRNA degradation rate, gene length and cellular doubling time: any promoter on-rate constant larger than this predicted value can exclude the computationally expensive genome from the simulations without creating more than a five-percent error, while any constant smaller than this should include the genome.
Stochastic modelling approach and software
Deterministic chemical kinetics apply in the regime of large numbers of randomly interacting molecules. Inside cells, molecule numbers are often small enough to produce significant fluctuations [8, 20, 28–44], thus requiring a stochastic simulation of the reaction kinetics. The Gillespie algorithm  treats chemical reactions as Poisson processes, with event (reaction) rates given by microscopic rate constants and the current state of the system. For an elementary reaction of the form A+B→C with rate constant k, the Poisson rate of the forward reaction is kab/V, where a and b represent the numbers of molecules of species A and B present, and V is the reaction volume (note that this volume is a changing parameter in a living bacterial cell). We use the unit "n" to represent the number of molecules present in the system, rather than concentration units such as molarity. To advance the simulation, the timing of the next reaction event is randomly selected using the exponential distribution of inter-event times for the set of Poisson processes representing the reactions, and the probability of each reaction being the one that occurs at that instant is given by its fraction of the sum of all reaction rates [26–28].
Bacterial cells have often been approximated as well-stirred reactors: based on their small size, it is assumed that diffusion is sufficiently fast to yield a well-mixed system. Early experimental results showed protein mobility in vivo consistent with normal diffusion , and though the diffusion coefficients were substantially lower than for the same proteins in water, the diffusion was fast enough to spread the proteins over the volume of a bacterium on a time scale of seconds. Recent theoretical treatments [43, 46–49] have questioned the picture of bacterial cells as well-mixed systems, and recent experimental results  have reported subdiffusive behavior in the motion of individual RNA molecules, where each RNA is rendered visible through binding to multiple fluorescent protein labels. In this paper, we use the well-stirred reactor picture as a first approximation to gain insight, but it should be noted that this is a significant simplification, and that future refinements and extensions are possible. Approaches proposed to deal with crowded cellular environments include rate laws obeying fractal-like kinetics [49, 51, 52], and Monte Carlo simulations wherein two- or three-dimensional spatial information is retained for each molecule [43, 46, 49, 53].
The gene expression model was initially implemented using BioNetS (Biochemical Network Stochastic Simulator) , which provides a convenient interface for specifying reactants, products and kinetic data. The software generates C++ source code implementing the system using the Gillespie stochastic simulation algorithm (or an approximation, if desired), and this code is then compiled and executed with user-tunable parameters as inputs. Some species in the model exist in small numbers while others exist in large numbers; although continuum approximations and hybrid schemes are available through BioNetS , the Gillespie algorithm with no approximations yielded the best simulation speed. The data from the BioNetS-generated code was processed using DataTask (Visual Data Tools, Inc) and its run manager DataTask, which automated the process of sweeping parameter values and analyzing the results. The complete gene expression models used are available as BioNetS scripts and are provided along with this paper (see Additional File 1).
Derivation of E. coli gene expression parameters
To derive the on-rate constant between RNA polymerase and the reporter promoter where there is 5% difference in transcript average between models, we employ bulk cellular averages obtained by Bremer and Dennis for several different cellular growth rates . We implement a "mean-field" approach  by considering the production of generic transcripts with properties derived from genome-wide averages: we compute mean transcript lengths, mean elongation rates, and so on. With these quantities in hand, the unknown between models is reduced to the RNA polymerase on-rate constant for binding to the reporter promoter, and we find its value by sweeping until the difference in transcript average between models is 5%.
The model has been constructed to be as detailed as possible, using all available information about the biochemical processes underlying gene expression. This leads to a large number of species and reactions, the full details of which are provided in the Appendix. For a derivation of average genome parameters, please see Iafolla and McMillen .
Stochastic model parameter sweeping
The duration of the stochastic simulations was varied to obtain linearity with R2 ≥ 0.90; this is achieved by using a minimum of 1000 cell divisions, although some simulations use more cell divisions to obtain the desired linearity. Since the doubling time of the cells is varied, the total duration in real time varies among the simulations; the number of cell divisions explored appears to be the key factor in obtaining well-converged statistics, rather than the absolute duration.
The minimum 1000 cell division duration was deduced by qualitative analysis of multiple simulations with the same seed but different durations; we examined the effect of duration on the mean values obtained from the reporter mRNA histograms. The on-rate constants used in the duration analysis was determined by comparing the histograms between models over a range of on-rate constants (10-7 n-1s-1 to 1 n-1s-1); the range of on-rate constants that bound the percent difference in the above statistical parameters by 5% was investigated for duration analysis (this range was from 10-5 n-1s-1 to 10-2 n-1s-1). Ultimately, longer-duration runs produced averages that were not statistically different from those obtained after 1000 divisions (see the Appendix for additional explanation), implying that longer durations only increase computational expense.
After interpolation, the validity of the on-rate was tested: using a different seed for 30 simulations – all employing steady-state initial conditions and the same duration, kinetics and interpolated on-rate – the sample mean difference between models of the 30 simulations was statistically compared to the population mean of 5%. The on-rate was accepted if the two means were not proven statistically different using a level of significance α = 0.95. All simulations, either in parameter sweeping or verification, employ different nucleating random number generator seeds.
Additional deterministic simulations
Similar to the parameter sweeping carried out for the stochastic simulations, we used the deterministic simulation results for each parameter set to calculate the RNA polymerase-promoter binding on rate, k_on, at which there will be a five percent difference between the models with and without a representation of the host cell genome; for the deterministic results, the 5% threshold was determined using the fzero function in MATLAB, which searches for a zero-crossing between two given points.
Interpolation of on-rate thresholds
The on-rate (k_on) threshold above which a 5% deviation between the genome and no-genome models occurred has been calculated explicitly only at the set of parameter values listed above (based on stochastic simulations supplemented by cross-validated deterministic simulations to increase the density of the sampling of parameter space). To allow the k_on threshold to be calculated at values other than those explicitly simulated, we created a MATLAB script to carry out the necessary interpolation using a local minimization method. In local linear fitting, to find the unknown point at a desired parameter value, one draws a straight line connecting the known points on either side of the desired value, and takes the point on that straight line as the interpolated result at the desired parameter value. Note that this process minimizes the total distance between the interpolated point and the two known points, and we use this idea to perform our interpolation in our 5-dimensional space (k_on as a function of four parameters: growth rate, gene length, mRNA half life, and plasmid copy number). For any single given 4-dimensional parameter set, the nearest available set of parameter values is determined by finding the two nearest parameter values in each direction on this 4-dimensional mesh; combining all four dimensions yields the 16 nearest points on the mesh. Since these 16 data points do not generally fit well to a linear function, we obtain the interpolated on-rate value for a given parameter set by searching for the k_on value that minimizes the total distance in 5-dimensional space to those nearest 16 points, using the MATLAB fminsearch function to carry out the minimization operation.
Results and discussion
Percent difference of reporter transcript averages between models
Figure 3 shows there are two binding constant ranges for each set of parameters where there is less than a 5% difference in transcript production. We have not considered the lower range, here, because of the insignificant number of transcripts produced, usually an average of much less than one per cell division. In this regime, the two versions of the model are both matching simply because they are both yielding a result of "nearly zero." For the case we wish to consider, that of observing the output of a target gene through the expression of a reporter, such low levels of transcription would be invisible to current detection techniques, requiring single-molecule resolution against the noisy background of the cytoplasm, and thus for the moment we consider it justified to exclude this near-zero range in our simulations. The higher k_on rate constant limit corresponds to transcript numbers on the order of 102 to 104, a magnitude that is much more amenable to experimental access and thus potentially more significant for use in other studies.
Accuracy of the interpolated on-rates
To test the accuracy of the interpolated on-rates, the on-rates were entered back into both versions of the model and run for 30 different simulations seeds for a duration of 30 cell divisions, after creating steady state values for all species within the model. The percent differences were assembled and statistically compared to the population mean of 5% using a level of significance α = 0.95. This process was repeated for all 240 different kinetic situations generated using the stochastic simulations. There was no statistical difference between the population mean and the sample mean obtained from the simulations (data not shown), thereby ensuring that the interpolated values are the correct ones for producing a percent difference of 5%.
Time reduction via genome exclusion
Excluding the genome from simulation studies does reduce CPU simulation time in the computationally intensive fully stochastic simulations. To illustrate this, the verification runs were used for comparison between models; these simulations employ the same kinetic parameters and duration, and offer a large population size (since each run was repeated multiple times with varying random seeds).
Relationship between the parameters
Simulating the translation of mRNA to protein, downstream of the transcriptional events discussed here, requires a significantly more elaborate model  with correspondingly greater computational demands. One extension of this study would be to investigate the binding on-rates for ribosomes binding to the ribosome-binding-sites (RBS) of the mRNA binding sites, and once again compare the results when the presence of the genome is modelled to those when it is excluded; presumably there would be a similar possibility of excluding the representation of the genome under some parameter ranges (where the main parameters would remain the same: doubling time, gene length, mRNA half-life, and plasmid copy number). Since translation follows transcription in the gene expression process, the range of parameter values in which one can exclude the genome from studies of the translational output of a target gene should be smaller than the regions found in the current study of transcriptional output: the system will be subject to the constraints imposed by matching the transcriptional results, as well as additional constraints required to match the translational results.
The ability of RNA polymerase to produce an approximately equal amount of transcripts at large enough binding constants for both models raises an important question: are there enough RNA polymerases left when a large rate law exists for the reporter promoter to transcribe the necessary genomic genes for cell division? The presence of a large rate for the reporter transcript will produce metabolic strain on the cell [54–56], possibly leading to an increase in doubling time that is not captured within the current model. Further studies on modelling the effect of metabolic strain and its feedback with cellular doubling time will help to clarify this issue.
Efforts to create accurate, quantitative models of Escherichia coli genomic networks using chemical equations results in large reaction schemes, with reactions potentially proceeding at a wide range of rates. The large computational time required to simulate these reactions is a persistent problem for large-scale cellular simulation. To help address one aspect of this problem, we have investigated the necessity of simulating the presence of the E. coli genome when studying a target gene inserted on a plasmid. The presence of the genome, introduced using our "mean-field" approach, is felt by the target gene through the competition for free RNA polymerases available to bind to the target gene's promoter and generate transcripts. However, there are ranges of the parameter space in which the presence of the genome yields a negligible difference in the number of reporter transcripts produced from the target gene, and in these cases is it possible to exclude any explicit representation of the genome and save the computations required to simulate the associated additional reactions. Stochastic simulations show speed increases of from two to 24 times, when the genome is excluded from our models. We have generated a set of fully stochastic simulations and found the promoter on-rate values for which the genome and no-genome models differ by less than 5%, and augmented these stochastic simulations with cross-validated deterministic runs to increase the number of sampled points in parameter space. Within the ranges of our four independent parameters (growth rate, gene length, mRNA degradation half-life, and plasmid copy number), we have produced a MATLAB user interface that will allow the user to input any set of parameters and obtain the promoter on-rate value (k_on) above which the effect of the genome will fall below our 5%-difference threshold. Given the increasing computational demands of cellular simulations, we hope that this approach will aid in the efficiency of other studies, and suggest other methods in which portions of the full cellular system may be excluded without significantly affecting the final results.
Below is a detailed explanation of the gene expression model, expanding on the information presented in the Methods section. A full list of kinetic parameters for each reaction is provided in Iafolla and McMillen .
Species used in both versions of the model
RNA polymerase elongating the reporter mRNA transcript from the reporter gene
RNA polymerase in a closed-complex with the repoter promoter
Nascent reporter mRNA degradation product
Reporter mRNA degradation product
Nascent reporter mRNA
RNA polymerase in an open-complex with an mRNA reporter promoter
Promoter on the plasmid
Counter (representing cell volume)
Species used exclusively in the model containing the genome
RNA polymerase elongating an average mRNA transcript from a template operon
RNA polymerase elongating an average RNA transcript from a template operon
RNA polymerase in a closed-complex with an mRNA operon promoter
RNA polymerase in a closed-complex with an mRNA operon promoter
Average mRNA (gene length)
Approximately 90% of the average mRNA; all species names that include "small" refer to this shorter species and its products/complexes
RNA polymerase in an open-complex with an mRNA operon promoter
RNA polymerase in an open-complex with a stable RNA operon promoter
Average mRNA operon
Average stable RNA operon
Average stable RNA (full operon length)
The cellular processes represented in the model are discussed individually, below:
To reflect the exponential growth of bacterial cells in a nutrient-rich liquid culture, we include cell growth and division, incorporated as a process that grows to a threshold volume and is then halved. At division, all species have their numbers cut approximately in half: for large numbers, a binomial distribution is used to calculate the new number, while small numbers (less than 100) have each molecule explicitly checked and randomly assigned to a daughter cell with equal probability . The model follows only one cell as a representative of the full population, so the second daughter effectively vanishes after division. Tracking such a representative cell over long times yields the same statistics as tracking an ensemble of many cells over shorter times, if we make the reasonable assumption that the system is ergodic.
Cell volume is represented by a "counter" species, v, whose exponential growth is governed by the following reaction, with rate constants adjusted to produce various doubling times to match the experimental conditions being examined:
v → 2v (R1)
For a doubling time τ, the rate constant is set to k = ln(2)/τ. The reaction is initialized at v = V0, and cellular division occurs when v reaches 2V0. Our model treats all processes as stochastic, but the resulting degree of variability depends strongly on the number of molecules participating in the reaction. The range of cell division times can thus be tuned by the choice of V0; here we set V0 = 1000, which yields a very slight degree of variability in the cell division times. This variability arises from two sources: the stochastic rate of reaction R1, and the random assortment of the counter v between daughter cells at division: v is cut only approximately in half at cell division, like all other species, and thus the initial volume after cell division lies in a small range around V0.
Enzyme binding, unbinding, isomerization and clearance
Since the only enzymes used in this model are RNA polymerases only binding to promoters need consideration. The bimolecular reactions for RNA polymerase (Rpoly) binding to a promoter on a gene (plas or operon) are shown below:
Rpoly + plas ⇔ closed_Rpoly_prom KR2 = [k_on_Rpoly/(v/V0)]/k_off_Rpoly (R2)
closed_Rpoly_prom → open_Rpoly_prom kR3 (R3)
RNA polymerase initially forms a closed-complex with the promoter region, which then undergoes isomerization (R3) into an open-complex. The rate constants for R2 and R3 are adjacent to the reactions; that for R2 is scaled to mimic dilution of cell cycle progression: as the cell grows, the increase in volume decreases the probability of the two species coming into contact and reacting, effecting reducing the rate constant ; this effect is incorporated by dividing the rate constants by v/V0.
Following binding, the enzyme clears the promoter at a particular rate. The elementary reactions for this process are shown below:
open_Rpoly_prom → Rpoly_operon + operon + incom_mRNA (R4)
We create a nascent transcript (mRNA_incom) at this step to allow subsequent translation to proceed; this feature will prove very helpful in studying future simulated studies of protein synthesis. Reaction R4 also shows an important assumption: the regeneration of a binding site after clearance allows another enzyme to bind to the same gene, creating the multiple simultaneous elongation processes observed in actual bacterial cells.
To avoid the computational complexity of accounting for all elongating intermediates (growing mRNA and peptides of every possible length), the following approximation has been employed: a single intermediate is converted to the final product at a rate corresponding to the average time taken by the complete polymerization process. Using average elongation rates for specific cell growth rates as specified by Bremer and Dennis , the elongating species produce only the enzyme and the polymerized product, not the template that is read. This is shown below in Reaction R5:
Rpoly_operon → Rpoly + mRNA (R5)
Compliment to this reaction is the disappearance of the nascent transcript made available during transcription: incom_mRNA→(), where () is a null placeholder. The elongation rate constant can be summarized as kelongation = ρ/λ, where ρ and λ are the polymerization rate and length of template, respectively.
Enzyme and genome production
Many processes involved in molecular biology are either too complex to model or not characterized at present. In our model, we use simplified zeroth-order production rates for complicated species involved: although the assembly details of some species are not fully available, there is considerable information on population size of these species. In E. coli, the average number of RNA polymerases and genome equivalents per cell are known at several cellular growth rates , and their production is represented by the elementary reactions below:
() → Rpoly (R6)
() → plas (R7)
() → operon (R8)
The operon species in R8 is representative of the genome, since our model employs RNA polymerase binding directly to the promoter sequence of the average operon. The rate constant for production can be summarized as krep = (ν/1.5)/τ, where ν and τ are the average number per cell and cellular doubling time, respectively.
The presence of RNases in E. coli implies that mRNA possess a finite life-span. The following reactions are used to represent mRNA degradation:
mRNA_reporter → () (R9)
incom_mRNA_reporter → () (R10)
For a half-life h, the rate constant for R9 and R10 is set to k = ln(2)/h.
We assume that RNases can degrade nascent transcripts. To account for degrading a transcript while it is being created we propose the following elementary reaction and rate constant:
Rpoly_mRNA_reporter → Rpoly (R11)
k R11 = incom_mRNA_reporter·kmRNA_degradation/Rpoly_mRNA_reporter
The reaction indicates that an RNA polymerase currently producing a transcript becomes an unscathed RNA polymerase and a degraded mRNA. Although this reaction implies that all RNA polymerases producing a transcript are subject to degradation, the proportionality to incomplete transcripts is specified in the rate constant. The Rpoly_mRNA species present in the denominator of the rate constant makes the reaction rate independent of the number of elongating RNA polymerases.
Modelling RNA production from operons
We assume that all genes in our relevant genome are clustered into operons. Our model creates a single transcript for the entire operon, mimicking the lac operon . To make the elementary reactions simple and accurate for mRNA and subsequent peptide production, RNA polymerase binds once to the promoter and produces a transcript of average length under corresponding kinetics; the ejection of the mRNA occurs simultaneously with RNA polymerase transcribing the adjacent gene on the operon, or in the case of the last gene on the operon, being released. This is shown in the following reactions for a hypothetical three gene operon, where the binding (R2), isomerization (R3) and clearance steps (R4) have been omitted:
Rpoly_operon1 → Rpoly_operon2 + mRNA k = ktranscription
Rpoly_operon2 → Rpoly_operon3 + mRNA k = ktranscription
Rpoly_operon3 → Rpoly + mRNA k = ktranscription
The numeric suffix on the Rpoly_operon species represents the gene number adjacent to the promoter. Notice that the rate constants for the above reactions are all equivalent. The release of the mRNA while the RNA polymerase is still elongating the operon allows ribosomes to bind and perform translation without requiring additional species; the act of transcription is conserved since RNA polymerase only binds once to the promoter. Evidently, the total time to transcribe all three genes is equivalent to the time for transcribing the whole operon.
Contrast to mRNA production, stable RNA is easily produced. Since this RNA is not translated there is no need to include ribosomes translating complete transcripts before the operon is finished elongation. Hence, the length of stable RNA in the model is equivalent to the average stable RNA operon length.
This work has been funded by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Ontario Research Fund (ORF), the Canada Foundation for Innovation (CFI), the Ontario Photonics Consortium (OPC), and the Canadian Institutes for Health Research (CIHR).
- Meng TC, Somani S, Dhar P: Modeling and simulation of biological systems with stochasticity. In Silico Biol 2004, 4: 293–309.PubMedGoogle Scholar
- Auffray C, Imbeaud S, Roux-Rouquie M, Hood L: From functional genomics to systems biology: concepts and practices. Comptes Rendus Biologies 2003, 326: 879–892. 10.1016/j.crvi.2003.09.033View ArticlePubMedGoogle Scholar
- Doyle FJ, Stelling J: Systems interface biology. Journal of the Royal Society Interface 2006, 3: 603–616. 10.1098/rsif.2006.0143PubMed CentralView ArticleGoogle Scholar
- Ideker T, Galitski T, Hood L: A new approach to decoding life: systems biology. Annual Review of Genomics and Human Genetics 2001, 2: 343–372. 10.1146/annurev.genom.2.1.343View ArticlePubMedGoogle Scholar
- Kitano H: Systems biology: a brief overview. Science 2002, 295: 1662–1664. 10.1126/science.1069492View ArticlePubMedGoogle Scholar
- Kitano H: Computational systems biology. Nature 2002, 420: 206–210. 10.1038/nature01254View ArticlePubMedGoogle Scholar
- Weston AD, Hood L: Systems biology, proteomics, and the future of health care: toward predictive, preventative, and personalized dedicine. Journal of Proteome Research 2004, 3: 179–196. 10.1021/pr0499693View ArticlePubMedGoogle Scholar
- Hasty J, McMillen DR, Isaacs F, Collins JJ: Computational studies of gene regulatory networks: in numero molecular biology. Nature Reviews Genetics 2001, 2: 268–279. 10.1038/35066056View ArticlePubMedGoogle Scholar
- Tanaka RJ, Okano H, Kimura H: Mathematical description of gene regulatory units. Biophysical Journal 2006, 91: 1235–1247. 10.1529/biophysj.106.081828PubMed CentralView ArticlePubMedGoogle Scholar
- Iafolla MAJ, McMillen DR: Extracting biochemical parameters for cellular modeling: A mean-field approach. Journal of Physical Chemistry B 2006, 110: 22019–22028. 10.1021/jp062739mView ArticleGoogle Scholar
- Ronen M, Rosenberg R, Shraiman BI, Alon U: Assigning numbers to the arrows: parameterizing a gene regulation network by using accurate expression kinetics. Proceedings of the National Academy of Sciences (USA) 2002, 99: 10555–10560. 10.1073/pnas.152046799View ArticleGoogle Scholar
- Bremer H, Dennis PP: Modulation of the chemical composition and other parameters of the cell by growth rate. In Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology. 2nd edition. Edited by: Neidhardt FC, Ingraham JL, Magasanik B, Low KB, Schaechter M, Umbarger HE. Washington, DC: American Society for Microbiology; 1996:1553–1569.Google Scholar
- Arkin AP: Synthetic cell biology. Current Opinion in Biotechnology 2001, 12: 638–644. 10.1016/S0958-1669(01)00273-7View ArticlePubMedGoogle Scholar
- Benner SA, Sismour M: Synthetic biology. Nature Reviews Genetics 2005, 6: 533–543. 10.1038/nrg1637View ArticlePubMedGoogle Scholar
- Hasty J, McMillen DR, Collins JJ: Engineered gene circuits. Nature 2002, 420: 224–230. 10.1038/nature01257View ArticlePubMedGoogle Scholar
- Kærn M, Blake W, Collins JJ: The engineering of gene regulatory networks. Annual Review of Biomedical Engineering 2003, 5: 179–206. 10.1146/annurev.bioeng.5.040202.121553View ArticlePubMedGoogle Scholar
- Voigt CA: Genetic parts to program bacteria. Current Opinion in Biotechnology 2006, 17: 548–557. 10.1016/j.copbio.2006.09.001View ArticlePubMedGoogle Scholar
- Weiss R, Basu S, Hooshangi S, Kalmbach A, Karig D, Mehreja R, Netravali I: Genetic circuit building blocks for cellular computation, communications, and signal processing. Natural Computing 2003, 2: 47–84. 10.1023/A:1023307812034View ArticleGoogle Scholar
- Alberts JB, Odell GM: In silico reconstitution of Listeria propulsion exhibits nano-saltation. Public Library of Science Biology 2004, 2: e412.Google Scholar
- Kærn M, Elston TC, Blake WJ, Collins JJ: Stochasticity in gene expression: from theories to phenotypes. Nature Reviews Genetics 2005, 6: 451–464. 10.1038/nrg1615View ArticlePubMedGoogle Scholar
- Neves SR, Iyengar R: Modeling of signaling networks. BioEssays 2002, 24: 1110–1117. 10.1002/bies.1154View ArticlePubMedGoogle Scholar
- The Virtual Cell Project[http://www.nrcam.uchc.edu/]
- The E-Cell Project[http://www.e-cell.org]
- Garvey T, Lincoln P, Pedersen C, Martin D, Johnson M: BioSPICE: Access to the most current computational tools for biologists. OMICS 2003, 7: 411–420. 10.1089/153623103322637715View ArticlePubMedGoogle Scholar
- Sundararaj S, Guo A, Habibi-Nazhad B, Rouani M, Stothard P, Ellison M, Wishart DS: The CyberCell database (CCDB): a comprehensive, self-updating, relational database to coordinate and facilitate in silico modeling of Escherichia coli . Nucleic Acids Research 2004, 32: D293-D295. 10.1093/nar/gkh108PubMed CentralView ArticlePubMedGoogle Scholar
- Adalsteinsson D, McMillen DR, Elston TC: Biochemical Network Stochastic Simulator (BioNetS): software for stochastic modeling of biochemical networks. BMC Bioinformatics 2004, 5: 24. 10.1186/1471-2105-5-24PubMed CentralView ArticlePubMedGoogle Scholar
- Gibson MA, Bruck J: Efficient exact stochastic simulation of chemical systems with many species and many channels. Journal of Physical Chemistry A 2000, 104: 1876–1889. 10.1021/jp993732qView ArticleGoogle Scholar
- Gillespie D: Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry 1977, 81: 2340–2361. 10.1021/j100540a008View ArticleGoogle Scholar
- Baetz K, Kærn M: Predictable trends in protein noise. Nature Genetics 2006, 38: 610–611. 10.1038/ng0606-610View ArticlePubMedGoogle Scholar
- Bar-Even A, Paulsson J, Maheshri N, Carmi M, O'Shea E, Pilpel Y, Barkai N: Noise in protein expression scales with natural protein abundance. Nature Genetics 2006, 38: 636–643. 10.1038/ng1807View ArticlePubMedGoogle Scholar
- Blake WJ, Kærn M, Cantor CR, Collins JJ: Noise in eukaryotic gene expression. Nature 2002, 422: 633–637. 10.1038/nature01546View ArticleGoogle Scholar
- Dublanche Y, Michalodimitrakis K, Kummerer N, Foglierini M, Serrano L: Noise in transcription negative feedback loops: simulation and experimental analysis. Molecular Systems Biology 2006., 2:Google Scholar
- Elowitz MB, Levine AJ, Siggia ED, Swain PS: Stochastic gene expression in a single cell. Science 2002, 297: 1183. 10.1126/science.1070919View ArticlePubMedGoogle Scholar
- Fraser HB, Hirsh AE, Giaever G, Kumm J, Eisen MB: Noise minimization in eukaryotic gene expression. PLoS Biology 2: e137. doi:110.1371/journal.pbio.0020137; 2004: e137 doi:110.1371/journal.pbio.0020137. 10.1371/journal.pbio.0020137Google Scholar
- Kepler TB, Elston TC: Stochasticity in transcriptional regulation: origins, consequences, and mathematical representations. Biophysical Journal 2001, 81: 3116–3136.PubMed CentralView ArticlePubMedGoogle Scholar
- Paulsson J: Summing up the noise in gene networks. Nature 2004, 427: 415–418. 10.1038/nature02257View ArticlePubMedGoogle Scholar
- Pedraza JM, van Oudenaarden A: Noise propagation in gene networks. Science 2005, 307: 1965–1969. 10.1126/science.1109090View ArticlePubMedGoogle Scholar
- Raser JM, O'Shea EK: Noise in gene expression: origins, consequences, and control. Science 2005, 309: 2010–2013. 10.1126/science.1105891PubMed CentralView ArticlePubMedGoogle Scholar
- Spudich JL, Koshland DE Jr: Non-genetic individuality: chance in the single cell. Nature 1976, 262: 467–471. 10.1038/262467a0View ArticlePubMedGoogle Scholar
- Swain PS, Elowitz MB, Siggia ED: Intrinsic and extrinsic contributions to stochasticity in gene expression. Proceedings of the National Academy of Sciences (USA) 2002, 99: 12795–12800. 10.1073/pnas.162041399View ArticleGoogle Scholar
- Swain PS, Longtin A: Noise in genetic and neural systems. Chaos 2006, 16: 026101. 10.1063/1.2213613View ArticlePubMedGoogle Scholar
- Thattai M, van Oudenaarden A: Intrinsic noise in gene regulatory networks. Proceedings of the National Academy of Sciences (USA) 2001, 98: 8614–8619. 10.1073/pnas.151588598View ArticleGoogle Scholar
- Turner TE, Schnell S, Burrage K: Stochastic approaches for modelling in vivo reactions. Computational Biology and Chemistry 2004, 28: 165–178. 10.1016/j.compbiolchem.2004.05.001View ArticlePubMedGoogle Scholar
- Volfson D, Marciniak J, Blake WJ, Ostroff N, Tsimring LS, Hasty J: Origins of extrinsic variability in eukaryotic gene expression. Nature 2006, 439: 861–864. 10.1038/nature04281View ArticlePubMedGoogle Scholar
- Elowitz MB, Surette MG, Wolf P-E, Stock JB, Leibler S: Protein mobility in the cytoplasm of Escherichia coli. Journal of Bacteriology 1999, 181: 197–203.PubMed CentralPubMedGoogle Scholar
- Andrews SS, Bray D: Stochastic simulation of chemical reactions with spatial resolution and single molecular detail. Physical Biology 2004, 1: 137–151. 10.1088/1478-3967/1/3/001View ArticlePubMedGoogle Scholar
- Bray D: Signaling complexes: Biophysical constraints on intracellular communication. Annual Review of Biophysics and Biomolecular Structure 1998, 27: 59–75. 10.1146/annurev.biophys.27.1.59View ArticlePubMedGoogle Scholar
- Ellis RJ: Macromolecular crowding: obvious but underappreciated. Trends in Biochemical Sciences 2001, 26: 597–604. 10.1016/S0968-0004(01)01938-7View ArticlePubMedGoogle Scholar
- Schnell S, Turner TE: Reaction kinetics in intracellular environments with macromolecular crowding: simulations and rate laws. Progress in Biophysics & Molecular Biology 2004, 85: 235–260. 10.1016/j.pbiomolbio.2004.01.012View ArticleGoogle Scholar
- Golding I, Cox EC: Physical nature of bacterial cytoplasm. Physical Review Letters 2006, 96: 098102. 10.1103/PhysRevLett.96.098102View ArticlePubMedGoogle Scholar
- Kopelman R: Rate processes on fractals: theory, simulations, and experiments. Journal of Statistical Physics 1986, 42: 185–200. 10.1007/BF01010846View ArticleGoogle Scholar
- Kopelman R: Fractal reaction kinetics. Science 1988, 241: 1620–1626. 10.1126/science.241.4873.1620View ArticlePubMedGoogle Scholar
- Berry H: Monte Carlo simulations of enzyme reactions in two dimensions: fractal kinetics and spatial segregation. Biophysical Journal 2002, 83: 1891–1901.PubMed CentralView ArticlePubMedGoogle Scholar
- Bagh S, Mazumder M, Velauthapillai T, Sardana V, Dong GQ, Movva AB, Lim LH, McMillen DR: Plasmid-borne prokaryotic gene expression: sources of variability and quantitative system characterization. Physical Review E 2008, 77: 021919. 10.1103/PhysRevE.77.021919View ArticleGoogle Scholar
- Birnbaum S, Bailey JE: Plasmid presence changes the relative levels of many host proteins and ribosome components in recombinant Escherichia coli. Biotechnology and Bioengineering 1990, 37: 736–745. 10.1002/bit.260370808View ArticleGoogle Scholar
- Paulsson J, Ehrenberg M: Noise in a minimal regulatory network – plasmid copy number control. Quarterly Review of Biophysics 2001, 34: 1–59. 10.1017/S0033583501003663View ArticleGoogle Scholar
- Steitz JA: Polypeptide chain initiation: nucleotide sequences of the three ribosomal binding sites in bacteriophage R17 RNA. Nature 1969, 224: 957–964. 10.1038/224957a0View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.