Overall Simulator Structure
Tabasco implements a modified version of the Gibson Next Reaction Method (NRM) [17]. Gibson's NRM is an exact SSA that extends Gillespie's original First Reaction method by (1) updating only the minimum number of reactions through the use of a dependency graph and using absolute tentative reaction times, and (2) using an efficient data structure, the indexed priority queue, to store and sort reactions. At the start of the NRM, all reactions are defined and their tentative time of next execution ("tentative reaction time") is calculated and stored within an indexed priority queue. In addition, a dependency graph, which allows an executed reaction to call an update on only those reactions that are affected, is generated. The use of absolute times when calculating tentative reaction times allows reactions that are not affected by the execution of the last reaction to remain valid for the next iteration. The indexed priority queue sorts these reactions efficiently to allow quick searches for the minimum tentative time as well as quick lookups for any particular reaction. The reaction with the next tentative reaction time is executed, any tentative reaction times that are affected by the execution of the current reaction are updated, and the indexed priority queue structure is reordered to reflect the new times. The NRM does not change the tentative reaction times of reactions that are not affected by the currently executing reaction. The process is repeated to compute the time evolution of the entire system. Gibson and Bruck proved that the NRM is equivalent to the exact SSA algorithms developed by Gillespie.
The NRM uses a dependency graph to determine which reactions are affected by any particular reaction's execution. In the NRM, the dependency graph is constructed only once at the start of simulation and remains unchanged afterwards. However, since Tabasco creates reactions and complexes at the single-molecule level dynamically during simulation, a static dependency graph and indexed priority queue will not work. In order to solve this problem, Tabasco contains two specialized classes per DNA molecule within the overall indexed priority queue that are used to track a set of dynamically generated reactions. Each class contains a dynamic priority queue that stores the dynamically generated transcriptional and translational reactions and their tentative reaction times, as well the dependencies of any particular reaction (these specialized priority queues and the overall indexed priority queue are easy to confuse, and thus we will refer to the prior as the dynamic priority queue). The minimum tentative reaction time for all the dynamic reactions is set as the tentative reaction time of that dynamic priority queue with respect to the overall priority queue. Since, as in the NRM, Tabasco uses absolute times for determining the next reaction, the particular choice of the data structures containing the reactions does not affect the simulation results. The reason to separate the dynamic queues from the main indexed priority queue is to allow the size of the dynamic queues to change over time. As long as all dependencies are accurately updated upon reaction execution in the dynamic priority queues, the results are equivalent to the NRM. The structure of these dynamic priority queues will now be discussed in more detail.
Transcription
A special class tracks the transcriptional processes for each DNA molecule. This class contains a dynamic priority queue that keeps track of all reactions related to that DNA molecule, such as transcriptional elongation processes. All reactions that are stored in this dynamic priority queue have tentative reaction times (as calculated by the NRM), and the minimum tentative reaction time is used to set when this class should be called to execute within the indexed priority queue. In order to determine what the effects of a particular binding or elongation reaction are, the class contains two arrays where each element represents one DNA base. Each element of the first array contains pointers to transcriptional elements encoded at that location such as promoters and terminators. The second array contains pointers to all DNA-protein complexes that reside on the DNA such as elongating RNA polymerase. Only one transcriptional control element or DNA complex can occupy any particular position at any time.
The framework we use here to simulate protein-DNA binding is shown in Figure 6A. Interactions from the species level to this specialized reaction occur when proteins bind the DNA. For example, RNA polymerase or other proteins can bind free promoters to form a protein-DNA complex, which causes the DNA complex array to be updated, the number of available promoters and RNA polymerase to decrement, and finally places two reactions into the dynamic priority queue. The two reactions compete to either have the complex form an initiation complex or fall off the DNA. If an initiation complex is formed, a stochastic decision is made as to whether the complex will recycle back to an initiation complex with some characteristic time, or moves on to be become an elongation complex. This stochastic decision is instantaneous in reaction time and will be discussed below. The polymerase can undergo abortive initiation step, recycling to the initiation complex, or form an elongation complex. A promoter is not made available for rebinding until the footprint of an initiating polymerase has cleared the entire promoter region.
Once an elongation complex is formed, each elongation reaction causes the complex to move one base pair, and the elongation reaction is updated with a new time for the next elongation step. When an elongation reaction executes, the algorithm also checks whether another complex's footprint prevents the current polymerase from moving forward by checking the array containing all complex positions on the DNA. If there is another polymerase blocking the current polymerase's path, the polymerases will behave according to the polymerase interaction model chosen at the start of simulation. For example, the upstream polymerase can terminate, or signal the downstream polymerase to terminate the next time the polymerase is set to elongate, or simply hold its position until the next opportunity to elongate. The elongation reaction also does a check for transcriptional elements on the DNA such as promoters and transcriptional terminators. Upon arriving at a transcriptional terminator, a polymerase will either continue transcribing or terminate depending on a stochastic decision that will also be discussed below. If a transcribing polymerase occludes an open promoter, then the promoter will be decremented and unavailable for binding.
The "stochastic decisions" described above for transcriptional termination and abortive initiation on the promoter are used as methods to simplify parameterization in the models, so as to be consistent with how experimental data for such events is typically reported. In the case of termination, a uniform random number is chosen, and if that number is less than the termination efficiency, the polymerase will terminate transcription and fall off the DNA. Analogously, in the case of abortive initiation, if the uniform random number is less than the abortive initiation percentage, then the next reaction will be an abortive initiation step rather than formation of an elongation complex. These decisions take no simulator time, but cause changes in the downstream actions the complexes may take. This simplification is not done for computational efficiency, but more for model simplicity. Tabasco could be modified, with little computational load, to replace the stochastic decisions into competing rates of different reactions. However, these rates are not well understood, and thus we chose to incorporate the measurable models of termination efficiencies and abortive initiation frequencies into the simulator.
Translation
The transcribing RNA polymerase complexes also produce mRNA, which are tracked by a separate class. Since there are many more copies of RNA than DNA during simulation, and because we were uncertain as to the importance of protein-protein interactions on the mRNA, we chose to treat the majority of translation at the species level. However, if, for example, an RNA polymerase prematurely terminates before reaching the translation stop site, the mRNA and the ribosomes translating it will not produce function proteins. Thus, so long as the coding sequence is still being transcribed, we must treat all translation events at the single-molecule level as well. However, as soon as the entire coding sequence of an open reading frame is transcribed, Tabasco transitions to tracking species of mRNA molecules.
We used two classes to model the formation of RBSs and their respective start sites, the Nascent RBSs and Mature RBSs, in order to differentiate when mRNA should be tracked at the single-molecule level or the species-level, respectively. At the start of simulation, one more array of genome length is created that contains the positions of translation start and stop sites on the DNA. As RNA polymerase elongates (as described above), if the polymerase transcribes past a translational start site, a Nascent RBS is made. This Nascent RBS is available for binding by free ribosomes, and a reaction is automatically created and placed into the dynamic priority queue for translation processes. Once the polymerase arrives at the corresponding translation termination site, the Nascent RBS is converted to a Mature RBS, which is tracked at the species-level in the overall Indexed Priority Queue.
Translation, both at the single-molecule level and species level, occurs in three steps. First, ribosomes can bind either Nascent or Mature RBSs; there is an initial reaction to form the (Nascent or Mature) Initiation Complex. Second, these initiation complexes are converted to (Nascent or Mature) Elongation Complexes and an RBS at a rate that depends on the speed of the ribosome and the length the ribosome must travel to clear the ribosome binding site. This reaction is treated as a single step, however the distribution of times is chosen from a gamma distribution, in order to better represent a series of individual elongation steps. Third, the Elongation Complex then forms a finished protein and a free ribosome at a rate proportional to the remaining length of the open reading frame. This reaction also is computed via a gamma distribution.
At the single molecule level, two additional mechanisms allow for the state of the transcribing RNA polymerase to affect translations. First, if the transcribing RNA polymerase terminates before reaching a corresponding translation stop site, all bound ribosomes are immediately released, the Nascent RBS is removed, and no protein product is formed. Also, for any particular Nascent RNA, the maximal number of Nascent Elongation Complexes that can be formed is capped at the length of the currently transcribed portion of the open reading frame divided by the footprint of the ribosome.
Finally, to note, all transitions between single-molecule tracking and species-level tracking occur when a reaction, such as RNA polymerase termination or Initiation Complex conversion into an Elongation Complex. Thus, the transition from single-molecule tracking back to the species level tracking should also not affect the validity of using the NRM.
DNA entry
We developed multiple models to represent DNA entry into a cell or compartment (such a step can be useful in starting a simulation of infection or transformation). First, DNA can enter the cell via a zero-order constant reaction rate. Second, RNA polymerases have themselves been implicated as molecular motors that can drive DNA entry [29, 43]. Thus, in Tabasco, RNA polymerases that reach the end of a DNA molecule that has not yet fully entered the cell can cause DNA internalization at the rate of transcription elongation. Both of these mechanisms were used during our simulation of T7 gene expression.
Simulation, Data Output, Visualization, Code
Tabasco is written in Java® 1.4. The input file to the simulator is an XML file that describes and parameterizes the relevant genetic elements, initial conditions, and any other reactions that occur. The visualization is created by producing images that are then merged using Quicktime® to create a movie. The source code, executables, along with documentation and instructions for use are available within Additional Files 2, 3 and 4 and via the TABASCO website (Availability & Requirements).
Parameterization
All constants used are provided for completeness. Please see the Supplementary Materials for input files and exact constants used.
Gamma versus Exponential distribution
The distribution of expected times for a reaction to occur in a stochastic simulator depends on the underlying model (Figure 2). Elementary chemical reactions will follow an exponential distribution in arrival times. However, this is not true of non-elementary reactions. Treating an imaginary elongation process as one step versus 50 individual steps has significant consequences. To obtain the distribution time for the two cases, we used uniformly-distributed pseudorandom numbers and transformed them into exponential- or gamma-distributed random numbers. The exponential distribution is used for the single step representation. Exponentially distributed numbers are calculated by simply taking the negative natural log of a uniformly-distributed pseudorandom number. A series of exponentially distributed arrival times, as in the case of the multi-step elongation process, is given exactly by the gamma distribution. Gamma distributed numbers are calculated from uniformly-distributed pseudorandom numbers by an implementation of the rejection method [44]. Pseudorandom numbers are generated from Java's implementation (java.util.Random) of a linear congruential pseudorandom number generator with a 48-bit seed [45].
Simple Gene Expression model
We simulated two models of a promoter driving expression of a coding domain. The first model, termed single-molecule simulation, used the described Tabasco simulator to account for each reaction step during transcription and translation. The model uses the schemes shown in Figures 1 and 6a. The second model, termed the species-level simulation, treats transcriptional and translational elongation as a single step as shown in Figure 6b. The main parameters used in both models are shown in Table 1. Constants were adjusted slightly to account for small differences in model structure to give equal steady state values of mRNA and protein levels. Finally, the input files for the simulations can be found in the Supplementary Materials; the input files can be used to either run the simulation using Tabasco, or check all parameters used for the simulation.
Polymerase Interactions & Bacteriophage T7
The simulations for the polymerase interactions and bacteriophage T7 development used parameters that can be found in the input files in the Supplementary Materials. Table S1 details the meaning of each of the parameters in the input files. The parameters used were based on empirical measurements where possible. However, in general, the exact values of the constants are ancillary to this section, which is to show that Tabasco is able to simulate processes such as polymerase interactions and entire genetic systems such as T7. The parameter derivations are detailed are detailed elsewhere [46].