Current approaches to gene regulatory network modelling
© Schlitt and Brazma. 2007
Published: 27 September 2007
Skip to main content
© Schlitt and Brazma. 2007
Published: 27 September 2007
Many different approaches have been developed to model and simulate gene regulatory networks. We proposed the following categories for gene regulatory network models: network parts lists, network topology models, network control logic models, and dynamic models. Here we will describe some examples for each of these categories. We will study the topology of gene regulatory networks in yeast in more detail, comparing a direct network derived from transcription factor binding data and an indirect network derived from genome-wide expression data in mutants. Regarding the network dynamics we briefly describe discrete and continuous approaches to network modelling, then describe a hybrid model called Finite State Linear Model and demonstrate that some simple network dynamics can be simulated in this model.
Most cellular processes involve many different molecules. The metabolism of a cell consists of many interlinked reactions. Products of one reaction will be educts of the next, thus forming the metabolic network. Similarly, signalling molecules are interlinked and cross-talk between the different signalling cascades forms the signalling network. And the same is true for regulatory relationships between genes and their products. All these networks are closely related, e.g. the regulatory network is influenced by extracellular signals. But there are characteristic features in the signalling network, which do not exist in the regulatory network; therefore dealing with these networks separately makes sense. Our main interest is in transcription regulation networks and we will refer to them as "gene networks", but many principles are valid for a wide range of networks. High-throughput technologies allow studying aspects of gene regulatory networks on a genome-wide scale and we will discuss recent advances as well as limitations and future challenges for gene network modelling. This survey is largely based on and is an extension of two previous publications [1, 2].
Gene networks are often described verbally in combination with figures to illustrate sometimes-complicated interrelations between network elements. Due to the complexity of these networks, such models are not always easy to comprehend and they often leave a considerable amount to ambiguity to the reader's imagination. Since the 1960's methods from mathematics and physics have been used to describe and simulate small gene networks more stringently. Nowadays, molecular biological methods and high-throughput technologies make it possible to study a large number of genes and proteins in parallel enabling the study of larger gene networks. This allows tackling gene networks more efficiently and has led to a new discipline called Systems Biology, which seeks to combine methods from biology with methods from mathematics, physics and engineering to describe biological systems.
We proposed to categorize gene networks models in four classes according to increasing level of detail in the models . Each class has its own advantages and limitations. The four classes are:
i. parts lists – a collection, description and systematisation of network elements in a particular organism or a particular biological system (e.g., transcription factors, promoters, and transcription factor binding sites);
ii. topology models – a description of the connections between the parts; this can be viewed as wiring diagrams where directed or undirected connections between genes represent different types of interactions;
iii. control logic models – a description of combinatorial (synergetic or interfering) effects of regulatory signals – e.g., which transcription factor combinations activate and which repress the transcription of the gene;
iv. dynamic models – the simulation of the real-time behaviour of the network and the prediction of its response to various environmental changes, external, or internal stimuli.
Obviously, for a fixed number of network elements each next level is more detailed and complex. But the size of the networks that we are able to model at each particular level is limited. Much larger networks can be described on topological level than on the dynamic level. In the following section we will discuss these classes in more detail.
"All models are wrong, but some are useful". – George E. P. Box
Compiling the parts list is the first step in developing any model of some complexity and is not always a trivial exercise. Simple parts lists of genes, transcription factors, promoters, binding sites and other molecular entities are useful means for assessing the network complexity and for comparing different organisms. Such parts lists can be the result of a genome-sequencing and annotation project, where the complete DNA sequence of an organism is determined and all (or at least many) genes and proteins are identified. A parts list could also be represented as a database of regulatory elements or it could be ontology terms of transcription regulation processes assigned to a set of genes.
Number of transcription regulators in different organisms
number of genes
number of transcription regulators
Many publications address the computational identification of transcription factor binding sites for instance by analysing promoter sequences of coexpressed genes . One approach is to search for short sequences that are overrepresented in the promoters of a particular group of genes (e.g. clusters of coexpressed genes or sets of longer sequences known to bind a particular transcription factor) in comparison to the promoter sequences of all other genes. This approach obviously depends on the availability of the sequences for many genes and their upstream regions. Such an approach was applied in a cell cycle study in Schizosaccharomyces pombe, where Rustici et al. showed that the presence or absence of consensus binding sites in the promoter regions corresponds to the cyclic expression pattern of the genes . Genes with a peak expression at similar cell cycle stage often share similar sets of consensus binding sites.
However, the exact promoter regions are usually unknown and even the transcription start sites are only known for a few genes. Baker's yeast (Saccharomoyces cerevisiae) has a relatively small genome with short intergenic regions, considering about 600–1000 bp upstream of the translation start site (ATG) appears to be a good approximation for the promoter regions. In higher organisms like vertebrates the intergenic regions and thus the putative promoter regions are much larger than in yeast, therefore the identification of regulatory elements in the DNA sequence by computational means has turned out to be rather elusive. Some studies have focused on the computational analysis of higher-level organisation of transcription factor binding sites in promoters, such as frequently occurring combinations of known binding sites [9, 10], or restricted the search for regulatory elements to conserved sequence regions, identified by genome comparisons, a method often referred to as phylogenetic footprinting . However, phylogenetic footprinting does not always work, because the localisation and the binding sites themselves are not always conserved [12, 13].
Transcription factor localisations can be identified experimentally, too. For example, individual binding sites can be detected using the "DNAse I footprinting assay"; proteins bound to the DNA protect it from degradation by DNAse I, therefore these regions can be analysed further . Another common experimental method is the "electrophoretic mobility shift assay" (EMSA) sometimes called "band shift assay" or "gel retardation assay" – DNA fragments that are bound by protein move slower in an electrophoretic gel than unbound fragments [15, 16]. These methods allow fine mapping of individual binding sites, but are very labour intensive. High-throughput methods such as the Chip-on-chip method (see additional file 2, which describes this methods in more detail) allow the genome-wide detection of binding sites for a transcription factor, but the spatial resolution and signal quality is limited. Furthermore, assigning transcription factors to their target genes based on the genomic localisation is difficult due to the size of intragenic and intronic regions and long range effects of some transcription factors.
Nevertheless parts lists provide the first impression of gene networks in different organisms and they are necessary to have before we continue by having a look at the network topology.
Once we know the transcription factors and their binding sites, we can describe the gene transcription regulatory networks by graphs with nodes corresponding to genes and edges to regulatory interactions . (Note the difference between these discrete graphs and plots of mathematical functions also often referred to as graphs.) A short introduction to discrete graphs is given in additional file 1, for more details please consult for example Cormen et al. . One important concept that we will use below is a representation of a graph by a so-called adjacency matrix, where the element a ij in a row i and column j equals 1 (i.e., a ij = 1), if node i is connected to node j, otherwise a ij = 0. Graph representations have been used for various biological data sets ranging from protein-protein interactions networks to coexpression networks, they have been long used in mathematics, physics and computer science, and many aspects of graphs and their applications have been studied (e.g., [19–21]).
To illustrate knowledge that we can obtain from studying network topology, we will compare and combine information from two high throughput data sets for the yeast Saccharomyces cerevisiae. The first is obtained from chromatin immunoprecipitation experiments for transcription factors (ChIP network), while the second is obtained from microarray experiments on single gene deletion mutants (mutant network), for more detail on the experimental methods please refer to additional file 2. Microarray experiment measurements can be presented in a data matrix, where rows represent genes and columns particular experiments (hybridisations). For instance, in the ChIP experiment each column will correspond to a particular transcription factor (studied in the particular experiment), while in the mutant experiment each column will correspond to a particular mutant. In this way, not only rows, but also columns will correspond to genes. The measurement values are typically real numbers, such as intensity levels, expression levels, or p-values, depending on the data processing steps applied. By applying an additional data processing step, often called thresholding, we can transform these continuous values into discrete values (e.g., if we chose a threshold t, then we can replace any x ij by 0, if x ij <t, and by 1, otherwise). In this way we will transform the original measurement matrix for these experiments into an adjacency matrix defining a graph: two genes in this graph are connected, if the measurement value is higher than the chosen threshold.
The ChIP network is based on experimental data published by Harbison et al. . As described in additional file 2 they used genomic tiling arrays to identify the genomic regions bound by transcription factors. The authors assigned each genomic region to one or two target genes based on proximity in the genome. Relative intensities of spots are the basis for an error model that assigns a probability score (p-values) to binding interactions, which we use for discretisation.
The starting point for the mutant network is the gene expression data matrix published by Hughes et al. . Each experimental condition, which in our case is a particular gene deletion mutant, corresponds to a column and each gene corresponds to a row (for more detail see additional file 2). We discretise the data matrix using a gene-specific standard deviation estimate γ obtained from the error model proposed by Hughes et al. .
Some properties of the mutant network and the ChIP network at different thresholds
ChIP network (p < 0.01)
ChIP network (p < 0.001)
mutant network (γ = 2.0)
mutant network (γ = 2.5)
mutant network (γ = 3.0)
edges where source gene and target gene have the same cellular role annotation in YPD http://www.proteome.com
edges per source gene
What do these two networks mean and how do they compare? In the ChIP network, an arc A→B means that the gene A codes for a transcription factor that binds to the promoter of gene B, while in the mutant network it means, that the mutation of A will change the expression level of B . The ChIP network describes physical interactions, but it does not tell us anything about the effects of these interactions. The mutant network is similar to the one used in gene networks built by classical genetics means – we know that a mutation (perturbation) of the first gene has an effect on the second one, but it does not necessarily mean a direct physical interaction – there may be a long transcriptional or signalling cascade leading from the first gene to the second. In this way the first network is likely to contain direct interactions, while the second may include indirect interactions as well. However, it is possible that some of the 'direct' interactions of the Chip network are not biologically functional and thus may not be supported by mutant network. Most importantly it should be noted that the experimental conditions in the two experiments are not identical, which may result in considerable discrepancies between the two networks.
Degrees of the source genes that are shared between the mutant network and the ChIP network (data for YPD medium only)
in the mutant network
in the ChIP network
shared between ChIP and mutant network
intersection significant according to hypergeometric test
sum of edges
Others have observed that when comparing different protein-protein interaction networks, their intersections are small, too. Only few interactions are reported by several experiments despite using just different methods to measure the same interactions . Reassuringly, the shared interactions turned out to be more reliable than most of the data . Here we work with experiments measuring different, if somehow related effects, but still the proportion of connections between functionally related genes is increased in the intersection of the two networks. In the intersection 40 of 102 (39.2%) connections connect genes that have the same cellular role in YPD, compared to less than 20% in the original networks (Table 2).
To summarise we can say that although the two networks are rather different, the part that is common to both is biologically more meaningful, and some of the indirect interactions of the mutant network can be explained by the direct interactions in the chip network.
Network topologies, particularly in yeast, have been widely studied and many interesting observations have been made. It has been proposed that the existence of highly connected genes (hubs) in a network might make networks more tolerant to random failure of network elements [29, 30]. In protein-protein interaction networks it seems possible to classify hubs in combination with expression data: Han et al.  showed that hub proteins can be divided into two groups based on the level of coexpression between their neighbours in the network (the proteins directly connected to the hub proteins). Hubs with low coexpression seem to link functionally separate modules and removing these hubs leads to more rapid disintegration of the network . However so far this has not been observed in transcription networks.
Luscombe et al. compiled data from ChIP-on-chip experiments for yeast to construct a network of 142 transcription factors, 3420 target genes and 7074 regulatory interactions . To study the dynamics of this network they traced the paths from the target genes back to initial transcription factors, starting from target genes that are differentially expressed under particular conditions as demonstrated in previously published microarray experiments. Depending on the conditions, different sets of genes are expressed, leading to different sets of target genes as start points for the backtracking and different transcription factors along the path. This is consistent with results from ChIP-on-chip experiments and demonstrates that the topology of the gene regulatory network is not independent of the experimental conditions .
In a different line of investigations, Lee et al. , and Milo et al. , identified re-occurring structural elements (motifs) in the networks. They examined topological networks derived from ChIP-on-chip data for structures consisting of 3, 4 or more edges that occur in the original network more often than in randomised networks. Network motifs they identified to be significantly more frequent than in randomised networks included feed-forward and feedback loops. These motifs may partly be the result of gene duplications during genome evolution .
These are just some examples of possible analyses that can be performed on topology level. However, arguably the main reason to study the network topology is to prepare the ground for the next step of building more detailed models for gene networks. Before any logic or dynamic network model can be constructed we need to know which gene products interact and which are mutually independent. Even if we take the view that in the real-world network every gene is connected to every other gene to some degree, not all these connections are equally strong and a discretization step can be used to keep only the strong connections in the model. A complete network where everything is connected to everything may not be a practical approach to network modelling on a genome scale.
Arguably the most important question is if we can find modules, i.e. subnetworks that are relatively isolated from the rest of the network. If such modules are found, they can help us to use the reductionist approach later on by allowing modelling the parts of the network independently on a more detailed level. For instance, if we can build a dynamic model of an independent module, then we can perform simulations independently from the rest of the network. The existence of modules in biological systems has often been taken as an axiom . However, a precise definition for what constitutes a module is elusive, and therefore this term has been used in various contexts . In a graph representation it is natural to define a module as a 'relatively' isolated component, and indeed such components were found in protein-protein interaction networks. In contrast isolated components have hitherto not been found in the wiring diagrams of eukaryotic transcription regulation networks . Several methods have been proposed to identify modules as groups of genes coexpressed under specific conditions [37, 38], but there remain controversial opinions regarding the existence and nature of modules in gene networks [39, 40]. Biologically meaningful pathways are sometimes used to define modules and to implement a reductionist approach. However, this easily breaks down in conditions when the pathways interact.
In general, data sets used for topological models have important limitations. While hundreds of organisms have been fully sequenced and many genes are identified relatively reliably, the data sets underlying most topological models are much less complete. Only a fraction of all protein-protein interactions in yeast have been tested; most large-scale experiments show high noise levels; and whereas the genome sequence is independent of particular growth conditions and (sometimes) is even conserved in fossils, data like protein-protein interactions and transcription factor localisations are condition dependent. In this context it is particularly important to note that some experimental methods are performed under conditions considerably different from the natural conditions in the cell. For example, the yeast-2-hybrid technology was used to determine protein-protein interactions between human proteins, yet the yeast cell provides a very different environment from a human cell [41, 42]. This can be a considerable source for variation and systematic errors. Some experimental techniques are performed in test tubes, thus providing the most "unnatural" conditions. Unfortunately various limitations are unavoidable and we have to work with incomplete data for a limited set of conditions.
We can conclude that genome scale topological representations have helped us to make many interesting observations about the network properties, however the main question of finding well defined modules of such networks on topology level is still open.
Continuous functions use continuous values (real numbers) to represent the gene activity. Weights w ij represent the interaction between genes i and j, which can be positive or negative. Thus the activity g i of gene i can be calculated as the weighted sum of the activities of all n genes:g i = w i1 g 1 + ... + w in g n
This approach assumes that the influence of one gene on another gene is linear. Note that the network topology will determine which of the weights w ij are equal to 0 (i.e., if there is no arc from gene i to gene j in the network topology, then w ij = 0). Like Boolean functions, linear functions are only approximations. For instance, it is not possible by linear functions to model a situation where the same transcription factor can play a role of an activator or repressor for the same gene, depending on the presence or absence of other transcription factors.
Although few promoters have been studied in great detail, there are excellent examples, such as the description of the promoter action logics of sea urchin developmental gene Endo16 . The Endo16 promoter consists of almost 30 regulatory elements stretched over a region of 2.3 kb. Based on experimental data collected using modified promoter constructs Davidson and co-workers constructed a model expressed as an algorithm combining Boolean and linear functions. This algorithm takes as an input the occupancy information from 12 binding sites and outputs a value, that 'can be thought of as the factor by which, at any point of time, the endogenous transcription activity (...) is multiplied as a result of the interactions mediated by the cis-regulatory control system' . Predictions of promoter manipulations based on this model have largely been confirmed in subsequent experiments. Extending their earlier work the group of Davidson compiled a regulatory network containing over 40 genes by the construction of a model that integrates extensive experimental evidence on early development of sea urchin embryos .
Recently Klamt et al. published an example for control logic networks , based on hypergraphs, which are an extension to the graphs described above. Several hyper-edges pointing to the same node represent OR relationships, but edges are allowed to combine to represent AND relationships. Weights on the edges distinguish positive and negative relationships. The authors provide a set of methods to analyse these networks, just to list a few examples: computation of all positive and negative signalling paths, computation of all positive and negative feedback loops and computation of minimal cut sets. These minimal cut sets report the smallest number of interventions necessary to force the network into a particular behaviour, for example, a minimal number of deletions necessary to block the activation of a particular downstream protein in a signalling cascade. These methods are implemented in the software tool CellNetAnalyzer and the example presented, a model of a signalling network for T-cell activation shows that these analyses are non-trivial for signalling networks of a typical size.
Bayesian networks provide a probabilistic framework for modelling gene regulatory networks [48–50]. Their graphical representation is a directed acyclic graph, where each node represents a variable and the edges represent dependencies. For a more detailed description of the application of Bayesian networks in gene expression analysis see the reviews by Pe'er  and Friedman . Segal et al. applied a learning procedure based on probabilistic graphical models to networks consisting of groups of coregulated genes .
There are situations where neither Boolean rules nor linear functions are powerful enough to express the control logics: transcription factors might bind competitively, if one factor is bound, the other one is excluded, as is the case for example in the phage λ switch between lysis and lysogeny . In some cases, transcription factors have to form homodimers or heterodimers to be fully functional. The transcription factors might have to bind sequentially or might act synergistically. In these situations it might be necessary to use more complex functions (here this would be solved by Boolean circuits with memory or delay). It remains an open question what is the minimum repertoire of functions to describe regulatory logics.
The knowledge of the parts list of a network, its topology and the control logics are necessary requirements in order to expand the model to capture dynamic changes during time. Compared to the approaches above, the dynamic models can be described as 'classical' approaches to gene network modelling, as many of them have been developed and studied long before the current genome era. Typically, they are relatively small, involving only a few genes. They aim at describing and often simulating the dynamic changes in the state of the system and predicting the network's response to various environmental changes and stimuli.
Various dynamic models have been proposed. Greller and Somogyi subclassified them  as follows: "Dichotomies for framing our thinking on how to best represent a particular biological network problem include the following distinguishing attributes: quantitative versus qualitative measurements; logical versus ordinal variables (e.g. Boolean versus abundances); deterministic versus probabilistic state transitions (e.g. differential equations versus hidden Markov); deterministic versus statistical overall system description (e.g. vector field versus Bayesian belief network probability distributions); continuous versus discrete state (e.g. continuous intensities or concentrations versus low, medium and high); nonlinear versus linear elementary interactions and state update rules (e.g. multiplicatives, sigmoids or non-monitonics versus linear ramps); high-dimensional versus low-dimensional (e.g. >> 100s of variables versus << 100 variables); stochasticity present and profound versus absent or present as nuisance noise (e.g. probabilistic state transitions versus small amplitude errors); measurement error substantially corrupting and obfuscating versus negligible distortion." In the following sections we describe several approaches following the discrete to continuous model axis. The discrete model approaches we consider include Boolean network based models [56–58] and Petri nets [59–62], the dynamic systems are based on difference or differential equations [63–65]. We will then discuss hybrid models, which combine discrete and continuous elements [66–68].
Kauffman introduced the notion of canalizing function – a Boolean function that has at least one input variable (canalizing variable) and one value (0 or 1) for that input (canalizing value), which determines the value of the output of the function regardless of other variables (i.e., if the canalizing variable has the canalizing value, then the output of the function do not depend on other variables, but if the canalizing variable does not have the canalizing value, then the output of the function is determined by the values of other variables) . He hypothesized that genes are predominantly controlled by such functions (whether this is indeed true is still unknown). Kauffman used randomly generated networks to study their general features . He found that under certain assumptions about the network topology (a limited number of incoming connections at each node) and logics (promoters are predominantly controlled by canalizing functions) there are only a small number of states in which the network will stay for most of the time. These states are called attractors; any other state, if possible at all, will lead to an attractor state in a relatively small number of steps. Moreover, the system either reaches a steady state or fluctuates between the attractor states in a regular fashion. Kauffman hypothesized that attractors correspond to different cell types of an organism. The number of cell types predicted by this model corresponds well with our current knowledge .
This approach has been generalized in a number of ways. Randomly generated networks are used to study the dynamics of complex systems . Stochastic extensions to deterministic Boolean networks were proposed – so-called noisy networks by Akutsu et al.  and Probabilistic Boolean Networks by Shmulevich et al. .
Thomas and Thieffry describe a generalized model for the qualitative description of gene regulatory networks . They introduce a notion of gene state and image, the last effectively representing the substance produced by the respective gene. There is a time delay between the change of the gene state and the change of the image state. By introducing several levels of gene activity and thresholds for switching the gene states they go beyond binary models, but they do not make continuous changes possible.
In metabolic networks the place nodes represent metabolites and the transition nodes represent reactions. Metabolite concentrations correspond to the number of tokens in the particular place nodes and the stoichiometry is described by the weights of the arcs. Subsequent analyses of Petri net models look for place nodes running out of tokens or accumulating tokens and for subnetworks that are inactive. Interesting are invariants, such as transition invariants (T-invariants), where the transitions reproduce a given state. In metabolic networks T-invariants represent reactions reproducing the given concentrations of metabolites, as for example in steady state situations. For examples of the application of Petri Nets in the analysis of metabolic networks see articles by Koch et al., Kuffner et al., Schuster et al. or Steggles et al. [62, 75–77].
Petri nets are particularly suitable for modelling metabolic reactions, because the similarities are intuitive and there is no need for detailed information about the reactions rates. This is an advantage, because often these rates are not known and hard, or at least costly to obtain. The lack of information about reaction rates is one major shortcoming for the application of differential models, which we will discuss in the next section. However, sometimes the reaction rates will be crucial to the function of the whole metabolic pathway and therefore need to be included in the pathway model (there are extensions to Petri Nets which address this, see the section on hybrid models below).
Boolean networks and Petri nets can reveal important network properties, but are too crude to capture some important aspects of network dynamics. Difference and differential equations allow more detailed descriptions of network dynamics, by explicitly modelling the concentration changes of molecules over time [63, 64, 78–80].
The basic difference equation model is of the formg1(t+Δt) - g1(t) = (w11 g1(t) + ... + w1n gn(t)) Δt
...gn(t+Δt) - gn(t) = (wn1 g1(t) + ... + wnn gn(t)) Δt
where g i (t + Δt) is the expression level of gene i at time t + Δt, and w ij the weight indicating how much the level of gene i is influenced by gene j (i,j = 1...n). Note that this model assumes a linear logic control model – the expression levels of genes at a time t+Δt, depends linearly on the expression levels of all genes at a time t. For each gene, one can add extra terms indicating the influence of additional substances .
Differential equation models are similar to difference equation models, but follow concentration changes continuously, modelling the time difference between two time steps in infinitely small time increases, i.e. Δt is approaching 0.
Dynamic networks models have been reviewed intensively [81–84]. One of the largest transcription network models using differential equations we are aware of is a model for segment polarity genes and pattern formation in the early development of Drosophila by von Dassow et al. . Their system included 48 parameters, such as the half-lives of messenger RNAs and proteins, binding ranges and cooperativity coefficients. The initial model described all known interactions, but it also revealed that the addition of at least two new hypothetical interactions were needed to ensure that the behaviour of the model was consistent with the observations.
Difference and differential models depend on numerical parameters, which are often difficult to measure experimentally. An important question for these models is stability – does the behaviour of the system depend on the exact values of these parameters and initial substance concentrations, or is it similar for different variations. It seems unlikely that an unstable system represents a biologically realistic model, while on the other hand, if the system is stable, the exact values of some parameters may not be essential. For instance, the Drosophila developmental model  is stable – it tolerates tenfold or more variation in the values of most individual parameters.
Many software packages have been developed for dynamical simulation of biological networks, but the exchange of models and data between these software packages was often not easy. The systems biology markup language SBML was developed to address this problem. SBML is an XML-based format that allows describing models software-independently . (XML eXtensible Markup Language allows to define special-purpose markup languages, capable of describing many different kinds of data.) An example for a markup language is HTML. Nowadays SBML model descriptions can be used on many software platforms, enabling data exchange and cross-validation of models. This has also enabled the establishment of model databases like BioModels, a curated database of published quantitative kinetic models . This is a central database where biological models published in scientific journals can be deposited.
In the real world systems both continuous aspects and discrete aspects are present. In general, concentrations are expressed as continuous values, whereas the binding of a transcription factor to DNA is expressed as a discrete event (bound or unbound). However, the boundaries between the discrete and continuous aspects depend on the level of detail that our model is designed for. For instance, on single cell level the concentrations may have to be expressed by molecule counts and become discrete, whereas if we use thermodynamic equilibrium to model the protein-DNA binding, the variable describing the binding state becomes continuous.
There are extensions of the Petri net model, that allow us to include knowledge about the dynamics of reactions: for example to include stochastic time delays for the transitions, first applied to molecular biology by Goss and Peccoud . In these networks the firing of transition nodes depends not only on the number of tokens in the input place nodes, but also on a stochastic component. In their study of circadian rhythms Matsuno et al. used another type of extension to Petri nets to simulate gene regulatory networks : in addition to standard Petri nets Hybrid Functional Petri Nets (HFPN) contain continuous place nodes and continuous transitions. Continuous place nodes can hold a real numbers and continuous transition nodes are firing at a constant rate. In metabolic networks this rate corresponds to reaction rates. However, this means we loose one major advantage of Petri nets over difference and differential models: we need information on reaction rates. If we have information only for some reactions, HFPNs provide a compromise by allowing the implementation of a mix of continuous and discrete place nodes and transitions.
Another example for hybrid models is the phage λ model by McAdams and Shapiro , where elements similar to ones used to describe electronic circuits have been exploited.
As an example we will describe the finite state linear model (FSLM), more detailed descriptions of FSLM can be found in [2, 90, 91]. It combines the advantages of Boolean networks such as simplicity and low computational cost, with the advantages of continuous models, such as continuous representation of concentrations and time. The activity of genes is described by discrete states (e.g., gene is 'on' or 'off'), but the gene product concentrations are expressed as real numbers. Time is continuous in FSLM and the state of the network determines directly the concentration change rates, while the state is in turn affected by the concentrations themselves.
The binary FSLM allows only two possible states for the binding sites (bound or unbound) and substance generators (on or off). A generalisation of the FSLM model and a more mathematically thorough definition can be found in . For each substance there is a corresponding substance generator. The substances can bind to binding sites, but each binding site can be bound by one substance only. The binding of a substance to a binding site b j depends on the association constant a j and the dissociation constant d j of the binding site (0 < d j <a j ). The binding site is bound if the concentration of the binding substance exceeds the association constant. If the substance concentration falls below the dissociation constant then the binding site is released and switches to the unbound state. The biochemical equivalents of the association and dissociation constants in FSLM are affinity constants. The difference between the association constant a j and the corresponding dissociation constant d j leads to a hysteresis characteristics (Figure 11B) for the switching between the states of a binding site (see for example ). The concentration threshold for the switch between the states of the binding site depends on the state of the binding site itself. Using discrete states to represent the binding sites means we approximate the binding equilibrium with a simpler step function.
The states of a set of binding sites comprise the binary input vector to a Boolean control function F. Depending on the input state vector the control function computes an output state (on or off). A substance generator S changes the concentration of a substance in time in a linear fashion. The concentration can either increase with rate r + or decrease with rate r - (r - < 0 < r +), corresponding to substance production and degradation, respectively. The output state of a control function determines the activity of a substance generator, i.e. whether the concentration of a particular substance is increasing or decreasing. Note that the linear increase and decrease rates that are assumed in the FSLM are only approximations to the reality.
In our example models the biological systems are relatively simple, but for larger networks we often lack detailed information about the biology.
All networks mentioned so far are deterministic – they assume that the next state of the system is determined by the current state and the external inputs. However, in real world systems stochastic effects may play an important role. For instance, for some genes in yeast the number of mRNA molecules is close to one copy per cell . This means that it is likely that there is a considerable intrinsic noise element present – some cells apparently have more mRNA molecules of the given species present than others. Thus modelling a cell by using continuous concentrations effectively means modelling an ensemble of cells by mean values of stochastic variables. It is not obvious to what extent this is possible. It has been demonstrated that the stochastic effects are important for the phage λ switch decision between lysis and lysogeny . Lately experimental studies have tried to measure the level of intrinsic noise in eukaryotic cells (e.g., [95, 96]). Simulating a stochastic model is computationally more expensive, because the simulations have to be run several times to provide a good impression of the system behaviour. But stochastic models are not always necessary; it depends on the system that is to be modelled. If the number of molecules involved is small and if important processes depend on random effects, stochastic models might be the best choice.
"With four parameters I can fit an elephant, and with five I can make him wiggle his trunk" – John von Neumann
Reverse engineering refers to an approach where one starts from data and tries to design a model that fits the data (semi-) automatically within the given model class, without additional prior hypothesis about the biological system. The model derived from the data is judged by the results of simulations compared to new experimental data. For example, one could use a gene expression data set to construct a particular gene network model that is consistent with the data. Inconsistencies between simulated data generated using this model and new data, that has not been used to construct the model, indicate shortcomings of the model. These inconsistencies can be used to choose between alternative models, or to improve the model. However, reverse engineering is possible only (1) if we have chosen an appropriate model class (in the sense that the desired properties of the real world network can be described in it), and (2) if we have enough quantitative data describing the behaviour of the system. Of course, even if the answers to these two questions are positive, reverse engineering is still a difficult problem, and few efficient algorithms are known. The methods chosen for reverse engineering depend crucially on the kind of modelling technique used. Quantitative models are normally more demanding than qualitative models. Dynamic models contain many parameters, and detailed experimental data are required to work out the parameters.
Miyano et al. have proposed algorithms to infer Boolean networks [67, 72] and Friedman et al. developed methods to extract probabilistic graphical models, such as Bayesian networks from experimental data [49, 52]. Tegner et al. proposed an approach for the reverse engineering of dynamic gene networks based on integrating genetic perturbations . They identified " [...] the network topology by analysing the steady-state changes in gene expression resulting from the systematic perturbation of a particular node in the network." . However, they only apply their approach to simulated data and to a comparatively small biological system consisting of only 5 genes.
A powerful approach to test our understanding of gene regulatory networks is to build new networks from scratch in an approach called synthetic biology. Predictions of small models have been successfully tested experimentally using specifically engineered control circuits, such as feed forward loops  and feedback loops [99–103]. In a sense this is reverse engineering of a real world network. For a more detailed description see the reviews by Kaern et al. and Ball [104, 105].
"If you torture the data long enough, Nature will confess." – Ronald Coase
At the basis of any modelling, including network modelling, there is a realisation and acceptance that a model describes only some properties of the 'real world' system, and ignores others. Thus it emphasizes particular aspects of reality, leaving out details that are not relevant for the purpose of the study.
How far are we from being able to build realistic cell models? The availability of large-scale data sets such as microarray gene expression and genomic localisation data triggered the search for suitable approaches to model complex biological systems. As the result of genome projects we are now able to compile parts lists on genome scale, though we do not know how many important categories in these parts lists are missing. Models describing the network topology are approaching the whole genome scale. High-throughput experiments, most notably microarrays, provide us with temporal information about transcriptional processes in time series experiments. These have been used to study control logics as well as some dynamics aspects of transcription regulation in processes such as the cell cycle [8, 106, 107], stress response [108, 109], or galactose utilization . Models have been built to explore the fundamentals for example of the cell cycle for yeast  and improvements in the understanding of genome wide dynamics of cell cycle have been made . Nevertheless, high-throughput technologies have yet to have a direct impact on quantitative real time simulations of gene networks.
The function of about one third of all genes is still unknown for the yeast Saccharomyces cerevisiae despite it being one of the best-studied organisms. And even for many of the better-known genes and core processes that have been studied for decades, like the cell cycle, there is still not enough data available to exactly know all changes in concentration and activation patterns. Currently it seems not feasible to simulate even relatively simple cells like yeast. Mechanisms like RNA interference, regulated degradation of mRNAs and proteins, chemical modifications of key molecules and others might play a larger role than anticipated in current models, other processes might still be unknown. It is obvious that the separation into gene regulatory networks, metabolic networks and protein interaction networks is possible only up to a certain degree. To what extent can the transcription regulation networks be decoupled from other networks, such as signal transduction networks? We need to integrate many types of information if we want to build realistic dynamic models, however, for current modelling approaches we have to limit the complexity of the systems we are dealing with.
One possibility to reduce the complexity of biological systems depends on the modularity of the real world networks and their robustness (stability against changes of various network parameters and initial conditions). If the networks are modular and robust, it might be possible to build genome scale networks as sets of smaller modules. If we can find modules – units behaving independently of each other – it would be possible to build the complete model as a set of modules.
The belief that real world biological networks 'must be' robust and 'must be' modular is quite wide spread. However precise definitions of biological robustness and modularity and, moreover, the proofs of their presence remain elusive. The principles of modularity and robustness used in engineering are sometimes given as a reason that the same must be true in biological systems, but there are many examples when the 'designs' in nature, which are obtained by natural selection are different from the designs one would use in engineering. However, there are other arguments why biological networks could be modular, such as reuse of the components after genome duplications, but they are no proofs. There are indications that, on the dynamic level, network modules exist. For instance, cell growth can be decoupled from cell cycle in yeast (e.g., ), indicating that to some extent independent modules control these two processes. Similarly, the Drosophila developmental network indicates that the exact values of the model parameters may not be crucial in large-scale systems behaviour . But to what extent can specific processes be decoupled from each other?
Another possibility to reduce complexity in network models depends on the importance of the exact values of parameters and substance concentrations. How much do the exact quantitative values, such as substance concentrations, matter in determining the more general patterns of system behaviour, such as cell differentiation? If we are not interested in predicting the exact concentrations of different substances, but only in the patterns of the systems behaviour such as steady states, we can often use simplified Boolean-type networks instead of differential equations  and hybrid models might offer "good enough" solutions.
The question "Is real time simulation on genome scale possible at all?" is still open. Obtaining high quality systematic quantitative data characterizing systems parameters such as mRNA, protein and metabolite concentrations, interactions and spatial and temporal localization of different molecules will be important. Nevertheless, the data will not provide new insights automatically. We believe that hypotheses expressed as rigorously defined models, the properties of which can be studied independently and tested on experimental data, will play an important role in understanding the living systems on genome-wide level. In any case, finding the right language for describing the models is a prerequisite for success.
This article is an extended version based on two earlier publications [1, 2]. The project is funded by the European Commission by the DIAMONDS grants under the RTD programme "Quality of Life and Management of Living Resources".
This article has been published as part of BMC Bioinformatics Volume 8 Supplement 6, 2007: Otto Warburg International Summer School and Workshop on Networks and Regulation. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S6
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.