Applications of a formal approach to decipher discrete genetic networks

Background A growing demand for tools to assist the building and analysis of biological networks exists in systems biology. We argue that the use of a formal approach is relevant and applicable to address questions raised by biologists about such networks. The behaviour of these systems being complex, it is essential to exploit efficiently every bit of experimental information. In our approach, both the evolution rules and the partial knowledge about the structure and the behaviour of the network are formalized using a common constraint-based language. Results In this article our formal and declarative approach is applied to three biological applications. The software environment that we developed allows to specifically address each application through a new class of biologically relevant queries. We show that we can describe easily and in a formal manner the partial knowledge about a genetic network. Moreover we show that this environment, based on a constraint algorithmic approach, offers a wide variety of functionalities, going beyond simple simulations, such as proof of consistency, model revision, prediction of properties, search for minimal models relatively to specified criteria. Conclusions The formal approach proposed here deeply changes the way to proceed in the exploration of genetic and biochemical networks, first by avoiding the usual trial-and-error procedure, and second by placing the emphasis on sets of solutions, rather than a single solution arbitrarily chosen among many others. Last, the constraint approach promotes an integration of model and experimental data in a single framework.


Background
A central task in molecular systems biology is to build and analyze genetic and biochemical networks in order to decipher the properties of cellular phenomena. The emphasis is not on investigating in detail one or a few molecules at a time, as is done traditionally in molecular biology, but rather on focusing on the network level.
We are specifically interested here in gene regulatory networks (GRNs) formalized as discrete genetic networks as defined by R. Thomas [1,2]. The main goal of this formalism is to obtain a qualitative understanding of the network dynamics by reasoning on discrete entities. In GRNs the molecular players are the genes and the proteins they produce. A genetic interaction corresponds to the fact that a gene g i produces a protein p i which influences the expression rate of another gene g j , or g i itself. The set of all these genetic interactions constitute the interaction graph, to be defined formally later. In a given state of the network, each gene has a certain expression rate (the rate of production of the encoded protein) which depends on the presence or absence of a subset of proteins, the activators or repressors of the considered gene. The expression rate of a gene is maximum when all its activators are present and all its repressors are absent. In this context a basic objective is to analyze the temporal evolution of the protein concentrations in given external conditions. This can be done when the values of the model parameters have been measured. When this is not the case, the problem is then to exploit the knowledge about network behaviour (e.g. response to perturbations, phenotype change when one or several genes are knocked-out) to deduce the possible values of the parameters. A frequently used method consists in performing a large number of simulations by varying some parameters (generally one or two at a time) and selecting a posteriori the set of values that is consistent with the observed behaviour. As explained below, we propose a different approach for this inference problem.
Beyond these basic functionalities (simulation, inference of parameter values), the construction of GRN models consistent with experimental data requires more sophisticated tools. It often occurs that a proposed model displays inconsistencies with part of the data. In such cases it is necessary to critically analyze the hypotheses used in building the model and to revise them. This analysis can be done "by hand" for small networks, e.g. up to three genes, but requires the use of computational tools to cope with the complexity of larger networks. In still other situations the observational constraints are weak with respect to the number of variables, and the number of solutions is very large. In such cases, it is interesting to derive properties that are shared by all the solutions, or subsets of them, in order to get a better understanding of the model properties and to design new experiments having the potential to substantially reduce the set of solutions.
Fundamentally, we want to provide the biologist studying GRNs with a software environment allowing to perform such tasks. The available knowledge is partial and bears on both the structure of the network of interest (the set of interactions) and the behaviour of the network in various conditions. The first kind of knowledge is said to be structural, or local (each interaction is a piece of information and can be studied in itself), whereas the second kind is said to be behavioural, or global (the network as a whole is giving rise to a given behaviour). The network architecture and its behaviour are closely inter-related. This relation is implemented formally as a set of constraints in a straightforward manner in our software environment, named GNBox (Genetic Networks toolBox -Additional file 1). More precisely, the philosophy of this approach is to represent a given problem, or set of problems, as a set of formulae linking variables. In our case this entails the specification of (i) the rules defining the updating scheme (how the successors of a state are computed); (ii) the network architecture (set of interactions); (iii) the observations about the behaviours of the network (partial information about paths), or any working hypothese about the system; (iv) the query itself (e.g. number of stationnary states, possible values of initially unkown parameters). The set of constraints thus defined is then submitted to a solver whether there exists solutions or not. A distinctive feature of the constraint approach is constraint propagation. It implements deduction rules and allows in favorable conditions to reduce drastically the search space, thus limiting enumeration. Of course some amount of enumeration is usually still necessary, but the aim of the game is to reduce it as much as possible. This formal relation is "executable" and allows not only to perform basic functionalities such as simulation or reverse-engineering, but also to assert and obtain properties on both the behaviours and the interactions. More specifically, we implemented in this constraint-based setting four main functionalities: (i) proof of consistency or inconsistency of a constraint pool, (ii) constraint relaxation in case of inconsistency (model revision), (iii) prediction of properties in case of consistency, (iv) search for minimal models, with respect to the number of thresholds, for example.
In this article we present our approach and we show how it can be applied successfully to the analysis of three different biological problems. In the section Methods we present the formalism we developped. We present the formal definition of interaction graphs and of the evolution rules of Thomas networks. These notions are required to express the queries implementing the functionalities mentionned above. Other notions related the specification of interaction compositions facilitate the expression of properties involving kinetic parameters. The implementation is discussed in [3]. In the section Results and Discussion we present three applications which differ by both the type of knowledge available and the type of biological questions asked. These applications permit (i) to illustrate the different functionalities of GNBox, (ii) to show the feasibility of this constraint-based approach on realistic biological problems, and (iii) to support the idea that a formal and declarative approach is very interesting to decipher the properties of GRNs, in order to assist in their construction.

Methods
We present briefly in this section the constraint technology, the constraint-based formalization of Thomas networks, the constraint-based formalization of biological properties of these networks, and the features of our software environment GNBox to elucidate GRNs.
Below we use the following mathematical notations: an integer x taking values between min and max is denoted x min..max, a Boolean b is an integer such as b 0..1, b1 ⇔ b2 means that the Boolean b1 is equal (or equivalent) to the Boolean b2, b1 implies b2 is denoted b1 ⇒ b2, the Boolean equal to b1 and b2 is denoted b1 ∧ b2, the Boolean equal to b1 or b2 is denoted b1 ∨ b2, the Boolean equal to the conjunction of a list of Boolean b i is denoted ∧ i b i , the Boolean equal to the disjunction of a list of Boolean b i is denoted

Constraint technology
We propose to implement the approach (particularly the link between network structure and behaviour) using Constraint Logic Programming (CLP) technology, with a finite domain solver. CLP is a programming paradigm based on first order logic. CLP considers specific classes of logical terms and proposes efficient resolution methods of equations over these terms (constraints). A CLP program is a logic formula, and its execution is the construction of a proof of consistency (or inconsistency) of this formula. The formula is consistent when it is possible to find an instantiation of the variables which satisfies the formula. Logicians call such an instantiation a model. A CLP program is reversible in the sense that it permits to impose and obtain partial knowledge over all the variables of the formula (including in our case the variables describing the interactions and behaviours). For example, let say that p(x, y) is a predicate defining a relationship between two entities x and y. If a measurement allows to reduce the domain of values of x, this information can be added as an additional constraint, and a query can be submitted about the possible values of y. The solver will try to propagate the additional information on x to reduce the domain of y, taking into account p(x, y). Conversely, if the measurement has been performed on y, this information can be propagated to x through p(x, y). This is reversibility. It must be said that different kinds of solvers exist, characterized by the type of variables (e.g. finite domain integers, reals) and the type of propagation rules used, among other things.
As all the variables describing interactions and behaviours have finite integer domains in the discrete framework used, the use of a constraint solver over finite domains is very well suited. In addition, the expressive power of first order logic and constraints over integers allows the definition of very general properties and functionalities. Finally, in order to be able to take advantage of the very efficient Boolean Satisfiability (SAT) solvers available, the GNBox environment is able to translate the CLP formalization into a Boolean formula in Conjunctive Normal Form (conjunction of disjunctions of Boolean variables or their negation, the input format used by most SAT solvers). Details on the translation into CNF can be found in [3]. In this way we combine the expressive power of CLP with the efficiency of SAT solvers.

Formalization of Thomas networks
In this subsection we present a constraint-based formalism to impose, check and infer properties about discrete genetic networks as defined by R. Thomas. We first introduce the notions needed to define and formalize the interaction graph and the evolution rules of Thomas networks. We define in the next subsection the notions of composition of interactions, additivity and observability properties which are useful to express hypotheses about kinetic parameters. All the presented notions of this subsection and the next one are illustrated with the example of Figure 1 and Figure 2, and will be put into use in the biological applications of the section Results.
The structure of a GRN is represented in an abstract way by an interaction graph  . The nodes of  are genes. Each node is associated to a concentration variable representing the concentration of the protein produced by the corresponding gene. The oriented edges of  represent interactions between these genes, denoted by i c r for the interaction on the gene (component) c of index r, r 1..r c , where r c is the number of interactions on c. An integer variable representing a discrete threshold labels each edge. In papers using the formalism of R. Thomas edges in interaction graphs are also labeled by a sign [1]. We choose more primitive interaction graphs without these labels in order to generalize the formalization and to facilitate the expression of hypotheses about the way interactions compose on target nodes. The presence of an interaction from gene a to gene b (with a threshold t) indicates that protein a can potentially modify the expression rate of gene b. Furthermore, this change in expression rate, when it actually exists, takes place when concentration of a crosses threshold t. In other words, this interaction indicates that the rate of production of protein b can be influenced by the position of protein a with respect to threshold t. It has to be noted that such an interaction does not actually impose a difference in production rate. Rather, the absence of such an interaction forbids the existence of such a difference in production rate. Such an interaction is represented by the triplet (a, t, b) (labelled edge). In the example in Figure 1, the rows "  " and "interactions" give respectively the definition of the interaction graph  and the sets of interactions for the target genes x and y.
The network structure being defined, the next step is to define the network state and dynamics. A state S of the network is a list of gene product concentrations (protein or RNA). The concentrations are discretized according to the thresholds appearing in  . The concentration of the product of gene c in state S is the integer S c 0..max c , where max c is the maximal value of the discrete concentration of protein c. The threshold of component c of index p is t c p 1..max c , where the index p takes its values in 1..max c (obviously if a concentration is cut by max c thresholds the associated discretized variable will take max c + 1 values). For a given system with n genes, c i , i 1..n, is the variable associated to the i th gene, and state S is the ordered list , and belonging to the same state space. The focal state gives the direction of evolution (tendency) for each concentration. Consider for instance S = 〈0, 0〉 and F s = 〈1, 1〉 in a 2-dimensional system. The successor of S is not 〈1, 1〉 as is the case in synchronous updating schemes. Rather F S = 〈1, 1〉 indicates the direction of evolution of each component taken separately. Here both are increasing and S = 〈0, 0〉 has 2 successors, 〈1, 0〉 and 〈0, 1〉. In other words two transitions are possible from S, and this type of updating scheme is often called asynchronous, but nondeterministic is a better term. What is the basis of this non-determinism? If the numerical values (real numbers) of the initial concentrations, together with those of the model parameters were known, it would be possible to determine the exact successor. In the discrete abstraction considered here this information is not available and consequently both possibilities must be taken into account. Non-determinism is a fundamental property of this abstraction due to the information loss induced by the partition of concentration space into rectangular domains. We chose this formalism in this study because it is well founded and it is a good match to the qualitative knowledge generally available in Systems Biology at present. Nevertheless, it should be kept in mind that our constraint approach is not tied to Thomas networks. Other types of discrete dynamical rules could be implemented, e.g. Kauffman-like Boolean networks with parallel (synchronous) [4] or block-sequential updating scheme [5,6].
To implement the Thomas evolution rules we need first to specify the equations which link a state S to its focal state F S . These equations are named focal equations. A set of rules then links state S, the focal state F S associated to S, and the successor states of S. We stress here that these rules must be viewed as relationships linking different kinds of unknowns. As explained above (reversibility), the use of these relationships depends on the available information in a given state of knowledge. If the concentration values making state S are all known, together with the position of its focal state, then the successors of S can be computed. But the relationships can be exploited in other ways, too.
The system of focal equations contains different kinds of parameters: constant concentrations associated to input genes (that is genes that are influenced by no genes in the network and whose state is fixed by external conditions), parameters related to reaction kinetics (similar to those that would appear in a differential description), and thresholds t c p . The set of all these parameters is denoted by P. The parameters are amongs the unknowns of the system of constraints because their values are in general not known, or only partially known. The evolution rules, once formalized (see below), lend to a first set of logical constraints. To this first set are added structural constraints over the parameters derived from experimental data, and working hypotheses. The set of solutions of such a system of  constraints defines a set of instantiated models (i.e. models in which all parameters are instantiated). The couple composed of a focal equation system and a set of structural constraints is called a parameterized constrained model M, or just model. A typical query includes one or several structurally-related models (when data are available on several mutants), and some additional behavioural constraints. If the resulting system (set of all constraints of the query) is underconstrained this set contains a large number of solutions. If it is over-constrained it is empty. In our context, this last case is interpreted as a contradiction between, on one hand, the experimental evidence and, on the other hand, the network structure or the hypotheses. More sophisticated queries are presented in the application part below, to illustrate the high-level functionalities mentioned in the introduction.
The parameterized focal equation system of a model M is completely defined by an interaction graph  . In fact these two entities contain exactly the same information (as long as kinetic parameters are not instantiated nor constrained). The set of interactions of  having the gene c as target induces a partition of the concentration space according to the thresholds t c p ' of these interactions. This partition defines a set of regions called the cellular contexts of c. As long as the concentration of the proteins c' regulating c do not cross one of the t c p ' thresholds, the system stays in the same cellular context, because from the viewpoint of gene c the regulatory conditions have not changed. This means that all the states S belonging to the same cellular context of c have the same focal component F c,S of the focal state F S . The value of F c,S being generally unknown, a formal parameter K c l called discrete kinetic parameter is introduced for each cellular context of c with index l. These parameters are the discrete version of the ratio of protein production rate over degradation rate. When the value of K c l is high in some cellular context, this is interpreted by saying that in the states belonging to that context the production rate of the protein associated to gene c is high, and/or its degradation rate low. But in the qualitative setting of Thomas formalism it should be kept in mind that we have only access to a discretized version of the production rate to degradation rate ratio. The number of cellular contexts for a given gene c is l c r c = 2 , and so is the number of K c l parameters. We have introduced the main notions unformally (interaction, threshold, state, focal state, focal equation, cellular context, discrete kinetic parameter), and will now present formal definitions which are directly usable in constraint form.
. is the discrete kinetic parameter of c with index l, l Î 1..l c , and Cellc c S l , is a condition true if S belongs to the cellular context of c with index l. The indexing convention is the following: l is equal to V + 1 where V is the decimal value of the binary number composed of the Booleans S t c c , these Booleans being arranged in increasing order of r (this is just meant at providing a unique numbering of the cellular context and is not fundamental).
The above formula means that if state S' belongs to the The row "cellular contexts" in Figure 1 describes formally and graphically, for a given order of thresholds, the cellular contexts for each component x and y of the considered example. Component x is the target of only one interaction and is thus associated to two cellular contexts, y is the target of two interactions and has 4 cellular contexts. The row "discrete kinetic parameters" gives the list of these parameters. The subscripts and superscripts make the correspondence with the associated cellular contexts ( K x l with Cellc x S l , , etc.). Finally the row "focal component" gives the equations describing the focal components F x, S and F y, S of a state S. The row "focal state" in Figure 1 describes the focal state F S = 〈F x,S , F y,S 〉 of S.
The focal state defines the direction of the dynamic transitions starting in S. In the Thomas networks, the authorized transitions are such that:.
1. S' and S are the same state or are neighbors, 2. S' and S differ on at most one component. 3. S' is in the "direction" of the focal state F S .
) is due to the fact that the concentrations evolve continuously, thus jumps over states are not allowed. The second is commonly called asynchronicity. The third one is specific to the discretization of evolution equations due to Thomas. We explained above that when two concentrations are increasing in a given state, it is not known in this kind of abstraction which will reach first its next threshold, and consequently which transition will occur first. In this situation both transitions are taken into account leading to two successors for the state considered (of course this generalizes to more than two). This is intimately connected to the nondeterminism inherent to abstractions based on phase space partition.
The rules have the following consequences: It is possible to specify a knock-out or ectopic expression mutation. For each non-input mutated gene c set to a constant value v, the constraint ∧ l K c l = v must be introduced. For a mutated input gene to the value v the input parameter of the model is set to this value v. In some cases it is necessary to use several models in the same query, one model corresponding to the wild type and the others to mutants. In such cases we introduce constraints specifying that for all couples of models (M a , M b ) the thresholds of M a are equal to those of M b , and the parameters K c l of M a associated to genes c which are not mutated in M a and M b are equal to those of M b . The constraints between the input parameters of M a and M b depends of the considered biological application (see Constraint 4, in the section Results and Discussion).
A user of GNBox must describe the structure of the studied GRN (possible interactions between genes), and can use the language LG1 to specify the existence of a behaviour. The language LG1 is composed of the predicate path(M, Path, L) which is true if Path is a succession of L states authorized by the model M (achieving a formal link between a model and its behaviours), and a language to impose arithmetic constraints between variables of Path. Language LG1 is used to formalize observations on the behaviour of the system. Our approach allows to specify (declare) partial information. For example only a few concentrations may have been measured. Absence of information is absence of constraints.

Interaction compositions
The interaction graph  lists the interactions individually but does not contain information on the manner in which different interactions are composed when they have the same target gene. The information about the way to compose interactions is embodied in relationships linking the parameters contained in P K t c l c p ( , ,...) . However, the manual interpretation of instantiations or properties over parameters of P is not convenient, especially for users not acquainted with the formalism of Thomas networks. For this reason we designed a higher level language LG2 to impose, check and infer properties about the way to compose the interactions in  in the manner of the traditional notion of the logic of regulation (NEG, AND, OR gates). It should be understood that this is not fundamental to the approach but merely a facility to handle relationships between parameters induced by the composition of interactions. The user always has the choice to work directly on these relationships.
We explained above that the specification of a set of interactions for a gene c partitions (in cellular contexts) the concentration space by hyperplanes (corresponding to thresholds of interactions acting on c). LG2 permits, for every c, to partition the concentration space in union of cellular contexts of c, named compositions of cellular contexts, via the definition of interaction compositions. Any union of cellular contexts can be specified, and in particular an union of disconnected regions. Similarly to the semantic of a set of interactions, the semantic of a set of interaction compositions is the following: all the states belonging to a given composition of cellular contexts of c have the same evolution tendency of the concentration of protein c. The borders between these regions are constituted of parts of threshold hyperplanes of interactions taking part in the composition. We name these borders ≥ is true is said to satisfy i c r . An interaction composition also partition the state space into two regions, but the border is not necessarily a hyper-plane defined by a single threshold. An interaction composition can have the following forms: where ic and ic' are interaction compositions. The region where the states S satisfies both ic and ic' is said to satisfy ic ic where ic and ic' are interaction compositions. The region where the states S satisfies ic, or ic', or both, is said to satisfy ic ic  ∨ ′ .

Example 2
The sixth row in Figure 2 gives three possible sets of interactions compositions for y, related to the example in Figure 1. The four first rows recall the context of example in Figure 1. The fifth row gives the set of interaction compositions over x. The seventh row shows for each of these couple of sets a graphical representation of the detailed structure of the network, with signs + and -over interactions and bridges, denoted by AND, to express a conjunction between two interaction compositions. Finally, the last row shows the resulting compositions of cellular contexts for y.
The first case (second column of last row) leads to the same partition of the discrete concentration space of y (the areas described by the cellular contexts are the same that those described by the compositions of cellular contexts).
The second case expresses with the sole interaction composition on y, i i y y 1 2  ∧ , that either the concentration of x and y are above t x 1 and t y 2 , respectively (the tendency of y is unique in this region), or the concentration of x or y are below t x 1 and t y 2 , respectively (the tendency of y is unique in this region).
The third case expresses quite the same of the second case but permits that x interacts on y whatever the concentration of y. So, we obtain three compositions of cellular contexts because the fact that x can interact on y all along the border of the threshold t x 1 . Example 3 The Figure 3 gives an example of a set of interaction compositions and resulting composition of cellular contexts. In the first column, we can see an interaction graph  with two components a and b, a set of four interactions over b, and a partition of the concentration space into nine non empty cellular contexts. The indexes l of the conditions Cell c b S l , appear in circles. The other cellular contexts are empty according to the order of the values of the thresholds . Note that usually this order is not known and the values of thresholds for a same species can be equal. In the second column (to make a parallel with the interactions and cellular contexts) we assumed to have two interaction compositions. We obtain a partition of the concentration space into four non empty compositions of cellular contexts (the pink region being the union of the two disconnected cellular contexts 1 and 12).

Additivity and observability properties
The language LG2 allows to define specific effects of an interaction composition on a component c. Here by effect we mean a shift in the position of the focal component F c when the border associated to the interaction composition is crossed. Biologically, an increase of the tendency of c can be due to an increase of the expression rate of gene c, or a decrease of the degradation rate of the corresponding protein. In other formalisms these effects are specified by labelling the arcs of the interaction graph with signs (we have used this in the 7th row in Figure 2). A + sign (respectively a -sign) for an interaction of a gene a on b in the signed interaction graph means informally that the interaction of a on b is an activation (respectively an inhibition). However, the exact meaning of the terms activation and inhibition is not clear, especially when several interactions combine on a gene: Does an activation of b by a forbid an inhibition of b by a or not? Is an activation of b by a necessarily observed all along the border associated to the interaction or not? Two properties are used to clarify formally these questions.
The first one, called additivity, is the systematic nonstrict increase of tendency of c when a border is crossed in some predefined direction. In other words the effect on c of the interaction composition adds to the effect of all other interaction compositions on c. The direction in which the border is crossed for this property is the one given by the passage from a state where the interaction composition is not satisfied to a state where it is satisfied.
The second property, called observability, is the existence of a strict increase of the tendency on c. This means that the effect on the tendency of c exists at least at one crossing point (where the border associated to the interaction composition is crossed in the same direction as the additivity property). In contrast to the additivity property, observability property requires only the existence of an effect somewhere along the border.
To define these effects more formally we introduce for each interaction composition ic c rc on c a set, denoted by Adj c rc , containing all couples of states (S0, S1) such that (i) S0 is adjacent to S1, (ii) S0 is a state in the region where ic c rc is not satisfied, and (iii) S1 is a state in the region where ic c rc is satisfied. Example 4 For the interaction composition ic b 2 of the example given in Figure 3 Figure 4 by a kind of arrow symbol, where the 'o' end is associated to state S0, and the '|' end to state S1.
LG2 allows to specify that an interaction composition ic c rc has an additive effect, denoted by a ic c rc ( ) , i.e. that the difference of trend of c is positive or zero all along the border defined by ic c rc . The exact semantics of a ic c rc ( ) is: for every couple (S0, S1) of Adj c rc the trend of c in S0 is less than or equal to the trend of c in S1.
Since the trend of a state is equal to the trend of all the states in the same cellular context, the additivity constraints are expressed as relations between discrete kinetic parameters K c l . Example 5 For the example in Figure 2   ⇔ ≤ ∧ ≤ (the same as the first one in the second case). If these aditivity are true, there are two cases of activation of y, one above t x 1 and one above t x 1 and t y 2 . Moreover, the second case of activation is greater than the first one, due to the additivity property a ic y ( ) 2 . If multi-arcs are present in the interaction graph (several arcs with the same origin and the same target node) the cellular contexts on each side of the border defined by the interaction composition ic c rc are not the same depending on the values of the thresholds associated to the multi-arc. In that case the additivity constraints are relations involving also thresholds. Briefly, the additivity constraint of the interaction composition ic c rc is: ∧(adj (l0, ll, rc) ⇒ K c l0 ≥ K c l1 ) with adj(l0, l1, rc) true if it exists a couple (S0, S1) of a ic = ¬  is:.
)) ( ( according to the identifiers l of cellular contexts for b (and so the identifiers of discrete kinetic parameters K b l ). It can be checked with the graphic representation of cellular contexts of b in Figure 3 that for t t t t a a b b This example shows that specifying additivity properties can be much more compact than working at the level of parameters. Without language LG2 we would have to write the above formula.
In addition to the additivity property, LG2 allows to specify that an interaction composition ic c rc has an  Figure 3 with

GNBox Features
The core functionality of the GNBox environment is to test, for a given structure of a GRN, the consistency of a set of hypotheses about the behaviours of this GRN (language LG1) for several mutant types, about the interaction compositions (language LG2), and even directly about the parameters in P. GN-Box is able to identify consistent solutions in terms of state variables that define the behaviour (LG1) and in terms of parameters of P. In cases where the set of hypotheses is inconsistent, it is desirable to determine the possible relaxations of hypotheses to remove the inconsistency. GNBox can identify automatically, among a defined set of questionable hypotheses, all subsets of hypotheses whose relaxation removes the inconsistency (subsets of necessarily false hypotheses). These subsets are represented as disjunctions of conjunctions of negations of hypotheses. For example, the hypotheses H1 and H2 must be relaxed or the hypothesis H3 must be relaxed: (¬H1 ∧ ¬H2) V ¬H3. Also GNBox automatically identifies, among a defined set of hypotheses, all subsets of hypotheses necessarily true. These subsets are represented by disjunctions of conjunctions of hypotheses. For example, the hypotheses H1 and H2 are true or the hypothesis H3 is true: (H1 ∧ H2) ∨ H3.

Results and Discussion
Application to the immunity control by the l bacteriophage The analysis of this network adapted from [7] illustrates mainly the capability of GNBox (i) to express constraints about reachability of states, and (ii) to find the minimal interaction graph consistent with observations. The λ bacteriophage (or simply λ phage) is a virus that infects the bacterium Escherichia coli. The infection starts by the injection of the genetic material of the virus into the cytoplasm of the bacterium. We focus here on two simple observations about the evolution of the bacterium after infection: either the viral DNA is integrated in the genetic material of the bacterium, and the cells continue to divide normally (thus reproducing the phage DNA in the same process), or the genetic material replicates in the cytoplasm of the bacterial cell to create new viral particles and then new viruses until lysis (destruction) of the cell, which leads to the release of new virus particles in the extracellular medium. The first case corresponds to the lysogenic phase while the second corresponds to the lytic phase. The decision between these two phases is made by a network of viral genes.
The model proposed in [7] contains four viral genes denoted by cI, cro, cII and n. The gene cI is expressed only in the lysogenic phase, cro is expressed only in the lytic phase and genes cII and n are not expressed in both phases. The graph  of interactions between these genes is given in Figure 5. Interactions and interaction compositions (deduced from experimental data) are given in Table 1. We consider the set of all additivity and observability constraints for all these interaction compositions ( ( ), ( ) a ic o ic cI cI 1 1 , etc.). In the following we assume that the thresholds t c p are ordered so That t p c p = . According to the previous section the set of interactions, the hypotheses about interaction compositions (set of interaction compositions, additivity properties, observability properties) and the hypotheses on threshold values define a parameterized constrained model (couple composed of a focal equation system derived from  and a set of structural constraints). We call it M λ and it is defined formally by the predicate model_l(M λ ). A state S for this model is represented by an ordered list 〈S CI , S cro , S cII , S n 〉 of discrete protein concentrations. According to  and hypotheses on threshold values, we have max cI = 2, max cro = 3, max cII = 1, max n = 1. So the concentration space contains (2+1)*(3 +1)*(1+1)*(1+1) = 48 states.
The uninfected cell does not have any viral protein and can therefore be represented by the state S0 = 〈S0 cI , S0 cro , S0 cII , S0 n 〉 = 〈0, 0, 0, 0〉 In the lysogenic phase of  the virus-host system the only viral gene expressed is cI. This phase is represented by the state S1 = 〈S1 cI , S1 cro , S1 cII , S1 n 〉 = 〈2, 0, 0, 0〉 such that the concentration of protein cI remains equal to its highest value. In the lytic phase the only viral gene expressed is cro. In a continuous description this phase is represented by a state which is not contained within a domain, but which is at the border between two adjacent domains. We could introduce in our formalism additional states corresponding to borders between domains. Such states are called singular states in [2]. We choose here to stick to the simpler formalism, and we represent the lytic phase as a cycle between the two following states: S2 = 〈0, 2, 0, 0〉 and S3 = 〈0, 3, 0, 0〉, such that the concentration of the protein cro remains around the highest values 2 and 3 [7]. Biological observations tell us that these two phases must be attractors of the network dynamics, and that they are reachable from the initial conditions. These observations are formalized by Constraint 1 where the lengths of the third and fourth paths for the reachability of the two phases are equal to 48 states, 48 being the total number of states of the state space. The GNBox environment proves the consistency of this pool of constraints in 2 seconds. All run times mentioned in this article are obtained on a laptop with 2 GB of RAM and running at 2.4 GHz. Moreover GNBox can provide the instantiations of the parameters of P that satisfy the pool of constraints.
Example 8 An example of instantiation is: 0 = (remember that the indexes l of discrete kinetic parameters K c l are set at their creation from the numbering of interactions i c r See Definition 1). The set of transitions from S to S', denoted by S ↠ S', for this instantiation is represented in Figure 6.
An interesting question, akin to reverse-engineering of the network, is: what are the minimal numbers of interaction compositions necessary to get a model consistent with Constraint 1 without specifying any additivity or observability constraints? In other words, we search for the minimal interaction graphs, in terms of interactions, which satisfy the observed behaviors. From a constraint point of view, this problem is specified and implemented in the following way. For each interaction composition on a gene c, a Boolean variable is created which means "all pairs of states separated only by this interaction composition have the same evolution tendency for c". Then GNBox searches for consistent models such that the number of these Boolean variables which are true is maximized. GNBox finds that only two interactions on cro are necessary (in two seconds). So, the minimal interaction graph contains only two interactions on cro and no interactions on the other genes. The result is surprising at first sight, but it should be borne in mind that the query contains only poor information about behaviours and no information on the interaction graph (but the limitation to possible interactions), the goal being to infer the minimum graph implied by this information. This does not preclude the existence of other interactions, but means that those are not necessary to account for the behaviours included in the query.
Finally this application lead us to the interesting question of the length L of the longest path without cycle in the state space for a given set of hypotheses Set. We call this length the diameter of the network for Set. This knowledge permits to restrict the length of paths in subsequent queries considering a set of hypotheses including Set. In our case Set is Constraint 1. The diameter for Set is 43. This highly combinatorial problem is Table 1 Interactions and interaction compositions  hypotheses for the model about immunity control by  Application to the carbon nutritional stress response in the bacterium Escherichia coli The modeling and analysis of this network is adapted from [8]. It illustrates a case of model revision coming from an inconsistency of the initial set of hypotheses. We performed a similar and more exhaustive study reported in [9]. We show that by an automatic relaxation method over biological constraints we can suggest lines of research to the biologist or, said differently, generate new hypotheses.
Populations of the bacterium Escherichia coli grow exponentially in favorable conditions. This state is called the exponential phase. In stressing conditions, when food (carbon) starts to be lacking, the populations stop growing and they enter in a state called stationary phase, with altered physiology and morphology. The phenomenon is reversible: the population can return to the exponential phase if the conditions become favorable again.
The model, proposed in [8] and adapted to our formalism, contains one input node sig (signal, 0 in the absence of stress and 1 in the presence of stress) and five species: crp, cya, fis, gyr and top. The interaction graph  is given in Figure 7 where the input sig is represented by a dotted circle filled in blue. Interactions and interaction compositions are given in Table 2. Moreover we consider the set of all additivity and observability constraints for all interaction compositions. As before, the thresholds t c p are ordered and equal to p. The model obtained from all these hypotheses is denoted by M coli . Thus we have max crp = 2, max cya = 2, max fis = 3, max gyr = 2 and max top = 2 (for the input sig we have max sig = 1). We obtain a concentration space of (2 + 1) * (2 + 1) * (3 + 1) * (2 + 1) * (2 + 1) = 324 states.
The exponential phase and the stationary phase are modeled by two states, respectively S ns (ns for "not stressed") and S s (s for "stressed"). As stated in [8], there exists partial knowledge about these states: Note that only three components are instantiated in each state and that there is a relationship between the two others which expresses the fact that the supercoiling of DNA is higher in the exponential phase. To model the presence or the absence of stress we use two models: a model M coli ns without stress (sig = 0) and a model M coli s with stress (sig = 1). These two models are the same biological model M coli in different conditions. So they share the same discrete kinetic parameters K c l . In absence of stress the system beginning in the stressed state S s can reach the non-stressed state S ns , which is steady. In presence of stress the system beginning in S ns can reach S s , which is steady. We formalize that by: where L is the length of the third and fourth paths for the reachability of the two steady states. In the following queries we choose L = 10 and L = 100 to compare performance, but if we want a general query without any limitation on L we should choose L = 324 (the total number of states of the model) but the amount of memory needed to generate the pool of constraints becomes very large. We point out here that for such queries involving paths, this approach is limited to networks of medium size.
With GNBox we prove that the pool of constraints composed of constraints for models M coli s and M coli ns Constraint 2 and Constraint 3 is inconsistent in 2 seconds for path length L = 10 states, and 13 seconds for path length L = 100. In fact just imposing the existence of two steady states gives an inconsistency in less than 1 second, thus proving that the constraint pool is inconsistent whatever the value of L. In [8] it is noted that the proposed instantiated model is indeed inconsistent. Here we prove in addition that there exists no other instantiation of the discrete kinetic parameters (accepting the interaction compositions hypotheses) able to   restore consistency. In other words it is proved that this network architecture with these hypotheses on interaction compositions is incompatible with the observations. It is thus necessary to revise the model. In [8] the authors suggest that a regulator or an interaction may be missing in the model. Here, instead, we keep the interaction graph  as it is, and try to change the way interactions are composed. The set of unreliable hypotheses is the set of all additivity and observability properties about interaction compositions. We allow the relaxation of these hypotheses and we obtain the property ¬ ∨ ¬ a ic a ic , is the one which is not additive and consequently that it is possible to observe an inhibition of top by fis. This inhibition effect is actually observed for another kind of stress. In [10] it is said: "when Fis levels are low, hydrogen peroxide treatment results in topA activation". This means that fis acts in some cellular contexts as an inhibitor of top. This paper shows that the protein fis can indeed play an inhibitory role on top in some contexts, and it thus gives support to the new hypothesis that fis plays an inhibitory role in the response to nutritional stress. It is remarkable that this pool of constraints is inconsistent given that the number of adjustable parameters is relatively high. We insist here on the fact that inspection of the constraint pool did not allow to resolve manually this inconsistency.
Finally, it appears that the hypotheses of interaction compositions on top are not well supported by experiments, and we propose to determine the necessarily observable compositions of the type i c r and (  ¬i c r ). The rationale for limiting the compositions to basic types (signed interactions) is to provide easily interpretable results in terms of the interaction graph  complemented with interaction signs (corresponding to the choice between activation and inhibition). This allows to determine for example whether there are unnecessary arcs in  . On the other hand this restriction still allows to guide the user in the choice of hypotheses about interaction compositions. We conserve all the previous hypotheses except the ones about the interaction compositions on top. We consider a new set of these six interaction compositions for top:  4 . This result provides an essential information to help the biologist to make additional hypotheses about interaction compositions.
Application to the gap-gene module of the segmentation of the Drosophila melanogaster embryo In the first hours after fertilization, the embryo of the fly Drosophila melanogaster undergoes segmentation along the anteroposterior axis (head to tail). The embryo is partitioned into segments, each segment being made of cells characterized by specific levels of a set of proteins. Segmentation takes place in several successive stages controlled by distinct genetic modules. Here we focus on the gap-gene regulatory module.
The modeling and analysis of this network illustrates the expression of steady states in several segments of the embryo for the wild type and several mutant types, and the search for the minimum number of thresholds necessary to account for all the observations. The initial model is adapted from [11,12]. Although this model is not the most recent available, it is convenient for our purpose. The resolution of this query provides a set of minimal models (in terms of number of thresholds) consistent with a set of very diverse observations. The connection between the observations for all these models (one for each mutant type and for each segment) adds a new level of complexity.
The model, proposed in [11,12], controlling the gapgene module contains seven genes: Giant denoted by gt, Hunchback zygotic denoted by hb z , Hunchback maternal denoted by hb m , Krüppel denoted by kr, Knirps denoted by kni, Bicoid denoted by bcd, and Caudal denoted by cad. The genes bcd, hb m and cad are input genes: they influence other genes but are not influenced by any gene. Stocks of maternal mRNA and proteins are accumulated at specific places of the egg before fertilization. These molecules generate gradients along the anteroposterior axis. In the model these quantities are represented by input parameters (one for each chemical species and each region). The interaction graph  between these genes is given in Figure 8 where input genes are represented by dotted circles filled in blue. Interactions and interaction compositions are given in Table 3. Moreover we consider the set of all additivity and observability constraints for all interaction compositions. The modeling in [11,12] takes into account four adjacent segments along the anteroposterior axis, denoted by A, B, C and D. Genetic experiments produced information on the concentration of the gap-gene proteins for the wild type, denoted by wt, and nine mutants. The mutants are: knock-out (KO) on gt (the focal value of the gt component is 0 everywhere) denoted by gt0, KO on both hb z and hb m denoted by hb0, KO on kr denoted by kr0, KO on kni denoted by kni0, KO on bcd denoted by bcd0, KO on hb m denoted by hbm0, KO on cad denoted by cad0, ectopic expression equal to 1 on gt (the focal value of the gt component is everywhere equal to 1) denoted by gt1, ectopic expression equal to 1 on kni denoted by kni1. We define a model M R,T for each segment RÎ {A, B, C, D} and each type T Î {wt, gt0, hb0, kr0, kni0, bcd0, cad0, hbm0, gt1, kni}. For example, M B , gt0 corresponds to the model of the mutant type gt0 in segment B. The input parameters, discrete kinetic parameters and threshold parameters, between models are linked by introducing equality constraints between them, as explained in the section on the formalization of Thomas networks. Thus it would be redundant to impose constraints about interaction compositions for mutant types (the corresponding constraints for the wild type are sufficient). Obviously, these constraints lead to exactly the same threshold and discrete kinetic parameters for the four models associated to the four segments and each mutant.
The concentrations of the proteins produced by input genes bcd, hb m and cad for each region R and each mutant type , . We impose in Constraint 4 that the inputs in the mutant types are equal to those of the wild type, except in the cases where some input genes themselves are mutated. This exception is due to the fact that the inputs come from the mother system, so only a mutation in this system can change the concentration value of the corresponding input. represented by ordered lists of four protein concentrations: S S S S S 1 3 . To identify the minimum number of different thresholds needed to satisfy all the observations and hypotheses, we must build a query using the method of relaxation of constraints. But in this case, the relaxation takes place on the number of thresholds in 1..max c for each component c.
To summarize, we challenge the hypotheses about the number of thresholds for all components. From a constraint point of view, this problem is specified and implemented in the following way. We introduce Boolean variables B j,c equivalent to "the number of thresholds for c is less or equal to j", j being an integer in the interval 1..max c -1. So we get six Boolean variables in our case: B hb z 1, and B hb z 2, for hb z , B 1, kr for kr, B 1, bcd and B 2, bcd for bcd, B 1, cad for cad. Then GNBox searches for models consistent with the set of constraints defined above such that the number of these Boolean variables which are true is maximum. GNBox finds in 22 seconds that the only Boolean variables to be false are B hb z 2, and B 2, cad . This indicates that hb z must have at least 2 different values of thresholds, and cad must have at least 2 different values of thresholds. Table 4 Constraints between stationary states and between input parameters for each region and each mutant type for the model about gap-gene module of the segmentation of the D. melanogaster embryo formalisms (logic, Petri nets, ODEs). The idea is to add to the simulation functionality a formal verification functionality to check, or optimize [21], the fitness between the simulated and the observed behaviours.
Three types of abstractions are available in BIOC-HAM, among which ordinary differential equations and Boolean networks. The inference of parameters is based on the technique of model-checking and the definition of a continuous degree of satisfaction of a temporal logic formula formalizing some observation on behaviour. This permits to find biochemical kinetic parameter values which are optimal with respect to a set of biological properties. Moreover it is possible to find the effect of parameter variations on the robustness of a behaviour specification [22]. Our work differs significantly in that it focuses to face the problem of incomplete knowledge to produce, by a constraint-based process, a class of models from which it is expected to design new experiments.
A steady state search module, including the so-called singular states, exists in GNA based on an integration of the SAT solver SAT4J [23]. This integration of a constraint approach avoids the generation of all the transitions to identify the steady states. But in contrast to our work aiming at providing general queries, [23] focus on the search of steady states, and only in the case of completely instantiated models (kinetic parameters instantiated and order of the thresholds predefined). The work in [24] search the same steady states with a CSP (Constraint Satisfaction Problem) formalization, the performances are worse than in [23]. In our work, we can write easily queries to identify steady states, and even cycles of length smaller than some predefined value. In addition in our case the kinetic parameters and the orders between thresholds can be only partially known. As explained here, other much more sophisticated queries are available, although in the current version we do not include singular states.
The formal approach proposed here modifies deeply the way to proceed in the building and in the exploration of genetic and biochemical networks, first by avoiding the usual trial-and-error procedure, and second by putting the emphasis on sets of solutions, rather than a single consistent solution arbitrarily chosen in a set. Last, the constraint approach lends to a unified description of network architecture and network behaviour, as both are described in terms on formal constraints. The knowledge available to initiate the modeling of a given phenomenon is generally sparse with respect to the complexity of the behaviour of the underlying networks. Table 5 Constraints of instantiation of stationary states according to the second table in [12] for each region and each mutant type for the model about gap-gene module of the segmentation of the D. melanogaster embryo