 Methodology Article
 Open Access
 Published:
Extended notions of sign consistency to relate experimental data to signaling and regulatory network topologies
BMC Bioinformatics volume 16, Article number: 345 (2015)
Abstract
Background
A rapidly growing amount of knowledge about signaling and gene regulatory networks is available in databases such as KEGG, Reactome, or RegulonDB. There is an increasing need to relate this knowledge to highthroughput data in order to (in)validate network topologies or to decide which interactions are present or inactive in a given cell type under a particular environmental condition. Interaction graphs provide a suitable representation of cellular networks with information flows and methods based on sign consistency approaches have been shown to be valuable tools to (i) predict qualitative responses, (ii) to test the consistency of network topologies and experimental data, and (iii) to apply repair operations to the network model suggesting missing or wrong interactions.
Results
We present a framework to unify different notions of sign consistency and propose a refined method for data discretization that considers uncertainties in experimental profiles. We furthermore introduce a new constraint to filter undesired model behaviors induced by positive feedback loops. Finally, we generalize the way predictions can be made by the sign consistency approach. In particular, we distinguish strong predictions (e.g. increase of a node level) and weak predictions (e.g., node level increases or remains unchanged) enlarging the overall predictive power of the approach. We then demonstrate the applicability of our framework by confronting a largescale gene regulatory network model of Escherichia coli with highthroughput transcriptomic measurements.
Conclusion
Overall, our work enhances the flexibility and power of the sign consistency approach for the prediction of the behavior of signaling and gene regulatory networks and, more generally, for the validation and inference of these networks
Background
The advancements of measurement technologies and highthroughput methods in molecular biology have led to a tremendous increase in the availability of factual biological knowledge as well as of data capturing the response of biological systems to experimental conditions. Knowledge about metabolic, signaling, and gene regulatory interactions and networks is available in databases such as KEGG, Regulon DB, PID, or Reactome which can be used as a starting point to build causal models of biomolecular networks [1]. Specifically, signaling and gene regulatory networks carrying signal and information flows can be represented as interaction (or influence) graphs [2–6], Bayesian networks [7], some form of logic (including Boolean or constrained fuzzy logic) modeling [4, 8, 9], or ordinary differential equations [10–12]. However, there is an increasing need to relate largescale network models to highthroughput data in order to (in)validate network topologies or to decide which regulatory or signaling interactions are present in a particular biological system, cell type, environmental condition etc.
Significant work has been published on this subject, attempting to detect inconsistencies among measured highthroughput data and signaling and regulatory networks and to subsequently identify missing or inactive interactions such that the optimized network structure maximizes consistency with experimental data [2, 4, 13–18]. Some of these approaches use signed directed graphs, also called interaction or influence graphs (IG), as underlying model where edges indicate either positive or negative effect of one node upon another. Although these models are qualitative and simple, they have frequently been used to study signal flows in a wide range of biological systems. Moreover, the fact that every Boolean and every ODE model has an underlying interaction graph renders their analysis directly relevant for other modeling formalisms and it has been shown that some important global properties of Boolean or ODE models are determined by the structure of their associated IG [6, 19, 20]. IG have also been used for qualitative reasoning, to describe physical systems where a detailed quantitative description is unavailable [21]. In fact, this has been one motivation for using IG in the context of biological systems [20] where knowledge and data are usually uncertain.
One important class of methods relating IG with experimental data is based on the notion of sign consistency. The key idea here is to represent the potential network behaviors resulting from steadystate shift experiments (such as upregulation or downregulation of node activation levels after network perturbations) by certain kinds of discrete constraints. A first approach based on sign consistency was introduced in [2]. There, experimentally measured changes in node activities were represented by two labels (increase, decrease) on the IG nodes. Constraints relating nodes labels and IG are introduced to model the propagation of regulatory effects. Later, in [3, 22], Answer Set Programming (ASP) [23] was used to find admissible node labelings adhering to the posed constraints, and optimal repairs to restore signconsistency were proposed. A related formalism was presented in [17]. Major differences to previous studies were (i) consideration of three node labels (increase, decrease, 0change), (ii) the representation of the constraints as an integer linear programming (ILP) problem, and (iii) the introduction of new repair operations minimizing inconsistencies between the IG structure and the experiments.
The goal of this study is fourfold. First, we aim at unifying existing approaches into a general framework. We show that different notions of sign consistency mainly differ in the way zero changes are modeled. Then, we propose a refined method for data discretization allowing one to express uncertainties during the discretization step. In addition, we introduce a new constraint to filter undesired selffulfilled explanations which result from positive feedback loops. Finally, we introduce an extended prediction method that allows not only strong (e.g., "increase") but also weak predictions (e.g., "increase or 0change"), enlarging the predictive power of the approach. We applied the extended framework to a realistic case study where we analyze highthroughput transcriptomic measurements of Escherichia coli in the context of a largescale gene regulatory network model obtained from RegulonDB. Taken together, we demonstrate that these extensions increase the applicability and flexibility of the approach significantly.
Methods
Definitions
An influence or interaction graph (IG) is a signed directed graph (V,E,σ), where V is a set of nodes, E a set of edges, and σ:E→{+,−} a labeling of the edges. Every node in V represents a species in the modeled system and an edge j → i means that the change of j in time influences the level of i. Every edge j → i of an IG can be labeled with a sign, either + or −, denoted by σ(j,i), where + (−) indicates that j tends to increase (decrease) i. An example IG is given in Fig. 1.
In this framework, we confront the IG with experimental profiles. In our approach, the experimental profiles are supposed to come from steadystate shift experiments where, initially, the system is at steadystate, then externally perturbed in certain nodes, and settles eventually into another steadystate. For some species S⊆V (genes, proteins, or metabolites) concentrations are measured in the initial and final state. The raw data is given by a real value o b s(s) for every measured species s∈S specifying the difference of the node states at the beginning and in the new steady state. As defined below, we determine for these nodes whether the concentration has increased, decreased or not significantly changed.
Data discretization
We propose a refined method to discretize the measurements using four (conditiondependent) thresholds t _{1}≤t _{2}<0<t _{3}≤t _{4}, allowing one to consider uncertainties in the discretization process. As illustrated in Fig. 2, these thresholds define a mapping \(\mu : S \rightarrow \{{}, \triangledown, 0, \vartriangle, + \}\) as follows:
We consider measurements which are smaller than t _{1}, bigger than t _{4}, and between t _{2} and t _{3} as certain (decrease , increase +, nochange 0) while measurements that are between t _{1} and t _{2} (resp. t _{3} and t _{4}) are uncertain (uncertaindecrease \(\triangledown \), uncertainincrease \(\vartriangle \)) and not exactly classifiable. With that, an experimental profile (S,I,μ) is defined by the set of measured species S, the set of input nodes I⊆S (the experimentally perturbed species) whose changes are trivially explained, and the mapping μ as defined above.
Local consistency rules
Given an IG (V,E,σ) and an experimental profile (S,I,μ) one can describe the rules that relate both. For this purpose we look for total labelings μ ^{t}:V→{−,0,+} that satisfy the local constraints defined below. It is important to notice that μ ^{t} will define a total labeling using the three labels {−,0,+} whereas μ defines a partial labeling (only measured nodes are labeled) based on the five labels \(\{ {}, \triangledown, 0, \vartriangle, {+} \}\) representing the discretized measurements.
With the first constraint, we look for total labelings μ ^{t} that satisfy the observed measurements captured in the partial node labeling given by μ:
Constraint 1 (satisfy observations).
Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ ^{t}:V→{+,−,0} be a total labeling, and let i∈V be a node with μ ^{t}(i)∈{+,0,−}.
Then μ ^{t} satisfies Constraint 1 for node i iff i∉S, or μ ^{t}(i)=+ and \(\mu (i)\in \{ {+}, \vartriangle \}\), or μ ^{t}(i)=0 and \(\mu (i)\in \{ \vartriangle, 0, \triangledown \}\), or μ ^{t}(i)=− and \(\mu (i) \in \{ \triangledown, {} \}\).
Note, uncertain measurements restrict the labeling of a node to two out of the three values {+,−,0}, while measurements with high certainty fix a node’s label to exactly one value.
Next we demand for every noninput node i, that its change μ ^{t}(i) ought to be explained by the total influence of its predecessors in the IG. The influence of j on i is given by the product μ ^{t}(j)σ(j,i)∈{+,−,0}.
Constraint 2 (change must be justified by a change in a predecessor).
Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ ^{t}:V→{+,−,0} be a total labeling, and let i∈V∖I be a noninput node with μ ^{t}(i)∈{+,−}.
Then μ ^{t} satisfies Constraint 2 for node i if there is some edge j → i in E such that μ ^{t}(i)=μ ^{t}(j)σ(j,i).
Constraint 2 is consistent with the propagation rule used in [2, 3] which demands that increases and decreases must be explained by predecessor nodes while 0changes are unconstrained, that is 0changes can always occur irrespective of the state of the predecessor nodes (note that 0changes were not considered in [2, 3]). One argument for this reasoning is that it is often impossible to estimate the strength of the influences and the thresholds at which a downstream effect occurs are unknown. Hence, we cannot guarantee that an influence really has an effect and therefore allow 0change. On the other hand, the constraint still enforces explanations for observed changes in node activation levels; each change must be explainable by an influence (with proper sign) of at least one predecessor.
Melas et al. [17] suggested also to demand proper explanations for 0changes using the following constraint:
Constraint 3 (0change must be justified).
Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ ^{t}:V→{+,−,0} be a total labeling, and let i∈V∖I be a noninput node with μ ^{t}(i)=0. Then μ ^{t} satisfies Constraint 3 for node i if there is either no edge j → i in E such that μ ^{t}(j)σ(j,i)∈{+,−} or there exist at least two edges j _{1} → i and j _{2} → i in E such that μ ^{t}(j _{1})σ(j _{1},i)+μ ^{t}(j _{2})σ(j _{2},i)=0
Constraint 3 restricts the occurrence of 0changes. A node is only allowed to show 0change if it receives either no influence or contradictory influences. This constraint thus assumes that each influence has indeed an effect and only contradictory influences can cancel each other out.
In Fig. 3, we illustrate IGs with different labelings where green stands for increase, red for decrease and blue for 0change. Notice, that Constraint 2 intentionally allows situations like in labeling g and h, where D is labeled as 0change even if the predecessor B is showing an increase resp. decrease. On the other hand, Constraint 2 forbids D to increase or decrease, if all predecessors are labeled as 0change.
From local to global reasoning
While there might exist several total labelings that satisfy the local constraints for some nodes we are interested in checking global consistency, where a total labeling exists such that the local constraints are satisfied for all nodes. In Fig. 4, we illustrate an IG together with a partial labeling which is locally consistent but globally inconsistent. In other words, there exist two total labelings such that the local consistency rules (Constraints 1, 2 and 3) are satisfied, for either A or B, but there exists no single total labeling that satisfies these constraints for all nodes.
We use the previously defined constraints to define the following global consistency notions.
Consistency Notion 1 (weak propagation, WP).
We call an IG and an experimental profile (S,I,μ) consistent under weak propagation (WP), iff there exists a total labeling μ ^{t} such that Constraint 1 and 2 are satisfied for all nodes.
Consistency Notion 2 (strong propagation, SP).
We call an IG and an experimental profile (S,I,μ) consistent under strong propagation (SP), iff there exists a total labeling μ ^{t} such that Constraints 1, 2 and 3 are satisfied for all nodes.
Further, we introduce here a new global constraint to ensure that every node change is justified by a chain of influences that can be traced back to an (perturbed) input node. This natural constraint is especially useful to forbid selfjustification of changes via positive feedback loops (see Fig. 5).
Constraint 4 (a change must be founded in an input).
Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ ^{t}:V→{+,−,0} be a total labeling, and i∈V a node with μ ^{t}(i)∈{+,−}.
Then μ ^{t} satisfies Constraint 4 for node i if either i is an input node i∈I, or there exist a path (v _{0},…,v _{ k }) in E with v _{0}∈I, v _{ k }=i and μ ^{t}(v _{ n−1})σ(v _{ n−1},v _{ n })=μ ^{t}(v _{ n }) for all n=1…k.
In Fig. 5, we illustrate an IG with a partial labeling (left) and two total labelings (middle and right) derived from the partial one. Both total labelings satisfy the local propagation rules (Constraints 2, 3), but only the second total labeling satisfies the global propagation rule (Constraint 4). While the first labeling suggests a selfsustained increase in B and C as explanation for the increase in D, the second labeling hints to an increase in the input node A. Using Constraint 4 we can avoid manual removal of positive feedback loops as done in previous studies [17].
We combine the new constraint with previously defined constraints into the following consistency notions.
Consistency Notion 3 (founded weak propagation, FWP).
We call an IG and an experimental profile (S,I,μ) consistent under founded weak propagation (FWP), iff there exists a total labeling μ ^{t} such that Constraints 1, 2 and 4 are satisfied for all nodes.
Consistency Notion 4 (founded strong propagation, FSP).
We call an IG and an experimental profile (S,I,μ) consistent under founded strong propagation (FSP), iff there exists a total labeling μ ^{t} such that Constraints 1, 2, 3, and 4 are satisfied for all nodes.
Consistency checking
We can now apply the previously defined consistency notions to enumerate consistent total labelings and to verify the consistency of network and observation data for a given experimental profile. We consider an IG consistent with an experimental profile (S,I,C) if there exists at least one consistent total labeling (consistent with respect to the chosen Notion WP, SP, FWP or FSP). Consider Fig. 6 which shows the total labelings of the IG in Fig. 1 consistent with an example experimental profile (A and D were increased resulting in a measured 0change in H) under the different consistency notions. Note that the notions become more strict, accepting less labelings as consistent and therefore excluding certain system behaviors. The set of admissible labeling under SP is a subset of the admissible labelings under WP and the set of admissible labeling under FSP is a subset of the admissible labelings under SP. Further, one can see that Constraint 4 excludes all labelings where E and F decrease. This behavior does not satisfy Constraint 4, as it is only possible by mutual inhibition using the positive loop between E and F, which is not founded in an input.
Predictions under consistency
The consistency check of network and experimental data is the first analysis that is performed with the sign consistency approach. If network and data are consistent the sign consistency approach can be used to predict the behavior of unmeasured entities in the network. This can also be used to predict the outcome of a planed experiment and reversely to plan an experiment that should result in a specific desired behavior. In the sign consistency approach, each consistent labeling represents an admissible behavior of the system. We call a statement that holds in all admissible behaviors under the given consistency notion a prediction. If parts of the system act the same in all admissible behaviors this can be predicted. We can predict the following types of behaviors in our systems. We predict that a species increases + (resp. decreases −, does not change 0) if it increases (resp. decreases, does not change) in all admissible labelings. We call these strong predictions, because the possible behaviors of a species are reduced to exactly one. Further, we can predict that a species does not increase (resp. does not decrease, does change) if it does not increase (resp. not decrease, does change) in all admissible labelings. Therefore, we can also predict weak increase ⊕, when a species does not decrease, but increases in at least one admissible behavior, and does not change in another admissible behavior. Likewise, we predict weak decrease ⊖ when a species does not increase, but decreases in at least one admissible behavior, and does not change in another. Finally, we predict change ± when a species does always change, it increases in at least one admissible behavior and decreases in another. We call ⊕, ⊖, and ± weak predictions because one possible behavior is excluded while one degree of freedom is left.
Formally, for a set V of nodes in our network and the set M of labelings consistent with our experimental profile, we define the prediction function p r e d:V→{+,−,0,⊕,⊖,±, no } as follows:
Recovery rate and precision
In Table 1, we show the predictions for the example given in Fig. 1. One can see that the more constrained consistency notions yield smaller sets of admissible labelings and a higher recovery rate (for how many unmeasured species can predictions be obtained). In the systematic comparison of the consistency notions based on real experimental data we not only consider recovery rate but also prediction precision (true positives/(true positives + false positives)). A strong prediction (+ / −/0) will be a true positive if it has a certain measurement with the same value (+/ −/0). A weak prediction ⊕ (resp. ⊖ and ±) will be a true positive if it has a certain measurement + or 0 (resp. − or 0 and + or −). Reversely, a prediction will be a false positive if has a certain measurement value with a contradictory value +.
Repairing inconsistent networks and data
If network and data are inconsistent the natural question arising is how to repair networks and/or data, that is, how to modify network and/or data in order to reestablish their mutual consistency. A major challenge lies in the range of possible repair operations, since an inconsistency can be explained by missing interactions or inaccurate information in a network as well as by measurement errors. The sign consistency approach can be used to determine a set of repair operations that are suitable to restore consistency. Typically, plenty of suitable repair operations are possible, in particular, if multiple repair operations are admitted. However, one usually is only interested in repairs that make few changes on the model and/or data. These minimal repair sets cannot only be used for hypotheses generation (e.g., which data might be questionable or which edges might be missing or inactive) but as a quantitative measure for the fitness of model and data. Also note that once consistency is reestablished, network and data can again be used for predicting behaviors of unmeasured entities.
In [17], four repair operations were introduced; two of them for single experiments (SCENFIT, Minimal Correction Sets (MCoS)) and two for multiple experiments (OPTSUBGRAPH, OPTGRAPH). The latter two are computationally more demanding as they seek to optimize the whole network structure based on many perturbation experiments. SCENFIT, as explained in detail in the Additional file 1, seeks to find a consistent node labeling that is closest to the given measurements and can thus help to identify inconsistencies between network and dataset. Herein we will focus on MCoS and thus deal with analysis of single experiments. This is motivated by our application example where we indeed have multiple experiments (105) but where the number of experiments is low compared to the number of edges and nodes in the network (1646) disabling a meaningful network structure optimization. However, we note here that our extended notion FSP can be straighforwardly applied to these repair operations as well.
Minimal Correction Sets (MCoS)
To resolve inconsistencies one may add new influences to the model if the later is considered to be potentially incomplete (which is often the case in practice). Adding an influence can be used to indicate missing (unknown) regulations or oscillations of regulators that would explain the (topologyinconsistent) measurements. We use minimal correction sets (MCoS) as defined in [17] as minimal sets of new signed (positive or negative) input influences that restore consistency of model and data. MCoS are defined as signed influences and are specific for a single experiment; they might be incompatible with other experiments. Note that every inconsistency can be repaired by adding a new influence. Therefore, adding influences is always suited to restore consistency. Also the MCoS can be interpreted as a measure of consistency of model and data. Compared to SCENFIT, MCoS yields always a smaller or equal number of repairs. Therefore we define the inconsistencyindex of a network with respect to data as (MCoS/number of observations in the experiment). Figure 7 illustrates how repair through addition of influences works.
Prediction under minimal repair
Due to the capability of repairing, the sign consistency approach enables prediction even if model and data are mutually inconsistent. Predictions under minimal repair are obtained from the identification of consequences shared by all consistent labelings under all possible minimal repairs. Note that this approach although it confines to minimal repairs following the law of parsimony, does not favor any of the possible minimal repairs but only considers a statement a prediction if it holds under every minimal repair.
Software
The different consistency notions as well as the methods for consistency checking and quantification, prediction, and all data and network repair operations were implemented in an open source application iggy [24]. iggy uses ASP [23] as logical modeling and constraint solving paradigm, it is part of the BioASP software collection and can easily be installed via the python package index (PyPI). ASP is used to model problems from NP and provides stateoftheart solvers. In particular, we use the solver clasp [25] via the pyasp [26] package. On an AMD Opteron 6168 1.9 GHz with 96 GB RAM, given a network with 1646 nodes and 4277 edges our software needs ≈20 min to compute the predictions under minimal repair (MCos) for the unmeasured species of 105 experiment data sets each containing 1392 measurements. For further information visit http://bioasp.github.io/iggy.
Results and discussion
To investigate the suitability of the different consistency notions, we used the gene regulatory network of Escherichia coli and confronted it with Microarray data. The network was obtained from RegulonDB [27], version 8.3 in october 2013, and we focused on its biggest weakly connected component which is composed of 1646 nodes and 4277 edges and covers 94 % of the nodes of the full RegulonDB network. Unsigned edges are treated as two parallel edges with opposite signs. The data refers to the microarray log ratio expression of 3607 genes measured under 240 different stress conditions in E. coli published in [28]. We chose 105 of 240 experiments which can be interpreted as steady state shift experiments and 1392 of the 3607 genes which occur in the RegulonDB network. Since the input nodes for the stress condition experiments are unknown, we simply defined all nodes without predecessors as inputs.
The GEO/GSE codes for the used experiments are listed in the Additional file 1. The microarray data was discretized as described in the Methods Section using the typical thresholds: t _{1}=−2, t _{2}=−0.01, t _{3}=0.01, t _{4}=2, to generate the constraints that restrict the labeling μ for the nodes measured in the experimental profile.
To evaluate the influence of the minimal correction sets (MCoS) and to investigate the suitability of the different consistency notions to predict the behavior of unobserved entities in a regulatory network, we performed a crossvalidation using the E. coli data.
Quality of regulatory network when confronted to the expression profiles
As a first step, we assess the quality of network and data by comparing it to randomized data. We generated 100 randomized datasets for each real experiment by shuffling the observed signs among the observed nodes; but preserving the sign distribution for each dataset. We then computed for real and randomized data the inconsistency index which is defined as the quotient of the number of minimal corrections (MCoS) to restore consistency (under notion FSP) divided by the number of observations in the experiment. Then we computed the Wilcoxon signedrank test to assess whether the population means of the two samples differ. The obtained pvalue of 2.0497e11 indicates a highly significant difference of real and randomized data, suggesting that the real data are more (sign) consistent with the network topology than random data. Figure 8 shows the inconsistency index for real and randomized data for each experiment. We can see that the real E. coli dataset exhibits a significantly lower inconsistency index than the randomized data.
Figure 9 shows the distribution of the measured signs in the experimental data revealing that the data tends to be less consistent if more +/ are contained.
Predictions under the different consistency notions
To investigate the suitability of the different consistency notions to predict the behavior of unobserved entities in a regulatory network, we performed a crossvalidation using the E. coli data. While other validation methods exist, we decided to use crossvalidation as a model validation technique because it allows us to assess how the results of the approach will generalize to independent datasets. To set up crossvalidation, we created for each experiment 100 samples each containing a random 10 % share of the measurements. We then confronted the E. coli network with these samples, determined the minimal corrections necessary to restore consistency, and computed the predictions that hold under all minimal correction sets.
In Table 2 one can see the distribution of the +, −, 0 and weak predictions as well as how the precision of the different types of predictions varies among the different notions (WP is similar to FWP see Additional file 1). With the different consistency notions we were on average able to compute behavior predictions for up to 69 % of the remaining nodes in the network, for which no measurement was given. One can observe that the share of nodes with predictions increases drastically with notion SP and even further with FSP, mainly through an increased prediction of 0change behaviors.
The different types of predictions contain different amount of informations. A weak prediction gives less information than a strong prediction because it discards only one out of three possible labels. Hence, the 69 % of nodes with prediction does not equal 69 % of information gained. Therefore, we also computed the information gain given by these predictions. For n unconstrained nodes, for which no measurements are taken into account, 3^{n} possible behaviors exist, for k nodes with strong predictions the possible behaviors can be restricted to just 1, and for l nodes with weak predictions remain still 2^{l} possible behaviors, for m nodes without predictions remain still 3^{m} possible behaviors, and the overall information gain can then be expressed as (l o g(3^{n})−l o g(1^{k}+2^{l}+3^{m})/l o g(3^{n})). In our experiments we observed an average information gain up to 61 % for the nodes for which no measurements had been taken into account. For more information on how to compute the information gain we refer to the Additional file 1.
To validate the quality of the predictions (obtained from 10 % of the data), we compared them with the validation data (the remaining 90 % of measurements). For the nodes where a prediction and validation data was available, we compared both. We obtained on average precisions that range from 73 % to 80 %. Overall, SP and FSP allow us to make predictions for a much bigger part of the network, resulting in a much higher information gain with only a slightly decreased precision, and for + and − predictions with a significant higher precision than FWP. In Section 5 of the Additional file 1 we plot the detailed recovery and precisions per experiment for notions FWP and FSP.
To test the influence of the number of measurements on recovery rate and precision, we also created a dataset with 50 % and 75 % (see Additional file 1: Table S3) of the measurements. Compared to the results with 10 % the overall recovery rate increases up to 82 % (FSP). This is due to the fact that the increased amount of data helps to put more constraints on the systems behavior. For notion SP and FSP the number of weak predictions drops slightly because many of them become strong predictions. The precision of +, − and weak predictions benefits from the richer datasets under notion SP and FSP, while the precision of 0change decreases only slightly.
Weak predictions easily have higher precisions, because they have a bigger chance to be true positives. To validate that the precisions obtained in our test case are indeed meaningful, we tested our approach on a randomized dataset. We could verify that the predictions from randomized data have less precision than the predictions obtained from the real data (see Additional file 1: Table S3),especially for notions SP and FSP. Accordingly the pvalues shown in Table 2 indicate a high significance that the predictions made by SP and, even more pronounced, by FSP are better than random.
These results show that the strongpropagation notions (SP and FSP) are the most pertinent choice to explain gene expression shifts within the E. coli transcriptional network. Using FSP we predict with high precision that 53 % to 72 % of the network remains unaltered (0change). Understanding the differentially expressed network regions becomes more delicate, since the precision remains on average 54 % to 62 % which, however, is still significantly higher than for notion FWP. Nevertheless, 48 % of the experiments had a precision above 75 % for up or downregulation (strong) predictions when considering a dataset with 50 % of the measures. Note, that the notion of precision changes its conclusiveness when applied to incompletely determined predictions. Thus, we use confusion matrices as an alternative representation to illustrates the performance of our prediction method. Here one can see that for uncertain observations, relatively few strong predictions are confused (see Fig. 10). Therefore, wrong predictions may be related to the choice of the discretization thresholds and that a single threshold was chosen for all genes.
Conclusion
We presented a unified framework to express different notions of sign consistency on interaction graphs. A refined methodology for data discretization into five values allows the consideration of uncertainties in experimental profiles. Within this framework we introduced a new constraint to filter undesired selffulfilled regulations that result from positive feedback loops. Finally, our extended prediction method considers not only strong (unique value) but additionally weak (multiple admissible values) predictions, enlarging the predictive power of the approach.
We evaluated our framework by confronting the full RegulonDB network with 105 experimental geneexpression profiles. Our crossvalidation results obtained when choosing 10 % of the initial dataset show that the overall precision of the methods ranges from 72 % to 80 %. The precision of the FSP notion has a much higher and significant pvalue. With its increased precision and recovery, FSP appears to be the superior notion.
We expect that the information gain is in general higher for datasets from (typically smaller) signaling networks (see e.g. [17]). This might be due to the fact that in the stress experiments considered here the (perturbed) inputs of the gene regulatory network were unknown which poses less constraints than in signaling networks with normally welldefined signal inputs (given by the applied ligands, inhibitors etc.).
Our method requires a careful selection of discretization thresholds. Therefore, we performed a detailed sensitivity analysis on a wide range of the discretization thresholds (see Additional file 1: Section 4). The analysis shows that there is a relatively small sensitivity of the results (precision, information gain) w.r.t. the chosen thresholds. We also discuss further aspects of threshold selection in the Additional file 1.
There is a relationship between the concept of sign consistency and the dependency matrix (discussed in more detail in [17]). The notion of the dependency matrix was originally introduced in [4] and has been used in several studies for checking consistency between signaling network topologies and experimental data from stimulusresponse experiments, (e.g., [5, 29]). In fact, the dependency matrix can be seen as another sign consistency notion which is more relaxed than SP or FSP (what might still be useful, e.g. when analyzing transient instead of steadystate responses). Since additional propagation rules are straightforward to implement in the framework presented herein, other sign consistency notions, including the dependency matrix or those that pose different constraints for 0changes, could be considered as well. Overall, our work enhances the flexibility and power of the sign consistency approach for the prediction of the behavior of signaling and gene regulatory networks and, more generally, for the validation and inference of these networks.
Abbreviations
 IG:

Influence graph
 ILP:

Integer linear programming
 ASP:

Answer set programming
 WP:

Weak propagation
 SP:

Strong propagation
 FWP:

Founded weak propagation
 FSP:

Founded strong propagation
References
Catlett NL, Bargnesi AJ, Ungerer S, Seagaran T, Ladd W, Elliston KO, et al. Reverse causal reasoning: applying qualitative causal knowledge to the interpretation of highthroughput data. BMC Bioinforma. 2013; 14:340.
Guziołowski C, Bourde A, Moreews F, Siegel A. BioQuali Cytoscape plugin: analysing the global consistency of regulatory networks. BMC Genomics. 2009; 10(1):244. doi:http://dx.doi.org/10.1186/1471216410244.
Gebser M, Schaub T, Thiele S, Veber P. Detecting inconsistencies in large biological networks with answer set programming. Theory Prac Logic Program. 2011; 11(2–3):323–60.
Klamt S, SaezRodriguez J, Lindquist J, Simeoni L, Gilles E. A methodology for the structural and functional analysis of signaling and regulatory networks. BMC Bioinforma. 2006; 7(1):56. doi:http://dx.doi.org/10.1186/14712105756.
Samaga R, SaezRodriguez J, Alexopoulos LG, Sorger PK, Klamt S. The Logic of EGFR/ErbB Signaling: Theoretical Properties and Analysis of HighThroughput Data. PLoS Comput Biol. 2009; 5(8):1000438. doi:http://dx.doi.org/10.1371/journal.pcbi.1000438.
Thieffry D. Dynamical roles of biological regulatory circuits. Brief. Bioinforma. 2007; 8(4):220–5. doi:http://dx.doi.org/10.1093/bib/bbm028 http://bib.oxfordjournals.org/content/8/4/220.full.pdf+html.
Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal ProteinSignaling Networks Derived from Multiparameter SingleCell Data. Science. 2005; 308(5721):523–9. doi:http://dx.doi.org/10.1126/science.1105809.
Morris MK, SaezRodriguez J, Sorger PK, Lauffenburger DA. Logicbased models for the analysis of cell signaling networks. Biochemistry. 2010; 49(15):3216–24.
Wang RS, Saadatpour A, Albert R. Boolean modeling in systems biology: an overview of methodology and applications. Phys Biol. 2012; 9(5):055001.
Schoeberl B, EichlerJonsson C, Gilles ED, Müller G. Computational modeling of the dynamics of the map kinase cascade activated by surface and internalized egf receptors. Nat Biotechnol. 2002; 20(4):370–5.
Quach M, Brunel N, d’AlchéBuc F. Estimating parameters and hidden variables in nonlinear statespace models based on odes for biological networks inference. Bioinforma. 2007; 23(23):3209–16.
Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol. 2008; 9(10):770–80.
Ideker TE, Thorsson V, Karp RM. Discovery of Regulatory Interactions Through Perturbation: Inference and Experimental Design. In: Proceedings of the Pacific Symposium on Biocomputing. Seattle, USA: World Scientific Press: 2000.
SaezRodriguez J, Alexopoulos LG, Epperlein J, Samaga R, Lauffenburger DA, Klamt S, et al. Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Mol Syst Biol. 2009; 5(1):331.
Sharan R, Karp R. Reconstructing boolean models of signaling In: Chor B, editor. Research in Computational Molecular Biology. Lecture Notes in Computer Science. Springer: 2012. p. 261–71. doi:http://dx.doi.org/10.1007/9783642296277_28.
Terfve C, Cokelaer T, Henriques D, MacNamara A, Goncalves E, Morris M, et al. Cellnoptr: a flexible toolkit to train protein signaling networks to data using multiple logic formalisms. BMC Syst Biol. 2012; 6(1):133. doi:http://dx.doi.org/10.1186/175205096133.
Melas IN, Samaga R, Alexopoulos LG, Klamt S. Detecting and Removing Inconsistencies between Experimental Data and Signaling Network Topologies Using Integer Linear Programming on Interaction Graphs. PLoS Comput Biol. 2013; 9(9):1003204. doi:http://dx.doi.org/10.1371/journal.pcbi.1003204.
Videla S, Guziołowski C, Eduati F, Thiele S, Gebser M, Nicolas J, et al. Learning Boolean logic models of signaling networks with ASP. Theoretical Computer Science. 2015; 599:79–101. Advances in Computational Methods in Systems Biology, doi:http://dx.doi.org/10.1016/j.tcs.2014.06.022, http://www.sciencedirect.com/science/article/pii/S0304397514004587.
Radde N, Bar NS, Banaji M. Graphical methods for analysing feedback in biological networks  a survey. Int J Syst Sci. 2010; 41(1):35–46. doi:http://dx.doi.org/10.1080/00207720903151326.
Samaga R, Klamt S. Modeling approaches for qualitative and semiquantitative analysis of cellular signaling networks. Cell Commun Signal. 2013; 11(1):43. doi:http://dx.doi.org/10.1186/1478811X1143.
Kuipers B. Qualitative reasoning: Modeling and simulation with incomplete knowledge. Automatica. 1989; 25(4):571–85. doi:http://dx.doi.org/10.1016/00051098(89)90099X.
Gebser M, Guziołowski C, Ivanchev M, Schaub T, Siegel A, Thiele S, et al. Repair and prediction (under inconsistency) in large biological networks with answer set programming In: Lin F, Sattler U, Truszczynski M, editors. Proceedings of the Twelfth International Conference on the Principles of Knowledge Representation and Reasoning (KR’10). Menlo Park, CA: AAAI Press: 2010.
Baral C. Knowledge Representation, Reasoning and Declarative Problem Solving. Cambridge: Cambridge University Press; 2003.
Thiele S. Iggy1.2: A tool for consistency based analysis of influence graphs and observed systems behavior. zenodo.org. 2015. doi:http://dx.doi.org/10.5281/zenodo.19042.
Gebser M, Kaminski R, Kaufmann B, Ostrowski M, Schaub T, Thiele S. A User’s Guide to gringo, clasp, clingo, and iclingo. 2010. http://potassco.sourceforge.net.. Accessed 10 Oct 2015.
Thiele S. PyASP 1.4.1  A convenience wrapper for the ASP tools gringo, gringo4 and clasp. 2015. doi:http://dx.doi.org/10.5281/zenodo.22968.
Salgado H, GamaCastro S, PeraltaGil M, DiazPeredo E, SanchezSolano F, SantosZavaleta A, et al. RegulonDB (version 5.0): Escherichia coli K12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res. 2006; 34(Database issue):394–7.
Sangurdekar DP, Srienc F, Khodursky AB. A classification based framework for quantitative description of largescale microarray data. Genome Biol. 2006; 7(4):32.
Ryll A, Samaga R, Schaper F, Alexopoulos LG, Klamt S. Largescale network models of il1 and il6 signalling and their hepatocellular specification. Mol BioSyst. 2011; 7:3253–270. doi:http://dx.doi.org/10.1039/C1MB05261F.
Acknowledgements
This work was funded in part by the German Federal Ministry of Education and Research within the “Virtual Liver Network” (grant 0315744) and “JAKSys” (grant 0316167B).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
ST, SK, AS, and CG conceived and supervised the study. LC and JSR contributed to the investigation of different sign consistency notions. ST implemented iggy. ST and CG calculated results for the E. coli case study. All authors discussed results of data analysis. ST, CG, and SK drafted the manuscript. All authors read and approved the final manuscript.
Additional file
Additional file 1
Supplementary. Contains the following supplementary material. Explanation of SCENFIT. Explanation of uncertain observations. Information gain by predictions in the sign consistency approach. Sensitivity analysis  Choosing the thresholds for discretization. Recovery and precision for E. coli crossvalidation experiments. GEO/GSEcodes for the experiments used. (PDF 1218 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Thiele, S., Cerone, L., SaezRodriguez, J. et al. Extended notions of sign consistency to relate experimental data to signaling and regulatory network topologies. BMC Bioinformatics 16, 345 (2015). https://doi.org/10.1186/s1285901507337
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901507337
Keywords
 E. coli
 Gene regulation
 Interaction graphs
 Sign consistency
 Uncertainty
 Logic modeling
 Answer Set Programming (ASP)