Skip to main content
  • Methodology Article
  • Open access
  • Published:

Extended notions of sign consistency to relate experimental data to signaling and regulatory network topologies

Abstract

Background

A rapidly growing amount of knowledge about signaling and gene regulatory networks is available in databases such as KEGG, Reactome, or RegulonDB. There is an increasing need to relate this knowledge to high-throughput data in order to (in)validate network topologies or to decide which interactions are present or inactive in a given cell type under a particular environmental condition. Interaction graphs provide a suitable representation of cellular networks with information flows and methods based on sign consistency approaches have been shown to be valuable tools to (i) predict qualitative responses, (ii) to test the consistency of network topologies and experimental data, and (iii) to apply repair operations to the network model suggesting missing or wrong interactions.

Results

We present a framework to unify different notions of sign consistency and propose a refined method for data discretization that considers uncertainties in experimental profiles. We furthermore introduce a new constraint to filter undesired model behaviors induced by positive feedback loops. Finally, we generalize the way predictions can be made by the sign consistency approach. In particular, we distinguish strong predictions (e.g. increase of a node level) and weak predictions (e.g., node level increases or remains unchanged) enlarging the overall predictive power of the approach. We then demonstrate the applicability of our framework by confronting a large-scale gene regulatory network model of Escherichia coli with high-throughput transcriptomic measurements.

Conclusion

Overall, our work enhances the flexibility and power of the sign consistency approach for the prediction of the behavior of signaling and gene regulatory networks and, more generally, for the validation and inference of these networks

Background

The advancements of measurement technologies and high-throughput methods in molecular biology have led to a tremendous increase in the availability of factual biological knowledge as well as of data capturing the response of biological systems to experimental conditions. Knowledge about metabolic, signaling, and gene regulatory interactions and networks is available in databases such as KEGG, Regulon DB, PID, or Reactome which can be used as a starting point to build causal models of bio-molecular networks [1]. Specifically, signaling and gene regulatory networks carrying signal and information flows can be represented as interaction (or influence) graphs [26], Bayesian networks [7], some form of logic (including Boolean or constrained fuzzy logic) modeling [4, 8, 9], or ordinary differential equations [1012]. However, there is an increasing need to relate large-scale network models to high-throughput data in order to (in)validate network topologies or to decide which regulatory or signaling interactions are present in a particular biological system, cell type, environmental condition etc.

Significant work has been published on this subject, attempting to detect inconsistencies among measured high-throughput data and signaling and regulatory networks and to subsequently identify missing or inactive interactions such that the optimized network structure maximizes consistency with experimental data [2, 4, 1318]. Some of these approaches use signed directed graphs, also called interaction or influence graphs (IG), as underlying model where edges indicate either positive or negative effect of one node upon another. Although these models are qualitative and simple, they have frequently been used to study signal flows in a wide range of biological systems. Moreover, the fact that every Boolean and every ODE model has an underlying interaction graph renders their analysis directly relevant for other modeling formalisms and it has been shown that some important global properties of Boolean or ODE models are determined by the structure of their associated IG [6, 19, 20]. IG have also been used for qualitative reasoning, to describe physical systems where a detailed quantitative description is unavailable [21]. In fact, this has been one motivation for using IG in the context of biological systems [20] where knowledge and data are usually uncertain.

One important class of methods relating IG with experimental data is based on the notion of sign consistency. The key idea here is to represent the potential network behaviors resulting from steady-state shift experiments (such as upregulation or downregulation of node activation levels after network perturbations) by certain kinds of discrete constraints. A first approach based on sign consistency was introduced in [2]. There, experimentally measured changes in node activities were represented by two labels (increase, decrease) on the IG nodes. Constraints relating nodes labels and IG are introduced to model the propagation of regulatory effects. Later, in [3, 22], Answer Set Programming (ASP) [23] was used to find admissible node labelings adhering to the posed constraints, and optimal repairs to restore sign-consistency were proposed. A related formalism was presented in [17]. Major differences to previous studies were (i) consideration of three node labels (increase, decrease, 0-change), (ii) the representation of the constraints as an integer linear programming (ILP) problem, and (iii) the introduction of new repair operations minimizing inconsistencies between the IG structure and the experiments.

The goal of this study is fourfold. First, we aim at unifying existing approaches into a general framework. We show that different notions of sign consistency mainly differ in the way zero changes are modeled. Then, we propose a refined method for data discretization allowing one to express uncertainties during the discretization step. In addition, we introduce a new constraint to filter undesired self-fulfilled explanations which result from positive feedback loops. Finally, we introduce an extended prediction method that allows not only strong (e.g., "increase") but also weak predictions (e.g., "increase or 0-change"), enlarging the predictive power of the approach. We applied the extended framework to a realistic case study where we analyze high-throughput transcriptomic measurements of Escherichia coli in the context of a large-scale gene regulatory network model obtained from RegulonDB. Taken together, we demonstrate that these extensions increase the applicability and flexibility of the approach significantly.

Methods

Definitions

An influence or interaction graph (IG) is a signed directed graph (V,E,σ), where V is a set of nodes, E a set of edges, and σ:E→{+,−} a labeling of the edges. Every node in V represents a species in the modeled system and an edge ji means that the change of j in time influences the level of i. Every edge ji of an IG can be labeled with a sign, either + or −, denoted by σ(j,i), where + (−) indicates that j tends to increase (decrease) i. An example IG is given in Fig. 1.

Fig. 1
figure 1

Interaction graph with a positive feedback loop between E and F

In this framework, we confront the IG with experimental profiles. In our approach, the experimental profiles are supposed to come from steady-state shift experiments where, initially, the system is at steady-state, then externally perturbed in certain nodes, and settles eventually into another steady-state. For some species SV (genes, proteins, or metabolites) concentrations are measured in the initial and final state. The raw data is given by a real value o b s(s) for every measured species sS specifying the difference of the node states at the beginning and in the new steady state. As defined below, we determine for these nodes whether the concentration has increased, decreased or not significantly changed.

Data discretization

We propose a refined method to discretize the measurements using four (condition-dependent) thresholds t 1t 2<0<t 3t 4, allowing one to consider uncertainties in the discretization process. As illustrated in Fig. 2, these thresholds define a mapping \(\mu : S \rightarrow \{{-}, \triangledown, 0, \vartriangle, + \}\) as follows:

$$\mu(s)=\left \{ \begin{array}{lll} {-} & |\quad & obs(s) \leq t_{1}, \\ \triangledown & |\quad t_{1} < & obs(s) \leq t_{2},\\ 0 & |\quad t_{2} < & obs(s) < t_{3}, \\ \vartriangle & |\quad t_{3} \leq & obs(s) < t_{4},\\ + & |\quad t_{4} \leq & obs(s). \end{array} \right. $$
Fig. 2
figure 2

Discretization of observed changes into sign constraints

We consider measurements which are smaller than t 1, bigger than t 4, and between t 2 and t 3 as certain (decrease -, increase +, no-change 0) while measurements that are between t 1 and t 2 (resp. t 3 and t 4) are uncertain (uncertain-decrease \(\triangledown \), uncertain-increase \(\vartriangle \)) and not exactly classifiable. With that, an experimental profile (S,I,μ) is defined by the set of measured species S, the set of input nodes IS (the experimentally perturbed species) whose changes are trivially explained, and the mapping μ as defined above.

Local consistency rules

Given an IG (V,E,σ) and an experimental profile (S,I,μ) one can describe the rules that relate both. For this purpose we look for total labelings μ t:V→{−,0,+} that satisfy the local constraints defined below. It is important to notice that μ t will define a total labeling using the three labels {−,0,+} whereas μ defines a partial labeling (only measured nodes are labeled) based on the five labels \(\{ {-}, \triangledown, 0, \vartriangle, {+} \}\) representing the discretized measurements.

With the first constraint, we look for total labelings μ t that satisfy the observed measurements captured in the partial node labeling given by μ:

Constraint 1 (satisfy observations).

Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ t:V→{+,−,0} be a total labeling, and let iV be a node with μ t(i){+,0,−}.

Then μ t satisfies Constraint 1 for node i iff iS, or μ t(i)=+ and \(\mu (i)\in \{ {+}, \vartriangle \}\), or μ t(i)=0 and \(\mu (i)\in \{ \vartriangle, 0, \triangledown \}\), or μ t(i)=− and \(\mu (i) \in \{ \triangledown, {-} \}\).

Note, uncertain measurements restrict the labeling of a node to two out of the three values {+,−,0}, while measurements with high certainty fix a node’s label to exactly one value.

Next we demand for every non-input node i, that its change μ t(i) ought to be explained by the total influence of its predecessors in the IG. The influence of j on i is given by the product μ t(j)σ(j,i){+,−,0}.

Constraint 2 (change must be justified by a change in a predecessor).

Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ t:V→{+,−,0} be a total labeling, and let iVI be a non-input node with μ t(i){+,−}.

Then μ t satisfies Constraint 2 for node i if there is some edge ji in E such that μ t(i)=μ t(j)σ(j,i).

Constraint 2 is consistent with the propagation rule used in [2, 3] which demands that increases and decreases must be explained by predecessor nodes while 0-changes are unconstrained, that is 0-changes can always occur irrespective of the state of the predecessor nodes (note that 0-changes were not considered in [2, 3]). One argument for this reasoning is that it is often impossible to estimate the strength of the influences and the thresholds at which a downstream effect occurs are unknown. Hence, we cannot guarantee that an influence really has an effect and therefore allow 0-change. On the other hand, the constraint still enforces explanations for observed changes in node activation levels; each change must be explainable by an influence (with proper sign) of at least one predecessor.

Melas et al. [17] suggested also to demand proper explanations for 0-changes using the following constraint:

Constraint 3 (0-change must be justified).

Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ t:V→{+,−,0} be a total labeling, and let iVI be a non-input node with μ t(i)=0. Then μ t satisfies Constraint 3 for node i if there is either no edge ji in E such that μ t(j)σ(j,i){+,−} or there exist at least two edges j 1i and j 2i in E such that μ t(j 1)σ(j 1,i)+μ t(j 2)σ(j 2,i)=0

Constraint 3 restricts the occurrence of 0-changes. A node is only allowed to show 0-change if it receives either no influence or contradictory influences. This constraint thus assumes that each influence has indeed an effect and only contradictory influences can cancel each other out.

In Fig. 3, we illustrate IGs with different labelings where green stands for increase, red for decrease and blue for 0-change. Notice, that Constraint 2 intentionally allows situations like in labeling g and h, where D is labeled as 0-change even if the predecessor B is showing an increase resp. decrease. On the other hand, Constraint 2 forbids D to increase or decrease, if all predecessors are labeled as 0-change.

Fig. 3
figure 3

IGs with different labelings where green stands for increase, red for decrease, and blue for 0-change. All labelings satisfy the basis Constraint 2 for node D, but only the labelings a-d satisfy also Constraint 3. Examples with uncertain measurements are shown in the Additional file 1

From local to global reasoning

While there might exist several total labelings that satisfy the local constraints for some nodes we are interested in checking global consistency, where a total labeling exists such that the local constraints are satisfied for all nodes. In Fig. 4, we illustrate an IG together with a partial labeling which is locally consistent but globally inconsistent. In other words, there exist two total labelings such that the local consistency rules (Constraints 1, 2 and 3) are satisfied, for either A or B, but there exists no single total labeling that satisfies these constraints for all nodes.

Fig. 4
figure 4

Example for an influence graph with partial labeling, which is locally consistent for A and B, but globally inconsistent because there exist no single total labeling satisfying Constraint 2 for A and B

We use the previously defined constraints to define the following global consistency notions.

Consistency Notion 1 (weak propagation, WP).

We call an IG and an experimental profile (S,I,μ) consistent under weak propagation (WP), iff there exists a total labeling μ t such that Constraint 1 and 2 are satisfied for all nodes.

Consistency Notion 2 (strong propagation, SP).

We call an IG and an experimental profile (S,I,μ) consistent under strong propagation (SP), iff there exists a total labeling μ t such that Constraints 1, 2 and 3 are satisfied for all nodes.

Further, we introduce here a new global constraint to ensure that every node change is justified by a chain of influences that can be traced back to an (perturbed) input node. This natural constraint is especially useful to forbid self-justification of changes via positive feedback loops (see Fig. 5).

Fig. 5
figure 5

Example for an influence graph with partial labeling, which can be explained either by self activation in B and C or by the input node A

Constraint 4 (a change must be founded in an input).

Let (V,E,σ) be an IG, (S,I,μ) an experimental profile, μ t:V→{+,−,0} be a total labeling, and iV a node with μ t(i){+,−}.

Then μ t satisfies Constraint 4 for node i if either i is an input node iI, or there exist a path (v 0,…,v k ) in E with v 0I, v k =i and μ t(v n−1)σ(v n−1,v n )=μ t(v n ) for all n=1…k.

In Fig. 5, we illustrate an IG with a partial labeling (left) and two total labelings (middle and right) derived from the partial one. Both total labelings satisfy the local propagation rules (Constraints 2, 3), but only the second total labeling satisfies the global propagation rule (Constraint 4). While the first labeling suggests a self-sustained increase in B and C as explanation for the increase in D, the second labeling hints to an increase in the input node A. Using Constraint 4 we can avoid manual removal of positive feedback loops as done in previous studies [17].

We combine the new constraint with previously defined constraints into the following consistency notions.

Consistency Notion 3 (founded weak propagation, FWP).

We call an IG and an experimental profile (S,I,μ) consistent under founded weak propagation (FWP), iff there exists a total labeling μ t such that Constraints 1, 2 and 4 are satisfied for all nodes.

Consistency Notion 4 (founded strong propagation, FSP).

We call an IG and an experimental profile (S,I,μ) consistent under founded strong propagation (FSP), iff there exists a total labeling μ t such that Constraints 1, 2, 3, and 4 are satisfied for all nodes.

Consistency checking

We can now apply the previously defined consistency notions to enumerate consistent total labelings and to verify the consistency of network and observation data for a given experimental profile. We consider an IG consistent with an experimental profile (S,I,C) if there exists at least one consistent total labeling (consistent with respect to the chosen Notion WP, SP, FWP or FSP). Consider Fig. 6 which shows the total labelings of the IG in Fig. 1 consistent with an example experimental profile (A and D were increased resulting in a measured 0-change in H) under the different consistency notions. Note that the notions become more strict, accepting less labelings as consistent and therefore excluding certain system behaviors. The set of admissible labeling under SP is a subset of the admissible labelings under WP and the set of admissible labeling under FSP is a subset of the admissible labelings under SP. Further, one can see that Constraint 4 excludes all labelings where E and F decrease. This behavior does not satisfy Constraint 4, as it is only possible by mutual inhibition using the positive loop between E and F, which is not founded in an input.

Fig. 6
figure 6

Consistent total labelings of the example in Fig. 1 under different consistency notions. The consistent labelings under all four consistency notions are fully displayed, a grey cell indicates that the labeling above is consistent and a white cell with “.” means that it is not a consistent labeling

Predictions under consistency

The consistency check of network and experimental data is the first analysis that is performed with the sign consistency approach. If network and data are consistent the sign consistency approach can be used to predict the behavior of unmeasured entities in the network. This can also be used to predict the outcome of a planed experiment and reversely to plan an experiment that should result in a specific desired behavior. In the sign consistency approach, each consistent labeling represents an admissible behavior of the system. We call a statement that holds in all admissible behaviors under the given consistency notion a prediction. If parts of the system act the same in all admissible behaviors this can be predicted. We can predict the following types of behaviors in our systems. We predict that a species increases + (resp. decreases −, does not change 0) if it increases (resp. decreases, does not change) in all admissible labelings. We call these strong predictions, because the possible behaviors of a species are reduced to exactly one. Further, we can predict that a species does not increase (resp. does not decrease, does change) if it does not increase (resp. not decrease, does change) in all admissible labelings. Therefore, we can also predict weak increase , when a species does not decrease, but increases in at least one admissible behavior, and does not change in another admissible behavior. Likewise, we predict weak decrease when a species does not increase, but decreases in at least one admissible behavior, and does not change in another. Finally, we predict change ± when a species does always change, it increases in at least one admissible behavior and decreases in another. We call , , and ± weak predictions because one possible behavior is excluded while one degree of freedom is left.

Formally, for a set V of nodes in our network and the set M of labelings consistent with our experimental profile, we define the prediction function p r e d:V→{+,−,0,,,±, no } as follows:

$${} pred(x) \,=\,\left \{ \begin{array}{cll} + & |~ \forall \mu \in M: \mu(x)= +, \\ - & |~ \forall \mu \in M: \mu(x)= -,\\ 0 & |~ \forall \mu \in M: \mu(x)= 0, \\ \\ \boldsymbol{\oplus} & |~ \forall \mu \in M: \mu(x)\neq -, & \exists \mu \!\in\! M: \mu(x)= +,\\ & & \exists \mu \!\in\! M: \mu(x)= 0,\\ \boldsymbol{\ominus} & |~ \forall \mu \in M: \mu(x)\neq +, & \exists \mu \!\in\! M: \mu(x)= -,\\ & & \exists \mu \!\in\! M: \mu(x)= 0,\\ \pm & |~ \forall \mu \in M: \mu(x)\neq 0, & \exists \mu \!\in\! M: \mu(x)= +,\\ & & \exists \mu \!\in\! M: \mu(x)= -,\\ \text{no} & |~ else. \end{array} \right. $$

Recovery rate and precision

In Table 1, we show the predictions for the example given in Fig. 1. One can see that the more constrained consistency notions yield smaller sets of admissible labelings and a higher recovery rate (for how many unmeasured species can predictions be obtained). In the systematic comparison of the consistency notions based on real experimental data we not only consider recovery rate but also prediction precision (true positives/(true positives + false positives)). A strong prediction (+ / −/0) will be a true positive if it has a certain measurement with the same value (+/ −/0). A weak prediction (resp. and ±) will be a true positive if it has a certain measurement + or 0 (resp. − or 0 and + or −). Reversely, a prediction will be a false positive if has a certain measurement value with a contradictory value +.

Table 1 Predictions for the example in Fig. 1 derived from the admissible behaviors in Fig. 6

Repairing inconsistent networks and data

If network and data are inconsistent the natural question arising is how to repair networks and/or data, that is, how to modify network and/or data in order to re-establish their mutual consistency. A major challenge lies in the range of possible repair operations, since an inconsistency can be explained by missing interactions or inaccurate information in a network as well as by measurement errors. The sign consistency approach can be used to determine a set of repair operations that are suitable to restore consistency. Typically, plenty of suitable repair operations are possible, in particular, if multiple repair operations are admitted. However, one usually is only interested in repairs that make few changes on the model and/or data. These minimal repair sets cannot only be used for hypotheses generation (e.g., which data might be questionable or which edges might be missing or inactive) but as a quantitative measure for the fitness of model and data. Also note that once consistency is re-established, network and data can again be used for predicting behaviors of unmeasured entities.

In [17], four repair operations were introduced; two of them for single experiments (SCEN-FIT, Minimal Correction Sets (MCoS)) and two for multiple experiments (OPT-SUBGRAPH, OPT-GRAPH). The latter two are computationally more demanding as they seek to optimize the whole network structure based on many perturbation experiments. SCEN-FIT, as explained in detail in the Additional file 1, seeks to find a consistent node labeling that is closest to the given measurements and can thus help to identify inconsistencies between network and dataset. Herein we will focus on MCoS and thus deal with analysis of single experiments. This is motivated by our application example where we indeed have multiple experiments (105) but where the number of experiments is low compared to the number of edges and nodes in the network (1646) disabling a meaningful network structure optimization. However, we note here that our extended notion FSP can be straighforwardly applied to these repair operations as well.

Minimal Correction Sets (MCoS)

To resolve inconsistencies one may add new influences to the model if the later is considered to be potentially incomplete (which is often the case in practice). Adding an influence can be used to indicate missing (unknown) regulations or oscillations of regulators that would explain the (topology-inconsistent) measurements. We use minimal correction sets (MCoS) as defined in [17] as minimal sets of new signed (positive or negative) input influences that restore consistency of model and data. MCoS are defined as signed influences and are specific for a single experiment; they might be incompatible with other experiments. Note that every inconsistency can be repaired by adding a new influence. Therefore, adding influences is always suited to restore consistency. Also the MCoS can be interpreted as a measure of consistency of model and data. Compared to SCEN-FIT, MCoS yields always a smaller or equal number of repairs. Therefore we define the inconsistency-index of a network with respect to data as (MCoS/number of observations in the experiment). Figure 7 illustrates how repair through addition of influences works.

Fig. 7
figure 7

Repair by adding signed influences example (Minimal Correction Sets - MCoS). There exist three alternative repair sets: repair set a adds a positive influence to A and repair set b includes a negative influence on B, repair set c includes a positive influence on A and a negative influence on B. Repair sets a and b are minimal containing only one repair, repair set c is not minimal having two repairs. Looking at the intersection of the labelings under minimal repairs, we can conclude that C is either responsible for an increase in A or a decrease in B. We can therefore exclude a labeling of C with 0, we can predict: p r e d(C)=±

Prediction under minimal repair

Due to the capability of repairing, the sign consistency approach enables prediction even if model and data are mutually inconsistent. Predictions under minimal repair are obtained from the identification of consequences shared by all consistent labelings under all possible minimal repairs. Note that this approach although it confines to minimal repairs following the law of parsimony, does not favor any of the possible minimal repairs but only considers a statement a prediction if it holds under every minimal repair.

Software

The different consistency notions as well as the methods for consistency checking and quantification, prediction, and all data and network repair operations were implemented in an open source application iggy [24]. iggy uses ASP [23] as logical modeling and constraint solving paradigm, it is part of the BioASP software collection and can easily be installed via the python package index (PyPI). ASP is used to model problems from NP and provides state-of-the-art solvers. In particular, we use the solver clasp [25] via the pyasp [26] package. On an AMD Opteron 6168 1.9 GHz with 96 GB RAM, given a network with 1646 nodes and 4277 edges our software needs ≈20 min to compute the predictions under minimal repair (MCos) for the unmeasured species of 105 experiment data sets each containing 1392 measurements. For further information visit http://bioasp.github.io/iggy.

Results and discussion

To investigate the suitability of the different consistency notions, we used the gene regulatory network of Escherichia coli and confronted it with Microarray data. The network was obtained from RegulonDB [27], version 8.3 in october 2013, and we focused on its biggest weakly connected component which is composed of 1646 nodes and 4277 edges and covers 94 % of the nodes of the full RegulonDB network. Unsigned edges are treated as two parallel edges with opposite signs. The data refers to the microarray log ratio expression of 3607 genes measured under 240 different stress conditions in E. coli published in [28]. We chose 105 of 240 experiments which can be interpreted as steady state shift experiments and 1392 of the 3607 genes which occur in the RegulonDB network. Since the input nodes for the stress condition experiments are unknown, we simply defined all nodes without predecessors as inputs.

The GEO/GSE codes for the used experiments are listed in the Additional file 1. The microarray data was discretized as described in the Methods Section using the typical thresholds: t 1=−2, t 2=−0.01, t 3=0.01, t 4=2, to generate the constraints that restrict the labeling μ for the nodes measured in the experimental profile.

To evaluate the influence of the minimal correction sets (MCoS) and to investigate the suitability of the different consistency notions to predict the behavior of unobserved entities in a regulatory network, we performed a cross-validation using the E. coli data.

Quality of regulatory network when confronted to the expression profiles

As a first step, we assess the quality of network and data by comparing it to randomized data. We generated 100 randomized datasets for each real experiment by shuffling the observed signs among the observed nodes; but preserving the sign distribution for each dataset. We then computed for real and randomized data the inconsistency index which is defined as the quotient of the number of minimal corrections (MCoS) to restore consistency (under notion FSP) divided by the number of observations in the experiment. Then we computed the Wilcoxon signed-rank test to assess whether the population means of the two samples differ. The obtained p-value of 2.0497e-11 indicates a highly significant difference of real and randomized data, suggesting that the real data are more (sign-) consistent with the network topology than random data. Figure 8 shows the inconsistency index for real and randomized data for each experiment. We can see that the real E. coli dataset exhibits a significantly lower inconsistency index than the randomized data.

Fig. 8
figure 8

Inconsistency index of the 105 experiments. The x-axis shows the inconsistency index of the original experiments and the y-axis the average inconsistency of the shuffled experiments. The error bars indicate the standard deviation of the inconsistency index among 100 shuffled samples

Figure 9 shows the distribution of the measured signs in the experimental data revealing that the data tends to be less consistent if more +/- are contained.

Fig. 9
figure 9

Distribution of observed signs in the experimental data. The x-axis shows the 105 experiments ordered after their inconsistency index represented by the black line, getting less consistent from left to right. The left y-axis quantifies the cumulative percentage of nodes in the network which were measured. The signs are represented by the following colors: − (red), \(\triangledown \) (purple), 0 (blue), \(\vartriangle \) (turquoise), and + (green)

Predictions under the different consistency notions

To investigate the suitability of the different consistency notions to predict the behavior of unobserved entities in a regulatory network, we performed a cross-validation using the E. coli data. While other validation methods exist, we decided to use cross-validation as a model validation technique because it allows us to assess how the results of the approach will generalize to independent datasets. To set up cross-validation, we created for each experiment 100 samples each containing a random 10 % share of the measurements. We then confronted the E. coli network with these samples, determined the minimal corrections necessary to restore consistency, and computed the predictions that hold under all minimal correction sets.

In Table 2 one can see the distribution of the +, −, 0 and weak predictions as well as how the precision of the different types of predictions varies among the different notions (WP is similar to FWP see Additional file 1). With the different consistency notions we were on average able to compute behavior predictions for up to 69 % of the remaining nodes in the network, for which no measurement was given. One can observe that the share of nodes with predictions increases drastically with notion SP and even further with FSP, mainly through an increased prediction of 0-change behaviors.

Table 2 Average % of unobserved nodes that have predictions of a particular behavior, their information gain and the precision for these predictions under the different notions giving 10 % or 50 % of the E. coli expression measurements as input. In the last major column (“all predictions”) the rows “% of unobserved nodes” quantify the overall recovery rates

The different types of predictions contain different amount of informations. A weak prediction gives less information than a strong prediction because it discards only one out of three possible labels. Hence, the 69 % of nodes with prediction does not equal 69 % of information gained. Therefore, we also computed the information gain given by these predictions. For n unconstrained nodes, for which no measurements are taken into account, 3n possible behaviors exist, for k nodes with strong predictions the possible behaviors can be restricted to just 1, and for l nodes with weak predictions remain still 2l possible behaviors, for m nodes without predictions remain still 3m possible behaviors, and the overall information gain can then be expressed as (l o g(3n)−l o g(1k+2l+3m)/l o g(3n)). In our experiments we observed an average information gain up to 61 % for the nodes for which no measurements had been taken into account. For more information on how to compute the information gain we refer to the Additional file 1.

To validate the quality of the predictions (obtained from 10 % of the data), we compared them with the validation data (the remaining 90 % of measurements). For the nodes where a prediction and validation data was available, we compared both. We obtained on average precisions that range from 73 % to 80 %. Overall, SP and FSP allow us to make predictions for a much bigger part of the network, resulting in a much higher information gain with only a slightly decreased precision, and for + and − predictions with a significant higher precision than FWP. In Section 5 of the Additional file 1 we plot the detailed recovery and precisions per experiment for notions FWP and FSP.

To test the influence of the number of measurements on recovery rate and precision, we also created a dataset with 50 % and 75 % (see Additional file 1: Table S3) of the measurements. Compared to the results with 10 % the overall recovery rate increases up to 82 % (FSP). This is due to the fact that the increased amount of data helps to put more constraints on the systems behavior. For notion SP and FSP the number of weak predictions drops slightly because many of them become strong predictions. The precision of +, − and weak predictions benefits from the richer datasets under notion SP and FSP, while the precision of 0-change decreases only slightly.

Weak predictions easily have higher precisions, because they have a bigger chance to be true positives. To validate that the precisions obtained in our test case are indeed meaningful, we tested our approach on a randomized dataset. We could verify that the predictions from randomized data have less precision than the predictions obtained from the real data (see Additional file 1: Table S3),especially for notions SP and FSP. Accordingly the p-values shown in Table 2 indicate a high significance that the predictions made by SP and, even more pronounced, by FSP are better than random.

These results show that the strong-propagation notions (SP and FSP) are the most pertinent choice to explain gene expression shifts within the E. coli transcriptional network. Using FSP we predict with high precision that 53 % to 72 % of the network remains unaltered (0-change). Understanding the differentially expressed network regions becomes more delicate, since the precision remains on average 54 % to 62 % which, however, is still significantly higher than for notion FWP. Nevertheless, 48 % of the experiments had a precision above 75 % for up- or down-regulation (strong) predictions when considering a dataset with 50 % of the measures. Note, that the notion of precision changes its conclusiveness when applied to incompletely determined predictions. Thus, we use confusion matrices as an alternative representation to illustrates the performance of our prediction method. Here one can see that for uncertain observations, relatively few strong predictions are confused (see Fig. 10). Therefore, wrong predictions may be related to the choice of the discretization thresholds and that a single threshold was chosen for all genes.

Fig. 10
figure 10

Confusion matrices for predicted behaviors under each notion given 10% (left) 50% (right) of the data. The columns indicate the predicted behavior and the rows the measured behavior. Given are the average numbers of nodes predicted to change as indicated by the column and measured in the experiment as indicated by the row with respect to the total number of measured predictions. Predictions that are consistent with the measurement are in green and confusions in red. Gray fields denote either predictions that could not be verified because no validation data was available or nodes for which validation data was available but no predictions had been made

Conclusion

We presented a unified framework to express different notions of sign consistency on interaction graphs. A refined methodology for data discretization into five values allows the consideration of uncertainties in experimental profiles. Within this framework we introduced a new constraint to filter undesired self-fulfilled regulations that result from positive feedback loops. Finally, our extended prediction method considers not only strong (unique value) but additionally weak (multiple admissible values) predictions, enlarging the predictive power of the approach.

We evaluated our framework by confronting the full RegulonDB network with 105 experimental gene-expression profiles. Our cross-validation results obtained when choosing 10 % of the initial dataset show that the overall precision of the methods ranges from 72 % to 80 %. The precision of the FSP notion has a much higher and significant p-value. With its increased precision and recovery, FSP appears to be the superior notion.

We expect that the information gain is in general higher for datasets from (typically smaller) signaling networks (see e.g. [17]). This might be due to the fact that in the stress experiments considered here the (perturbed) inputs of the gene regulatory network were unknown which poses less constraints than in signaling networks with normally well-defined signal inputs (given by the applied ligands, inhibitors etc.).

Our method requires a careful selection of discretization thresholds. Therefore, we performed a detailed sensitivity analysis on a wide range of the discretization thresholds (see Additional file 1: Section 4). The analysis shows that there is a relatively small sensitivity of the results (precision, information gain) w.r.t. the chosen thresholds. We also discuss further aspects of threshold selection in the Additional file 1.

There is a relationship between the concept of sign consistency and the dependency matrix (discussed in more detail in [17]). The notion of the dependency matrix was originally introduced in [4] and has been used in several studies for checking consistency between signaling network topologies and experimental data from stimulus-response experiments, (e.g., [5, 29]). In fact, the dependency matrix can be seen as another sign consistency notion which is more relaxed than SP or FSP (what might still be useful, e.g. when analyzing transient instead of steady-state responses). Since additional propagation rules are straightforward to implement in the framework presented herein, other sign consistency notions, including the dependency matrix or those that pose different constraints for 0-changes, could be considered as well. Overall, our work enhances the flexibility and power of the sign consistency approach for the prediction of the behavior of signaling and gene regulatory networks and, more generally, for the validation and inference of these networks.

Abbreviations

IG:

Influence graph

ILP:

Integer linear programming

ASP:

Answer set programming

WP:

Weak propagation

SP:

Strong propagation

FWP:

Founded weak propagation

FSP:

Founded strong propagation

References

  1. Catlett NL, Bargnesi AJ, Ungerer S, Seagaran T, Ladd W, Elliston KO, et al. Reverse causal reasoning: applying qualitative causal knowledge to the interpretation of high-throughput data. BMC Bioinforma. 2013; 14:340.

    Article  Google Scholar 

  2. Guziołowski C, Bourde A, Moreews F, Siegel A. BioQuali Cytoscape plugin: analysing the global consistency of regulatory networks. BMC Genomics. 2009; 10(1):244. doi:http://dx.doi.org/10.1186/1471-2164-10-244.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Gebser M, Schaub T, Thiele S, Veber P. Detecting inconsistencies in large biological networks with answer set programming. Theory Prac Logic Program. 2011; 11(2–3):323–60.

    Article  Google Scholar 

  4. Klamt S, Saez-Rodriguez J, Lindquist J, Simeoni L, Gilles E. A methodology for the structural and functional analysis of signaling and regulatory networks. BMC Bioinforma. 2006; 7(1):56. doi:http://dx.doi.org/10.1186/1471-2105-7-56.

    Article  CAS  Google Scholar 

  5. Samaga R, Saez-Rodriguez J, Alexopoulos LG, Sorger PK, Klamt S. The Logic of EGFR/ErbB Signaling: Theoretical Properties and Analysis of High-Throughput Data. PLoS Comput Biol. 2009; 5(8):1000438. doi:http://dx.doi.org/10.1371/journal.pcbi.1000438.

    Article  CAS  Google Scholar 

  6. Thieffry D. Dynamical roles of biological regulatory circuits. Brief. Bioinforma. 2007; 8(4):220–5. doi:http://dx.doi.org/10.1093/bib/bbm028 http://bib.oxfordjournals.org/content/8/4/220.full.pdf+html.

  7. Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science. 2005; 308(5721):523–9. doi:http://dx.doi.org/10.1126/science.1105809.

    Article  CAS  PubMed  Google Scholar 

  8. Morris MK, Saez-Rodriguez J, Sorger PK, Lauffenburger DA. Logic-based models for the analysis of cell signaling networks. Biochemistry. 2010; 49(15):3216–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Wang RS, Saadatpour A, Albert R. Boolean modeling in systems biology: an overview of methodology and applications. Phys Biol. 2012; 9(5):055001.

    Article  CAS  PubMed  Google Scholar 

  10. Schoeberl B, Eichler-Jonsson C, Gilles ED, Müller G. Computational modeling of the dynamics of the map kinase cascade activated by surface and internalized egf receptors. Nat Biotechnol. 2002; 20(4):370–5.

    Article  PubMed  Google Scholar 

  11. Quach M, Brunel N, d’Alché-Buc F. Estimating parameters and hidden variables in non-linear state-space models based on odes for biological networks inference. Bioinforma. 2007; 23(23):3209–16.

    Article  CAS  Google Scholar 

  12. Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol. 2008; 9(10):770–80.

    Article  CAS  PubMed  Google Scholar 

  13. Ideker TE, Thorsson V, Karp RM. Discovery of Regulatory Interactions Through Perturbation: Inference and Experimental Design. In: Proceedings of the Pacific Symposium on Biocomputing. Seattle, USA: World Scientific Press: 2000.

    Google Scholar 

  14. Saez-Rodriguez J, Alexopoulos LG, Epperlein J, Samaga R, Lauffenburger DA, Klamt S, et al. Discrete logic modelling as a means to link protein signalling networks with functional analysis of mammalian signal transduction. Mol Syst Biol. 2009; 5(1):331.

    PubMed  PubMed Central  Google Scholar 

  15. Sharan R, Karp R. Reconstructing boolean models of signaling In: Chor B, editor. Research in Computational Molecular Biology. Lecture Notes in Computer Science. Springer: 2012. p. 261–71. doi:http://dx.doi.org/10.1007/978-3-642-29627-7_28.

  16. Terfve C, Cokelaer T, Henriques D, MacNamara A, Goncalves E, Morris M, et al. Cellnoptr: a flexible toolkit to train protein signaling networks to data using multiple logic formalisms. BMC Syst Biol. 2012; 6(1):133. doi:http://dx.doi.org/10.1186/1752-0509-6-133.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Melas IN, Samaga R, Alexopoulos LG, Klamt S. Detecting and Removing Inconsistencies between Experimental Data and Signaling Network Topologies Using Integer Linear Programming on Interaction Graphs. PLoS Comput Biol. 2013; 9(9):1003204. doi:http://dx.doi.org/10.1371/journal.pcbi.1003204.

    Article  CAS  Google Scholar 

  18. Videla S, Guziołowski C, Eduati F, Thiele S, Gebser M, Nicolas J, et al. Learning Boolean logic models of signaling networks with ASP. Theoretical Computer Science. 2015; 599:79–101. Advances in Computational Methods in Systems Biology, doi:http://dx.doi.org/10.1016/j.tcs.2014.06.022, http://www.sciencedirect.com/science/article/pii/S0304397514004587.

  19. Radde N, Bar NS, Banaji M. Graphical methods for analysing feedback in biological networks - a survey. Int J Syst Sci. 2010; 41(1):35–46. doi:http://dx.doi.org/10.1080/00207720903151326.

    Article  Google Scholar 

  20. Samaga R, Klamt S. Modeling approaches for qualitative and semi-quantitative analysis of cellular signaling networks. Cell Commun Signal. 2013; 11(1):43. doi:http://dx.doi.org/10.1186/1478-811X-11-43.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Kuipers B. Qualitative reasoning: Modeling and simulation with incomplete knowledge. Automatica. 1989; 25(4):571–85. doi:http://dx.doi.org/10.1016/0005-1098(89)90099-X.

    Article  Google Scholar 

  22. Gebser M, Guziołowski C, Ivanchev M, Schaub T, Siegel A, Thiele S, et al. Repair and prediction (under inconsistency) in large biological networks with answer set programming In: Lin F, Sattler U, Truszczynski M, editors. Proceedings of the Twelfth International Conference on the Principles of Knowledge Representation and Reasoning (KR’10). Menlo Park, CA: AAAI Press: 2010.

    Google Scholar 

  23. Baral C. Knowledge Representation, Reasoning and Declarative Problem Solving. Cambridge: Cambridge University Press; 2003.

    Book  Google Scholar 

  24. Thiele S. Iggy-1.2: A tool for consistency based analysis of influence graphs and observed systems behavior. zenodo.org. 2015. doi:http://dx.doi.org/10.5281/zenodo.19042.

  25. Gebser M, Kaminski R, Kaufmann B, Ostrowski M, Schaub T, Thiele S. A User’s Guide to gringo, clasp, clingo, and iclingo. 2010. http://potassco.sourceforge.net.. Accessed 10 Oct 2015.

  26. Thiele S. PyASP 1.4.1 - A convenience wrapper for the ASP tools gringo, gringo4 and clasp. 2015. doi:http://dx.doi.org/10.5281/zenodo.22968.

  27. Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-Solano F, Santos-Zavaleta A, et al. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res. 2006; 34(Database issue):394–7.

    Article  CAS  Google Scholar 

  28. Sangurdekar DP, Srienc F, Khodursky AB. A classification based framework for quantitative description of large-scale microarray data. Genome Biol. 2006; 7(4):32.

    Article  CAS  Google Scholar 

  29. Ryll A, Samaga R, Schaper F, Alexopoulos LG, Klamt S. Large-scale network models of il-1 and il-6 signalling and their hepatocellular specification. Mol BioSyst. 2011; 7:3253–270. doi:http://dx.doi.org/10.1039/C1MB05261F.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was funded in part by the German Federal Ministry of Education and Research within the “Virtual Liver Network” (grant 0315744) and “JAK-Sys” (grant 0316167B).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Carito Guziołowski or Steffen Klamt.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

ST, SK, AS, and CG conceived and supervised the study. LC and JSR contributed to the investigation of different sign consistency notions. ST implemented iggy. ST and CG calculated results for the E. coli case study. All authors discussed results of data analysis. ST, CG, and SK drafted the manuscript. All authors read and approved the final manuscript.

Additional file

Additional file 1

Supplementary. Contains the following supplementary material. Explanation of SCEN-FIT. Explanation of uncertain observations. Information gain by predictions in the sign consistency approach. Sensitivity analysis - Choosing the thresholds for discretization. Recovery and precision for E. coli cross-validation experiments. GEO/GSEcodes for the experiments used. (PDF 1218 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thiele, S., Cerone, L., Saez-Rodriguez, J. et al. Extended notions of sign consistency to relate experimental data to signaling and regulatory network topologies. BMC Bioinformatics 16, 345 (2015). https://doi.org/10.1186/s12859-015-0733-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-015-0733-7

Keywords