Deciphering transcriptional regulations coordinating the response to environmental changes

Background Gene co-expression evidenced as a response to environmental changes has shown that transcriptional activity is coordinated, which pinpoints the role of transcriptional regulatory networks (TRNs). Nevertheless, the prediction of TRNs based on the affinity of transcription factors (TFs) with binding sites (BSs) generally produces an over-estimation of the observable TF/BS relations within the network and therefore many of the predicted relations are spurious. Results We present Lombarde, a bioinformatics method that extracts from a TRN determined from a set of predicted TF/BS affinities a subnetwork explaining a given set of observed co-expressions by choosing the TFs and BSs most likely to be involved in the co-regulation. Lombarde solves an optimization problem which selects confident paths within a given TRN that join a putative common regulator with two co-expressed genes via regulatory cascades. To evaluate the method, we used public data of Escherichia coli to produce a regulatory network that explained almost all observed co-expressions while using only 19 % of the input TF/BS affinities but including about 66 % of the independent experimentally validated regulations in the input data. When all known validated TF/BS affinities were integrated into the input data the precision of Lombarde increased significantly. The topological characteristics of the subnetwork that was obtained were similar to the characteristics described for known validated TRNs. Conclusions Lombarde provides a useful modeling scheme for deciphering the regulatory mechanisms that underlie the phenotypic responses of an organism to environmental challenges. The method can become a reliable tool for further research on genome-scale transcriptional regulation studies. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0885-0) contains supplementary material, which is available to authorized users.

gene expression, emphasizing the complexity of the regulation as part of the acclimation process.
In general, classic methods of in silico reconstruction of transcriptional regulation processes consider expression profiles and genomic sequences separately. The most commonly used strategies for identifying co-expressed genes consider linear correlation [1] or mutual information methods like ARACNe [2], CLR [3], and MRNET [4]. Some of these methods have been successful in identifying regulatory interactions in synthetic networks and in model organisms [5]. Nonetheless, the interdependence of the expression profiles of two genes does not necessarily mean that there is a physical interaction. Also, the computed correlations are not oriented and thus cannot be interpreted causally. Even so, they convey information about the transcriptional mechanisms.
On the other hand, different in silico approaches have been developed for the study of genomic sequences. These approaches attempt to model the physical interactions that form a putative transcriptional regulatory network (TRN) and they rely on the identification of pairs of genes where the first gene codes for a transcription factor (TF) that potentially binds in the promoter region of the second gene [6,7]. Genes coding for TFs are typically obtained by homology between the genome sequence and a database of TFs (such as RegulonDB [8] or Prodoric [9]). In these databases, each TF has an associated position weighted matrix (PWM), which estimates the affinity between the TF and a potential binding site (BS) in a promoter region. The low specificity of current methods for identifying transcriptional regulations means that the number of TF/BS affinities found is usually huge, while, in practice, only a few of them correspond to observed regulatory interactions. Nevertheless, even if many of the predicted interactions have a low probability of occurrence, or are never observed, it is reasonable to assume that the TRN reconstructed from them contains most of the physical interactions that occur for a given process and therefore the network should be able to explain co-expression of genes. Moreover, the p-value of the affinity computed for a given TF/BS interaction provides information that can be interpreted as a likelihood of the occurrence of this interaction.
Under this assumption, the problem we address here is: given a putative TRN constructed from TF/BS affinities and given a set of co-expressed gene pairs, how can the most probable set of interactions that the organism uses to coordinate the gene expression changes be determined. In other words, we want to find a simple and confident subnetwork of a putative TRN that is able to explain a given set of co-expressions.
To define what we mean by a subnetwork explaining a pair of co-expressed genes, we consider all the topological configurations within the network that allow for a coordinate change. The simplest case is when a pair of coexpressed genes also corresponds to a TF/BS interaction in the putative TRN. Clearly a direct interaction, represented in the TRN as a single arc between the genes, is a possible explanation for the co-expression, indicating that one gene is regulating the expression of the other gene. Another possibility is that the co-expression of the two genes is correlated not by a single regulatory interaction but by a chain of TF/BS interactions, i.e. a regulatory cascade. Such a cascade is represented in the TRN as a directed path from one gene to the other. Thus, a path connecting two correlated genes is also a possible explanation for the co-expression. Finally, the correlation may be the result of a third gene that simultaneously regulates two co-expressed genes. In such a situation, the two correlated genes will be at the ends of two regulatory cascades, both of which start with this common regulator gene (see right side of Fig. 1). We define any of these configurations as an explanation for the correlation.
To identify a simple and confident subnetwork explaining a set of co-regulations, we propose LOMBARDE, an optimization strategy that extracts from a putative TRN the most simple and reliable TF/BS interactions that explain a given set of co-regulated genes. The LOMBARDE method also accepts as input an additional independent list of experimentally validated transcriptional regulations.
The precision of LOMBARDE clearly depends on our assumption that the initial putative TRN includes most of the observed TF/BS interactions. When this assumption is satisfied, at least one explanation for each real co-expression should be assured. Interestingly, we found that a putative TRN constructed for Escherichia coli using a classical bioinformatic pipeline to produce TF/BS affinity pairs explained 91.1 % of nearly 60,000 observed co-expressions (see Results and discussion for details). When LOMBARDE was applied to this putative E. coli TRN and the set of observed co-expressions it produced a subnetwork that conserved only 19.2 % of the initial interaction arcs, while still explaining 91.1 % of the co-expressions. LOMBARDE has a strong bias towards preserving experimentally validated regulations. It preserved over 66 % of a set of independent experimentally validated arcs in the putative E. coli TRN and kept only 18.4 % of non-validated interactions. Moreover, when LOMBARDE was applied to the same putative TRN extended by adding all independent experimentally validated arcs, the resulting subnetwork retained 92 % of the validated arcs and included only 11.3 % of the other putative regulations. In addition, the subnetworks produced by LOMBARDE showed credible topological characteristics and recovered most of the global regulators described in the E. coli literature. The regulators were also ranked correctly in relation to their role in the network. We concluded that the modeling scheme proposed in LOMBARDE is a reliable strategy for deciphering the transcriptional regulatory interactions that can explain the co-expressions observed under environmental changes.

Materials and methods
Given a putative TRN computed from TF/BS affinities and a set of co-expressed gene pairs, the main idea of LOMBARDE is to extract from this network a simple and confident subnetwork that contains an explanation for each co-expression in the given set. Here we present a description of the proposed model and the methods used. We also compare the optimization strategy defined to model simple and confident subnetworks with alternative optimization strategies.

Input of LOMBARDE
If G is the set of genes in a studied organism, then LOMBARDE requires the following inputs:  [9]. We assume that pairs with high p-values would already have been discarded from A. 3. Validated pairs: Optionally, a set V ⊆ G × G of gene pairs that correspond to independent experimentally validated regulations, if available.
LOMBARDE was initially intended to be applied in Prokarya where all the genes of a given operon are transcribed typically in a single polycistronic mRNA molecule; therefore, we assumed that the expression of a gene implies the expression of the operon to which it belongs. Given the specific operon structure of the studied organism, E. coli, we can consider G as the set of operons and C, A and V as sets of pairs of operons, which simplifies the analysis and reduces the running time of the method. This simplification is purely operational and can be applied at the user's discretion.

Defining the a priori graph G and explanations
Initially, LOMBARDE defines the a priori graph G as a directed graph where nodes correspond to genes G and directed arcs correspond to pairs of affinities in A and pairs of known regulations in V. That is, G = (G, A ∪ V). Thus, there will be a directed arc from gene A to gene B if there is some a priori evidence (experimental or theoretical, weak or strong) that A directly regulates B. If no validated regulations are available, then G = (G, A). It is important to note that a regulatory cascade, i.e. a sequence of regulatory relations between genes in G, should appear in this graph as a directed path (see the right side of Fig. 1), although clearly not every path will represent a real regulatory cascade. The final objective is to highlight paths that most likely correspond to real regulatory cascades controlling the co-expressed data.
Under this representation of a TRN, the observed coregulation of two genes in C can be explained by considering two cases. One, is the existence of a directed path from one gene to the other, meaning that the first gene is regulating the last gene through a regulatory cascade (direct regulation is considered as a regulatory cascade of size one). Two, is considering that none of the genes regulate the other, rather both are co-regulated by a third gene. Such a situation is represented in the a priori graph by two paths from a common regulator to each of the co-regulated gene (see right side of Fig. 1).

Definition 1. Given a pair (A, B) ∈ C of co-expressed genes, an explanation for (A, B) in G is a set of arcs E that satisfy any of the following conditions:
• E is a directed path from A to B; • E is a directed path from B to A; • E is the union of two divergent directed paths starting from a gene C and arriving, respectively, at A and B, which have only vertex C in common.

Definition 2.
We say that a subgraph G ⊆ G explains C if, for every pair (A, B) ∈ C, the subgraph G contains an explanation for (A, B).
Ideally, every pair (A, B) in C should have at least one explanation in G. If this is not the case, it indicates that, under the modeling hypothesis, A and B are not really co-regulated or that the methods used to compute the set A did not capture all the transcriptional mechanisms involved in the co-regulation of A and B. These unexplained pairs correspond to missing or inaccurate input data beyond the scope of the method and therefore are removed from C. After their removal we can assume that the a priori graph G explains C. However, as discussed earlier, many arcs in G represent TF/BS relations that have a low probability of occurrence or that have never been observed. Our objective, therefore, is to find a simple and confident subgraph G ⊆ G that explains every pair in C.

Cost definition
We considered two ways of defining simple and confident subgraphs of G that explain C: (i) foster explanations that use a small number of arcs and (2) foster explanations that have arcs of high affinity (i.e. low p-value). A way to consider both criteria simultaneously is to define costs on the arcs in such a way that the most likely TF/BS affinities have the lowest costs.
Instead of defining the cost of an arc in G as a continuous function of the p-value of the associated TF/BS, we implemented a qualitative approach by defining levels of likelihood. Indeed, because small variations in p-values have no real biological significance, we considered all arcs with similar p-values as equally likely. This approach increased the robustness of the method and prevented solutions that were slightly but not significantly different to the best ones from being discarded.
We defined the costs of the arcs in G using the following procedure where k ∈ N and r ∈ (0, ∞) are parameters. We defined k levels of likelihood, from level i = 0 (highest likelihood) to level i = k − 1 (lowest likelihood), in such a way that every level contains the same number of arcs (i.e. in equal-frequency bins). Arcs in V are assigned to the highest likelihood level (valued i = 0), because they correspond to already validated regulations. Finally, the cost of an arc in level i ∈ {0, . . . , k − 1} is set as r i . In this way r represents the incremental cost between consecutive levels; that is, an arc in level i has a cost that is equal to r arcs in level i − 1. We analyzed the use of different values for the parameters k and r on a real data sets (see the 'Analysis of cost parameters' subsection for details).

Optimal explanations and subgraphs
Having defined the cost of arcs, it is natural to define the cost of a subgraph as the sum of the costs of the arcs it contains. With this, we defined an optimal explanation: Definition 3. We define an explanation E for the pair (A, B) in C as optimal if it has the minimum cost among all the explanations for the pair.
Note that a pair (A, B) in C can have more than one optimal explanation (see Fig. 2), especially after the cost categorization where all arcs with similar p-values are considered as equally likely. Thus we define an optimal subgraph: Definition 4. We define a subgraph G ⊆ G as an optimal subgraph explaining C if G is the union of |C| (cardinality of C) optimal explanations, one for each pair of genes in C. That is, There could be a huge number of optimal subgraphs explaining C, because several optimal explanations for each pair (A, B) in C could exist. For example, if C contained 20 gene pairs, each one having two optimal explanations, then the number of optimal subgraphs explaining C could reach a million (when each possible union of optimal explanations produced a different subgraph).
Instead of enumerating all optimal subgraphs explaining C, which could be computationally infeasible, we computed a subgraph G L defined as the union of all optimal subgraphs explaining C. It is clearly not necessary to compute every optimal subgraph, but rather to compute for every (A, B) pair in C only the set of all optimal explanations for (A, B). Thus, the graph G L is obtained as, This graph is the output of LOMBARDE (see Fig. 2 for an example).

Analysis of alternative optimization problems
We proposed G L as a simple and confident subgraph of G to explain the co-expressions provided in C by including all the optimal explanations for each pair in C. This can be seen as a local optimization problem because the cost of explaining every pair in C independently is minimized. The main reason for choosing this local strategy was that other natural alternatives that consider solving global optimization problems are computationally infeasible [10].
For instance, consider the problem of computing the subgraph of minimum cost that explains C. This criterion can give a different result than G L (see Fig. 3 for an example). However, this approach is not robust because adding a new co-expression pair could completely change the global solution. Moreover, the best explanation that it provides for a given pair could have a very high cost compared with an optimal explanation. As well as having these undesirable properties, this problem is hard to solve. Indeed, it has been shown to be an NP-hard problem [10] by a reduction from the Steiner Weighted Directed Tree problem. This means that the problem can be solved exactly and in a reasonable time only for very small instances.
Another alternative optimization problem, which can be considered as a mixture between local and global optimization, is to compute a minimum cost subgraph that contains an optimal explanation for each pair in C; that is, the optimal subgraph explaining C with minimum global cost. Although this approach may seem more interesting than the previous alternative, it has some disadvantages. One disadvantage is the possible multiplicity of solutions because again there could be a large number of optimal subgraphs that have the same minimum global cost. An even worse disadvantage is that it has been proved by a reduction from the Minimum Hitting Set problem that finding just one solution is NP-hard [10] (although in practice this problem can be solved for larger instances). Besides the disadvantages, the results produced by this alternative optimization are not more interesting in practice than those given by LOMBARDE because they always correspond to subsets of G L . Thus, LOMBARDE gives not only the global optimal solution of the optimization problem but also provides alternative optimal explanations for each pair. Moreover, the solution given by LOMBARDE is always unique and computationally feasible.

Results and discussion
A large number of TF/BS associations have been validated experimentally for E. coli; therefore, we used public data for this bacteria to evaluate our model. The genome sequence and gene annotation of E. coli K12 [GenBank:NC_000913] were downloaded from the NCBI ftp site [11] (http://www.ncbi.nlm.nih.gov/ nuccore/NC_000913.3). Differential expression profiles for 907 conditions were obtained from Ecoli_v4_Build_ Fig. 3 Comparison between LOMBARDE output and an alternative minimum global cost explanations. Left: Graph with minimal cost that explains all pairs of co-expressions. Note that the explanation proposed for (G,H) has cost 18, which is much higher than the optimal explanation of cost 6. Center: LOMBARDE output all optimal explanations for each pair. Right: Minimum cost subgraph containing an optimal explanation for each pair. This graph is always contained in the LOMBARDE output 6_chips907probes4297.tab the Many Microbes Microarray Database [12] http://m3d.mssm.edu/norm/E_coli_v4_ Build_6.tar.gz.

Input data
Here, we describe briefly the sources and methods used to generate input data for LOMBARDE. As mentioned above, for bacteria we condense the information into operons to simplify computation. It is important to note that the methods we used to produce the input for LOMBARDE were merely a choice. Because data production was not our main concern, as far as possible, we used simple pipelines that are currently used by the community to build the putative TRNs.

C: Pairs of co-expressed operons: A set of 61,506
pairs of co-expressed operons was generated by analyzing the E.coli differential expression profiles as follows. A matrix with the mutual information between all pairs of the 4297 E. coli genes was computed using a parametric Gaussian density estimator with the minet [13] library of the R statistical package [14]. The matrix had over 18 million values, although most of them were either insignificant or redundant and were discarded using the MRNET [4] strategy. From the remaining pairs of genes we considered only the 100,000 that had the highest mutual information, i.e. about 5 % of the total pairs. This number was chosen to support the inclusion of the main co-expressions. Finally, two operons were considered to be co-expressed if each one contained a gene from a pair of co-expressed genes. After discarding redundant and trivial cases, we obtained a set of 61,506 pairs of co-expressed operons, involving 2492 different operons. A: Affinity network and its p-values: A set of 25,604 pairs of operons with high TF/BS affinity was produced as follows. Genes coding for TFs were identified by BLAST homology searches (E-value cutoff of 10 −10 ) between the gene product and a known instance of the TF in the Prodoric database [9] (http://www.prodoric.de). Then for each TF, a BS was considered each time a putative binding site in the upstream region (up to 300 bp) of a gene was located by MEME/FIMO [15] with a p-value smaller than 10 −5 . A pair of operons was reported if the first operon contained a gene coding for a TF with a putative BS in the upstream region of the second operon. The p-value for a pair was defined as the p-value of the affinity represented. (The minimum p-value was used if more than one TF in the first operon had affinity with the promoter region of the second operon. We used this data to build the a priori graph that is used by LOMBARDE. Then we simulated two possible application scenarios. To evaluate the results of LOMBARDE compared with the already known TRN, we defined the validated regulatory network as the graph G V that had only the arcs in V.

Explanatory potential of E.coli a priori graph
If we consider the validated regulatory network G V (composed with just the validated arcs in V), only 3990 coexpressions (6.5 % of C) were explained. The main reason for this low value is that the set of co-expressed pairs C involved 2492 different operons, while the network of validated regulations contained only 823 different operons. Thus, most of the operons in C were not contained in the network of validated regulations. Interestingly, among the 3990 explained co-expressions, only 83 were explained by a single validated arc, while the remainder were explained only through regulatory cascades. This result is consistent with the results in Sun et al. [17] and shows that the reconstruction of a TRN using expression data alone seems to be infeasible, and confirms the role of regulatory cascades.
On the other hand, when the ab initio scenarios (computed affinities in A) were considered we found that the a priori graph G A explained 56,044 of the pairs of coexpressed operons (i.e. 91.1 % of C). This number rose to 56,789 (92.3 % of C) in the extended case when validated arcs were included in the a priori graph G AV . This result reveals the explanatory potential of the a priori graphs G A and G AV .
Nevertheless, the huge number of affinities (more than 15 times the number of known regulations) is consistent with the evidence that many of the predicted TF/BS affinities are spurious and they are not part of the real regulatory processes that coordinate the gene co-expressions given as input, which is obtained from a particular set of experiments. The modeling principle of LOMBARDE is that we can choose the most confident subnetwork which explains the studied data.

LOMBARDE results are biased towards validated interactions
Considering the ab initio scenario, LOMBARDE was first applied to G A and the set C of observed co-expressions for E. coli. After setting the cost parameters k = 9 and r = 10 (an analysis of this choice is presented below), LOMBARDE produced a subnetwork with only 19.2 % of the initial arcs (4922 of 25,604), which still explained 91.1 % of the co-expressions. Interestingly, LOMBARDE showed a strong bias towards preserving independent experimentally validated regulations in V (see Table 1). Indeed, LOMBARDE preserved 66.4 % of the validated arcs in G A and kept only 18.4 % of the non-validated interactions. A hypergeometric test confirmed this bias, with an enrichment p-value under 10 −105 . For the extended scenario, this bias was even stronger when LOMBARDE was applied to the extended graph G AV (i.e., adding all validated regulations). The resulting subnetwork kept 92 % of the validated arcs (1520 of 1652) and included only 11.3 % of the non-validated putative regulations (2854 of 25,604). For a future work, it would be interesting to The a priori graphs explained most of the co-expressions. The LOMBARDE results kept most of the vertices, significantly reduced the number of arcs, and kept most of the validated arcs explore whether the 2,854 putative regulations contain real regulatory relations in E. coli which have not been experimentally validated yet. It should be noted that while LOMBARDE preferentially chooses arcs with low p-values, it also includes arcs with high p-values when they are required to explain a co-expression (see Fig. 4). Validated arcs are also biased towards lower p-values, although some do have high values. Thus methods based on only a p-value threshold will not recover all validated arcs and may not produce the largest networks.
We expected that LOMBARDE would recover many nonvalidated arcs because sets of validated regulations represent only the current knowledge, which may correspond to a very small portion of all the transcriptional regulations in an organism.

Degree distribution of LOMBARDE output are similar to observed TRNs
Some characteristics that were measured in the LOMBARDE results suggested that the networks were topologically closer to other observed TRNs than the a priori graphs G A and G AV . Indeed, the original average degree (number of interactions per operon) of G A was 10.7, which is much higher than the values of 1.5-2.0 suggested in the literature [18] for a TRN. The resulting average degree of LOMBARDE outputs was 2.1, which is much closer to the expected value. This value is also close to the average degree value of 2.0 for the existing network G V of validated regulations for E. coli. Furthermore, the degree distribution (proportion of operons for each degree) in LOMBARDE outputs was highly correlated with the degree distribution in the existing network of validated regulations, which indicates that they shared some structural properties, as shown in Fig. 5. In contrast, the degree distributions in G A and G AV were significantly different; therefore, their structures were different from the structure of the observed network of validated regulations.

Ranking of global regulators
The networks produced by LOMBARDE contained most of the global regulators that have been described for E. coli [19]. Starting from G A , the output of LOMBARDE included 16 of the 19 known global regulators for this bacteria.
To determine the vertices that correspond to global regulators in the LOMBARDE output, we ranked them based on the connection structure of the network. In particular, we considered the vertex radiality to be a centrality index that measured the capability of each vertex to reach other ones in the graph [20]. If d(u, v) represents the distance from u to v (unweighted length of the shortest path from u to v) and D is the diameter of the graph (D = max{d(ū,v) : We defined a vertex in LOMBARDE output to be a central regulator if its radiality index is ranked among the top 30 %. Ten of the known global regulators were among the central regulators in the LOMBARDE output (Table 2). In  The first two columns list E. coli gene names and their global rankings as described in the literature. The last two columns show the ranking of the operons that contain each of these genes using the radiality index in the networks obtained by applying LOMBARDE to G A and G AV . The numbers in bold indicate genes that were ranked among the top 30 % of radiality index and were considered central regulators contrast, G A had only seven global regulators among the central regulators.
When LOMBARDE was applied to G AV (which also has the validated regulations in V), 18 of the known global regulators were recovered in the output, 14 of them were central regulators. Therefore, LOMBARDE produced networks that gave a central role to most of the global regulators described in the literature.

LOMBARDE applied to a meaningful set of co-expressions
When the set of co-expressions C involved most of the genes in the organism, LOMBARDE produced a genomewide putative TRN. However, LOMBARDE also can be applied to any specific set of biologically meaningful genes to decipher the regulatory relationships between them. As an illustration of this in the ab initio scenario, we restricted the set C obtained for E. coli to a subset C built of eight pairs of operons that contained co-expressed genes. From the a priori graph G A , LOMBARDE produced the small putative regulatory network shown in Fig. 6. This network explained all the co-expressions using 30 regulations. In most cases, there was only one optimal explanation. For example, the gene fur codes for a transcription factor that regulates operons that contain metA, metF, metNIQ, pyrD, purEK, purC, and codBA through a regulatory cascade involving metJ and purC. In the other cases, there were several optimal explanations. For example, the co-expression of nohA-ydfN-tfaQ and clcB could be explained by cascades either from fur or from galS. Among the 30 regulations determined by LOMBARDE, 16 have been experimentally validated.

Robustness of results for different input data
Input data used to generate the a priori graph used by LOMBARDE can vary according to the bioinformatics methods used to obtain them. By analyzing different ways of obtaining input data we found that LOMBARDE was quite robust to many variations. We also checked that, when the input data incorporated more information, as expected, LOMBARDE produced better predictions.
For instance, the a priori graph depends on the source of the TF and BS patterns used to determine the set of TF/BS affinities. We compared the results of LOMBARDE applied to an a priori graph built using RegulonDB instead Fig. 6 TRN obtained for a reduced set of co-regulated genes C and the a priori graph G A . All the arcs were predicted by LOMBARDE. The arcs drawn in solid lines have been validated experimentally, although this information was not used in the prediction. The gene purC was co-expressed with metF, metA, and metQ; metN was co-expressed with purE and pyrD; metQ was co-expressed with codA; clcB was co-expressed with ydfN; and agaS was co-expressed with ugpA of the Prodoric database and found that LOMBARDE produced smaller graphs containing a greater proportion of validated arcs (Fig. 7, bar graph C). This is not surprising because Prodoric is based on several prokaryote organisms, while RegulonDB is based only on E. coli. However, the differences were not significant when validated regulations were included (Fig. 7, bar graph D).
The LOMBARDE results also were robust to the methods used to determine co-expressions. We evaluated LOM-BARDE using different sets of co-expressions as determined by ARACNe [2], C3NET, CLR [3], and MRNET [4]. The LOMBARDE results were similar for two of the methods. The exception was C3NET (see bar graph B in Fig. 7), which produced the smallest set of co-expressions so LOMBARDE produced fewer explanations.

Analysis of cost parameters
The cost of each arc in the a priori graph was chosen following two criteria: similar values are considered equivalent so that LOMBARDE results are robust to minor variations in p-value estimations; and cost should decrease as the arc p-value decreases so that arcs with higher likelihood have lower cost. The first criterion was fulfilled by classifying each arc into one of k bins and assigning a discrete cost based on the second criterion.
Without further restrictions the minimization algorithm will have to distinguish between alternative paths that have biological relevance: the regulatory cascade between two genes may be a short path with low confidence arcs or, alternatively, a long path with high confidence arcs. LOMBARDE will choose the longer path with high confidence arcs only when the path length is up to r times the length of the path with the low confidence arcs. Therefore the cost of an arc which belongs to the i-th bin (i ∈ {0, . . . , k − 1}) is r i .
To choose useful values of k and r we explored the conditions that are summarized in Table 3. The size of the LOMBARDE output decreased when either r or k increased, and also when the number of validated arcs increased. The ratio of validated arcs among all the arcs in the LOMBARDE output increased with increased k and r. This result suggested that choosing the parameters r = 10 and k = 9 would bias LOMBARDE to produce more confident networks.

Conclusions
Deciphering which regulatory interactions can provide a causal explanation for a set of observed co-expressions remains an important challenge in systems biology. We developed LOMBARDE, a modeling method that uses an optimization principle to determine a simple and confident set of regulations that serve as causal explanations for a given set of co-expressions. When the set of co-expressions involves genome-wide interactions, LOM-BARDE produces an explanatory putative TRN that has some basic topological characteristics close to observed TRNs and is biased to include regulations that have been independently validated experimentally. The LOMBARDE method was illustrated for a E. coli data set, where coexpressions were determined using mutual information under several environmental conditions. LOMBARDE was applied to an a priori graph considering TF/BS affinities recovered by BLAST and MEME/FIMO, and produced Fig. 7 Comparisons of LOMBARDE performance with different input data. Different methods and parameters were used to generate the input E. coli data used by LOMBARDE. The percentage of arcs kept from the a priori graph in the output is shown in relation to the validated arcs in V and the non-validated arcs in A. a The costs of the arcs in the a priori graph G A were computed for four values of the parameter k. b The set C was computed using four different methods to infer mutual information. c and d Two different databases were used to infer the set A and thus the a priori graphs G A and G AV , respectively, were generated