Deciphering transcriptional regulations coordinating the response to environmental changes
 Vicente Acuña†^{1, 2},
 Andrés Aravena†^{4},
 Carito Guziolowski^{5},
 Damien Eveillard^{6},
 Anne Siegel^{7} and
 Alejandro Maass^{1, 2, 3}Email author
https://doi.org/10.1186/s1285901608850
© Acuña et al. 2016
Received: 6 August 2015
Accepted: 8 January 2016
Published: 16 January 2016
Abstract
Background
Gene coexpression evidenced as a response to environmental changes has shown that transcriptional activity is coordinated, which pinpoints the role of transcriptional regulatory networks (TRNs). Nevertheless, the prediction of TRNs based on the affinity of transcription factors (TFs) with binding sites (BSs) generally produces an overestimation of the observable TF/BS relations within the network and therefore many of the predicted relations are spurious.
Results
We present Lombarde, a bioinformatics method that extracts from a TRN determined from a set of predicted TF/BS affinities a subnetwork explaining a given set of observed coexpressions by choosing the TFs and BSs most likely to be involved in the coregulation. Lombarde solves an optimization problem which selects confident paths within a given TRN that join a putative common regulator with two coexpressed genes via regulatory cascades. To evaluate the method, we used public data of Escherichia coli to produce a regulatory network that explained almost all observed coexpressions while using only 19 % of the input TF/BS affinities but including about 66 % of the independent experimentally validated regulations in the input data. When all known validated TF/BS affinities were integrated into the input data the precision of Lombarde increased significantly. The topological characteristics of the subnetwork that was obtained were similar to the characteristics described for known validated TRNs.
Conclusions
Lombarde provides a useful modeling scheme for deciphering the regulatory mechanisms that underlie the phenotypic responses of an organism to environmental challenges. The method can become a reliable tool for further research on genomescale transcriptional regulation studies.
Keywords
Transcriptional regulatory network Coexpression Combinatorial graphsBackground
Deciphering the mechanisms that explain the coordinated change in gene expression of an organism as a response to changes in the environment, is one of the fundamental challenges in systems biology. Moreover, highthroughput expression data have provided evidence that these mechanisms act in intriguing ways to coordinate gene expression, emphasizing the complexity of the regulation as part of the acclimation process.
In general, classic methods of in silico reconstruction of transcriptional regulation processes consider expression profiles and genomic sequences separately. The most commonly used strategies for identifying coexpressed genes consider linear correlation [1] or mutual information methods like ARACNe [2], CLR [3], and MRNET [4]. Some of these methods have been successful in identifying regulatory interactions in synthetic networks and in model organisms [5]. Nonetheless, the interdependence of the expression profiles of two genes does not necessarily mean that there is a physical interaction. Also, the computed correlations are not oriented and thus cannot be interpreted causally. Even so, they convey information about the transcriptional mechanisms.
On the other hand, different in silico approaches have been developed for the study of genomic sequences. These approaches attempt to model the physical interactions that form a putative transcriptional regulatory network (TRN) and they rely on the identification of pairs of genes where the first gene codes for a transcription factor (TF) that potentially binds in the promoter region of the second gene [6, 7]. Genes coding for TFs are typically obtained by homology between the genome sequence and a database of TFs (such as RegulonDB [8] or Prodoric [9]). In these databases, each TF has an associated position weighted matrix (PWM), which estimates the affinity between the TF and a potential binding site (BS) in a promoter region. The low specificity of current methods for identifying transcriptional regulations means that the number of TF/BS affinities found is usually huge, while, in practice, only a few of them correspond to observed regulatory interactions. Nevertheless, even if many of the predicted interactions have a low probability of occurrence, or are never observed, it is reasonable to assume that the TRN reconstructed from them contains most of the physical interactions that occur for a given process and therefore the network should be able to explain coexpression of genes. Moreover, the pvalue of the affinity computed for a given TF/BS interaction provides information that can be interpreted as a likelihood of the occurrence of this interaction.
Under this assumption, the problem we address here is: given a putative TRN constructed from TF/BS affinities and given a set of coexpressed gene pairs, how can the most probable set of interactions that the organism uses to coordinate the gene expression changes be determined. In other words, we want to find a simple and confident subnetwork of a putative TRN that is able to explain a given set of coexpressions.
To identify a simple and confident subnetwork explaining a set of coregulations, we propose LOMBARDE, an optimization strategy that extracts from a putative TRN the most simple and reliable TF/BS interactions that explain a given set of coregulated genes. The LOMBARDE method also accepts as input an additional independent list of experimentally validated transcriptional regulations.
The precision of LOMBARDE clearly depends on our assumption that the initial putative TRN includes most of the observed TF/BS interactions. When this assumption is satisfied, at least one explanation for each real coexpression should be assured. Interestingly, we found that a putative TRN constructed for Escherichiacoli using a classical bioinformatic pipeline to produce TF/BS affinity pairs explained 91.1 % of nearly 60,000 observed coexpressions (see Results and discussion for details).
When LOMBARDE was applied to this putative E.coli TRN and the set of observed coexpressions it produced a subnetwork that conserved only 19.2 % of the initial interaction arcs, while still explaining 91.1 % of the coexpressions. LOMBARDE has a strong bias towards preserving experimentally validated regulations. It preserved over 66 % of a set of independent experimentally validated arcs in the putative E.coli TRN and kept only 18.4 % of nonvalidated interactions. Moreover, when LOMBARDE was applied to the same putative TRN extended by adding all independent experimentally validated arcs, the resulting subnetwork retained 92 % of the validated arcs and included only 11.3 % of the other putative regulations. In addition, the subnetworks produced by LOMBARDE showed credible topological characteristics and recovered most of the global regulators described in the E. coli literature. The regulators were also ranked correctly in relation to their role in the network. We concluded that the modeling scheme proposed in LOMBARDE is a reliable strategy for deciphering the transcriptional regulatory interactions that can explain the coexpressions observed under environmental changes.
Materials and methods
Given a putative TRN computed from TF/BS affinities and a set of coexpressed gene pairs, the main idea of LOMBARDE is to extract from this network a simple and confident subnetwork that contains an explanation for each coexpression in the given set. Here we present a description of the proposed model and the methods used. We also compare the optimization strategy defined to model simple and confident subnetworks with alternative optimization strategies.
Input of LOMBARDE
 1.
Coexpressed pairs: A set \(\mathcal {C}\subseteq \mathbb {G}\times \mathbb {G}\) of pairs of coexpressed genes, selected based on the values of their correlation or mutual information. An example of such a set would be the results of ARACNe [2], MRNET [4], or other mutual information based methods evaluated using expression profiles obtained under different environmental conditions.
 2.
Affinity pairs: A set \(\mathcal {A}\subseteq \mathbb {G}\times \mathbb {G}\) of gene pairs obtained based on TF/BS sequence affinity and the associated pvalues. Specifically, a pair of genes (A,B) is in \(\mathcal {A}\) if gene A codes for a TF that has high affinity with a BS in the promoter region of gene B. For instance, \(\mathcal {A}\) could be the result of matches in the Prodoric database [9]. We assume that pairs with high pvalues would already have been discarded from \(\mathcal {A}\).
 3.
Validated pairs: Optionally, a set \(\mathcal {V}\subseteq \mathbb {G}\times \mathbb {G}\) of gene pairs that correspond to independent experimentally validated regulations, if available.
LOMBARDE was initially intended to be applied in Prokarya where all the genes of a given operon are transcribed typically in a single polycistronic mRNA molecule; therefore, we assumed that the expression of a gene implies the expression of the operon to which it belongs. Given the specific operon structure of the studied organism, E.coli, we can consider \(\mathbb {G}\) as the set of operons and \(\mathcal {C}\), \(\mathcal {A}\) and \(\mathcal {V}\) as sets of pairs of operons, which simplifies the analysis and reduces the running time of the method. This simplification is purely operational and can be applied at the user’s discretion.
Defining the a priori graph \(\mathcal {G}\) and explanations
Initially, LOMBARDE defines the a priori graph \(\mathcal {G}\) as a directed graph where nodes correspond to genes \(\mathbb {G}\) and directed arcs correspond to pairs of affinities in \(\mathcal {A}\) and pairs of known regulations in \(\mathcal {V}\). That is, \(\mathcal {G}=(\mathbb {G},\mathcal {A}\cup \mathcal {V})\). Thus, there will be a directed arc from gene A to gene B if there is some a priori evidence (experimental or theoretical, weak or strong) that A directly regulates B. If no validated regulations are available, then \(\mathcal {G}=(\mathbb {G},\mathcal {A})\). It is important to note that a regulatory cascade, i.e. a sequence of regulatory relations between genes in \(\mathcal {G}\), should appear in this graph as a directed path (see the right side of Fig. 1), although clearly not every path will represent a real regulatory cascade. The final objective is to highlight paths that most likely correspond to real regulatory cascades controlling the coexpressed data.
Under this representation of a TRN, the observed coregulation of two genes in \(\mathcal {C}\) can be explained by considering two cases. One, is the existence of a directed path from one gene to the other, meaning that the first gene is regulating the last gene through a regulatory cascade (direct regulation is considered as a regulatory cascade of size one). Two, is considering that none of the genes regulate the other, rather both are coregulated by a third gene. Such a situation is represented in the a priori graph by two paths from a common regulator to each of the coregulated gene (see right side of Fig. 1).
Definition 1.

\(\mathcal {E}\) is a directed path from A to B;

\(\mathcal {E}\) is a directed path from B to A;

\(\mathcal {E}\) is the union of two divergent directed paths starting from a gene C and arriving, respectively, at A and B, which have only vertex C in common.
Definition 2.
We say that a subgraph \(\mathcal {G}^{\prime }\subseteq \mathcal {G}\) explains \(\mathcal {C}\) if, for every pair \((A,B)\in \mathcal {C}\), the subgraph \(\mathcal {G}^{\prime }\) contains an explanation for (A,B).
Ideally, every pair (A,B) in \(\mathcal {C}\) should have at least one explanation in \(\mathcal {G}\). If this is not the case, it indicates that, under the modeling hypothesis, A and B are not really coregulated or that the methods used to compute the set \(\mathcal {A}\) did not capture all the transcriptional mechanisms involved in the coregulation of A and B. These unexplained pairs correspond to missing or inaccurate input data beyond the scope of the method and therefore are removed from \(\mathcal {C}\). After their removal we can assume that the a priori graph \(\mathcal {G}\) explains \(\mathcal {C}\). However, as discussed earlier, many arcs in \(\mathcal {G}\) represent TF/BS relations that have a low probability of occurrence or that have never been observed. Our objective, therefore, is to find a simple and confident subgraph \(\mathcal {G}'\subseteq \mathcal {G}\) that explains every pair in \(\mathcal {C}\).
Cost definition
We considered two ways of defining simple and confident subgraphs of \(\mathcal {G}\) that explain \(\mathcal {C}\): (i) foster explanations that use a small number of arcs and (2) foster explanations that have arcs of high affinity (i.e. low pvalue). A way to consider both criteria simultaneously is to define costs on the arcs in such a way that the most likely TF/BS affinities have the lowest costs.
Instead of defining the cost of an arc in \(\mathcal {G}\) as a continuous function of the pvalue of the associated TF/BS, we implemented a qualitative approach by defining levels of likelihood. Indeed, because small variations in pvalues have no real biological significance, we considered all arcs with similar pvalues as equally likely. This approach increased the robustness of the method and prevented solutions that were slightly but not significantly different to the best ones from being discarded.
We defined the costs of the arcs in \(\mathcal {G}\) using the following procedure where \(k \in {\mathbb N}\) and r∈(0,∞) are parameters. We defined k levels of likelihood, from level i=0 (highest likelihood) to level i=k−1 (lowest likelihood), in such a way that every level contains the same number of arcs (i.e. in equalfrequency bins). Arcs in \(\mathcal {V}\) are assigned to the highest likelihood level (valued i=0), because they correspond to already validated regulations. Finally, the cost of an arc in level i∈{0,…,k−1} is set as r ^{ i }. In this way r represents the incremental cost between consecutive levels; that is, an arc in level i has a cost that is equal to r arcs in level i−1. We analyzed the use of different values for the parameters k and r on a real data sets (see the ‘Analysis of cost parameters’ subsection for details).
Optimal explanations and subgraphs
Having defined the cost of arcs, it is natural to define the cost of a subgraph as the sum of the costs of the arcs it contains. With this, we defined an optimal explanation:
Definition 3.
We define an explanation \(\mathcal {E}\) for the pair (A,B) in \(\mathcal {C}\) as optimal if it has the minimum cost among all the explanations for the pair.
Definition 4.
There could be a huge number of optimal subgraphs explaining \(\mathcal {C}\), because several optimal explanations for each pair (A,B) in \(\mathcal {C}\) could exist. For example, if \(\mathcal {C}\) contained 20 gene pairs, each one having two optimal explanations, then the number of optimal subgraphs explaining \(\mathcal {C}\) could reach a million (when each possible union of optimal explanations produced a different subgraph).
This graph is the output of LOMBARDE (see Fig. 2 for an example).
Analysis of alternative optimization problems
We proposed \(\mathcal {G}_{L}\) as a simple and confident subgraph of \(\mathcal {G}\) to explain the coexpressions provided in \(\mathcal {C}\) by including all the optimal explanations for each pair in \(\mathcal {C}\). This can be seen as a local optimization problem because the cost of explaining every pair in \(\mathcal {C}\) independently is minimized. The main reason for choosing this local strategy was that other natural alternatives that consider solving global optimization problems are computationally infeasible [10].
Another alternative optimization problem, which can be considered as a mixture between local and global optimization, is to compute a minimum cost subgraph that contains an optimal explanation for each pair in \(\mathcal {C}\); that is, the optimal subgraph explaining \(\mathcal {C}\) with minimum global cost. Although this approach may seem more interesting than the previous alternative, it has some disadvantages. One disadvantage is the possible multiplicity of solutions because again there could be a large number of optimal subgraphs that have the same minimum global cost. An even worse disadvantage is that it has been proved by a reduction from the Minimum Hitting Set problem that finding just one solution is NPhard [10] (although in practice this problem can be solved for larger instances). Besides the disadvantages, the results produced by this alternative optimization are not more interesting in practice than those given by LOMBARDE because they always correspond to subsets of \(\mathcal {G}_{L}\). Thus, LOMBARDE gives not only the global optimal solution of the optimization problem but also provides alternative optimal explanations for each pair. Moreover, the solution given by LOMBARDE is always unique and computationally feasible.
Results and discussion
A large number of TF/BS associations have been validated experimentally for E.coli; therefore, we used public data for this bacteria to evaluate our model. The genome sequence and gene annotation of E.coli K12 [GenBank:NC_000913] were downloaded from the NCBI ftp site [11] (http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3). Differential expression profiles for 907 conditions were obtained from Ecoli_v4_Build_6_chips907probes4297.tab the Many Microbes Microarray Database [12] http://m3d.mssm.edu/norm/E_coli_v4_Build_6.tar.gz.
Input data
 ■
\(\mathcal {C}\): Pairs of coexpressed operons: A set of 61,506 pairs of coexpressed operons was generated by analyzing the E.coli differential expression profiles as follows. A matrix with the mutual information between all pairs of the 4297 E. coli genes was computed using a parametric Gaussian density estimator with the minet [13] library of the R statistical package [14]. The matrix had over 18 million values, although most of them were either insignificant or redundant and were discarded using the MRNET [4] strategy. From the remaining pairs of genes we considered only the 100,000 that had the highest mutual information, i.e. about 5 % of the total pairs. This number was chosen to support the inclusion of the main coexpressions. Finally, two operons were considered to be coexpressed if each one contained a gene from a pair of coexpressed genes. After discarding redundant and trivial cases, we obtained a set of 61,506 pairs of coexpressed operons, involving 2492 different operons.
 ■
\(\mathcal {A}\): Affinity network and its pvalues: A set of 25,604 pairs of operons with high TF/BS affinity was produced as follows. Genes coding for TFs were identified by BLAST homology searches (Evalue cutoff of 10^{−10}) between the gene product and a known instance of the TF in the Prodoric database [9] (http://www.prodoric.de). Then for each TF, a BS was considered each time a putative binding site in the upstream region (up to 300 bp) of a gene was located by MEME/FIMO [15] with a pvalue smaller than 10^{−5}. A pair of operons was reported if the first operon contained a gene coding for a TF with a putative BS in the upstream region of the second operon. The pvalue for a pair was defined as the pvalue of the affinity represented. (The minimum pvalue was used if more than one TF in the first operon had affinity with the promoter region of the second operon.). The set of 25,604 pairs involved 2390 different operons.
 ■
\(\mathcal {V}\): Validated network: A set of 1652 pairs of operons that represent independent experimentally validated transcriptional regulations was generated. Each pair of operons was reported if the first operon contained a gene coding for a TF that regulated the expression of the second operon based on the compilations made by Salgado et al. [8, 16] available at http://regulondb.ccg.unam.mx/. This set of 1652 pairs contained a total of 823 different operons.
 ■
(\(\mathcal {G}_{\mathcal {A}}\)) Ab initio: This scenario simulates a case when no validated regulations are available; therefore, only coexpressions \(\mathcal {C}\) and affinities \(\mathcal {A}\) are used as input. The a priori graph \(\mathcal {G}_{\mathcal {A}}\) generated from this input contained 25,604 arcs corresponding to the affinities in \(\mathcal {A}\). Although the set of independently validated regulations \(\mathcal {V}\) is discarded in the input of Lombarde, this set of independent information is used to evaluate the bias of the method to include confirmed regulations, because 444 regulations in \(\mathcal {A}\) were also in \(\mathcal {V}\).
 ■
(\(\mathcal {G}_{\mathcal {AV}}\)) Extended: This scenario considers all the data \(\mathcal {C}, \mathcal {A}\), and \(\mathcal {V}\) as input and uses Lombarde in the usual way. The a priori graph \(\mathcal {G}_{\mathcal {AV}}\) generated from this input contained 26,812 arcs, corresponding to the union \(\mathcal {A}\cup \mathcal {V}\) (444 pairs are in the intersection, i.e. TF/BS affinities that were also experimentally validated). In this case all arcs in \(\mathcal {V}\) are assigned cost 1, corresponding to the highest likelihood.
To evaluate the results of LOMBARDE compared with the already known TRN, we defined the validated regulatory network as the graph \(\mathcal {G}_{\mathcal {V}}\) that had only the arcs in \(\mathcal {V}\).
Explanatory potential of E.coli a priori graph
If we consider the validated regulatory network \(\mathcal {G}_{\mathcal {V}}\) (composed with just the validated arcs in \(\mathcal {V}\)), only 3990 coexpressions (6.5 % of \(\mathcal {C}\)) were explained. The main reason for this low value is that the set of coexpressed pairs \(\mathcal {C}\) involved 2492 different operons, while the network of validated regulations contained only 823 different operons. Thus, most of the operons in \(\mathcal {C}\) were not contained in the network of validated regulations. Interestingly, among the 3990 explained coexpressions, only 83 were explained by a single validated arc, while the remainder were explained only through regulatory cascades. This result is consistent with the results in Sun et al. [17] and shows that the reconstruction of a TRN using expression data alone seems to be infeasible, and confirms the role of regulatory cascades.
On the other hand, when the ab initio scenarios (computed affinities in \(\mathcal {A}\)) were considered we found that the a priori graph \(\mathcal {G}_{\mathcal {A}}\) explained 56,044 of the pairs of coexpressed operons (i.e. 91.1 % of \(\mathcal {C}\)). This number rose to 56,789 (92.3 % of \(\mathcal {C}\)) in the extended case when validated arcs were included in the a priori graph \(\mathcal {G}_{\mathcal {AV}}\). This result reveals the explanatory potential of the a priori graphs \(\mathcal {G}_{\mathcal {A}}\) and \(\mathcal {G}_{\mathcal {AV}}\).
Nevertheless, the huge number of affinities (more than 15 times the number of known regulations) is consistent with the evidence that many of the predicted TF/BS affinities are spurious and they are not part of the real regulatory processes that coordinate the gene coexpressions given as input, which is obtained from a particular set of experiments. The modeling principle of LOMBARDE is that we can choose the most confident subnetwork which explains the studied data.
LOMBARDE results are biased towards validated interactions
Characteristics of the a priori graphs and LOMBARDE output networks
Network  Explained  No. of  No. of  No. of  

coexpressions  vertices  arcs  arcs in \(\mathcal {V}\)  
TRN \(\mathcal {G}_{\mathcal {V}}\) built from \(\mathcal {V}\)  3,990  (6.5 %)  823  1,652  1,652 
E. coli ab initio \(\mathcal {G}_{\mathcal {A}}\)  56,044  (91.1 %)  2,390  25,604  444 
Lombarde output for \(\mathcal {G}_{\mathcal {A}}\)  56,044  (91.1 %)  2,336  4,922  295 
E. coli extended\(\mathcal {G}_{\mathcal {AV}}\)  56,789  (92.3 %)  2,434  26,812  1,652 
Lombarde output for \(\mathcal {G}_{\mathcal {AV}}\)  56,789  (92.3 %)  2,370  4,374  1,520 
We expected that LOMBARDE would recover many nonvalidated arcs because sets of validated regulations represent only the current knowledge, which may correspond to a very small portion of all the transcriptional regulations in an organism.
Degree distribution of LOMBARDE output are similar to observed TRNs
Ranking of global regulators
The networks produced by LOMBARDE contained most of the global regulators that have been described for E.coli [19]. Starting from \(\mathcal {G}_{\mathcal {A}}\), the output of LOMBARDE included 16 of the 19 known global regulators for this bacteria.
To determine the vertices that correspond to global regulators in the LOMBARDE output, we ranked them based on the connection structure of the network. In particular, we considered the vertex radiality to be a centrality index that measured the capability of each vertex to reach other ones in the graph [20]. If d(u,v) represents the distance from u to v (unweighted length of the shortest path from u to v) and D is the diameter of the graph (\(D=\max \{d(\bar u,\bar v): \bar u, \bar v\) in the graph }), then the value R _{ u,v }=1+D−d(u,v) is minimal (with value 1) when u to v are the extreme vertices of the graph and maximal when vertices u and v are neighbors. Then the radiality R a d(u) of a vertex u is defined as the average of the values of R _{ u,v } in LOMBARDE’s output. A vertex with high radiality is able to reach more vertices in fewer steps on average than a vertex with lower radiality.
E. coli global regulators and their ranking using radiality centrality index in the LOMBARDE output
Gene  Ranking  Ranking for radiality  Ranking for radiality 

name  in literature  index in Lombarde  index in Lombarde 
output for \(\mathcal {G}_{\mathcal {A}}\)  output for \(\mathcal {G}_{\mathcal {A}\mathcal {V}}\)  
crp  1  25  1 
ihfA  2  14  4 
ihfB  3  16  5 
fnr  4  1  6 
fis  5  63  2 
arcA  6  13  7 
lrp  7  34  87 
hns  8  —  14 
narL  9  121  126 
ompR  10  143  96 
fur  11  7  8 
phoB  12  9  25 
cpxR  13  80  22 
soxR  14  69  49 
soxS  15  109  18 
mtfA  16  —  — 
cspA  17  —  42 
rob  18  30  95 
purR  19  39  47 
When LOMBARDE was applied to \(\mathcal {G}_{\mathcal {AV}}\) (which also has the validated regulations in \(\mathcal {V}\)), 18 of the known global regulators were recovered in the output, 14 of them were central regulators. Therefore, LOMBARDE produced networks that gave a central role to most of the global regulators described in the literature.
LOMBARDE applied to a meaningful set of coexpressions
Robustness of results for different input data
Input data used to generate the a priori graph used by LOMBARDE can vary according to the bioinformatics methods used to obtain them. By analyzing different ways of obtaining input data we found that LOMBARDE was quite robust to many variations. We also checked that, when the input data incorporated more information, as expected, LOMBARDE produced better predictions.
The LOMBARDE results also were robust to the methods used to determine coexpressions. We evaluated LOMBARDE using different sets of coexpressions as determined by ARACNe [2], C3NET, CLR [3], and MRNET [4]. The LOMBARDE results were similar for two of the methods. The exception was C3NET (see bar graph B in Fig. 7), which produced the smallest set of coexpressions so LOMBARDE produced fewer explanations.
Analysis of cost parameters
The cost of each arc in the a priori graph was chosen following two criteria: similar values are considered equivalent so that LOMBARDE results are robust to minor variations in pvalue estimations; and cost should decrease as the arc pvalue decreases so that arcs with higher likelihood have lower cost. The first criterion was fulfilled by classifying each arc into one of k bins and assigning a discrete cost based on the second criterion.
Without further restrictions the minimization algorithm will have to distinguish between alternative paths that have biological relevance: the regulatory cascade between two genes may be a short path with low confidence arcs or, alternatively, a long path with high confidence arcs. LOMBARDE will choose the longer path with high confidence arcs only when the path length is up to r times the length of the path with the low confidence arcs. Therefore the cost of an arc which belongs to the ith bin (i∈{0,…,k−1}) is r ^{ i }.
Effects of parameters k and r on LOMBARDE output
No. of arcs  No. of valid arcs  Ratio  

r ∖ k  3  5  7  9  3  5  7  9  3  5  7  9 
1  24,471  24,471  24,471  24,471  428  428  428  428  1.8  1.8  1.8  1.8 
1.2  15,184  14,339  13,093  12,051  385  375  363  369  2.5  2.6  2.8  3.1 
1.5  12,204  10,299  8,951  8,059  369  344  338  333  3.0  3.3  3.8  4.1 
2  10,294  8,423  7,144  6,453  351  338  325  314  3.4  4.0  4.5  4.9 
5  6,379  5,612  4,999  4,955  293  300  294  295  4.6  5.3  5.9  6.0 
10  5,958  5,466  4,817  4,922  280  299  289  295  4.7  5.5  6.0  6.0 
Conclusions
Deciphering which regulatory interactions can provide a causal explanation for a set of observed coexpressions remains an important challenge in systems biology. We developed LOMBARDE, a modeling method that uses an optimization principle to determine a simple and confident set of regulations that serve as causal explanations for a given set of coexpressions. When the set of coexpressions involves genomewide interactions, LOMBARDE produces an explanatory putative TRN that has some basic topological characteristics close to observed TRNs and is biased to include regulations that have been independently validated experimentally. The LOMBARDE method was illustrated for a E.coli data set, where coexpressions were determined using mutual information under several environmental conditions. LOMBARDE was applied to an a priori graph considering TF/BS affinities recovered by BLAST and MEME/FIMO, and produced a simple and confident explanatory putative TRN that explained most of the observed coexpressions. LOMBARDE discarded a lot of arcs from the initial TRN but interestingly kept most of the independent experimentally validated regulations within it.
Sensitivity analysis showed that LOMBARDE was biased towards validated regulations in all cases of coexpression sets and a priori graphs used as input. Moreover, this method produced better results when the a priori graph was finetuned to the target organism.
We have evaluated LOMBARDE using E.coli as a test case because many regulatory interactions have been validated, making it is suitable for evaluation purposes. LOMBARDE can be applied in a straightforward way to other prokaryote organisms and we expect that the bias towards true regulations will be the same as long as the confidences of the predicted regulations are tuned to the target organism.
In summary, LOMBARDE is a tool that can provide useful insight into the regulatory mechanisms that underlie the phenotypical response of an organism to environmental challenges and it can be used as a reliable tool for further research on genomescale transcriptional regulation studies.
Availability of supporting data
Source code and raw data used for the evaluation are available at http://github.com/anaraven/Lombarde and as Additional file 1. The method can also be run from http://mobyle.inria.cl.
Declarations
Acknowledgements
We thank Alejandra MedinaRivera, Heladia Salgado and Julio ColladoVides for useful discussions about the interpretation and use of the RegulonDB database. This work was supported by grants Fondap 15090007, Basal CMM, Fondecyt 1140631 and ANR Biotempo ANR10BLANC0218. We also thank the international cooperation fellowships IntegrativeBioChile INRIA Associative Team and CIRICINRIA Chile.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci USA. 1998; 95(25):14863–8.PubMedPubMed CentralView ArticleGoogle Scholar
 Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, et al. Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 006 7:S7. 2006; 7 Suppl 1:7.View ArticleGoogle Scholar
 Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):8.View ArticleGoogle Scholar
 Meyer PE, Kontos K, Lafitte F, Bontempi G. Informationtheoretic inference of large transcriptional regulatory networks. EURASIP J Bioinform Syst Biol. 2007:79879. doi: 10.1155/2007/79879.
 Olsen C, Meyer PE, Bontempi G. On the impact of entropy estimation on transcriptional regulatory network inference based on mutual information. EURASIP J Bioinform Syst Biol. 2009:308959. doi: 10.1155/2009/308959.
 Rodionov DA. Comparative genomic reconstruction of transcriptional regulatory networks in bacteria. Chem Rev. 2007; 107(8):3467–97.PubMedPubMed CentralView ArticleGoogle Scholar
 Guhathakurta D. Computational identification of transcriptional regulatory elements in dna sequence. Nucleic Acids Res. 2006; 34(12):3585–98.PubMedPubMed CentralView ArticleGoogle Scholar
 Salgado H, PeraltaGil M, GamaCastro S, SantosZavaleta A, MunizRascado L, GarciaSotelo JS, et al. Regulondb v8.0: omics data sets, evolutionary conservation, regulatory phrases, crossvalidated gold standards and more. Nucleic Acids Res. 2013; 41(D1):203–13.View ArticleGoogle Scholar
 Grote A, Klein J, Retter I, Haddad I, Behling S, Bunk B, et al. Prodoric (release 2009): a database and tool platform for the analysis of gene regulation in prokaryotes. Nucleic Acids Res. 2009; 37(Database issue):61–5.View ArticleGoogle Scholar
 Acuña V, Aravena A, Maass A, Siegel A. Modeling parsimonious putative regulatory networks: Complexity and heuristic approach. In: Verification, Model Checking, and Abstract Interpretation. Volume 8318 of the series Lecture Notes in Computer Science. BerlinHeidelberg: Springer: 2014. p. 322–336.Google Scholar
 Pruitt KD. Ncbi reference sequence (refseq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2004; 33(Database issue):501–4.View ArticleGoogle Scholar
 Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, et al. Many microbe microarrays database: uniformly normalized affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008; 36(Database issue):866–70.Google Scholar
 Meyer PE, Lafitte F, Bontempi G. minet: A r/bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics. 2008; 9:461.PubMedPubMed CentralView ArticleGoogle Scholar
 R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. R Foundation for Statistical Computing. ISBN 3900051070.Google Scholar
 Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. Meme suite: tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(Web Server issue):202.View ArticleGoogle Scholar
 GamaCastro S, Salgado H, PeraltaGil M, SantosZavaleta A, MuñizRascado L, SolanoLira H, et al. Regulondb version 7.0: transcriptional regulation of escherichia coli k12 integrated within genetic sensory response units (gensor units). Nucleic Acids Res. 2011; 39(Database issue):98–105.View ArticleGoogle Scholar
 Sun J, Tuncay K, Haidar A, Ensman L, Stanley F, Trelinski M, et al. Transcriptional regulatory network discovery via multiple method integration: application to e. coli k12. Algorithms Mol Biol. 2007; 2(1):2–2.PubMedPubMed CentralView ArticleGoogle Scholar
 Leclerc RD. Survival of the sparsest: robust gene networks are parsimonious. Mol Syst Biol. 2008; 4:213.PubMedPubMed CentralView ArticleGoogle Scholar
 MartínezAntonio A, ColladoVides J. Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol. 2003; 6(5):482–9.PubMedView ArticleGoogle Scholar
 Valente T, Foreman R. Integration and radiality: measuring the extent of an individual’s connectedness and reachability in a network. Soc Networks. 1998; 20(1):89–105.View ArticleGoogle Scholar