Research Article  Open  Published:
Deciphering transcriptional regulations coordinating the response to environmental changes
BMC Bioinformaticsvolume 17, Article number: 35 (2016)
Abstract
Background
Gene coexpression evidenced as a response to environmental changes has shown that transcriptional activity is coordinated, which pinpoints the role of transcriptional regulatory networks (TRNs). Nevertheless, the prediction of TRNs based on the affinity of transcription factors (TFs) with binding sites (BSs) generally produces an overestimation of the observable TF/BS relations within the network and therefore many of the predicted relations are spurious.
Results
We present Lombarde, a bioinformatics method that extracts from a TRN determined from a set of predicted TF/BS affinities a subnetwork explaining a given set of observed coexpressions by choosing the TFs and BSs most likely to be involved in the coregulation. Lombarde solves an optimization problem which selects confident paths within a given TRN that join a putative common regulator with two coexpressed genes via regulatory cascades. To evaluate the method, we used public data of Escherichia coli to produce a regulatory network that explained almost all observed coexpressions while using only 19 % of the input TF/BS affinities but including about 66 % of the independent experimentally validated regulations in the input data. When all known validated TF/BS affinities were integrated into the input data the precision of Lombarde increased significantly. The topological characteristics of the subnetwork that was obtained were similar to the characteristics described for known validated TRNs.
Conclusions
Lombarde provides a useful modeling scheme for deciphering the regulatory mechanisms that underlie the phenotypic responses of an organism to environmental challenges. The method can become a reliable tool for further research on genomescale transcriptional regulation studies.
Background
Deciphering the mechanisms that explain the coordinated change in gene expression of an organism as a response to changes in the environment, is one of the fundamental challenges in systems biology. Moreover, highthroughput expression data have provided evidence that these mechanisms act in intriguing ways to coordinate gene expression, emphasizing the complexity of the regulation as part of the acclimation process.
In general, classic methods of in silico reconstruction of transcriptional regulation processes consider expression profiles and genomic sequences separately. The most commonly used strategies for identifying coexpressed genes consider linear correlation [1] or mutual information methods like ARACNe [2], CLR [3], and MRNET [4]. Some of these methods have been successful in identifying regulatory interactions in synthetic networks and in model organisms [5]. Nonetheless, the interdependence of the expression profiles of two genes does not necessarily mean that there is a physical interaction. Also, the computed correlations are not oriented and thus cannot be interpreted causally. Even so, they convey information about the transcriptional mechanisms.
On the other hand, different in silico approaches have been developed for the study of genomic sequences. These approaches attempt to model the physical interactions that form a putative transcriptional regulatory network (TRN) and they rely on the identification of pairs of genes where the first gene codes for a transcription factor (TF) that potentially binds in the promoter region of the second gene [6, 7]. Genes coding for TFs are typically obtained by homology between the genome sequence and a database of TFs (such as RegulonDB [8] or Prodoric [9]). In these databases, each TF has an associated position weighted matrix (PWM), which estimates the affinity between the TF and a potential binding site (BS) in a promoter region. The low specificity of current methods for identifying transcriptional regulations means that the number of TF/BS affinities found is usually huge, while, in practice, only a few of them correspond to observed regulatory interactions. Nevertheless, even if many of the predicted interactions have a low probability of occurrence, or are never observed, it is reasonable to assume that the TRN reconstructed from them contains most of the physical interactions that occur for a given process and therefore the network should be able to explain coexpression of genes. Moreover, the pvalue of the affinity computed for a given TF/BS interaction provides information that can be interpreted as a likelihood of the occurrence of this interaction.
Under this assumption, the problem we address here is: given a putative TRN constructed from TF/BS affinities and given a set of coexpressed gene pairs, how can the most probable set of interactions that the organism uses to coordinate the gene expression changes be determined. In other words, we want to find a simple and confident subnetwork of a putative TRN that is able to explain a given set of coexpressions.
To define what we mean by a subnetwork explaining a pair of coexpressed genes, we consider all the topological configurations within the network that allow for a coordinate change. The simplest case is when a pair of coexpressed genes also corresponds to a TF/BS interaction in the putative TRN. Clearly a direct interaction, represented in the TRN as a single arc between the genes, is a possible explanation for the coexpression, indicating that one gene is regulating the expression of the other gene. Another possibility is that the coexpression of the two genes is correlated not by a single regulatory interaction but by a chain of TF/BS interactions, i.e. a regulatory cascade. Such a cascade is represented in the TRN as a directed path from one gene to the other. Thus, a path connecting two correlated genes is also a possible explanation for the coexpression. Finally, the correlation may be the result of a third gene that simultaneously regulates two coexpressed genes. In such a situation, the two correlated genes will be at the ends of two regulatory cascades, both of which start with this common regulator gene (see right side of Fig. 1). We define any of these configurations as an explanation for the correlation.
To identify a simple and confident subnetwork explaining a set of coregulations, we propose LOMBARDE, an optimization strategy that extracts from a putative TRN the most simple and reliable TF/BS interactions that explain a given set of coregulated genes. The LOMBARDE method also accepts as input an additional independent list of experimentally validated transcriptional regulations.
The precision of LOMBARDE clearly depends on our assumption that the initial putative TRN includes most of the observed TF/BS interactions. When this assumption is satisfied, at least one explanation for each real coexpression should be assured. Interestingly, we found that a putative TRN constructed for Escherichiacoli using a classical bioinformatic pipeline to produce TF/BS affinity pairs explained 91.1 % of nearly 60,000 observed coexpressions (see Results and discussion for details).
When LOMBARDE was applied to this putative E.coli TRN and the set of observed coexpressions it produced a subnetwork that conserved only 19.2 % of the initial interaction arcs, while still explaining 91.1 % of the coexpressions. LOMBARDE has a strong bias towards preserving experimentally validated regulations. It preserved over 66 % of a set of independent experimentally validated arcs in the putative E.coli TRN and kept only 18.4 % of nonvalidated interactions. Moreover, when LOMBARDE was applied to the same putative TRN extended by adding all independent experimentally validated arcs, the resulting subnetwork retained 92 % of the validated arcs and included only 11.3 % of the other putative regulations. In addition, the subnetworks produced by LOMBARDE showed credible topological characteristics and recovered most of the global regulators described in the E. coli literature. The regulators were also ranked correctly in relation to their role in the network. We concluded that the modeling scheme proposed in LOMBARDE is a reliable strategy for deciphering the transcriptional regulatory interactions that can explain the coexpressions observed under environmental changes.
Materials and methods
Given a putative TRN computed from TF/BS affinities and a set of coexpressed gene pairs, the main idea of LOMBARDE is to extract from this network a simple and confident subnetwork that contains an explanation for each coexpression in the given set. Here we present a description of the proposed model and the methods used. We also compare the optimization strategy defined to model simple and confident subnetworks with alternative optimization strategies.
Input of LOMBARDE
If $\mathbb {G}$ is the set of genes in a studied organism, then LOMBARDE requires the following inputs:

1.
Coexpressed pairs: A set $\mathcal {C}\subseteq \mathbb {G}\times \mathbb {G}$ of pairs of coexpressed genes, selected based on the values of their correlation or mutual information. An example of such a set would be the results of ARACNe [2], MRNET [4], or other mutual information based methods evaluated using expression profiles obtained under different environmental conditions.

2.
Affinity pairs: A set $\mathcal {A}\subseteq \mathbb {G}\times \mathbb {G}$ of gene pairs obtained based on TF/BS sequence affinity and the associated pvalues. Specifically, a pair of genes (A,B) is in $\mathcal {A}$ if gene A codes for a TF that has high affinity with a BS in the promoter region of gene B. For instance, $\mathcal {A}$ could be the result of matches in the Prodoric database [9]. We assume that pairs with high pvalues would already have been discarded from $\mathcal {A}$.

3.
Validated pairs: Optionally, a set $\mathcal {V}\subseteq \mathbb {G}\times \mathbb {G}$ of gene pairs that correspond to independent experimentally validated regulations, if available.
LOMBARDE was initially intended to be applied in Prokarya where all the genes of a given operon are transcribed typically in a single polycistronic mRNA molecule; therefore, we assumed that the expression of a gene implies the expression of the operon to which it belongs. Given the specific operon structure of the studied organism, E.coli, we can consider $\mathbb {G}$ as the set of operons and $\mathcal {C}$, $\mathcal {A}$ and $\mathcal {V}$ as sets of pairs of operons, which simplifies the analysis and reduces the running time of the method. This simplification is purely operational and can be applied at the user’s discretion.
Defining the a priori graph $\mathcal {G}$ and explanations
Initially, LOMBARDE defines the a priori graph $\mathcal {G}$ as a directed graph where nodes correspond to genes $\mathbb {G}$ and directed arcs correspond to pairs of affinities in $\mathcal {A}$ and pairs of known regulations in $\mathcal {V}$. That is, $\mathcal {G}=(\mathbb {G},\mathcal {A}\cup \mathcal {V})$. Thus, there will be a directed arc from gene A to gene B if there is some a priori evidence (experimental or theoretical, weak or strong) that A directly regulates B. If no validated regulations are available, then $\mathcal {G}=(\mathbb {G},\mathcal {A})$. It is important to note that a regulatory cascade, i.e. a sequence of regulatory relations between genes in $\mathcal {G}$, should appear in this graph as a directed path (see the right side of Fig. 1), although clearly not every path will represent a real regulatory cascade. The final objective is to highlight paths that most likely correspond to real regulatory cascades controlling the coexpressed data.
Under this representation of a TRN, the observed coregulation of two genes in $\mathcal {C}$ can be explained by considering two cases. One, is the existence of a directed path from one gene to the other, meaning that the first gene is regulating the last gene through a regulatory cascade (direct regulation is considered as a regulatory cascade of size one). Two, is considering that none of the genes regulate the other, rather both are coregulated by a third gene. Such a situation is represented in the a priori graph by two paths from a common regulator to each of the coregulated gene (see right side of Fig. 1).
Definition 1.
Given a pair $(A,B) \in \mathcal {C}$ of coexpressed genes, an explanation for (A,B) in $\mathcal {G}$ is a set of arcs $\mathcal {E}$ that satisfy any of the following conditions:

$\mathcal {E}$ is a directed path from A to B;

$\mathcal {E}$ is a directed path from B to A;

$\mathcal {E}$ is the union of two divergent directed paths starting from a gene C and arriving, respectively, at A and B, which have only vertex C in common.
Definition 2.
We say that a subgraph $\mathcal {G}^{\prime }\subseteq \mathcal {G}$ explains $\mathcal {C}$ if, for every pair $(A,B)\in \mathcal {C}$, the subgraph $\mathcal {G}^{\prime }$ contains an explanation for (A,B).
Ideally, every pair (A,B) in $\mathcal {C}$ should have at least one explanation in $\mathcal {G}$. If this is not the case, it indicates that, under the modeling hypothesis, A and B are not really coregulated or that the methods used to compute the set $\mathcal {A}$ did not capture all the transcriptional mechanisms involved in the coregulation of A and B. These unexplained pairs correspond to missing or inaccurate input data beyond the scope of the method and therefore are removed from $\mathcal {C}$. After their removal we can assume that the a priori graph $\mathcal {G}$ explains $\mathcal {C}$. However, as discussed earlier, many arcs in $\mathcal {G}$ represent TF/BS relations that have a low probability of occurrence or that have never been observed. Our objective, therefore, is to find a simple and confident subgraph $\mathcal {G}'\subseteq \mathcal {G}$ that explains every pair in $\mathcal {C}$.
Cost definition
We considered two ways of defining simple and confident subgraphs of $\mathcal {G}$ that explain $\mathcal {C}$: (i) foster explanations that use a small number of arcs and (2) foster explanations that have arcs of high affinity (i.e. low pvalue). A way to consider both criteria simultaneously is to define costs on the arcs in such a way that the most likely TF/BS affinities have the lowest costs.
Instead of defining the cost of an arc in $\mathcal {G}$ as a continuous function of the pvalue of the associated TF/BS, we implemented a qualitative approach by defining levels of likelihood. Indeed, because small variations in pvalues have no real biological significance, we considered all arcs with similar pvalues as equally likely. This approach increased the robustness of the method and prevented solutions that were slightly but not significantly different to the best ones from being discarded.
We defined the costs of the arcs in $\mathcal {G}$ using the following procedure where $k \in {\mathbb N}$ and r∈(0,∞) are parameters. We defined k levels of likelihood, from level i=0 (highest likelihood) to level i=k−1 (lowest likelihood), in such a way that every level contains the same number of arcs (i.e. in equalfrequency bins). Arcs in $\mathcal {V}$ are assigned to the highest likelihood level (valued i=0), because they correspond to already validated regulations. Finally, the cost of an arc in level i∈{0,…,k−1} is set as r ^{i}. In this way r represents the incremental cost between consecutive levels; that is, an arc in level i has a cost that is equal to r arcs in level i−1. We analyzed the use of different values for the parameters k and r on a real data sets (see the ‘Analysis of cost parameters’ subsection for details).
Optimal explanations and subgraphs
Having defined the cost of arcs, it is natural to define the cost of a subgraph as the sum of the costs of the arcs it contains. With this, we defined an optimal explanation:
Definition 3.
We define an explanation $\mathcal {E}$ for the pair (A,B) in $\mathcal {C}$ as optimal if it has the minimum cost among all the explanations for the pair.
Note that a pair (A,B) in $\mathcal {C}$ can have more than one optimal explanation (see Fig. 2), especially after the cost categorization where all arcs with similar pvalues are considered as equally likely. Thus we define an optimal subgraph:
Definition 4.
We define a subgraph $\mathcal {G}^{\prime }\subseteq \mathcal {G}$as an optimal subgraph explaining $\mathcal {C}$ if $\mathcal {G}^{\prime }$ is the union of $\mathcal {C}$ (cardinality of $\mathcal {C}$) optimal explanations, one for each pair of genes in $\mathcal {C}$. That is,
where $\mathcal {E}_{(A,B)}$ is an optimal explanation for (A,B).
There could be a huge number of optimal subgraphs explaining $\mathcal {C}$, because several optimal explanations for each pair (A,B) in $\mathcal {C}$ could exist. For example, if $\mathcal {C}$ contained 20 gene pairs, each one having two optimal explanations, then the number of optimal subgraphs explaining $\mathcal {C}$ could reach a million (when each possible union of optimal explanations produced a different subgraph).
Instead of enumerating all optimal subgraphs explaining $\mathcal {C}$, which could be computationally infeasible, we computed a subgraph $\mathcal {G}_{L}$ defined as the union of all optimal subgraphs explaining $\mathcal {C}$. It is clearly not necessary to compute every optimal subgraph, but rather to compute for every (A,B) pair in $\mathcal {C}$ only the set of all optimal explanations for (A,B). Thus, the graph $\mathcal {G}_{L}$ is obtained as,
This graph is the output of LOMBARDE (see Fig. 2 for an example).
Analysis of alternative optimization problems
We proposed $\mathcal {G}_{L}$ as a simple and confident subgraph of $\mathcal {G}$ to explain the coexpressions provided in $\mathcal {C}$ by including all the optimal explanations for each pair in $\mathcal {C}$. This can be seen as a local optimization problem because the cost of explaining every pair in $\mathcal {C}$ independently is minimized. The main reason for choosing this local strategy was that other natural alternatives that consider solving global optimization problems are computationally infeasible [10].
For instance, consider the problem of computing the subgraph of minimum cost that explains $\mathcal {C}$. This criterion can give a different result than $\mathcal {G}_{L}$ (see Fig. 3 for an example). However, this approach is not robust because adding a new coexpression pair could completely change the global solution. Moreover, the best explanation that it provides for a given pair could have a very high cost compared with an optimal explanation. As well as having these undesirable properties, this problem is hard to solve. Indeed, it has been shown to be an NPhard problem [10] by a reduction from the Steiner Weighted Directed Tree problem. This means that the problem can be solved exactly and in a reasonable time only for very small instances.
Another alternative optimization problem, which can be considered as a mixture between local and global optimization, is to compute a minimum cost subgraph that contains an optimal explanation for each pair in $\mathcal {C}$; that is, the optimal subgraph explaining $\mathcal {C}$ with minimum global cost. Although this approach may seem more interesting than the previous alternative, it has some disadvantages. One disadvantage is the possible multiplicity of solutions because again there could be a large number of optimal subgraphs that have the same minimum global cost. An even worse disadvantage is that it has been proved by a reduction from the Minimum Hitting Set problem that finding just one solution is NPhard [10] (although in practice this problem can be solved for larger instances). Besides the disadvantages, the results produced by this alternative optimization are not more interesting in practice than those given by LOMBARDE because they always correspond to subsets of $\mathcal {G}_{L}$. Thus, LOMBARDE gives not only the global optimal solution of the optimization problem but also provides alternative optimal explanations for each pair. Moreover, the solution given by LOMBARDE is always unique and computationally feasible.
Results and discussion
A large number of TF/BS associations have been validated experimentally for E.coli; therefore, we used public data for this bacteria to evaluate our model. The genome sequence and gene annotation of E.coli K12 [GenBank:NC_000913] were downloaded from the NCBI ftp site [11] (http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3). Differential expression profiles for 907 conditions were obtained from Ecoli_v4_Build_6_chips907probes4297.tab the Many Microbes Microarray Database [12] http://m3d.mssm.edu/norm/E_coli_v4_Build_6.tar.gz.
Input data
Here, we describe briefly the sources and methods used to generate input data for LOMBARDE. As mentioned above, for bacteria we condense the information into operons to simplify computation. It is important to note that the methods we used to produce the input for LOMBARDE were merely a choice. Because data production was not our main concern, as far as possible, we used simple pipelines that are currently used by the community to build the putative TRNs.

■
$\mathcal {C}$: Pairs of coexpressed operons: A set of 61,506 pairs of coexpressed operons was generated by analyzing the E.coli differential expression profiles as follows. A matrix with the mutual information between all pairs of the 4297 E. coli genes was computed using a parametric Gaussian density estimator with the minet [13] library of the R statistical package [14]. The matrix had over 18 million values, although most of them were either insignificant or redundant and were discarded using the MRNET [4] strategy. From the remaining pairs of genes we considered only the 100,000 that had the highest mutual information, i.e. about 5 % of the total pairs. This number was chosen to support the inclusion of the main coexpressions. Finally, two operons were considered to be coexpressed if each one contained a gene from a pair of coexpressed genes. After discarding redundant and trivial cases, we obtained a set of 61,506 pairs of coexpressed operons, involving 2492 different operons.

■
$\mathcal {A}$: Affinity network and its pvalues: A set of 25,604 pairs of operons with high TF/BS affinity was produced as follows. Genes coding for TFs were identified by BLAST homology searches (Evalue cutoff of 10^{−10}) between the gene product and a known instance of the TF in the Prodoric database [9] (http://www.prodoric.de). Then for each TF, a BS was considered each time a putative binding site in the upstream region (up to 300 bp) of a gene was located by MEME/FIMO [15] with a pvalue smaller than 10^{−5}. A pair of operons was reported if the first operon contained a gene coding for a TF with a putative BS in the upstream region of the second operon. The pvalue for a pair was defined as the pvalue of the affinity represented. (The minimum pvalue was used if more than one TF in the first operon had affinity with the promoter region of the second operon.). The set of 25,604 pairs involved 2390 different operons.

■
$\mathcal {V}$: Validated network: A set of 1652 pairs of operons that represent independent experimentally validated transcriptional regulations was generated. Each pair of operons was reported if the first operon contained a gene coding for a TF that regulated the expression of the second operon based on the compilations made by Salgado et al. [8, 16] available at http://regulondb.ccg.unam.mx/. This set of 1652 pairs contained a total of 823 different operons.
We used this data to build the a priori graph that is used by LOMBARDE. Then we simulated two possible application scenarios.

■
($\mathcal {G}_{\mathcal {A}}$) Ab initio: This scenario simulates a case when no validated regulations are available; therefore, only coexpressions $\mathcal {C}$ and affinities $\mathcal {A}$ are used as input. The a priori graph $\mathcal {G}_{\mathcal {A}}$ generated from this input contained 25,604 arcs corresponding to the affinities in $\mathcal {A}$. Although the set of independently validated regulations $\mathcal {V}$ is discarded in the input of Lombarde, this set of independent information is used to evaluate the bias of the method to include confirmed regulations, because 444 regulations in $\mathcal {A}$ were also in $\mathcal {V}$.

■
($\mathcal {G}_{\mathcal {AV}}$) Extended: This scenario considers all the data $\mathcal {C}, \mathcal {A}$, and $\mathcal {V}$ as input and uses Lombarde in the usual way. The a priori graph $\mathcal {G}_{\mathcal {AV}}$ generated from this input contained 26,812 arcs, corresponding to the union $\mathcal {A}\cup \mathcal {V}$ (444 pairs are in the intersection, i.e. TF/BS affinities that were also experimentally validated). In this case all arcs in $\mathcal {V}$ are assigned cost 1, corresponding to the highest likelihood.
To evaluate the results of LOMBARDE compared with the already known TRN, we defined the validated regulatory network as the graph $\mathcal {G}_{\mathcal {V}}$ that had only the arcs in $\mathcal {V}$.
Explanatory potential of E.coli a priori graph
If we consider the validated regulatory network $\mathcal {G}_{\mathcal {V}}$ (composed with just the validated arcs in $\mathcal {V}$), only 3990 coexpressions (6.5 % of $\mathcal {C}$) were explained. The main reason for this low value is that the set of coexpressed pairs $\mathcal {C}$ involved 2492 different operons, while the network of validated regulations contained only 823 different operons. Thus, most of the operons in $\mathcal {C}$ were not contained in the network of validated regulations. Interestingly, among the 3990 explained coexpressions, only 83 were explained by a single validated arc, while the remainder were explained only through regulatory cascades. This result is consistent with the results in Sun et al. [17] and shows that the reconstruction of a TRN using expression data alone seems to be infeasible, and confirms the role of regulatory cascades.
On the other hand, when the ab initio scenarios (computed affinities in $\mathcal {A}$) were considered we found that the a priori graph $\mathcal {G}_{\mathcal {A}}$ explained 56,044 of the pairs of coexpressed operons (i.e. 91.1 % of $\mathcal {C}$). This number rose to 56,789 (92.3 % of $\mathcal {C}$) in the extended case when validated arcs were included in the a priori graph $\mathcal {G}_{\mathcal {AV}}$. This result reveals the explanatory potential of the a priori graphs $\mathcal {G}_{\mathcal {A}}$ and $\mathcal {G}_{\mathcal {AV}}$.
Nevertheless, the huge number of affinities (more than 15 times the number of known regulations) is consistent with the evidence that many of the predicted TF/BS affinities are spurious and they are not part of the real regulatory processes that coordinate the gene coexpressions given as input, which is obtained from a particular set of experiments. The modeling principle of LOMBARDE is that we can choose the most confident subnetwork which explains the studied data.
LOMBARDE results are biased towards validated interactions
Considering the ab initio scenario, LOMBARDE was first applied to $\mathcal {G}_{\mathcal {A}}$ and the set $\mathcal {C}$ of observed coexpressions for E.coli. After setting the cost parameters k=9 and r=10 (an analysis of this choice is presented below), LOMBARDE produced a subnetwork with only 19.2 % of the initial arcs (4922 of 25,604), which still explained 91.1 % of the coexpressions. Interestingly, LOMBARDE showed a strong bias towards preserving independent experimentally validated regulations in $\mathcal {V}$ (see Table 1). Indeed, LOMBARDE preserved 66.4 % of the validated arcs in $\mathcal {G}_{\mathcal {A}}$ and kept only 18.4 % of the nonvalidated interactions. A hypergeometric test confirmed this bias, with an enrichment pvalue under 10^{−105}. For the extended scenario, this bias was even stronger when LOMBARDE was applied to the extended graph $\mathcal {G}_{\mathcal {AV}}$ (i.e., adding all validated regulations). The resulting subnetwork kept 92 % of the validated arcs (1520 of 1652) and included only 11.3 % of the nonvalidated putative regulations (2854 of 25,604). For a future work, it would be interesting to explore whether the 2,854 putative regulations contain real regulatory relations in E.coli which have not been experimentally validated yet.
It should be noted that while LOMBARDE preferentially chooses arcs with low pvalues, it also includes arcs with high pvalues when they are required to explain a coexpression (see Fig. 4). Validated arcs are also biased towards lower pvalues, although some do have high values. Thus methods based on only a pvalue threshold will not recover all validated arcs and may not produce the largest networks.
We expected that LOMBARDE would recover many nonvalidated arcs because sets of validated regulations represent only the current knowledge, which may correspond to a very small portion of all the transcriptional regulations in an organism.
Degree distribution of LOMBARDE output are similar to observed TRNs
Some characteristics that were measured in the LOMBARDE results suggested that the networks were topologically closer to other observed TRNs than the a priori graphs $\mathcal {G}_{\mathcal {A}}$ and $\mathcal {G}_{\mathcal {AV}}$. Indeed, the original average degree (number of interactions per operon) of $\mathcal {G}_{\mathcal {A}}$ was 10.7, which is much higher than the values of 1.5–2.0 suggested in the literature [18] for a TRN. The resulting average degree of LOMBARDE outputs was 2.1, which is much closer to the expected value. This value is also close to the average degree value of 2.0 for the existing network $\mathcal {G}_{\mathcal {V}}$ of validated regulations for E.coli. Furthermore, the degree distribution (proportion of operons for each degree) in LOMBARDE outputs was highly correlated with the degree distribution in the existing network of validated regulations, which indicates that they shared some structural properties, as shown in Fig. 5. In contrast, the degree distributions in $\mathcal {G}_{\mathcal {A}}$ and $\mathcal {G}_{\mathcal {AV}}$ were significantly different; therefore, their structures were different from the structure of the observed network of validated regulations.
Ranking of global regulators
The networks produced by LOMBARDE contained most of the global regulators that have been described for E.coli [19]. Starting from $\mathcal {G}_{\mathcal {A}}$, the output of LOMBARDE included 16 of the 19 known global regulators for this bacteria.
To determine the vertices that correspond to global regulators in the LOMBARDE output, we ranked them based on the connection structure of the network. In particular, we considered the vertex radiality to be a centrality index that measured the capability of each vertex to reach other ones in the graph [20]. If d(u,v) represents the distance from u to v (unweighted length of the shortest path from u to v) and D is the diameter of the graph ($D=\max \{d(\bar u,\bar v): \bar u, \bar v$ in the graph }), then the value R _{ u,v }=1+D−d(u,v) is minimal (with value 1) when u to v are the extreme vertices of the graph and maximal when vertices u and v are neighbors. Then the radiality R a d(u) of a vertex u is defined as the average of the values of R _{ u,v } in LOMBARDE’s output. A vertex with high radiality is able to reach more vertices in fewer steps on average than a vertex with lower radiality.
We defined a vertex in LOMBARDE output to be a central regulator if its radiality index is ranked among the top 30 %. Ten of the known global regulators were among the central regulators in the LOMBARDE output (Table 2). In contrast, $\mathcal {G}_{\mathcal {A}}$ had only seven global regulators among the central regulators.
When LOMBARDE was applied to $\mathcal {G}_{\mathcal {AV}}$ (which also has the validated regulations in $\mathcal {V}$), 18 of the known global regulators were recovered in the output, 14 of them were central regulators. Therefore, LOMBARDE produced networks that gave a central role to most of the global regulators described in the literature.
LOMBARDE applied to a meaningful set of coexpressions
When the set of coexpressions $\mathcal {C}$ involved most of the genes in the organism, LOMBARDE produced a genomewide putative TRN. However, LOMBARDE also can be applied to any specific set of biologically meaningful genes to decipher the regulatory relationships between them. As an illustration of this in the ab initio scenario, we restricted the set $\mathcal {C}$ obtained for E.coli to a subset $\mathcal {C}^{\prime }$ built of eight pairs of operons that contained coexpressed genes. From the a priori graph $\mathcal {G}_{\mathcal {A}}$, LOMBARDE produced the small putative regulatory network shown in Fig. 6. This network explained all the coexpressions using 30 regulations. In most cases, there was only one optimal explanation. For example, the gene fur codes for a transcription factor that regulates operons that contain metA, metF, metNIQ, pyrD, purEK, purC, and codBA through a regulatory cascade involving metJ and purC. In the other cases, there were several optimal explanations. For example, the coexpression of nohAydfNtfaQ and clcB could be explained by cascades either from fur or from galS. Among the 30 regulations determined by LOMBARDE, 16 have been experimentally validated.
Robustness of results for different input data
Input data used to generate the a priori graph used by LOMBARDE can vary according to the bioinformatics methods used to obtain them. By analyzing different ways of obtaining input data we found that LOMBARDE was quite robust to many variations. We also checked that, when the input data incorporated more information, as expected, LOMBARDE produced better predictions.
For instance, the a priori graph depends on the source of the TF and BS patterns used to determine the set of TF/BS affinities. We compared the results of LOMBARDE applied to an a priori graph built using RegulonDB instead of the Prodoric database and found that LOMBARDE produced smaller graphs containing a greater proportion of validated arcs (Fig. 7, bar graph C). This is not surprising because Prodoric is based on several prokaryote organisms, while RegulonDB is based only on E.coli. However, the differences were not significant when validated regulations were included (Fig. 7, bar graph D).
The LOMBARDE results also were robust to the methods used to determine coexpressions. We evaluated LOMBARDE using different sets of coexpressions as determined by ARACNe [2], C3NET, CLR [3], and MRNET [4]. The LOMBARDE results were similar for two of the methods. The exception was C3NET (see bar graph B in Fig. 7), which produced the smallest set of coexpressions so LOMBARDE produced fewer explanations.
Analysis of cost parameters
The cost of each arc in the a priori graph was chosen following two criteria: similar values are considered equivalent so that LOMBARDE results are robust to minor variations in pvalue estimations; and cost should decrease as the arc pvalue decreases so that arcs with higher likelihood have lower cost. The first criterion was fulfilled by classifying each arc into one of k bins and assigning a discrete cost based on the second criterion.
Without further restrictions the minimization algorithm will have to distinguish between alternative paths that have biological relevance: the regulatory cascade between two genes may be a short path with low confidence arcs or, alternatively, a long path with high confidence arcs. LOMBARDE will choose the longer path with high confidence arcs only when the path length is up to r times the length of the path with the low confidence arcs. Therefore the cost of an arc which belongs to the ith bin (i∈{0,…,k−1}) is r ^{i}.
To choose useful values of k and r we explored the conditions that are summarized in Table 3. The size of the LOMBARDE output decreased when either r or k increased, and also when the number of validated arcs increased. The ratio of validated arcs among all the arcs in the LOMBARDE output increased with increased k and r. This result suggested that choosing the parameters r=10 and k=9 would bias LOMBARDE to produce more confident networks.
Conclusions
Deciphering which regulatory interactions can provide a causal explanation for a set of observed coexpressions remains an important challenge in systems biology. We developed LOMBARDE, a modeling method that uses an optimization principle to determine a simple and confident set of regulations that serve as causal explanations for a given set of coexpressions. When the set of coexpressions involves genomewide interactions, LOMBARDE produces an explanatory putative TRN that has some basic topological characteristics close to observed TRNs and is biased to include regulations that have been independently validated experimentally. The LOMBARDE method was illustrated for a E.coli data set, where coexpressions were determined using mutual information under several environmental conditions. LOMBARDE was applied to an a priori graph considering TF/BS affinities recovered by BLAST and MEME/FIMO, and produced a simple and confident explanatory putative TRN that explained most of the observed coexpressions. LOMBARDE discarded a lot of arcs from the initial TRN but interestingly kept most of the independent experimentally validated regulations within it.
Sensitivity analysis showed that LOMBARDE was biased towards validated regulations in all cases of coexpression sets and a priori graphs used as input. Moreover, this method produced better results when the a priori graph was finetuned to the target organism.
We have evaluated LOMBARDE using E.coli as a test case because many regulatory interactions have been validated, making it is suitable for evaluation purposes. LOMBARDE can be applied in a straightforward way to other prokaryote organisms and we expect that the bias towards true regulations will be the same as long as the confidences of the predicted regulations are tuned to the target organism.
In summary, LOMBARDE is a tool that can provide useful insight into the regulatory mechanisms that underlie the phenotypical response of an organism to environmental challenges and it can be used as a reliable tool for further research on genomescale transcriptional regulation studies.
Availability of supporting data
Source code and raw data used for the evaluation are available at http://github.com/anaraven/Lombarde and as Additional file 1. The method can also be run from http://mobyle.inria.cl.
References
 1
Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci USA. 1998; 95(25):14863–8.
 2
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, et al. Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 006 7:S7. 2006; 7 Suppl 1:7.
 3
Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al. Largescale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):8.
 4
Meyer PE, Kontos K, Lafitte F, Bontempi G. Informationtheoretic inference of large transcriptional regulatory networks. EURASIP J Bioinform Syst Biol. 2007:79879. doi: 10.1155/2007/79879.
 5
Olsen C, Meyer PE, Bontempi G. On the impact of entropy estimation on transcriptional regulatory network inference based on mutual information. EURASIP J Bioinform Syst Biol. 2009:308959. doi: 10.1155/2009/308959.
 6
Rodionov DA. Comparative genomic reconstruction of transcriptional regulatory networks in bacteria. Chem Rev. 2007; 107(8):3467–97.
 7
Guhathakurta D. Computational identification of transcriptional regulatory elements in dna sequence. Nucleic Acids Res. 2006; 34(12):3585–98.
 8
Salgado H, PeraltaGil M, GamaCastro S, SantosZavaleta A, MunizRascado L, GarciaSotelo JS, et al. Regulondb v8.0: omics data sets, evolutionary conservation, regulatory phrases, crossvalidated gold standards and more. Nucleic Acids Res. 2013; 41(D1):203–13.
 9
Grote A, Klein J, Retter I, Haddad I, Behling S, Bunk B, et al. Prodoric (release 2009): a database and tool platform for the analysis of gene regulation in prokaryotes. Nucleic Acids Res. 2009; 37(Database issue):61–5.
 10
Acuña V, Aravena A, Maass A, Siegel A. Modeling parsimonious putative regulatory networks: Complexity and heuristic approach. In: Verification, Model Checking, and Abstract Interpretation. Volume 8318 of the series Lecture Notes in Computer Science. BerlinHeidelberg: Springer: 2014. p. 322–336.
 11
Pruitt KD. Ncbi reference sequence (refseq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2004; 33(Database issue):501–4.
 12
Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, et al. Many microbe microarrays database: uniformly normalized affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008; 36(Database issue):866–70.
 13
Meyer PE, Lafitte F, Bontempi G. minet: A r/bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics. 2008; 9:461.
 14
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. R Foundation for Statistical Computing. ISBN 3900051070.
 15
Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. Meme suite: tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(Web Server issue):202.
 16
GamaCastro S, Salgado H, PeraltaGil M, SantosZavaleta A, MuñizRascado L, SolanoLira H, et al. Regulondb version 7.0: transcriptional regulation of escherichia coli k12 integrated within genetic sensory response units (gensor units). Nucleic Acids Res. 2011; 39(Database issue):98–105.
 17
Sun J, Tuncay K, Haidar A, Ensman L, Stanley F, Trelinski M, et al. Transcriptional regulatory network discovery via multiple method integration: application to e. coli k12. Algorithms Mol Biol. 2007; 2(1):2–2.
 18
Leclerc RD. Survival of the sparsest: robust gene networks are parsimonious. Mol Syst Biol. 2008; 4:213.
 19
MartínezAntonio A, ColladoVides J. Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol. 2003; 6(5):482–9.
 20
Valente T, Foreman R. Integration and radiality: measuring the extent of an individual’s connectedness and reachability in a network. Soc Networks. 1998; 20(1):89–105.
Acknowledgements
We thank Alejandra MedinaRivera, Heladia Salgado and Julio ColladoVides for useful discussions about the interpretation and use of the RegulonDB database. This work was supported by grants Fondap 15090007, Basal CMM, Fondecyt 1140631 and ANR Biotempo ANR10BLANC0218. We also thank the international cooperation fellowships IntegrativeBioChile INRIA Associative Team and CIRICINRIA Chile.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
This work was conceived and designed by VA, AA, AM, and AS. The experiments were performed by AA. Results were analyzed and paper was written by VA, AA, CG, DE, AS, and AM. All authors read and approved the final manuscript.
Additional file
Additional file 1
Source code and raw data used for the evaluation are available as Additional file 1. (ZIP 1340 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Transcriptional regulatory network
 Coexpression
 Combinatorial graphs