Mathematical modeling of the structural a priori
We first introduce our notations before detailing our structural models and variational formulation. Let G represent the total number of genes for which expression data is collected. Expression data is gathered in a symmetric weighted adjacency matrix \(\mathbf {W} \in \mathbb {R}^{G\times G}\). Its (i,j) element corresponds to a statistical measure reflecting the strength of the link, or information shared, between the expression profiles of gene i∈{1,…,G} and gene j∈{1,…,G}. Our approach uses non-negative weights. A convenient choice for ω
i,j
is the normalized mutual information (ω
i,j
∈ [ 0,1]) computed between the expression profiles of genes i and j.
Let \(\mathcal {G}(\mathcal {V},\mathcal {E})\) be a fully connected, undirected and non-reflexive graph where \(\mathcal {V} = \{v_{1}, \ldots v_{G}\}\) is a set of nodes (corresponding to genes), and \(\mathcal {E}=(e_{i,j})_{(i,j)\in \mathcal {V}^{2} | i {<} j }\) is a set of edges (corresponding to plausible gene interactions). Each edge e
i,j
is weighted by the value ω
i,j
from matrix W. The initial number of gene-to-gene edges of \(\mathcal {G}\), denoted by ε, is equal to G(G−1)/2. Inferring a GRN from \(\mathcal {G}\) aims to construct a final graph selecting a subset of edges \(\mathcal {E}^{*} \subset \mathcal {E}\) which reflects true gene regulatory processes. We formulate the search for this graph by computing an edge indicator vector x∈{0,1}ε whose components x
i,j
are such that
$$ x_{i,j} = \left\{ \begin{array}{ll} 1 & \text{if}\,\, e_{i,j} \in \mathcal{E}^{*}, \\ 0 & \text{otherwise}. \end{array}\right. $$
((1))
We assume in this work that a list of putative transcription factors is available. A gene supposed to code for a transcription factor is metonymically denoted by TF. A gene not identified with this property is designated by \(\overline {\text {TF}}\). The \({\text {TFs}}/\overline {\text {TF}}\mathrm {s}\) notation defines two complementary subsets of the ensemble of genes \(\mathcal {V}\). Subsequently, \(\mathcal {T} \subseteq \mathcal {V}\) denotes the set of putative TFs. We consider that regulation is implicitly oriented from TF toward \({\overline {\text {TF}}}\) genes, and do not infer edge directions between TF-TF links. Assuming that significant edges have stronger weights ω
i,j
, we wish to maximize the sum of weights, while expressing our structural a priori in the inference model. To that goal, the edge labeling problem is formulated as the min
$${} \footnotesize{\underset{\substack{{\mathbf{x} \in \{0,1\}^{\epsilon}}}}{\text{\text{minimize}}}\underbrace{\sum_{\substack{(i,j) \in \mathcal{V}^{2} \\ i < j \\ \vphantom{i < j}}} \omega_{i,j}|x_{i,j}-1|}_{\substack{ \text{favors strongly}\\\text{weighted edges}}} +\!\! \underbrace{\sum_{\substack{(i,j) \in \mathcal{V}^{2} \\ i<j \\ \vphantom{i < j}}} \lambda_{i,j} x_{i,j}}_{\substack{\text{favors} {\text{TF}}\textrm{-}{\overline{\text{TF}}} \\ \text{~~edge presence}}} \!\!\,+\,\!\!\! \underbrace{\sum_{\substack{i \in {\mathcal{V}\setminus \mathcal{T}},\\ (j,j') \in \mathcal{T}^{2} \\ j< j'}} \rho_{i,j,j'} |x_{i,j} - x_{i,j'}| }_{\substack{\text{enforces regulatory}\\ \text{~~~~~relationships coupling}}}.} $$
((2))
Let us comment the first term in the above functional. In order to select edges of strong weights ω
i,j
, the first term reflects a biological data fidelity term. It represents a gene-to-gene edge deletion cost. Thus, if ω
i,j
is large (respectively, small), its edge deletion cost is high (respectively, low), disfavoring (respectively, favoring) its deletion. We now explore the two last penalty terms of (2) corresponding to our biologically-related structural a priori regularization.
The second term counterbalances the first one. Independently from the fact that actual TF genes are less numerous than \({\overline {\text {TF}}}\) genes, regulatory relationships between couples of TFs are expected to be less frequent than between one TF and one \({\overline {\text {TF}}}\). This expectation may promote biological graphs with a modular structure [19, 20]. An illustration is presented in Fig. 1. As we are looking for gene regulatory knowledge, we infer edges linked to at least one TF. In addition, we want to favor the preservation of \({\text {TF}}\textrm {-}{\overline {\text {TF}}}\) edges over TF-TF links. This edge selection capability is driven by positive weights λ
i,j
. Their values depend on the three types of pairs of nodes i and j. We define these case-dependent weights as follows:
$$ \lambda_{i,j} = \left\{ \begin{array}{ll} 2\,\eta & \text{if}~ i \notin \mathcal{T} \text{and}~ j \notin \mathcal{T}, \\ 2\, \lambda_{\text{TF}} & \text{if}~ i \in \mathcal{T} \text{and}~ j \in \mathcal{T}, \\ \lambda_{\text{TF}} + \lambda_{\overline{\text{TF}}} & \text{otherwise.} \end{array} \right. $$
((3))
Hence, \({\overline {\text {TF}}}\textrm {-}{\overline {\text {TF}}}\) edges have weights assigned to 2η. The parameter λ
TF (respectively, \({\lambda _{\overline {\text {TF}}}}\)) acts in the neighborhood of TF genes (respectively, \({\overline {\text {TF}}}\) genes). They may be interpreted as two threshold parameters. This double threshold promotes grouping between strong and weaker edges among functionally-related genes. A similar approach is used in image segmentation [21] to enhances object detection with reduced sensitivity to irrelevant features [22]. To promote \({\text {TF}}\textrm {-}{\overline {\text {TF}}}\) interactions, the λ
TF parameter should be greater than \({\lambda _{\overline {\text {TF}}}}\). To ensure that any TF involved interaction is selected first, we should verify that \(\eta \geq {\lambda _{\text {TF}}} \geq {\lambda _{\overline {\text {TF}}}}\). Additionally, removing all \({\overline {\text {TF}}}\textrm {-}{\overline {\text {TF}}}\) edges amounts to setting their corresponding x
i,j
to zero. Consequently, η should exceed the maximum value of the weights ω. Since we address different data types and input weight distributions, we can easily renormalize them all to ω
i,j
∈ [ 0,1], and choose η=1. When \({\lambda _{\text {TF}}}={\lambda _{\overline {\text {TF}}}}\), no distinction is made on the type of edges. This is equivalent to using a unique threshold value, as in classical gene network thresholding. This can be interpreted as if, without further a priori, all genes were indistinguishable from putative TFs. However, different λ
TF and \({\lambda _{\overline {\text {TF}}}}\) may be beneficial. We indeed show in the Additional file 1 that for any fixed value of λ
TF, smaller values for \({\lambda _{\overline {\text {TF}}}}\) improve graph inference results. A simple linear dependence \({\lambda _{\text {TF}}} = \beta {\lambda _{\overline {\text {TF}}}}\), with β≥1 suffices to define a generalized inference formulation encompassing the classical formulation. We fixed here β as a parameter based on the gene/TF cardinal ratio: \(\beta = \frac {|\mathcal {V}|}{|\mathcal {T}|}\). This choice is consistent when no a priori is formulated on the TFs (i.e. all genes are considered as putative TFs). Hence, β=1 and \({\lambda _{\text {TF}}}={\lambda _{\overline {\text {TF}}}}\). As mentioned above, without knowledge on TFs, we recover classical gene network thresholding. The λ
i,j
parameter now only depends on a single free parameter \({\lambda _{\overline {\text {TF}}}}\), similarly to the large majority of inference methods requiring a final thresholding step on their weights.
Finally, the third term of the proposed functional aims to enforce a regulator coupling property (see Fig. 2). If two transcription factors are co-expressed, and co-regulate at least one gene, we consider plausible that any gene regulated (respectively non regulated) by one of these TFs is regulated (respectively, non regulated) by the other TF. We quantitatively translate the co-expression of two TFs j and j
′ by \(\phantom {\dot {i}\!}\omega _{j,j^{\prime }}>\gamma \), where \(\gamma \in \mathbb {R}^{+}\) is a threshold reflecting the strength of the co-expression between j and j
′. Similarly, the regulation of a \({\overline {\text {TF}}}\)
k by a TF j (respectively, j
′) is numerically expressed by ω
j,k
>γ (respectively, \(\phantom {\dot {i}\!}\omega _{j^{\prime },k}>\gamma \)). We define γ from robust statistics [23] as the (G−1)th quantile of the weights. We thus choose the coupling parameter as:
$$ \rho_{i,j,j'} = \mu \frac{\sum\limits_{k \in {\mathcal{V}\setminus (\mathcal{T} \cup\{i\})}} 1(\min \{\omega_{j,j'}, \omega_{j,k}, \omega_{j',k}\} > \gamma)}{|\mathcal{V} \setminus \mathcal{T}|-1}, $$
where 𝟙 is the characteristic function (equals to 1 when the condition in argument is satisfied and 0 otherwise) and μ≥0 is a regularization parameter controlling the impact of the third term on the global cost. The proposed numerator counts the number of \({\overline {\text {TF}}}\) genes co-regulated by j and j
′. As we exclude the gene i, the maximal number of \({\overline {\text {TF}}}\) genes co-regulated by j and j
′ equals \(|\mathcal {V} \backslash \mathcal {T}|-1\). Hence, using the latter quantity as the denominator casts the \(\phantom {\dot {i}\!}\rho _{i,j,j^{\prime }} / \mu \) parameter as a co-regulation probability relative to couples of TFs (j,j
′). The greater the co-regulation probability, the stronger the influence of the third term. This penalty requires that at least two \({\overline {\text {TF}}}\) target genes exist (hence, the denominator does not vanish). Otherwise, when \(|\mathcal {T}| = |\mathcal {V}|\), we set μ=0.
We now turn our attention to the strategy for computing an optimal labeling vector x
∗ solution to Problem (2).
Optimization strategy
By using elements from graph theory, we now explain how a maximum flow algorithm can solve Problem (2). It relies on the maximum flow/minimum cut duality [24]: the computation of an optimal edge labeling minimizing (2) can be performed by maximizing a flow in a network G
f
.
A flow (or transportation) network \(\mathcal {G}_{f}\) is a directed, weighted graph including two specific nodes, called source (a node with 0-in degree) and sink (a node with 0-out degree), respectively denoted by s and t. We recall that the degree of a node is defined as the number of edges incident to that node.
We now introduce the concept of flow in the transportation network \(\mathcal {G}_{f}\). A flow function f assigns a real value to each edge under two main constraints: the capacity limit and the flow conservation. The capacity limit constraint entails that the flow in each edge has to be less than or equal to the capacity (i.e. the weight) of this edge. If the flow equals the capacity, the edge is said saturated. The flow conservation constraint signifies that, at each node, the entering flow equals the exiting flow. Subject to these two constraints, the aim is to find the maximal flow from s to t in the flow network \(\mathcal {G}_{f}\). According to the graph construction rules provided by [25], the flow network for solving Problem (2) is composed of:
-
A set of ε nodes a
i,j
with (i,j)∈{1,…,G}2, i<j, linked to the source s with edges of weight ω
i,j
. Each node is associated with a label x
i,j
.
-
A set of G nodes v
k
of labels y
k
with k∈{1,…,G}. The nodes v
k
is linked to the previously defined node a
i,j
if k=i or k=j. If such an edge exists, a weight λ
k
is thus assigned. In reference to (3), the weight λ
k
equals η, λ
TF or \({\lambda _{\overline {\text {TF}}}}\), according to the nature of the node a
i,j
(corresponding to the edge e
i,j
in the initial network \(\mathcal {G}\)).
-
A set of q edges, linking nodes a
i,j
to \(\phantom {\dot {i}\!}a_{i,j^{\prime }}\) for which the regulator coupling property is satisfied, with weights equal to \(\phantom {\dot {i}\!}\rho _{i,j,j^{\prime }}\).
-
An additional set of G edges, linking nodes v
i
, i∈{1,…,G} to the sink node t.
Figure 3 illustrates this graph construction on a small-size example. Computing a maximum flow from the source to the sink in this flow network saturates some edges, thus splitting the nodes a
i,j
into two different groups: nodes that are reachable through a non saturated path from the source, and those that are not. Assuming that the source node s is labeled with 1, and the sink node t is labeled with 0, binary values are thus attributed to the edge labels x
i,j
(secondarily, binary values are also assigned to the y labels of nodes v in the flow network), and this final labeling returns the set of selected edges \(\mathcal {E}^{*}\) which minimizes (2). We use the C++ code implementing a max-flow algorithm from [26].
Problem dimension reduction
As explained in the previous section, the optimal solution to the minimization problem (2) may be obtained via a maximum flow computation in a network generated from the whole original graph. In practice, many parameters \(\phantom {\dot {i}\!}\rho _{i,j,j^{\prime }}\) have zero values. So rather than building 0-valued edges in the flow network, reducing the dimension of this network is judicious. Indeed, if \(\phantom {\dot {i}\!}\rho _{i,j,j^{\prime }} = 0\) for all \(j' \in \mathcal {T}\), the optimal label of x
i,j
is given by the explicit solution
$$ x_{i,j} = \left\{ \begin{array}{ll} 1 & \text{if } \lambda_{i,j} \leq \omega_{i,j}\\ 0 & \text{otherwise.} \end{array} \right. $$
((4))
This formula also provides a better insight into the role played by thresholding parameters λ
i,j
. We now have a fast optimization strategy to generate a solution to the proposed variational formulation. One of the advantages of employing the BRANE Cut algorithm is the optimality guaranty of the resulting inferred network with respect to the proposed criterion. We next describe quantitative gains that can be achieved using BRANE Cut.