Reconstruction of novel transcription factor regulons through inference of their binding sites

Background In most sequenced organisms the number of known regulatory genes (e.g., transcription factors (TFs)) vastly exceeds the number of experimentally-verified regulons that could be associated with them. At present, identification of TF regulons is mostly done through comparative genomics approaches. Such methods could miss organism-specific regulatory interactions and often require expensive and time-consuming experimental techniques to generate the underlying data. Results In this work, we present an efficient algorithm that aims to identify a given transcription factor’s regulon through inference of its unknown binding sites, based on the discovery of its binding motif. The proposed approach relies on computational methods that utilize gene expression data sets and knockout fitness data sets which are available or may be straightforwardly obtained for many organisms. We computationally constructed the profiles of putative regulons for the TFs LexA, PurR and Fur in E. coli K12 and identified their binding motifs. Comparisons with an experimentally-verified database showed high recovery rates of the known regulon members, and indicated good predictions for the newly found genes with high biological significance. The proposed approach is also applicable to novel organisms for predicting unknown regulons of the transcriptional regulators. Results for the hypothetical protein D d e0289 in D. alaskensis include the discovery of a Fis-type TF binding motif. Conclusions The proposed motif-based regulon inference approach can discover the organism-specific regulatory interactions on a single genome, which may be missed by current comparative genomics techniques due to their limitations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0685-y) contains supplementary material, which is available to authorized users.

1 System model

High-confidence co-regulated gene set estimation via sparse biclustering
Given the genome-wide expression profiles we want to estimate a group of genes that are putatively co-regulated with the coding gene of a TF of interest. Co-regulation of a pair of genes can be quantified by the expression coherency across the data points (conditions).
Let M ∈ R n×p be the gene expression data consisting of n genes and p conditions (features). Let m i denote the i-th row of M . For a given subset of features (⊆ {1, ..., p}) , the observation pair (m i , m j ) can be represented as a beam around the linear regression line with slope a ij and intercept b ij (e.g., Fig. 1). Then the feature selection problem is to select a subset of features that minimizes the sum of residuals from the best regression. lexA mean lexA std dinI mean dinI std Figure 1 Scatter plot of the features corresponding to expression rates of genes lexA and dinI. The data is obtained from [1].
For the bicluster generated for the i-th observation (e.g., a TF's coding gene), feature selection can be represented by a vector, s i ∈ {0, 1} p , such that s ik = 1 if the feature k is selected, and 0 otherwise. To avoid trivial solutions -not selecting any feature-a sparse penalty term is introduced to the problem, i.e., β 1 1 − s i 1 , where β 1 is the regularizer coefficient and 1 is the vector of 1s. Therefore, the feature selection problem for the i-th gene can be represented as, min si,aij ,bij where s ik ∈ {0, 1}.
To find coherent observations under the conditions given by s i a gene selection vector w i ∈ {0, 1} n is introduced, such that w ij = 1 if there is a strong linear coherence between the observation pair (m i , m j ), and w ij = 0 otherwise. Similarly, another sparse regularizer is added to penalize trivial biclusters, i.e., β 2 1 − w i 1 .
To favor co-up-regulated and co-down-regulated observation samples, an importance matrix D ∈ R n×n×p is introduced, where d ijk ∈ [0, 1] indicates the importance of the k-th feature for the observation pair (m i , m j ) which is computed as a function of the euclidean distance of the k-th feature to the central point (m i , m j ) in the scatter plot. Relaxing w ij ∈ {0, 1} to w ij ∈ [0, 1] and s ik ∈ {0, 1} to s ik ∈ [0, 1], the gene expression biclustering problem for the i-th gene can then be cast as, where w ij ∈ [0, 1], s ik ∈ [0, 1] [2].
By optimizing (2) for a given regulatory gene i and expression matrix M , one can obtain a bicluster where the genes within the bicluster exhibit high linear-coherency with each other. One may expect that the high coherency in gene expression could indicate a common transcriptional activity. However, this correlation may occur either by chance (e.g. resulting from experimental noise) or because of indirect effects underlying the experiments, thereby integration of additional biological evidence into the problem can be crucial. In particular, we can employ the fitness data to check whether the knockout strains of the bicluster genes can show similar responses (fitness) to a particular stress, i.e., we may assume that the co-regulated geneswhen knocked out-are more likely to cause similar growth rates for the organism. Following this intuition, the algorithm looks to find the knock outs that have similar growth patterns under a subset of stress conditions. Let F ∈ R n×q be the fitness data consisting of n genes and q conditions. For the observation pair (f i , f j ) in F we determine the feature selection by whereŝ i ∈ {0, 1}, andβ 1 control the feature selection in F . We optimize the observation selection such that whereβ 2 controls the gene selection andd ij is the importance of the -th feature for the observation (f i , f j ). For a given regulatory gene i, gene expression matrix M , and fitness data matrix F , the biclustering problem can then be written as min wi,si,ŝi,aij ,bij j where (3) is not jointly convex in W , S,Ŝ, A, and B, we propose an iterative procedure for solving it as follows.
Initialization: A is initialized to 1 · 1 T since the genes' expression levels (rows) are normalized to [0,1]. B is normalized as the linear regression line for each observation (m i , m j ) passing through the central point (m i , m j ) with slope a ij . S andŜ are initialized as selecting the feature points whose importance distance to regression line remain within a certain threshold, i.e., s ik = 1 if the k-th data point's distance to the central point (m i , m j ) is larger than some predefined threshold. W is computed from initial values of S,Ŝ, A, and B. Each variable is then updated through optimizing its cost function when the other variables are fixed.
Updating W : When S,Ŝ, A, and B are fixed (3) becomes a convex linear objective of W , i.e., which can be solved as , and w ij = 0 otherwise.
Updating S: When W ,Ŝ, A, and B are fixed, the objective function is a convex (linear) function of S, i.e., min s ik UpdatingŜ: Similarly, when W , S, A, and B are fixed, the objective function is convex inŜ, i.e., Updating A and B: When W , S, andŜ are fixed, A and B can be solved as a least-squares linear regression problem.
The procedure iterates by updating W , S,Ŝ, A, and B until the objective function converges. After convergence, we check if the bicluster have sufficient members (intuitively between 5-50 genes) to discover a regulatory motif from their upstream sequences. Before motif discovery, a pre-processing may be necessary for refining those inputs. As one can expect, the obtained high-confidence set may contain false positive genes and the corresponding upstream sequences can thereby deteriorate the motif estimation. Even some true positive genes may lack the corresponding regulatory sites in their upstream sequences which makes the motif finding problem ill-defined. Next, we will demonstrate a filtering procedure to eliminate those potentially uninformative genes from the obtained set.

Filtering high-confidence gene set
We aim to eliminate genes that presumably lack the regulatory patterns in their upstream, which is generally observed in gene groups that belong to the same operon. If the reference knowledge about the TF is available (e.g., consensus sequence, motif, etc.) we could easily detect such genes by scanning their upstream regions. However, for novel genomes such prior information is usually not available. On the other hand, for most sequenced genomes the operon predictions data are available. In such data, the adjacently co-regulated genes are grouped into operons whereby the common regulatory sites are located in the upstream of the first-in-operon genes. Using this information, an intuitive elimination procedure could be first including only the first-in-operon genes and then sequentially filtering in the downstream genes as necessary. Our filtering procedure is outlined as follows.
Assume that the high-confidence set H consist of the genes appearing in three different operons named as a, b, and c, e.g., where a 1 is the first-in-operon gene in the operon a and a 2 is the second-in-operon gene, and so forth for the other operons b and c.
• Determine the first-in-operon genes F = {a 1 , b 1 , c 1 } and the downstream genes D = {a 2 , a 3 , b 2 , c 2 , c 3 } in the high-confidence set by using the operon predictions database [3]. Start the new high-confidence set with only the firstin-operon genes, e.g., H = F = {a 1 , b 1 , c 1 }. • For each operon in D, calculate the difference S in co-expression score between a pair of adjacent genes, where the co-expression score is defined as "the coherence of gene expressions (correlation) with the given TF gene", e.g., S(a 2 ) = score(a 1 ) − score(a 2 ), where score(x) = Corr(x, T F gene). • Add the downstream genes with score S < threshold, and check if (i) there are ≥ 10 genes in H or (ii) 50% of the downstream genes are filtered in. • If (i) or (ii) satisfies, return H. Otherwise, continue adding more downstream genes by incrementing the threshold within a given range then comparing the next set of gene pairs in the downstream (see Algorithm 1).

Algorithm 1 Downstream gene filter
Require: set H = F , and D 0 = D Require: set i = 2, and threshold = min(thresholdRange) 1: for every operon x ∈ D, e.g., Setting a sensible thresholdRange is important, e.g., starting with a negative value may be intuitive. A negative S occurs when the gene's correlation score is higher than that of a preceding gene, which indicates that the gene is likely directlyregulated by a TF and the regulatory pattern could be present in the corresponding gene's upstream region.

Motif discovery with BAMBI2b
Genes regulated by the same TF usually share a common nucleotide pattern in their promoter regions where the TF specifically recognizes this pattern and binds. The locations of such TF binding sites and the properties of the corresponding motif can be computationally derived by motif discovery algorithms. Here, we employ the Bayesian Algorithm for Multiple Biological Instances of motif discovery -BAMBI [4]-to estimate the unknown motif for the putatively co-regulated genes obtained in the previous section.
BAMBI is designed to discover nucleotide motifs via estimating the number, length and locations of the binding instances in a collection of sequences, where each sequence may contain one, several or no instances of the motif. The algorithm is based on the sequential Monte Carlo (SMC) technique which processes one input sequence at a time efficiently in a sequential manner. The system is modeled as an HMM, where the hidden state at time t corresponds to the current input's state vector x t composed by the number of motif instances n t and their locations in the current sequence (observation) s t . As the iteration proceeds, given t sequences S t = {s 1 , . . . , s t }, BAMBI aims to identify the number of motif instances and their locations for each sequence, i.e., X t = {x 1 , . . . , x t }. Since the distribution of the state vectors given observations p(X t |S t ) is not known and the samples are usually not available for this posterior, an approximation is implemented by taking weighted samples (particles), X k t , k=1, . . . , K, from a trial density q(X t |S t ) within the sequential importance sampling (SIS) framework [5]. The sequential algorithm is obtained by setting q(X k t |S t ) = q(X k t−1 |S t−1 )q(x k t |X k t−1 , S t ), and computing the sample weights A resampling procedure [6] is employed to reduce the increasing variance of ineffective samples with very small weights resulting from the phenomenon known as the degeneracy problem [7]. The length-M motif is described as a 4×M PWM, i.e., θ = [θ 1 , . . . , θ M ], where θ j , j = 1, . . . , M , represents the unknown nucleotide distributions (i.e., A, C, G, T/U) in each position of the motif that will be estimated. Since each promoter sequence can have one, several or no binding site (instance) of the motif their abundance also needs to be estimated. Assuming N instances per sequence as the upper bound, the distribution of the number of instances is described by an unknown vector λ = [λ 0 . . . λ N ], where λ j is the proportion of sequences with j instances of the motif.
TF binding sites have intrinsic properties that should be taken into account in the model, such as the location of conserved bases in the half-sites and their symmetry. We modified BAMBI to incorporate such properties of the TFBS. In particular, we can use the general two-block model [8] to describe the palindromic or direct/inverted repeat patterns in the half-sites. Estimating the location and size of the core segments will be introduced later under the "Two-block model " section. To model the motif symmetry we consider three types, i.e., (i) palindromic, (ii) inverted-repeat and (iii) direct-repeat symmetry. We represent the distribution of the symmetry of instances as an unknown vector, i.e., σ = [σ 0 , σ 1 , σ 2 ], where σ 0 represents the proportion of instances with palindromic symmetry, and σ 1 (σ 2 ) represents the proportion of instances with inverted(direct)-repeat symmetry.
At each step t, given the first t sequences S t the sequential algorithm estimates the number of instances in each sequence and their locations, X t , within the Bayesian framework. The motif matrix θ is modeled as Dirichlet random vectors. The distribution of the number of instances λ and the distribution of the symmetry of instances σ are also modeled as Dirichlet random vectors. A sufficient statistic α t is used to characterize the distribution of θ, and it can be updated sequentially, i.e., where a t,xt is the subsequence of M letters in the sequence s t whose location is given by x t . Similarly, a sufficient statistic is selected for the distribution of the number of instances λ which is Dirichlet distributed with parameters γ t = [γ t 0 . . . γ t N ]. The sequential update is obtained by γ t = γ t−1 + j(x t ), where j(x t ) is a vector of zeros except for a 1 indicating the number of instances of the motif in the t-th sequence. A sufficient statistic for the distribution of the symmetry of instances σ is also defined. σ is a Dirichlet distribution with parameters δ t = [δ t 0 δ t 1 δ t 2 ] which can be updated sequentially by δ t = δ t−1 + u(a t,xt ), where I {.} is the indicator function which takes the value 1 if a nucleotide pair observed in a t,xt obeys the given symmetry type, palindromic, inverted-repeat, or direct-repeat, and the operator () takes the Watson-Crick complement of the given letter. The influence of these variables can be averaged out in the SIS procedure, whereby the measurement model depends on θ and σ, and the state transition depends on λ. The importance distribution can be chosen with q( Then, for each particle, k=1, . . . , K, the importance weight can be updated as Given the PWM θ, the background distribution θ 0 , the state x t at time t, and the symmetry distribution σ, we can express in closed-form the likelihood of the sequence s t in (8) as where θ n = 4 j=1 θ nj j , n(a)=[n 1 , . . . , n 4 ] with n j , j=1, . . . , 4, being the number of times the j-th letter appears in the instance a, and a c t,xt is the sequence in s t resulting from removing the instance a t,xt .
The last term in (10) is the likelihood of the instance a t,xt given the PWM θ, and the motif's underlying symmetry σ. The influence of the symmetry can be averaged out by computing the integral p(a t,xt |θ) = p(a t,xt |θ, σ)p(σ|θ)dσ. For a given instance a t,i , as a measure of dissimilarity (asymmetry) we use the Kullback-Leibler (KL) divergence between the PWM's base variations indicated by the symmetry type observed in a t,i . Since the KL divergence is a non-symmetric distance measure we define a symmetrized similarity metric and write the closed-from expression for p(a t,xt |θ) as follows.
p(a t,xt |θ) = p(a t,xt |θ, σ)p(σ|θ)dσ = p(a t,xt , θ, s)E[σ s ] where and s=0, 1, 2 refers to the given symmetry type, i.e., palindromic, inverted-repeat, and direct-repeat, respectively. We normalize these cross entropy measures by the number of symmetric bases in a t,xt , i.e., u s (a t,xt ). By restricting the calculation to only block sites B, i.e., m ∈ B ⊂ {1, . . . , M/2}, the density is then multiplied by us(at,x t ) 1+|B| and log-normalized to average out the influence of the block length |B|, which replaces (11) by p(a t,xt |θ) where τ is a constant to avoid taking log(0) and serves as a tolerance threshold for the asymmetric instances, i.e., the larger is τ the algorithm is more likely to sample an (asymmetric) instance that possess a larger cross-entropy. By using (12) the expectation term in (10) can be approximated as The second integral in (8) is computed analogously.
where p(i 1 . . . i n |n) is distributed uniformly. Given a set of promoter sequences S t = {s 1 , . . . , s t }, the BAMBI2b algorithm for discovering symmetric two-block motifs is then summarized as follows.
For each sequence, and for each particle, • Construct the importance distribution by computing (8,10,13,14) and enumerating all possible sample extensions • Draw sample x k t from the importance distribution and set and update the particle weight using (9); • Update the sufficient statistics α t , γ t , and δ t ; then, resample if needed. Two-block model The bacterial TF binding sites are usually symmetric in sequence where the half-sites obey a palindromic or direct/inverted repeat symmetry. The more informative bases in each half-site are generally conserved in the middle. One can model such motif structure as a pair of informative symmetric sequences (blocks B) separated by a run of uninformative bases (gap), and a pair of uninformative flanker sequences (F) located at each end (Fig. 2). To estimate this structure within the SMC framework we used the class-based resampling scheme presented in [6] and augmented the state vector as  The class-based resampling procedure is summarized as follows.
-Choose the number of particles for each class according to a multinomial distribution with parameters -For every class ∀γ ∈ Λ M,B,F , if P γ < P thresh then set P γ = P thresh .
-Reduce the number particles starting from the class with the largest number particles until γ P γ = P . -For every class draw P γ new particles from the set of previous particles with probabilities proportional to their weights. -Assign equal weights to each new sample within a class, w k t =P (m, b, f |S t )/P m,b,f .

Optimal data feeding
Because of the sequential nature of BAMBI2b, the order in which the sequences are utilized have a direct impact on the accuracy of estimation. An optimal data feeding should then be defined in order to prioritize the most useful sequences to supply to BAMBI2b. The downstream genes can be supplied in the same order in which they are filtered in by Algorithm 1.
To rank-order the first-in-operon genes F, we can use a similar approach. The genes that are expressed more coherently with the given transcription factor may indicate a proximity to the TF's binding instances. Following this intuition, we feed the upstream sequences of the first-in-operon genes {x ∈ F} in a decreasing order based on the correlation scores, i.e., score(x) = Corr(x, T F gene), as follows. Let M ∈ R |F |×p be the gene expression matrix corresponding to |F| high-confidence genes and p different conditions. Suppose that m i ∈ R p is the vector of expression rates for the given transcription factor gene located in the i-th row of M . The correlation scores for the genes x = 1, . . . , |F|, are calculated by Pearson's correlation where m i and m x are empirical means of the expression vectors m i and m x of the genes i and x, respectively. The upstream sequences are then arranged in the order of descending absolute scores |score(x)| of the corresponding genes in F.