Skip to main content
  • Methodology Article
  • Open access
  • Published:

Sequence information gain based motif analysis

Abstract

Background

The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions.

Results

This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70 % of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions.

Conclusions

Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.

Background

The information encoded in genetic sequences is expressed by means of a gene regulation process, which begins with a gene transcription step. The binding between specific proteins and their target sites in DNA is a key step in the control of the transcription process. These proteins – transcription factors (TF) – recognise specific motifs in DNA known as Transcription Factor Binding Sites (TFBS) or cis-regulatory sequences. The prediction, identification and detection of cis-regulatory sequences is a key factor in understanding gene regulation and in inferring regulatory networks [1, 2]. TFBS are usually very short (5 to 20 base pairs long) and highly degenerate, which gives rise to an extremely difficult identification problem due to low statistical power, as short sequences are expected to occur at random every few hundred base pairs. Due to their high variability, a consensus sequence approach for detection is insufficient. There is also evidence that this variability exhibits correlation between positions among the regulatory sequence [3, 4], and that this correlation could contain information which would help reduce the false positive rate and increase the sensitivity of a detector [5].

Due to the importance of identifying cis-regulatory sequences, much effort has been devoted to mapping the binding sites for a large set of transcription factors. An important recent project is the ENCODE (Encyclopedia of DNA Elements) project, which has been able to map 4 million regulatory regions in the human genome, opening new possibilities for computational methods [6]. Motif detection methods may be classified in different ways, depending on the approach adopted. Some reviews focus on the biology of motif discovery in regulatory regions [7, 8], whereas other publications focus more on the representation of the motifs: consensus-based methods and alignment-based methods [9]: consensus-based methods use word algorithms which consider binary hit/no-hit values [10, 11], and alignment-based methods use a set of alignment sequences with binding evidence to assign putative motifs to a candidate sequence. These latter methods could be classified as either numerical or stochastic models: numerical models are based on a mathematical representation of the nucleotides, whereas stochastic models, which are probably the most popular methods, are based on Position Weight Matrices (PWM) or Position Specific Weight Matrices (PSWM) [12]. A PWM is a matrix of scores corresponding to the frequency of the sequence symbols for each binding site position. The PWMs allow the capture of the variability over a sequence of nucleotides from a set of binding site positions [13], although there is the implicit assumption of independence between the residues of the aligned sequence matrix. PWM representations have been used in several algorithms to discover over-represented patterns from candidate sequences [14].

As noted above, statistical studies have shown the dependence among binding site positions variability. The common strategies for incorporating these dependencies within motif detectors include the extension of the PSSM approach to include pairs of correlated positions [15, 16], m th order Markov chains (HMM) [17, 18] and Bayesian Networks [1922]. HMM can model the position interdependencies as long as high order HMMs, or a Bayesian approach are used but, in order to train any of both methods model sufficiently well, a huge training set of sequences would be required (± 1000 or more sequences per model).

A popular method, based on some of the previous work, is MEME/MAST, which provides an improved detection performance [23]. MAST is part of the MEME suite and uses a Q-FAST algorithm for finding motifs. Although these strategies may perform well in some datasets, they have shown certain limitations in the number of dependencies which may be considered between positions, in their ability to model dependencies between more distant positions, and in the large number of parameters which need to be adjusted in the models [3].

Previous work by our group proposed a parametric detector using the Rényi Entropy for binding site detection [24]. This measurement allowed us to build variable-sensitivity detectors modulated by the Rényi order – this assumed independence between binding site positions. A first approximation for modelling the correlation among binding site positions, known as Qresiduals, used a linear embedding to represent the set of binding site sequences [5] and employed a residuals-based approach as the detection statistic. Other non-related work modelled the pure correlation between binding site positions through non-linear correlations based on the variation of mutual information [25].

Statistical pattern recognition has also been applied to identification of sequence motif. Luo et al. [26] propose to use discriminant analysis for the prediction of Transcription Start Sites (TSS). From non-parametric measure, similar to Shannon information, Luo et al. [26] provide information about the variance observed in the dataset. This strategy has good performance for the binding motif detection when the motif positions are not correlated among them. But, this measurement does not allow modelling the dependencies among motif positions.

In this paper, we propose a generalisation of a non-linear model based on Information Theory, which allows modeling DNA contact by the protein and the biological interaction among binding sites using a small training set of sequences (5–50 sequences model). This new approach aims at a trade-off between the good generalisation properties of pure entropy methods and the ability of position-dependency metrics to improve detection power.

The performance of the proposed detector method, named SIGMA (Sequence Information Gain based Motif Analysis), is compared with different computational methods for binding site detection: MEME/MAST [23], Biostrings [27], MotifRegressor [28], Qresiduals [5] and a previously published set of algorithms based on information theory [24, 25].

Methods

The information gain has been measured for each TFBS by means of two parametric uncertainty estimators. The rationale is based on the idea that the total information gain of a set of true TFBS aligned sequences will change according to the similitude of the new candidate sequence to that set (Fig. 1). The first estimator measures the total amount of information change produced by assuming position independence, whereas the second estimator measures the total amount of change of per-position mutual information (capturing pure correlation among binding site positions). Both estimators are computed by a parametric uncertainty measurement.

Fig. 1
figure 1

Information gain space defined by means of the variation on the information. X-axis on the graph shows the total amount of information change produced by assuming position independence. Y-axis shows the total amount of information change produced by assuming the correlation among positions. (black square symbol) Training matrix, (red square symbol) Training matrix with genomic sequence, (green square symbol) Training matrix with binding sites sequence. The broken line is the decision boundary

Let us consider a set of I aligned sequences (s i ) with binding evidence M={s i ,i=1,,I}, and the same set including a candidate sequence \(s_{c}, S = s_{c} \cup M\). Following Fig. 1, let a be the coordinate corresponding to the set M, with axes determined by the two measures previously mentioned. When a new candidate sequence is considered in S, both measures will vary to b or g depending on the nature of the candidate sequence. When the candidate sequence is a binding site sequence, (b,) the variation on the information will be not significant. However, when the candidate sequence is a genomic sequence, (g), the amount of information will vary significantly. With a sufficient training set, this information gain space can be split in two regions, genomic and binding, by means of a simple discriminant analysis which will define a decision boundary, as highlighted as a dashed line in Fig. 1. The decision boundary shape is the result of applying non-linear function.

Information content measures

We have employed as parametric uncertainty measurements the Rényi entropy and Rényi Divergence (also called α-Divergence) [29], which are defined as:

$$ H_{q}(X) =\frac{1}{1-q}{log_{2}{ \sum^{4}_{i=1} p(X_{i})^{q} }} $$
((1))
$$ D_{q}(X;Y) =\frac{1}{q-1}log_{2}{\sum^{4}_{i=1}\sum^{4}_{j=1}P\left(X_{i},Y_{j}\right)^{q}}Q\left(X_{i},Y_{j}\right)^{1-q} $$
((2))

where X i and Y j are the nucleotides {A, T, C and G} at different DNA sequence positions, P(X,Y)=p(X,Y),Q(X,Y)=p(X)p(Y) and the q is the Rényi order which modulates the probability of occurrence of each symbol. p(X,Y) is the joint probability of X and Y, p(X) and p(Y) are the marginal probability. Both measurements (H q (X) and D q (X;Y)) depend on q which is a positive real number (q≠1)and both are non-negative for all q≥0. This parametrisation allows the building of a variable-sensitivity detector exploiting the statistical properties of the Redundancy, R, where R is defined as [24].

The measurement of the variation when the candidate sequence is added to the set has been computed using two heuristic functions, see (Eqs. 3 and 4). These functions depend on two parameters, γ and ω, which measure the difference between redundancies, Eq. 5, and divergence, Eq. 6, between the set of aligned sequences without the candidate sequences, s i , and with candidate sequence, M. These are estimated as described in Maynou et al. [24].

$$ \rho (q,M) = \left|\sum_{i=1}^{L} R_{q}^{M_{i}} {\gamma}_{i} \right|^{-1} $$
((3))
$$ \eta(q,M)= \left|\sum_{i=1}^{L}|R_{q}^{M_{i}}| {\omega}_{i} \right|^{-1} $$
((4))

where, γ i and ω i are

$$ \gamma_{i}= \left|R_{q}^{M_{i}}-R_{q}^{S_{i}}\right| $$
((5))
$$ \omega_{i}= \left|D_{q}^{M_{i}}-D_{q}^{S_{i}}\right| $$
((6))

where L is the number of nucleotides in the binding region, M is the aligned set of sequences with binding evidence and i is a specific column of M. \({R_{q}^{M}}\) is the redundancy, normalized depending on the maximum entropy on the set of aligned sequences, whereas \({R_{q}^{S}}\) contains the equivalent parametric entropy when the candidate sequence is assumed to belong to the set. The redundancy profile is a L-dimensional vector, where L is the total number of positions of the binding site. \({D_{q}^{M}}\) is the divergence matrix of the set of aligned sequences and \({D^{S}_{q}}\) is the divergence matrix considering the training matrix with the candidate sequence. The main diagonal is set to zero in each of these matrices, \({D_{q}^{M}}\) and \({D^{S}_{q}}\). The variation in the information is therefore calculated by means of γ and ω and q-values are optimised at the validation stage within the range (0,2]. As q increases, the noise included in the redundancy signal also increases [24]. From q-values higher than 2, signal-to-noise ratio is not optimal.

For a genomic sequence, the order of the system will decrease the values of γ and ω, whereas for a binding sequence the order of the system will not be altered substantially. Each candidate sequence will therefore be characterised by the pair (x=(ρ,η)) and classified as genomic or binding by means of a Quadratic Discriminant Analysis (QDA), as shown in Fig. 6. The decision boundary, H(y), is defined from the distribution of the variation on the information, x, for each class, genomic or binding, in the information gain space.

Binding site detection by means of the SIGMA algorithm can be summarized as follows, see Fig. 2:

  1. 1.

    Given a set of aligned sequences with binding evidence M, estimate the redundancy profile \({R^{M}_{q}}\) and the Rényi Divergence \({D_{q}^{M}}\) Eqs. (1) and (2).

    Fig. 2
    figure 2

    The essential steps in the training and detection process are shown for the SIGMA algorithm

  2. 2.

    Given a new candidate sequence, re-estimate both values assuming the candidate sequence belongs to M, \({R^{S}_{q}}\) and \({D^{S}_{q}}\),.

  3. 3.

    Compute the variation on the information x=(ρ,η) as defined in Eqs. (3) and (4).

  4. 4.

    Quadratic Discriminant Analysis is applied to the information gain space from the set of computed features.

  5. 5.

    Steps 3 and 4 are iterated over for each candidate sequence.

Additionally, for characterisation of the results we define a heuristic magnitude C, related to the Complexity of M, in order to characterise the degree of pure correlation between the variability of binding site positions in M, see (Eq. 7). C computes element by element the ratio between divergence value, where D q | i,j is the element of D q at row i and column j, and maximun entropy, H max without to considerer the main diagonal. The average of the ratios define the complexity of M.

$$ C= \frac{\sum_{i,j=1}^{N}D_{q}|_{i,j}}{N*(N-1)*H_{max}}\,;\quad i \neq j $$
((7))

where D is the parametric uncertainty measurement considered, N is the size of the binding sites, q is the Rényi order and H max is the maximum entropy for the set of probabilities p(X) and p(y), see section “Database description”. C is a value between 0 and 1. When C is close to 1, the degree of correlation among binding site positions is high.

Database description

Data has been obtained from the Jaspar database [30], http://jaspar.genereg.net/ (see Tables 1 and 2).

Table 1 Summary of the transcription factors analysed for the Homo sapiens organism obtained from Jaspar database
Table 2 Summary of the transcription factors analysed for the Mus musculus organism from Jaspar database

The JASPAR Core provides non-redundancy and high-quality alignment matrices for each transcription factor [30]. Results have been computed with background genomic sequences from the Eukaryotic Promoter Database (EPD) [31], using the EPD version based on the EMBL release 105 (sept 2010). The background loci chosen were E P74078(+)H s R P S9P2+ for Homo sapiens and E P07119(+)M m I g k0M P C11 for Mus musculus.

Optimization

To apply SIGMA methodology to TFBS detection over genomic sequence, we should calculate the variation of the information, Eq. 4, as many times as the length of the sequence I (typically millions nucleotides). Given a sequence position, we must calculate the divergence between the binding site positions. This means that we must compute \(\frac {L*(L-1)}{2}\) times the joint probability for each training matrix, where L is the total number of binding site positions in M. The running time of the algorithm depends on the length of the candidate sequence and on the number of binding site positions.The run time is therefore linear in the length of the input sequence and quadratic in the length of the binding site L.

$$ T(L) \in O\left(L^{2}\right) $$
((8))

The optimization algorithm is based on considering only the correlated binding site positions. The η function has been calculated considering only the Rényi-divergence of the correlated binding site positions (showing positive correlations) through a screening on the possible positive dependencies between these positions.

Any two binding site positions are considered to be correlated if the Rényi divergence score is bigger than the error finite sample. This error yields to a bias on the uncertainty parametric measurement caused by estimating the probability using the nucleotide frequencies [24]. After the screening, we only compute based on the correlated positions of the training matrix as shown in Fig. 3.

Fig. 3
figure 3

Left: Rényi Divergence, \(D^{M}_{q=1}\), considering all possibles correlations between binding site positions. Right: \(D^{M}_{q=1}\) considering only significant dependences between binding site positions after applying the error finite sample correction. Black boxes mean maximum correlation and white boxes mean zero correlation between binding site positions

For each pair of positions (i,j) in M where i,j={1,…,L}, the joint probability for all the possible combinations of (x i ,x j )= {A, C, G and T} are precomputed and stored in a 4×4 matrix. We construct a library \(\left (B_{i,j, x_{i},x_{j}}\right)\) of sixteen 4×4 matrices containing all the possible joint probability values for each pair of positions i and j (as illustrated in Fig. 4).

Fig. 4
figure 4

For each pair of positions i and j we calculate a joint probability matrix, \(B_{i,j,x_{i},x_{j}}\), using all possible combinations of {A, C, G and T}

For each new candidate sequence, we have to consider only the symbols matching correlated positions and read the joint probability value from the lookup table \(B_{i,j,x_{i},x_{j}}\). The Rényi divergence and the discrimination function, η are then computed from these values. The estimated total number of significant transcription factor site dependencies in Homo sapiens and Rattus novergicus is approximately 50 % and 37 % [32]. In this way, the computing time can be reduced by approximately an order of magnitude.

Validation

In order to build a model for each set of binding site sequences, the SIGMA detector has been characterized by means of leave-one-out cross validation (loo-cv). Each method has its own characteristic parameter. The range of the parameter used is different for each detector, see Table 3. The detector performance depends on the value of these parameters which have been selected employing loo-cv. Taking as a criteria a heuristic magnitude, ν auc . This parameter has been computed from the mean and variance of the area under the N ROC curve (A U C N ) [5], which will be maximised for all methods.

$$ \nu_{auc}= \mu_{auc}*(1-\sigma_{auc}) $$
((9))
Table 3 Summary of the characteristic parameters and the range considered for the validation of each computational method used

where μ auc and σ auc are the mean and the variance of A U C N . ν auc is a value between 0 and 1. When ν auc is close to 1, the mean is close to 1 and the variance is close to 0.

From the performance data, we have calculated the mean and standard deviation of the AUC for each transcription factor and method by means of the outer loo-cv. This process has been repeated for all the TFs listed in Tables 1 and 2.

Results and discussion

We first show a characterisation of how the performance of the individual algorithms based on Entropy and Divergence depends on the complexity properties of the training matrix (M) C, (Eq. 7), see Fig. 5. The performance of these algorithms will vary on C depending on the design of each algorithm and the true correlation between positions found for each set of binding sequences. As one would expect, the total Entropy algorithm has a better behaviour with low values of C, whereas a Divergence based approach improves its performance when C is large. The SIGMA approach is partially based on both measurements and aims at finding a trade-off between both approximations in order to maximise the performance over the full dynamic range of C.

Fig. 5
figure 5

Left to right: Entropy and Divergence performances against Complexity (degree of correlation between binding site positions) for a set TF of different organisms ((blue times symbol) Homo sapiens, (red triangle symbol) Drosophila melanogaster, (green cross symbol) Rattus norvegicus, (black circle symbol) Mus musculus). Entropy performs better for low Complexity. On the contrary, Divergence performs better for large Complexity

Figure 6 shows an example of real case where each input sequence is represented as a point in (ρ,η) coordinates. This set of samples includes genomic or binding sequences as shown in the figure. It is clear from the figure that both variables are contributing to the separation of the true binding site sequences.

Fig. 6
figure 6

Empirical representation of the concept depicted in Fig. 1. Left to right: Information Gain when candidate sequences are inserted in the Transcription Factor Binding Sites Irf2 and HLF for the Homo sapiens organism. Black points correspond to candidate sequences which are true binding site sequences. Grey points correspond to candidate sequences which are false binding site sequences

The performance of SIGMA, MEME/MAST, Qresiduals, Entropy, Divergence, Biostrings and MotifRegressor has been compared against the same set of TFs under the same validation conditions described in the previous section. In Fig. 7, it can be observed that the mean and standard deviation depend both on the Transcription Factor and on the method considered. The performance among all the methods has been compared by means of the ν auc parameter described in Eq. (9). In Fig. 8, the ν auc parameter is shown for each transcription factor and method. Based on the ν auc values, in approximately 70 % of the TFBS under study, SIGMA shows better performance than the other methods. In 20 % of the TFs, the performance of the others methods is better than that of SIGMA. In the remaining cases, the SIGMA performance is similar to one or several of the computational methods considered. In most cases, the mean AUC is close to one and the variance is approximately zero, which suggests that SIGMA also behaves more robustly than other methods, as seen in Tables 4 and 5.

Fig. 7
figure 7

Top to bottom: Box plot of the AUC and its variation for the studied transcription factors for the Homo sapiens and Mus musculus organisms using different computational methods: (black square symbol) MAST, (red square symbol) SIGMA, (green square symbol) Divergence, (blue square symbol) Entropy, (cyan square symbol) Qresiduals, (pink square symbol) Biostring, (yellow square symbol) MotifRegressor. The background sequences used have been E P74078(+)H s R P S9P2+ for the Homo sapiens and E P07119(+)M m I g k0M P C11 for the Mus musculus

Fig. 8
figure 8

Top to bottom: Performance of each algorithm ((black circle symbol) MAST, (red triangle symbol) SIGMA, (green cross symbol) Divergence, (blue times symbol) Entropy, (cyan diamond symbol) Qresiduals, (pink down-pointing triangle symbol) Biostring, (yellow diamond symbol) MotifRegressor) is shown through ν auc , (Eq. 9), for a set of TFBS for the Mus musculus and Homo sapiens organisms. When ν auc is close to 1, the mean is close to 1 and the variance is close to zero. For each TF, the best computational method will be that for which ν auc is closest to 1

Table 4 Results for the set of computational methods considered for each TF of the Homo sapiens organism. The ν auc is defined through the mean and variance of the A U C N using a cross-validation method. Given a TF and method, ν auc is chosen with maximum mean and lower variance in the A U C N
Table 5 Results for the set of computational methods considered for each TF of the Mus musculus organism. The ν auc is defined through the mean and variance of the A U C N using a cross-validation method. Given a TF and method, the ν auc is chosen with maximum mean and lower variance in the A U C N

We computed a Wilcoxon rank-test [33] in order to estimate whether the improvement in performance is statistically significant. The null hypothesis was that the AUC distributions between SIGMA and other methods were the same and the alternative hypothesis was that the AUC distributions were different. The level of significance is represented by −l o g 10(p value ). Any p value >0.05 is shown in bold, see Tables 6 and 7). In most cases, it can be observed that the difference between the AUC distributions is significant.

Table 6 The level of significance corresponding to −l o g 10 (p value ) calculated using the Wilcoxon-rank test for the Homo sapiens organism. The null hypothesis is that the AUC distributions between SIGMA and the other computational methods are the same and the alternative hypothesis is that the AUC distributions are different. p value >0.05 is in shown in bold
Table 7 The level of significance corresponding to −l o g 10(p value ) calculated using the Wilcoxon-rank test for the Mus musculus organism. The null hypothesis is that the AUC distributions between SIGMA and the other computational methods are the same and the alternative hypothesis is that the AUC distributions are different. p value >0.05 is in shown in bold

The computational time of SIGMA was compared with the set of computational methods considered. The C code for Qresiduals, Entropy and Divergence using the model obtained in validation and MEME/MAST (Version 4.4.0) was used and has been made publicly available. The run time was obtained in comparison with randomly generated candidate sequences of 1500 nucleotides. The total time has been calculated from 100 iterations of each algorithm. The averages of the computational times in detection for the set of TF considered of Homo sapiens (Tables 4 and 5) are shown in Table 8.

Table 8 Per CPU, the total run time was calculated on a 2.3 GHz Intel Core 2 Duo P8600 computer with 4 GB RAM

Conclusions

A new methodology based on a discriminant analysis of two information theoretic measures has been proposed for binding site detection. The variation on the information has been measured through two parametric uncertainty measurements (the Rényi entropy and Rényi divergence). The method focusses on the variation in these information measures when a new sequence is assumed to belong to a training set of sequences with known binding properties.

This methodology allows us to detect cis-regulatory sequences with maximum performance disregarding the co-variability observed in the positions of the training set of sequences. SIGMA has been characterised on the detection problem for a large set of transcription factors and compared with different motif detection algorithms. AUC distributions have been calculated which show that there is a statistically significant difference between SIGMA performance and the performance of the other methods. In approximately 70 % of the cases considered, SIGMA has exhibited better performance properties, at comparable levels of computational resources, than the methods with which it was compared.

As you can see through the heuristic parameter, SIGMA method is more robust than the other methods. A model based on both parametric uncertainty measurements can be useful to detect cis-regulatory sequences. But when the number of the positions involved in the binding sites process is small, the SIGMA performance is comparable with the rest of the computational methods.

Abbreviations

AUC:

Area Under ROC Curve

ENCODE:

Encyclopedia of DNA Elements

LOO-CV:

Leave-one-out cross-validation

PWM:

Position Weight Matrices

PSWM:

Position Specific Weight Matrices

QDA:

Quadratic Discriminant Analysis

ROC:

Receiver Operating Characteristic

SIGMA:

Sequence Information Gain Motif Analysis

TF:

Transcription Factor

TFBS:

Transcription Factor Binding Sites

TSS:

Transcription Start Sites.

References

  1. Wei W, Yu XD. Comparative analysis of regulatory motif discovery tools for transcription factor binding sites. Geno Rot Bioinfo. 2007; 5(2):131–42.

    Article  CAS  Google Scholar 

  2. Whitfield TW, Wang J, Collins PJ, Partridge EC, Aldred SF, Trinklein ND, et al.Functional analysis of transcription factor binding sites in human promoters. Genome Biol. 2012; 13(9):50. doi:10.1186/gb-2012-13-9-r50.

    Article  CAS  Google Scholar 

  3. Sharon E, Lubliner S, Segal E. A feature-based approach to modeling protein-DNA interactions. PLoS Comput Biol. 2008; 4(8):1000154. doi:10.1371/journal.pcbi.1000154.

    Article  CAS  Google Scholar 

  4. Tomovic A, Oakeley E. Position dependencies in transcription factor binding sites. Bioinformatics. 2007; 23(8):933–41.

    Article  CAS  PubMed  Google Scholar 

  5. Pairó E, Maynou J, Marco S, Perera A. A subspace method for the detection of transcription factor binding sites. Bioinformatics (Oxford, England). 2012; 28(10):1328–35. doi:10.1093/bioinformatics/bts147.

    Article  CAS  Google Scholar 

  6. Dunham IEA. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74. doi:10.1038/nature11247.

    Article  CAS  Google Scholar 

  7. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004; 5(4):276–87. doi:10.1038/nrg1315.

    Article  CAS  PubMed  Google Scholar 

  8. Sandve GK, Drablos F, A survey of motif discovery methods in a integrated framework. Biol Direct. 2006;1(11). doi:10.1186/1745-6150-1-11.

  9. Pavesi G, Mauri G, Pesole G. In silico representation and discovery of transcription factor binding sites. Brief Bioinformatics. 2004; 5(3):217–36.

    Article  CAS  PubMed  Google Scholar 

  10. Brazma A, Jonassen I, Eidhammer I, Gilbert D. Approaches to the automatic discovery of patterns in biosequences. J Comput Biol. 1998; 5(2):279–305.

    Article  CAS  PubMed  Google Scholar 

  11. Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. In: Proceedings of the Eighth International Conference on Intelligent Systems on Molecular Biology. San Diego, CA: 2000. p. 344–54.

  12. Stormo GD. Dna binding sites: representation and discovery. Bioinformatics. 2000; 16(1):16–23.

    Article  CAS  PubMed  Google Scholar 

  13. GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res. 2006; 34(12):3585–98. doi:10.1093/nar/gkl372.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Sinha S, Tompa M. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2002; 30(24):5549–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Naughton BT, Fratkin E, Batzoglou S, Brutlag DL. A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites. Nucleic Acids Res. 2006; 34(20):5730–9. doi:10.1093/nar/gkl585.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. King OD, Roth FP. A non-parametric model for transcription factor binding sites. Nucleic Acids Res. 2003; 31(19):116. doi:10.1093/nar/gng117.

    Article  CAS  Google Scholar 

  17. Zhao X, Huang H, Speed TP. J Comput Biol J Comput Mol Cell Biol. 2004; 12(6):894–906. doi:10.1089/cmb.2005.12.894.

  18. Ellrott K, Yang C, Sladek FM, Jiang T. Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics (Oxford, England). 2002; 18 Suppl 2:100–9.

    Article  Google Scholar 

  19. Ben-Gal I, Shani A, Gohr A, Grau J, Arviv S, Shmilovici A, et al.Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics (Oxford, England). 2005; 21(11):2657–66. doi:10.1093/bioinformatics/bti410.

    Article  CAS  Google Scholar 

  20. Pudimat R, Schukat-Talamazzini EG, Backofen R. A multiple-feature framework for modelling and predicting transcription factor binding sites. Bioinformatics (Oxford, England). 2005; 21(14):3082–8. doi:10.1093/bioinformatics/bti477.

    Article  CAS  Google Scholar 

  21. Zhou Q, Liu JS. Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics (Oxford, England). 2004; 20(6):909–16. doi:10.1093/bioinformatics/bth006.

    Article  CAS  Google Scholar 

  22. Barash Y, Elidean G, Friedman N, Kaplan T. Modeling dependencies in protein-dna binding sites In: Vingron M, Istrail S, Pevzner P, Waterman M, editors. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology. New York, NY: ACM Press: 2003. p. 28–37.

    Google Scholar 

  23. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al.MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(Web Server issue):202–8. doi:10.1093/nar/gkp335.

    Article  CAS  Google Scholar 

  24. Maynou J, Vallverdu M, Gallardo-Chacon JJ, Caminal P, Perera A. Computational detection of transcription factor binding sites using a parametric entropy measure. IEEE Trans Information Theroy. 2010; 56(2):734–41.

    Article  Google Scholar 

  25. Maynou J, Vallverdu M, Claria F, Gallardo-Chacon JJ, Caminal P, Perera A. Conf Proc Ann Int Conf IEEE Eng Med Biol Soc IEEE Eng Med Biol Soc Conf. 2009; 2009:7087–90. doi:10.1109/IEMBS.2009.5332912.

  26. Lu J, Luo L. Prediction for human transcription start site using diversity measure with quadratic discriminant. Bioinformation. 2008; 2(7):316–321.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Pages H, Aboyoun P, Gentleman R, DebRoy S. Biostrings: String objects representing biological sequences, and matching algorithms. R package version 2.26.3.

  28. Conlon EM, Liu X, Lieb J, JS L. Integrating regulatory motif discovery and genome-wide expression analysis. Proc Nat Acad Sci. 2003; 100:3339–44.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Rényi A. On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Berkeley, California: University of California Press: 1961. p. 547–61.

    Google Scholar 

  30. Vlieghe D, Sandelin A, Bleser PJD, Vleminckx K, Wasserman WW, Roy FV, et al.A new generacion of jaspar, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 2006; 34(Database issue):95–7.

    Article  CAS  Google Scholar 

  31. Schmid CD, Perier R, Bucher P. Edp in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res. 2006; 34:82–5.

    Article  CAS  Google Scholar 

  32. Tomovic A, Stadler M, Oakeley EJ. Transcription factor site dependencies in human, mouse and rat genomes. BMC Bioinformatics. 2009; 10:339. doi:10.1186/1471-2105-10-339.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945; 1:80–3.

    Article  Google Scholar 

Download references

Acknowledgements

Spanish Ministerio de Ciencia y Tecnologa through the CICYT grant TEC2010-20886-C02-02/TCM, Ramon y Cajal program from the Spanish Ministerio de Educación y Ciencia (partially). CIBER-BBN is an initiative of the Spanish ISCIII.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joan Maynou.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

SM and AP equally contributed to the idea. JM and AP contributed to the design of the algorithm. JM developed the tool. JM and EP designed the database and performed the validation. JM and AP wrote the paper. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Maynou, J., Pairó, E., Marco, S. et al. Sequence information gain based motif analysis. BMC Bioinformatics 16, 377 (2015). https://doi.org/10.1186/s12859-015-0811-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-015-0811-x

Keywords