Sequence information gain based motif analysis

Background The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions. Results This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70 % of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions. Conclusions Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.


Background
The information encoded in genetic sequences is expressed by means of a gene regulation process, which begins with a gene transcription step. The binding between specific proteins and their target sites in DNA is a key step in the control of the transcription process. These proteins -transcription factors (TF) -recognise specific motifs in DNA known as Transcription Factor Binding Sites (TFBS) or cis-regulatory sequences. The prediction, identification and detection of cis-regulatory sequences is a key factor in understanding gene regulation and in inferring regulatory networks [1,2]. TFBS are usually very short (5 to 20 base pairs long) and highly degenerate, which gives rise to an extremely difficult identification problem due to low statistical power, as short sequences are expected to occur at random every few hundred base pairs. Due to their high variability, a consensus Full list of author information is available at the end of the article sequence approach for detection is insufficient. There is also evidence that this variability exhibits correlation between positions among the regulatory sequence [3,4], and that this correlation could contain information which would help reduce the false positive rate and increase the sensitivity of a detector [5].
Due to the importance of identifying cis-regulatory sequences, much effort has been devoted to mapping the binding sites for a large set of transcription factors. An important recent project is the ENCODE (Encyclopedia of DNA Elements) project, which has been able to map 4 million regulatory regions in the human genome, opening new possibilities for computational methods [6]. Motif detection methods may be classified in different ways, depending on the approach adopted. Some reviews focus on the biology of motif discovery in regulatory regions [7,8], whereas other publications focus more on the representation of the motifs: consensus-based methods and alignment-based methods [9]: consensus-based methods use word algorithms which consider binary hit/nohit values [10,11], and alignment-based methods use a set of alignment sequences with binding evidence to assign putative motifs to a candidate sequence. These latter methods could be classified as either numerical or stochastic models: numerical models are based on a mathematical representation of the nucleotides, whereas stochastic models, which are probably the most popular methods, are based on Position Weight Matrices (PWM) or Position Specific Weight Matrices (PSWM) [12]. A PWM is a matrix of scores corresponding to the frequency of the sequence symbols for each binding site position. The PWMs allow the capture of the variability over a sequence of nucleotides from a set of binding site positions [13], although there is the implicit assumption of independence between the residues of the aligned sequence matrix. PWM representations have been used in several algorithms to discover over-represented patterns from candidate sequences [14].
As noted above, statistical studies have shown the dependence among binding site positions variability. The common strategies for incorporating these dependencies within motif detectors include the extension of the PSSM approach to include pairs of correlated positions [15,16], m th order Markov chains (HMM) [17,18] and Bayesian Networks [19][20][21][22]. HMM can model the position interdependencies as long as high order HMMs, or a Bayesian approach are used but, in order to train any of both methods model sufficiently well, a huge training set of sequences would be required (± 1000 or more sequences per model).
A popular method, based on some of the previous work, is MEME/MAST, which provides an improved detection performance [23]. MAST is part of the MEME suite and uses a Q-FAST algorithm for finding motifs. Although these strategies may perform well in some datasets, they have shown certain limitations in the number of dependencies which may be considered between positions, in their ability to model dependencies between more distant positions, and in the large number of parameters which need to be adjusted in the models [3].
Previous work by our group proposed a parametric detector using the Rényi Entropy for binding site detection [24]. This measurement allowed us to build variable-sensitivity detectors modulated by the Rényi order -this assumed independence between binding site positions. A first approximation for modelling the correlation among binding site positions, known as Qresiduals, used a linear embedding to represent the set of binding site sequences [5] and employed a residuals-based approach as the detection statistic. Other non-related work modelled the pure correlation between binding site positions through non-linear correlations based on the variation of mutual information [25].
Statistical pattern recognition has also been applied to identification of sequence motif. Luo et al. [26] propose to use discriminant analysis for the prediction of Transcription Start Sites (TSS). From non-parametric measure, similar to Shannon information, Luo et al. [26] provide information about the variance observed in the dataset. This strategy has good performance for the binding motif detection when the motif positions are not correlated among them. But, this measurement does not allow modelling the dependencies among motif positions.
In this paper, we propose a generalisation of a nonlinear model based on Information Theory, which allows modeling DNA contact by the protein and the biological interaction among binding sites using a small training set of sequences (5-50 sequences model). This new approach aims at a trade-off between the good generalisation properties of pure entropy methods and the ability of position-dependency metrics to improve detection power.
The performance of the proposed detector method, named SIGMA (Sequence Information Gain based Motif Analysis), is compared with different computational methods for binding site detection: MEME/MAST [23], Biostrings [27], MotifRegressor [28], Qresiduals [5] and a previously published set of algorithms based on information theory [24,25].

Methods
The information gain has been measured for each TFBS by means of two parametric uncertainty estimators. The rationale is based on the idea that the total information gain of a set of true TFBS aligned sequences will change according to the similitude of the new candidate sequence to that set (Fig. 1). The first estimator measures the total amount of information change produced by assuming position independence, whereas the second estimator measures the total amount of change of per-position mutual information (capturing pure correlation among binding site positions). Both estimators are computed by a parametric uncertainty measurement.
Let us consider a set of I aligned sequences (s i ) with binding evidence M = {s i , i = 1, · · · , I}, and the same set including a candidate sequence s c , S = s c ∪ M. Following Fig. 1, let a be the coordinate corresponding to the set M, with axes determined by the two measures previously mentioned. When a new candidate sequence is considered in S, both measures will vary to b or g depending on the nature of the candidate sequence. When the candidate sequence is a binding site sequence, (b,) the variation on the information will be not significant. However, when the candidate sequence is a genomic sequence, (g), the amount of information will vary significantly. With a sufficient training set, this information gain space can be split in two regions, genomic and binding, by means of a simple discriminant analysis which will define a decision boundary, as highlighted as a dashed line in Fig. 1. The decision boundary shape is the result of applying non-linear function.

Information content measures
We have employed as parametric uncertainty measurements the Rényi entropy and Rényi Divergence (also called α-Divergence) [29], which are defined as: where X i and Y j are the nucleotides {A, T, C and G} at different DNA sequence positions, and the q is the Rényi order which modulates the probability of occurrence of each symbol. p(X, Y ) is the joint probability of X and Y, p(X) and p(Y ) are the marginal probability. Both measurements (H q (X) and D q (X; Y )) depend on q which is a positive real number (q = 1)and both are non-negative for all q ≥ 0. This parametrisation allows the building of a variable-sensitivity detector exploiting the statistical properties of the Redundancy, R, where R is defined as [24]. The measurement of the variation when the candidate sequence is added to the set has been computed using two heuristic functions, see (Eqs. 3 and 4). These functions depend on two parameters, γ and ω, which measure the difference between redundancies, Eq. 5, and divergence, Eq. 6, between the set of aligned sequences without the candidate sequences, s i , and with candidate sequence, M. These are estimated as described in Maynou et al. [24].
where, γ i and ω i are where L is the number of nucleotides in the binding region, M is the aligned set of sequences with binding evidence and i is a specific column of M. R M q is the redundancy, normalized depending on the maximum entropy on the set of aligned sequences, whereas R S q contains the equivalent parametric entropy when the candidate sequence is assumed to belong to the set. The redundancy profile is a L-dimensional vector, where L is the total number of positions of the binding site. D M q is the divergence matrix of the set of aligned sequences and D S q Fig. 2 The essential steps in the training and detection process are shown for the SIGMA algorithm is the divergence matrix considering the training matrix with the candidate sequence. The main diagonal is set to zero in each of these matrices, D M q and D S q . The variation in the information is therefore calculated by means of γ and ω and q-values are optimised at the validation stage within the range (0, 2]. As q increases, the noise included in the redundancy signal also increases [24]. From q-values higher than 2, signal-to-noise ratio is not optimal. For a genomic sequence, the order of the system will decrease the values of γ and ω, whereas for a binding sequence the order of the system will not be altered substantially. Each candidate sequence will therefore be characterised by the pair (x = (ρ, η)) and classified as genomic or binding by means of a Quadratic Discriminant Analysis (QDA), as shown in Fig. 6. The decision boundary, H(y), is defined from the distribution of the variation on the information, x, for each class, genomic or binding, in the information gain space.
Binding site detection by means of the SIGMA algorithm can be summarized as follows, see Fig. 2 Additionally, for characterisation of the results we define a heuristic magnitude C, related to the Complexity of M, in order to characterise the degree of pure  correlation between the variability of binding site positions in M, see (Eq. 7). C computes element by element the ratio between divergence value, where D q | i,j is the element of D q at row i and column j, and maximun entropy, H max without to considerer the main diagonal. The average of the ratios define the complexity of M.
where D is the parametric uncertainty measurement considered, N is the size of the binding sites, q is the Rényi order and H max is the maximum entropy for the set of probabilities p(X) and p(y), see section "Database description". C is a value between 0 and 1. When C is close to 1, the degree of correlation among binding site positions is high.
The JASPAR Core provides non-redundancy and high-quality alignment matrices for each transcription factor [30]. Results have been computed with background genomic sequences from the Eukaryotic Promoter Database (EPD) [31], using the EPD version based on the EMBL release 105 (sept 2010). The background loci chosen were EP74078(+)HsRPS9P2+ for Homo sapiens and EP07119(+)MmIgk0 MPC11 for Mus musculus.

Optimization
To apply SIGMA methodology to TFBS detection over genomic sequence, we should calculate the variation of the information, Eq. 4, as many times as the length of the sequence I (typically millions nucleotides). Given  Any two binding site positions are considered to be correlated if the Rényi divergence score is bigger than the error finite sample. This error yields to a bias on the uncertainty parametric measurement caused by estimating the probability using the nucleotide frequencies [24]. After the screening, we only compute based on the correlated positions of the training matrix as shown in Fig. 3.  Fig. 4).
For each new candidate sequence, we have to consider only the symbols matching correlated positions and read the joint probability value from the lookup table B i,j,x i ,x j . The Rényi divergence and the discrimination function, η are then computed from these values. The estimated total number of significant transcription factor site dependencies in Homo sapiens and Rattus novergicus is approximately 50 % and 37 % [32]. In this way, the computing time can be reduced by approximately an order of magnitude.

Validation
In order to build a model for each set of binding site sequences, the SIGMA detector has been characterized by means of leave-one-out cross validation (loo-cv). Each method has its own characteristic parameter. The range of the parameter used is different for each detector, see  Table 3. The detector performance depends on the value of these parameters which have been selected employing loo-cv. Taking as a criteria a heuristic magnitude, ν auc . This parameter has been computed from the mean and variance of the area under the N ROC curve (AUC N ) [5], which will be maximised for all methods.
where μ auc and σ auc are the mean and the variance of AUC N . ν auc is a value between 0 and 1. When ν auc is close to 1, the mean is close to 1 and the variance is close to 0. From the performance data, we have calculated the mean and standard deviation of the AUC for each transcription factor and method by means of the outer loo-cv. This process has been repeated for all the TFs listed in Tables 1 and 2.   shown through ν auc , (Eq. 9), for a set of TFBS for the Mus musculus and Homo sapiens organisms. When ν auc is close to 1, the mean is close to 1 and the variance is close to zero. For each TF, the best computational method will be that for which ν auc is closest to 1

Results and discussion
We first show a characterisation of how the performance of the individual algorithms based on Entropy and Divergence depends on the complexity properties of the training matrix (M) C, (Eq. 7), see Fig. 5. The performance of these algorithms will vary on C depending on the design of each algorithm and the true correlation between positions found for each set of binding sequences. As one would expect, the total Entropy algorithm has a better behaviour with low values of C, whereas a Divergence based approach improves its performance when C is large. The SIGMA approach is partially based on both measurements and aims at finding a trade-off between both approximations in order to maximise the performance over the full dynamic range of C. Figure 6 shows an example of real case where each input sequence is represented as a point in (ρ, η) coordinates. This set of samples includes genomic or binding sequences as shown in the figure. It is clear from the figure that both variables are contributing to the separation of the true binding site sequences.
The performance of SIGMA, MEME/MAST, Qresiduals, Entropy, Divergence, Biostrings and MotifRegressor has been compared against the same set of TFs under the same validation conditions described in the previous section. In Fig. 7, it can be observed that the mean and standard deviation depend both on the Transcription Factor and on the method considered. The performance among all the methods has been compared by means of the ν auc parameter described in Eq. (9). In Fig. 8, the ν auc parameter is shown for each transcription factor and method. Based on the ν auc values, in approximately 70 % of the TFBS under study, SIGMA shows better performance than the other methods. In 20 % of the TFs, the performance of the others methods is better than that of SIGMA. In the remaining cases, the SIGMA performance is similar to one or several of the computational methods considered. In most cases, the mean AUC is close to one and the variance is approximately zero, which suggests that SIGMA also behaves more robustly than other methods, as seen in Tables 4  and 5.
We computed a Wilcoxon rank-test [33] in order to estimate whether the improvement in performance is statistically significant. The null hypothesis was that the AUC distributions between SIGMA and other methods were the same and the alternative hypothesis was that the AUC distributions were different. The level of significance is represented by −log 10 (p value ). Any p value > 0.05 is shown in bold, see Tables 6 and 7). In most cases, it Table 4 Results for the set of computational methods considered for each TF of the Homo sapiens organism. The ν auc is defined through the mean and variance of the AUC N using a cross-validation method. Given a TF and method, ν auc is chosen with maximum mean and lower variance in the AUC N    can be observed that the difference between the AUC distributions is significant. The computational time of SIGMA was compared with the set of computational methods considered. The C code for Qresiduals, Entropy and Divergence using the model obtained in validation and MEME/MAST (Version 4.4.0) was used and has been made publicly available. The run time was obtained in comparison with randomly generated candidate sequences of 1500 nucleotides. The total time has been calculated from 100 iterations of each algorithm. The averages of the computational times in detection for the set of TF considered of Homo sapiens (Tables 4  and 5) are shown in Table 8.

Conclusions
A new methodology based on a discriminant analysis of two information theoretic measures has been proposed for binding site detection. The variation on the information has been measured through two parametric uncertainty measurements (the Rényi entropy and Rényi divergence). The method focusses on the variation in these information measures when a new sequence is assumed to belong to a training set of sequences with known binding properties. This methodology allows us to detect cis-regulatory sequences with maximum performance disregarding the co-variability observed in the positions of the training set of sequences. SIGMA has been characterised on the detection problem for a large set of transcription factors and compared with different motif detection algorithms. AUC distributions have been calculated which show that there is a statistically significant difference between SIGMA performance and the performance of the other methods. In approximately 70 % of the cases considered, SIGMA has exhibited better performance properties, at comparable levels of computational resources, than the methods with which it was compared.
As you can see through the heuristic parameter, SIGMA method is more robust than the other methods. A model based on both parametric uncertainty measurements can be useful to detect cis-regulatory sequences. But when the number of the positions involved in the binding sites process is small, the SIGMA performance is comparable with the rest of the computational methods.