Local Renyi entropic profiles of DNA sequences

Vinga, Susana; Almeida, Jonas S

doi:10.1186/1471-2105-8-393

Research article
Open access
Published: 16 October 2007

Local Renyi entropic profiles of DNA sequences

Susana Vinga^1,2 &
Jonas S Almeida^3,4

BMC Bioinformatics volume 8, Article number: 393 (2007) Cite this article

9363 Accesses
26 Citations
Metrics details

Abstract

Background

In a recent report the authors presented a new measure of continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of probability density estimation (pdf) using the Parzen's window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). Subsequent work proposed a fractal pdf kernel as a more exact solution for the iterated map representation. This report extends the concepts of continuous entropy by defining DNA sequence entropic profiles using the new pdf estimations to refine the density estimation of motifs.

Results

The new methodology enables two results. On the one hand it shows that the entropic profiles are directly related with the statistical significance of motifs, allowing the study of under and over-representation of segments. On the other hand, by spanning the parameters of the kernel function it is possible to extract important information about the scale of each conserved DNA region. The computational applications, developed in Matlab m-code, the corresponding binary executables and additional material and examples are made publicly available at http://kdbio.inesc-id.pt/~svinga/ep/.

Conclusion

The ability to detect local conservation from a scale-independent representation of symbolic sequences is particularly relevant for biological applications where conserved motifs occur in multiple, overlapping scales, with significant future applications in the recognition of foreign genomic material and inference of motif structures.

Background

Biological sequences are the ultimate support for the description of Biological Systems. In particular, key aspects of sequence analysis are known to play a role in integrated analysis of regulatory networks: for example in motif searching and inference.

Over the last decades and more recently due to the development of a considerable number of whole genome sequencing projects, several efforts have been made to mathematically model DNA sequences. In particular from the statistical side, the use of Markov based models [1] has widespread and proven to be effective in tackling the problem of data mining of biological sequences, through variable length Markov chains [2, 3], interpolated Markov models [4], fractal prediction machines [5] for symbolic time series based on Chaos Game Representations [6], to name just a few. Other algorithmic approaches based on the computational side have also proven to be useful [7]. All this effort allowed establishing important relations between the results obtained (computationally and statistically) with real biologically significant findings. From these models developed for DNA, it is now apparent that each genome has pervasive [8] motif and compositional characteristics in terms of the frequencies of its constitutive L-tuples or L-length motifs, which gave rise to the genomic signature concept [9]. This fact can be directly employed for horizontal transfer detection and characterization, coding vs. non-coding discrimination [8, 10], study and compare DNA through the use of composition profiles [11] and spectra [12] and other applications partly reviewed in [13].

In this regard and more specifically, an important statistical problem in bioinformatics that emerged is the evaluation of the number of repetitions occurring in biological sequences. More generally, they can occur in distinct hierarchical levels, from single symbols [14] to genes. In fact, in a recent paper, the number of gene repetitions was shown to be a key aspect of gene expression and phenotype [15]. Apparently theses repetitions, not only at nucleotide level, might play a key role in genome organization and functionality of networks. The notions of repetitions, entropy and correlation in DNA are unquestionably connected [16–18] and references therein – the degree of predictability of a sequence, which is closely related with its internal repetition and compression, can be measured by its entropy. The major importance of this research has provided evidence that is already too vast to fully account for. In particular, the relation between motif over- or under-representation is usually related with their biological function. This creates the need for an efficient method to analyze, for different parameters sets, the degree or scale of each DNA region.

In a recent report [19], the authors defined a new continuous measure of DNA entropy, based on non-parametric density estimation applied to Chaos Game Representation (CGR) and Universal Sequence Maps (USM) within the Rényi theory. The idea therein explored was that there is a close relationship between the statistics of the sequences, given by their constitutive motifs, and their entropy, measured under information theory methodologies. In that report the Rényi entropy was estimated in a global approach, and the measures obtained were compared with random sequences by Monte Carlo simulation. Although the main concepts were then introduced, that report was incomplete in the sense that just a global analysis was conducted. Specifically, no exploration of local patterns and fine tuned neighboring analysis was conducted, which is finally allowed by the present work, with the introduction of the concept of the Entropic Profile (EP).

Entropic profiles were defined previously but in a different context and scope: they were estimated using the histograms of the L-mer or L-tuple frequencies in DNA [20]. In that report the authors could discriminate between random and natural DNA sequences using the Shannon entropies of the histograms obtained from the CGR for different resolutions or oligomer lengths. Although the same name was used, that previous endeavor focused on a global perspective of sequence entropy [19] whereas this report proposes and investigates a local entropy formulation instead. In fact, the results obtained by Oliver and colleagues are global features for each DNA sequence, different from the present proposal of local based information per position/symbol. Another type of sequence profile also explored was based on linguistic complexity [21] and low entropy DNA zones [22].

In the present report the definition of entropic profile arises from the direct estimation of a local density, derived from the Parzen's window method described before. In our last report this estimation allowed the calculation of a global entropy measure, according to the Rényi definition. This report describes the next logical step of exploring complementary methods to access local information as to identify the location and composition of the conserved sequence which existence might have been anticipated from the global measures of entropy. The rationale is to have a function that assesses, for each position in the sequence (illustrated here for DNA), the information content of L-tuple suffixes directly from the density kernel function estimate. Such a solution should enable the scale-independent extraction of motifs without the need to identify complex state automata for unit succession.

In addition to our preceding report on Rényi entropy for global characterization of sequences, the study reported here also builds on the identification of a kernel function that produces a more accurate density estimation in CGR/USM projections of symbolic sequences [23]. The more conventional use of symmetrical functions as we did with a Gaussian Parzen kernel produces a rough fit to the characteristically fractal nature of iterative map projections. That approximation did suffice for assessment of global entropy [19] but it is not refined enough for the intended density estimation resolved locally at the sequence unit level.

Future applications of the methodologies here proposed might include motif inference and extraction, providing tools for the construction and inference of generalized sequence models for whole genomes.

Results and Discussion

This section presents some entropic profiles calculated for the DNA sequences described below. The relation between this values and former results is also investigated. Additionally, the influence of the parameters on the profiles is discussed.

DNA sequence dataset description

For sake of clarity this report uses the same dataset previously studied [19], thus allowing a comparison of results, in the continuity of the former proposal. In particular, the results for a subset of those sequences with known present motifs will be shown and extensively studied. In order to further test the estimation of the profiles to more challenging datasets, the analysis of whole genomes is also included. More specifically, the detection of Chi sequences in Escherichia coli and Haemophilus influenzae will be assessed. These genomes have been extensively analyzed after the completion of its DNA sequencing projects, thus constituting an excellent dataset to test new procedures. In particular, several important motifs have been studied elsewhere and can be compared directly with the proposed method. The following Table 1 recalls the DNA sequences examined.

Table 1 DNA sequence dataset used in this report.

Full size table

All the datasets and additional information are available in the webpage referred to above.

Entropic profiles and parameters optimization

The tests consisted on calculating the entropic profiles (EP) for different combination of parameters L and φ and check for particular features. The use of artificial DNA allows the accurate study of the impact of the parameters on the profiles obtained. The results can be directly obtained by using the deduced formulae of Equations 5 for $\hat{f}$ _{L, φ}(x_i) and their corresponding normalized values $\hat{g}$ _{L, φ}(x_i) (Eq.3), after specifying the parameters (see Methods and online software).

The results presented in this section are focused on the analysis of specific positions, known to be important and/or contain statistical significant motifs as suffixes. For example, Figure 1 represents the profiles obtained for the sequence m4 with the motif 'ATCG' implanted. This motif was implanted n = 20 times at equally spaced positions p = 50+i 100, i = 1, ..., 20 (see details in [19]). By studying one of the positions where this suffix ends (as an illustrative example p = 353 was chosen), one immediately assesses for which combination of parameters L and φ the maximum values of the profiles is obtained. In this case this maximum is achieved with L = 4 and φ approximately of 1 (one might further search this parameter space continuously in order to optimize φ but this is not pertinent in this explanatory step).

As seen from the Figure 1a) and 1b), there are parameter combinations for which that particular position/suffix is highlighted, with normalized density values way above alternative choices. It was not by chance that the maximum was attained at L = 4, since this is precisely the length of the suffix highly repeated, so L_max ≥ 4 was expected to be a local maximum of EP.

In the other panels of Figure 1 the entropic profile for the complete sequence is plotted, using the parameters previously optimized for the chosen position (p = 353). These plots allow the overview of all the sequence using local information obtained for a specific putative important suffix and, in fact, using this combination of parameters one immediately recovers all the positions where the known motif appears, which are simply the peaks on the graph. Panel d) shows a detail of the EP (from position 300 to 400), clearly illustrating the position where the implanted motif "ATCG" ends, with a density local maximum around EP(353) = 3.9. The expected number of counts under a first-order Markov Chain model would be 10.7 (p-value = 0.0027, z-score = 2.78).

In Figure 1e) and 1f) is also shown the corresponding density estimations on the CGR map for two distinct parameter sets. Comparatively with the Gaussian function this kernel is better adjusted to the CGR square-based geometry and presents a more clear-cut profile, as expected. The darker squares correspond precisely to the implanted motif sub-quadrants.

The following figures present the same results obtained with the other datasets under study.

In Figure 2 the same pattern occurs, with maxima EP(393) = 3.8, obtained for L = 3, again the implanted motif length. It should be mentioned that occasionally, for some positions where the motif "ATC" appears, the maxima occurs for a value L > 3. This can also happen and simply means that longer, non-implanted motifs appeared more often that would be expected by chance – in this case "ATC" is embedded in a longer significant motif, i.e. is contained in a longer string with potential significance. Interestingly, when plotting all the EP for the sequence using L = 3, one obtains additional, non-implanted motifs, which occurred just by chance – extra peaks with non-equal spacing in Figure 2c) and 2d). In fact, the probability of one specific motif of length 3 (under a null model of symbol equiprobability) is 4^-3, which implies, for a sequence of 2000, that the expected number of counts is roughly equal to 31. This simply means that the motif already existed in the random sequence m3 before the implantation took place. The detail graph – Fig. 2d) – shows precisely these "extra" appearances. If one uses a first-order Markov chain model as previously the expected number of counts becomes 60.08 (p-value = 2.8E-10, z-score = 6.2).

A similar interpretation can be made regarding sequence m5: the positions where the suffix "ATCGA" appears have maximal values $\hat{g}$ (x) for L = 5, although with high values in the range L = 4 to L = 7, which indicate nested significant motifs. The entropic profile for the complete 2000 base-sequence shows the maxima of the equally spaced motif (see Fig. 3), where it is noticeable an extra peak that corresponds to a previously existing motif ATCGA (ending at position 729).

By spanning the parameters space (L, φ) it is possible to find maximum values for $\hat{g}$ (x). For example, in specific positions 854 one finds out that $\hat{g}$ attains a maximum value for memory L = 5 and φ ≥ 10 with EP(854) = 7.1, a high relative value. By using these optima in the EP one obtains a profile that highlights immediately the suffixes where the highly repeated motif appears. Some other maxima appears sometimes (results not shown), but were discovered to correspond to other interpretable extreme values. The expected number of counts for this motif is just 3.07 that, comparing with the observed 21 occurrences, gives a p-value≈0 (z-score = 10.02).

Finally, Figure 4 shows part of the results for the real DNA sequence in the position corresponding to the ending of the TATA box (motif = "TATAAT"). The graph for this position shows precisely that L = 6 is an interesting scale to search for. The EP, in contrast to the former ones, does not exhibit a clear trend. In fact, differently to the former sequences, which were artificially generated and presented non-degenerate highly conserved motifs, the real DNA exhibits several point mutations that introduce some "noise" in the estimations. When plotting the complete profile for this sequence and observing one detail it is possible to recover the complete structured motif, known to bind to specific transcription factor binding sites, with values EP(TATAAT) = 4.3 and EP(TTGACA) = 3.6. It should be stressed however, that these results are biased towards the sequence itself: in this particular case, the concatenation of the promoter regions of B. subtilis provided a set with conserved motifs, at least to the point where they could be detected by density estimations. Of course, if non-conservation is allowed up to a higher level, the EP becomes noisier and eventually the signal will be lost, hampering the recovering of any significant motif if no pre-processing correction is performed. The analysis based of Markov chains gives for the TATAAT motif an expected number of counts of 1.60 (p-value≈0, z-score = 10.38) and 0.94 for TTGACA (p-value≈0, z-score = 9.54). The most common motif EP(AAAAAA) = 5.4 is highly periodic which explains the peak, although under a Markov chain it is expected to occur 11.67 (p-value = 0.1245, z-score = 1.15).

The two last datasets are constituted by whole genomes from two Gammaproteobacteria: Escherichia coli K12 and Haemophilus influenzae Rd (see Table 1 for NCBI GenBank accession numbers).

The study of the regions where Chi sequences appear will be analyzed in both genomes. Chi (crossover hotspot instigator) sites are homologous recombinational hotspot octamer sequences which modulate the exonuclease activity of RecBCD. This enzyme is necessary for chromosomal dsDNA repair and integration of exogenous dsDNA, which supports the idea that Chi sites have a biologically functional role [24].

Since Chi motifs are orientation-dependent and strand-specific, the sequence to be analyzed should be previously processed to comply with this property. This means that one should extract the whole genome and use the DNA sequence from the origin of replication up to the terminus plus the reverse complementary sequence, since chromosome replication in bacteria start from one replication origin (oriC) and proceeds bi-directionally until the replication forks reach the termination site (terC). These pre-processed genomes will conform the 5'->3' direction of replication and therefore will be used throughout the analysis. The oriC and terC positions (referred to the NCBI GenBank database) have been estimated based on experimental data and asymmetric properties [25] and are specified in Table 2.

Table 2 Description of Chi sites in E. coli and H. influenzae genomes.

Full size table

Chi sequences (see Table 2) are statistically overrepresented in the genome of E. coli (5'-GCTGGTGG-3'), appearing more often than would be expected by chance whereas in H. influenzae (5'-GNTGGTGG-3' and 5'-GSTGGAGG-3' show Chi activity) they are known to be less frequent and less conserved. This makes for two different datasets with distinct features that involve a different degree of difficulty to detect these regions.

The study of Chi sites have been subject to many analyses and therefore constitute an excellent test dataset to assess the strength of the entropic profile approach to detect these motifs. In particular several recent papers have assessed its statistical significance using Markov models [1], analyzing the 8-tuple frequency for the whole genome of E. coli [26] and also comparing Chi site conservation in both organisms [24].

The expected number of an 8-tuple in E. coli and H. influenzae using a Markov model of order 0 (only nucleotide abundance is taken into account) is respectively 70.796 and 27.926. One immediately sees that in E. coli this motif is highly represented whereas in H. influenzae this fact is less evident.

Interestingly, when analyzing whole genomes, several motifs appear with p-values near 0, i.e. they occur in exceptionally high number when considering a Markov chain model. This fact does not allow their accurate comparison and is a major drawback of using solely the p-values to assess the statistical significance and correctly compare and order the relative importance of these motifs. Therefore, as explained in the Methods, the normalized z-scores are also reported for clarity.

For example, using a first order Markov Chain model the expected number of counts for the chi-sequence in E. coli and H. influenza is 85.06 and 12.34 respectively. Although this motif has a p-value≈0 for both sequences, the corresponding z-score of 73.37 and 12.43 respectively puts it in different ranks among all motifs of the same length.

When analyzing one (random) position where Chi sequence ends in E. coli (exactly in the same way as the previous analysis) the following profiles are obtained (Fig. 5). The position p = 35840 shows that the maximum EP values are obtained for parameters (L = 8, φ = 10) and (L = 9, φ = 5), for which the profiles attain similar values of EP = 8.04 and EP = 8.08 respectively. For L = 7 the motif also appears relevant. The complete profiles for that region are plotted in the panels c) and d), showing striking and evident peaks at the positions where Chi sequences end. The other local maximum corresponds to a chi-related sequence (GCGCTGGC), which in fact shares the 5-mer GCTGG. Indeed, the family containing the trimer CTG, often within the pentamer GCTGG, is very frequent in this genome [27], all with p-values≈0 and highest scoring ranks

When analyzing the genome of H. influenzae and studying one particular position where motif 5'-GGTGGTGG-3' ends (in the example, p = 36532), the following Figure 6 is obtained.

From Figure 6a) is it possible to see that the maximum EP = 0.1252 is obtained for parameters (L = 8, φ = 10), a relatively low value when compared with the previous examples so far. Interestingly other peaks exhibit a period of 3 (L = 3 and L = 6) – the motif TGGTGG repeats every 3 and 6 bases and therefore that property is patent in the graph (Figure 6a) through the appearance of this local maxima every 3 bases. When using the above parameters to plot the entire profile one immediately sees that other positions of extremely high significance appear. This is the case of the 8-tuple motifs AAGTGCGG and AGTGCGGT, which corresponds to EP(36549) = 11.1281, p-value≈0, z-score = 174.80, and EP(36550) = 9.7819, p-value≈0, z-score = 186.20, marked in Figure 6d). These motifs appear 867 and 770 times in the genome, which makes them the most common 8-tuples, along with CCGCACTT (820 times; EP = 10.4869, p-value≈0, z-score = 184.47), ACCGCACT (755 times; EP = 9.5784, p-value≈0, z-score = 210.81) and AAAGTGCG (699 times; EP = 8.8696, p-value≈0, z-score = 97.35), using the same parameters.

As expected, the Chi sites are not detected solely based on EP maximization. In fact, the motif is not especially over-represented when compared with all the others, so it would be impossible to detect based solely on the raw entropic profiles. Furthermore and evident from the figures, the H. influenzae genome has one extremely ubiquitous 9-tuple motif, the extensively studied uptake signal sequence (USS+) AAGTGCGGT (appears 740 times) and its inverted complement sequence (USS-) ACCGCACTT (731 times) with a total number of 1471 occurrences. Their p-values≈0 and their extremely high z-scores of 293.28 and 329.74, puts them in the first rank positions of exceptionality. Furthermore, all the motifs present among the first 25 highest scoring positions greatly overlap USS sequences [1].

USSs are involved in natural competence, which is a genetically controlled form of horizontal gene transfer in some bacterial species, related to their ability to take up DNA from the surrounding environment (reviewed in [28]). This process allows genetic exchange in bacteria, which is the only organism known to actively take up DNA from the environment and recombine it into their own genome [29]. The DNA uptake machinery on the cell surface preferentially binds and takes up fragments containing this specific short sequence. In particular H. influenzae is able to take up double-stranded DNA of its own species and closely relatives, facilitated through the recognition of USS, which are indeed over represented in its genome.

One interesting statistical aspect of the USS distribution, besides its extremely over-representation, is that these sequences appear equally partitioned in both strands and are remarkably and significantly evenly spaced around the chromosome [30]. They can be constituted by the 9 bp core referred to but allowing a longer 29 bp consensus. The USS evolutionary origin and function was recently addressed [31] by confronting two models, preference first hypothesis and a molecular drive hypothesis. Nevertheless this issue remains controversial [32].

Through the analysis of H. influenzae complete genome conducted above one obtains peaks on the entropic profiles precisely at these ubiquitous motifs, which definitely obscures the retrieval of Chi sequences, whose number of occurrences is not at all comparable with USS frequency.

In fact, the profile obtained for the maximum values (L, φ) shows that the Chi sequence (with G) attains a maximum entropic density value of 0.12, which is way below the detection level when compared with the value obtained for USS which was equal to EP(AGTGCGGT) = 9.78 and EP(AAGTGCGG) = 11.13. This phenomenon is well understood, and some authors name it "contamination" [1]: the highly overrepresented expressed motif contaminates the calculation of low expressed segments. The program R'MES [33] lists precisely USS motifs and their variants showing this behavior. One idea to assess the statistical significance excluding this bias is to delete, from the original sequence, the regions/positions where this ubiquitous 9-tuple appears [1]. This is approximately comparable to perform exact Markov calculations and therefore can be used to further study the sequence. The obtained values for the transformed sequence were nevertheless very low around EP = 0.16 (results not shown). After investigation what might be happening it was found that other motifs emerged even when USS were all deleted from the genome.

For example, the 8-tuple AAAATTTT (p-value→1, z-score = -10.70) appears with high EP values, along with other motifs constituted by long successions of A's and T's. These long adenine-thymine tracts, previously detected for other organisms such as Yeast [34, 35], might have an important role due to their strong DNA bending properties [36]. Although the detection of Chi sites failed, other motifs emerged that have notable biological functions and roles in the cell.

This effort highlight an important possible procedure, to be explored further, that one should plot the motifs hierarchically and delete the influence of more ubiquitous motifs that highly "pollute" the calculations, starting from the most exceptional. In fact, from the profile information we could further envisage an algorithm that automatically extracts putative motifs for each position. This is accomplished by searching the parameters space for which the estimation is maximal for position i:

{(L_{\max}, φ_{\max})}_{i} = \underset{L, φ}{\arg \max} {\hat{g}}_{L, φ} (x_{i})

and then use these parameters to retrieve the suffix $m_{i} = s_{i - L_{\max} + 1} \dots s_{i}$ .

Using this methodology one obtains precisely the implanted motifs of the previous datasets. As an example, the "TATA"-box referred to before is correctly inferred and also the above mentioned examples with the artificial sequences (Figure 7).

It should be stressed however that this is not the most convenient procedure for motif inference problems since several algorithms already exist that perform these searches very efficiently. Nevertheless is interesting to find that combinatorial and probabilistic methodologies are comparable as the latter come with broader opportunities for theory development albeit leading to advantageous numerical solutions. The observation that there is a close relationship between the overrepresentation, detected by the majority of the algorithms, and the proposed Entropic Profiles with its density and statistical significance measure suggests that it could provide a way of simultaneously finding and statistically classifying the motifs instead of pursuing the two goals separately.

The analysis also showed that the statistical significance z-scores and p-values are unequivocally related with the entropic profiles, since most of the algorithms detected the same motifs. Over-represented motifs exhibit a very low p-value, very near zero, and high z-scores and EP values; common motifs, that appear a median and/or expected number of times, have high p-values and low z-scores, which indicate its non-exceptionality under the Markov chain model considered. These are the motifs that also attain low EP values. The full correspondence between both methods is still under study.

By expressing the density estimation as a function of the suffix counts, one is also allowed to search for under-represented segments, i.e., those whose density is below average. Although not explored in this work, minimum entropic profile values might also play a role in under-represented motifs detection. In fact, rare motifs/substrings are known to correspond to traits/regions with very specific functions in high precision biological processes. The use of unique sub-strings, or UniMarkers, that appear only once in the genome, recently allowed to locate single nucleotide polymorphisms (SNP's) [37, 38]. These unique substrings were shown to be clustered close to genes [39]. All these positions can be detected as low-density areas in the CGR and consequently correspond to local minima in their Entropic Profiles. Another example also related with low-density points is related with 6-tuple palindromes. These short sequences, which often correspond to restriction sites, are under-represented in E. coli and in the bacteriophage lambda [1, 40], thus providing a self-protecting effect. More generally this methodology can be used to find heterogeneous traits in the genome, both related with local under- and over- representation of motifs. This result can indicate the presence of foreign material which can have significant applications in the detection of horizontal transfer [11].

Conclusion

In this report, Entropic Profiles (EP) were proposed as a novel local information entropy measure for DNA sequences. This function is built on previous work on continuous Rényi quadratic entropy where the Parzen window method was applied to the non-parametric density estimation of the Chaos Game Representation/Universal Sequence Maps (CGR/USM) of a sequence. Subsequently, the estimation was decisively refined to the accuracy that the determination of local entropy requires. This advance, reported elsewhere, introduced a two-parameter fractal-based kernel, instead of Gaussian functions, which is more adequate to the geometry of the CGR domain.

The Entropic Profiles proposed here assess point/symbol normalized deviations from a mean composition signature. EP calculation was based on a density estimation value per position, thus depicting local sequence information related with the statistical significance of a motif, measured as its global over- or under-representation. Furthermore, it was shown that using this kernel the EP can be calculated independently from a particular representation. The local genome scale (or resolution) is defined by the combination of parameters for which a particular suffix emerges. Therefore, this scanning procedure identifies simultaneously the position and the scale at with the sequence composition is singular, by focusing and adjusting the best parameters locally and then looking back to the overall sequence. There is a strong biological rationale for such an approach as the genome is organized to conserve motifs at different scales (lengths) and with varying stringency. The underlying hypothesis is that over- or underrepresented motifs may be indicative of important biological functions.

This conclusion was illustrated with the analysis of artificial DNA sequences, reference genomic datasets and also whole genomes from E. coli and H. influenzae, where known regulatory components and motifs were correctly recovered – both as regards position and scale (length) of the conserved segments. By spanning the parameter space of this new function it was possible to study the local scale for which a given suffix and position were implicit. This effort highlighted the interaction between several methodologies in this field. Specifically, it greatly simplifies the exploration of fundamental relationships between distinct sequence analysis approaches and concepts such as metrics on strings, information theory and entropy, iterated function systems and statistical significance of DNA segments, providing a common ground in kernel-based learning theory.

The procedure proposed here is easily extendable to other kernel function classes, which might be more adequate to model specific traits or genomes. Future work includes the generalization for point mutations and also dealing with nested or embedded motifs.

The proposed entropic profiles provide promising new tools for the study of biological sequences, allowing the quantification of repeatability and identifying key parameters for which relevant features arise.

Methods

This section recalls the background work that led to the new analysis described here and defines the main concepts proposed, namely: the CGR/USM representation of DNA sequences; the assessment of entropy in biological sequences and definition of local Entropic Profile (EP); the use of specialized kernel density estimation functions and its conjugation with the EP method.

CGR/USM representation of DNA sequences and Parzen's method

The CGR/USM representation, introduced in [6] and generalized to higher-order alphabets in [41], allows the mapping of a discrete DNA sequence onto ℝⁿ. Formally, the CGR mapping x_i∈ ℝ² of a N-length DNA sequence S = s₁ s₂ ... s_N, s_i∈ $A=$ = {A, C, G, T }, i = 1, ..., N is given by Equation 1:

{\begin{matrix} \begin{array}{l} x_{0} = (0.5, 0.5) \\ x_{i} = x_{i - 1} + \frac{1}{2} (y_{i} - x_{i - 1}), i = 1, ..., N \end{array} & where y_{i} = {\begin{matrix} (0, 0) & if & s_{i} =' A' \\ (0, 1) & if & s_{i} =' C' \\ (1, 0) & if & s_{i} =' G' \\ (1, 1) & if & s_{i} =' T' \end{matrix} \end{matrix}

(1)

The properties and generalizations of this method have already been studied and extensively applied as a consequence to the natural development of alignment-free techniques for sequence comparison [13, 42].

As previously, the variables employed in this work will be the USM coordinates sample points {x_i}_{i = 1, ..., N}that correspond to the symbols {s_i}_{i = 1, ..., N}in the original sequence.

In particular, it was seen in the previous report that these points could be adequately used to estimate the Rényi entropy of the original sequence through the Parzen's window density estimation method [43]. This is a non-parametric technique used to estimate a probability density function f from a sample. This method is one of the most widely used kernel-based methods and consists on the choice of a weighting function or kernel κ_θ(x). The estimation $\hat{f}$ (x) of a random vector x is a linear combination of the kernels centered in the observed sample points a_i, i = 1, ..., N, and is defined for a specific window width τ (Eq.2):

\hat{f} (x; a, θ) = {\hat{f}}_{θ} (x; a) = \frac{1}{N τ} \sum_{i = 1}^{N} κ_{θ} (\frac{x - a_{i}}{τ})

(2)

In that former report [19] Gaussian or normal distribution functions were used in order to estimate the Rényi quadratic entropy of the CGR of a given DNA sequence. Due to important algebraic simplifications and properties of the Gaussian kernel it was shown that this calculation was obtained by using a simple potential function of the CGR map.

Entropic profile definition

The former equations provide a natural method to extract local information from a DNA sequence. By calculating the values $\hat{f}$ _θ(x_i) for each coordinate x_ithat represents the i^th symbol in the original sequence and parameter set θ, it is possible to plot, for each position i = 1, ..., N, normalized values $\hat{g}$ _θ(x) ≡ $\hat{g}$ _θ(x; a) of the density function estimated previously, obtained as the number of standard deviations from the mean (taking into account all the sample points or symbols, omitted for notation simplification):

{\hat{g}}_{θ} (x) \equiv \frac{{\hat{f}}_{θ} (x) - m_{θ}}{s_{θ}}, with m_{θ} = \frac{1}{N} \sum_{i = 1}^{N} {\hat{f}}_{θ} (x_{i}) and s_{θ} = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} {({\hat{f}}_{θ} (x_{i}) - m_{θ})}^{2}}

(3)

In fact, this corresponds to extracting the local density, estimated for each coordinate that represents a symbol in the original sequence context. For example, if a particular motif appears more often than what would be expected by chance, the density estimation for that particular position/coordinate will be higher than the average m_θ.

For each parameter set θ one can define the Entropic Profile EP_θ(i) ≡ $\hat{g}$ _θ(x_i), i = 1, ..., N, that measures precisely the density deviations from the mean in each coordinate, or equivalently, in each last symbol of all the suffixes appearing in the original sequence.

Therefore, these values obtained with the kernel estimations are related to the statistical significance of the corresponding suffix present at that particular position, since they represent a density, which is strongly associated with the degree of repetition of a given suffix in the sequence.

It is worth noting that the proposed entropic profiles are a descriptive measure of local DNA properties and that a full extensive comparison with other methods that search for motifs and assign p-values to the results are out of the scope of this work. Future efforts will quantitatively compare these profiles with other models, e.g. Markov chain models, to confirm for the quantitative correspondence between methods on the assessment of under and over-representation of motifs.

Fractal kernel definition

The former approach used Gaussian distribution function to model the generalized Markov models. One possible drawback of this methodology is related with the domain issue above mentioned, since the normal distribution function is defined in ℝⁿwhereas the CGR/USM domain is explicitly defined in unit hypercubes. This concern lead to the development of another kernel [23] to be used in the CGR density estimation, which is recalled, reformulated and further discussed in this section.

Let χ_A: X → {0,1}, A ⊂ X ⊆ ℝ, be an indicator or characteristic function such that:

χ_{A} (x) = {\begin{matrix} 1 & if x \in A \\ 0 & if x \notin A \end{matrix}

Each function $I'_{k, x_{j}}$ : X → {0,1} with parameters k and x_jis defined for a point x ∈ X as:

I'_{k, x_{j}} (x) = χ_{A_{k, x_{j}}} (x)

where the interval $A_{k, x_{j}}$ depends on the point x_j∈ X and on the resolution k chosen:

A_{k, x_{j}} = (2^{- k} ⌊ 2^{k} x_{j} ⌋, 2^{- k} ⌊ 2^{k} x_{j} ⌋ + 2^{- k})

and ⌊x_j⌋ denotes the floor function. The interval above defined $A_{k, x_{j}}$ has length V ( $A_{k, x_{j}}$ ) = 2^-k.

Intuitively, this function rounds the value of x_j, respecting the borders of the regions that represent specific k-tuples, which are always given by multiples of 2^-k(see figure 1 in[19]). This might also be interpreted as the number of common digits of the binary representation of x_jand x, up to the k^thdecimal digit. This is more clearly deduced using numeric representation in base 2.

For the CGR mapping $\vec{x}$ ≡ (x⁽¹⁾, x⁽²⁾) ∈ ℝ², the 2D step function for a point ${\vec{x}}_{j} \equiv (x_{j}^{(1)}, x_{j}^{(2)})$ ∈ ℝ² is defined as $I_{k, {\vec{x}}_{j}} (x^{(1)}, x^{(2)}) = I'_{k, x_{j}^{(1)}} (x^{(1)}) \times I'_{k, x_{j}^{(2)}} (x^{(2)})$ , i.e., the function is 1 when both coordinates x⁽¹⁾and x⁽²⁾belong the above mentioned intervals and is zero otherwise. This is due to the indicator function property χ_{A ∩ B}= χ_Aχ_B. For sake of clarity and notation simplification, in the following formulas all the variables x and x_jwill be assumed in ℝ² otherwise stated, i.e. $I_{k, {\vec{x}}_{j}} (x^{(1)}, x^{(2)}) \equiv I_{k, x_{j}} (x)$ .

The kernel κ^f(x) used in this work and extensively presented in [23] is based on the linear combination of block functions I_k, using particular resolutions k and a parameter h that defines the height (or weight) of each block:

κ^{f} (x) \equiv κ_{L, x_{j}} (x) = \sum_{k = 0}^{L} h_{k} \cdot I_{k, x_{j}} (x) .

Additionally, considering the restriction of probability density functions, the following equation is obtained:

\int κ_{L, x_{j}} (x) d x = 1 \Rightarrow \sum_{k = 0}^{L} h_{k} 2^{- 2 k} = 1

since $\int I_{k, x_{j}} (x) d x = V (A_{k, x_{j}}) = 2^{- 2 k}$ and $V (A_{k, {\vec{x}}_{j}}) = V (A_{k, x_{j}^{(1)}}) V (A_{k, x_{j}^{(2)}}) = 2^{- 2 k}$ .

Defining φ as the (constant) ratio between two consecutive volumes A_kand A_k-1, k = 1, ..., L (in 2D):

φ = \frac{V (A_{k})}{V (A_{k - 1})} = \frac{1}{4} \frac{h_{k}}{h_{k - 1}} \Rightarrow h_{k} = 4 φ \cdot h_{k - 1} = {(4 φ)}^{k} \cdot h_{0},

it is possible to express this restriction in terms of φ as:

h_{0} \sum_{k = 0}^{L} φ^{k} = 1

And finally the (normalized) kernel $κ_{L, φ, x_{j}}$ (x) with parameters L, φ and x_jis:

κ_{L, φ, x_{j}} (x) = \sum_{k = 0}^{L} h_{k} \cdot I_{k, x_{j}} (x) = \frac{\sum_{k = 0}^{L} {(4 φ)}^{k} \cdot I_{k, x_{j}} (x)}{\sum_{k = 0}^{L} φ^{k}}

(4)

The underlying idea is to weight, by powers of 4 φ, each step function $I_{k, x_{j}}$ (x), which corresponds to a sort of generalized Markov model. An illustration of this kernel function (projected to one-dimensional space) is given in Figure 8 for L = 2 which correspond to three blocks $I_{k, x_{j}}$ (x), k = 0,1,2.

Another important property of this function κ is its symmetry regarding x_iand x_j, in fact, $κ_{L, φ, x_{j}} (x_{i}) = κ_{L, φ, x_{i}} (x_{j})$ since ${I^{'}}_{k, x_{i}} (x_{j}) = {I^{'}}_{k, x_{j}} (x_{i})$ . Actually, if x_ibelongs to the interval A_kmeans that x_iand x_jhave the same binary expansion up to the k digit, which obviously implies symmetry.

This allows a straightforward generalization under kernel learning theory in which specific transformation of the data with kernel functions induce dot products and norms in other function spaces [44]. In fact, this kernel is related with the Cantor distance in strings, which measures precisely the suffix similarity.

Furthermore, it should be clear that this new fractal kernel is more adjusted to the CGR geometry: instead of Gaussian functions that span all ℝⁿdomain the proposed κ (x) is defined on unit hypercubes, which is definitely more in agreement with these iterative maps.

Entropic profiles with fractal kernels

When using the above-defined fractal-based kernel, the expression for the estimation for the entropic profile is significantly simplified, thus allowing its optimal and straightforward calculation. In fact, for a particular coordinate, each density block is only different from zero if the points in that neighborhood are close, in the sub-quadrant sense. In other words, for one position, the only non-zero blocks of length k correspond to the nearest points, which are at a distance less than 2^-kapart.

Another important note is that this particular kernel, contrary to the Gaussian which only has two parameters (mean and variance), depends on the point x_j: in effect, the format of the kernel varies according to the rounding procedure and the particular coordinate x_jconsidered.

Therefore, the Parzen density estimation for position i or point x_iis given as a function of all the other sample coordinates x_j, j = 1, ..., N, and parameter set θ ≡ {L, φ}, where L represents the Markov resolution and φ is a smoothing parameter:

{\hat{f}}_{L, φ} (x_{i}) = \frac{1}{N} \sum_{j = 1}^{N} κ_{L, φ, x_{j}} (x_{i})

By simple algebraic simplification and using Eq.4 one obtains a more condensed formula:

{\hat{f}}_{L, φ} (x_{i}) = \frac{1}{N \sum_{k = 0}^{L} φ^{k}} \sum_{k = 0}^{L} {(4 φ)}^{k} \sum_{j = 1}^{N} I_{k, x_{j}} (x_{i}) where

I_{k, x_{j}} (x_{i}) = {\begin{array}{l} 1 & if x_{i}^{(1)} \in A_{k, x_{j}^{(1)}} and x_{i}^{(2)} \in A_{k, x_{j}^{(2)}} \\ 0 & otherwise \end{array}

Due to the CGR suffix property, the last condition is equivalent of having the same suffix of length k, i.e., $x_{i}^{(1)} \in A_{k, x_{j}^{(1)}}$ and $x_{i}^{(2)} \in A_{k, x_{j}^{(2)}}$ if and only if the string with length k corresponding to the CGR coordinate x_iis the same as the one represented by coordinate x_j. Therefore the sum $\sum_{j = 1}^{N} I_{k, x_{j}}$ (x_i) that appears in the last equation is calculated by simply counting the number of common suffixes of length k shared through all the sequence S:

\sum_{j = 1}^{N} I_{k, x_{j}} (x_{i}) = \sum_{j = 1}^{N} δ_{s_{i}^{k} s_{j}^{k}} = c o u n t_{s} (s_{i - k + 1} \dots s_{i}) = c ([i - k + 1, i])

where δ_ijis the Kronecker delta and $s_{i}^{k}$ is the suffix of length k that ends in position i.

Finally, and using this result, the Parzen density estimation with this kernel can be simplified to the formula given by the following Equation 5:

{\hat{f}}_{L, φ} (x_{i}) = \frac{1 + \frac{1}{N} \sum_{k = 1}^{L} 4^{k} φ^{k} \cdot c ([i - k + 1, i])}{\sum_{k = 0}^{L} φ^{k}}, L \geq 1

(5)

Computationally, this is a significant result since it allows the simplification of $\hat{f}$ _{L, φ}(x_i): instead of having to calculate individual kernel function for each point and sum all the contributions, one can simplify the calculation up to a desired resolution or memory length L, greatly reducing the associated algorithmic complexity from quadratic to linear on L and sequence length. In the supplementary MATLAB functions available along this report this simplification was taken into account. In practice this is an important result since low resolutions L are commonly used, remembering that they represent Markov orders. Indeed, most approaches in sequence modeling use Markov orders below 8, which greatly simplifies the calculation time. Some limiting properties of the estimation f for different φ include:

\begin{array}{l} \lim_{φ \to \infty} {\hat{f}}_{L, φ} (x_{i}) = \frac{4^{L}}{N} \cdot c ([i - L + 1, i]) = \frac{4^{L}}{N} \cdot c (L -tuple suffix i) \\ \lim_{φ \to 0} {\hat{f}}_{L, φ} (x_{i}) = 1 \end{array}

These results show that the parameter φ is weighting different Markov chain models: φ = 0 means that a zero order, background (equal) frequencies are taken, whereas φ → ∞ corresponds to weighting higher L-tuples, ignoring the lower order counts, which, in the limiting case, is equivalent to a L-order Markov chain.

In effect, $\hat{f}$ _{L, φ}(x_i) can be interpreted as a linear combination of suffixes counts up to a given memory length, with increasing (φ > 1/4) or decreasing weights (φ < 1/4). These results came up as quite unpredictably, since the kernel defined above was based on a different rationale. It turned out that both perspectives are equivalent in terms of final formulation. It is also noteworthy the relation between this methodology and generalized Markov models and interpolated Markov chains (IMM). In fact, similar profiles were obtained recently [39] representing the shortest unique substrings in sequences.

In the application section when calculating the normalized values EP_θ(i) ≡ $\hat{g}$ _θ(x_i), one has to consider a burnt-in period corresponding to the first symbols in the sequence. Since the estimation of the profile is biased in the sense that only higher order tuples are considered, it is necessary to exclude these first points f(x_i), i = 1, ..., b₀, given that no information is provided for higher suffixes up to that position. For that reason, this correction was taken into account when using the EP normalized values. This border effect is nevertheless negligible and can be ignored for longer sequences. The background just presented will allow the representation of the entropic profiles EP_θ≡ $\hat{g}$ _{L, θ}as a function of both L and θ and search for key parameters combinations to unravel the scale upon which important features might arise in the original DNA sequence.

Markov Chain-based p-value calculation

In order to compare our method with previous efforts, we also report the p-values and respective statistical z-scores for the motifs analyzed. These values were calculated using first-order Markov Chain transition probability tables estimated directly from the whole sequences. This estimation was based on the relative frequency of each oligonucleotide, using pseudo-counts to avoid zero transition probabilities when necessary. After this step, the probability of each motif can be easily accessed along with their expected number of occurrences in a specific sequence. The calculation of the p-value of a motif m is therefore the probability of observing more counts N(m) than those expected under that given model, i.e., prob{N(m) ≥ N_obs(m)}. The normal distribution was used as an approximation for the distribution of N(m), with expected values and variances described in [1]. These variances took into account the overlap capacity or period of each motif, as described in the same reference. Other approximations, such as using the Poisson distribution, give the same relative order for the motifs. The p-values calculated are reported for each motif referred in the text. To complement the analysis and since many of the motifs studied exhibit very low p-values, practically equal to zero, i.e. they are exceptionally frequent, the z-scores and their relative rank order was also reported. In this way a more accurate comparison can be performed.

References

Robin S, Rodolphe F, Schbath S: DNA, words, and models. New York, NY: Cambridge University Press; 2005.
Google Scholar
Buhlmann P, Wyner AJ: Variable length Markov chains. Ann Stat 1999, 27: 480–513. 10.1214/aos/1018031204
Article Google Scholar
Bejerano G: Algorithms for variable length Markov chain modeling. Bioinformatics 2004, 20: 788–789. 10.1093/bioinformatics/btg489
Article CAS PubMed Google Scholar
Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene identification using interpolated Markov models. Nucleic Acids Res 1998, 26: 544–548. 10.1093/nar/26.2.544
Article PubMed Central CAS PubMed Google Scholar
Tino P, Dorffner G: Predicting the future of discrete sequences from fractal representations of the past. Machine Learning 2001, 45: 187–217. 10.1023/A:1010972803901
Article Google Scholar
Jeffrey HJ: Chaos Game Representation of Gene Structure. Nucleic Acids Res 1990, 18: 2163–2170. 10.1093/nar/18.8.2163
Article PubMed Central CAS PubMed Google Scholar
Gusfield D: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge [England]; New York: Cambridge University Press; 1997.
Book Google Scholar
Jernigan RW, Baran RH: Pervasive properties of the genomic signature. BMC Genomics 2002, 3: 23. 10.1186/1471-2164-3-23
Article PubMed Central PubMed Google Scholar
Karlin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 1995, 11: 283–290. 10.1016/S0168-9525(00)89076-9
Article CAS PubMed Google Scholar
Deschavanne P, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol 1999, 16: 1391–1399.
Article CAS PubMed Google Scholar
Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P: Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res 2005, 33: e6. 10.1093/nar/gni004
Article PubMed Central PubMed Google Scholar
Wang Y, Hill K, Singh S, Kari L: The spectrum of genomic signatures: from dinucleotides to chaos game representation. Gene 2005, 346: 173–185. 10.1016/j.gene.2004.10.021
Article CAS PubMed Google Scholar
Vinga S: Biological sequence analysis by vector-valued functions: revisiting alignment-free methodologies for DNA and protein classification. In Advanced Computational Methods for Biocomputing and Bioimaging. Edited by: Pham TD, Yan H, Crane DI. New York: Nova Science Publishers; 2007.
Google Scholar
Haubold B, Wiehe T: How repetitive are genomes? BMC Bioinformatics 2006, 7: 541. 10.1186/1471-2105-7-541
Article PubMed Central PubMed Google Scholar
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature 2006, 444: 444–454. 10.1038/nature05329
Article PubMed Central CAS PubMed Google Scholar
Herzel H, Ebeling W, Schmitt AO: Entropies of biosequences: The role of repeats. Phys Rev E 1994, 50: 5061–5071. 10.1103/PhysRevE.50.5061
Article CAS Google Scholar
Holste D, Grosse I, Beirer S, Schieg P, Herzel H: Repeats and correlations in human DNA sequences. Phys Rev E 2003, 67: 061913. 10.1103/PhysRevE.67.061913
Article Google Scholar
Holste D, Grosse I, Herzel H: Statistical analysis of the DNA sequence of human chromosome 22. Phys Rev E 2001, 6404: 041917. 10.1103/PhysRevE.64.041917
Article Google Scholar
Vinga S, Almeida JS: Rényi continuous entropy of DNA sequences. J Theor Biol 2004, 231: 377–388. 10.1016/j.jtbi.2004.06.030
Article CAS PubMed Google Scholar
Oliver JL, Bernaola-Galvan P, Guerrero-Garcia J, Roman-Roldan R: Entropic profiles of DNA sequences through chaos-game-derived images. J Theor Biol 1993, 160: 457–470. 10.1006/jtbi.1993.1030
Article CAS PubMed Google Scholar
Troyanskaya OG, Arbell O, Koren Y, Landau GM, Bolshoy A: Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity. Bioinformatics 2002, 18: 679–688. 10.1093/bioinformatics/18.5.679
Article CAS PubMed Google Scholar
Crochemore M, Verin R: Zones of low entropy in genomic sequences. Comput Chem 1999, 23: 275–282. 10.1016/S0097-8485(99)00009-1
Article CAS PubMed Google Scholar
Almeida JS, Vinga S: Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol 2006, 1: 18. 10.1186/1748-7188-1-18
Article PubMed Central PubMed Google Scholar
Sourice S, Biaudet V, El Karoui M, Ehrlich SD, Gruss A: Identification of the Chi site of Haemophilus influenzae as several sequences related to the Escherichia coli Chi site. Mol Microbiol 1998, 27: 1021–1029. 10.1046/j.1365-2958.1998.00749.x
Article CAS PubMed Google Scholar
Freeman JM, Plasterer TN, Smith TF, Mohr SC: Patterns of Genome Organization in Bacteria. Science 1998, 279: 1827a. 10.1126/science.279.5358.1827a
Article Google Scholar
Arakawa K, Uno R, Nakayama Y, Tomita M: Validating the significance of genomic properties of Chi sites from the distribution of all octamers in Escherichia coli. Gene 2007.
Google Scholar
Blattner FR, Plunkett G 3rd, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome sequence of Escherichia coli K-12. Science 1997, 277: 1453–1474. 10.1126/science.277.5331.1453
Article CAS PubMed Google Scholar
Dubnau D: DNA uptake in bacteria. Annu Rev Microbiol 1999, 53: 217–244. 10.1146/annurev.micro.53.1.217
Article CAS PubMed Google Scholar
Davidsen T, Rodland EA, Lagesen K, Seeberg E, Rognes T, Tonjum T: Biased distribution of DNA uptake sequences towards genome maintenance genes. Nucleic Acids Res 2004, 32: 1050–1058. 10.1093/nar/gkh255
Article PubMed Central CAS PubMed Google Scholar
Karlin S, Mrazek J, Campbell AM: Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. Nucleic Acids Res 1996, 24: 4263–4272. 10.1093/nar/24.21.4263
Article PubMed Central CAS PubMed Google Scholar
Chu D, Rowe J, Lee HC: Evaluation of the current models for the evolution of bacterial DNA uptake signal sequences. J Theor Biol 2006, 238: 157–166. 10.1016/j.jtbi.2005.05.024
Article CAS PubMed Google Scholar
Bakkali M, Chen TY, Lee HC, Redfield RJ: Evolutionary stability of DNA uptake signal sequences in the Pasteurellaceae. Proc Natl Acad Sci USA 2004, 101: 4513–4518. 10.1073/pnas.0306366101
Article PubMed Central CAS PubMed Google Scholar
Bouvier A, Gélis F, Schbath S: R'MES: Recherche de Mots Exceptionnels dans les Séquences d'ADN – Version 2. Guide de l'utilisateur INRA, Biométrie, F78352 Jouy-en-Josas 1999.
Google Scholar
Ettwiller LM, Rung J, Birney E: Discovering novel cis-regulatory motifs using functional networks. Genome Res 2003, 13: 883–895. 10.1101/gr.866403
Article PubMed Central CAS PubMed Google Scholar
Vilo J, Brazma A, Jonassen I, Robinson A, Ukkonen E: Mining for putative regulatory elements in the yeast genome using gene expression data. Proc Int Conf Intell Syst Mol Biol 2000, 8: 384–394.
CAS PubMed Google Scholar
Koo HS, Wu HM, Crothers DM: DNA bending at adenine. thymine tracts. Nature 1986, 320: 501–506. 10.1038/320501a0
Article CAS PubMed Google Scholar
Chen LY, Lu SH, Shih ES, Hwang MJ: Single nucleotide polymorphism mapping using genome-wide unique sequences. Genome Res 2002, 12: 1106–1111. 10.1101/gr.224502. Article published online before print in June 2002
Article PubMed Central CAS PubMed Google Scholar
Liao BY, Chang YJ, Ho JM, Hwang MJ: The UniMarker (UM) method for synteny mapping of large genomes. Bioinformatics 2004, 20: 3156–3165. 10.1093/bioinformatics/bth380
Article CAS PubMed Google Scholar
Haubold B, Pierstorff N, Moller F, Wiehe T: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 2005, 6: 123. 10.1186/1471-2105-6-123
Article PubMed Central PubMed Google Scholar
Vandenbogaert M, Makeev V: Analysis of bacterial RM-systems through genome-scale analysis and related taxonomy issues. In Silico Biol 2003, 3: 127–143.
CAS PubMed Google Scholar
Almeida JS, Vinga S: Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics 2002, 3: 6. 10.1186/1471-2105-3-6
Article PubMed Central PubMed Google Scholar
Vinga S, Almeida J: Alignment-free sequence comparison – a review. Bioinformatics 2003, 19: 513–523. 10.1093/bioinformatics/btg005
Article CAS PubMed Google Scholar
Parzen E: On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics 1962, 33: 1065–1076. 10.1214/aoms/1177704472
Article Google Scholar
Schoelkopf B, Smola AJ: Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge, Mass.: MIT Press; 2002.
Google Scholar
Helmann JD: Compilation and analysis of Bacillus subtilis sigma A-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res 1995, 23: 2351–2360. 10.1093/nar/23.13.2351
Article PubMed Central CAS PubMed Google Scholar
Vanet A, Marsan L, Sagot M-F: Promoter sequences and algorithmical methods for identifying them. Res Microbiol 1999, 150: 779–799. 10.1016/S0923-2508(99)00115-1
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors acknowledge partial support by projects MaGiC (IE02ID01004 – A. T. Freitas, PI) from INESC-ID and DynaMo (PTDC/EEA-ACR/69530/2006 – S. Vinga, PI) from the Portuguese Science Foundation (FCT). The authors would like to thank Prof. Sophie Schbath (INRA, France) for her comprehensive explanation of the statistical properties of Chi sequences and for kindly providing the processed genomes used in her previous analysis. The authors also thank Eng. Ana Casimiro for calculating the p-values and z-scores reported and Prof. Arlindo Oliveira (INESC-ID/IST, Lisboa) for insightful suggestions during the preparation of this work. Finally, the authors would like to thank the two anonymous referees for their comments and valuable suggestions that greatly improved this manuscript.

Author information

Authors and Affiliations

Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R. Alves Redol 9, 1000-029, Lisboa, Portugal
Susana Vinga
Departamento de Bioestatística e Informática, Faculdade de Ciências Médicas, Universidade Nova de Lisboa (FCM/UNL), Campo dos Mártires da Pátria 130, 1169-056, Lisboa, Portugal
Susana Vinga
Dept Biostatistics and Applied Mathematics, Univ. Texas MDAnderson Cancer Center – unit 447, 1515 Holcombe Blvd, Houston, TX, 77030-4009, USA
Jonas S Almeida
Biomathematics Group, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa (ITQB/UNL), R. Qta. Grande 6, 2780-156, Oeiras, Portugal
Jonas S Almeida

Authors

Susana Vinga
View author publications
You can also search for this author in PubMed Google Scholar
Jonas S Almeida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Susana Vinga.

Additional information

Competing interests

The author(s) declares that there are no competing interests.

Authors' contributions

SV devised and developed the methodology, carried out the analysis and wrote the manuscript. JSA participated in the implementation of the algorithms, the design of the study and contributed to the analysis and interpretation of results. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Vinga, S., Almeida, J.S. Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics 8, 393 (2007). https://doi.org/10.1186/1471-2105-8-393

Download citation

Received: 10 May 2007
Accepted: 16 October 2007
Published: 16 October 2007
DOI: https://doi.org/10.1186/1471-2105-8-393

Local Renyi entropic profiles of DNA sequences