 Research Article
 Open Access
On the comparison of regulatory sequences with multiple resolution Entropic Profiles
 Matteo Comin^{1}Email author and
 Morris Antonello^{1}
https://doi.org/10.1186/s1285901609802
© Comin and Antonello. 2016
 Received: 29 April 2015
 Accepted: 6 March 2016
 Published: 18 March 2016
Abstract
Background
Enhancers are stretches of DNA (100–1000 bp) that play a major role in development gene expression, evolution and disease. It has been recently shown that in highlevel eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cisregulatory modules (CRMs). Although the binding of transcription factors is sequencespecific, the identification of functionally similar enhancers is very difficult and it cannot be carried out with traditional alignmentbased techniques.
Results
The use of fast similarity measures, like alignmentfree measures, to detect related regulatory sequences is crucial to understand functional correlation between two enhancers. In this paper we study the use of alignmentfree measures for the classification of CRMs. However, alignmentfree measures are generally tied to a fixed resolution k. Here we propose an alignmentfree statistic, called \(EP^{*}_{2}\), that is based on multiple resolution patterns derived from the Entropic Profiles (EPs). The Entropic Profile is a function of the genomic location that captures the importance of that region with respect to the whole genome. As a byproduct we provide a formula to compute the exact variance of variable length word counts, a result that can be of general interest also in other applications.
Conclusions
We evaluate several alignmentfree statistics on simulated data and real mouse ChIPseq sequences. The new statistic, \(EP^{*}_{2}\), is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixedresolution methods. We implemented the new alignmentfree measures, as well as traditional ones, in a software called EPsim that is freely available: http://www.dei.unipd.it/~ciompin/main/EPsim.html.
Keywords
 Alignmentfree
 Sequence comparison
 Entropic profiles
Background
How to measure the degree of similarity between biological sequences is one of the foremost questions on the mind of bioinformaticians. This problem relates to the identification of homologous sequences like proteins and, to this end, the use of tools like BLAST is nowadays a standard procedure. In this paper we study the same question but for regulatory sequences such as promoters or enhancers of genes. The detection of similarities between coding sequences is a widespread approach to estimate functional correlations. Indeed, there is a general belief that similar binding site contents in regulatory sequences are expected to drive similar expression patterns [1]. Moreover, large collections of regulatory sequences have become available after the advent of ChIPseq technologies and the identification of sequences regulating the same celltype in the analysis of ChIPseq data is definitely a crucial step.
Many articles [1] discuss recent views on enhancers or cisregulatory modules (CRMs), one of several types of genomic regulatory elements, and their coordinated action in regulatory networks. Enhancers are stretches of DNA (100–1000 bp) that play a major role in the development of gene expression. They can upregulate, i.e. enhance, the transcription process driving animal development. A single cell can give rise to a multitude of different cell types and organs which will acquire different functions by expressing different sets of genes [2]. These modules are known to play a key role in the regulation of the transcription process, for example in Human [3] and in Drosophila [4].
Here we summarize the main features of CRMs. First, they contain several short (6–15 bp) DNA motifs that act as binding sites for transcription factors (TFBSs) and often allow different nucleotides at some of the binding positions. In other words, there may be mutations on TFBSs. Second, these TFBSs act seemingly independently of the distance and orientation to their target genes as a consequence of looping. It follows that the strand to which a CRM under study belongs is unknown so both cases need to be considered. Third, they maintain their functions independently of the sequence context, are modular and contribute additively and partly redundantly to the overall expression pattern of their target genes. Finally, enhancers with similar transcription factors binding sites content have a high probability of bearing a similar function. This is why predictions and classifications of enhancers can be addressed by similarity searches. However, the presence of multiple binding sites, with different spacing between them, can make the comparison of two CRMs very difficult. For these reasons biologists need first to screen ChIPseq datasets to select cellspecific regulatory sequences on the basis of common contents.
A similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate largescale analyses like clustering, prediction and classification. As opposed to traditional methods that output a list of putative TFBSs, alignmentfree methods [5–7] do not try to find any candidates. Instead, they analyze many long regulatory regions, which are composed by several TFBSs along with the background, in order to group together those sharing a similar content in terms of TFBSs. If the identification and positioning of TFBSs are of concern, then wellknown tools like MotifSampler [8] can be applied as a postprocess.
The comparison of sequences can be carried out without the need of costly alignments. A sequence can be represented by its word distribution. It has been shown that the word content and distribution can be effectively used to compare sequences in a number of applications [9]. This recent research field is usually referred as alignmentfree. In the context of CRMs, where it is assumed that a similar function is driven by the presence of different binding site contents, the idea to describe a sequence by its word distribution still works just as well. In addition, alignmentfree methods are receiving increasing attention because they are computationally efficient and can provide attractive alternatives when alignmentbased approaches fail. For example the study of organism evolution using wholegenome sequence is impossible to conduct with traditional alignment techniques [10, 11]. Similarly, the comparison of genomes from nextgeneration sequencing data can be performed only with alignmentfree methods [12–14]. Several alignmentfree methods have been devised for the identification of cisregulatory modules [5–7].
In general alignmentfree method are based on statistics of words with fixedlength k. The problem with these methods is that the performance depends dramatically on the choice of the resolution k [10]. For example in the analysis of enhancers using simulated data [5, 6], the best performing k is usually equal to the length of the implanted TFBS. In real cases its choice is critical because it is not possible to know the enhancer length in advance. Moreover, in the presence of several TFBSs, it is simply not feasible to select the k that best fits enhancers of different lengths. The statistical profile of variable length words in known CRMs has been used for the identification of potential CRMs in [15]. However, this method is supervised, in the sense that it uses orthologs of the known CRMs. In this paper we extend the idea of alignmentfree measures accounting for multiple resolutions and without depending neither on any knowledge nor accurate prediction of TFBSs.

we extend the function EP for pairwise sequence comparison;

as a byproduct, given that the word counts are not independent because of overlaps, we provide a formula for computing the exact variance of variable length word counts;

we will show that pairwise sequence similarity of regulatory sequences is able to estimate similar in vivo activity.
In the next Sections “Previous work on alignmentfree measures” and “Entropic profiles” we review the previous work on alignmentfree statistics and present the original definition of Entropic Profile. Then, in Section “Methods”, their statistical properties are studied and particular attention is paid to the role of the variance. The extension of the wellknown alignmentfree measures is discussed in Section “New alignmentfree measures derived from Entropic Profiles”, and implemented in a tool called EP_sim. In Section “Results and Discussion” the results are discussed and compared with the state of the art. Conclusions and future work are reported in Section “Conclusions”.
Previous work on alignmentfree measures

transcription factor binding sites are short motifs so they frequently match to genomic or even random DNA sequences so enhancer similarity or dissimilarity may be due primarily to their background;

enhancer location and orientation do not matter so no reliable alignment can be obtained;

they are timeconsuming and inadequate for comparing sequences in realistically large datasets, e.g. large ChIPseq datasets;

enhancers do not work alone and their coordinated action cannot be fully explored with a single alignment.
On the contrary, alignmentfree approaches provide viable alternatives [9, 20]. With the aim of effectively summing up sequence content they are usually based on kmer counts.
An implementation of D _{2}, \(D_{2}^{*}\) and \({D_{2}^{S}}\) is provided by ALF [5], which, by default, uses another similarity measure named N _{2}, one of the best available methods for the analysis of regulatory sequences. N _{2} aims at overcoming the limitation of exact word counts by taking into account word neighbourhood counts. N _{2} is defined similarly to \(D_{2}^{*}\) except that every word w is replaced with a set n(w) of words somehow linked to w, e.g. reverse complement and mismatches.
Several other alignmentfree statistics have been proposed recently for different applications: multiple alignment [23], phylogeny [11, 24], classification of NGS data [12, 13], reads clustering [25, 26], and many others.
The major drawback of alignmentfree measures is that they are all tied on the choice of the resolution k, which crucially influences performances but cannot be known in advance. Entropic Profiles, which are based on variable length word counts by definition, can be extended to create new alignmentfree measures accounting for multiple resolutions. In particular we will show that Entropic Profiles pave the way to more robust but still efficient alignmentfree methods.
Entropic profiles
where l is the length of the entire sequence, L the resolution, i.e. the kmer length, φ is a smoothing parameter, and c([i−k+1,i]) is the number of occurrences of \(x_{ik+1}\dots x_{i}\), i.e. the suffix of length k that ends at position i.
Entropic Profilers proved to be useful for the discovery of patterns in genome [17] and they can be computed efficiently in linear time and space [27–29]. By definition Entropic Profiles are based on multiple resolution kmers counts, thus they are not tied to a fixed resolution k, as almost all alignmentfree measures. Our intent is to extend this function for developing new alignmentfree measures for the prediction and classification of enhancers.
Methods
From Entropic Profiles to multiple resolution alignmentfree measures
where c _{ w,k } is the number of occurrences of the kmer suffix s _{ w,k } and the weights a _{ k } have been generalized.
The statistical properties of S E _{ w } have not been carefully studied yet. In the previous works [27], only the expectation of this function has been explored. In addition, in [16, 17], the standardization is done with respect to the arithmetic mean and standard deviation (see Formula 6 and 7). This procedure can introduce biases due to the noise present in the input sequence. Indeed, the standardization does not depend on the word w that we want to score, but instead it is applied regardless of the particular word w, see Formula 5 where mean and variance are computed once and for all from the sequence under examination. Different words have different probability to occur, for example the string AAAA has more chance to appear than ACGT, because of its autocorrelation. Thus the number of occurrences of a word should be standardized with respect to the word statistics, as in \(D_{2}^{*}\) already reported in Formula 3. In order to replicate the same scheme we first need to study the statistical properties of the simple entropy S E _{ w }.
Computing the expected entropy
Without loss of generality the entire sequence \(S = \left (X_{1}, X_{2},\ldots, X_{i},\ldots, X_{l}\right)\) can be modeled by a stationary Markov chain [30]. Here, we use a firstorder Markov chain, but all results can be extended to any other order. Thanks to the stationarity of the Markov chain, the probability μ(w) that a word w occurs does not depend on the position i, and it is: \( \mu (w) = \mu \left (w_{1}\right) \prod _{j=2}^{L} \pi \left (w_{j1}, w_{j}\right)\), where \(\mu \left (w_{1}\right)\) is the probability that the first letter occurs and π(w _{ j−1},w _{ j }) is the transition probability from letter w _{ j−1} to w _{ j }.
For each i, Y _{ i }(w) is a Bernoulli variable with parameter μ(w) so its expectation is E[Y _{ i }(w)]=μ(w) and its variance is V a r[Y _{ i }(w)]=μ(w)[1−μ(w)]. This indicator provides a way to define the number of occurrences c _{ w } of word w: \( c_{w} = \sum _{i = 1}^{l  L + 1} Y_{i}(w)\).
Note that, as opposed to Formula 3, where the expected number of occurrences of the word w is estimated as (l−k+1)μ(w) (see definition of \(\tilde {A}_{w}\)), here S E _{ w } accounts for multiple words of different lengths, and thus its expectation is computed accordingly.
Computing the variance of entropy
In this section we continue to study the statistical property of entropies S E _{ w }. If we consider the standardization proposed in Formula 3, we can note that the denominator does not contain the exact variance but an approximation. The variance is replaced by the estimated mean of the word occurrence across the two sequences. If the probability of the word pattern is small, this approach can be justified by considering a Poisson approximation for the individual word counts. Here instead we are interested in deriving the exact variance of entropies S E _{ w }.
Case 1: variance of the count
 1.
selfoverlap of the word with itself;
 2.
partial selfoverlap, the suffix of the word with its prefix or viceversa;
 3.
disjoint occurrences.
Formally:
Case 2: covariance of the counts of words of different length
 1.
the former stands for all the terms due to two words of different length that do not start at the same position;
 2.the latter stands for all the terms due to two words of different length that start at the same position (yellow words in Fig. 1).
To reformulate the former and to study overlaps, we can always fix the first w ^{′} (the longest) and move \(w^{\prime \prime }\) (the shortest, i.e. its suffix). In particular, let d be the shift of the moving word \(w^{\prime \prime }\) with respect to the fixed word w ^{′}. A summary of the possible overlaps between w ^{′} and \(w^{\prime \prime }\) is shown in Fig. 1, so as to make the subsequent analysis of the two parts easier.

left shift, d≥1 (red words);

right shift, d≥1 (blues and green words);

zero shift, d=0 (yellow word).
Left shift
 1.
prefix  suffix overlap: two overlapping words, the latter of which (red words in Fig. 1) starts before the beginning and ends before the end of the former.
 2.
two non overlapping words.
Since the expectation does not depend on the position i we can write:\(E[Y_{i}(w')]E[Y_{id}(w^{\prime \prime })] = \mu (w')\mu (w^{\prime \prime })\).
Right shift
 1.
substring  string overlap: two overlapping words, the latter (blue words in Fig. 1) starts after the beginning and ends before the end of the former.
 2.
substring  prefix overlap: two overlapping words, the latter (green words in Fig. 1) starts before the end of the former and ends after it.
 3.
two non overlapping words.
Zero shift
This is the exact formula that, together with the other case, can be used to compute the variance of S E _{ w }. Unlike previous approaches that approximate the variance of equal length word counts, we have also provided a challenging formula for computing the exact variance of variable length word counts. For the sake of simplicity, as done in [5], the last two terms, i.e. the nonoverlapping terms, will be neglected thereby assuming that the occurrence of nonoverlapping words is independent of the sequence in between.
We believe that this result can be of general interest, and that it can be used also in other applications. For example exact word statistics are fundamental for the discovery of surprising/overrepresented patterns [30, 31].
New alignmentfree measures derived from Entropic Profiles
Entropies and counts are very much alike, as already described in the previous section. The basic intuition is that Entropic Profiles can be used instead of kmer counts, so that one can build alignmentfree statistics that are not based on the fixed length k, but that are multiple resolution. This suggests that the adaptation of the stateoftheart measures can be done by replacing the vector of kmer counts with the vector of entropies.
While the implementation of E P _{2} is straightforward, \(EP_{2}^{*}\) instead is based on the statistical properties of entropies. The theory developed in the previous section is preliminary to the implementation of \(EP_{2}^{*}\).
Note that Entropic Profiles, expectations and variances can be precomputed in linear time and space by adapting the implementation in [27]. Thus, the proposed statistics, as many others, can be computed efficiently.
We implemented these alignmentfree measures, as well as traditional ones, in a software called EPsim that is freely available^{1}. It is based on the library SeqAn [32] that provides efficient string primitives. Among the different options available, the possibilities to include reverse complements and to compute an approximated version of the variance are of note. In particular one can extend the formulas for the mean and variance to include also reverse complements. There are several ways to incorporate reverse complements into the score. The method we selected consists in taking the maximum between the entropies of a word and its reverse complement. In practice the fact that only the strongest signal is taken makes the effect of exceptional words more incisive. This solution is only one of the possibilities. In N _{2} [5], the kmer counts from the reverse and forward strand can be combined in many ways. There are four options: bothstrands, to calculate the pairwise score using both strands from the input sequences, mean, min and max. In general, the use of reverse complements will be of help for the detection of enhancers and in other applications.
Results and Discussion
This section deals with the testing procedures for the study of the statistical power of the proposed multiresolution sequence similarity measures. The task of pairwise comparison of regulatory sequences is much harder than traditional pairwise alignment since only very few shared words might lead to a similar activity. In this section we want to test if pairwise sequence similarity of regulatory sequences is able to estimate similar in vivo activity.
The same biological problem has been addressed in [5–7] and we chose to compare with these methods using the same experimental setup. Here, we report experiments on simulated and real regulatory sequences, by using the same evaluation procedure. In each experiment two equallength sets of sequences, which are named negative and positive set, are built. Sequences in the former are dissimilar while those in the latter similar. The positive predictive value (PPV) is evaluated in two steps: first similarity scores are computed for each pair of sequences in the two sets; then similarity scores are sorted in descending order, and the PPV is the percentage of pair of sequences from the positive set in the first half of the chart. The best PPV is 1 and means a perfect separation between negative and positive sets while a PPV close to 0.5 implies no statistical power. Performances will depend on the choice of the background model, the kmer length and the weights a _{ k }. For the latter we will use a Gaussian kernel with standard deviation σ, which is centered about k=L, i.e. \(a_{k} = e^{\frac {(L  k)^ 2}{2 \sigma ^ 2}}\).
Implanted motifs on Drosophila genome
In this simulation study, the sequences in the negative set are randomly picked from a real genome while those in the positive set are built by implanting a set of motifs in those of the negative set, since random sequences are unrealistic backgrounds. Thus, as in [33], we chose the Drosophila genome, whose intergenic sequences, which are regions containing functionally important elements such as promoters and enhancers, are downloadable from FlyBase^{2}. Patterns can be artificially implanted via the pattern transfer model [22] or the revised one [33] with the aim of mimicking the exchange of genetic material. While, under the former model, only strings of the same length, e.g 5, are considered, under the latter, also strings of different length, e.g. 4, 5 and 6 are implanted.
Comparison of mouse regulatory sequences
The above simulations deal with artificial CRMs from unrelated sequences. The next series of experiments involves neither artificial enhancers nor implanted transcription factor binding sites. The positive set is build from ChIPseq data of real enhancers, which have been already identified in a genomewide manner using the coactivator protein p300 by [34, 35]. More precisely, it consists in sequences of length between 350 and 1000 that are issuespecific enhancers of mouse embryos active in one of the following tissues: forebrain, midbrain, limb or heart. These studies [34, 35] have identified 2543, 561, 2105 and 3597 peaks from forebrain, midbrain, limb and heart respectively. For the purpose of this study we select the top 200 peaks for each tissue.
Comparison of mouse tissuespecific enhancers versus random mouse genomic sequences. Values in the table represents the average PPV, over all tissues, varying the kmer length. The standard deviation is 0.7
\(EP^{*}_{2}\)  k mer length  

Tissue  1  2  3  4  5  6  7 
Limb  0.61  0.68  0.77  0.82  0.82  0.81  0.8 
Forebrain  0.59  0.71  0.78  0.8  0.83  0.82  0.82 
Midbrain  0.58  0.69  0.72  0.84  0.81  0.78  0.79 
Heart  0.63  0.73  0.81  0.85  0.83  0.81  0.81 
Average  0.60  0.70  0.77  0.83  0.82  0.80  0.80 
N _{ 2 }  k mer length  
Tissue  1  2  3  4  5  6  7 
Limb  0.6  0.66  0.71  0.74  0.75  0.69  0.66 
Forebrain  0.59  0.68  0.7  0.73  0.76  0.72  0.68 
Midbrain  0.58  0.63  0.68  0.71  0.72  0.69  0.65 
Heart  0.62  0.66  0.73  0.75  0.74  0.71  0.68 
Average  0.6  0.66  0.70  0.73  0.74  0.70  0.67 
Comparison of mouse tissuespecific enhancers versus others tissuespecific enhancers. Values in the table represent the average PPV, over all tissues, varying the kmer length. The standard deviation is 0.7
\({EP^{*}_{2}}\)  kmer length  
Tissue  1  2  3  4  5  6  7 
Limb  0.52  0.59  0.68  0.71  0.7  0.69  0.67 
Forebrain  0.5  0.58  0.62  0.65  0.63  0.63  0.59 
Midbrain  0.51  0.61  0.68  0.69  0.7  0.68  0.66 
Heart  0.49  0.6  0.7  0.73  0.72  0.68  0.67 
Average  0.50  0.59  0.67  0.69  0.69  0.67  0.65 
N _{2}  kmer length  
Tissue  1  2  3  4  5  6  7 
Limb  0.51  0.55  0.58  0.59  0.61  0.54  0.53 
Forebrain  0.51  0.52  0.54  0.56  0.57  0.51  0.52 
Midbrain  0.51  0.5  0.51  0.48  0.52  0.54  0.5 
Heart  0.49  0.52  0.55  0.58  0.56  0.53  0.49 
Average  0.50  0.52  0.54  0.55  0.56  0.53  0.51 
Comparison of mouse tissuespecific enhancers with each other. Values in the table represent the average PPV, with kmer length of 4 and standard deviation of 0.7
\({EP^{*}_{2}}\)  Limb  Forebrain  Midbrain  Heart 
Limb  X  0.63  0.68  0.78 
Forebrain  0.63  X  0.61  0.68 
Midbrain  0.68  0.61  X  0.73 
Heart  0.78  0.68  0.73  X 
Average  0.70  0.64  0.67  0.73 
N _{2}  Limb  Forebrain  Midbrain  Heart 
Limb  X  0.55  0.54  0.66 
Forebrain  0.55  X  0.54  0.6 
Midbrain  0.54  0.54  X  0.53 
Heart  0.66  0.6  0.53  X 
Average  0.58  0.56  0.54  0.59 
Speed tests
Conclusions
In this paper we studied the use of alignmentfree measures to detect functional or evolutionary similarities among regulatory sequences. We introduced a multiple resolution alignmentfree method based on Entropic Profiles that is designed around the use of variablelength words combined with statistical properties. To evaluate the performance of several alignmentfree methods, we devised a series of tests on both synthetic and real data. In almost all simulations our method \(EP^{*}_{2}\) outperforms all other statistics. Importantly \(EP^{*}_{2}\) is also able to detect similarities between in vivo identified enhancer sequences, e.g. of mouse. This will help to better understand the sequencedependent code within CRMs, which is responsible for the large diversity of cell types.
As a byproduct we provide a formula to compute the exact variance of variable length word counts, a result that can be of general interest also in other applications, e.g. the discovery of surprising patterns. As a future direction we plan to implement different methods to incorporate reverse complements. Another context where the these statistics can be of help is the comparison of viral sequences.
Endnotes
Declarations
Acknowledgements
M. Comin was partially supported by the P.R.I.N. Project 20122F87B2.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genomewide predictions. Nat Rev Genet. 2014; 15:272–86.View ArticlePubMedGoogle Scholar
 Bonn S, et al. Tissuespecific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development. Nat Genet. 2012; 44(2):148–56.View ArticlePubMedGoogle Scholar
 Wilson MD, et al. Speciesspecific transcription in mice carrying human chromosome 21. Science. 2008; 322(5900):434–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Goto T, Macdonald P, Maniatis T. Early and late periodic patterns of even skipped expression are controlled by distinct regulatory elements that respond to different spatial cues. Cell. 1989; 57(3):413–22.View ArticlePubMedGoogle Scholar
 Goke J, Schulz MH, Lasserre J, Vingron M. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics. 2012; 28(5):656–63.View ArticlePubMedPubMed CentralGoogle Scholar
 Liu X, Wan L, Reinert G, Waterman MS, Sun F, Li J. New powerful statistics for alignmentfree sequence comparison under a pattern transfer model. J Theor Biol. 2011; 1:106–16.View ArticleGoogle Scholar
 Kantorovitz MR, Robinson GE, Sinha S. A statistical method for alignmentfree comparison of regulatory sequences. Bioinformatics. 2007; 23(13):249–55.View ArticleGoogle Scholar
 Thompson W, Newberg L, Conlan S, McCue LA, Lawrence C. The gibbs centroid sampler. Nucl Acids Res. 2007; 35(2):232–7.View ArticleGoogle Scholar
 Vinga S, Almeida J. Alignmentfree sequence comparison a review. Bioinformatics. 2003; 19(4):513–23.View ArticlePubMedGoogle Scholar
 Sims G, Jun SR, Wu G, Kim SH. Alignmentfree genome comparison with feature frequency profiles (ffp) and optimal resolutions. PNAS. 2009; 106(8):2677–82.View ArticlePubMedPubMed CentralGoogle Scholar
 Comin M, Verzotto D. Alignmentfree phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol. 2012; 7(1):34.View ArticlePubMedPubMed CentralGoogle Scholar
 Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignmentfree sequence comparison based on nextgeneration sequencing reads. J Comput Biol. 2013; 20(2):64–79.View ArticlePubMedPubMed CentralGoogle Scholar
 Comin M, Schimd M. Assemblyfree genome comparison based on nextgeneration sequencing reads and variable length patterns. BMC Bioinformatics. 2014; 15(Suppl 9):1.View ArticleGoogle Scholar
 Fan H, Ives A, SurgetGroba Y, Cannon C. An assembly and alignmentfree method of phylogeny reconstruction from nextgeneration sequencing data. BMC Genomics. 2015; 16:522.View ArticlePubMedPubMed CentralGoogle Scholar
 Kazemian M, Zhu Q, Halfon MS, Sinha S. Improved accuracy of supervised crm discovery with interpolated markov models and crossspecies comparison. Nucl Acids Res. 2011; 39(22):9463–72.View ArticlePubMedPubMed CentralGoogle Scholar
 Vinga S, Almeida JS. Local renyi entropic profiles of dna sequences. BMC Bioinformatics. 2007; 8:393.View ArticlePubMedPubMed CentralGoogle Scholar
 Fernandes F, Freitas A, Almeida J, Vinga S. Entropic profiler  detection of conservation in genomes using information theory. BMC Res Notes. 2009; 2:72.View ArticlePubMedPubMed CentralGoogle Scholar
 Smith T, Waterman M. Comparison of biosequences. Adv Appl Math. 1981; 2:482–9.View ArticleGoogle Scholar
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215:403–10.View ArticlePubMedGoogle Scholar
 Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignmentfree sequence comparison: measures, statistics and nextgeneration sequencing. Brief Bioinform. 2014; 15(3):343–53.View ArticlePubMedPubMed CentralGoogle Scholar
 Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Nat Acad Sci. 1986; 83:5155–5159.View ArticlePubMedPubMed CentralGoogle Scholar
 Reinert G, Chew D, Sun F, Waterman MS. Alignmentfree sequence comparison (i): statistics and power. J Comput Biol. 2009; 16(12):1615–34.View ArticlePubMedPubMed CentralGoogle Scholar
 Ren J, Song K, Sun F, Deng M, Reinert G. Multiple alignmentfree sequence comparison. Bioinformatics. 2013; 29(21):2690–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Leimeister C, Boden M, Horwege S, Lindner S, Morgenstern B. Fast alignmentfree sequence comparison using spacedword frequencies. Bioinformatics. 2014; 30:1991–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Comin M, Leoni A, Schimd M. Qcluster: Extending alignmentfree measures with quality values for reads clustering. Algoritm Bioinforma Lecture Notes Comput Sci. 2014; 8701:1–13.View ArticleGoogle Scholar
 Comin M, Leoni A, Schimd M. Clustering of reads with alignmentfree measures and quality values. BMC Algorithms Mol Biol. 2015; 10:4.View ArticleGoogle Scholar
 Comin M, Antonello M. Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes. IEEE/ACM Trans Comput Biol Bioinforma. 2014; 11(3):500–9.View ArticleGoogle Scholar
 Parida L, Pizzi C, Rombo S. Entropic profiles, maximal motifs and the discovery of significant repetitions in genomic sequences. Algorithms Bioinform. 2014; 8701:148–60.Google Scholar
 Comin M, Antonello M. Fast Alignmentfree Comparison for Regulatory Sequences Using Multiple Resolution Entropic Profiles. In: Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOSTEC 2015): 2015. p. 172–7.Google Scholar
 Robin S, Rodolphe F, Schbathothers S. DNA, Words and Models: Statistics of Exceptional Words. Cambrige, UK: Cambridge University Press; 2005.Google Scholar
 Apostolico A, Comin M, Parida L. Varun: Discovering extensible motifs under saturation constraints. IEEE/ACM Trans Comput Biol Bioinformatics. 2010; 7(4):752–62.View ArticleGoogle Scholar
 Doring A, Weese D, Rausch T, Reinert K. Seqan an efficient, generic c++ library for sequence analysis. BMC Bioinformatics. 2008; 9:11.View ArticlePubMedPubMed CentralGoogle Scholar
 Comin M, Verzotto D. Beyond fixedresolution alignmentfree measures for mammalian enhancers sequence comparison. IEEE/ACM Trans Comput Biol Bioinformatics. 2014; 11(4):628–37.View ArticleGoogle Scholar
 Visel A, et al. Chipseq accurately predicts tissuespecific activity of enhancers. Nature. 2009; 457(7231):854–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Blow MJ, et al. Chipseq identification of weakly conserved heart enhancers. Nat Genet. 2010; 42(9):806–10.View ArticlePubMedPubMed CentralGoogle Scholar