Volume 7 Supplement 5

## APBioNet – Fifth International Conference on Bioinformatics (InCoB2006)

# Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

- Sylvain Forêt
^{1}Email author, - Miriam R Kantorovitz
^{2}and - Conrad J Burden
^{1}

**7(Suppl 5)**:S21

https://doi.org/10.1186/1471-2105-7-S5-S21

© Forêt et al; licensee BioMed Central Ltd 2006

**Published: **18 December 2006

## Abstract

### Background

The number of *k*-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, *D*_{2}, has been used for the clustering of EST sequences. Sequence comparison based on *D*_{2} is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of *D*_{2}, and asymptotic regimes have been derived. The distribution of approximate *k*-word matches has also been studied.

### Results

We have computed the *D*_{2} optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show *D*_{2} to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the *D*_{2} statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA.

### Conclusion

We have characterized the distribution of the *D*_{2} statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, *D*_{2}-based methods are most appropriate for large genomic sequences.

## Background

The overwhelming amount of molecular data generated by the sequencing of whole genomes and EST libraries has triggered the development of numerous sequence comparison algorithms, aimed at detecting related sequences and at quantifying this relatedness. BLAST [1], FASTA [2] and other related algorithm are arguably the most popular programs for sequence comparison. These algorithms rely on the local alignment of the sequences under scrutiny and assume conservation of contiguity between homologous segments. This assumption, however, is often violated. Discontinuity can occur, for example, when spliced transcripts are matched to genomic sequences, when ESTs or cDNAs from different splice variants are compared or when genomic sequences are aligned that have undergone genome shuffling. Other alignment-based algorithms that do not assume conservation of contiguity, such as BLAT [3] or SIM4 [4], can compute scores, percentage similarity, but do not assess statistical significance.

Several types of alignment-free sequence comparison algorithms, reviewed in [5], can circumvent this problem. Among these alignment-free methods, techniques based on the number of *k*-words shared between two sequences, also known as the *D*_{2} statistic, are particularly noteworthy due to the speed of their implementation, their sensitivity and selectivity [6]. They have been extensively used to structure large collections of ESTs into clusters of similar sequences [7–9].

The rigorous study of the statistical distribution of *D*_{2} began with the computation of bounds of *D*_{2} variance and the characterization of asymptotic distributional regimes [10]. These results have been refined in a recent study [11]. Other studies [12, 13] have focussed on a generalization of *D*_{2}, the number of approximate *k*-word matches between two sequences,
, where *t* is the number of mismatches per word. Bounds on thevariance and asymptotic distribution of
have been determined [13].

The current paper summarizes the theoretical knowledge on the distribution of *D*_{2} and
. The optimal word sizes of these statistics in a variety of conditions were computed, and the distributional regimes at optimal word size were deduced. Finally, the accuracy of
as a measure of sequence similarity was compared with other measures using random sequences and sequences evolved from human genomic DNA.

## Results and Discussion

### Word matches measures

#### Exact matches

*d*letters, let

**A**=

*A*

_{1}

*A*

_{2}⋯

*A*

_{ n }be a sequence of

*n*i.i.d. letters of . Let

*f*

_{ a }be the probability of a letter taking the value

*a*, and

*p*

_{ k }= .

*D*_{2} is defined as the number of *k*-words matches between two sequences **A** and **B**, and can be expressed

where *Y*_{(i, j)}is the *k*-word match indicator variable starting at position *i* in **A** and *j* in **B**, and the index set is

*I* = {(*i*, *j*) : 1 ≤ *i* ≤
, 1 ≤ *j*
}

where, for convenience, (*n - k* + 1) and (*m - k* + 1) are written
and
respectively.

*D*_{2} can also be thought of as an inner product of the word count vectors. Let
= {**w**_{
1
}**, w**_{
2
}**, ⋯, w**_{
n
}} be the set of all *k*-words on
. For w ∈
, let
be the number of times the letter **w** appears in sequence **A**. The count vector for that sequence is
. *D*_{2} can thus also be expressed as

The mean of *D*_{2} can be easily computed (e.g. [14])

When the letters of the alphabet are uniformly distributed, that is, *f*_{
a
}=
, for all *a* ∈ *A*.

Lower and upper bounds for the variance of *D*_{2} were computed in [10], and an exact formula for the variance in the case of uniform letter distribution was given in [11]. Limiting distributions of *D*_{2} when *n* and *k* get large were derived in [10], namely (a) when *k*/
*n* > 2, and *E*[*D*_{2}] is not too small, *D*_{2} has a compound Poisson [15] asymptotic behaviour, and (b) when *k*/
*n* < 1/6, and the letter distribution is non-uniform, *D*_{2} has a normal asymptotic behaviour. Numerical simulations in [10] showed that these theoretical bounds are not tight. The uniform letter distribution case was studied in [11], where it was proved that, for large enough *k*, the *D*_{2} statistic is approximately normal as *n* gets large.

#### Approximate matches

The theory for the number of approximate matches between two sequences has been developed in the more restricted framework of strand symmetric Bernoulli text [12, 13]. In this framework, the letters are i.i.d. with frequencies

where *η* is the perturbation parameter, with -1 ≥ η ≥ 1. The distance between two words *x* and *y* of length *k* is defined as the number of character mismatches between *x* and *y* and is written *δ*(*x, y*). When *δ*(*x, y*) ≥ *t, x* is said to be a *t*-neighbour of *y*. If **A** is a strand symmetric Bernoulli text and **w** is a known word of the same length, *n*, as **A**, then the probability distribution of the distance *δ*(**A, w**) is:

Pr(*δ*(**A, w**) = *t*) =*g*_{
t
}(*n,η,c*)

where *g*_{
t
} is a perturbed binomial distribution [12], *c* is the *GC* count in **w** and *η* is the perturbation parameter of **A**. The cumulative distribution function of the distance is then

*t*-neighbours of

*k*-words between sequences

**A**and

**B**, and can be expressed as

*t*-neighbourhood for the

*k*-words starting at position

*i*in

**A**and

*j*in

**B**. The expectation of can be expressed in terms of the perturbed binomial distribution [12]

*n*, is asymptotically normal when

*k*=

*α*(

*n*) +

*C*where 0 ≥

*α*< and

*C*is a constant. When

*t*= 0, this result is an improvement on the

*α*= 1/6 result for perfect matches [10] reported above. Numerical simulations in [13] suggest that this asymptotic behaviour occurs for

*α*as high as 2.

*D*

_{2}to the normal and compound Poisson distribution, in the case of exact word matches, for uniform and non-uniform letters distribution, can be found in [10].

### Optimal word sizes

*n*and

*t*mismatches. Numerical simulations were carried out to determine for various sequence lengths

*n*= 2

^{ x }× 10

^{2}with

*x*= 1, 2, 3, 4, 5, and for different numbers of mismatches

*t*= 0, 1, 2, 3, 4, 5. A summary of these simulations is given in table 1 and the detailed simulation results can be found in additional files 1 to 9. Sequences with a non-uniform letter distribution were used, with nucleotide frequencies

*f*

_{ A }=

*f*

_{ T }= ,

*f*

_{ G }=

*f*

_{ C }= . Similar compositional biases are observed in several sequenced genomes, such as the honey bee

*Apis mellifera*, the roundworm

*Caenorhabditis elegans*or the zebra fish

*Danio rerio*. The optimal word size was also computed in the uniform case for

*t*= 0, 3, with very similar outcomes. We will therefore focus our discussion on the non-uniform case.

Optimal word sizes. Optimal word sizes for various sequence lengths and numbers of mismatches.

Mismatches | Sequence length | ||||
---|---|---|---|---|---|

200 | 400 | 800 | 1600 | 3200 | |

0 | 6 | 7 | 7 | 7 | 7 |

1 | 8 | 10 | 10 | 10 | 10 |

2 | 10 | 12 | 12 | 12 | 12 |

3 | 12 | 14 | 14 | 14 | 14 |

4 | 14 | 16 | 16 | 16 | 16 |

5 | 16 | 18 | 18 | 18 | 18 |

When our data for perfect word matches are compared to tables 1 and 3 in [10], it appears that the distribution of *D*_{2} at optimal word size is approximately a compound Poisson distribution when sequences are 400 letters long or smaller, and is approximately normal when sequences lengths are larger than 1600 letters.

In the case of approximate word matches, it can be extrapolated from figure 1 that the distribution of
becomes normal for sequences larger than 1600 letters. The approximate behaviour of
, when *t* > 0 and *k*/
*n* > 2, however, is unknown. The distribution regime of
at optimal word size for smaller sequences could therefore not be characterized.

### Accuracy of measures

*A*(see methods), obtained with these measures and compare them with that obtained using either BLAST or the Hamming distance. Similarly, we compared the efficiency of for different numbers of mismatches,

*t*= 0,1, 2, 3, 4, 5 (figure 2). We used the same sequence size (600) as inthe above-cited study so as to allow comparison of the performance of with the measures assessed in that study.

*A*for BLAST was computed based on the bit scores obtained with the default settings.

*A*) ranging from 9.3 to 9.6. This is better than BLAST whose log (

*A*) is close to 9.9. It is noteworthy that, at optimal word size, the statistic gives better results when the number of mismatches allowed per word increases.

### Application to non-iid sequences

In real biological sequences, letters are not independent and identically distributed. To test the applicability of our results to biological sequences, we conducted similar simulations, but instead of randomly generated sequences of known composition, we used sequences sampled from the human genome and made them evolve according to the K80 [17] model of nucleotide substitutions. The results are summarized in additional file 9. These results are quite encouraging in many respects. First, the optimal word sizes for these sequences are the same as or very close to the ones predicted with random sequences in the previous sections. The values of log (*A*) for near-optimum word sizes are lower than those computed previously, suggesting that similarity measures based on word counts may be more accurate on real sequences than on randomly generated sequences. Finally, the accuracy of the
measures near optimal word size are consistently better than BLAST bit scores obtained with the default settings.

## Conclusion

*k*, the sequences lengths

*n*, the number of mismatches

*t*and the sequence composition

*f*

_{ ATGC }. We computed the optimal word size for various combinations of these parameters that influence the distribution of . For sequences smaller than 400 letters and when no mismatches are allowed, the distribution of is close to a compound Poisson. In the case of sequence larger than 1600 letters, the distribution of is approximately normal. When estimating the significance of sequence similarity using at its optimal word size, the null distribution would thus have to be adjusted according to the size of the sequences. It is worth noting that the size of typical ESTs or whole genome shotgun sequencing traces (500–800 letters) is in the transition region between these two limiting distributions.

*o*(

*kn*), when

*t*= 0, however, when

*t*> 0, the worst case complexity is

*o*(

*knm*). In our simulations, was more accurate than BLAST (see figure 2). We expect this difference to be accentuated on real sequences where shuffling and transposition occurs, breaking the contiguity generally assumed by alignment-based sequence comparison algorithms.

*o*(

*n*

^{2}) complexity, hence the development of heuristics speeding up these comparisons, such as BLAT [3] or MEGABLAST [18]. The linear (

*o*(

*n*)) nature of the algorithm and the normal asymptotic behavior of , when sequences are large compared to word size, would allow sequence similarity to be assessed in a rigorous statistical framework, with significant improvement in run time.

## Methods

### Word size optimization

The optimization of the word size, for a given sequence length and number of mismatches, was carried out according to a method similar to that introduced in [16]. In brief, a family of sequences was generated by first creating a random mother sequence. 100 sons were then derived by mutating γ% of the mother sequence, where γ = 1, 2, . . ., 100%. Only point mutations were used: substitution, insertion and deletion of a single letter. The
statistic between the mother and each son were computed. Two rankings of the sons were then produced, one based on
, and another based on γ. The accuracy of
-based sequence comparison was estimated by looking at the discrepancy between these two rankings by means of the Spearman's rank statistic *A*. The optimal word size is that for which *A* is minimal. For each condition, the data presented here are the average of *A* for 100 to 400 families.

Similar simulations were also carried out by randomly selecting a mother sequence from the human genome chromosome 1, version NCBI36.40 (available from http://www.ensembl.org). Sons where then derived according to a K80 [17] model of evolution.

A program written in ANSI C was written to compute the *spearmanRS* statistic. Data were post-processed using the R environment. The simulations were carried out on a Debian Gnu/Linux desktop computer. The source code is available from our *k*-words website [19].

### Fit to the normal distribution

Kolmogorov-Smirnov *p*-values [20] for the standardized statistic
compared with the standard normal distribution for sample sizes of 2,500 sequence pairs were computed. The results are shown in figure 1. In general, samples which are a close approximation to the normal distribution have *p*-values distributed uniformly in the interval [0, 1], whereas samples which are a poor approximation have small *p*-values.

## Declarations

### Acknowledgements

This work was funded in part by ARC Discovery Grant DP0559260.

This article has been published as part of *BMC Bioinformatics* Volume 7, Supplement 5, 2006: APBioNet – Fifth International Conference on Bioinformatics (InCoB2006). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S5.

## Authors’ Affiliations

## References

- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ:
**Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.***Nucleic Acids Res*1997,**25**(17):3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar - Pearson WR:
**Rapid and sensitive sequence comparison with FASTP and FASTA.***Methods Enzymol*1990,**183:**63–98.View ArticlePubMedGoogle Scholar - Kent WJ:
**BLAT-the BLAST-like alignment tool.***Genome Res*2002,**12**(4):656–64. 10.1101/gr.229202. Article published online before March 2002PubMed CentralView ArticlePubMedGoogle Scholar - Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W:
**A computer program for aligning a cDNA sequence with a genomic DNA sequence.***Genome Res*1998,**8**(9):967–74.PubMed CentralPubMedGoogle Scholar - Vinga S, Almeida J:
**Alignment-free sequence comparison-a review.***Bioinformatics*2003,**19**(4):513–23. 10.1093/bioinformatics/btg005View ArticlePubMedGoogle Scholar - Hide W, Burke J, Davison DB:
**Biological evaluation of d2, an algorithm for high-performance sequence comparison.***J Comput Biol*1994,**1**(3):199–215.View ArticlePubMedGoogle Scholar - Burke J, Davison D, Hide W:
**d2_cluster: a validated method for clustering EST and full-length cDNAsequences.***Genome Res*1999,**9**(11):1135–42. 10.1101/gr.9.11.1135PubMed CentralView ArticlePubMedGoogle Scholar - Christoffels A, van Gelder A, Greyling G, Miller R, Hide T, Hide W:
**STACK: Sequence Tag Alignment and Consensus Knowledgebase.***Nucleic Acids Res*2001,**29:**234–8. 10.1093/nar/29.1.234PubMed CentralView ArticlePubMedGoogle Scholar - Carpenter JE, Christoffels A, Weinbach Y, Hide WA:
**Assessment of the parallelization approach of d2 cluster for high-performance sequence clustering.***J Comput Chem*2002,**23**(7):755–7. 10.1002/jcc.10025View ArticlePubMedGoogle Scholar - Lippert RA, Huang H, Waterman MS:
**Distributional regimes for the number of k-word matches between two random sequences.***Proc Natl Acad Sci U S A*2002,**99**(22):13980–9. 10.1073/pnas.202468099PubMed CentralView ArticlePubMedGoogle Scholar - Kantorovitz MR, Booth HS, Burden CJ, Wilson SR:
**Asymptotic behavior of k-word matches between two uniformly distributed sequences.***preprint*2006.Google Scholar - Melko OM, Mushegian AR:
**Distribution of words with a predefined range of mismatches to a DNA probe in bacterial genomes.***Bioinformatics*2004,**20:**67–74. 10.1093/bioinformatics/btg374View ArticlePubMedGoogle Scholar - Burden CJ, Kantorovitz MR, Wilson SR:
**Approximate word matches between two random sequences.***preprint*2006.Google Scholar - Waterman MS:
*Introduction to Computational Biology*. Chapman and Hall; 1995.View ArticleGoogle Scholar - Barbour A, Chryssaphinou O:
**Compound Poisson approximation: a user guide.***Annals of Applied Probability*2001,**11**(3):964–1002. 10.1214/aoap/1015345355View ArticleGoogle Scholar - Wu TJ, Huang YH, Li LA:
**Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences.***Bioinformatics*2005,**21**(22):4125–32. 10.1093/bioinformatics/bti658View ArticlePubMedGoogle Scholar - Kimura M:
**A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.***J Mol Evol*1980,**16**(2):111–20. 10.1007/BF01731581View ArticlePubMedGoogle Scholar - Zhang Z, Schwartz S, Wagner L, Miller W:
**A greedy algorithm for aligning DNA sequences.***J Comput Biol*2000,**7**(1–2):203–14. 10.1089/10665270050081478View ArticlePubMedGoogle Scholar **Source code for k-words**[http://dayhoff.anu.edu.au/~sf/k_words.]- Conover WJ:
*Practical Nonparametric Statistics*. John Wiley and Sons; 1999.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.