# Genome comparison without alignment using shortest unique substrings

- Bernhard Haubold
^{1}, - Nora Pierstorff
^{2}, - Friedrich Möller
^{3}and - Thomas Wiehe
^{2}Email author

**6**:123

**DOI: **10.1186/1471-2105-6-123

© Haubold et al; licensee BioMed Central Ltd. 2005

**Received: **09 November 2004

**Accepted: **23 May 2005

**Published: **23 May 2005

## Abstract

### Background

Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Such substrings can be detected using generalized suffix trees.

### Results

We find that the shortest unique substrings in *Caenorhabditis elegans*, human and mouse are no longer than 11 bp in the autosomes of these organisms. In mouse and human these unique substrings are significantly clustered in upstream regions of known genes. Moreover, the probability of finding such short unique substrings in the genomes of human or mouse by chance is extremely small. We derive an analytical expression for the null distribution of shortest unique substrings, given the GC-content of the query sequences. Furthermore, we apply our method to rapidly detect unique genomic regions in the genome of *Staphylococcus aureus* strain MSSA476 compared to four other staphylococcal genomes.

### Conclusion

We combine a method to rapidly search for shortest unique substrings in DNA sequences and a derivation of their null distribution. We show that unique regions in an arbitrary sample of genomes can be efficiently detected with this method. The corresponding programs shustring (SHortest Unique subSTRING) and shulen are written in C and available at http://adenine.biz.fh-weihenstephan.de/shustring/.

## Background

Sequence comparison is traditionally carried out using alignments. The alignment procedure ensures that only homologous positions are compared and corresponding algorithms form the classical core of bioinformatics [1–3]. Once a sequence alignment has been computed, it can be used to determine, for example, signature oligonucleotides or unique genomic regions among a group of closely related organisms.

Perhaps surprisingly, the applications of alignments just mentioned – signature oligos and detection of unique genomic regions – do not necessarily involve an alignment step. Since the computation of alignments tends to take time proportional to the product of the lengths of the sampled sequences, elimination of this step often leads to dramatic increases in the speed of sequence analysis algorithms [4].

*S*= ACCG. It contains substrings, of which the following eight are unique: {A, AC, ACC, ACCG, CC, CCG, CG, G}. Two of these are

*shortest*unique substrings: {A, G}. Such

*global*shortest unique substrings can occur anywhere in

*S*. In contrast, we define

*local*shortest unique substrings to be tied to a specific position in

*S*. More formally, we determine for every position

*i*in

*S*the length

*x*of the substring

*S*[

*i*..

*i*+

*x*- 1] such that it is unique while

*S*[

*i*..

*i*+

*x*- 2] is not. In the case of our example string, the result is

*x*= 1 for the first position,

*x*= 2 for the second,

*x*= 2 for the third, and

*x*= 1 for the last. Figure 1A gives a graphical representation of these local shortest unique substrings.

So far we have only considered the forward strand of a given DNA sequence. In the presence of the reverse strand, the set of *x*-values changes to *x* = 1 for the first and *x* = 2 for the second position. No well-defined unique substrings start at the third or the last position of the sequence. Figure 1B illustrates the shortest unique substrings found on the forward and reverse strands of our example sequence. It is important to realize that when dealing with double-stranded DNA, the set of shortest unique substrings is different from that found in single-stranded DNA. However, complementarity of DNA ensures that the complement of a local shortest unique substring is again a unique substring, though not necessarily a shortest unique substring (cf. Figure 1B). In contrast, the complement of a global shortest unique substring is also a global shortest unique substring.

Application of shortest unique substrings to biological problems requires both their efficient detection and knowledge of their probability distribution. The latter is derived in this paper. As to the detection of shortest unique substrings, a data structure known as the suffix tree can readily be used for this purpose [4]. We demonstrate the utility of shortest unique substrings for sequence analysis by applying them to three tasks: (i) identification of signature oligo nucleotide sequences, (ii) detection of unique as well as repeat regions in the genome of *Mycoplasma genitalium*, and (iii) detection of unique genomic regions in strain MSSA476 of the human pathogen *Staphylococcus aureus* when compared to four other staphylococcal genomes.

## Results

### Global shortest unique substrings in *Caenorhabditis elegans*, human, and mouse

The genome of *C. elegans* is one of the smallest metazoan genomes sequenced to date. It consists of five autosomes and one sex chromosome, amounting to 100.29 Mb of sequence information [5]. When searching for global shortest unique substrings in this genome, we found a single complementary pair of unique motifs of length 10 located on chromosome 1. Considering the next shortest unique substrings (length 11) we found a total of 10,509 such motif pairs distributed among the five chromosomes. Note that the search for these globally unique substrings was done with respect to the forward and backward strands of the complete genome.

We repeated this analysis for the human genome, which is the largest genome sequenced to date. It consists of 22 autosomes and two sex chromosomes totalling 2.84 Gb published sequence data [6]. We found 215 pairs of global shortest unique substrings of length 11 distributed on the autosomes and the X-chromosome. The Y chromosome contained no unique sequences of length 11 but 135 globally unique sequence pairs of length 12.

We were puzzled by the fact that – with the exception of the single instance of a unique substring pair of length 10 on chromosome 1 of *C. elegans* – the shortest unique substrings in humans had the same length (11) as those found in *C. elegans*, even though the human genome is 28 times larger than that of the nematode. When repeating the search for global shortest unique substrings in the mouse genome, whose size is similar to that of human (2.49 Gb), we found a matching result: there were 255 shortest unique substring pairs of length 11 distributed among the autosomes and the X-chromosome. On the Y-chromosome there were 38 unique substring pairs of length 12. The fact that the highly repetitive Y chromosome contained global unique substrings of length 12 as compared to length 11 on autosomes suggested that the length of shortest unique substrings is inversely related to genome information content. In order to further explore whether particularly short unique substrings are associated with functional regions of the genome, we investigated the distribution of globally shortest unique substrings among 1 kb upstream regions of annotated genes.

A total of 29 out of 2 × 350 shortest unique substrings were located among a non-redundant set of 16,286 human 1 kb upstream regions. The probability of observing a single hit in an upstream region with one shortest unique substring is equal to the fraction of the published genome (considering again forward and backward strands) covered by upstream regions. This is

*f*_{h} = 16286 × 1000/(2 × 2.84 × 10^{9}) ≈ 0.003.

The probability of observing 29 or more hits to upstream regions under the null hypothesis of equal distribution is

A similar result was obtained for the mouse genome. Here a total of 13,985 non-redundant upstream regions contained 22 of the 2 × 293 shortest unique substrings. The probability of finding a single hit with one shortest unique substring is again equal to the fraction of the published genome covered by the upstream regions:

*f*_{m} = 13985 × 1000/(2 × 2.49 × 10^{9}) ≈ 0.003.

The probability of observing 22 or more hits to upstream regions by chance is

In other words, both in the human as well as in the mouse genome shortest unique substrings are clustered close to genes. A complete list of the shortest unique substrings with hits to upstream regions is available as supplementary material (see Additional file 1).

So far we have concentrated on the overall, i.e. global, shortest unique substrings. In the following sections we extend this analysis to include all local shortest unique substrings.

### Empirical distribution of local shortest unique substring lengths

*Mycoplasma genitalium*has one of the smallest genomes known for any free-living organism [11]. The 580,074 bp of its genome encode 480 open reading frames and would be expected to be void of redundant, that is repetitive, sequences. However, when plotting the length of all local shortest unique substrings contained in its genome, we detected 26 non-overlapping shortest unique sequences longer than 100 bp, the longest of which spanned 244 bp (Figure 2A). In other words, the genome of

*M. genitalium*contains a perfectly conserved repeat sequence that is 244 - 1 = 243 bp long. The statistical significance of these repeats is illustrated in Figure 2B, which displays the lengths of the shortest unique substrings in a shuffled version of

*M. genitalium*'s genome. In such a scrambled sequence devoid of biological meaning no shortest unique substring is longer than 21 bp. Having found surprisingly short, as well as surprisingly long shortest unique substrings, we proceed by deriving the null distribution of shortest unique substring lengths.

### Distribution of local shortest unique substring lengths

As explained further in the Methods section, the probability of finding shortest unique substrings of length *x*,
, is the number of unique substrings of length *x*,
, minus the number of unique substrings of length *x* - 1,
, divided by the genome length *l*:

where

*p*represents the GC-content of the genome (

*p*∈ [0, 1/2]). Figure 3 demonstrates that the fit between equation (1) and the empirical distribution of local shortest unique substrings (cf. Figure 2B) is excellent. Equation (1) provides an efficient method for establishing the statistical significance of any given length of a local shortest unique substring. Using equation (1) and knowing that the GC-content of the genome of

*C. elegans*is 0.3544, one expects to find by chance alone 1.6 unique substrings of length and 10 and 20,441 unique substrings of length 11. These values agree with what we have found in the actual genome of

*C. elegans*(one pair of unique substrings of length 10, and 10,509 pairs of length 11). However, again based on equation (1), the probability of finding in the human genome (GC-content = 0.4088) a unique substring of length 11 is less than 10

^{-100}. This is equivalent to an expected number of only 2.4 × 10

^{-94}of such unique substrings and in sharp contrast to the observed value of 215 pairs of unique substrings of this length. The same holds for mouse. Clearly, the lengths of the shortest unique substrings found in mouse and human cannot be explained by chance.

Since equation (1) describes the length distribution of shortest unique substrings in random sequences, it is also the null-distribution for multiple, but phylogenetically unrelated, sequences. This fact can be exploited for comparative genome analysis.

### Comparing five strains of *Staphylococcus aureus*

*Staphylococcus aureus*is a human pathogen notorious for its resistance to multiple antibiotics [7]. Five genomes of this bacterium are publicly available, which makes it one of the best characterized bacterial species. Our aim was to take strain MSSA476 [7] and identify the regions unique to its genome when compared to the four other strains available. Given the GC-content of

*Staphylococcus aureus*and the combined length of the five genomes analyzed, we calculated the expected distribution of shortest unique substring lengths using equation (1). Figure 5 shows that for unrelated

*S. aureus*sequences of the same combined length (14.21 × 10

^{6}) and composition (GC-content 0.33) as the strains investigated, the lengths of shortest unique substrings are expected to range from 9 to 27. We decided to analyze the local shortest unique substrings of length ≤ 10 in strain MSSA476 in the presence of the genomes of the four other strains. Figure 6 displays the cumulative distribution of the local shortest unique substrings of length ≤ 10 along the genome of strain MSSA476. Regions unique to MSSA476 contain a high density of such substrings and hence stand out as jumps in the cumulative plot. Figure 6 shows two such jumps. These correspond exactly to the two unique regions ΦSa4 and SCC

_{476}recently annotated as the only two unique regions in MSSA476 [7].

## Discussion

The search for unique substrings has a long tradition in molecular biology. It is fundamental for many sequence-based identification techniques, and affects PCR primer design as well as the development of specific antibodies. A DNA or protein sequence of length *n* contains
substrings, which is also an upper limit for the number of unique substrings. This means that in most real world situations there is in an excess of unique substrings to choose from. Since a given unique substring remains unique upon extension, we decided to concentrate on the shortest unique substrings. In their global version they have minimal length with respect to the entire sample of sequences investigated. In contrast, their local version is defined for substrings starting from a specific position in the genome. There are of the order of *n* such local shortest unique substrings from which all the remaining unique substrings can be generated. Shortest unique substrings therefore lead to considerable space reduction when dealing with unique substrings.

Our technique for detecting such unique substrings is applicable to protein as well as to DNA sequence data. Antibodies are widely used in basic biomedical research; in addition, there is growing interest in applying them as therapeutics. A major design goal in generating antibodies to a given protein in all of these contexts is to maximize their specificity. Since the entire proteome of important biomedical model organisms, including human, is known, epitope selection might be guided not only by considerations of antigenicity, but also of uniqueness. In a preliminary study of the human proteome we found that 88% of the 27,175 human proteins we looked at contain at least one unique hexapeptide. Given that a typical epitope consists of 6 to 12 amino acids, this suggests that our method of detecting shortest unique substrings coupled with epitope prediction programs might also be useful for antibody development.

However, in this paper we have concentrated on shortest unique substrings in genomes. The fact that the length of global shortest unique substrings does not exceed 11 in autosomes of both *C. elegans* and humans is intriguing given the widely differing sizes of the two genomes and the extremely small probability of observing unique sequences of length 11 by chance in the human genome. Since the length of global shortest unique substrings remained constant after we had removed repetitive elements from the genomes, we take this as an indication that genomes contain a core of high-complexity sequences which determine the length of global shortest unique substrings. The size of this high-complexity core is apparently much less variable across metazoan genomes than raw genome size, hence the observed constancy of global shortest unique substring lengths.

These global shortest unique substrings can be used as starting points for developing signature oligos. Such oligos are widely used in biotechnology and taxonomy. A typical application in biotechnology is PCR-primers that should be unique to the target sequence. In taxonomy a recent initiative for the "Barcoding of Life" http://barcoding.si.edu/ attempts to "label" all extant species by assigning a short unique DNA-sequence to them.

The probability of finding a shortest unique substring of some length can be readily computed using equation (1). However, this equation is highly sensitive to the value of the parameter *p*, which describes the sequence composition. Hence, local variation in sequence composition will strongly affect the expected length of both local and global shortest unique substrings. This fact may also have contributed to our observation that shortest unique substrings cluster in upstream regions of genes in both the mouse and the human genomes. The euchromatic part of the human genome has an average GC-content of 0.41 (http://genome.ucsc.edu, Human genome assembly hg16), which is similar to the value of 0.42 for the mouse (http://genome.ucsc.edu, Mouse genome assembly mm4). In contrast, the global shortest unique substrings we found in humans have a GC-content of 0.59 and those found in mouse have a GC-content of 0.61. The upstream regions of genes tend to be GC-rich in human (GC-content = 0.53) as well as mouse (GC-content = 0.50), which might account for the clustering of global shortest unique substrings in these regions.

Detection of unique genomic regions is traditionally done by alignment-based approaches. However, the run-time of these algorithms depends non-linearly on the number and lengths of the input sequences and also on the degree of relatedness of the input sequences. In contrast, the scheme for detecting unique genomic regions proposed in this paper has a run time that is strictly linear in the combined lengths of the input sequences.

## Conclusion

There is currently a lot of interest in comparative genomics [8]. In many of these projects detection of regions unique to a genome is one of the first steps towards functional annotation (e. g. [7]). Given equation (1), the size distribution of shortest unique substrings in random, i.e. unrelated, sequences can be predicted. This leads to our method of detecting unique genomic regions from an arbitrary set of input sequences without the need for alignment. Its usefulness for comparative genomics is clearly demonstrated in the case of the genomes of *S. aureus*, where we could rapidly detect the two unique regions previously annotated in one strain [7] (Figure 5).

## Methods

### Detection of shortest unique substrings

Two methods borrowed from computer science were used for the detection of shortest unique substrings: suffix tree construction and hashing. Suffix trees are well described by Gusfield [4] and we follow his nomenclature. To use suffix trees for detecting unique substrings, notice that the path label of any leaf is a unique substring. The set of *shortest* unique substrings at every position can therefore be discovered by traversing the tree once and looking up the string depth of the parent node of every leaf. This value plus one is the desired length of the shortest unique string that starts at the position indicated by the leaf.

Hashing is described, for example, by Cormen *et al.* [[9], ch. [11]] and we used it for detecting global shortest unique substrings in large genomes.

Unless stated otherwise, all computations presented in this paper consider both strands of the DNA sequences concerned. Note that in this case, and due to complementarity of DNA, a single parameter (*p*) suffices to describe nucleotide composition.

### The probability distribution of local shortest unique substring lengths in nucleotide sequences

Consider a nucleotide sequence *S* and let 2*p* denote the GC-content of *S* (*p* ∈ [0, 1/2]). A shortest unique substring of length *x* of this nucleotide sequence is defined as a unique substring *S*[*i*..*i* + *x* - 1] where *S*[*i*..*i* + *x* - 2] is not unique. We wish to derive the probability distribution of values of *x* under the assumption of random sequence composition.

We start by considering a particular substring of length *x* consisting of *k* positions occupied by either G or C each. We refer to such a substring as being of type (*x*, *k*). The probability of finding a substring of type (*x*, *k*) is

*P*_{x,k}= (1/2 - *p*)^{x - k}*p*^{
k
}.

Assuming that *l* independent trials each having a success probability of *P*_{x,k}are performed, the probability of finding a particular sequence of type (*x*, *k*) exactly once is then

This expression is only approximately valid, since the nucleotide compositions of any two overlapping subtrings are not independent. Still, from now on we assume independence. The error introduced by this assumption is negligible, if the genome size, *l*, is large compared to the length of the considered substrings (*l* >> *x*) - which is the case we have in mind. Thus, we replace the ≈-sign in the above and following expressions by = and define

For each sequence of type (*x*, *k*), there are
permutations of *k* "G|C" s and (*x* - *k*) "A|T" s. Some of these permutations occur zero times in *S*, some occur multiple times and some occur exactly once. We are interested in the latter: the number of *unique* substrings of type (*x*, *k*) is

In order to determine the number of unique substrings irrespective of their sequence composition, we need to sum over all possible values of *k*:

The number of *shortest* unique substrings of length *x*,
, is then simply the number of unique substrings of length *x* minus the number of unique substrings of length *x* - 1. In order to see this, notice that all unique substrings of length *x* - 1 are contained in the set of unique substrings of length *x*. Those that are gained by adding the extra nucleotide are precisely the substrings that lose their uniqueness when reduced in length by one as required by the definition of shortest unique substring:

Finally, the probability of finding such a shortest unique substring of length *x*,
, is the number of unique shortest substrings of length *x* divided by the genome length:

### Implementation

The search for shortest unique substrings is implemented in our program shustring (SHortest Unique subSTRING). The distribution of shortest unique substring lengths in genomic sequences as embodied in equation (1) is implemented in our program shulen. Both pieces of software are available from http://adenine.biz.fh-weihenstephan.de/shustring/.

### Data

*Caenorhabditis elegans*) [5], mouse (

*Mus musculus*) [10], and human (

*Homo sapiens*) [6] as well as 1 kb upstream regions for genes in the genomes of human and mouse were obtained from the University of California Santa Cruz genome website at the following URLs:

- 1.
nematode: http://hgdownload.cse.ucsc.edu/goldenPath/ce2/bigZips/ (version ce2)

- 2.
mouse: http://hgdownload.cse.ucsc.edu/goldenPath/mm4/bigZips/ (version mm4, October 2003)

- 3.
human: http://hgdownload.cse.ucsc.edu/goldenPath/hg16/bigZips/ (version hg16, July 2003)

Bacterial genomes analyzed in this study

organism | accession number | reference | genome size |
---|---|---|---|

| L43967 | [11] | 580,074 |

| NC 002952 | [7] | 2,902,619 |

| NC 002953 | [7] | 2,799,802 |

| NC 003923 | [12] | 2,820,462 |

| NC 002758 | [13] | 2,878,040 |

| NC 002745 | [13] | 2,814,816 |

## Declarations

### Acknowledgements

We would like to thank C. Acquisti and A. Börsch-Haubold for stimulating discussions and two anonymous reviewers for comments which helped to improve the manuscript. Furthermore, we would like to thank the members of the High Performance Computing Group at the Leibniz Rechenzentrum München for advice on computational issues. F.M. is supported by a grant from the German Ministry of Education and Research (BMBF; Fkz. 0312705A). B.H. is supported financially by Dehner Gartencenter GmbH and the Stifterverband der Deutschen Wissenschaft.

## Authors’ Affiliations

## References

- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins.
*Journal of Molecular Biology*1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar - Smith TF, Waterman MS: Identification of common molecular subsequences.
*Journal of Molecular Biology*1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar - Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool.
*Journal of Molecular Biology*1990, 215: 403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar - Gusfield D:
*Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology*. Cambridge: Cambridge University Press; 1997.View ArticleGoogle Scholar - The C elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: a platform for investigating biology.
*Science*1998, 282: 2012–2018. 10.1126/science.282.5396.2012View ArticleGoogle Scholar - International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome.
*Nature*2001, 409: 860–921. 10.1038/35057062View ArticleGoogle Scholar - Holden MT, Feil EJ, Lindsay JA, Peacock SJ, Day NP, Enright MC, Foster TJ, Moore CE, Hurst L, Atkin R, Barron A, Bason N, Bentley SD, Chillingworth C, Chillingworth T, Churcher C, Clark L, Corton C, Cronin A, Doggett J, Dowd L, Feltwell T, Hance Z, Harris B, Hauser H, Holroyd S, Jagels K, James KD, Lennard N, Line A, Mayes R, Moule S, Mungall K, Ormond D, Quail MA, Rabbinowitsch E, Rutherford K, Sanders M, Sharp S, Simmonds M, Stevens K, Whitehead S, Barrell BG, Spratt BG, Parkhill J: Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance.
*Proc Natl Acad Sci U S A*2004, 101: 9786–9791. 10.1073/pnas.0402521101PubMed CentralView ArticlePubMedGoogle Scholar - Haubold B, Wiehe T: Comparative genomics: methods and applications.
*Naturwissenschaften*2004, 91: 405–421.PubMedGoogle Scholar - Cormen TH, Leiserson CE, Rivest RL, Stein C:
*Introduction to Algorithms*. The MIT Press; 2001.Google Scholar - Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome.
*Nature*2002, 420: 520–561. 10.1038/nature01262View ArticleGoogle Scholar - Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton GG, Kelley JM, Fritchman JL, Weidman JF, Small KV, Sandusky M, Fuhrmann JL, Nguyen DT, Utterback T, Saudek DM, Phillips CA, Merrick JM, Tomb J, Dougherty BA, Bott KF, Hu PC, Lucier TS, Peterson SN, Smith HO, Venter JC: The minimal gene complement of Mycoplasma genitalium.
*Science*1995, 270: 397–403.View ArticlePubMedGoogle Scholar - Baba T, Takeuchi F, Kuroda M, Yuzawa H, Aoki K, Oguchi A, Nagai Y, Iwama N, Asano K, Naimi T, Kuroda H, Cui L, Yamamoto K, Hiramatsu K: Genome and virulence determinants of high virulence community-acquired MRSA.
*Lancet*2002, 359: 1819–1827. 10.1016/S0140-6736(02)08713-5View ArticlePubMedGoogle Scholar - Kuroda M, Ohta T, Uchiyama I, Baba T, Yuzawa H, Kobayashi I, Cui L, Oguchi A, Aoki K, Nagai Y, Lian J, Ito T, Kanamori M, Matsumaru H, Maruyama A, Murakami H, Hosoyama A, Mizutani-Ui Y, Takahashi NK, Sawano T, Inoue R, Kaito C, Sekimizu K, Hirakawa H, Kuhara S, Goto S, Yabuzaki J, Kanehisa M, Yamashita A, Oshima K, Furuya K, Yoshino C, Shiba T, Hattori M, Ogasawara N, Hayashi H, Hiramatsu K: Whole genome sequencing of meticillin-resistant Staphylococcus aureus.
*Lancet*2001, 357: 1225–1240. 10.1016/S0140-6736(00)04403-2View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.