Genome comparison without alignment using shortest unique substrings
© Haubold et al; licensee BioMed Central Ltd. 2005
Received: 09 November 2004
Accepted: 23 May 2005
Published: 23 May 2005
Sequence comparison by alignment is a fundamental tool of molecular biology. In this paper we show how a number of sequence comparison tasks, including the detection of unique genomic regions, can be accomplished efficiently without an alignment step. Our procedure for nucleotide sequence comparison is based on shortest unique substrings. These are substrings which occur only once within the sequence or set of sequences analysed and which cannot be further reduced in length without losing the property of uniqueness. Such substrings can be detected using generalized suffix trees.
We find that the shortest unique substrings in Caenorhabditis elegans, human and mouse are no longer than 11 bp in the autosomes of these organisms. In mouse and human these unique substrings are significantly clustered in upstream regions of known genes. Moreover, the probability of finding such short unique substrings in the genomes of human or mouse by chance is extremely small. We derive an analytical expression for the null distribution of shortest unique substrings, given the GC-content of the query sequences. Furthermore, we apply our method to rapidly detect unique genomic regions in the genome of Staphylococcus aureus strain MSSA476 compared to four other staphylococcal genomes.
We combine a method to rapidly search for shortest unique substrings in DNA sequences and a derivation of their null distribution. We show that unique regions in an arbitrary sample of genomes can be efficiently detected with this method. The corresponding programs shustring (SHortest Unique subSTRING) and shulen are written in C and available at http://adenine.biz.fh-weihenstephan.de/shustring/.
Sequence comparison is traditionally carried out using alignments. The alignment procedure ensures that only homologous positions are compared and corresponding algorithms form the classical core of bioinformatics [1–3]. Once a sequence alignment has been computed, it can be used to determine, for example, signature oligonucleotides or unique genomic regions among a group of closely related organisms.
Perhaps surprisingly, the applications of alignments just mentioned – signature oligos and detection of unique genomic regions – do not necessarily involve an alignment step. Since the computation of alignments tends to take time proportional to the product of the lengths of the sampled sequences, elimination of this step often leads to dramatic increases in the speed of sequence analysis algorithms .
So far we have only considered the forward strand of a given DNA sequence. In the presence of the reverse strand, the set of x-values changes to x = 1 for the first and x = 2 for the second position. No well-defined unique substrings start at the third or the last position of the sequence. Figure 1B illustrates the shortest unique substrings found on the forward and reverse strands of our example sequence. It is important to realize that when dealing with double-stranded DNA, the set of shortest unique substrings is different from that found in single-stranded DNA. However, complementarity of DNA ensures that the complement of a local shortest unique substring is again a unique substring, though not necessarily a shortest unique substring (cf. Figure 1B). In contrast, the complement of a global shortest unique substring is also a global shortest unique substring.
Application of shortest unique substrings to biological problems requires both their efficient detection and knowledge of their probability distribution. The latter is derived in this paper. As to the detection of shortest unique substrings, a data structure known as the suffix tree can readily be used for this purpose . We demonstrate the utility of shortest unique substrings for sequence analysis by applying them to three tasks: (i) identification of signature oligo nucleotide sequences, (ii) detection of unique as well as repeat regions in the genome of Mycoplasma genitalium, and (iii) detection of unique genomic regions in strain MSSA476 of the human pathogen Staphylococcus aureus when compared to four other staphylococcal genomes.
Global shortest unique substrings in Caenorhabditis elegans, human, and mouse
The genome of C. elegans is one of the smallest metazoan genomes sequenced to date. It consists of five autosomes and one sex chromosome, amounting to 100.29 Mb of sequence information . When searching for global shortest unique substrings in this genome, we found a single complementary pair of unique motifs of length 10 located on chromosome 1. Considering the next shortest unique substrings (length 11) we found a total of 10,509 such motif pairs distributed among the five chromosomes. Note that the search for these globally unique substrings was done with respect to the forward and backward strands of the complete genome.
We repeated this analysis for the human genome, which is the largest genome sequenced to date. It consists of 22 autosomes and two sex chromosomes totalling 2.84 Gb published sequence data . We found 215 pairs of global shortest unique substrings of length 11 distributed on the autosomes and the X-chromosome. The Y chromosome contained no unique sequences of length 11 but 135 globally unique sequence pairs of length 12.
We were puzzled by the fact that – with the exception of the single instance of a unique substring pair of length 10 on chromosome 1 of C. elegans – the shortest unique substrings in humans had the same length (11) as those found in C. elegans, even though the human genome is 28 times larger than that of the nematode. When repeating the search for global shortest unique substrings in the mouse genome, whose size is similar to that of human (2.49 Gb), we found a matching result: there were 255 shortest unique substring pairs of length 11 distributed among the autosomes and the X-chromosome. On the Y-chromosome there were 38 unique substring pairs of length 12. The fact that the highly repetitive Y chromosome contained global unique substrings of length 12 as compared to length 11 on autosomes suggested that the length of shortest unique substrings is inversely related to genome information content. In order to further explore whether particularly short unique substrings are associated with functional regions of the genome, we investigated the distribution of globally shortest unique substrings among 1 kb upstream regions of annotated genes.
A total of 29 out of 2 × 350 shortest unique substrings were located among a non-redundant set of 16,286 human 1 kb upstream regions. The probability of observing a single hit in an upstream region with one shortest unique substring is equal to the fraction of the published genome (considering again forward and backward strands) covered by upstream regions. This is
fh = 16286 × 1000/(2 × 2.84 × 109) ≈ 0.003.
The probability of observing 29 or more hits to upstream regions under the null hypothesis of equal distribution is
A similar result was obtained for the mouse genome. Here a total of 13,985 non-redundant upstream regions contained 22 of the 2 × 293 shortest unique substrings. The probability of finding a single hit with one shortest unique substring is again equal to the fraction of the published genome covered by the upstream regions:
fm = 13985 × 1000/(2 × 2.49 × 109) ≈ 0.003.
The probability of observing 22 or more hits to upstream regions by chance is
In other words, both in the human as well as in the mouse genome shortest unique substrings are clustered close to genes. A complete list of the shortest unique substrings with hits to upstream regions is available as supplementary material (see Additional file 1).
So far we have concentrated on the overall, i.e. global, shortest unique substrings. In the following sections we extend this analysis to include all local shortest unique substrings.
Empirical distribution of local shortest unique substring lengths
Distribution of local shortest unique substring lengths
As explained further in the Methods section, the probability of finding shortest unique substrings of length x, , is the number of unique substrings of length x, , minus the number of unique substrings of length x - 1, , divided by the genome length l:
Since equation (1) describes the length distribution of shortest unique substrings in random sequences, it is also the null-distribution for multiple, but phylogenetically unrelated, sequences. This fact can be exploited for comparative genome analysis.
Comparing five strains of Staphylococcus aureus
The search for unique substrings has a long tradition in molecular biology. It is fundamental for many sequence-based identification techniques, and affects PCR primer design as well as the development of specific antibodies. A DNA or protein sequence of length n contains substrings, which is also an upper limit for the number of unique substrings. This means that in most real world situations there is in an excess of unique substrings to choose from. Since a given unique substring remains unique upon extension, we decided to concentrate on the shortest unique substrings. In their global version they have minimal length with respect to the entire sample of sequences investigated. In contrast, their local version is defined for substrings starting from a specific position in the genome. There are of the order of n such local shortest unique substrings from which all the remaining unique substrings can be generated. Shortest unique substrings therefore lead to considerable space reduction when dealing with unique substrings.
Our technique for detecting such unique substrings is applicable to protein as well as to DNA sequence data. Antibodies are widely used in basic biomedical research; in addition, there is growing interest in applying them as therapeutics. A major design goal in generating antibodies to a given protein in all of these contexts is to maximize their specificity. Since the entire proteome of important biomedical model organisms, including human, is known, epitope selection might be guided not only by considerations of antigenicity, but also of uniqueness. In a preliminary study of the human proteome we found that 88% of the 27,175 human proteins we looked at contain at least one unique hexapeptide. Given that a typical epitope consists of 6 to 12 amino acids, this suggests that our method of detecting shortest unique substrings coupled with epitope prediction programs might also be useful for antibody development.
However, in this paper we have concentrated on shortest unique substrings in genomes. The fact that the length of global shortest unique substrings does not exceed 11 in autosomes of both C. elegans and humans is intriguing given the widely differing sizes of the two genomes and the extremely small probability of observing unique sequences of length 11 by chance in the human genome. Since the length of global shortest unique substrings remained constant after we had removed repetitive elements from the genomes, we take this as an indication that genomes contain a core of high-complexity sequences which determine the length of global shortest unique substrings. The size of this high-complexity core is apparently much less variable across metazoan genomes than raw genome size, hence the observed constancy of global shortest unique substring lengths.
These global shortest unique substrings can be used as starting points for developing signature oligos. Such oligos are widely used in biotechnology and taxonomy. A typical application in biotechnology is PCR-primers that should be unique to the target sequence. In taxonomy a recent initiative for the "Barcoding of Life" http://barcoding.si.edu/ attempts to "label" all extant species by assigning a short unique DNA-sequence to them.
The probability of finding a shortest unique substring of some length can be readily computed using equation (1). However, this equation is highly sensitive to the value of the parameter p, which describes the sequence composition. Hence, local variation in sequence composition will strongly affect the expected length of both local and global shortest unique substrings. This fact may also have contributed to our observation that shortest unique substrings cluster in upstream regions of genes in both the mouse and the human genomes. The euchromatic part of the human genome has an average GC-content of 0.41 (http://genome.ucsc.edu, Human genome assembly hg16), which is similar to the value of 0.42 for the mouse (http://genome.ucsc.edu, Mouse genome assembly mm4). In contrast, the global shortest unique substrings we found in humans have a GC-content of 0.59 and those found in mouse have a GC-content of 0.61. The upstream regions of genes tend to be GC-rich in human (GC-content = 0.53) as well as mouse (GC-content = 0.50), which might account for the clustering of global shortest unique substrings in these regions.
Detection of unique genomic regions is traditionally done by alignment-based approaches. However, the run-time of these algorithms depends non-linearly on the number and lengths of the input sequences and also on the degree of relatedness of the input sequences. In contrast, the scheme for detecting unique genomic regions proposed in this paper has a run time that is strictly linear in the combined lengths of the input sequences.
There is currently a lot of interest in comparative genomics . In many of these projects detection of regions unique to a genome is one of the first steps towards functional annotation (e. g. ). Given equation (1), the size distribution of shortest unique substrings in random, i.e. unrelated, sequences can be predicted. This leads to our method of detecting unique genomic regions from an arbitrary set of input sequences without the need for alignment. Its usefulness for comparative genomics is clearly demonstrated in the case of the genomes of S. aureus, where we could rapidly detect the two unique regions previously annotated in one strain  (Figure 5).
Detection of shortest unique substrings
Two methods borrowed from computer science were used for the detection of shortest unique substrings: suffix tree construction and hashing. Suffix trees are well described by Gusfield  and we follow his nomenclature. To use suffix trees for detecting unique substrings, notice that the path label of any leaf is a unique substring. The set of shortest unique substrings at every position can therefore be discovered by traversing the tree once and looking up the string depth of the parent node of every leaf. This value plus one is the desired length of the shortest unique string that starts at the position indicated by the leaf.
Unless stated otherwise, all computations presented in this paper consider both strands of the DNA sequences concerned. Note that in this case, and due to complementarity of DNA, a single parameter (p) suffices to describe nucleotide composition.
The probability distribution of local shortest unique substring lengths in nucleotide sequences
Consider a nucleotide sequence S and let 2p denote the GC-content of S (p ∈ [0, 1/2]). A shortest unique substring of length x of this nucleotide sequence is defined as a unique substring S[i..i + x - 1] where S[i..i + x - 2] is not unique. We wish to derive the probability distribution of values of x under the assumption of random sequence composition.
We start by considering a particular substring of length x consisting of k positions occupied by either G or C each. We refer to such a substring as being of type (x, k). The probability of finding a substring of type (x, k) is
Px,k= (1/2 - p)x - kp k .
Assuming that l independent trials each having a success probability of Px,kare performed, the probability of finding a particular sequence of type (x, k) exactly once is then
This expression is only approximately valid, since the nucleotide compositions of any two overlapping subtrings are not independent. Still, from now on we assume independence. The error introduced by this assumption is negligible, if the genome size, l, is large compared to the length of the considered substrings (l >> x) - which is the case we have in mind. Thus, we replace the ≈-sign in the above and following expressions by = and define
For each sequence of type (x, k), there are permutations of k "G|C" s and (x - k) "A|T" s. Some of these permutations occur zero times in S, some occur multiple times and some occur exactly once. We are interested in the latter: the number of unique substrings of type (x, k) is
In order to determine the number of unique substrings irrespective of their sequence composition, we need to sum over all possible values of k:
The number of shortest unique substrings of length x, , is then simply the number of unique substrings of length x minus the number of unique substrings of length x - 1. In order to see this, notice that all unique substrings of length x - 1 are contained in the set of unique substrings of length x. Those that are gained by adding the extra nucleotide are precisely the substrings that lose their uniqueness when reduced in length by one as required by the definition of shortest unique substring:
The search for shortest unique substrings is implemented in our program shustring (SHortest Unique subSTRING). The distribution of shortest unique substring lengths in genomic sequences as embodied in equation (1) is implemented in our program shulen. Both pieces of software are available from http://adenine.biz.fh-weihenstephan.de/shustring/.
nematode: http://hgdownload.cse.ucsc.edu/goldenPath/ce2/bigZips/ (version ce2)
mouse: http://hgdownload.cse.ucsc.edu/goldenPath/mm4/bigZips/ (version mm4, October 2003)
human: http://hgdownload.cse.ucsc.edu/goldenPath/hg16/bigZips/ (version hg16, July 2003)
Bacterial genomes analyzed in this study
Staphylococcus aureus MRSA252
Staphylococcus aureus MSSA476
Staphylococcus aureus MW2
Staphylococcus aureus Mu50
Staphylococcus aureus N315
We would like to thank C. Acquisti and A. Börsch-Haubold for stimulating discussions and two anonymous reviewers for comments which helped to improve the manuscript. Furthermore, we would like to thank the members of the High Performance Computing Group at the Leibniz Rechenzentrum München for advice on computational issues. F.M. is supported by a grant from the German Ministry of Education and Research (BMBF; Fkz. 0312705A). B.H. is supported financially by Dehner Gartencenter GmbH and the Stifterverband der Deutschen Wissenschaft.
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of Molecular Biology 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of Molecular Biology 1990, 215: 403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press; 1997.View ArticleGoogle Scholar
- The C elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 1998, 282: 2012–2018. 10.1126/science.282.5396.2012View ArticleGoogle Scholar
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921. 10.1038/35057062View ArticleGoogle Scholar
- Holden MT, Feil EJ, Lindsay JA, Peacock SJ, Day NP, Enright MC, Foster TJ, Moore CE, Hurst L, Atkin R, Barron A, Bason N, Bentley SD, Chillingworth C, Chillingworth T, Churcher C, Clark L, Corton C, Cronin A, Doggett J, Dowd L, Feltwell T, Hance Z, Harris B, Hauser H, Holroyd S, Jagels K, James KD, Lennard N, Line A, Mayes R, Moule S, Mungall K, Ormond D, Quail MA, Rabbinowitsch E, Rutherford K, Sanders M, Sharp S, Simmonds M, Stevens K, Whitehead S, Barrell BG, Spratt BG, Parkhill J: Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance. Proc Natl Acad Sci U S A 2004, 101: 9786–9791. 10.1073/pnas.0402521101PubMed CentralView ArticlePubMedGoogle Scholar
- Haubold B, Wiehe T: Comparative genomics: methods and applications. Naturwissenschaften 2004, 91: 405–421.PubMedGoogle Scholar
- Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. The MIT Press; 2001.Google Scholar
- Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420: 520–561. 10.1038/nature01262View ArticleGoogle Scholar
- Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton GG, Kelley JM, Fritchman JL, Weidman JF, Small KV, Sandusky M, Fuhrmann JL, Nguyen DT, Utterback T, Saudek DM, Phillips CA, Merrick JM, Tomb J, Dougherty BA, Bott KF, Hu PC, Lucier TS, Peterson SN, Smith HO, Venter JC: The minimal gene complement of Mycoplasma genitalium. Science 1995, 270: 397–403.View ArticlePubMedGoogle Scholar
- Baba T, Takeuchi F, Kuroda M, Yuzawa H, Aoki K, Oguchi A, Nagai Y, Iwama N, Asano K, Naimi T, Kuroda H, Cui L, Yamamoto K, Hiramatsu K: Genome and virulence determinants of high virulence community-acquired MRSA. Lancet 2002, 359: 1819–1827. 10.1016/S0140-6736(02)08713-5View ArticlePubMedGoogle Scholar
- Kuroda M, Ohta T, Uchiyama I, Baba T, Yuzawa H, Kobayashi I, Cui L, Oguchi A, Aoki K, Nagai Y, Lian J, Ito T, Kanamori M, Matsumaru H, Maruyama A, Murakami H, Hosoyama A, Mizutani-Ui Y, Takahashi NK, Sawano T, Inoue R, Kaito C, Sekimizu K, Hirakawa H, Kuhara S, Goto S, Yabuzaki J, Kanehisa M, Yamashita A, Oshima K, Furuya K, Yoshino C, Shiba T, Hattori M, Ogasawara N, Hayashi H, Hiramatsu K: Whole genome sequencing of meticillin-resistant Staphylococcus aureus. Lancet 2001, 357: 1225–1240. 10.1016/S0140-6736(00)04403-2View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.