How repetitive are genomes?
© Haubold and Wiehe; licensee BioMed Central Ltd. 2006
Received: 26 October 2006
Accepted: 22 December 2006
Published: 22 December 2006
Genome sequences vary strongly in their repetitiveness and the causes for this are still debated. Here we propose a novel measure of genome repetitiveness, the index of repetitiveness, Ir, which can be computed in time proportional to the length of the sequences analyzed. We apply it to 336 genomes from all three domains of life.
The expected value of Ir is zero for random sequences of any G/C content and greater than zero for sequences with excess repeats. We find that the Ir of archaea is significantly smaller than that of eubacteria, which in turn is smaller than that of eukaryotes. Mouse chromosomes have a significantly higher Ir than human chromosomes and within each genome the Y chromosome is most repetitive. A sliding window analysis reveals that the human HOXA cluster and two surrounding genes are characterized by local minima in Ir. A program for calculating the Ir is freely available at http://adenine.biz.fh-weihenstephan.de/ir/.
The general measure of DNA repetitiveness proposed in this paper can be efficiently computed on a genomic scale. This reveals a broad spectrum of repetitiveness among diverse genomes which agrees qualitatively with previous studies of repeat content. A sliding window analysis helps to analyze the intragenomic distribution of repeats.
Repeat sequences are a common feature of prokaryote and eukaryote genomes [1–3] and in both types of organisms the selective neutrality or otherwise of extra copies of sequences has been debated for decades . Since the start of the genomics era in the mid-1990s the hitherto unexpectedly large amount of repetitive sequences found in bacteria, which may account for more than 10% of the total genome, prompted a flurry of investigations of the functional and evolutionary significance of these elements . More recently, Aras et al. surveyed 51 bacterial genomes to quantify the effect repeat sequences might have on genome plasticity due to intragenomic recombination . The authors conclude that in bacteria repeats might be selected for their positive effect on the adaptability of their host . In another in silico survey of 58 completely sequenced bacteria, Achaz et al. noted that inverted repeats are underrepresented in bacterial genomes due to their destabilizing effect on genome structure .
In eukaryotes the discrepancy between DNA content and apparent organismic complexity had been noted even before the discovery of the double helix leading to the conclusion that "The relationship between DNA and the size or number of genes is obscure" [, p. 462]. In the 1960s DNA reannealing studies uncovered that eukaryotic genomes contain a highly variable fraction of repetitive DNA. Since the sequencing of complex genomes these observations have been made precise: approximately 50% of the human genome is made up of repetitive sequences . However, the term "repetitive sequences" encompasses a rather heterogeneous set of elements: 45% of the human genome is covered by transposons, 3% are repeats of less than a hundred base pairs (microsatellites and minisatellites), and 5% consist of recent duplications of large segments of DNA. Broadly similar observations have been made in other mammalian genomes [9–11]. The human genome contains low, but appreciable, genetic variation caused by transposable elements, indicating that transposable elements have been active over the short time span since humans diverged from their last common ancestor . However, the decline of transposon activity in the hominoid lineage contrasts with more recent insertions in mouse, where new spontaneous mutations are 60 times more likely to be caused by transposition than in human .
The hypothesis that transposable elements are molecular parasites was originally designed to explain the apparently excessive DNA baggage of eukaryotes [13, 14]. A number of contemporary observations support this view. Transposon-derived sequences are rare close to transcription start sites and inside coding regions, suggesting that insertions are usually deleterious . Moreover, the four human HOX clusters and other highly regulated genomic regions contain very few transposable elements . Direct deletion of megabase-sized regions devoid of known genes also seems to have no effect on mice, even though these regions contain elements that have been conserved since the emergence of mammals . There is no contradiction between these observations and the fact that occasionally transposable elements can give rise to beneficial structures including novel gene regulatory regions  and the V(D)J recombination mechanism that generates the antibody diversity expressed by vertebrate B cells .
Since the publication of whole genome data, the quantification and classification of repeat elements has become a major area of research in computational biology [18, 19]. Perhaps the best-known program for the detection of repeat elements is repeatmasker , which looks for two things: (1) tandem repeats of a few nucleotides, and (2) homology to known repetitive elements. This approach has the advantage of dealing with elements of known origin. Its disadvantage is that the presence of hitherto unknown repetitive elements might be missed. The program repeatfinder implements a highly efficient and more generic approach based on suffix trees that makes no assumptions about the type of repeat present . Such methods can be used to compute, for example, the percentage of a given DNA sequence covered by repeats and most methods provide a means of checking the statistical significance of the repeats returned. Suffix trees allow the efficient detection of all exact repeats in a sequence. In contrast, the widely used relative simplicity factor (RSF)  is based on the local density of repeat motifs up to four bases long compared to their density in a shuffled version of the input sequence . Application of the RSF to diverse genomes revealed that eukaryotes are characterized by an elevated "micro-repetitiveness" compared to prokaryotes .
What is lacking, though, is an all-inclusive measure of repetitiveness. Under the RSF repetitiveness is defined as a quantity that is minimized by shuffling the investigated sequence. As suggested by the term simplicity factor, studies of repetitiveness are related to investigations of complexity  – if repetitiveness is high, complexity is low, though the converse is not always true. For example, the "linguistic complexity" of a string S is defined as the number of substrings of lengths 2, 3, ..., |S| observed in S compared to the maximum number of substrings of these lengths . A random DNA sequence with G/C content 0.5 has maximal complexity and minimal repetitiveness. However, a random DNA sequence with a G/C content of, say, 0.1 does not have maximal complexity, while its repetitiveness should still be minimal.
In this paper we propose a novel measure of repetitiveness which considers repeats of any length, takes into account G/C content, and does not necessitate shuffling for its computation. As explained in detail in the Methods Section, our index of repetitiveness, Ir, is expected to be zero in random DNA sequences of any G/C content and length, and can be computed in time proportional to sequence length. We apply the Ir to 303 sequenced bacterial genomes, 27 archaebacteria, and six model eukaryotes: baker's yeast (Saccharomyces cerevisiae), nematode worm (Caenorhabditis elegans), thale cress (Arabidopsis thaliana), fruit fly (Drosophila melanogaster), mouse (Mus musculus), and human (Homo sapiens).
Survey of Irvalues
At the other extreme of the distribution, Buchnera aphidicola str. Bp had the smallest Ir value (0.019), which was even smaller than that observed in phage λ (Ir = 0.024; Figure 1). With one exception the ten eubacteria with the lowest Ir values comprised only intracellular organisms sampled form the genera Buchnera, Chlamydophila, Candidatus, Neorickettsia, and Rickettsia. The exception was the highly abundant photosynthetic bacterium Prochlorococcus marinus subsp. marinus str. CCMP1375 [see Additional file 1].
Figure 2B displays the Ir values of archaebacteria and eukaryotes. In archaebacteria Ir was significantly correlated with log genome size (Pearson correlation = 0.562; P = 0.002), while in eukaryotes the correlation was not significant (Pearson correlation = 0.485; P = 0.515). The average Ir of archaebacteria was 0.467, which is significantly smaller than that of eubacteria (Wilcoxon test, P = 3.15 × 10-6). The average Ir of eukaryotes was 2.103, which is in turn significantly greater than either that of eubacteria (P = 4.3 × 10-3) or archaebacteria (P = 6.36 × 10-5). Among eukaryotes only Drosophila melanogaster had an Ir > 3.
The bacterium with the second highest global Ir-value, Strepotococcus agalactiae NEM316 (Ir = 4.842; Figure 2A) was an outlier among the other 14 streptococci investigated, which have an average Ir of 1.665 [see Additional file 1]. Window analysis of S. agalactiae NEM316 revealed three exact repeats of 47 kb (not shown) and their removal resulted in an Ir of 1.756. Similarly, Escherichia coli OH157:H7 EDL933 had an exceptionally high Ir of 3.521 (Figure 2A) compared to the other five strains of E. coli sampled (average Ir: 1.049; cf. Additional file 1). In this case window analysis of E. coli OH157:H7 EDL933 (not shown) highlighted a repeat region of approximately 100 kb located at positions 1,050,000–1,150,000 and 1,450,000–1,550,000, which contained several long exact repeats with the longest spanning over 41 kb. Removal of one copy of the 100 kb repeat region reduced the Ir to 1.756.
Mouse and human chromosomes
The average Ir for mouse chromosomes was 1.773 (Figure 4B), which is significantly larger than that of humans (Wilcoxon test, P = 1.4 × 10-3). This agrees with the observation that the rodent lineage has experienced a higher rate of retro-transposition than hominoids . Individual mouse chromosomes had Ir values ranging from 0.7 in chromosome 19 to 3.654 in the Y chromosome. As in the human genome, the Y chromosome from mouse was characterized by the largest Ir. In addition, chromosomes 7 and X had Ir values > 3 (Figure 2B).
HOX genes in human and D. melanogaster
A sliding window analysis of the antennapedia complex in D. melanogaster, which is homologous to part of the human HOXA cluster, revealed a very different topology of repetitiveness (Figure 5B). On a background of Ir ≈ 0, large peaks marked the presence of long exact repeats and the antennapedia cluster was not characterized by a conspicuous change in Ir values.
"At this point we do not know what most of the DNA in eukaryotes is doing" [, p. 253]. Today, thirty-five years later, the function of apparently excess DNA in both eukaryotes and prokaryotes remains a topic of intense research activity . Our method to quantify this excess DNA, the index of repetitiveness, is close in spirit to the investigation of linguistic complexity based on suffix trees . Linguistic complexity is maximized in random sequences with equiprobable residues. Deviations from equiprobability lead to a reduction in complexity even if the sequence remains completely random. In contrast, in this paper we were interested in quantifying repetitiveness with respect to genome composition and to make this measure comparable across genomes. Our starting point was an investigation of the complement of repeats, the unique sequences. These are trivially easy to find, for example a sequence is always unique with respect to itself, and for this reason we have concentrated on shortest unique substrings. A shortest unique substring occurs only once in its parent string and cannot be reduced in length without losing its uniqueness. A genome with many long repeats contains many excessively long shortest unique substrings, while its shuffled version contains only the shortest unique substrings expected to be there by chance alone (cf. Methods). Since we have derived the latter quantity analytically , the Ir is constructed as the logarithm of the ratio between the observed and expected aggregate number of nucleotides found in shortest unique substrings. At the cost of ignoring homology relationships, this measure has the advantage that it can be computed for any double-stranded DNA sequence and its expectation is always zero. It is also possible to estimate an Ir value for sequences over alphabets other than the four nucleotides. In this case the quantity Ae defined in Equation (2) can be estimated by shuffling the input sequence. For example, the Ir of this paper is approximately 0.7.
Since the construction of the underlying suffix tree takes only time proportional to the length of the sequence analyzed, the Ir can be computed in time proportional to the length of the input sequence. In contrast, traditional repeat analysis such as implemented in the program repeatmasker  runs in time proportional to the product of the length of query and subject sequence.
Like most suffix tree implementations, the suffix tree on which our analysis is based, is kept entirely in the main memory (RAM) of the computer . This has the advantage of being relatively easy to implement. The disadvantage of this approach is that the amount of sequence data that can be analyzed in a single run of the program is limited by the available RAM rather than by the much cheaper hard disk space. We are currently studying advances in disk-based suffix tree construction  in order to break through the RAM barrier.
It may come as a surprise that the Ir values for human and mouse chromosomes were within the range of Ir values observed for less complex eubacterial genomes (Figure 2). However, this does not contradict the well-known fact that mammalian genomes are full of interspersed repeats, while bacteria usually contain fewer of these elements. The apparent paradox is due to the fact that the effect of interspersed repeats on the excess amount of exact repeats in a given genome – which is what the Ir measures – depends not only on the fraction of sequence covered by repetitive elements; equally important is the number of mutations accumulated since the divergence of an interspersed repeat from its most recent ancestor. As a result of the mutation process, ancient repetitive elements may not contain longer motifs repeated elsewhere than the rest of the genome. The presence of such elements would leave the Ir unchanged compared to the identical genome without them.
A similar argument applies to the interpretation of the high Ir values found in the Y chromosomes of human and mouse. The two factors determining the accumulation of sequence polymorphisms, time to the most recent common ancestor and mutation rate, cannot be separated. In addition, the effective mutation rate differs between autosomes and the Y chromosome. Under neutrality the number of SNPs expected for a pair of homologous sequences is θ = 4Neμ, where Ne is the effective population size and μ the rate of mutation. Since the effective population size of mammalian Y chromosomes is only one quarter that of autosomes, repeat pairs on the Y chromosome are broken up more slowly by mutations than elsewhere in the genome contributing to higher Ir values.
It should be noted at this point that neither the mouse nor the human genome are completely sequenced to date. If new sequence data comes predominantly from regions that are difficult to sequence due to their repetitiveness, future editions of the human and mouse genomes are expected to have higher Ir.
The Ir values found in our whole genome analyses (Figure 2) correlate well with the relative simplicity factors (RSFs) reported previously  (Pearson correlation = 0.552, P = 3.3 × 10-4). This correlation is not perfect due to the fact that the RSF measures the local excess of short repeats, while the Ir measures the excess of all repeats throughout the sequence. Moreover, no significant correlation between archaebacterial genome size and RSF was observed by Hancock , in contrast to our finding. This effect, however, is simply due to differences in sampling; if we reduce our sample of 27 archaebacterial genomes to the nine investigated by Hancock, the correlation between Ir and log genome size also vanishes. In contrast, a tenfold increase in the number of bacterial genomes investigated between Hancock's and our study only confirmed the earlier diagnosis of no correlation between RSF and genome size.
The average Ir for eubacteria was 1.048. However, it is clear that there are a few extreme Ir values that inflate this average (Figure 2A). The largest Ir for bacteria (or for any other organism) was found in Methylobacillus flagellatus KT (6.337). This value was the most extreme of a set of seven organisms with Ir > 3 that also included the human pathogens Neisseria meningitidis MC58 and Escherichia coli O157:H7 EDL933 (Figure 2). In a previous survey of 58 bacteria, Neisseria meningitidis was already singled out as having a highly repetitive genome . The low Ir values found by us among obligately host-associated bacteria also agree with a known lack of repeats in these genomes . While other bacteria appear to harbor repeats to increase genome plasticity , we speculate that intracellular symbionts and pathogens are less dependent on genome shuffling for their survival as they live in more stable environments. Our sliding window analyses revealed that the computation of Ir values for entire genomes averages out sharp regional fluctuations in Ir (Figures 3 and 5). In bacteria a high Ir value may be caused by a few extreme duplications, as was the case for M. flagellatus KT (Figure 3A) and S. agalactiae NEM316. In the human genome the 13 genes making up the HOXA cluster were characterized by a 100 kb footprint of low Ir values (Figure 5A). The fact that additional runs of low Ir outside the HOXA cluster also coincided with known genes leads us to currently search the entire human genome for further regions of low Ir.
Investigations of repetitiveness are traditionally carried out using some form of alignment algorithm. Such algorithms tend to run in time proportional to the product of the length of the query and subject sequence. In this paper we present an approach that runs in time linear in the length of the input sequence. It is based on a comparison between the observed and expected sums of the lengths of shortest unique substrings. We apply the resulting index of repetitiveness, Ir, to prokaryote and eukaryote genomes. Our global repetitiveness measures agree qualitatively with current knowledge about genome structure. However, a more detailed picture emerges by subjecting the genomes to window analyses. In the human genome the highly regulated HOXA cluster is known to lack insertion sequences. Accordingly, it is characterized by a footprint of low Ir. This suggests that in mammalian genomes regions of low Ir may be due to strong selection against mutagenesis by insertion sequences. If this is the case, scanning mammalian genomes for further intervals of low Ir may reveal tracts under strong purifying selection.
The quantity Ao corresponds to the area under the curve shown in Figure 8A.
We have previously derived an exact expression for the number of shortest unique substrings of length x expected in a completely shuffled genome of a given length and G/C content, N x . It is therefore convenient to define the expected aggregate length of shortest unique substrings as
Figure 8B shows the length of shortest unique substrings at each position along a shuffled version of the 2 kb fragment from the genome of M. genitalium. Notice that all the spikes indicating long repeats contained in the original sequence data (Figure 8A) have vanished, leaving a narrow baseline of shortest unique substring lengths. The quantity Ae is the expectation of the area under this baseline curve.
The index of repetitiveness, Ir, is now defined as the logarithm of the ratio of the observed aggregate shortest unique substring length and its theoretical expectation:
For genomes devoid of excess repeat sequences Ir ≈ 0, while for sequences with an excess of repeats Ir > 0. We have written the program ir for calculating Ir. The software is accessible using any standard web browser .
The sources of the eukaryotic genomes analyzed in this study.
Ircalculations and statistical analysis
All Ir values presented in Figure 2 were computed from the complete genome data available. Unsequenced regions marked by Ns were removed to prevent artificial inflation of Ir. The human and mouse genomes were too large for complete analysis with the computing equipment available to us. We therefore analyzed only individual chromosomes (Figure 4). With the exception of human and mouse chromosomes 1 and 2, all sequences were analyzed on their reverse and forward strands. Due to their sizes, only the forward strands of human and mouse chromosomes 1 and 2 were included in the computation of Ir.
For the sliding window analyses (Figures 3 and 5) Ao is computed as the sum of shortest unique substring lengths starting inside an interval of 1000 bp. Similarly, Ae is a function of the local G/C content and window length (1000 in our case). The window is then moved by a tenth of its length, i.e. 100 bp, and the Ir is recomputed.
The significance of differences between average values computed from sets of Ir values was tested using the two-sample Wilcoxon test as implemented in the statistics software R .
Availability and requirements
We have implemented Ir computations in the program ir, which can be accessed via a web-interface at
The C source code of a stand-alone version of the program is also freely available from this web site under the terms of the GNU General Public License.
We thank A. Börsch-Haubold, P. Pfaffelhuber, and C. Schlötterer for constructive criticism. BH is supported financially by Dehner Gartencenter GmbH and the Stifterverband der Deutschen Wissenschaft.
- Britten RJ, Kohne DE: Repeated sequences in DNA. Science 1968, 161: 529–540. 10.1126/science.161.3841.529View ArticlePubMedGoogle Scholar
- Rocha EPC, Danchin A, Viari A: Functional and evolutionary roles of long repeats in prokaryotes. Research in Microbiology 1999, 150: 725–733. 10.1016/S0923-2508(99)00120-5View ArticlePubMedGoogle Scholar
- Gregory TR: Synergy between sequence and size in large-scale genomics. Nature Reviews Genetics 2005, 6: 699–708. 10.1038/nrg1674View ArticlePubMedGoogle Scholar
- Hofnung M, Shapiro JA: Introduction. Research in Microbiology 1999, 150: 577–578. 10.1016/S0923-2508(99)00133-3View ArticleGoogle Scholar
- Aras RA, Kang J, Tschumi AI, Harasaki Y, Blaser MJ: Extensive repetitive DNA facilitates prokaryotic genome plasticity. Proceedings of the National Academy of Sciences, USA 2003, 100: 13579–13584. 10.1073/pnas.1735481100View ArticleGoogle Scholar
- Achaz G, Coissac E, Netter P, Rocha EPC: Associations between inverted repeats and the structural evolution of bacterial genomes. Genetics 2003, 164: 1279–1289.PubMed CentralPubMedGoogle Scholar
- Mirsky AE, Ris H: The desoxyribonucleic acid content of animal cells and its evolutionary significance. The Journal of General Physiology 1951, 34: 451–462. 10.1085/jgp.34.4.451PubMed CentralView ArticlePubMedGoogle Scholar
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921. 10.1038/35057062View ArticleGoogle Scholar
- Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420: 520–561. 10.1038/nature01262View ArticleGoogle Scholar
- Rat Genome Sequencing Consortium: Genome sequence of the brown Norway rat yields insights into mammalian evolution. Nature 2004, 428: 493–521. 10.1038/nature02426View ArticleGoogle Scholar
- The Chimpanzee Sequencing and Analysis Consortium: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 2005, 437: 69–87. 10.1038/nature04072View ArticleGoogle Scholar
- Bennett EA, Coleman LE, Tsui C, Pittard SW, Devine SE: Natural genetic variation caused by transposable elements in humans. Genetics 2004, 168: 933–951. 10.1534/genetics.104.031757PubMed CentralView ArticlePubMedGoogle Scholar
- Orgel LE, Crick FHC: Selfish DNA: the ultimate parasite. Nature 1980, 284: 604–607. 10.1038/284604a0View ArticlePubMedGoogle Scholar
- Doolittle WF, Sapienza C: Selfish genes, the phenotype paradigm and genome evolution. Nature 1980, 284: 601–603. 10.1038/284601a0View ArticlePubMedGoogle Scholar
- Jordan JI, Rogozin IB, Glazko GV, Koonin EV: Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends in Genetics 2003, 19: 68–72. 10.1016/S0168-9525(02)00006-9View ArticlePubMedGoogle Scholar
- Nóbrega MA, Y Z, Plajzer-Frick I, V A, Rubin EM: Megabase deletions of gene deserts result in viable mice. Nature 2004, 431: 988–933. 10.1038/nature03022View ArticlePubMedGoogle Scholar
- Zhou L, Atkinson PW, Hickman DydaFAB, Craig NL: Transposition of hAT elements links transposable elements and V(D)J recombination. Nature 2004, 432: 995–1001. 10.1038/nature03157View ArticlePubMedGoogle Scholar
- Kurtz S, Schleiermacher C: REPuter – fast computation of maximal repeats in complete genomes. Bioinformatics 1999, 15: 426–427. 10.1093/bioinformatics/15.5.426View ArticlePubMedGoogle Scholar
- Volfovsky N, Haas BJ, Salzberg SL: A clustering method for repeat analysis in DNA sequences. Genome Biology 2001, 2: 0027.1–0027.11. 10.1186/gb-2001-2-8-research0027View ArticleGoogle Scholar
- Hancock JM: The contribution of slippage-like processes to genome evolution. Journal of Molecular Evolution 1995, 41: 1038–1047. 10.1007/BF00173185View ArticlePubMedGoogle Scholar
- Tautz D, Trick M, Dover GA: Cryptic simplicity in DNA is a major source of genetic variation. Nature 1986, 322: 652–656. 10.1038/322652a0View ArticlePubMedGoogle Scholar
- Hancock JM: Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects. Genetica 2002, 115: 93–103. 10.1023/A:1016028332006View ArticlePubMedGoogle Scholar
- Orlov YL, Potapov NV: Complexity: an internet resource for analysis of DNA sequence complexity. Nucleic Acids Research 2004, 32: W628-W633.PubMed CentralView ArticlePubMedGoogle Scholar
- Troyanskaya OG, Arbell O, Loren Y, Landau GM, Bolshoy A: Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity. Bioinformatics 2002, 18: 679–688. 10.1093/bioinformatics/18.5.679View ArticlePubMedGoogle Scholar
- Shapiro SS, Wilk MB: An analysis of variance test for normality (complete samples). Biometrika 1965, 52: 591–611. 10.2307/2333709View ArticleGoogle Scholar
- Liu J, Kang H, Raab M, da Silva AJ, Kraeft SK, Rudd CR: FYB (FYN binding protein) serves as a binding partner for lymphoid protein and FYN kinase substrate SKAP55 and a SKAP55-related protein in T cells. Proceedings of the National Academy of Sciences, USA 1998, 95: 8779–8784. 10.1073/pnas.95.15.8779View ArticleGoogle Scholar
- Faiella A, D'Esposito M, Rambaldi M, Acampora D, Balsofiore S, Stornaiuolo A, Mallamaci A, Migliaccio E, Gulisano M, Simeone A, Bonicelli E: Isolation and mapping of ENVX1, a human homeobox gene homologous to even-skipped , localized at the 5' end of HOX1 locus on chromosome 7. Nucleic Acids Research 1991, 19: 6541–6545. 10.1093/nar/19.23.6541PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas Jn CA: The genetic organization of chromosomes. Annual Reviews of Genetics 1971, 5: 237–256. 10.1146/annurev.ge.05.120171.001321View ArticleGoogle Scholar
- Haubold B, Pierstorff N, Möller F, Wiehe T: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 2005, 6: 123. 10.1186/1471-2105-6-123PubMed CentralView ArticlePubMedGoogle Scholar
- Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press; 1997.View ArticleGoogle Scholar
- Tian Y, Tata S, Hankins RA, Patel JM: Practical methods for constructing suffix trees. The VLDB Journal 2005, 14: 281–299. 10.1007/s00778-005-0154-8View ArticleGoogle Scholar
- Calculate the Repetitiveness of DNA Sequences[http://adenine.biz.fh-weihenstephan.de/ir/]
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 2005, (33 Database):D501–4.
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2004. [http://www.R-project.org]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.