- Methodology article
- Open Access
Browsing repeats in genomes: Pygram and an application to non-coding region analysis
© Durand et al; licensee BioMed Central Ltd. 2006
- Received: 23 June 2006
- Accepted: 26 October 2006
- Published: 26 October 2006
A large number of studies on genome sequences have revealed the major role played by repeated sequences in the structure, function, dynamics and evolution of genomes. In-depth repeat analysis requires specialized methods, including visualization techniques, to achieve optimum exploratory power.
This article presents Pygram, a new visualization application for investigating the organization of repeated sequences in complete genome sequences. The application projects data from a repeat index file on the analysed sequences, and by combining this principle with a query system, is capable of locating repeated sequences with specific properties. In short, Pygram provides an efficient, graphical browser for studying repeats. Implementation of the complete configuration is illustrated in an analysis of CRISPR structures in Archaea genomes and the detection of horizontal transfer between Archaea and Viruses.
By proposing a new visualization environment to analyse repeated sequences, this application aims to increase the efficiency of laboratories involved in investigating repeat organization in single genomes or across several genomes.
- Maximal Repeat
- Suffix Tree
- Zoom Lens
- Index File
- Large Repeat
Some years ago, genomes were considered as static objects containing an informative part, the coding sequences, representing only a small percentage of the total genome, and a part referred to as "junk DNA" that was generally free of any annotation. It is now widely acknowledged that genomes must be considered from a more dynamic point of view, involving the study of the many " copy" events that occur during evolution, while covering not only coding genes, but non-coding sequences as well. A large number of in silico studies have revealed that repetitive sequences play an important role in the structure, function, dynamics and evolution of genomes in Archaea [1, 2], Bacteria [3, 4] and Eukarya [5–7]. It is well known, for instance, that proteins are combinations in a finite set of domains that represent basic structural units whose arrangements determine a wide variety of functions. Other classes of repeats, such as transposable elements, allow mobile elements to move around a genome, and have a major impact on the evolution of sequences . DNA palindromes, a particular form of repeat, are widespread in human cancers. Other repeats in centromeric or telomeric regions of chromosomes seem to endow a certain robustness to the sequences during replication. Repeats may be strictly conserved through evolution, as revealed by comparisons of human, mouse, rat, chicken and dog genomes [9, 10]. Complex mechanisms such as chromosome segment duplications, or even whole genome duplications, are thought to occur, explaining genome evolution [11, 12]. Converging studies of human and other genomes have also revealed that variations in the number of occurrences of particular repeats may be an important factor responsible for diseases such as diabetes, epilepsy, fragile-X mental retardation and myotonic dystrophy diabetes [13, 14]. From a technical point of view, repeats are the source of many difficulties encountered in assembling or comparing sequences, requiring their extraction from these sequences. For these and other reasons, the analysis of repetitive sequences is an essential step in genome assembly, annotation and analysis.
At the core of life information, there exists an outstanding opportunity to analyse the genomic structure by deciphering its content in repeated sequences. The exhaustive analysis of 360 published complete sequences from Archaea, Bacteria and Eukarya genomes (data from Genome OnLine Database ) has revealed that most of them, especially in Eukarya, have a genomic content consisting of large proportions of repeats. Revealing the structure of sequences as an assembly of elementary repeated sequences is thus a task of utmost importance.
An important goal in computational molecular biology is therefore finding repeats of biological interest, i.e. repeats that have a role in genome structure and function. Practical libraries of repeats have been established in an attempt to collect prototypical sequences and group them into families, either for a large set of genomes or for a particular species [16–19]. To achieve this goal using computational methods, the problem consists in giving a precise definition of a "repeat". In the biological literature, three main classes of repeats are proposed: tandem repeats (consecutive copies of patterns), duplicated segments (which include genes and chromosome segment duplications) and interspersed repeats (which include transposons). Tandem repeats are thought to have originated by slippage of a replicated chromosome against its template. The patterns in tandem repeats are k-mers, k being generally less than 5 (micro-satellites), but sometimes far greater (up to several thousand base pairs long, with a total size that can represent several uninterrupted megabases). The number of repeats for a given satellite may differ between individuals. Therefore, they can be used for DNA fingerprinting or to provide information about paternity.
Microsatellites, also known as short tandem repeats (STR), have a repeat unit that is 2 to 10 bp long, with the entire repetitive region spanning less than 200 bp. Minisatellites are generally GC-rich repeats that range in length from 10 to over 100 bp with total length ranging from 1 kb to 20 kb. Duplicated segments are large intra- or interchromosomally DNA segments, ranging from 41 to 655 kb in size and likely to result from replication accidents. These events result in the duplication of gene clusters. Interspersed repeats or mobile elements are DNA sequences located in dispersed regions in a genome, produced by mechanisms such as DNA recombination. The gene pool of a species consists of DNA sequences in a network linked by gene conversion events. This type of repetitive sequence plays the role of uncoupling the network, thereby allowing new genes to evolve. In mammals, the most common mobile elements are LINEs for interchromosomal uncoupling (length ≃ 6–7 kb) and SINE for intrachromosomal uncoupling (total length ≃ 300 bases). The first mobile elements were discovered by Barbara McClintock in the 1940s in studies on corn. Subsequently, they were found in all kinds of organisms. Classifications such as these provide a better understanding of the biological processes at hand during genome evolution. But since they are based on current limited biological knowledge, these definitions introduce some bias in the type of repeats targeted by the analysis, and also introduce complexity in the algorithms used to locate them, especially when considering error-prone repeats. PILER,  represents the current state of the art in this respect, where four classes of biological repeats are defined. Classes TA (tandem array) and DF (dispersed family) correspond to the previously cited tandem repeats and interspersed repeats, respectively. The other two classes are pseudosatellites (PS), which are clustered elements in the genome that are not tandem repeats, and terminal repeats (TR), which are copies of the same element located at the termini of a duplicated element.
A number of formal definitions have been proposed to capture the essence of observable repeats. A vast amount of literature covers this problem, and essentially three categories of formal repeats have been proposed: words, contiguous repeats and structured repeats. The first category tries to distinguish among repeated words those that include all other ones and are thus representative of the whole set of repeats. It mainly uses a maximization criterion, such as the longest repeats [21, 22] and maximal repeats [23, 24]. The second category introduces a basic model to achieve a closer approximation of observed repeats, since natural repeats in genomic sequences usually present many variations of close basic repeat units. Certain authors propose to look for trains of contiguous repeats such as tandem arrays (e.g., [25–27]), or pairs of repeats at a fixed distance (e.g, longest repeats with a block of don't cares , maximal pairs with bounded gap ), or to introduce an edit distance or a similarity score to take into account local variations (e.g k-mismatch repeats  and approximate tandem repeats ). Finally, the third category contains sophisticated repeat models that include all the previous notions and are designed to discover the complex word arrangements that occur with a minimum frequency. A structured motif consists of an ordered collection of p > 1 parts separated from one another by spacers, the length and distance between parts being bounded with given Min and Max values [32–34]. This kind of repeat seems of particular interest in studying non-coding sequences in gene expression and regulation.
Among these formal definitions, the notion of exact maximal repeat is quite attractive, since it is at the core of all others. It only focuses on sequences present in the two largest common blocks, with no possible extension to the right or left, and with no biological a priori. Maximal repeats have nice properties: they can be computed in linear time using a suffix-tree-based algorithm, their number is linear (at most n kinds of exact maximal repeats in a sequence of size n), and they can be used as basic blocks to compute error-prone repeats .
Associated visualization techniques play a fundamental role in analysing these numerous repeats, and various kinds of tools displaying repeats at genome level have been proposed in the past few years. Among them are dotplots , landscapes , chaos games , percent identity plots , repeat graphs  and BARD . Interpreting the views created using these tools is quite difficult, however, especially for large genomes, since most of them rely on displaying repeat pairs. They do not usually provide convenient zooming features to analyse regions of particular interest. Tools like dotplot, chaos game and BARD still can only be used on pairwise genome sequence alignments, and, because they only work at sequence level, become difficult to use as the sequence size and/or number of repeats increases. Moreover, they are not capable of summarizing the hierarchical organization of repetitive structures in a convenient way so that they can be interpreted by the end users.
This paper introduces the pyramid diagram, or pygram, designed to provide an abstract representation of the organization of repeated structures in genomic sequences. The theoretical foundation of pygrams is similar to sequence landscapes, which display all exact maximal repeats in a picture. But the pygram improves the original sequence landscape visualization in several ways. Aside from various practical improvements (two-strand display, zoom lenses), pygram offers several new features, including frequency visualization and multigenome repeat analysis. Most important, pygram visualization is closely associated with a query system designed to locate repeats that share specific properties. When combined, the query system and visual interface provide an efficient repeat browser that is useful for discovering unexpected structures in genomes.
Pyramid Diagram (Pygram) description
A pygram for a genome sequence S of length n is a bi-dimensional plot where S and all its exact maximal repeats (eMR) are mapped along the x-axis. Given an x-axis magnifying factor k and a y-axis magnifying factor l, mapping is defined as follows: the i th nucleotide of S is located at position (i/k,0), and the eMR of size m located at position i within S corresponds to the interval [i/k,(i+m)/k] on the x-axis. The size m eMR located at position a within S is symbolized in the diagram by an isosceles triangle (a pyramid) of height δ m/l. δ is either '+1' for an eMR located on the normal (N) strand of S, or '-1' for an eMR located on the reverse complement (RC) strand of S.
Since focus will be placed on eMRs in the rest of this paper, and most of the presentation does not depend on the kind of repeat used, the simpler term "repeat" will be used instead of "exact maximal repeat". It is first important to emphasize three basic facts about managing two DNA strands to avoid confusion in the interpretation of results.
A single coordinate system is used for both strands, i.e. all repeat coordinates, whether they are located on N or RC strands, are computed relative to the N strand.
The word on the reverse complement strand must be read as usual in the reverse direction. On the pygram each pyramid has an associated colour computed from the corresponding eMR sequence, ensuring that each repeat has its own specific colour. Consequently, all occurrences of the same repeat will have the same colour on both strands.
The definition of an eMR is symmetrical with regards to N and RC strands: if a word w is an eMR, then word , the reverse complement of w, is also an eMR. The display of an eMR along one strand is always mirrored with an eMR of the same size on the other strand.
Since the basic idea behind the pygram is to display all exact maximal repeats, pygrams may be considered as a rational reconstruction of landscapes , fully characterizing the structure that is displayed without requiring the computation of intermediate repeats. Landscapes do indeed display maximal repeats, where the scope of the right triangles is such that increasing the corresponding subword to the left or right removes at least one occurrence of the repeat in the extended subword. This provides a precise definition of maximal repeats.
Producing pygrams and browsing repeats
The first step in creating a pygram consists in producing the complete set of repeats. Since the repeat structures are to be analysed either within a single sequence or across several sequences, Gusfield's eMR detection algorithm  was implemented on a generalized suffix tree (see Methods).
The second step in constructing a pygram consists in creating an indexed representation of the complete set of eMR occurrences. Indexing aims to order repeats along the sequences, so that pygrams can be created efficiently. Indexing also improves browsing speed when checking specific repeat properties, such as frequency, size, location (normal vs. reverse complement strand), and conservation between two or more sequences. This close association between repeat visualization and querying provides an efficient browsing function for in-depth analysis of repeat organization at various levels, from the highest level (i.e. the complete sequence) to the lowest level (a single nucleotide).
The following discussion illustrates the browser capabilities through two case studies. The first shows how to detect and analyse Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR; ). The second presents an analysis of the horizontal transfer of DNA sequences between two genomes.
Visual analysis of repeat organization
To further investigate these frequently repeated regions, a pygram with two zoom lenses is presented in Figure 2C (the Methods section gives a detailed description of the implementation and graphical features of the pygram viewer). The lenses magnify two frequently repeated regions previously identified on the complete genome pygram, where numerous repeats appear to be organized in arrays of spaced tandems. Zooming in further (Figure 2D) reveals that the two regions share a similar structure. The tandem repeat in the first lens is preceded by two large repeats, symbolized by pink and red triangles on the N strand. Identical large repeats can be observed on the opposite strand, downstream of the tandem repeat displayed in the second lens. Instances of a 25-nucleotide-long repeat appear regularly spaced, consecutive occurrences being separated by non-repetitive sequences. This type of structure has already been observed in this genome  and is known as CRISPR.
CRISPRs are a very peculiar family of repeated sequences found in Archaea and Bacteria genomes . Their remarkably constant structure consists of short sequences from 21 to 37 nucleotides long, repeated almost exactly, and referred to as 'units', separated by similarly sized non-repetitive sequences, called 'spacers' (Figure 2D). In most species with two or more CRISPR loci, these loci are flanked on one side by a common leader sequence of 300–500 nucleotides. In Figure 2D, the leader sequence is delineated by the previously mentioned pink and red large repeat.
Each CRISPR unit appears as a group of co-occurring repeats that differ by only a few nucleotides. This is the result observed when a sequence, here the CRISPR unit, is repeated several times with point mutations. If a maximal repeat occurrence mutates at some point, it results in two included maximal repeats overlapping at the point of mutation: if aubvc is an eMR, a, b and c being letters and u and v two words, then a mutation from b to d leads to eMRs aud and dvc. Visualizing all eMRs detected on both strands of a complete genome can therefore be used to identify error-prone repeated sequences.
Querying the eMR index to locate exceptional repeats
Another way to target the presence of specific repeated structures consists in querying the eMR index file, then interpreting query results on a pygram. Such queries can be simple, consisting of searching for the most frequent repeats, or complex, as in the case of searching for specific repeat patterns. In the case of S. solfataricus, querying the index to answer the first question returns a 25-nucleotide eMR repeated 151 times in the complete genome, 103 occurrences being located on the N strand and 48 on the RC strand. This information can be drawn directly from pygrams (Figures 2B to 2D), where the specific eMR is highlighted in red on the centre frequency line. These pygrams immediately show that the most repeated eMR is located exclusively within two different CRISPRs, forming the most conserved element of the repeated units (Figure 2D).
The pygrams presented in Figure 3 show that some CRISPRs have a similar leader sequence. The left side of Figure 3A has two large repeats (pink and red) that correspond to the right side of Figure 3B, but on the opposite strand. Likewise, repeats at the end of Figure 3C (orange and green) match those at the beginning of Figure 3D. Considering the visual properties of these pictures, it is worth noting that pygrams using a logarithmic y-axis (Figure 3) render the CRISPR, structure better than linear-pygrams magnified using the zoom lens (Figure 2).
Among the various CRISPRs displayed in Figure 3, two are questionable (Figures 3F and 3G), since they do not fit the repetitive structure observed in Figures 3A to 3E. The CRISPR in Figure 3F is quite short, with its seven repeated units, and there is no detectable leader sequence. However, querying the index file to retrieve all eMR occurrences forming the units in this CRISPR reveals that these repeats are also located within the CRISPR units from Figure 3E, and nowhere else on the genome. Therefore, the CRISPR in Figure 3F should be a real one, even if it is quite short. The relationship between CRISPRs from Figures 3E and 3F remains unclear.
The structure presented in Figure 3G is an example of what could be a false positive reported when querying the index file using the above-mentioned CRISPR model. Even if some repeats are organized like CRISPR units, the overall structure is repeated, as revealed by the large brown trapezoid on the N strand, and the eMR forming that structure cannot be found anywhere else in the genome. This example illustrates the advantages of using pygrams to visually interpret the results of a computational method that predicts the presence of specific patterns of repeated structures.
Analysing repeats across two genome sequences
It was recently reported that S. solfataricus CRISPRs contained foreign genetic elements from the SIRV1 virus . The authors have suggested that these particular CRISPRs, which contain SIRV1 foreign DNA, could be involved in the known immunity of S. solfataricus against SIRV viruses.
These observations are consistent with data reported by Mojica et al. , although the present study failed to locate one out of the six known SIRV1 sub-sequences integrated in S. solfataricus as reported by these authors. This can be explained by the fact that the study only covered recognition of repeats containing 20 nucleotides or more, whereas Mojica et al. used BLAST, which is capable of recognizing shorter sequences. The pygram method may still be used to locate this particular sub-sequence, however, by lowering the repeat recognition size to 10 nucleotides.
Comparison with existing visualization methods
We generated a dotplot, percent identity plot (PIP), repeat graph and pygram from the same sequence to compare these visualization techniques in studying repeated sub-sequence organization within genome sequences. The sequence analysed here is the 2.83 Mb genome of S. solfataricus.
The pyramid diagram (or pygram) is a new visualization method that aims to summarize the complex hierarchical organization of repetitive sequence structures for either a single genomic sequence or across several sequences.
In contrast to similar existing tools, the pygram is not based on repeat pair display, and provides convenient graphical functions such as two-strand visualization, repeat frequency display, a zoom feature, repeat selection and annotation display. It therefore produces a better view of repeated sequences at all levels, from the complete genome sequence down to the nucleotide. Moreover, closely associating a viewer and a querying tool results in an efficient repeat browser, as illustrated in the examples on CRISPR investigation and DNA transfers in Archaea genomes.
The prototype developed uses a generalized suffix tree to produce eMRs. It achieves good linear performance (see Methods) with respect to the sequence size and the number of eMR occurrences to be handled, but the current application is limited to genome sequences containing no more than 50 million nucleotides on a computer with 4 Gb of RAM. During development, however, in the pygram browser implementation phase, the system that identifies the repeated sequences (in the present case, eMRs), was separated from the browser engine. This feature opens the pygram browsing infrastructure to other repeat models, in particular error-prone ones. In this way, Pygram could be used to perform the difficult job of analysing divergent sequences, a particularly crucial task in comparative genomics.
Implementation and performance of the Pygram application
Maxgen is an ANSI C software package that implements Gusfield's eMR detection algorithm on a generalized suffix tree (GST). This algorithm is capable of locating all eMRs in linear time and space, with respect to the sequence size, and presents the advantage of inserting the normal and reverse-complement sequences of each genome in a unique suffix tree. Maxgen proceeds in two steps. First, it analyses all real internal nodes of the GST, detecting all words that are eMRs. It then collects all occurrences of each eMR. The overall process runs at a rate of ~46 kbases/s, and the program uses an average of 17 bytes per sequence letter, which is slightly more than the highly space-efficient standard suffix tree application created by Kurtz . Additional byte capacity is required to handle several sequences in a single suffix tree and detect eMRs. After running this software on a set of FASTA-format sequences, a text file is created containing all lexicographically ordered eMRs. Each line of this file represents a single eMR, along with all its positions in the analysed sequence(s).
PyramidIndexator is a Java program that converts the text file generated by Maxgen into two binary index files. The first index file stores an object representation of each eMR,, all eMRs being ordered lexicographically. The data file uses 36 bytes per eMR, (these bytes are used to store primary key and repeat size, type and colour), in addition to one byte per character to store the repeat sequence, and four bytes per occurrence to store the positions. The second index file stores an object representation of each eMR, for visualization purposes, all repeats being ordered by sequence position. The visualization file uses 17 bytes per repeat occurrence (to store primary key and repeat size and colour). Each of these binary files is associated with an index file to speed up data access. PyramidIndexator creates both index files in linear time and space with respect to the number of occurrences at a rate of ~310 k eMR, occurrences/s. Once the indexes have been created, pygrams can be created and index files queried on line. The current version of the index files requires a significant amount of memory to store each eMR,, since the data (primary key, repeat type, colour, position, etc.) are all encoded using the standard Java integer and colour classes. A future version of PyramidIndexator will optimize capacity requirements, an important issue for Eukarya genome visualization.
The colour scheme is computed using the sequence of each eMR: from each individual sequence, a hashcode is computed (using the standard Java API) which is in turn converted to RGB values.
PyramidImage is a Java application capable of creating pygram pictures. This program provides the visualization infrastructure necessary to explore repeated structures at various levels of magnification, from the highest level (i.e. the complete sequence) to the lowest level (a single nucleotide). This is achieved using a contextual zoom tool associated with a global viewer.
PyramidImage creates a pygram using the visualization index file generated by PyramidIndexator: the index file is scanned so that the larger repeats are displayed before the smaller ones. In this way, the larger pyramids do not hide the smaller ones. To display each repeat, pyramids are produced in increasing sequence position order. PyramidImage runs in linear time and space at a rate of ~165 k eMR occurrences/s.
PyramidImage receives input from two files. The first is the visualization binary file created by PyramidIndexator. The second is an optional text file, referred to as a pygram descriptor, that contains drawing and filtering parameters. If no pygram descriptor is provided, PyramidImage creates a pygram for the entire genome sequence, displaying all eMRs reported in the index file. On the other hand, if a descriptor is provided, these parameters can be controlled to produce pygrams with various layouts (see examples in Figure 2). The drawing parameters include:
region of the sequence to display,
standard or logarithmic pygram,
number of lenses to produce,
eMR to highlight.
For each lens, the descriptor can be used to specify the location of the lens within the sequence, x- and y-axis magnifying factors, the sequence coordinate ruler, and whether or not to display sequence letters. The filtering parameters for eMRs include:
PyramidImage can also display annotations on the pygram. This feature can be used to display either known genome annotations or user-defined ones. Figure 2D presents a full-featured example of pygram visualization functionality: two different regions of the same sequence are presented side-by-side, along with a zoom lens, several features underlying a genomic structure of interest and a selected eMR. The DNA sequence is also displayed inside the lens, and a y-axis magnifying factor is applied to achieve better structure magnification.
PyramidBrowser is Java software designed to query the binary files created by PyramidIndexator. This tool can be used to select specific eMRs according to their size, number of occurrences and location on the sequence. The information from PyramidBrowser can be entered as filtering parameters in the pygram descriptor.
Size of the eMR occurrence index file
Creating an index file for all observed instances of an eMR in a genome sequence can be difficult to compute, since the number of occurrences (number of locations) is not linear with respect to the number of eMR types (number of different words) observed in a sequence. For instance, the sequence CA n GA n T contains exactly n maximal repeats (different words), namely A k (k = 1, n), and the number of occurrences of all these repeats is .
Since the task covers millions of different words, quadratic behaviour such as this is computationally intractable. Experiments were therefore conducted on several genomes to study the practical impact of the relationship between the number of eMR occurrences and the number of eMR types. Two scenarios were tested empirically: a linear relationship and a quadratic relationship. The ratio for each case was computed as:
the ratio between eMR occurrences and the number of eMR types (a) and,
the ratio between and the number of eMR types (b).
The ratio trends are remarkably similar in all cases.
First of all, the left side of the [a] curves and [b] curves is the same for the normal and shuffled versions of the genomes. This tends to show that short maximal repeats occur at a frequency that depends only on the sequence structure. In contrast, long maximal repeats occur more often in normal genomes than in their shuffled counterpart. Furthermore, they continue to occur at an almost constant rate for quite large word sizes, whereas the maximum size of maximal repeats remains less than 25 nucleotides in random sequences of the same length. Note that the final behaviour of [b] curves simply follows sqrt(2/NbMR) since Nbocc is proportional to NbMR: [c] curves clearly show this fact. The last value is sqrt(2), corresponding to one MR with two occurrences.
The second important observation is that the overall trend for the number of occurrences is quadratic, as expected for very short words (the left side of the [b] curves is flat), then decreases rapidly and becomes almost linear for long words (the right side of the [a] curves is almost flat). The number of occurrences of significant maximal repeats (those that can be distinguished from randomly occurring ones) therefore remains comparable with the number of maximal repeat types, which means that a systematic analysis of these eMRs may reasonably be attempted along a genome.
Pygram tools (precompiled binaries, documented source code and user manuals) are distributed under the CeCILL (CRA-CNRS-INRIA Logiciel Libre) free software license and are available at http://www.irisa.fr/symbiose/projets/Modulome/.
This work is supported by a grant from the French Agence Nationale de la Recherche (Modulome project), and uses the bioinformatics platform from Ouest-Genopole. We would like to thank François Coste, Mathieu Giraud, Dominique Lavenier and Anne Siegel for their helpful discussions regarding the pygram visualization system, and Marc Le Romancer for his help with biological data analysis.
- Blount D, Grogan D: New insertion sequences of Sulfolobus: New functional properties and implications for genome evolution in hyperthermophilic archaea. Mol Microbiol 2005, 55: 312–25.View ArticlePubMedGoogle Scholar
- Mojica FJ, Díez-Villaseñor C, García-Martínez J, Soria E: Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 2005, 60(2):174–182.View ArticlePubMedGoogle Scholar
- Achaz G, Rocha EP, Netter P, Coissac É: Origin and fate of repeats in bacteria. Nucleic Acids Res 2002, 30(13):2987–94.PubMed CentralView ArticlePubMedGoogle Scholar
- Pourcel C, Salvignol G, Vergnaud G: CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology 2005, 151: 653–63.View ArticlePubMedGoogle Scholar
- Charlesworth B, Sniegowski P, Stephan W: The evolutionary dynamics of repetitive DNA in eukaryotes. Nature 1994, 371(6494):215–220.View ArticlePubMedGoogle Scholar
- Achaz G, Coissac É, Viari A, Netter P: Analysis of intrachromosomal duplications in yeast Saccharomyces cerevisiae :a possible model for their origin. Mol Biol Evol 2000, 17(8):1268–75.View ArticlePubMedGoogle Scholar
- Friedman R, Hughes AL: Gene duplication and the structure of eukaryotic genomes. Genome Res 2001, 11(3):373–81.PubMed CentralView ArticlePubMedGoogle Scholar
- Kazazian HH: Mobile elements: drivers of genome evolution. Science 2004, 303(5664):1626–1632.View ArticlePubMedGoogle Scholar
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D: Ultraconserved elements in the human genome. Science 2004, 304(5675):1321–1325.View ArticlePubMedGoogle Scholar
- Bejerano G, Lowe CB, Ahituv N, King B, Siepel A, Salama SR, Rubin EM, Kent WJ, Haussler D: A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature 2006, 441(7089):87–90.View ArticlePubMedGoogle Scholar
- Taylor JS, Braasch I, Frickey T, Meyer A, Van de Peer Y: Genome duplication, a trait shared by 22,000 species of ray-finned fish. Genome Res 2003, 13(3):382–390.PubMed CentralView ArticlePubMedGoogle Scholar
- Dujon B, Sherman D, Fischer G, Durrens P, Casaregola S, Lafontaine I, de Montigny J, Marck C, Neuvéglise C, Talla E, Goffard N, Frangeul L, Aigle M, Anthouard V, Babour A, Barbe V, Barnay S, Blanchin S, Beckerich JM, Beyne E, Bleykasten C, Boisramé A, Boyer J, Cattolico L, Confanioleri F, de Daruvar A, Despons L, Fabre E, Fairhead C, Ferry-Dumazet H, Groppi A, Hantraye F, Hennequin C, Jauniaux N, Joyet P, Kachouri R, Kerrest A, Koszul R, Lemaire M, Lesur I, Ma L, Muller H, Nicaud JM, Nikolski M, Oztas S, Ozier-Kalogeropoulos O, Pellenz S, Potier S, Richard GF, Straub ML, Suleau A, Swennen D, Tekaia F, Wésolowski-Louvel M, Westhof É, Wirth B, Zeniou-Meyer M, Zivanovic I, Bolotin-Fukuhara M, Thierry A, Bouchier C, Caudron B, Scarpelli C, Gaillardin C, Weissenbach J, Wincker P, Souciet JL: Genome evolution in yeasts. Nature 2004, 430: 35–44.View ArticlePubMedGoogle Scholar
- Rubinsztein DC, Leggo J, Coetzee GA, Irvine RA, Buckley M, Ferguson-Smith MA: Sequence variation and size ranges of CAG repeats in the Machado-Joseph disease, spinocerebellar ataxia type 1 and androgen receptor genes. Hum Mol Genet 1995, 4(9):1585–1590.View ArticlePubMedGoogle Scholar
- Dubrova YE, Nesterov VN, Krouchinsky NG, Ostapenko VA, Vergnaud G, Giraiideau F, Buard J, Jeffreys AJ: Further evidence for elevated human minisatellite mutation rate in Belarus eight years after the Chernobyl accident. Mutat Res 1997, 381(2):267–278.View ArticlePubMedGoogle Scholar
- The Genome OnLine Database[http://www.genomesonline.org/]
- Jurka J: Repeats in genomic DNA: mining and meaning. Curr Opin Struct Biol 1998, 8: 333–337.View ArticlePubMedGoogle Scholar
- Jurka J, Kapitonov V, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogentic and Genome Research 2005, 110: 462–467.View ArticleGoogle Scholar
- Ruitberg CM, Reeder DJ, Butler JM: STRBase: a short tandem repeat DNA database for the human identity testing community. Nucleic Acids Research 2001, 29: 320–322.PubMed CentralView ArticlePubMedGoogle Scholar
- Blenda A, Scheffler J, Scheffler B, Palmer M, Lacape JM, Yu JZ, Jesudurai C, Jung S, Muthukumar S, Yellambalase P, Ficklin S, Staton M, Eshelman R, Ulloa M, Saha S, Burr B, Liu S, Zhang T, Fang D, Pepper A, Kumpatla S, Jacobs J, Tomkins J, Cantrell R, Main D: CMD: a Cotton Microsatellite Database resource for Gossypiumgenomics. BMC Genomics 2006, 7: 132.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC, Myers EW: PILER: identification and classification of genomic repeats. Bioinformatics 2005, 21(suppl 1):il52–158.View ArticleGoogle Scholar
- Karp RM, Miller RE, Rosenberg AL: Rapid identification of repeated patterns in strings, trees and arrays. In STOC '72: Proceedings of the fourth annual ACM symposium on Theory of computing. New York, NY, USA: ACM Press; 1972:125–136.View ArticleGoogle Scholar
- Lefebvre A, Lecroq T, Alexandre J: An Improved Algorithm for Finding Longest Repeats with a Modified Factor Oracle. Journal of Automata, Languages and Combinatorics 2003, 8(4):647–657.Google Scholar
- Gusfield D: Algorithms on strings, trees, and sequences. Cambridge University Press; 1997.View ArticleGoogle Scholar
- Kolpakov R, Kucherov G: Finding Maximal Repetitions in a Word in Linear Time. In Proceedings of the40th IEEE Annual Symposium on Foundations of Computer Science. New York: IEEE Computer Society Press; 1999:596–604. [citeseer.ist.psu.edu/kolpakov99finding.html] [citeseer.ist.psu.edu/kolpakov99finding.html]Google Scholar
- Sagot MF, Myers EW: Identifying Satellites and Periodic Repetitions in Biological Sequences. Journal of Computational Biology 1998, 5(3):539–554.View ArticlePubMedGoogle Scholar
- Stoye J, Gusfield D: Simple and flexible detection of contiguous repeats using a suffix tree. Theor Comput Sci 2002, 270(1–2):843–856. [http://dx.doi.org/10.1016/S0304–3975(01)00121–9]View ArticleGoogle Scholar
- Boeva V, Regnier M, Papatsenko D, Makeev V: Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 2006, 22(6):676–684. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/6/676]View ArticlePubMedGoogle Scholar
- Crochemore M, Iliopoulos CS, Mohamed M, Sagot MF: Longest repeats with a block of don't cares. LATIN 2004, 271–278.Google Scholar
- Brodal GS, Lyngs RB, Pedersen CS, Stoye J: Finding Maximal Pairs with Bounded Gap. CPM 1999, 134–149. [http://link.springer.de/link/service/series/0558/bibs/1645/l6450134.htm]Google Scholar
- Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R: REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res 2001, 29(22):4633–4642.PubMed CentralView ArticlePubMedGoogle Scholar
- Wexler Y, Yakhini Z, Kashi Y, Geiger D: Finding approximate tandem repeats in genomic sequences. RECOMB 2004, 223–232.View ArticleGoogle Scholar
- Marsan L, Sagot MF: Extracting structured motifs using a suffix tree – Algorithms and application to consensus identification. In Proceedings of the 4th Annual International Conference on Computational Molecular Biology (RECOMB). Edited by: Minoru S, Shamir R. Tokyo, Japan: ACM Press; 2000:210–219. [citeseer.ist.psu.edu/marsan00extracting.html] [citeseer.ist.psu.edu/marsan00extracting.html]Google Scholar
- Iliopoulos CS, McHugh JM, Peterlongo P, Pisanti N, Rytter W, Sagot MF: A First Approach to Finding Common Motifs With Gaps. Stringology 2004, 88–97. [http://psc.felk.cvut.cz/event/2004/p8.html]Google Scholar
- Morgante M, Policriti A, Vitacolonna N, Zuccolo A: Structured Motifs Search. Comp Biol 2005, 12(8):1065–1082. [http://www.liebertonline.com/doi/abs/10.1089/cmb.2005.12.1065]View ArticleGoogle Scholar
- Gibbs AJ, McIntyre GA: The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences. Eur J Biochem 1970, 16: 1–11.View ArticlePubMedGoogle Scholar
- Clift B, Haussler D, McConnell R, Schneider TD, Storrno GD: Sequence landscapes. Nucl Acids Res 1986, 14: 141–158.PubMed CentralView ArticlePubMedGoogle Scholar
- Jeffrey HT: Chaos game representation of gene structure. Nucleic Acids Res 1990, 18(8):2163–70.PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, Miller W: PipMaker – a web server for aligning two genomic DNA sequences. Genome Res 2000, 10(4):577–586.PubMed CentralView ArticlePubMedGoogle Scholar
- Spell R, Brady R, Dietrich F: BARD: A visualization tool for biological sequence analysis. INFOVIS 2003.Google Scholar
- Jansen R, Van Embden JDA, Gaastra W, Schouls LM: Identification of genes that are associated with DNA repeats in prokaryotes. Mol Microbiol 2002, 43(6):1565–1575.View ArticlePubMedGoogle Scholar
- She Q, Brügger K, Chen L: Archaeal integrative genetic elements and their impact on genome evolution. Res Microbiol 2002, 153(6):325–332.View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16(6):276–277.View ArticlePubMedGoogle Scholar
- Kurtz S, Schleiermacher C: REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 1999, 15(5):426–427.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.