Detecting long tandem duplications in genomic sequences
© Audemard et al.; licensee BioMed Central Ltd. 2012
Received: 27 September 2011
Accepted: 24 March 2012
Published: 8 May 2012
Detecting duplication segments within completely sequenced genomes provides valuable information to address genome evolution and in particular the important question of the emergence of novel functions. The usual approach to gene duplication detection, based on all-pairs protein gene comparisons, provides only a restricted view of duplication.
In this paper, we introduce ReD Tandem, a software using a flow based chaining algorithm targeted at detecting tandem duplication arrays of moderate to longer length regions, with possibly locally weak similarities, directly at the DNA level. On the A. thaliana genome, using a reference set of tandem duplicated genes built using TAIR,a we show that ReD Tandem is able to predict a large fraction of recently duplicated genes (dS < 1) and that it is also able to predict tandem duplications involving non coding elements such as pseudo-genes or RNA genes.
ReD Tandem allows to identify large tandem duplications without any annotation, leading to agnostic identification of tandem duplications. This approach nicely complements the usual protein gene based which ignores duplications involving non coding regions. It is however inherently restricted to relatively recent duplications. By recovering otherwise ignored events, ReD Tandem gives a more comprehensive view of existing evolutionary processes and may also allow to improve existing annotations.
Gene duplication has long been recognized as a major driving force in evolution. Both the extent of gene duplications in genomes and the theoretical formalization describing the process by which duplicate genes may contribute to genetic novelty by neo-functionalization lead to an intense interest for the subject (reviewed in [1–3]). The recent discovery of a previously unexpected dynamic of gene family expansion and contraction observed in complete genome sequences has called new attention on the phenomenon of gene duplication [4, 5]. Moreover, recent studies of gene copy-number polymorphism in various organisms provide evidence of an ongoing mechanism of gene duplication and loss within species . The different studies underline that this “revolving door” of gene gain and loss largely contributes to intra and interspecific phenotypic variability [7–9] and is therefore likely to have played an important role in shaping phenotypic differences among species .
Analysis of the genomes of Arabidopsis, human, mouse and rat revealed that tandemly arrayed duplicates account from 10% to 20% of all genes [2, 10, 11]. In addition, the contribution of tandem duplication to gene duplicates ranges from one-third in mammals  to almost 70% in Caenorhabditis elegans, highlighting the predominant role that tandem duplication plays in gene duplication. Tandem duplication contributes also to the evolution of other classes of functional elements such as exons within genes  or RNA genes . In this respect, the detection of recent tandemly duplicated segments in complete genome sequences is a question of foremost interest.
Tandem duplication has been extensively studied at the protein coding gene level [11, 15–17] or at the much smaller scale of serial repeats (micro-satellites), based on local DNA similarities [18, 19].
All studies based on protein similarity analysis are naturally biased by the available genome annotation. In addition, such analyses automatically exclude duplicated segments with RNA genes or degenerated copies from the scope of the study. Despons and colleagues  have recently proposed an approach combining protein and DNA sequence comparison, enabling to detect degenerated paralogous copies, but the method still relies on an existing annotation and is additionally, as acknowledged by the authors, essentially limited to the analysis of compact genomes. Using DNA sequence comparison only, Eichler and colleagues [21, 22] have significantly contributed to the understanding of dynamics of duplication in primates by studying highly identical duplicated DNA fragments greater than 1Kb, termed segmental duplications. This latter work is however limited to the study of very recent duplications.
On the other side of the size spectrum, different algorithms have been devised to detect so-called serial repeats at the DNA level. Initially targeted at short (micro-satellite-like) repeats, these algorithms have been considerably improved, leading to tools such as TRF  or mreps  which are capable detecting short tandem repeats on whole genomes. But, as shown in our experiments, the underlying definition of a serial repeat (as a contiguously repeated string) is not suitable for detecting large duplications that may contain disrupted similarities and which, despite being close, are far from contiguous.
In this paper, we introduce ReD Tandem, a tandem duplication detection tool that works from the genomic DNA sequence of the considered organism. In order to identify tandem duplicated segments, we start from short similar regions (also called anchors) that have been detected by a fast whole genome self-alignment software. These anchors are then chained into larger (duplicated) segments, similarly to what is done in synteny or segmental duplication detection tools such as DAGchainer  or OSfinder , modified to account for the specific properties of tandem duplications. In the next step, we analyse these chains to find tandem regions and an associated duplication unit. This duplication unit is used as a seed to locate further tandem duplications defining what we call a Tandem Array (TA).
In the first section, we present the formal definition of anchors and chains and the algorithm that enables to detect tandemly duplicated regions and the associated duplication unit. We next apply our method on Arabidopsis thaliana and we show that a large number of Tandem Gene Arrays, that can be derived from a CDS based family analysis, are detected by ReD Tandem. We analyse how the detection sensitivity varies with the evolutionary distance between genes. Finally we discuss the ability of the agnostic approach of ReD Tandem to detect duplications of RNA genes, duplications families involving different functional categories such as protein-coding genes together with long non-coding RNAs, as well as duplications of unannotated regions.
Results and discussion
Tandem duplications typically include several copies of the same sequence. In the usual situation, these duplications have been obfuscated by evolution, leaving only local similarities. In this section we show how a specific chaining algorithm (called ReD) can reconstruct sets of duplicated regions that can be further analyzed to identify Tandem Arrays and associated duplication units. To test this approach, we apply it on the Arabidopsis thaliana genome and analyze its performance on characterized fraction of tandem coding gene duplications. Finally, we also explore the non coding fraction of the predicted tandem duplicated regions and show that ReD is also able to discover duplicated regions involving pseudo-genes, small or long RNA genes and other specific regions in the Arabidospis thaliana genome.
To properly identify tandem arrays and their associated duplication unit, we follow a multiple steps procedure which is succinctly described now and described in more detail in the “Methods” section.
In a first step, adjacent sequence similarities are identified. These local similarities are next chained to identify a set of pairwise duplicated regions that could belong to tandem duplications. In a third step, the resulting chains are used to identify regions that could define tandem arrays together with the corresponding duplication unit. In the final step, in each such region, this duplication unit is used as a seed to reconstruct the structure of the complete tandem array.
Similarly to what has been done for the reconstruction of homologous regions [23, 24] or whole-genome alignment , our aim is to reconstruct duplicated regions as consistently ordered sequences of close anchors. By consistent order, we mean that each of the two sequences of regions defined by the sequence of anchors is either increasing (on the forward strand) or decreasing (on the reverse strand). To characterize close anchors, we use a distance introduced in  and defined in the “Methods” section.
The usual approach to identify large pairwise duplicated regions is to build a graph whose vertices are the detected anchors and where a directed edge (a b) is created when a and b are consistently ordered and sufficiently close to each other. A score is affected to the nodes and edges, reflecting the alignment score and the physical distances between anchors, and a minimum cost (shortest) path in this graph defines the regions sought .
By repeatedly extracting shortest paths from this acyclic graph, denoted as G1, one obtains a set of predicted duplicated regions, called chains, with an overall cost defined as the sum of the costs of all its paths.
Enforcing conditions (1) and (2) is difficult, especially because condition (2) is a global condition on the set of predicted chains and not only on each chain. Instead of using the usual process of repeated extraction of shortest chains, we therefore shift to a more sophisticated minimum cost flow based algorithm that will be able to produce a set of k chains that satisfy the constraints above and minimize their overall cost.
To use flows, the previous graph G1 is transformed in a transportation network and the problem of identifying a set of chains representing duplicated regions is reformulated into a minimum cost flow problem (see Methods).
Compared to the usual iterated greedy shortest path approach, this approach is able to “reconsider” previous chains and reallocate an anchor that was previously used in a chain to a new chain if this is needed to get an overall optimal cost. It therefore has a global view on the set of predicted chains. By iterating until (1) and (2) are satisfied we guarantee that the set of predicted chains are “non-overlapping” chains.
Tandem array and duplication unit identification
In order to delineate tandem array regions from the previous set of chains, we build a second graph G2 whose vertices are the predicted chains and remaining anchors and where two vertices are connected iff the associated regions of the two vertices on the genome overlap sufficiently on one of their sub-regions, i.e. the two chains/anchors share at least one sub-region.
Every connected component of this graph collect regions that share paralogous relationship and defines therefore a predicted tandem array region. The actual predicted tandem region is obtained by extending the minimum and maximum coordinates inside the connected component by a small margin. Every chain inside the connected component is a duplication unit candidate and the longest of all minimum cost chains in the array is used to identify the reference duplication unit (see Methods).
In order to increase sensitivity, the previously identified duplication unit is used as the query sequence in a TBLASTX  search against the tandem array region. All the candidate regions that align with the duplication unit on a sufficient length are kept as additional occurrences of the duplication unit and define the output of the prediction by the global “ReD Tandem” approach.
The evaluation of the proposed method was performed using the Arabidopsis thaliana genome sequence as a test case. This genome and its internal gene duplications have been extensively studied providing an excellent standard for evaluation [28, 29]. We used NCBI build 9.1, preprocessed using the low complexity filter DUST (Tatusov and Lipman, unpublished; described in ), anchors are produced using our own genome wide aligner, glint (Faraut T, Courcelle E., unpublished), using standard alignment scoring scheme (match +1, mismatch −3, gap open/extend −5/-2). Our aim is to test if tandem duplications identified by similarities between protein sequences can be recovered by ReD Tandem using the DNA sequence only. The first step therefore involves the construction of a reference set of tandem gene arrays against which the tandem duplications detected by ReD will be compared.
Creating a reference set of tandem gene arrays
In order to construct a reference set of tandem gene arrays we proceed essentially like other published methods [10, 20, 31]. Considering the TAIR10 version of the Arabidopsis thaliana genome annotation, for each coding gene of length > 500 bp (matching the minimum length ℓ used in ReD Tandem), the longest annotated transcript is selected as the reference transcript. An all-against-all BLASTP comparison is conducted on the corresponding set of proteins. Two genes are considered to share a tandem paralogous relationship (resulting from an ancient tandem duplication) if they are less than T (150 kb) apart and exhibit a BLASTP hit with an e-value of at most 10−5 covering at least 70% of both sequences. These tandem paralogous relationships between genes are used in turn as anchors to create an overlap graph following the method described in “Methods”. Each connected component of this graph defines a tandem duplication array with genes as elementary duplication units. These connected components are essentially equivalent to the TGA defined in , with the difference that the notion of spacer genes between duplicated copies is replaced here by the physical distance threshold of T between copies to enable the comparison with our annotation-free approach.
These reference tandem duplication arrays will be called tandem gene array s (TGA), and the associated duplication unit tandem gene unit (TGU). Conversely, the regions and units detected by ReD Tandem from DNA alone will be respectively denoted as tandem arrays (TA), and associated tandem unit (TU).
In our analysis, we consider that a TGU is detected if it is overlapped on at least 70% of its length by a ReD TU. For TA and TGA, the criteria is more stringent and requires that the TGA is overlapped by more than 70% by the TA and that at least one of the TGU in the TGA is detected as a TU.
Scanning Arabidopsis thaliana genome
From 60,021 DNA anchors, ReD Tandem built 10,290 chains with a mean of 2.9 anchors per chain, underlining the importance of chaining here. These chains define 1,718 Tandem Array (TA) covering 28.8% of the A. thaliana genome, made up of 5,477 Tandem Unit (TU) covering 10.6% of the sequenced genome. This is consistent with the estimated 10% of tandem gene duplications in the Arabidopsis genome .
Comparison with the reference set: sensibility
Arabidopsis vs Arabidopsis
The results we have obtained on A. thaliana show that ReD Tandem, without relying on a predicted proteome, is able to correctly detect a large fraction of reference tandem duplicated genes provided they are sufficiently close from an evolutionary point of view (dS < 1).
The real added value of ReD Tandem is precisely its ability to perform its analyzes purely from DNA. Although it is restricted to “recent” duplications, ReD Tandem has the ability to identify duplicated regions which are implicitly censored by pure proteome based approaches, therefore helping to analyze the evolutionary history of the region. The only existing software that we know that provides related capabilities is  which uses BLASTX comparison in the immediate vicinity of every gene to identify possible pseudo-genes and gene relics. This approach, while still depending on an annotation, is, as acknowledged by the authors, essentially restricted to compact (bacterial or unicellular eukaryote) genomes. Because it relies on a direct comparison of the genome vs. itself that can be achieved using fast whole genome index based software, ReD Tandem is not restricted to the analysis of compact genomes. Serial repeat finders such as TRF or Mreps are also able to deal directly with DNA sequences, including large genomic sequences, but are instead restricted by the underlying definition of serial repeats (as contiguous repeats). To verify this, we applied both TRF (with default parameters) and Mreps (with a resolution of 50 allowing for maximum approximate matching) to the A. thaliana genome. TRF and Mreps identified respectively 35 and 26 serial repeats containing duplication units with a size above 500 bp (data not shown), compared to the 1,718 TA identified by ReD Tandem. By chaining local similarities, ReD Tandem is instead able to reconstruct large duplicated regions that may be interrupted by local loss of similarities.
Comparison with annotated elements
Arabidopsis vs Arabidopsis
Trans. Element gene
lncRNA and CDS
Orphan TUs and TAs
These examples illustrate the fact that ReD Tandem ability to predict tandem duplications extends beyond pure tandem protein gene arrays. Still, TUs remain which do not cover any annotated element in A. thaliana genome. Among the 5,573 TUs predicted by ReD, 1,438 are orphan TUs. This is expected since ReD Tandem is just targeted at detecting DNA level tandem duplications. However, when such orphan TUs appear in a TA, other TUs in the same TA may provide extra information. Some of these orphan TUs may be of interest for improving the existing genome annotation.
We note in addition that ReD Tandem is also able to detect intertwined TGAs and to separate the duplication units belonging to the different families. (Figures not shown, see http://narcisse.toulouse.inra.fr/ReDTandem/6.htmlath4-8005015-8046308 and http://narcisse.toulouse.inra.fr/ReDTandem/2.htmlath4-8031970-8049154)
Arabidopsis vs Arabidopsis
Mean size of TU (bp)
Mean nb of TU
This section gives just a short extract of all detected TAs. A full list of detected TAs, with associated TUs and TAIR10 annotation for the A. thaliana genome with direct links to the corresponding region on Gramene web site is available from http://narcisse.toulouse.inra.fr/ReDTandem.
Availability and requirements
The full packaged software from anchor detection to final TA/TU prediction is distributed under a CECILL open-source licence at http://narcisse.toulouse.inra.fr/ReDTandem. The software archive is also available as Additional file 1. You can either download a set of executable Linux 64 bits binaries wrapped in a Perl script or the set of sources. ReD Tandem is implemented in C++ and its execution time on Arabidopsis thaliana genome is around 4 hours on a single core computer. Its execution requires the availability of NCBI Blast and a Perl interpreter with the BioPerl package.
In this paper we have introduced ReD Tandem, which, in our knowledge, is the first software targeted at predicting large partially conserved tandem duplications directly from DNA. This allows ReD Tandem to work directly on unannotated genomes. The analysis of ReD Tandem output and examples show that a pure DNA based analysis of tandem duplications unveils a large variety of phenomenons that cannot be revealed by usual protein based analysis. This uncensored vision of tandem duplication should be of great interest to address specific questions on duplication driven genome evolution such as the evolutionary fate of duplicated segments regarding their functional content [2, 34].
From a pure evolutionary point of view, the Tandem Arrays and Tandem Units predicted by ReD Tandem and the usual protein gene based analysis [11, 15, 16, 38] complement each other nicely. While a protein gene based analysis allows to identify distant evolutionary relationships, it implicitly censors all non coding elements that may be involved in the evolutionary process (pseudo-genes, gene relics, RNA genes, CNVs…). Conversely, we have shown that ReD Tandem is able to reliably detect relatively recent tandem duplications (dS < 1 typically) and can uncover a variety of duplications involving coding and non coding regions (and potentially totally non functional regions). It is therefore useful even if a current genome annotation exists and may help identify spurious or missing annotated elements. More importantly, it offers unprecedented direct raw access to tandem duplicated regions, directly bringing to light a variety of situations that were inaccessible in protein gene based approaches.
The DNA sequence is modeled by a string S. A sub-string u of S s i …s j will be simply noted as the interval [i, j] with a start u s = i and an end u e = j. For two non-overlapping intervals, u and v, u < v iff u e < v s . If u < v we define .
A duplication copies a sub-string of S to a distinct location in S and a tandem duplication copies the original copy in its neighborhood. When the duplication is recent, the relationship between the original copy and the duplicate can be captured by a single sequence alignment.
Let a denote a local alignment (or anchor), between S and itself. a is a mapping between an interval and another interval . Using the traditional 2-dimensional representation of an alignment - with S associated with the x-axis as well as the y-axis of the 2-dimensional plane - a local alignment is a path on the plane with and being the corresponding projections on the two axis, 0 standing for the y-axis and 1 for the x-axis. As usual, the orientation, or sign, of an anchor a, a.sign, indicates if the two aligned regions lie on the same strand (+) or not (−). We note a.score the alignment score of the anchor a. As anchors, defined by alignments, have often imprecise boundaries, when we compare two anchors we assume they are reduced to their mid-point.
where T is a user defined threshold. We note the distance between a0 and a1, the two duplicated segments identified by the alignment a. Dot-plot examples of two contrasted real tandem arrays are illustrated in Figure 2.
When the duplication is a more ancient one, because of sequence divergence, the original region and the duplicate one can usually not be aligned on their full length. The identification of a duplication can be viewed as a special case of the genome alignment problem, a generalization of the sequence alignment problem. A common approach in genome alignment consists in chaining alignments . A chain of anchors c is simply a path connecting anchors in the plane. This path relates the interval c0 on S, the projection of c on the y-axis and the interval c1 on S, the projection of c on the x-axis.
The relation ≺ being a partial order on the set of anchors, it induces a directed acyclic graph where vertices are anchors and a directed edge (a i , a j ) appears iff a i ≺ a j . Any path in this graph is a chain of anchors. As a consequence of co-linearity (3), any chain inherits the shared sign of its anchors.
If furthermore the two intervals c0 and c1 defined by a chain c satisfy the properties (2), representing a candidate tandem duplication, we say that c is a t-chain. The purpose of the algorithm described below is to identify a set of t-chains in a sequence S and for each t-chain, identify the corresponding duplication unit, the region delineating the tandem array and the number of repetitions. Note that a t-chain can possibly be composed of a single anchor.
In theory, anchors could be identified using any local self-alignment software (such as YASS ) but existing software usually do not produce anchors satisfying property (2). We therefore adapted our own genome-wide alignment software (glint, Faraut T, Courcelle E., unpublished) to this specific requirement. Optionally, the sequence can be preprocessed to deal with low complexity regions (see the Results section).
Compared to the Euclidian or Manhattan distances on the dotplot plan, this distance tends to be smaller when the closest extremities of two anchors lie on the same diagonal (with the same distance between their two intervals).
Every vertex a, representing an anchor, is weighted by its rescaled alignment score. For each position i of the sequence S, we define the coverage of nucleotide s i , c(i), as the number of intervals a.0, a.1 containing the nucleotide s i .b For each anchor a, m a denotes the mean coverage of the associated intervals a0 and a1. The cost of vertex a in the anchor graph is defined by , favoring the selection of anchors whose regions participate in other anchors.
Every edge (a i , a j ) connecting two anchors is initially weighted by the previous “distance” d(a, b) between anchors. To keep our algorithm efficient, we keep only the k best edges leaving every vertex (based on edge cost, typically k = 15). To normalize costs, edge score are rescaled so that the mean edge score is equal to the absolute value of the mean anchor score.
This defines the first digraph G1.
which also implies that t-chains cannot share anchors. Therefore, finding a set of t-chains with an optimal global score that satisfies properties (2), (3) and (4) cannot be simply computed by iteratively predicting successive t-chains using a traditional weighted chaining method . Figure 3 shows a set of chains that would violate these properties.
To satisfy these properties, we transform the graph G1 in a transportation network  where edges and vertices are associated to unit capacity. All vertices are connected to a source and a sink with a unit capacity edge. Because vertices and edges have a unit capacity, any flow in this network defines a set of paths (chains) that, with guarantee, do not share any vertex (anchor).
In order to guarantee that this set of chains is a set of non overlapping t-chains satisfying a specific form of cost optimality, we use a variant of the successive shortest path algorithm for minimum cost maximum flow by Busaker and Gowen [40, 41]. At iteration i, this algorithm provides a minimum cost flow of value i. This flow defines a set of i chains which do not share anchors with a global minimum cost (maximum score) among all such flows. If the set of chains defined shows no overlapping, we may stop. Otherwise, we proceed to the next iteration. It is easy to prove that the algorithm will terminatec and therefore provides a set of t-chains C. This set has optimal cost among all sets of t-chains of same cardinality. These chains are the potential traces of locally duplicated regions that will be used in the next step to delineate tandem arrays.
Delineating potential tandem arrays: the overlap graph
Different chains in the set C may participate in the same tandem duplication. In order to delineate tandem duplicated regions, we build a new (undirected) graph G2, an overlap graph, whose vertices represent t-chains and remaining anchors. Two vertices are connected by an edge if the intervals they define overlap sufficiently on either axis. The connected components of this overlap graph define the tandem duplicated regions. More precisely the smallest interval encompassing all the projections of the anchors of a connected component defines the tandem duplicated region, or tandem array (TA), on S (in practice this interval is enlarged by L = 40 kb on each side).
For each TA, we try to infer the associated duplication unit from its t-chains. Because of the specific nature of tandem duplications, a single t-chain may contain multiple copies of the minimal duplication unit of the TA.d To identify a minimal duplication unit, we start from the t-chain c with minimum cost (breaking ties with length). Its longest region is aligned against itself (with the alignment tool) and if less than 50% of the sequence is self-similar, we use it as the reference duplication unit. Otherwise, we trim the sequence from the extremity closest to the HSP with the smallest score and iterate.
In order to improve the power of our algorithm to detect tandemly duplicated genes, the reference duplication unit is used as a TBLASTX query against the corresponding tandem region producing a new set of anchors. This new set of anchors is used to build a local weighted digraph G3. Since G3 only contains anchors involving the reference duplication unit, its structure is very simple. and a successive shortest path algorithm is used to extract the best chains, defining the detected tandem units (TU) of the TA. Ultimately, all detected TUs are enlarged by a maximum amount of 25% on each side, without violating constraints (4).
a The Arabidopsis Information Resource, at http://www.arbidopsis.orgcentralizes information on the A. thaliana genome.
b Note that because of property (2), for each anchor a at most one of the two intervals a0, a1 contains a nucleotide s i .
c The maximum flow defines a set of chains of just one anchor therefore satisfying all conditions.
d A tandem duplication with 4 duplication units can also be considered as a tandem duplication with 2 duplication units, each containing 2 smaller (minimal) units.
The authors would like to thank the INRA for funding E. Audemard.
- Conant GC, Wolfe KH: Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet 2008, 9(12):938–950. http://dx.doi.org/10.1038/nrg2482 10.1038/nrg2482View ArticlePubMed
- Hahn MW: Distinguishing among evolutionary models for the maintenance of gene duplicates. J Hered 2009, 100(5):605–617. http://dx.doi.org/10.1093/jhered/esp047 10.1093/jhered/esp047View ArticlePubMed
- Innan H, Kondrashov F: The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet 2010, 11(2):97–108. http://dx.doi.org/10.1038/nrg2689PubMed
- Demuth JP, Bie TD, Stajich JE, Cristianini N, Hahn MW: The evolution of mammalian gene families. PLoS One 2006., e85+: http://dx.doi.org/10.1371/journal.pone.0000085
- Hahn MW, Han MV, Han SG: Gene family evolution across 12 Drosophila genomes. PLoS Genet 2007, 3(11):e197. http://dx.doi.org/10.1371/journal.pgen.0030197 10.1371/journal.pgen.0030197PubMed CentralView ArticlePubMed
- Perry GH, Yang F, Marques-Bonet T, Murphy C, Fitzgerald T, Lee AS, Hyland C, Stone AC, Hurles ME, Tyler-Smith C, Eichler EE, Carter NP, Lee C, Redon R: Copy number variation and evolution in humans and chimpanzees. Genome Res 2008, 18(11):1698–1710. http://dx.doi.org/10.1101/gr.082016.108 10.1101/gr.082016.108PubMed CentralView ArticlePubMed
- Marques-Bonet T, Kidd JM, Ventura M, Graves TA, Cheng Z, Hillier LW, Jiang Z, Baker C, Malfavon-Borja R, Fulton LA, Alkan C, Aksay G, Girirajan S, Siswara P, Chen L, Cardone MF, Navarro A, Mardis ER, Wilson RK, Eichler EE: A burst of segmental duplications in the genome of the African great ape ancestor. Nature 2009, 457(7231):877–881. http://dx.doi.org/10.1038/nature07744 10.1038/nature07744PubMed CentralView ArticlePubMed
- Dumas L, Kim YHH, Karimpour-Fard A, Cox M, Hopkins J, Pollack JRR, Sikela JMM: Gene copy number variation spanning 60 million years of human and primate evolution. Genome Res 2007, 2007: 2007–2007. http://dx.doi.org/10.1101/gr.6557307
- Schrider DR, Hahn MW: Gene copy-number polymorphism in nature. Proc Biol Sci 2010, 277(1698):3213–3221. http://dx.doi.org/10.1098/rspb.2010.1180 10.1098/rspb.2010.1180PubMed CentralView ArticlePubMed
- Rizzon C, Ponger L, Gaut BS: Striking similarities in the genomic distribution of tandemly arrayed genes in Arabidopsis and rice. PLoS Comput Biol 2006, 2(9):e115. http://dx.doi.org/10.1371/journal.pcbi.0020115 10.1371/journal.pcbi.0020115PubMed CentralView ArticlePubMed
- Shoja V, Zhang L: A roadmap of tandemly arrayed genes in the genomes of human, mouse, and rat. Mol Biol Evol 2006, 23(11):2134–2141. http://dx.doi.org/10.1093/molbev/msl085 10.1093/molbev/msl085View ArticlePubMed
- Katju V, Lynch M: The structure and early evolution of recently arisen gene duplicates in the Caenorhabditis elegans genome. Genetics 2003, 165(4):1793–1803.PubMed CentralPubMed
- Letunic I, Copley RR, Bork P: Common exon duplication in animals and its role in alternative splicing. Hum Mol Genet 2002, 11(13):1561–1567. 10.1093/hmg/11.13.1561View ArticlePubMed
- Zhang R, Peng Y, Wang W, Su B: Rapid evolution of an X-linked microRNA cluster in primates. Genome Res 2007, 17(5):612–617. http://dx.doi.org/10.1101/gr.6146507 10.1101/gr.6146507PubMed CentralView ArticlePubMed
- Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science 2000, 290(5494):1151–1155. http://dx.doi.org/10.1126/science.290.5494.1151 10.1126/science.290.5494.1151View ArticlePubMed
- Li WH, Gu Z, Cavalcanti AR, Nekrutenko A: Detection of gene duplications and block duplications in eukaryotic genomes. J Struct Funct Genomics 2003, 3: 27–34. http://view.ncbi.nlm.nih.gov/pubmed/12836682 10.1023/A:1022644628861View ArticlePubMed
- Lynch M: The Origins of Genome Architecture. W.H. Freeman & Company; 2007. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07–20&path=ASIN/0878934847
- Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucl Acids Res 1999, 27(2):573–580. http://dx.doi.org/10.1093/nar/27.2.573 10.1093/nar/27.2.573PubMed CentralView ArticlePubMed
- Kolpakov R, Bana G, Kucherov G: mreps: efficient and flexible detection of tandem repeats in DNA. Nucl Acids Res 2003, 31(13):3672–3678. http://dx.doi.org/10.1093/nar/gkg617 10.1093/nar/gkg617PubMed CentralView ArticlePubMed
- Despons L, Baret PV, Frangeul L, Louis VL, Durrens P, Souciet JL: Genome-wide computational prediction of tandem gene arrays: application in yeasts. BMC Genomics 2010, 11: 56. http://dx.doi.org/10.1186/1471–2164–11–56 10.1186/1471-2164-11-56PubMed CentralView ArticlePubMed
- Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE: Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 2001, 11(6):1005–1017. http://dx.doi.org/10.1101/gr.187101 10.1101/gr.GR-1871RPubMed CentralView ArticlePubMed
- Marques-Bonet T, Girirajan S, Eichler EE: The origins and impact of primate segmental duplications. Trends Genet 2009, 25(10):443–454. http://dx.doi.org/10.1016/j.tig.2009.08.002 10.1016/j.tig.2009.08.002PubMed CentralView ArticlePubMed
- Haas BJ, Delcher AL, Wortman JR, Salzberg SL: DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics 2004, 20(18):3643–3646. http://view.ncbi.nlm.nih.gov/pubmed/15247098 10.1093/bioinformatics/bth397View ArticlePubMed
- Hachiya T, Osana Y, Popendorf K, Sakakibara Y: Accurate identification of orthologous segments among multiple genomes. Bioinformatics 2009, 25: 853–860. 10.1093/bioinformatics/btp070View ArticlePubMed
- Hohl M, Kurtz S, Ohlebusch E: Efficient multiple genome alignment. Bioinformatics 2002, 18(Suppl 1):S312-S320. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?cmd=prlinks&dbfrom=pubmed&retmode=ref&id=12169561 10.1093/bioinformatics/18.suppl_1.S312View ArticlePubMed
- Simillion C, Vandepoele K, Saeys Y, de Peer YV: Building genomic profiles for uncovering segmental homology in the twilight zone. Genome Res 2004, 14(6):1095–1106. http://dx.doi.org/10.1101/gr.2179004 10.1101/gr.2179004PubMed CentralView ArticlePubMed
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410. http://dx.doi.org/10.1016/S0022–2836(05)80360–2View ArticlePubMed
- Blanc G, Barakat A, Guyot R, Cooke R, Delseny M: Extensive duplication and reshuffling in the Arabidopsis genome. The Plant Cell Online 2000, 12(7):1093.View Article
- Cannon S, Mitra A, Baumgarten A, Young N, May G: The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC Plant Biol 2004, 4: 10. 10.1186/1471-2229-4-10PubMed CentralView ArticlePubMed
- Morgulis A, Gertz EM, SchÃ¤ffer AA, Agarwala R: A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences. J Comput Biol 2006, 1028–1040.
- Zhang L, Gaut BS: Does recombination shape the distribution and evolution of tandemly arrayed genes (TAGs) in the Arabidopsis thaliana genome? Genome Res 2003, 13(12):2533–2540. http://dx.doi.org/10.1101/gr.1318503 10.1101/gr.1318503PubMed CentralView ArticlePubMed
- Yang Z, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol 2000, 17: 32–43. http://mbe.oxfordjournals.org/cgi/content/abstract/17/1/32 10.1093/oxfordjournals.molbev.a026236View ArticlePubMed
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 1997, 13: 555–556.PubMed
- Kaessmann H: Origins, evolution, and phenotypic impact of new genes. Genome Res 2010, 20(10):1313. 10.1101/gr.101386.109PubMed CentralView ArticlePubMed
- Theologis A, Ecker JR, Palm CJ, Federspiel NA, Kaul S, White O, Alonso J, Altafi H, Araujo R, Bowman CL, Brooks SY, Buehler E, Chan A, Chao Q, Chen H, Cheuk RF, Chin CW, Chung MK, Conn L, Conway AB, Conway AR, Creasy TH, Dewar K, Dunn P, Etgu P, Feldblyum TV, Feng J, Fong B, Fujii CY, Gill JE, Goldsmith AD, Haas B, Hansen NF, Hughes B, Huizar L, Hunter JL, Jenkins J, Johnson-Hopson C, Khan S, Khaykin E, Kim CJ, Koo HL, Kremenetskaia I, Kurtz DB, Kwan A, Lam B, Langin-Hooper S, Lee A, Lee JM, Lenz CA, Li JH, Li Y, Lin X, Liu SX, Liu ZA, Luros JS, Maiti R, Marziali A, Militscher J, Miranda M, Nguyen M, Nierman WC, Osborne BI, Pai G, Peterson J, Pham PK, Rizzo M, Rooney T, Rowley D, Sakano H, Salzberg SL, Schwartz JR, Shinn P, Southwick AM, Sun H, Tallon LJ, Tambunga G, Toriumi MJ, Town CD, Utterback T, Aken SV, Vaysberg M, Vysotskaia VS, Walker M, Wu D, Yu G, Fraser CM, Venter JC, Davis RW: Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature 2000, 408(6814):816–820. http://dx.doi.org/10.1038/35048500 10.1038/35048500View ArticlePubMed
- Duret L, Chureau C, Samain S, Weissenbach J, Avner P: The Xist RNA gene evolved in eutherians by pseudogenization of a protein-coding gene. Science 2006, 312(5780):1653. 10.1126/science.1126316View ArticlePubMed
- DeBolt S: Copy number variation shapes genome diversity in Arabidopsis over immediate family generational scales. Genome Biol Evol 2010, 2: 441. 10.1093/gbe/evq033PubMed CentralView ArticlePubMed
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucl Acids Res 2002, 30(7):1575–1584. http://dx.doi.org/10.1093/nar/30.7.1575 10.1093/nar/30.7.1575PubMed CentralView ArticlePubMed
- Noé L, Kucherov G: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res 2005., 33: http://view.ncbi.nlm.nih.gov/pubmed/15980530(WebServerissue)
- Ahuja RK, Magnanti TL, Orlin JB:Network Flows: Theory, Algorithms, and Applications. Prentice Hall; 1993. [http://www.amazon.com/exec/obidos/redirect?tag=citeulike07–20&path=ASIN/013617549X]
- Busacker R, Gowen P: A procedure for determining minimal-cost network flow patterns. ORO Technical Report 15 1961.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.