Detecting long tandem duplications in genomic sequences
© Audemard et al.; licensee BioMed Central Ltd. 2012
Received: 27 September 2011
Accepted: 24 March 2012
Published: 8 May 2012
Skip to main content
© Audemard et al.; licensee BioMed Central Ltd. 2012
Received: 27 September 2011
Accepted: 24 March 2012
Published: 8 May 2012
Detecting duplication segments within completely sequenced genomes provides valuable information to address genome evolution and in particular the important question of the emergence of novel functions. The usual approach to gene duplication detection, based on all-pairs protein gene comparisons, provides only a restricted view of duplication.
In this paper, we introduce ReD Tandem, a software using a flow based chaining algorithm targeted at detecting tandem duplication arrays of moderate to longer length regions, with possibly locally weak similarities, directly at the DNA level. On the A. thaliana genome, using a reference set of tandem duplicated genes built using TAIR,a we show that ReD Tandem is able to predict a large fraction of recently duplicated genes (dS < 1) and that it is also able to predict tandem duplications involving non coding elements such as pseudo-genes or RNA genes.
ReD Tandem allows to identify large tandem duplications without any annotation, leading to agnostic identification of tandem duplications. This approach nicely complements the usual protein gene based which ignores duplications involving non coding regions. It is however inherently restricted to relatively recent duplications. By recovering otherwise ignored events, ReD Tandem gives a more comprehensive view of existing evolutionary processes and may also allow to improve existing annotations.
Gene duplication has long been recognized as a major driving force in evolution. Both the extent of gene duplications in genomes and the theoretical formalization describing the process by which duplicate genes may contribute to genetic novelty by neo-functionalization lead to an intense interest for the subject (reviewed in [1–3]). The recent discovery of a previously unexpected dynamic of gene family expansion and contraction observed in complete genome sequences has called new attention on the phenomenon of gene duplication [4, 5]. Moreover, recent studies of gene copy-number polymorphism in various organisms provide evidence of an ongoing mechanism of gene duplication and loss within species . The different studies underline that this “revolving door” of gene gain and loss largely contributes to intra and interspecific phenotypic variability [7–9] and is therefore likely to have played an important role in shaping phenotypic differences among species .
Analysis of the genomes of Arabidopsis, human, mouse and rat revealed that tandemly arrayed duplicates account from 10% to 20% of all genes [2, 10, 11]. In addition, the contribution of tandem duplication to gene duplicates ranges from one-third in mammals  to almost 70% in Caenorhabditis elegans, highlighting the predominant role that tandem duplication plays in gene duplication. Tandem duplication contributes also to the evolution of other classes of functional elements such as exons within genes  or RNA genes . In this respect, the detection of recent tandemly duplicated segments in complete genome sequences is a question of foremost interest.
Tandem duplication has been extensively studied at the protein coding gene level [11, 15–17] or at the much smaller scale of serial repeats (micro-satellites), based on local DNA similarities [18, 19].
All studies based on protein similarity analysis are naturally biased by the available genome annotation. In addition, such analyses automatically exclude duplicated segments with RNA genes or degenerated copies from the scope of the study. Despons and colleagues  have recently proposed an approach combining protein and DNA sequence comparison, enabling to detect degenerated paralogous copies, but the method still relies on an existing annotation and is additionally, as acknowledged by the authors, essentially limited to the analysis of compact genomes. Using DNA sequence comparison only, Eichler and colleagues [21, 22] have significantly contributed to the understanding of dynamics of duplication in primates by studying highly identical duplicated DNA fragments greater than 1Kb, termed segmental duplications. This latter work is however limited to the study of very recent duplications.
On the other side of the size spectrum, different algorithms have been devised to detect so-called serial repeats at the DNA level. Initially targeted at short (micro-satellite-like) repeats, these algorithms have been considerably improved, leading to tools such as TRF  or mreps  which are capable detecting short tandem repeats on whole genomes. But, as shown in our experiments, the underlying definition of a serial repeat (as a contiguously repeated string) is not suitable for detecting large duplications that may contain disrupted similarities and which, despite being close, are far from contiguous.
In this paper, we introduce ReD Tandem, a tandem duplication detection tool that works from the genomic DNA sequence of the considered organism. In order to identify tandem duplicated segments, we start from short similar regions (also called anchors) that have been detected by a fast whole genome self-alignment software. These anchors are then chained into larger (duplicated) segments, similarly to what is done in synteny or segmental duplication detection tools such as DAGchainer  or OSfinder , modified to account for the specific properties of tandem duplications. In the next step, we analyse these chains to find tandem regions and an associated duplication unit. This duplication unit is used as a seed to locate further tandem duplications defining what we call a Tandem Array (TA).
In the first section, we present the formal definition of anchors and chains and the algorithm that enables to detect tandemly duplicated regions and the associated duplication unit. We next apply our method on Arabidopsis thaliana and we show that a large number of Tandem Gene Arrays, that can be derived from a CDS based family analysis, are detected by ReD Tandem. We analyse how the detection sensitivity varies with the evolutionary distance between genes. Finally we discuss the ability of the agnostic approach of ReD Tandem to detect duplications of RNA genes, duplications families involving different functional categories such as protein-coding genes together with long non-coding RNAs, as well as duplications of unannotated regions.
Tandem duplications typically include several copies of the same sequence. In the usual situation, these duplications have been obfuscated by evolution, leaving only local similarities. In this section we show how a specific chaining algorithm (called ReD) can reconstruct sets of duplicated regions that can be further analyzed to identify Tandem Arrays and associated duplication units. To test this approach, we apply it on the Arabidopsis thaliana genome and analyze its performance on characterized fraction of tandem coding gene duplications. Finally, we also explore the non coding fraction of the predicted tandem duplicated regions and show that ReD is also able to discover duplicated regions involving pseudo-genes, small or long RNA genes and other specific regions in the Arabidospis thaliana genome.
To properly identify tandem arrays and their associated duplication unit, we follow a multiple steps procedure which is succinctly described now and described in more detail in the “Methods” section.
In a first step, adjacent sequence similarities are identified. These local similarities are next chained to identify a set of pairwise duplicated regions that could belong to tandem duplications. In a third step, the resulting chains are used to identify regions that could define tandem arrays together with the corresponding duplication unit. In the final step, in each such region, this duplication unit is used as a seed to reconstruct the structure of the complete tandem array.
Similarly to what has been done for the reconstruction of homologous regions [23, 24] or whole-genome alignment , our aim is to reconstruct duplicated regions as consistently ordered sequences of close anchors. By consistent order, we mean that each of the two sequences of regions defined by the sequence of anchors is either increasing (on the forward strand) or decreasing (on the reverse strand). To characterize close anchors, we use a distance introduced in  and defined in the “Methods” section.
The usual approach to identify large pairwise duplicated regions is to build a graph whose vertices are the detected anchors and where a directed edge (a b) is created when a and b are consistently ordered and sufficiently close to each other. A score is affected to the nodes and edges, reflecting the alignment score and the physical distances between anchors, and a minimum cost (shortest) path in this graph defines the regions sought .
By repeatedly extracting shortest paths from this acyclic graph, denoted as G 1, one obtains a set of predicted duplicated regions, called chains, with an overall cost defined as the sum of the costs of all its paths.
Enforcing conditions (1) and (2) is difficult, especially because condition (2) is a global condition on the set of predicted chains and not only on each chain. Instead of using the usual process of repeated extraction of shortest chains, we therefore shift to a more sophisticated minimum cost flow based algorithm that will be able to produce a set of k chains that satisfy the constraints above and minimize their overall cost.
To use flows, the previous graph G 1 is transformed in a transportation network and the problem of identifying a set of chains representing duplicated regions is reformulated into a minimum cost flow problem (see Methods).
Compared to the usual iterated greedy shortest path approach, this approach is able to “reconsider” previous chains and reallocate an anchor that was previously used in a chain to a new chain if this is needed to get an overall optimal cost. It therefore has a global view on the set of predicted chains. By iterating until (1) and (2) are satisfied we guarantee that the set of predicted chains are “non-overlapping” chains.
In order to delineate tandem array regions from the previous set of chains, we build a second graph G 2 whose vertices are the predicted chains and remaining anchors and where two vertices are connected iff the associated regions of the two vertices on the genome overlap sufficiently on one of their sub-regions, i.e. the two chains/anchors share at least one sub-region.
Every connected component of this graph collect regions that share paralogous relationship and defines therefore a predicted tandem array region. The actual predicted tandem region is obtained by extending the minimum and maximum coordinates inside the connected component by a small margin. Every chain inside the connected component is a duplication unit candidate and the longest of all minimum cost chains in the array is used to identify the reference duplication unit (see Methods).
In order to increase sensitivity, the previously identified duplication unit is used as the query sequence in a TBLASTX  search against the tandem array region. All the candidate regions that align with the duplication unit on a sufficient length are kept as additional occurrences of the duplication unit and define the output of the prediction by the global “ReD Tandem” approach.
The evaluation of the proposed method was performed using the Arabidopsis thaliana genome sequence as a test case. This genome and its internal gene duplications have been extensively studied providing an excellent standard for evaluation [28, 29]. We used NCBI build 9.1, preprocessed using the low complexity filter DUST (Tatusov and Lipman, unpublished; described in ), anchors are produced using our own genome wide aligner, glint (Faraut T, Courcelle E., unpublished), using standard alignment scoring scheme (match +1, mismatch −3, gap open/extend −5/-2). Our aim is to test if tandem duplications identified by similarities between protein sequences can be recovered by ReD Tandem using the DNA sequence only. The first step therefore involves the construction of a reference set of tandem gene arrays against which the tandem duplications detected by ReD will be compared.
In order to construct a reference set of tandem gene arrays we proceed essentially like other published methods [10, 20, 31]. Considering the TAIR10 version of the Arabidopsis thaliana genome annotation, for each coding gene of length > 500 bp (matching the minimum length ℓ used in ReD Tandem), the longest annotated transcript is selected as the reference transcript. An all-against-all BLASTP comparison is conducted on the corresponding set of proteins. Two genes are considered to share a tandem paralogous relationship (resulting from an ancient tandem duplication) if they are less than T (150 kb) apart and exhibit a BLASTP hit with an e-value of at most 10−5 covering at least 70% of both sequences. These tandem paralogous relationships between genes are used in turn as anchors to create an overlap graph following the method described in “Methods”. Each connected component of this graph defines a tandem duplication array with genes as elementary duplication units. These connected components are essentially equivalent to the TGA defined in , with the difference that the notion of spacer genes between duplicated copies is replaced here by the physical distance threshold of T between copies to enable the comparison with our annotation-free approach.
These reference tandem duplication arrays will be called tandem gene arrays (TGA), and the associated duplication unit tandem gene unit (TGU). Conversely, the regions and units detected by ReD Tandem from DNA alone will be respectively denoted as tandem arrays (TA), and associated tandem unit (TU).
In our analysis, we consider that a TGU is detected if it is overlapped on at least 70% of its length by a ReD TU. For TA and TGA, the criteria is more stringent and requires that the TGA is overlapped by more than 70% by the TA and that at least one of the TGU in the TGA is detected as a TU.
From 60,021 DNA anchors, ReD Tandem built 10,290 chains with a mean of 2.9 anchors per chain, underlining the importance of chaining here. These chains define 1,718 Tandem Array (TA) covering 28.8% of the A. thaliana genome, made up of 5,477 Tandem Unit (TU) covering 10.6% of the sequenced genome. This is consistent with the estimated 10% of tandem gene duplications in the Arabidopsis genome .
Arabidopsis vs Arabidopsis
The results we have obtained on A. thaliana show that ReD Tandem, without relying on a predicted proteome, is able to correctly detect a large fraction of reference tandem duplicated genes provided they are sufficiently close from an evolutionary point of view (dS < 1).
The real added value of ReD Tandem is precisely its ability to perform its analyzes purely from DNA. Although it is restricted to “recent” duplications, ReD Tandem has the ability to identify duplicated regions which are implicitly censored by pure proteome based approaches, therefore helping to analyze the evolutionary history of the region. The only existing software that we know that provides related capabilities is  which uses BLASTX comparison in the immediate vicinity of every gene to identify possible pseudo-genes and gene relics. This approach, while still depending on an annotation, is, as acknowledged by the authors, essentially restricted to compact (bacterial or unicellular eukaryote) genomes. Because it relies on a direct comparison of the genome vs. itself that can be achieved using fast whole genome index based software, ReD Tandem is not restricted to the analysis of compact genomes. Serial repeat finders such as TRF or Mreps are also able to deal directly with DNA sequences, including large genomic sequences, but are instead restricted by the underlying definition of serial repeats (as contiguous repeats). To verify this, we applied both TRF (with default parameters) and Mreps (with a resolution of 50 allowing for maximum approximate matching) to the A. thaliana genome. TRF and Mreps identified respectively 35 and 26 serial repeats containing duplication units with a size above 500 bp (data not shown), compared to the 1,718 TA identified by ReD Tandem. By chaining local similarities, ReD Tandem is instead able to reconstruct large duplicated regions that may be interrupted by local loss of similarities.
Comparison with annotated elements
Arabidopsis vs Arabidopsis
Trans. Element gene
These examples illustrate the fact that ReD Tandem ability to predict tandem duplications extends beyond pure tandem protein gene arrays. Still, TUs remain which do not cover any annotated element in A. thaliana genome. Among the 5,573 TUs predicted by ReD, 1,438 are orphan TUs. This is expected since ReD Tandem is just targeted at detecting DNA level tandem duplications. However, when such orphan TUs appear in a TA, other TUs in the same TA may provide extra information. Some of these orphan TUs may be of interest for improving the existing genome annotation.
We note in addition that ReD Tandem is also able to detect intertwined TGAs and to separate the duplication units belonging to the different families. (Figures not shown, see http://narcisse.toulouse.inra.fr/ReDTandem/6.htmlath4-8005015-8046308 and http://narcisse.toulouse.inra.fr/ReDTandem/2.htmlath4-8031970-8049154)
Arabidopsis vs Arabidopsis
Mean size of TU (bp)
Mean nb of TU
This section gives just a short extract of all detected TAs. A full list of detected TAs, with associated TUs and TAIR10 annotation for the A. thaliana genome with direct links to the corresponding region on Gramene web site is available from http://narcisse.toulouse.inra.fr/ReDTandem.
The full packaged software from anchor detection to final TA/TU prediction is distributed under a CECILL open-source licence at http://narcisse.toulouse.inra.fr/ReDTandem. The software archive is also available as Additional file 1. You can either download a set of executable Linux 64 bits binaries wrapped in a Perl script or the set of sources. ReD Tandem is implemented in C++ and its execution time on Arabidopsis thaliana genome is around 4 hours on a single core computer. Its execution requires the availability of NCBI Blast and a Perl interpreter with the BioPerl package.
In this paper we have introduced ReD Tandem, which, in our knowledge, is the first software targeted at predicting large partially conserved tandem duplications directly from DNA. This allows ReD Tandem to work directly on unannotated genomes. The analysis of ReD Tandem output and examples show that a pure DNA based analysis of tandem duplications unveils a large variety of phenomenons that cannot be revealed by usual protein based analysis. This uncensored vision of tandem duplication should be of great interest to address specific questions on duplication driven genome evolution such as the evolutionary fate of duplicated segments regarding their functional content [2, 34].
From a pure evolutionary point of view, the Tandem Arrays and Tandem Units predicted by ReD Tandem and the usual protein gene based analysis [11, 15, 16, 38] complement each other nicely. While a protein gene based analysis allows to identify distant evolutionary relationships, it implicitly censors all non coding elements that may be involved in the evolutionary process (pseudo-genes, gene relics, RNA genes, CNVs…). Conversely, we have shown that ReD Tandem is able to reliably detect relatively recent tandem duplications (dS < 1 typically) and can uncover a variety of duplications involving coding and non coding regions (and potentially totally non functional regions). It is therefore useful even if a current genome annotation exists and may help identify spurious or missing annotated elements. More importantly, it offers unprecedented direct raw access to tandem duplicated regions, directly bringing to light a variety of situations that were inaccessible in protein gene based approaches.
The DNA sequence is modeled by a string S. A sub-string u of S s i …s j will be simply noted as the interval [i, j] with a start u s = i and an end u e = j. For two non-overlapping intervals, u and v, u <v iff u e <v s . If u <v we define .
A duplication copies a sub-string of S to a distinct location in S and a tandem duplication copies the original copy in its neighborhood. When the duplication is recent, the relationship between the original copy and the duplicate can be captured by a single sequence alignment.
Let a denote a local alignment (or anchor), between S and itself. a is a mapping between an interval and another interval . Using the traditional 2-dimensional representation of an alignment - with S associated with the x-axis as well as the y-axis of the 2-dimensional plane - a local alignment is a path on the plane with and being the corresponding projections on the two axis, 0 standing for the y-axis and 1 for the x-axis. As usual, the orientation, or sign, of an anchor a, a.sign, indicates if the two aligned regions lie on the same strand (+) or not (−). We note a.score the alignment score of the anchor a. As anchors, defined by alignments, have often imprecise boundaries, when we compare two anchors we assume they are reduced to their mid-point.
where T is a user defined threshold. We note the distance between a 0 and a 1, the two duplicated segments identified by the alignment a. Dot-plot examples of two contrasted real tandem arrays are illustrated in Figure 2.
When the duplication is a more ancient one, because of sequence divergence, the original region and the duplicate one can usually not be aligned on their full length. The identification of a duplication can be viewed as a special case of the genome alignment problem, a generalization of the sequence alignment problem. A common approach in genome alignment consists in chaining alignments . A chain of anchors c is simply a path connecting anchors in the plane. This path relates the interval c 0 on S, the projection of c on the y-axis and the interval c 1 on S, the projection of c on the x-axis.
The relation ≺ being a partial order on the set of anchors, it induces a directed acyclic graph where vertices are anchors and a directed edge (a i , a j ) appears iff a i ≺ a j . Any path in this graph is a chain of anchors. As a consequence of co-linearity (3), any chain inherits the shared sign of its anchors.
If furthermore the two intervals c 0 and c 1 defined by a chain c satisfy the properties (2), representing a candidate tandem duplication, we say that c is a t-chain. The purpose of the algorithm described below is to identify a set of t-chains in a sequence S and for each t-chain, identify the corresponding duplication unit, the region delineating the tandem array and the number of repetitions. Note that a t-chain can possibly be composed of a single anchor.
In theory, anchors could be identified using any local self-alignment software (such as YASS ) but existing software usually do not produce anchors satisfying property (2). We therefore adapted our own genome-wide alignment software (glint, Faraut T, Courcelle E., unpublished) to this specific requirement. Optionally, the sequence can be preprocessed to deal with low complexity regions (see the Results section).
Compared to the Euclidian or Manhattan distances on the dotplot plan, this distance tends to be smaller when the closest extremities of two anchors lie on the same diagonal (with the same distance between their two intervals).
Every vertex a, representing an anchor, is weighted by its rescaled alignment score. For each position i of the sequence S, we define the coverage of nucleotide s i , c(i), as the number of intervals a . 0, a . 1 containing the nucleotide s i .b For each anchor a, m a denotes the mean coverage of the associated intervals a 0 and a 1. The cost of vertex a in the anchor graph is defined by , favoring the selection of anchors whose regions participate in other anchors.
Every edge (a i , a j ) connecting two anchors is initially weighted by the previous “distance” d(a, b) between anchors. To keep our algorithm efficient, we keep only the k best edges leaving every vertex (based on edge cost, typically k = 15). To normalize costs, edge score are rescaled so that the mean edge score is equal to the absolute value of the mean anchor score.
This defines the first digraph G 1.
which also implies that t-chains cannot share anchors. Therefore, finding a set of t-chains with an optimal global score that satisfies properties (2), (3) and (4) cannot be simply computed by iteratively predicting successive t-chains using a traditional weighted chaining method . Figure 3 shows a set of chains that would violate these properties.
To satisfy these properties, we transform the graph G 1 in a transportation network  where edges and vertices are associated to unit capacity. All vertices are connected to a source and a sink with a unit capacity edge. Because vertices and edges have a unit capacity, any flow in this network defines a set of paths (chains) that, with guarantee, do not share any vertex (anchor).
In order to guarantee that this set of chains is a set of non overlapping t-chains satisfying a specific form of cost optimality, we use a variant of the successive shortest path algorithm for minimum cost maximum flow by Busaker and Gowen [40, 41]. At iteration i, this algorithm provides a minimum cost flow of value i. This flow defines a set of i chains which do not share anchors with a global minimum cost (maximum score) among all such flows. If the set of chains defined shows no overlapping, we may stop. Otherwise, we proceed to the next iteration. It is easy to prove that the algorithm will terminatec and therefore provides a set of t-chains C. This set has optimal cost among all sets of t-chains of same cardinality. These chains are the potential traces of locally duplicated regions that will be used in the next step to delineate tandem arrays.
Different chains in the set C may participate in the same tandem duplication. In order to delineate tandem duplicated regions, we build a new (undirected) graph G 2, an overlap graph, whose vertices represent t-chains and remaining anchors. Two vertices are connected by an edge if the intervals they define overlap sufficiently on either axis. The connected components of this overlap graph define the tandem duplicated regions. More precisely the smallest interval encompassing all the projections of the anchors of a connected component defines the tandem duplicated region, or tandem array (TA), on S (in practice this interval is enlarged by L = 40 kb on each side).
For each TA, we try to infer the associated duplication unit from its t-chains. Because of the specific nature of tandem duplications, a single t-chain may contain multiple copies of the minimal duplication unit of the TA.d To identify a minimal duplication unit, we start from the t-chain c with minimum cost (breaking ties with length). Its longest region is aligned against itself (with the alignment tool) and if less than 50% of the sequence is self-similar, we use it as the reference duplication unit. Otherwise, we trim the sequence from the extremity closest to the HSP with the smallest score and iterate.
In order to improve the power of our algorithm to detect tandemly duplicated genes, the reference duplication unit is used as a TBLASTX query against the corresponding tandem region producing a new set of anchors. This new set of anchors is used to build a local weighted digraph G 3. Since G 3 only contains anchors involving the reference duplication unit, its structure is very simple. and a successive shortest path algorithm is used to extract the best chains, defining the detected tandem units (TU) of the TA. Ultimately, all detected TUs are enlarged by a maximum amount of 25% on each side, without violating constraints (4).
a The Arabidopsis Information Resource, at http://www.arbidopsis.org, centralizes information on the A. thaliana genome.
b Note that because of property (2), for each anchor a at most one of the two intervals a 0, a 1 contains a nucleotide s i .
c The maximum flow defines a set of chains of just one anchor therefore satisfying all conditions.
d A tandem duplication with 4 duplication units can also be considered as a tandem duplication with 2 duplication units, each containing 2 smaller (minimal) units.
The authors would like to thank the INRA for funding E. Audemard.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.