Anchors
Let a denote a local alignment (or anchor), between S and itself. a is a mapping between an interval
and another interval
. Using the traditional 2dimensional representation of an alignment  with S associated with the xaxis as well as the yaxis of the 2dimensional plane
 a local alignment is a path on the plane with
and
being the corresponding projections on the two axis, 0 standing for the yaxis and 1 for the xaxis. As usual, the orientation, or sign, of an anchor a, a.sign, indicates if the two aligned regions lie on the same strand (+) or not (−). We note a.score the alignment score of the anchor a. As anchors, defined by alignments, have often imprecise boundaries, when we compare two anchors we assume they are reduced to their midpoint.
Because of the symmetry in the comparison of
S to itself, we can restrict ourselves to an upperhalfplane and impose, without loss of generality, that
An anchor
a which belongs to a genuine tandem duplication must satisfy some additional conditions: the two intervals
a
^{0} and
a
^{1} being intervals of
S, as a consequence of the considered duplication mechanism, cannot overlap. In other words
a cannot overlap itself on
S. Together with (1), this therefore implies that
a
^{0} <
a
^{1}. Since we consider only tandem duplications,
a
^{0} and
a
^{1} must be sufficiently close to each other on
S. The two conditions can be formalized as follows
where T is a user defined threshold. We note
the distance between a
^{0} and a
^{1}, the two duplicated segments identified by the alignment a. Dotplot examples of two contrasted real tandem arrays are illustrated in Figure 2.
When the duplication is a more ancient one, because of sequence divergence, the original region and the duplicate one can usually not be aligned on their full length. The identification of a duplication can be viewed as a special case of the genome alignment problem, a generalization of the sequence alignment problem. A common approach in genome alignment consists in chaining alignments [25]. A chain of anchors c is simply a path connecting anchors in the plane. This path relates the interval c
^{0} on S, the projection of c on the yaxis and the interval c
^{1} on S, the projection of c on the xaxis.
In order to build chains of anchors that reflect the proposed homology relationship between the two regions
c
^{0} and
c
^{1}, we require the anchors
a
_{1}, …,
a
_{
n
} in a chain to be colinear: they must share the same sign and each sequence
and
must be totally ordered intervals on
S. More formally we say that the anchor
b can be a successor of anchor
a in a chain, noted
a ≺
b, if and only if
The relation ≺ being a partial order on the set of anchors, it induces a directed acyclic graph where vertices are anchors and a directed edge (a
_{
i
}, a
_{
j
}) appears iff a
_{
i
} ≺ a
_{
j
}. Any path in this graph is a chain of anchors. As a consequence of colinearity (3), any chain inherits the shared sign of its anchors.
If furthermore the two intervals c
^{0} and c
^{1} defined by a chain c satisfy the properties (2), representing a candidate tandem duplication, we say that c is a tchain. The purpose of the algorithm described below is to identify a set of tchains in a sequence S and for each tchain, identify the corresponding duplication unit, the region delineating the tandem array and the number of repetitions. Note that a tchain can possibly be composed of a single anchor.
Identifying chains
In theory, anchors could be identified using any local selfalignment software (such as YASS [39]) but existing software usually do not produce anchors satisfying property (2). We therefore adapted our own genomewide alignment software (glint, Faraut T, Courcelle E., unpublished) to this specific requirement. Optionally, the sequence can be preprocessed to deal with low complexity regions (see the Results section).
Starting from the DAG defined by the ≺ relation, we build a digraph by removing all edges (
a
b) that cannot participate in a
tchain,
i.e. anchors
a and
b which define a chain with overlapping intervals or which are too distant (distance larger than
L). If we note
and
, following [
26] we use
Compared to the Euclidian or Manhattan distances on the dotplot plan, this distance tends to be smaller when the closest extremities of two anchors lie on the same diagonal (with the same distance between their two intervals).
To identify the most likely set of
tchains in this graph, we will consider minimum cost paths in a graph weighted as follows:

Every vertex a, representing an anchor, is weighted by its rescaled alignment score. For each position i of the sequence S, we define the coverage of nucleotide s
_{
i
}, c(i), as the number of intervals a
_{.}
^{0}, a
_{.}
^{1} containing the nucleotide s
_{
i
}.^{b} For each anchor a, m
_{
a
} denotes the mean coverage of the associated intervals a
^{0} and a
^{1}. The cost of vertex a in the anchor graph is defined by
, favoring the selection of anchors whose regions participate in other anchors.

Every edge (a
_{
i
}, a
_{
j
}) connecting two anchors is initially weighted by the previous “distance” d(a, b) between anchors. To keep our algorithm efficient, we keep only the k best edges leaving every vertex (based on edge cost, typically k = 15). To normalize costs, edge score are rescaled so that the mean edge score is equal to the absolute value of the mean anchor score.
This defines the first digraph G
_{1}.
Importantly, since every
tchain represents a specific duplication event, two predicted
tchains
c
_{
i
} and
c
_{
j
} should not define overlapping intervals:
which also implies that tchains cannot share anchors. Therefore, finding a set of tchains with an optimal global score that satisfies properties (2), (3) and (4) cannot be simply computed by iteratively predicting successive tchains using a traditional weighted chaining method [25]. Figure 3 shows a set of chains that would violate these properties.
To satisfy these properties, we transform the graph G
_{1} in a transportation network [40] where edges and vertices are associated to unit capacity. All vertices are connected to a source and a sink with a unit capacity edge. Because vertices and edges have a unit capacity, any flow in this network defines a set of paths (chains) that, with guarantee, do not share any vertex (anchor).
In order to guarantee that this set of chains is a set of non overlapping tchains satisfying a specific form of cost optimality, we use a variant of the successive shortest path algorithm for minimum cost maximum flow by Busaker and Gowen [40, 41]. At iteration i, this algorithm provides a minimum cost flow of value i. This flow defines a set of i chains which do not share anchors with a global minimum cost (maximum score) among all such flows. If the set of chains defined shows no overlapping, we may stop. Otherwise, we proceed to the next iteration. It is easy to prove that the algorithm will terminate^{c} and therefore provides a set of tchains C. This set has optimal cost among all sets of tchains of same cardinality. These chains are the potential traces of locally duplicated regions that will be used in the next step to delineate tandem arrays.
Delineating potential tandem arrays: the overlap graph
Different chains in the set C may participate in the same tandem duplication. In order to delineate tandem duplicated regions, we build a new (undirected) graph G
_{2}, an overlap graph, whose vertices represent tchains and remaining anchors. Two vertices are connected by an edge if the intervals they define overlap sufficiently on either axis. The connected components of this overlap graph define the tandem duplicated regions. More precisely the smallest interval encompassing all the projections of the anchors of a connected component defines the tandem duplicated region, or tandem array (TA), on S (in practice this interval is enlarged by L = 40 kb on each side).
For each TA, we try to infer the associated duplication unit from its tchains. Because of the specific nature of tandem duplications, a single tchain may contain multiple copies of the minimal duplication unit of the TA.^{d} To identify a minimal duplication unit, we start from the tchain c with minimum cost (breaking ties with length). Its longest region is aligned against itself (with the alignment tool) and if less than 50% of the sequence is selfsimilar, we use it as the reference duplication unit. Otherwise, we trim the sequence from the extremity closest to the HSP with the smallest score and iterate.