Our method represents a given family of TE sequences as an assembly of elementary blocks called modules. We propose an associated tool, ModuleOrganizer, assuming that these sequences have been selected on the basis of local characteristic features (for instance in a database such as Repbase [16]) and providing a global high level characterization of them facilitating the study of their variations. The section starts with a precise definition of properties that are suitable to delimit modules. We then describe in detail the method we propose for module identification.
Overall, it is based on the search and assembly of "maximal repeat" common to several sequences. A word w is a maximal repeat (MR) in a non-empty set of sequences S = {S1, ..., Sn} if, and only if, there are S
i
, S
j
∈ S (not necessarily distinct) and letters a, b, c, d, with a ≠ b and c ≠ d, such that awc is a substring of $S
j
$ and bwd is a substring of $S
j
$ (where $ is a letter not occurring in any sequence). In order to compute all these MR, the sequences of the family are indexed via a generalized suffix tree [17–19]. Our algorithm recursively associates maximal repeats of a same sequence into modules under restrictions corresponding to their definition, such as their size, the number of sequences supporting their presence and the content of the sequence between two MR. Two final steps allow drawing an overall representation of the family: sequences are classified with respect to the presence or the absence of modules and a visualization tool yields an overall graphical view of the sequences.
Defining modules in transposable elements
In theory all sequences of a given family of transposable elements are identical copies of an ancestor sequence. In practice an amount of variation is observed in TE copies, in connection with the age of the copies and the mutation rate. There are several kinds of TE that exhibit a reorganization of internal sequences including insertions and deletions of large sequences: the Miniature Inverted-repeat Transposable Elements (MITEs) [2, 9], the Mu-related bacterial transposons [20, 21] and the Helitron superfamily [22] that integrate blocks of genomic material into their variable sequence [21, 23, 24] and the Short Interspersed DEgenerated Retroposons 2 (SIDER2) [15].
Non-autonomous transposable elements (TEs that lost their protein-coding elements), like MITEs, which represent for some sequences the main source of copies, are often subject to deletions [25]. In such a case, it becomes difficult to reconstruct the autonomous element from the set of non-autonomous sequences [26]. We have studied as a test case the MITE family Foldback4 [14] and in accordance with previous studies of non-autonomous TEs [27, 28], it clearly exhibits variations conserved across several sequences that could be largely explained by biological events such as insertions/deletions of mobile DNA or of host sequences [23, 26]. In order to automatically retrace the main events that occurred, we have systematically exploited the fact that MITEs and other non-autonomous transposable elements present consensus patterns in their different copies [2, 25]. For example, the MITE mPing, Foldback4 or AtREP21 share consensus extremities in all their copies simply because they are necessary for transposition [13, 14, 25]. The importance of host sequence acquisition mechanisms by TEs is well known in plants [29] and leads to detectable repeated blocks in copies separated by small non-consensus nucleotidic regions.
We propose a definition of module for this type of repeated blocks that introduces cautiously these separating nucleotides. Basically, a module is an assembly of flexible repeats. Each flexible repeat is a maximal repeat combination that occurs several times in sequences where MR are separated by a variable number of nucleotides. This class of repeats can be related to the class of structured repeats introduced by M.F. Sagot [30] but introduces new interesting variations that will be discussed in the Results and discussion section under paragraph Structured versus flexible repeats. Flexibility is founded on two simple criteria that delimit the possible spacers between consecutive repeats by fixing a reasonable level of similarity between instances of the same flexible repeat. Flexibility cannot be greater than the parts it links.
-
Flexible repeats: Let S = {S1,..., S
n
} be a set of sequences. Let |w| denote the length of word w and e(w1, w2) denote the edit distance between words w1 and w2. A flexible repeat is inductively defined as follows:
-
1.
Each maximal repeat is a flexible repeat
-
2.
If A and B are flexible repeats and there exist a support subset of sequences T ∈ S of cardinality at least 2, and words A
i
x
i
B
i
in each sequence S
i
of T satisfying the following constraints:
-
(a)
A
i
and B
i
are occurrences of A and B in sequence S
i
-
(b)
Length condition: |x
i
| ≤ max(|A
i
|, |B
i
|)
-
(c)
Distance condition: e(x
i
, x
j
) ≤ min(|A
i
|, |A
j
|, |B
i
|, |B
j
|) for all pairs S
i
, S
j
in T
then (A, B) is a flexible repeat with occurrences A
i
x
i
B
i
.
The definition recursively accepts chains of maximal repeats separated by variable constrained spacers. The length condition applies on spacers in each sequence individually whereas the distance condition requires a similarity level between all spacers globally. From this general notion of flexible repeat, one can define modules as a selection of flexible repeats that get a sufficient support in the set of sequences, that do not overlap and cover as much as possible of this set. More formally:
-
Modules: Given parameters MinSizeModule and MinSequences, a module M in a set of sequences S = {S1, ..., S
n
} is a flexible repeat satisfying the following constraints:
-
1.
Size condition: Each occurrence of M has length at least MinSizeModule.
-
2.
Support condition: M is present in a support subset of cardinality at least MinSequences of S.
An admissible set of modules M = {M1, ..., M
m
} in a set of sequences S = {S1, ..., S
n
} is a set of modules such that:
-
1.
Partition condition: For two different indices i and j, M
i
and M
j
do not overlap. Moreover, no other flexible repeat contains a module M
i
.
-
2.
Maximality condition: No other flexible repeat fulfilling the previous three conditions (size, support and partition) could be added to M.
Such a definition aims at selecting globally a set of modules that must cover a largest subset of a set of sequences. Once admissibility has been reached, there remains some range of variation to build a set of modules from a set of sequences. We propose an iterative strategy based on a preliminary search for seeds at the core of the largest flexible repeats.
An assembly algorithm for the creation of modules
Targeted modules have sizes greater than MinSizeModule and are present in at least MinSequences sequences. All admissible modules are based on an assembly of maximal repeats. In an initial step, our algorithm will thus build the set of all MRs present in at least MinSequences sequences. This may be achieved in linear time with respect to the cumulated length of the sequences, using a generalized suffix tree [19]. These exact maximal repeats can be considered as seeds which are extended to the left or to the right depending on the admissibility of the extension. This method of seed extension is similar to the method used in Blast [31].
The construction of modules is detailed in Algorithm 1. Its basic data structure is a list L of MR sorted by decreasing size, then by number of occurrences. Each maximal repeat is associated with the sorted list of its occurrences in increasing position. Initially, L contains the whole set of MRs present in at least MinSequences sequences and it is updated after the construction of each module (line 8 and 11 in Algorithm 1).
Algorithm 1
1. BuildModules(L, MinSequences, MinSizeModule)
2. REQUIRE: Sorted list L of possible MR (size m, decreasing order)
3. REQUIRE: Minimal Number of covered sequences MinSequences
4. i ← 1; PairOk ← FALSE
5. COMMENT: Looking for a a pair of MR (Seed, Next) in decreasing order of size in L
6. WHILE (i < m and not PairOk)
7. Seed ← L[i]
8. (Next, PairOk) ← BuildPair(L, Seed, i+1, MinSeq)
9. i ← i + 1
10. IF (PairOk)
11. Discard the paired occurrences A of Seed from L
12. COMMENT: Try to enlarge the current flexible repeat to the left or to the right by a new maximal repeat
13. WHILE (PairOk)
14. Depending on the observed occurrences of the flexible repeat, replace Seed by SeedXNext or NextXSeed
15. Discard the paired occurrences B of Next from L
16. PairOk ← FALSE
17. (Next, PairOk) ← BuildPair(L, Seed, 1, MinSeq)
18. IF (size(Seed) ≥ MinSizeModule)
19. Seed and its occurences as a new module
20. ELSE ∅
For each module, the algorithm considers the largest remaining MRs as seed candidates and looks iteratively and greedily at flexible repeats that can be built from such seeds. At line 8 of Algorithm 1, a flexible repeat made of a flexible repeat Seed and a maximal repeat Next has been discovered and will serve as a new seed for the search of larger flexible repeats (line 10). Once it is not possible to extend it any more (line 14), the last condition to be checked is the size of the obtained module.
The search for maximal repeats to be associated with seeds in flexible repeats is described in Algorithm 2. Associations are represented as AxB or BxA, where A is the largest part and B is the smallest part of flexible repeats. This convention explains the simplified tests for flexible repeat length and distance in Algorithm 2. The spacer x can be an empty sequence (Figure 2). The first condition of flexible repeats is checked in line 5 and the second in line 11 (Algorithm 2). The test in line 11 also checks if there is at least one association in each of MinSequences sequences, the first condition for a flexible repeat to be retained as a potential module. In building flexible repeats, the algorithm chooses the largest maximal repeat B that has the most associations with A.
Algorithm 2
1. REQUIRE: BuildPair(L, Seed, Start, MinSequences)
2. REQUIRE: Sorted list L of possible MR (size m)
3. REQUIRE: A flexible repeat Seed and a starting index Start for the search in L
4. REQUIRE: Minimal Number of covered sequences MinSequences
5. j ← Start; PairOk ← FALSE
6. WHILE (j≤m and not PairOk)
7. Next ← L[j]; Pairs ← Ø
8. FOR (all occurrences A of Seed)
9. Search for the occurrences B of Next such that AxB (orientation = "+") or BxA (orientation = "-") is a subword of the sequence and x has size at most the size of A
10. IF (B exists)
11. Pairs ← Pairs ⋃ {(x, b, orientation)} COMMENT: b is the size of B
12. COMMENT: Build the graph GB of occurrences at a suitable edit distance in Pairs
13. FOR (all pairs of occurrences ((x, b
x
), (y, b
y
)) in Pairs in each orientation
14. IF (the edit distance e(x, y) ≤ min(b
x
, b
y
))
15. Create an edge in GB between vertex (x, b
x
) and vertex (y, b
y
)
16. IF (there exists a clique containing at least MinSequences sequences in G)
17. PairOk ← TRUE
18. j ← j + 1
19. RETURN (Next, PairOk)
The worst case complexity of Algorithm 2 is o(n3), where n is the cumulated size of sequences: it is based on a loop on possible maximal repeats (o(n)) including a loop on possible matching occurrences (o(n) since list of occurrences of A and B can be searched in parallel), the production of a graph of similar occurrences (o(n2)) and a search of a clique of size at least MinSequences in this graph (we use a heuristic search here, keeping only vertices that are connected to at least MinSequences nodes associated to different sequences and using an iterative choice of vertices with highest output degree in the remaining graph. The complexity of this step is thus o(n)). Note that the last step offers no guarantee to always find the clique if it exists. Algorithm 1 is also based on a loop on possible maximal repeats (o(n)) including calls to Algorithm 2. The total worst case complexity is thus o(n4). The main data structure are the list L of occurrences of maximal repeats and the graph G, requiring o(n2) space. In practice, the algorithm is very fast (less than a minute) on typical transposable elements families (e.g. 15 sequences of length 2000 nt). The tool we propose allows in fact a more flexible method of association of A and B. The size of spacer x has to be smaller than a percentage of the size of A (|x| ≤ |A| * percentage). By default, the percentage value is 100% and corresponds to the criterion we have defined. With a lower percentage, it is possible to be more restrictive on the spacer size.
Detection of all modules in sequences
After the creation of a module, the list L of maximal repeats must be pruned of any occurrence that overlaps occurrences of this module in order to fulfill condition 3 of the definition of modules (partition condition). The algorithm stops the search for modules present in MinSequences sequences when the procedure BuildModules returns an empty set.
At the beginning MinSequences is set to the number of sequences present in the input file. After the algorithm has found all modules of size bigger than MinSizeModule in MinSequences sequences, the main loop searches for modules in MinSequences = MinSequences - 1 sequences until MinSequences = 1.
Palindromic modules and truncated modules
Transposable elements have two characteristics that are not taken into account by our module definition. First, elements may be copied in the direct or reverse direction. This leads to frequent palindromic motifs that have to be recovered in the context of flexible repeats. Second, elements may be truncated due to large deletions or a high number of mutations. Some partial flexible motifs may remain interesting to identify for a complete analysis of the transposon structure.
Our tool proposes the identification of reverse modules on the basis of the exact MR they contain. In practice, all MRs are searched both on the direct and on the reverse strand and are consequently labeled. A reverse module has the same composition as the module it is derived from, replacing its MR by the corresponding reverse MR. Once a module has been determined, the presence of its reverse module is systematically looked for in the sequences. The presence of a module in its direct or reverse form is counted whatever its direction. This way, the requested number of supporting sequences MinSequences may be attained by a combination of both directions.
Truncated modules may exist with a conservation level that is very low, one or several MRs being discarded from the original copy. It is difficult to define an absolute conservation threshold that would decide if a given degenerated combination of MR remains or not a truncated version of a module or if it must be considered as a new entity. Modules are often composed in practice by a main founding MR surrounded by several smaller MRs at some distance. We have chosen to require that the largest MR remains present in a truncated version of a module, a simple constraint that ensures at least a core identity with the full module. Unlike the previous case, truncated modules are considered only if the full module exists in the set of sequences. They are added during a second step, once all complete modules have been identified.
Clustering of sequences
After the module detection stage, assume m different modules have been obtained in n sequences. The next step of ModuleOrganizer is to build a hierarchical clustering of sequences (a tree) conducted on the basis of module similarity. This way, the evolution of sequences in the family can be traced back by comparing sequences in decreasing order of similarity of their modular profiles. Basically, each sequence is represented by a vector of values on a set of attributes (variables), one numerical attribute per module that is a counter of its occurrences. If the user does not select the search for reverse or truncated modules, the software creates thus an incidence matrix of dimensions m × n.
If reverse modules are allowed, more attributes are necessary to finely describe the evolution of sequences. For each module we create three attributes: one counting the number of direct occurrences (direct orientation), one counting the reverse occurrences, and a last counter for occurrences of the module (either normal or reverse occurrence). The third attribute allows to measure the convergent evolution of palindromic structures like Inverted Terminal Repeats [32]. Indeed, the transposase of autonomous elements recognizes specific palindromic structure at their extremities [2]. The mutation of one extremity decreases strongly their transposition. A double mutation in both extremities may restore the palindromic structure and transposition. This has been observed for instance within families of mariner-like elements [33]. The incidence matrix has size 3m × n with the reverse option.
The presence of truncated occurrences of modules slightly extends the meaning of the attributes we have just defined. These occurrences correspond to small fragments of entire modules, and are composed of a selection of the module MR. While full module occurrences contain 100% of the cumulated size of MR, truncated occurrences contain a lower percentage of this total size reflecting the deleted fraction of a module MR: a complete direct or reverse occurrence will contribute for 1 and a truncated occurrence will have a strictly smaller positive contribution. For instance in Figure 3, the module M2 has one complete occurrence in sequence A and B and one truncated occurrence that contributes at level 0.5.
Sequences are clustered using a standard Hierarchical Agglomerative Clustering (HAC) algorithm, using the Ward criterion [34]. Ward's criterion states that merging HAC clusters should be focused on minimizing the increase of variance induced by the added interclass variance. Basically, it is an error sum-of-squares criterion. In the first step, the loss of inertia in aggregating sequence pair x and y is computed between each possible pair, where x
i
(y
i
) corresponds to the value of attribute i in sequence x (y) and n is the number of sequences. Starting from clusters reduced to a single sequence with weight 1/n, the pair of clusters minimizing Δ is replaced by its union and the values of the weight and Δ for this new cluster is updated (weights are additive and if x, y and z are three clusters with weight n
x
, n
y
and n
z
, then . The algorithm iterates until all sequences are in the same cluster.