Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale

Background With rapid advancements in technology, the sequences of thousands of species’ genomes are becoming available. Within the sequences are repeats that comprise significant portions of genomes. Successful annotations thus require accurate discovery of repeats. As species-specific elements, repeats in newly sequenced genomes are likely to be unknown. Therefore, annotating newly sequenced genomes requires tools to discover repeats de-novo. However, the currently available de-novo tools have limitations concerning the size of the input sequence, ease of use, sensitivities to major types of repeats, consistency of performance, speed, and false positive rate. Results To address these limitations, I designed and developed Red, applying Machine Learning. Red is the first repeat-detection tool capable of labeling its training data and training itself automatically on an entire genome. Red is easy to install and use. It is sensitive to both transposons and simple repeats; in contrast, available tools such as RepeatScout and ReCon are sensitive to transposons, and WindowMasker to simple repeats. Red performed consistently well on seven genomes; the other tools performed well only on some genomes. Red is much faster than RepeatScout and ReCon and has a much lower false positive rate than WindowMasker. On human genes with five or more copies, Red was more specific than RepeatScout by a wide margin. When tested on genomes of unusual nucleotide compositions, Red located repeats with high sensitivities and maintained moderate false positive rates. Red outperformed the related tools on a bacterial genome. Red identified 46,405 novel repetitive segments in the human genome. Finally, Red is capable of processing assembled and unassembled genomes. Conclusions Red’s innovative methodology and its excellent performance on seven different genomes represent a valuable advancement in the field of repeats discovery. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0654-5) contains supplementary material, which is available to authorized users.

Once the local maxima have been found by the labeling module, the boundaries of the candidate regions 9 are determined. Locating candidate repetitive regions involves the interleaving non-repetitive regions. A 10 region is considered non-repetitive (repetitive) if the percentage of low (≤ a user-specified threshold) scores 11 in this region is greater (less) than the expected percentage assuming a uniform background distribution. 12 The expected percentage is the percentage of the scores that are ≤ the score of the threshold in the genome. 13 For example, suppose that 55% of the scores of a genome are ≤ 2. If these low scores were distributed 14 uniformly, then one would expect the percentage of low scores in a random segment of this genome to be 15 about 55%. However, due to the fact that non-repetitive regions include a high percentage of low scores, 16 the observed percentage should be greater than 55%. In the current implementation of Red, if the expected 17 percentage is low, it is increased to 52.5%.

18
Properties of repetitive regions are utilized for reducing the number of false maxima. The original scores 19 1 in the small region flanking a local maximum are examined to ensure that this maximum is in a repetitive 20 region; the percentage of the low scores in the targeted region must be lower than the expected percentage.

21
The targeted region has the same size as the Gaussian mask.

22
The presence of local maxima and of high scores is characteristic of repeats, whereas non-repetitive 23 regions consist mainly of low scores. Two definitions are required. A separator and a core illustrate how the 24 boundaries of repetitive regions are determined. A separator is defined as a non-repetitive region located 25 between two consecutive local maxima; consequently, a separator does not include any local maximum.

26
A core is defined as a repetitive region including at least one local maximum, and it is bounded by two 27 separators. To begin the delineation of repetitive regions, the separators are identified and, in turn, the 28 cores. Then, the boundaries of each core are adjusted.

29
The boundaries of a core are expanded or eroded in two stages. The first stage is a step-by-step expansion.

30
During this stage, the small region adjacent to the start or the end of the core is added to the core if this 31 region is a repetitive region. The size of this region, i.e. the step, is half the width of the Gaussian mask.

32
This stage is repeated until a non-repetitive region is encountered. In the second stage, the start and the end 33 of the core are further expanded or eroded nucleotide by nucleotide. The one-by-one expansion is executed 34 if the score adjacent to the core is greater than that of the threshold. In this case, the boundaries of the core 35 are adjusted to include this score. Expanding the core is repeated until a score that is less than or equal 36 to the score of the threshold is encountered. The one-by-one erosion is executed if the original score at the 37 start or the end of the core is less than or equal to that of the threshold. Then the boundaries of the core 38 are adjusted to exclude this score. Eroding the core is repeated until a score that is greater than that of the 39 threshold is encountered. During the second stage, the start or the end of the core is either expanded or 40 eroded; however, it is not subject to the two operations combined.

49
• The labeling module: It takes n to score a sequence, m × n to smooth the scores, n to calculate the 50 first derivative, n to calculate the second derivative, n to find the local maxima, at most n to find the 51 separators, and at most n to expand the candidate regions. Therefore, the total time required by the 52 labeling module is (6 + m) × n.

53
• The training module: The module takes n to score a sequence, n to calculate the states using a 54 logarithmic function of the scores, at most n to calculate the prior probabilities, and n to calculate the 55 transition probabilities. The total time required by the training module is 4 × n.

56
• The scanning module: It takes n to score a sequence, n to calculate the states, and s × n to calculate 57 the optimal series of states according to Viterbi's algorithm. The total time for the scanning module is 58 (2 + s) × n.