RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Background Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. Results In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. Conlusions We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

Interspersed repeats are dispersed throughout the genome and mainly composed of transposable elements (TEs). Transposable elements account for a large fraction of the genome and have influence on much of the mass of DNA in eukaryotic genomes [1]. In many organisms, repeated sequences make up a significant part of whole genomes, e.g. over two-thirds of the human genome [2], over 75% of the maize genome [3] and about 20% of the Drosophila melanogaster genome [4] are repetitive. For many basic analysis methods of genome sequences, such as de novo assembly, sequence alignment, sequence error correction, etc., repetitive sequences pose a challenge to these tasks [5].
In recent years, many methods have been proposed with the developments and applications of the high-throughput sequencing and single molecule sequencing. Tallymer [6] counts k-mer occurrences from sequence sets and puts k-mer into enhanced suffix arrays to find repetitive sequences. RepeatExplorer [7] is a collection of software tools with a web interface and utilizes a graph-based sequence clustering algorithm to facilitate de novo repeat identification. RepARK [8] creates repeat libraries from NGS reads by assembly of the high-frequency k-mer without using a reference genome, which has two working modes based on the assembly tools, RepARK Velvet and RepARK CLC. MixTaR [9] firstly detects raw tandem repeat patterns from short reads, then selects out long reads containing raw tandem repeat patterns, and finally regards the assembly of the short reads which overlap long reads as TR repeats. RepLong [10] creates a read overlap network and uses community detection algorithm to detect long repeats by using only Pacbio long reads. REPdenovo [11] counts k-mer frequency and assembles k-mer into raw contigs, then merges contigs from directed contig graph into scaffolds to obtain repeats. The k-mer frequency is typically calculated from the NGS reads and the repeats are obtained by assembly of the high frequency k-mers. For example, RepARK and REPdenovo are both methods based on this principle. However, these approaches have some drawbacks.
Firstly, the higher sequence coverage is a requirement for common assemblers to work properly. For example, EPGA [12] ,EPGA2 [13] and SPAdes [14] usually require read coverage of more than 30× , in which SPAdes is considered as an outstanding genome assembler with NGS sequences. Compared with other sequence assemblers, SPAdes uses multiple de Bruijn graphs to construct contigs. This strategy can effectively reduce assembly errors while making full use of a variety of k-mers of different sizes to build more complete assemblies [15]. Due to the dual effects of sequencing bias and the complexity of the repetitive region structure, the fragment can be repeated from several times to serval thousand times. For k-mers converted from fragments with a large number of repetitions, their frequencies usually much higher than the average coverage of sequencing, and for others with few repetitions may not meet the basic coverage requirement. Therefore, the repetitive fragments with low repeatability are difficult to obtain by assembly, and the integrity of the test results is greatly affected.
Secondly, most of the widely-used NGS assemblers are designed based on the pairedend reads, which often resolve the branch path resulting from the repetitive region by using paired-end reads of large insertsizes. However, the length of the high frequency k-mers is too short compared with the reads, which is detrimental to the recovery of the repetitive segments. In addition, splitting reads into k-mers may destroy the structure of the repetitive regions, which means that assemblers do not perform well under such conditions.
Finally, due to the length of the k-mer is short, the sequencing error has a great influence on the result of the k-mer frequency counting. Assembly of k-mers may bring the sequencing errors within these k-mers into repeats, resulting the accuracy of the repetitive regions decreases. In this study, a method called RepAHR is proposed to overcome the shortcomings mentioned above and get more accurate repeats. Compared the repeats identified by RepAHR with these of RepARK and REPdenovo on five NGS datasets, the experimental results show that RepAHR outperforms RepARK Velvet, RepARK CLC and REPdenovo in some aspects.

Metrics for evaluation
In order to comprehensively evaluate the performance of each tool, we use seventeen evaluation metrics in this experiment, which are Num, Max (bp), Min (bp), N50 (bp), N90 (bp), Avg_length (bp), AR (%), Alignment ratio (%), Multiple alignment ratio (%), Masked ratios on reference genome (%), Masked ratios on RepBase sequences (%), User Time (s), System Time (s), Percent of CPU this job got (%), Maximum resident set size (kbytes), Virtual Memory (kb) and File system outputs. Among them, 'Num' denotes the number of segments in detection results, 'Max (kb)' denotes the length of the largest segment in detection results, 'N50 (kb)' is the length of the longest segment such that all the segments longer than this segment cover at least half (50%) of the total length of all segments, 'N90' are calculated in a similar way, ' Avg_length (bp)' denotes the average length of segments, ' AR (%)' denotes alignment ratio of the high-frequency k-mer or the high frequency reads on RepBase sequences, ' Alignment ratio (%)' is the proportion of fragments in the detected results that can be aligned to the reference genome, 'Multiple alignment ratio (%)' is the proportion of fragments in the detection results that can be aligned to multiple locations on the reference genome, 'Masked ratios on reference genome (%)' is the proportion of bases on the reference genome that can be covered by the detection results, 'Masked ratios on RepBase sequences (%)' is the proportion of bases in RepBase that can be covered by the detection results, 'Virtual Memory (kb)' indicates the peak virtual memory consumption of algorithms.
We evaluate the repeats identified by RepAHR, RepARK and REPdenovo on five NGS data sets. Table 1 shows the comparison of the repeats identified by RepAHR, RepARK and REPdenovo in terms of number of repeats, total bases number of repeats, N50 and length of repeats. REPdenovo can identify few repeats and its bases number is the least among the four methods, but the average length of its repeats is the longest. RepARK CLC identifies more repeats and bases number than REPdenovo, but repeats identified by RepARK CLC are shorter than repeats identified by REPdenovo. Compared with the other three tools, RepARK Velvet can identify more repeats, but the bases number of its repeats is less than RepAHR in Drosophila melanogaster, Homo sapiens chr14 and Mus musculus. The repeats identified by RepARK Velvet have the shortest average length, which can be seen from the N50 and N90. RepAHR identifies fewer repeats than RepARK Velvet, but the size of repeats identified by RepAHR is larger than that obtained by RepARK Velvet. In terms of sequence length, the maximum repeat length identified by RepAHR is the longest of the four methods, but it can be seen from the N50 and N90 that the repeats identified by RepAHR also contains many short fragments. The long and short fragments combined together make the final repeats identified by RepAHR more complete.

The high-frequency k-mers and reads
Among RepAHR, RepARK Velvet, RepARK CLC and REPdenovo, the most obvious difference between RepAHR and the other three methods is that RepAHR does not directly use the high-frequency k-mers for sequence assembly. RepAHR first uses the highfrequency k-mers to find the high-frequency reads, then assembles the high-frequency reads to obtain repeats. In order to prove that the high-frequency reads have some advantages over the high-frequency k-mers, we compare the high-frequency k-mers and the high-frequency reads in experiments.
Repbase is the most widely-used database of repetitive DNA sequences in which the currently known repetitive DNA sequences of many eukaryotes are stored [16]. The similarity between a sequence and the sequences in Repbase can be used as a criterion for determining whether the sequence is a repetitive sequence. In the test, three NGS datasets (D.mela, M.musc and H.sapi) are used to obtain the high-frequency k-mers, and the high-frequency reads are selected by RepAHR from all reads according to the highfrequency k-mers. RepARK, REPdenovo and RepAHR differ in the method of determining the high-frequency k-mers threshold. In our test, the high-frequency k-mers is Table 1 Metrics of repeats on five datasets ′ Num ′ indicates the number of repeats. ′ Size ′ indicates the total length of all repeats. ′ Max ′ represents the length of the longest segment in the repeats. ′ Min ′ represents the length of the shortest segment in the repeats. ′ N50 or N90 ′ represents the length of the longest segment such that all the segments longer than this segment cover at least 50% or 90% of the total length of the assemblies The comparison of the number, N50, the average length and the maximum length of aligned segments obtained based on the high-frequency k-mers and reads is shown in Table 2. N50 is the length of the longest contig such that all the contigs longer than this contig cover at least half of the genome being assembled [17]. In addition, the alignment ratios of the high-frequency k-mers and reads compared with the Repbase are shown in Table 2. In Table 2, the alignment ratios of the high-frequency reads are much higher than the alignment ratios of the high-frequency k-mers. It also can been seen that the number of aligned segments obtained by the high-frequency k-mers is much larger than that obtained by the high-frequency reads in three species. At the same time, the length of the alignment segments obtained by the high-frequency reads is greater than that obtained by the high-frequency k-mers, which can be seen from the variation of the N50 and the average length of detection results in Table 3. The maximum length of the aligned segments covered by the high-frequency reads is the same as the k-mers. This shows that the high-frequency reads and k-mers are similar in covering the long alignment segments. As can be seen from Table 3, the total number of alignment segments obtained by the high-frequency reads is less than that obtained by the high-frequency k-mers, but the number of alignment segments obtained by the high-frequency reads is greater than that obtained by the high-frequency k-mers in the long length portion. It can be inferred that the short alignment segments obtained by the high-frequency k-mers are formed by gaps in which continuous regions cannot be communicated. The visualization tool IGV [18] is used to demonstrate the alignment of the high-frequency reads and k-mers with Repbase sequences. Figure 1 shows the case where the sequence named BATUM_I in the Drosophila melanogaster Repbase library is covered by the high-frequency reads and k-mers. There is a gap on the region that is not covered by the high-frequency k-mers, while this region is completely covered by the high-frequency reads, which proves that the high-frequency reads is easier to obtain a complete repeat region.
Based on the above results and analysis, it can be considered that the high-frequency reads is longer, more continuous, and more accurate than the high-frequency k-mers, and it is more suitable for obtaining repeats by sequence assembly. Compared with the previous two methods, RepAHR not only replaces the high-frequency k-mers with the high-frequency reads, but also preserves the information of the paired-end reads as much as possible to assist in the assembly of repetitive regions. It is well known that the paired-end reads can span hundreds to thousands of bp (base pair), so using its supporting information [19], RepAHR can assemble and identify longer repetitive regions.

Verification of the authenticity of repeated sequences detected by RepAHR
In order to prove that the repeats recognized by RepAHR are not only long , but also are the true repeats in the genome. We take a special repetitive region detected by three tools on dataset of human-r14 as an example to analyze, just as shown in Fig. 2. In Fig. 2, the first block shows the alignment relationships between the repetitive fragments  Table 2 Metrics of alignment segments ′ Num ′ indicates the number of alignment segments. ′ N50 ′ indicates the length of the longest segment such that all the segments longer than this segment cover at least 50% of the total length of all segments. ′ Avg Length(bp) ′ indicates the average length of segments. ′ Max ′ indicates the maximum length of a repeat. ′ AR(%) ′ indicates alignment ratio of the high-frequency k-mers or the high-frequency reads on Repbase sequences

Segment type Num N50 (bp) Avg length (bp) Max (bp) AR (%)
Drosophila  detected by RepARK and the reference genome, the second block shows the alignment relationships between the repetitive fragments detected by REPdenovo and the reference genome, and the third block shows the alignment relationships between the repetitive fragments detected by RepAHR and the reference genome. From Fig. 2, we can find that the repetitive fragment detected by RepAHR is more complete than that obtained by RepARK, and there is no fragment detected by REPdenovo in this region. Therefore, RepAHR obtains the longest and the most complete repetitive fragment in this region. Next, we need to prove whether the repetitive fragment obtained by RepAHR is real. In order to achieve the purpose of verification, we collected the repeated fragments obtained by different tools in this region, and aligned them to the human-r14 reference sequence by using the bowtie2 [20] aligner. The alignment results show that the repetitive fragment detected by RepAHR can be aligned to different locations on the human-r14 reference sequence, and its coverage regions include the coverage regions of the RepARK's detection results. The sequence fragment detected by RepAHR is not only a true repeat in the genome, but also the longest and most complete fragment in the detection results of three tools.
In order to prove that the repetitive regions detected by RepAHR which cannot or cannot fully be identified by other tools, we compared the repetitive regions detected by the three tools on the human-r14 dataset. The benchmark for comparison is the repetitive regions on the human-r14 dataset provided by a third-party library, and the comparison results are shown in Fig. 3. From Fig. 3, we find that RepAHR can detect some specific regions that other tools cannot identify or cannot fully identify. In this step, Bowtie2 [20] is used as a aligner.
In order to verify the influence of read coverage on the detection effect of each tools, we designed a simulation experiment based on the human-r14 reference sequence. The specific process of the simulation experiment is as follows: Firstly, four sets of simulation sequencing libraries with different coverage are generated by ART (an NGS read simulator) using the human-r14 reference sequence as the template. Secondly, we have tested the three tools on four simulated sequencing libraries, and the test results are shown in Table 4 The average read coverage represents the average sequencing depth at each regions of the genome, which reflects the sequencing level of the entire genome. From Table 4, we can see that the average read coverage have a certain influence on the detection results of each tools. From the overall trend, the max length and N50 of the fragments in the test results increases as the average coverage of the reads increases. The main reason for this phenomenon is that the sequence assembly process is extremely susceptible to changes in read coverage. For example, Velvet, SOAPdenovo2, Abyss, IDBA and SPAdes usually require read coverage of more than 30X to meet the basic requirements of assembly, otherwise the assemblies will be very fragmented. From the  Table 4 The influence of read coverage on the detection effect of each tools ′ Cov/read_len (bp) ′ indicates the average coverage of sequencing and the average length of reads. ′ Num ′ indicates the number of repeats. ′ MA_ratio (%) ′ represents the multiple alignment ratio of repeats. ′ Max ′ represents the length of the longest segment in the repeats. ′ Min ′ represents the length of the shortest segment in detected repetitive segments. ′ N50 or N90 ′ represents the length of the longest segment such that all the segments longer than this segment cover at least 50% or 90% of the total length of the assemblies  Table 4, we can find that the impact of read coverage on the detection results of tools is inevitable, and this impact is not specific to RepAHR, but also to these two similar tools. Generally, there is a certain conversion relationship (just as shown in equation (2) of the main text) between the average coverage of reads and the frequency of k-mers converted from these reads. Therefore, we can obtain the average read coverage by using the k-mer frequency distribution information, and we can also obtain the high-frequency k-mers and the high-frequency reads in the global genome based on the average read coverage. From a global perspective, assuming that the sequencing is roughly balanced, the high-frequency k-mers must be from the repetitive regions. Therefore, the reads containing more the high-frequency k-mers must also from the repetitive regions. The conclusion is that as long as the high-frequency k-mers contained in a read reaches a certain proportion, this read must be included in RepAHR without being missed. On the contrary, the reads missed by RepAHR must be those that do not meet the above requirements. We are not sure whether they come from the repetitive regions.
The running time, CPU and memory consumption of RepAHR and other tools on five datasets are shown in Table 5. From the perspective of running time, RepARK has the most advantages among these three tools, and the longest running tool is sometimes REPdenovo and sometimes RepAHR. From the perspective of memory consumption, RepAHR and RepARK are more dominant on some datasets, for example, the memory consumption of the two tools on the first three datasets is almost the same. From the perspective of CPU usage, REPdenovo has the highest degree of parallelism. From the perspective of system throughout, RepAHR is dominant in most cases.

Repeats on reference genomes
In the experiment, Bowtie2 [20] is used as a aligner to obtain the overall and the multiple alignment ratios between the repeats and their respective reference sequences on the five data sets. The overall alignment ratio refers to the ratio of the number of repeats that can be aligned to the reference genome at least one position to the total number of repeats. The multiple alignment ratio refers to the ratio of the number of repeats that can be aligned to the reference genome multiple positions to the total number of repeats.
As can be seen from Table 6 and Fig. 4, RepAHR gets the higher multiple alignment ratios than the other three methods on the five data sets. In particular, RepAHR has obvious advantages in the multiple alignment ratio on Saccharomyces cerevisiae dataset. The experimental results show that the repeats identified by RepAHR are more repetitive and accurate than the other three methods. As can be seen from Table 6, the proportion of repeats identified by RepAHR can be aligned with multiple positions of the reference genome is higher than the other tools. For example, Bowtie2 is used to align repeats identified by RepAHR to the reference genome of D. melanogaster. The alignment results show that more than half of these repeats can be aligned to more than 100 positions of the reference genome, just as shown in Fig. 5.
RepeatMasker [21] is a tool for masking interspersed repeats and simple tandem repeats by using sequence alignment, which is used to obtain the masked ratio of the repeats identified by RepAHR, RepARK CLC, RepARK Velvet and REPdenvo on the reference, respectively. As can be seen from Fig. 6, the masking ratio of the repetitive sequences generated by RepAHR is higher than that of the repetitive sequences generated by RepARK CLC, RepARK Velvet and REPdenovo.

Repeats on Repbase sequences
Repbase [22] is the most commonly used database of repetitive DNA sequences. In this study, we use RepeatMakser [21] to compute the ratio of bases in Repbase library that can be covered by detection results of tools. As can be seen from Fig. 7, the masked ratio of RepAHR is higher than that of RepARK CLC, RepARK Velvet and REPdenovo. It is indicated that the repeats identified by RepAHR is closer to the repeats collected in the Repbase database. Therefore, the repeats identified by RepAHR is more accurate and reliable.
BLAST [23] can be used to find the similarity between sequences based on sequence alignment, which is applied to obtain the coverage ratio of the repeats in Repbase that can be covered by the repetitive fragments detected from RepARK CLC, RepARK Velvet, REPdenovo and RepAHR in this study. There are two strategies used to perform the alignment of the repeats identified by each tools with the corresponding Repbase library, which indicate that the fragments in Repbase library can be covered by the single alignment repeats and multiple alignment repeats, respectively. The coverage ratios obtained by using these two strategies are called the best coverage ratio and maximum coverage ratio, which can be abbreviated as BCR and MCR, respectively. Then, we count the number of sequences in Repbase that can be covered by repeats with BCR and MCR from 0% to 100% . The number of sequences in Repbase with BCR and MCR more than 50%, 70% and 90% of the four methods is shown in Table 7, respectively. It can be seen that the sequences in Repbase with high BCR and MCR covered by the repeats identified by RepAHR are more than the other three methods. Figure 8 shows the distribution of the coverage ratio obtained from the four methods by using the BLAST. Each box shows the range of 50% in the middle, and the horizontal line in the box represents the median of the coverage ratios, the x marked in the box indicates the mean of the coverage ratios. Figure 8a shows the distribution of BCR on the Repbase sequences. As can be seen from this figure,the BCR of the repeats identified Alignment ratios and multiple alignment ratios of repeats which generated from four tools on five different datasets. 'Alignment ratios(%)' is the proportion of fragments in the detected results that can be aligned to the reference genome, and 'Multiple alignment ratios(%)' is the proportion of fragments in the detected results that can be aligned to multiple locations on the reference genome by RepAHR is higher than that of the other three methods. This is manifested that the median, mean, and upper quartile and lower quartile of the repeats generated from RepAHR are greater than that of the other three methods. Similarly, the distribution of  MCR on the Repbase sequences is shown in Fig. 8b. The upper quartile of the repeats generated from RepAHR is close to 100%, which means that more than 25% of the Repbase sequences are 100% covered by the repeats identified by RepAHR. In addition, the median, mean, and lower quartiles of the repeats generated from RepAHR are higher than those obtained from the other three methods. It is indicated that the repetitive sequences identified by RepAHR are more similar to the repeats in Repbase, and also closer to the standard repetitive sequences. From the data shown in Tables 5 and 8, we can see that RepAHR has problems with long running time and large memory requirements on some datasets (such as Acromyrmex echinatior and Musmusculus). Those problems have caused our great concern. To 15.    Table 8 The comparison of CPU time and virtual memory consumption of these three tools when using default parameters to process these five datasets

RepAHR [CPU time (s)/ virtual memory (kb)]
shown in Table 9; (2) through the analysis, we find that there are two important parameters in RepAHR (MAX_MEMORY and THREAD) that affect the running time and memory consumption. The statistics of the limit effect of the parameter of MAX_MEMORY on the memory usage of RepAHR when processing the Human-r14 dataset is shown in Table 9. The statistics of the impact of concurrency on the memory usage of RepAHR when processing the Human-r14 dataset is shown in Table 10. As can be seen from the results shown in Table 9 and Table 10, when we lower the values of the two parameters MAX_MEMORY and THREAD, the memory requirements of the RepAHR will be greatly reduced. In addition, the running time of RepAHR can also be similar to the other two tools, especially when MAX_MEMORY and THREAD are set to 20 respectively, just as the shown in Table 11; (3) the causes of the above phenomenon are analyzed. First, the parameter THREAD is used to set the number of threads when the program runs in parallel. In theory, the more threads, the shorter computing time, and the higher computing cost. When we reduce the number of threads, we will get the opposite conclusion (the memory cost is reduced, and the running time is longer). However, this statement is only theoretical. In fact, it is not necessary to increase the thread to improve the computing speed. This is because if the number of threads is set too much, it will intensify   Table 11 The comparison of CPU time and virtual memory consumption of these three tools when RepAHR using the specific parameters of MAX_MEMORY=20, THREAD=20 to process these five datasets

RepAHR [CPU time (s)/virtual memory (kb)]
the competition of system resources and increase the delay of queuing. So in this case, when we appropriately reduce the number of threads, the memory consumption of the algorithm will be reduced on a large scale, and the calculation time of the algorithm will not be greatly affected. Second, the parameter MAX_MEMORY is used to set the maximum memory usage of the algorithm. When we reduce the value of this parameter, the operation system will control the amount of memory that the algorithm resides within the set MAX_MEMORY range. At this time, the program needs to deal with the limitations of the operation system through page scheduling. Therefore, when we reduce the value of this parameter, the memory consumption of the algorithm will be controlled, but the running speed of the algorithm will also be affected.

Recommendations on the parameters configuration of the proposed tool when resources are limited
Sequence assembly is the most time-consuming and memory-consuming step in the entire processing flow of RepAHR. Therefore, as long as the running time and memory consumption of this step can be controlled by adjusting the parameters MAX_MEMORY and THREADS, the tool can still run normally under the condition of limited resources. SPAdes uses 512 Mb per thread for buffers, which results in higher memory consumption (The default value of parameters MAX_MEMORY and THREADS in SPAdes are set to 250GB and 16 respectively). If you set memory limit manually, SPAdes will use smaller buffers and thus less RAM. The parameter MAX_MEMORY set memory limit in Gb. SPAdes terminates if it reaches this limit. Actual amount of consumed RAM will be below this limit. Make sure this value is correct for the given machine. SPAdes uses the limit value to automatically determine the sizes of various buffers, etc. The parameter THREADS is used to set the number of threads using in SPAdes assembly, and the default value of it is 16. The larger the number of threads is, the faster the SPAdes assembly speed, and the memory consumption will also increase. Therefore, the value of the parameter MAX_MEMORY should be set according to the available memory capacity of the machine. For example, if the available memory of the machine is 300GB, then MAX_MEMORY can be set to 250GB at most. It should be noted that SPAdes assembly usually has the minimum memory requirements. For example, when the machine's available memory are less than 20GB, the tool will report an error. In summary, the value of MAX_MEMORY should be taken in the interval [20, available_memory], and the number of threads should be calculated based on the actual processing speed requirements and the overall memory limit.

Conclusions
In molecular biology, it is important to accurately detect repeats in the DNA sequences. With the development of the next-generation sequencing, more and more tools have been proposed for identification of repeats, including RepARK and REPdenovo. In this paper, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. The core steps of RepAHR are as follows: Firstly, RepAHR filters the high-frequency reads from overall NGS data according to certain rules based on the high-frequency k-mers. Secondly, it identifies repeats by assembly of the high-frequency reads. The main advantages of RepAHR are reflected in the following two aspects: 1 the high-frequency reads achieve enough coverage and longer than the k-mers, which facilitates the assembly process. Compared with the previous two methods, RepAHR not only replaces the high-frequency k-mers with the high-frequency reads, but also preserves the information of the paired-end reads as much as possible to assist in the assembly of repetitive sequences. It is well known that the paired-end reads can span hundreds to thousands of bp (base pair), so using its supporting information, RepAHR can assemble and identify longer repetitive regions. 2 due to the sequencing errors and bias, the k-mers from the repetitive regions do not necessarily all show high frequencies, and the k-mers with low frequencies are not necessarily all from the non-repetitive regions. In addition, in the case of a short k-mer size, the error k-mers also have the opportunity to couple together and behave as high frequencies. In these cases, it is unreliable to obtain repeating regions directly by assembly of the high-frequency k-mers. However, the strategy based on the highfrequency reads assembly can effectively circumvent this problem. Firstly, RepAHR has set a stricter filtering strategy in the process of selecting the high-frequency reads, which makes it less likely that error k-mers are used to form repetitive fragments. Secondly, RepAHR also set multiple verification strategies in the process of finalizing the repetitive fragments to ensure that the detection results are authentic and reliable.
The comparison of the repeats identified by RepAHR, RepARK CLC, RepARK Velvet and REPdenovo based on the five NGS datasets. We use multiple metrics, including some basic metrics of the repeats, alignment rate on reference genome, masked ratio on RepBase and so on, to evaluate the performance of each tools. The experimental results show that repeats obtained by RepAHR are more precise and reliable than that of RepARK CLC, RepARK Velvet and REPdenovo.

Methods
As shown in Fig. 9, the proposed method(RepAHR) contains the following phases. Firstly, RepAHR convents overall NGS short paired-end reads into unique k-mers and gets their frequencies by using Jellyfish [24]. Secondly, RepAHR needs to determine whether the average read coverage is known. If it is known, its value can be used to calculate the threshold of the high frequency k-mer directly. Otherwise, RepAHR needs to estimate it based on the k-mer frequency distribution. Thirdly, RepAHR generates a high-frequency k-mer set based on overall k-mers and the high frequency threshold. Fourthly, RepAHR obtains the high-frequency reads from whole NGS short paired-end reads based on the high-frequency k-mer set. Finally, RepAHR gets assemblies of the high-frequency reads, and then scans these contigs to obtain the final repeats.

Determining the threshold of the high-frequency k-mer k-mer frequency counting
This section corresponds to Step1 in the flowchart (Fig. 9). k-mer counting can be

NGS Paired-end reads
Step1: k-mer frequency counting Step2: Whether the average read coverage is known Step4: The average read coverage estimation

Step3: Construction of the k-mer frequency distribution
Step5: Getting threshold of the high frequency k-mer based on the average read coverage Step6: Getting the high frequency kmers based on the high frequency threshold Step7: Filtering the high frequency reads from overall NGS paired-end reads based on the high frequency k-mers Step8: Assembly of the high frequency reads Step9: Scanning assemblies Repeats End Fig. 9 The illustration of the pipeline of RepAHR fulfilled by many available tools, such as Jellyfish [24] and KMC2 [25]. In this step, RepAHR needs to generate two groups of k-mers with different lengths by using Jellyfish. The k-mer length in the first group is about 15bp-20bp according to the literature [6], while the second group is set to 31 bp which is an empirical value obtained through a lot of experiments.

Determining whether the average read coverage is known
This section corresponds to Step2 in the flowchart (Fig. 9). In this section, RepAHR needs to determine whether the average read coverage is known. If it is known, its value can be used to calculate the threshold of the high frequency k-mer directly (the algorithm moves to Step5). Otherwise, RepAHR needs to estimate it based on the k-mer frequency distribution (the algorithm moves to Step3).

Constructing the k-mer frequency distribution
This section corresponds to Step3 in the flowchart (Fig. 9). Suppose n k-mers and their frequencies are obtained from the statistics in the previous step. Let K denote a list containing all k-mer and K i be the i-th k-mer in the list (i = 1, 2, 3..., n). While list F is used to store the frequency of each k-mer in K, so F has a one-to-one relationship with K, for example, F i = t indicates that k-mer K i appears t times in the input reads. Given a frequency number of t, the number of k-mer that have a frequency number of t is denoted as f(t), and the value of f(t) can be calculated as follows: Based on each pair of t and f(t), RepAHR plots the k-mer frequency distribution curve as shown in Fig. 10, which is plotted using Drosophila melanogaster sequencing data downloaded from the Short Read Archive (http://www.ncbi.nlm.nih.gov/sra), while the x-axis refers to the frequency of k-mer and the y-axis refers to the total number of the frequency appearing. If the input read data is evenly distributed over the reference genome, the k-mer frequency distribution curve usually forms a Poisson distribution [26] or Gaussian distribution [27] after the steep decreasing at the beginning of the curve, as shown in Fig. 10. According to this feature, we can always find the main peak point of the curve.

Estimating the average read coverage
This section corresponds to Step4 in the flowchart (Fig. 9). RepAHR estimates the average read coverage using a method similar to that in literature [28]. The calculation principle is shown as follows: Where p is the horizontal coordinate of the main peak in the k-mer frequency distribution histogram, length is the average length of the input NGS reads, k is the k-mer length (1) (2) Cov = p * length length − k + 1 used in estimation which is settled to 15 by default, and Cov is the average read coverage estimated.

Getting the threshold of the high-frequency k-mer based on the average read coverage
This section corresponds to Step5 in the flowchart (Fig. 9). If the average read coverage (Depth) is known, RepAHR can multiply it with the coverage factor to obtain the threshold of the high frequency k-mer t 1 ( t 1 =c × Depth , c ∈ [1.5, 3]) directly. Otherwise, RepAHR needs to multiply the estimated read coverage Cov generated in the previous step by the coverage factor c to obtain the threshold t 1 ( t 1 =c × Cov , c ∈ [1.5, 3]). Where c is reasonable to set between 1.5 and 3, the larger c is, the more stringent the selection of the high-frequency k-mer is.

Getting the high-frequency k-mers based on the high frequency threshold
This section corresponds to Step6 in the flowchart (Fig. 9). In this section, RepAHR first obtains a k-mer set with a length of about 31 bp generated in the Step1. After that, the k − mer i whose frequency f(i) is lower than the threshold t 1 are discarded, and the remaining k-mers are composed as a high-frequency k-mer set as S h . The principle of generating S h is shown as follows.

Filtering the high frequency reads from overall NGS reads based on the high frequency k-mers
This section corresponds to Step7 in the flowchart (Fig. 9). For each read, all k-mers included in them can be denoted as S={s 1 , s 2 , ..., s q }, respectively. RepAHR checks how (3) S h = {k − mer i |∀k − mer i ∈ K , f (i) ≥ t 1 } Fig. 10 The k-mer frequency distribution histogram. In this figure, the blue line is the number of k-mer with a specific frequency, the orange dotted line is a Gaussian fit to the trend near the main peak of the blue line, and the green dotted line is the vertical line from the position at the main peak to the x-axis, and p is the position where the green dotted line intersects the x-axis many k-mers in S appear in the high frequency k-mer set S h . If both the k-mers at the beginning and at the end of a read ( s 1 and s q ) are included in set S h , and the number of the high-frequency k-mer in S has reach a certain percentage threshold t 2 . Then the read is classified as a high-frequency read. RepAHR sets t 2 to 90% by default. The principle of this process is shown in Fig. 11.

Assembling the high frequency reads
This section corresponds to Step8 in the flowchart (Fig. 9). In this phase, SPAdes [14] is used to achieve the contigs of the high-frequency reads. Subsequently, the contigs with low coverage are filtered out [29]. The contigs left are the initial repeats determined by RepAHR.

Scanning assemblies
This section corresponds to Step9 in the flowchart (Fig. 9). In this step, RepAHR needs to sort the initial repetitive sequences generated from SPAdes according to the fragment length, and removes some fragments whose length cannot meet the requirements. After that, the fragments left are the final repeats determined by RepAHR. Fig. 11 Schematic diagram of generating the high-frequency reads. In this figure, a green line on the left denotes a high-frequency k-mer, all these k-mers constitute a high-frequency k-mer set. The blue line denotes the NGS reads, and the green and red line segments under the blue line represent all the k-mers generated from an NGS read. A green line denotes a k-mer which appears in the high-frequency k-mer set, and a red line denotes a k-mer which does not appear in the high-frequency k-mer set. The diagram contains a matched case and an unmatched case on the right Homo sapiens chr14 GAGE DataSet3 Library 1 9.8