De novo assembly of bacterial genomes with repetitive DNA regions by dnaasm application

Background Many organisms, in particular bacteria, contain repetitive DNA fragments called tandem repeats. These structures are restored by DNA assemblers by mapping paired-end tags to unitigs, estimating the distance between them and filling the gap with the specified DNA motif, which could be repeated many times. However, some of the tandem repeats are longer than the distance between the paired-end tags. Results We present a new algorithm for de novo DNA assembly, which uses the relative frequency of reads to properly restore tandem repeats. The main advantage of the presented algorithm is that long tandem repeats, which are much longer than maximum reads length and the insert size of paired-end tags can be properly restored. Moreover, repetitive DNA regions covered only by single-read sequencing data could also be restored. Other existing de novo DNA assemblers fail in such cases. The presented application is composed of several steps, including: (i) building the de Bruijn graph, (ii) correcting the de Bruijn graph, (iii) normalizing edge weights, and (iv) generating the output set of DNA sequences. We tested our approach on real data sets of bacterial organisms. Conclusions The software library, console application and web application were developed. Web application was developed in client-server architecture, where web-browser is used to communicate with end-user and algorithms are implemented in C++ and Python. The presented approach enables proper reconstruction of tandem repeats, which are longer than the insert size of paired-end tags. The application is freely available to all users under GNU Library or Lesser General Public License version 3.0 (LGPLv3).


Background
Next-generation sequencing (NGS) has dramatically reduced the time and the cost of producing genome sequences using massively parallel technologies [1]; therefore, we observe exponential increase of sequencing data [2]. The reduction of cost and sequencing the time allowed to develop many applications, such as biosurveillance, bioforensics, and infectious disease epidemiology [3]. What is more, genome-scale metabolic modeling and metagenomic sequencing of patient samples could improve the efficiency of diagnosis and treatment of diseases in the near future. All of the shown above practical applications The sequencing procedure for bacterial organisms has changed a lot over the last 20 years. In 1995, the first two sequenced bacterial organisms were published. Over time, sequencing technology has evolved, and now bacterial sequencing has become the standard procedure. However, many of the sequenced bacterial genomes are currently incomplete -for example 90% of bacterial genomes in GenBank [3] are incomplete. In many cases the incompleteness is a result of the occurrence of repetitive sequences in bacterial genomes that can not always be reconstructed from short DNA reads from secondgeneration sequencing.
Some of the repetitive DNA regions could represent a structure called tandem repeat -a sequence built from several identical DNA fragments lying one after another, caused mainly by strand-slippage replication [4]. Bacterial genomes contain up to several dozens of tandem repeats divided into two groups: intragenic and intergenic. Nevertheless, only a small number of tandem repeats have been functionally studied to date; for example, some of the functions of specific genes can be modulated by instability of tandem repeats. This process allows bacteria adaptation to a new environment in a short term without complicated mutation [5].
Current DNA assemblers, like ABySS [6], Velvet [7] or SPAdes [8], reconstruct tandem repeats using the information contained in paired-end tags. However, some repetitive regions may be much longer than maximum reads length and the insert size of paired-end tags. Such regions cannot be reconstructed by modern DNA assemblers.
Here, we present a new algorithm for DNA assembly, which uses the relative frequency of DNA reads to properly reconstruct tandem repeats. The main advantage of our approach is that tandem repeats, which are longer than the insert size of paired-end tags, can also be properly reconstructed, while other de novo genome assemblers fail in such cases. What is more, long tandem repeats could also be restored if only single-read sequencing data is available. The presented approach requires high sequencing coverage, currently easily achievable for bacterial genomes, but the tandem repeats reconstruction process could significantly improve contiguity over previous approaches, which was also indicated in the study.

Implementation
In this section, we present the main data processing pipeline that has been implemented in a new DNA assembler named 'dnaasm' . We use de Bruijn graph due to its efficiency for the next generation sequencing data. We mainly focus on describing the process of estimation tandem repeats length and the process of reconstruction repetitive DNA fragments. We also present the main implementation aspects that make our application memory and computing efficient.

Assembly workflow Building and correcting de Bruijn graph
The first stage of de novo assembling in 'dnaasm' is de Bruijn graph construction. As in the typical de novo DNA assembler, dnaasm builds de Bruijn graph from input set of DNA reads by splitting each read into set of k-mers. Each k-mer represents a substring of length k from input DNA read -a number of k-mers generated from single DNA read of length L is equal to L − k + 1. Then, on the constructed de Bruijn graph, some algorithms for error correction are applied, similar to algorithms implemented previously [7]. Especially, dnaasm uses algorithms for removing tips, bubbles and edges of low weight. At this stage, all edges representing DNA sequencing errors should be removed from the de Bruijn graph. Moreover, the edges of the de Bruijn graph represent substrings of length k and in the presented approach each edge has an additional property, the integer number named edge weight, which depicts a number of occurrence of DNA fragment of length k in the input set of DNA reads, as in A-Bruijn graph [9].
The specified edge weight w is equal to exact k-mer count, where edge represents specified DNA substring of length k in the set of reads. Let's consider ideal assembler input R, called k-spectrum, where reads are generated without errors from a circular bacterial genome of sequence s 0 s 1 ...s G−1 of length G, and reads r ∈ R have identical length L, and R is a set of all substrings s i ...
for non-repetitive k-mers, where N depicts a number of reads, and edge weight w = d , for k-mers inside tandem repeat of length n, where repetitive motif of length d is repeated n/d times (integer division), tandem repeat is longer than graph dimension, n > k, and = n − k + 1.
We prove in [10] the edge weight w for de Bruijn graph of dimension k, for error-less set of N reads with identical length L, assuming uniform distribution of the reads position over input circular bacterial genome of length G, is a random variable with Poisson distribution (probability mass function is P (x) = e −λ λ x x! ), as depicted in Eq. 1.

Estimating a number of repeats
After the de Bruijn graph construction and correction, dnaasm application estimates the number of occurrences of a given DNA fragment, represented by the edge in the de Bruijn graph, in the investigated genome. This process consists of two stages: firstly, the normalization factor is calculated in accordance with the equation: The presented normalization factor is the result of modeling edge weight by Poisson distribution described in Eq. 1. Then, the edge normalization is carried out -it consists in multiplying the input edge weight (which is the number of occurrences of the DNA fragment represented by the edge in the input set of DNA reads) by the previously calculated normalization factor. The multiplication result is rounded to the nearest integer, which represents the number of occurrences of the DNA fragment represented by the edge in the investigated genome. This step could be briefly described by the following equation: The proper repetitive sequence reconstruction requires high coverage c = N * L G ≥ 100. When c ≥ 10 Poisson distribution of edge's weight can be approximated by Normal distribution N (μ, σ ): For given level of confidence q, 0 ≤ q ≤ 1 we can calculate a required read coverage c for proper repetitive motif reconstruction, using the Eq. 5, where −1 N (q) is the inverse cumulative distribution function for standard normal distribution (μ = 0, σ = 1), d is the length of repetitive motif, n is the length of tandem repeats, n > k, k is de Bruijn graph dimension, L is read length.
The process of estimating the number of occurrences of a given DNA fragment in the investigated genome is presented in Fig. 1.

Detecting tandem repeats
The next step of the tandem repeats reconstruction process is the detection of structures in the de Bruijn graph, which represent tandem repeats in the investigated genome. These structures appears as loops in de Bruijn graph connected with the rest of the graph by only one in-edge and only one out-edge. In other words, tandem repeats are represented by a sub-graph, where exactly one vertex has two in-edges and one out-edge, exactly one vertex has one in-edge and two out-edges, and all other vertices have one in-edge and one out-edge. Such structure consists of two parts: • a branch from a vertex which represents an entry to the loop to a vertex which represents an exit of the loop; • a branch from a vertex which represents an exit of the loop to a vertex which represents an entry to the loop.
An example of a structure representing tandem repeat in the de Bruijn graph is presented in Fig. 2.

Correcting weights in tandem repeats
The next step of the tandem repeats reconstruction process is the correction of the edge weights in the previously detected de Bruijn graph loops. Firstly, the weights in single branches are corrected so that all weights of the branch have the same weight. Secondly, the number of vertices in

Resolving tandem repeats in DNA sequence
The last step of reconstructing the repetitive DNA sequence from next-generation sequencing reads is to generate a DNA sequence from the de Bruijn graph. This process involves traverse the vertices of the de Bruijn graph until an ambiguous vertex is encountered. The vertex is treated as unambiguous if it has zero, one or two input (output) edges and, in the case of exactly two input (output) edges, for one of them a simple return path exists ie. path from the target vertex to the source vertex, that has at least one vertex with more than one input edges and at least one vertex with more than one output edges. This condition makes the number of ambiguous vertices in our approach smaller than in the other existing assemblers, where ambiguity is set if a vertex has more than one input edge or more than one output edge.
The process of resolving tandem repeats consists of two steps: (1) finding vertices without any input edges and with at least one output edge, such vertex starts new contig and becomes current vertex; (2) iteratively processing directly connected vertices ie. adding them to actual contig and decrementing weights of visited edges; if the edge weight is 0, edge is removed from the graph. If the current vertex v is unambiguous, it extends the current contig, otherwise, it starts the new one. Moreover, if current vertex v is unambiguous and has two output edges, a b c Fig. 3 Correcting weights in loops of de Bruijn graph. A sample process of correcting weights in the de Bruijn graph. a The input de Bruijn graph. b De Bruijn graph after correcting weights in branchesthe weight of (GGT,GTA) edge was changed to 2. The graph has uncorrected weights in valid loops -vertices ATT and TAA have degrees different from zero. c De Bruijn graph after correction weights in valid loops -the weights of less numerous branch (ATT,TTA,TAA) of the loop were changed so that all vertices of the loop will have degree equal to zero the edge, for which a previously defined simple return path exists, is chosen.
This process is repeated until all ambiguous vertices are resolved. An example of generating DNA sequences from de Bruijn graph is presented in Fig. 4.

Final assembly steps
All of the previously described steps of de novo assembly in dnaasm application lead to a generation set of DNA sequences called unitigs. Then, created unitigs could be extended to contigs and scaffolds using paired-end tags and mate-pairs -both algorithms are also implemented in dnaasm application.

Implementation
The web-application was developed in client-server architecture, where web-browser is used to communicate with end-user, Python is used to realize the application server, and algorithms are implemented in C++. The described architecture is based on a bioweb framework [11], the main modules of the application are presented in Fig. 5.
To achieve the high performance of calculation module we used several memory-efficient structures, e.g. Compressed Sparse Row Graph from Boost library to represent de Bruijn graph, Google Sparse Hash to implement hash map. Our advanced memory optimization enabled building and processing graph up to 7 * 10 9 vertices (e.g. for human genome) in 256 GB RAM. We deploy the module as shared C++ library.

Results
In this section, we presented the results of tests for real data sets of bacterial organisms. We compared the results obtained by our approach with tandem repeats detected by algorithms based on paired-end tags. We also briefly describe new real assembly case from the whole genome sequencing project, where our approach gives an advantage. Moreover, we carried out several experiments on simulated datasets to compare efficiency of tandem repeats reconstruction.

Comparison to another applications
We compared the dnaasm application with the three popular de novo DNA assemblers: ABySS [6] ver. 2.0.1, Velvet [7] ver. 1.2.10 and SPAdes [8] ver. 3.11.0. Applications were compared on four sets of bacterial DNA reads obtained from the National Center for Biotechnology Information. The benchmark dataset contains DNA reads from four samples -ERR351243 for Helicobacter pylori PeCan4, SRR5431732 for Mycobacterium bovis, SRR1981622 and SRR1981619 for Helicobacter pylori J99. The description of benchmark data sets is presented in Table 1.
De novo assembling of the mentioned DNA reads was carried out in two modes -with and without using pairedends tags. The results were compared in terms of the number of contigs longer than 1000 bp, the length of N50 contig, the length of the longest contig and two parameters describing the quality of the resultant sequencesthe average number of mismatches and indels per 100,000 aligned bases. The above parameters were calculated by the quality assessment tool QUAST [12] ver. 4.1; and the results are presented in Table 2.
In Table 3 we showed the improvement of results by tandem repeat resolution. Furthermore, we counted the number of places in investigated samples, where our approach works properly and other assemblers fail. To compare the number of detected tandem repeats we used Tandem repeats finder application [13]; the results of this application are presented in Table 4.

Simulated reference genome
The next two experiments were carried out on the simulated data generated from the generated reference genome. This sequence consists of the 20 tandem repeats isolated from each other by a section of 1000 random symbols over {A, C, G, T} alphabet. The repetitive structures

Simulated dataset for different value of insert size
In this experiment we investigated how insert size affects the accuracy of tandem repeats detection. We generated sets of reads from simulated reference genome using the profile-based Illumina pair-end Read Simulator pIRS [14]. Three sets were generated: • mean insert size: 250 bp, standard deviation of insert sizes: 25; • mean insert size: 750 bp, standard deviation of insert sizes: 75; • mean insert size: 1250 bp, standard deviation of insert sizes: 125. The read length and depth of coverage for all simulated sets of reads was 100 bp and 150x, respectively. The substitution-error rate was 0.01, simulating indel errors in reads was switched on. To compare a number of detected tandem repeats we used Tandem repeats finder application [13]. The results are shown in Table 5.

Simulated dataset for different depth of coverage
In this experiment we checked how read coverage affects the tandem repeats detection for different types of repetitive sequences -we compared efficiency of reconstructing tandem repeats by our approach and by methods based on paired-end tags on simulated datasets generated with another depth of coverage. We used, as in the previous experiment, dataset generated by read simulator from our reference genome, The read length, insert size mean and standard deviation of insert sizes was 100 bp, 250 bp and 25, respectively, the error simulation parameters -as in previous experiment. We generated three sets of input paired-end tags with depth of coverage: 50x, 100x and 150x. The results are depicted in Table 6.

PCR confirmation
To present the correctness and usefulness of our approach, we use our application in a project managed by the Witold Stefański Institute of Parasitology of the Polish Academy of Sciences dealing with, inter alia, the problem of sequencing and assembling mitochondrial DNA of rat tapeworm Hymenolepis diminuta. Despite the small size of this sequence (only 13,900 bp), there is a large repetitive DNA region (tandem repeats), which contains 13 repeats of the same 31-nt sequence [15]. To assemble this sequence, we obtained reads from the Illumina sequencer, the reads were paired (2x100 bp), an average insert size was equal to 300 bp. Unfortunately, the insert size of paired-end tags was smaller than the length of the investigated repetitive region. Due to this fact, our application, as the only one DNA assembler, was able to reconstruct this repetitive region. Moreover, the depth of coverage for this sequencing project was high, ie. for mitochondrial DNA above 1000x, so we were able to use our application several times with different coverage depths (from 300x to 1000x). The results for all these calculations were the same, especially, the DNA fragment with tandem repeat was always reconstructed. What is more, additional ultradeep sequencing of PCR amplicons for this DNA region confirmed the results obtained by our approach.

Discussion
In this paper we describe an application used to reconstruct some of the repetitive DNA regions based on the normalised read depth. The presented approach was thoroughly tested and the experiments carried out on the simulated data, described in this paper, confirmed our Unitigs output depicts assembling mode without using paired-end tags, scaffolds -with paired reads. The table shows that the presented de novo assembler works comparatively in terms of the number of contigs, N50 statistic, the largest contig length and the quality (average number of mismatches and indels per 100000 aligned bases) of the resultant DNA sequences concept. What is more, the reconstruction of repetitive DNA region was proved by biological experiments. The read coverage of the genome region is key to the correct reconstruction of the repetitive fragment in our approach. However, the read depth of the specific DNA region varies depending on the GC content [16]. There are many methods for correction of the GC bias [17], most of them are implemented in copy number variation (CNV) detection tools based on read depth. Implementation and testing of some correction GC bias algorithm in our approach is one of the most important tasks in the near future.
Nowadays, nanopore sequencers are very popular. They allow to obtain the DNA reads of length greater than 10 kbp. The main disadvantage of nanopore sequencing is that obtained data contains more errors than the second Unitigs output depicts assembling mode without using paired-end tags, scaffolds -with paired reads. The table shows that tandem repeats reconstructuion process could significantly improve the results in terms of the number of contigs, N50 statistic and the largest contig length generation sequencing reads. However, the usage of the long reads can improve the assembly results from the short reads [18]. The presented algorithm currently does not use long reads. However, we plan to integrate such sequencing data in the next version of the software.
What is more, in the future we plan to add the possibility of running the application on a computer cluster. The de novo assembler will be divided into the set of containers, which will be managed and run by Apache Spark. The new architecture will allow to disperse the calculation, which     Table 4. Only dnaasm was able to reconstruct tandem repeats with more than two motif repetition from unitigs. Additionally, the ABySS results of 100 bp motif reconstruction from paired-end tags shows, that increasing the insert size value increases the probability of tandem repeat reconstruction  The numbers in the table depict number of motif repetitions in reconstructed DNA sequence for depth of coverage 50x, 100x and 150x, respectively. The proper restorations are in bold, the expected number of motif repetitions is defined as in Table 4. Only our algorithm was able to reconstruct tandem repeats with more than two motif repetition. It is worth paying attention to dnaasm results for 100 bp and 200 bp motifs, where increasing the depth of coverage increases the probability of tandem repeat correct reconstruction will significantly reduce the time of de novo assembling. Furthermore, in the future we plan to create a virtual machine [19] image and an Amazon machine image. The demo application with web interface as well as source code of the application are available at project homepage 1 . What is more, there is a public Docker container [20] with dnaasm de novo assembler. The presented application is freely available to both academic and commercial users under GNU Library or Lesser General Public License version 3.0 (LGPLv3).

Conclusions
As more and more bacterial genomes are sequenced, it becomes desirable to analyze their tandem repeats.
Here we have presented dnaasm, a de novo DNA assembler that uses the relative frequency of reads to properly reconstruct repetitive sequences, especially, in bacterial genomes.