Illumina error correction near highly repetitive DNA regions improves de novo genome assembly

Background Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. Results We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. Conclusions BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector. Electronic supplementary material The online version of this article (10.1186/s12859-019-2906-2) contains supplementary material, which is available to authorized users.

1 Parameter settings Tools were executed with 64 threads. BLESS2 fails to finish with 64 threads in some datasets; therefore, only 32 cores were used by this tool. For all tables and figures in the main paper and in the supplementary data the parameters' default or recommended values were selected for each tool. Below, the command line parameters are specified for each tool individually: Arguments need to be provided in the following order: the first argument is the data library (note that if there are multiple libraries, each library must be corrected individually); The second argument is the cov which is the depth of coverage in that library; the last argument is the working directory. $ ./bashScripts/runPipeLine.sh $inputreads $cov $workdir 1.5 Karect v. 1.0 5 $ ./karect -correct -inputfile=$inputreads -matchtype=hamming -celltype=diploid -resultdir=karectOut -kmer=9 -memory=32 -threads=32 In order to see the impact of EC tools on assembly results, we used SPAdes to assemble both corrected and uncorrected data. The following command was used to run the SPAdes. spades.py -t 32 --only-assembler --12 reads.fastq -o outputDir 3 k -mer selection Table 1 reperesents the most frequent k-mers in each dataset. For example a poly-(A/T) 15-mer is the most frequent 15-mer in 3 datasets. We used jellyfish to count the frequency of 15-mers in the datasets as follows: $ jellyfish count -m 15 -s 100M -t 64 -C reads.fastq $ jellyfish dump mer counts.jf -L $threshold > kmerFile.high.fasta The threshold values are chosen appropriately based on the dataset size to find the top-5 most frequent 15-mers. Fig 1 shows the average quality score of bases in reads for different polymers and a group of randomly sampled reads. This picture shows that reads contain a poly (A/T) or (C/G) have a lower quality hence they are more erroneous.  Our experimental investigations show that most of the breakpoints in the assembly results occur in the direct vicinity of highly repetitive k-mers. For example, the top 5 most frequent 15-mers in the first or last 100 bp of the assembled contigs ( > 1000bp) with SPAdes in 6 different datasets are listed in table 2: 4 k -mer coverage Assuming a genomic region and a randomly selected k-mer from this region, the average number of reads that initially cover any base of that k-mer is the initial coverage (C). The expected number of extracted reads from that region that contain that specific k-mer (C k ) is given by the following formula: C k = l−k+1 l C(1 − e) k where l is the read length and e is the error rate. Proof: It has been shown in [1], in the absence of errors the expected coverage of reads in a region of size k is l−k+1 l C. However, in the presence of errors, some of these reads fail to cover that region perfectly (i.e without any mismatch) due to the sequencing error. Let us assume the errors occur independently from each other, then the probability that all the bases of a read in an interval of size k are error-free is (1 − e) k . Therefore, the expected number of reads that cover a region of k is: Table 2 and 3 in the main paper show the exact values of NGA50 for contigs and scaffods after and before the error correction. Table 3 shows the improvement rate of NGA50 for both contigs and scaffolds upon the uncorrected data for different datasets. The average improvement rate (AVG column) shows that jointly using of BrownieCorrector and Karect leads to the highest positive impact on the quality of contigs/scaffolds (+21%/+25%) whereas BrownieCorrector with (+18%/+19% ), Karect with (+11%/+15%), and BFC with (+5%/+7%) are the second, third and forth best tools. On the other hand, BLESS2 (-25%/-19% ), ACE (-17%/-14% ), and Reckoner(-11%/-10%) deteriorate the quality of assembly on average.

Choice of highly repetitive k -mer
In order to see the performance of BrownieCorrector, we run the benchmark with two homopolymers poly-(A/T) and poly-(C/G). The results for poly-(A/T) are reported in the main paper. Table 4 compares the number of reads in each dataset that contains specific k-mers and respectively corrected versus the total number of reads in that dataset. Table 5 and 6 show the NGA50 of contigs and scaffolds when reads that contain respectively a 15-mer poly (C/G) and a 15-mer poly (AC/GT) are corrected by BrownieCorrector. Table 5 indicates that correcting reads that contain a poly (C/G) often has a lower impact on the quality of the assembly (except D3 which yields in a higher NGA50). This is due to the fact that a poly C is less occurred than a poly A in all the datasets and the assembler can itself handle the associated complexity. Table 6 indicates that correcting reads that contain a poly (AC/GT) has no positive impact on the quality of the assembly and sometimes the results are slightly worse. This is due to the fact that even though poly (AC/GT) is frequent, but the quality of reads that contain a poly (AC/GT) is high, and the assembler can itself handle the associated complexity. We highly suggest the user to use the poly A which is the default k-mer in the software.

Choice of the number of iterations
In order to find the stable cores of clusters we repeated the clustering multiple times. The default value for the number of iteration is set to 20 in the software. However, we further investigate the quality of assembly results (for D1) when it is set to 1, 5, 10 and 30 as well. Fig. 2 shows how NGA50 of contigs and scaffolds changes for different values of iteration. This picture indicates that using the stable cores after running the clustering multiple times improves the quality of assembly. However, it also shows the accuracy of BrownieCorrector is not much affected by changing this parameter in the range of (5 to 30).

Full Quast report (contigs)
This section contains the Quast evaluation report of contigs after assembling each dataset with SPAdes. Error correction by ACE, BFC, BLESS2, Brownie, Karect and Reckoner is performed prior to assembling the reads. The Uncorrected column refers to the quality of contigs without any precorrection process. The Hybrid column shows the quality of assembly of reads which are corrected jointly by BrownieCorrector and Karect. Default parameter settings are used for Quast, therefore all statistics are based on contigs of size ≥ 500 bp. Table 7 contains the Quast report after assembling dataset D1 (Homo sapiens Chr. 21) with SPAdes and Fig. 3 shows the corresponding NGAx plot.   Table 8 contains the Quast report after assembling dataset D2 (Homo sapiens Chr. 14) with SPAdes. Fig. 4 shows the corresponding NGAx plot.  Figure 4: SPAdes assembly results for dataset D2 (Homo sapiens Chr. 14) for both uncorrected and corrected data. Contigs with length NGAx or larger produce x% of the genome. Table 9 contains the Quast report after assembling dataset D3 (C. elegans) with SPAdes. Fig. 5 shows the corresponding NGAx plot.

D4
Table 10 contains the Quast report after assembling dataset D4 (D. melanogaster ) with SPAdes. Fig.  6 shows the corresponding NGAx plot.

D5
Table 11 contains the Quast report after assembling dataset D5 ( D. melanogaster ) with SPAdes. Fig. 7 shows the corresponding NGAx plot.

D6
Table 12 contains the Quast report after assembling dataset D6 (A. thaliana) with SPAdes. Fig. 8 shows the corresponding NGAx plot.

D7
Table 13 contains the Quast report after a hybrid assembly of dataset D7 (D. melanogaster ) with SPAdes. This is a hybrid assembly in which the corrected (and uncorrected) Illumina reads (R4) are complemented with the Pacbio reads (P1). Fig. 9 shows the corresponding NGAx plot.  Figure 9: SPAdes assembly results for dataset D7 (D. melanogaster ) for both uncorrected and corrected data. Contigs with length NGAx or larger produce x% of the genome. Table 14 contains the Quast report after assembling dataset D8 D. melanogaster with SPAdes. This is a hybrid assembly in which the corrected (and uncorrected) Illumina reads (R5) are complemented with the Pacbio reads (P1). Fig. 10 shows the corresponding NGAx plot.

D9
Table 15 contains the Quast report after assembling dataset D9 (A. thaliana) with SPAdes. This is a hybrid assembly in which the corrected (and uncorrected) Illumina reads (R6) are complemented with the Pacbio reads (P2). Fig. 11 shows the corresponding NGAx plot.

Full Quast report (scaffolds)
This section ontains the Quast evaluation report of scaffolds after assembling with each dataset with SPAdes. Error correction by ACE, BFC, BLESS2, Brownie, Karect and Reckoner are done before assembling the reads. The Uncorrected column refers to the quality of contigs without any cleaning process. The Hybrid column shows the quality of assembly of reads which are corrected jointly by BrownieCorrector and Karect. Default parameter settings are used for Quast, therefore all statistics are based on contigs of size ≥ 500 bp. Table 16 contains the Quast report after assembling dataset D1 (Homo sapiens Chr. 21) with SPAdes. Fig. 12 shows the corresponding NGAx plot.

D2
Table 17 contains the Quast report after assembling dataset D2 (Homo sapiens Chr. 14) with SPAdes. Fig. 13 shows the corresponding NGAx plot. BrownieCorrector Reckoner Figure 13: SPAdes assembly results for dataset D2 (Homo sapiens Chr. 14) for both uncorrected and corrected data. Contigs with length NGAx or larger produce x% of the genome. Table 18 contains the Quast report after assembling dataset D3 (C. elegans) with SPAdes. Fig. 14 shows the corresponding NGAx plot.  Figure 14: SPAdes assembly results for dataset D3 (C. elegans) for both uncorrected and corrected data. Contigs with length NGAx or larger produce x% of the genome. Table 19 contains the Quast report after assembling dataset D4 (D. melanogaster ) with SPAdes. Fig.  15 shows the corresponding NGAx plot.  Figure 15: SPAdes assembly results for dataset D4 (D. melanogaster ) for both uncorrected and corrected data. Contigs with length NGAx or larger produce x% of the genome.

D5
Table 20 contains the Quast report after assembling dataset D5 ( D. melanogaster ) with SPAdes. Fig. 16 shows the corresponding NGAx plot.  Table 21 contains the Quast report after assembling dataset D6 (A. thaliana) with SPAdes. Fig. 17 shows the corresponding NGAx plot.  Figure 17: SPAdes assembly results for dataset D6 (D. melanogaster ) for both uncorrected and corrected data. Contigs with length NGAx or larger produce x% of the genome. Table 22 contains the Quast report after a hybrid assembly of dataset D7 (D. melanogaster ) with SPAdes. This is a hybrid assembly in which the corrected (and uncorrected) Illumina reads (R4) are complemented with the Pacbio reads (P1). Fig. 18 shows the corresponding NGAx plot.  Figure 18: SPAdes assembly results for dataset D7 (D. melanogaster ) for both uncorrected and corrected data. Contigs with length NGAx or larger produce x% of the genome. Table 23 contains the Quast report after assembling dataset D8 D. melanogaster with SPAdes. This is a hybrid assembly in which the corrected (and uncorrected) Illumina reads (R5) are complemented with the Pacbio reads (P1). Fig. 19 shows the corresponding NGAx plot.   Table 25 and 26 provide the detail numbers of peak memory usage and the runtime (wall time) of EC tools on datasets respectively.