Evaluation of the impact of Illumina error correction tools on de novo genome assembly

Background Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. Results For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. Conclusions We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1784-8) contains supplementary material, which is available to authorized users.


Error Correction Tool Parameter Settings
All error correction tools were executed with 32 threads. Some tools need to be provided with the approximate genome size. For those tools, the exact genome size was provided. Some tools that internally operate on k-mers allow the user to specify the value of k. For all tables and figures in the main paper and in supplementary data, except Table 4 in the main paper, the default or recommended value of k was taken for each tool, regardless of dataset or assembly tool that was used.
To generate the results of Table 4 in the main paper, the optimal value of k was selected for each EC tool/dataset combination by running the EC tool multiple times with different k-mer sizes. The optimal k-mer size then corresponds to the SPAdes assembly that yields the highest scaffold N50. This way of selecting the optimal value of k would be identical to an end-user who wants to optimally assemble a new genome in absence of a reference genome. Below, the actual parameters are specified for each tool individually:

ACE
The k-mer size for ACE is built-in and cannot be specified by the user. $ size=$(stat -c%s genome.fasta) $ ./ace $size $inputreads aceOut/aceCorrected

BFC v. r181
$ size=$(stat -c%s genome.fasta) $ ./bfc -s $size -k 33 -t 32 $inputreads >bfcOut/bfcCorrected Following table shows the scaffold N50 of the SPAdes assembly from BFC-corrected reads for different values of k used in BFC. Therefore, in Table 4 of the main paper, k = 33 was used for datasets D1, D2, D3, D4 and D5; k = 29 was used for datasets D6 and D7; k = 35 was used for dataset D8. The default value of k = 33 was used for all other tables and figures. Therefore, in Table 4 of the main paper, k = 27 was used for datasets D2, D3 and D6; k = 29 was used for datasets D7 and D8; k = 33 was used for dataset D4; k = 35 was used for dataset D5. The default value of k = 31 was used for all other tables and figures. Therefore, in Table 4 of the main paper, k = 21 was used for datasets D1 and D5; k = 25 was used for datasets D3, D4 and D7; k = 27 was used for dataset D6; k = 31 was used for datasets D2 and D8. The default value of k = 25 was used for all other tables and figures.

Fiona v. 0.2.5
The k-mer size for Fiona is built-in and cannot be specified by the user. $ size=$(stat -c%s genome.fasta) $ ./fiona -nt 32 -g $size $inputreads fionaOut/fionaCorrected 1.7 Karect v. 1.0 $ ./karect -correct -inputfile=$inputreads -matchtype=hamming -celltype=diploid -resultdir=karectOut -kmer=9 -memory=32 -threads=32 Following Therefore, in Table 4 of the main paper, k = 9 was used for datasets D1, D2, D3, D4, D5, D6 and D8; k = 13 was used for dataset D7; The default value of k = 9 was used for all other tables and figures. It should be noted that Karect sometimes overrides the user-specified value for k to a value it considers to be more suitable. Therefore, in Table 4 of the main paper, k = 13 was used for datasets D2, D5 and D7; k = 15 was used for dataset D8; k = 17 was used for datasets D1, D3 and D4; k = 21 was used for dataset D6. The default value of k = 17 was used for all other tables and figures. Therefore, in Table 4 of the main paper, k = 21 was used for datasets D1, D3, D4 and D6; k = 27 was used for dataset D2, D5, D7 and D8. The default value of k = 21 was used for all other tables and figures.

Data simulation
To compare the performance of error correction tools (EC tools) on simulated data, we produced synthetic Illumina reads for the same set of organisms for which real data was used. The ART read simulator is used to generate reads, with the following command: $ ./art illumina -i genome.fasta -p -l [len] -f [cov] -m 300 -s 30 -o reads Where cov and len correspond to the coverage and length and change according to the values in real datasets. The mean fragment size is 300 bp, and the fragment standard deviation is 30 bp.
3 Error Metrics

Alignment ratio
Reads are grouped based on the number of mismatches (m) after aligning them to the reference genome using BWA v. 0.7.12: $ ./bwa mem -M -t 32 -p reference/genome.fasta reads.fastq >alignment/samfileName.sam Table 1 shows the percentage of reads that align to the reference genome without mismatches (m = 0). Table 2 shows the percentage of reads that do not align to the reference genome with <10 mismatches.
The 'Uncorrected' row shows the results of the raw data.

Accuracy comparison method
Accuracy comparison of EC tools in simulated data is straightforward since the perfect read is known. Let R represent an input read. For each read R, there is a corresponding read C which is corrected by any of the EC tools. In artificial data, a perfect read P is provided together with R. Therefore, for the evaluation of tools in simulated data, bases in these three reads (R, P and C) are compared and classified as follows: tp : R c = P c and C c = P c , TP =

Assembly Result for Real Data
In order to see the impact of EC tools on assembly results, we used SPAdes, DISCOVAR, IDBA and Velvet to assemble both corrected and uncorrected data. Quast provides comprehensive information on the assembly quality. For the Quast analyses, scaffolds were used. The following commands were used to run the assemblers.
• SPAdes (v. 3.7.1) spades.py -t 32 --only-assembler --12 reads.fastq -o outputDir Quast results for each dataset are shown in the following subsections. Assemblies were named after the EC tool that was used to preprocess the data. The 'Uncorrected' assembly was obtained from uncorrected data. Default parameter settings are used for Quast, i.e., all statistics are based on contigs of size ≥ 500 bp. The Quast (v. 4.4) command line was: ./quast.py asmDir/contigs.fa -R genome.fasta -o quastReport --plots-format ps -1 uncorrectedForwardRead.fq -2 uncorrectedReverseRead.fq --labels "toolName" 4.1 DISCOVAR 4.1.1 B. dentium Table 6 contains the Quast report after assembling dataset B. dentium with DISCOVAR .    Table 7 contains the Quast report after assembling dataset E. coli str. K-12 substr. DH10B with DISCOVAR .      Table 9 contains the Quast report after assembling dataset S. enterica with DISCOVAR . 4.1.5 P. aeruginosa Table 10 contains the Quast report after assembling dataset P. aeruginosa with DISCOVAR .    Table 12 contains the Quast report after assembling dataset C. elegans with DISCOVAR .  Table 13 contains the Quast report after assembling dataset D. melanogaster with DISCOVAR .  Table 14 contains the Quast report after assembling dataset B. dentium with IDBA .

Ability of EC tools to improve genome assembly
Error correction does not always lead to a more fragmented assembly. Fig. 2 shows a case where Karect corrects sequencing errors around a poorly covered region. In the initial reads there was a true 21-mer missing, and the error correction allowed the assembler to assemble the two sets of reads into a single contig.

Corrected Karect
Figure 2: Error correction with Karect resolves a breakpoint in the uncorrected data assembly. The first track (Ref) shows a part of the reference genome, which is assembled into a single contig from Karect-corrected reads. The second track (Uncorrected) shows the alignment of the uncorrected reads to the reference. The third track (Corrected Karect) uses these same alignment positions, but with the sequence content of reads corrected by Karect. The short overlap between the uncorrected reads is less than 21, i.e., a true 21-mer is missing from the uncorrected data. There are three reads which expand along this region but they contain some errors which are highlighted in purple. After error correction those three reads are partially cleaned which suffices to connect the two groups of reads.
6 Memory and Runtime 6.1 Real Data 6.1.1 Memory In the main paper peak memory usage was measured for all EC tools. Table 38 shows the exact numbers for reference. Furthermore, the possible effects of error correction on the memory usage of assemblers have also been measured. Results are shown for DISCOVAR and SPAdes. Fig. 3 shows the peak memory usage of the EC tools next to the the peak memory usage of SPAdes. While Fig. 4 shows the peak memory usage of the EC tools next to the the peak memory usage of DISCOVAR.

Runtime
The runtime plot of the EC tools is shown in the paper, Table 39 shows the exact numbers.
We also recorded the possible effects of error correction on the assembly runtime for DISCOVAR and SPAdes. Fig. 5 shows the total runtime for error correction plus SPAdes assembly. Fig. 6 shows the same graph for DISCOVAR.    Table 40 shows the peak memory usage for all tools in simulated data.

Runtime
We recorded elapsed (wall clock) time to measure the runtime. Table 41 shows the runtime of EC tools on simulated data.