EvalDNA: a machine learning-based tool for the comprehensive evaluation of mammalian genome assembly quality

Background To select the most complete, continuous, and accurate assembly for an organism of interest, comprehensive quality assessment of assemblies is necessary. We present a novel tool, called Evaluation of De Novo Assemblies (EvalDNA), which uses supervised machine learning for the quality scoring of genome assemblies and does not require an existing reference genome for accuracy assessment. Results EvalDNA calculates a list of quality metrics from an assembled sequence and applies a model created from supervised machine learning methods to integrate various metrics into a comprehensive quality score. A well-tested, accurate model for scoring mammalian genome sequences is provided as part of EvalDNA. This random forest regression model evaluates an assembled sequence based on continuity, completeness, and accuracy, and was able to explain 86% of the variation in reference-based quality scores within the testing data. EvalDNA was applied to human chromosome 14 assemblies from the GAGE study to rank genome assemblers and to compare EvalDNA to two other quality evaluation tools. In addition, EvalDNA was used to evaluate several genome assemblies of the Chinese hamster genome to help establish a better reference genome for the biopharmaceutical manufacturing community. EvalDNA was also used to assess more recent human assemblies from the QUAST-LG study completed in 2018, and its ability to score bacterial genomes was examined through application on bacterial assemblies from the GAGE-B study. Conclusions EvalDNA enables scientists to easily identify the best available genome assembly for their organism of interest without requiring a reference assembly. EvalDNA sets itself apart from other quality assessment tools by producing a quality score that enables direct comparison among assemblies from different species. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04480-2.

: Differences between each CH assembly EvalDNA score and PICR's EvalDNA score compared to the differences between NUCmer scores (derived from NUCmer alignments of each assembly to PICR).

Supplementary Figures
Supplementary Figure 2. Results from regsubsets function from the leaps package. Each row is a general linear model created from a subset of the features listed along the x-axis. Shaded cells indicated features included in that row's model. The models are ordered and shaded based on their r-squared value given along the y-axis.
Supplementary Figure 4. FRCbam results (FRCurves) for the Chinese hamster genome assemblies. Thresholds of the number of allowed errors (features) are shown along the x-axis. Only contigs (starting with the longest) whose sum of features is less than this threshold can be used to compute the genome coverage, which is shown on the y-axis.
Supplementary Figure 5. Recommended guidelines for EvalDNA quality score interpretation from the reference-based scores of the training data instances.
Supplementary Figure 6. The EvalDNA scores are plotted against the corrected N50 of bacteria assemblies from the GAGE-B study that were created using different assemblers. The Pearson correlation coefficient is provided in red.
Supplementary Figure 7. The EvalDNA scores are plotted against the amount of large errors in the bacteria assemblies from the GAGE-B study that were created using different assemblers. The Pearson correlation coefficient is provided in red.
Supplementary Figure 8. Alignment of two versions of the P. syringue (strain Shaanxi MG228) genome assembly using NUCmer. For a more interpretable visualization, the contigs in the GCF_000344475.2 assembly were merged based on the gaps and overlaps in the original assembly alignment.

Quality Metric Definitions
All quality metrics examined are described here. Metrics are converted into percentage of bases per assembled sequence or normalized by assembly length if needed. Note that not all of the metrics were included in the final model due to multicollinearity or lack of significant correlation with the reference-based quality score in the training data. 10. Fragment coverage distribution (FCD) errors in contig -percent of bases in regions that REAPR marks as an FCD error within a contig (the region does not contain any gaps).
11. FCD errors over gap -percent of bases in regions that REAPR marks as an FCD error and the region contains a gap.
12. Low fragment coverage (FC) in contig -percent of bases in regions that REAPR marks as having low fragment coverage and the region does not contain any gaps.
13. Low fragment coverage (FC) over gap -percent of bases in regions that REAPR marks as having low fragment coverage and the region contains a gap.

Model Testing
Information about the other models that were examined as well as their RMSE and Rsquared values are provided in this section.  Figure 7. Performance of the elastic net regression model on test data. The estimated quality scores are plotted against the reference-based quality scores of the test instances.

K-Nearest Neighbors (KNN) regression
(a) Tuning parameters RMSE was used to select the model with the most optimal k value. The final value used for the model was k = 5.    Figure 9. Performance of the SVM regression model with a polynomial kernel on test data. The estimated quality scores are plotted against the reference-based quality scores of the test instances.

Model without the normalized N50 metric
A random forest regression model using the same metrics as the main mammalian model, except for normN50, was developed. Parameters were tuned for using 10-fold cross validation. The lowest value of RMSE was used to select the best value of mtry, which was mtry = 3 (Table 9). The model was applied to test data and produced a R-squared value of 0.817 and an RMSE of 14.483.
Supplementary  Figure 11. Performance of the random forest regression model without the normN50 metric on test data. The estimated quality scores are plotted against the reference-based quality scores of the test instances. A 100% accurate model would produce the blue line with an r-squared equal to 1. The line of best fit for the plotted data is shown as the red line and has an r-squared of 0.8169.