Skip to main content
Fig. 1 | BMC Bioinformatics

Fig. 1

From: De novo Nanopore read quality improvement using deep learning

Fig. 1

Overview of MiniScrub. The Convolutional Neural Network (CNN) must be trained to predict sequence segment percent identity (percent match to reference) from the read-to-read overlaps. To generate ground-truth percent identity for read segments, reads are generated from known genomes in a reference database, then GraphMap [26] is used to map those reads to the reference, from which we calculate the percentage of bases from each read segment that match the reference genome. We also use MiniMap2 to generate read-to-read mapping, then encode the information into an RGB “pileup” image for each read, which is then split up into shorter segments. We then train the CNN to learn the segment percent identity from the pileup images and save the model. On the user side, users run MiniMap2 on their set of reads and specify a cutoff threshold for read segments to scrub. The learned CNN model then predicts read segment percent identity and scrubs the segments below the quality threshold, outputting a new FASTQ file with the scrubbed reads

Back to article page