Nanopore basecalling from a perspective of instance segmentation

Background Nanopore sequencing is a rapidly developing third-generation sequencing technology, which can generate long nucleotide reads of molecules within a portable device in real-time. Through detecting the change of ion currency signals during a DNA/RNA fragment’s pass through a nanopore, genotypes are determined. Currently, the accuracy of nanopore basecalling has a higher error rate than the basecalling of short-read sequencing. Through utilizing deep neural networks, the-state-of-the art nanopore basecallers achieve basecalling accuracy in a range from 85% to 95%. Result In this work, we proposed a novel basecalling approach from a perspective of instance segmentation. Different from previous approaches of doing typical sequence labeling, we formulated the basecalling problem as a multi-label segmentation task. Meanwhile, we proposed a refined U-net model which we call UR-net that can model sequential dependencies for a one-dimensional segmentation task. The experiment results show that the proposed basecaller URnano achieves competitive results on the in-species data, compared to the recently proposed CTC-featured basecallers. Conclusion Our results show that formulating the basecalling problem as a one-dimensional segmentation task is a promising approach, which does basecalling and segmentation jointly.


Background
Nanopore sequencing, a third-generation sequencing technique, has achieved impressive improvements in the past several years [1,2]. A nanopore sequencer measures currency changes during the transit of a DNA or an RNA molecule through a nanoscopic pore and *Correspondence: r.yamaguchi@aichi-cc.jp 1 The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan 2 Division of Cancer Systems Biology, Aichi Cancer Center Research Institute, 1-1 Kanokoden, Chikusa-ku, Aichi, 464-8681 Nagoya, Japan Full list of author information is available at the end of the article can be equipped in a portable size. For example, Min-ION is such a commercially available device produced by Oxford Nanopore Technologies (ONT). One key merit of nanopore sequencing is its ability to generate long reads on the order of tens of thousands of nucleotides. Besides the sequencing application, it is actively used in more and more fields, such as microbiology and agriculture.
Basecalling is usually the initial step to analyze nanopore sequencing signals. A basecaller translates raw signals (referred to as squiggle) into nucleotide sequences and feeds the nucleotide sequences to downstream analysis. It is not a trivial task, as the currency signals are highly complex and have long dependencies. ONT provides established packages, such as Scrappie and Guppy. Currently, nanopore basecalling still has a higher error rate when compared with short-read sequencing. Its error rate ranges from 5% to 15%, while the Illumina Hiseq platform has an error rate of around 0.1% (a majority of reads have Q-score more than 30). More and more work is now focusing on solving challenges to further improve basecalling accuracy.
Early-generation basecallers require first splitting raw signals into event segments and predict k-mer (including blanks) for each event. Sequential labeling models, such as hidden Markov model (HMM) [3] and recurrent neural network (RNN) [4] are used for modeling label dependencies and predicting nucleotide labels. It is widely considered that a two-stage pipeline usually brings about an error propagation issue that wrong segments affect the accuracy of basecalling. Recently, end-to-end deep learning models are used to avoid pre-segmentation of raw signals, which enables basecallers to directly process raw signals. For example, BasecRAWller [5] puts the event segmentation step in a later stage after initial feature extraction by an RNN. Chiron [6] and recent ONT Guppy use a Connectionist Temporal Classification (CTC) module to avoid explicit segmentation for basecalling from raw signals. With CTC, a variant length base sequence can be generated for a fixed-length signal window through output-space searching.
On the other hand, even though those basecallers can translate raw signals to bases directly, segmentation and explicit correspondence between squiggles and nucleotide bases are also informative. It can provide information for detecting signal patterns of target events, such as DNA modifications [7]. In a re-squiggle algorithm, basecalling and event detection are also required.
In this paper, we do basecalling from the point of view of instance segmentation and develop a new basecaller named URnano. Distinguished from previous work that treats basecalling as a sequence labeling task, we formalize it as a multi-label segmentation task that splits raw signals and assigns corresponding labels. Meanwhile, we avoid making the assumption that each segment is associated with a k-mer (k ≥ 2) and directly assign nucleotide masks for each currency sampling point. On the modellevel, based on the basic U-net model [8], we propose an enhanced model called UR-net that is capable of modeling sequential dependencies for a one-dimensional (1D) segmentation task. Our basecaller is also an end-to-end model that can directly process raw signals. Our experiment results show that the proposed URnano achieves competitive results when compared with current basecallers using CTC decoding.

Methods
The overall pipeline of URnano is described in Fig. 1. URnano contains two major components: 1 UR-net for signal segmentation and basecalling. 2 Post-processing. For streaming signals generated by a nanopore sequencer, URnano scans signals in a fixed window length L (e.g., L = 300) and slides consequently with a step length s (e.g., s = 290). Given signal input X = (x 1 , x 2 , ..., x i , ..., x L ), UR-net predicts segment label masks y i for each x i . The output of UR-net Y = (y 1 , y 2 , ..., y i , ..., y L ) has exactly the same length as the input X and y i ∈ {A 1 , A 2 , C 1 , C 2 , G 1 , G 2 , T 1 , T 2 }. Here, {A 1 , C 1 , G 1 , T 1 } and Fig. 1 Overall pipeline of URnano basecaller. Block 1 is the UR-net deep neural network. Block 2 is the post-processing part that transforms the UR-net's output to final basecalls {A 2 , C 2 , G 2 , T 2 } are alias label names, which is designed to handle homopolymer repeats (described in "Homopolymer repeats processing" section). After label mask Y is generated, we conduct a post-processing step that transforms Y to Y ∈ {A, C, G, T} N , where N is the length of the final basecall. The post-processing contains two simple steps. First, it collapses consecutive identical label masks as one label. Second, the collapsed labels in alias namespace are transformed back to bases in {A, C, G, T}. Y is the final basecalls of the URnano.
Besides predicting basecalls, URnano also generates a signal segment for each base. In the previous work [4,5], signal segments are assumed to be associated with k-mers of a fixed k (e.g., k=2,4,5). Every base is read as a part of k consecutive events. In URnano, we avoid making the k-mer assumption and directly assign label masks for signals.

UR-net: enhanced u-net model for 1D sequence segmentation
The key component of the URnano is UR-net. Its network structure is profiled in Fig. 1 (more details in Additional file 1: Figure S1). In general, UR-net is based on the U-net model [8] and is enhanced to model sequential dependencies. "R" represents a refinement of U-net and the integration of RNN modules. The original U-net is designed for image data in two dimensional (2D) and has achieved the-state-of-the-art performances in many image segmentation tasks. Although the model can be directly applied for 1D data, the 1D segmentation task has its own characteristics that are distinguished from the 2D image segmentation task. In a sequence segmentation task, one segment may not only relate to its adjacent segments but also depends on non-adjacent segments that are several distance away. Such dependencies were not considered in the original U-net model, which mainly focuses on detecting object regions and boundaries.
The UR-net has a similar U-shape structure as U-net, in which left-U side encodes inputs X through convolution (CONV) with batch normalization (BN) following with rectified linear unit (ReLU) and max pooling, and right-U side decodes through up-sampling or de-convolution. We make two major enhancements in the UR-net model, which are highlighted in green shown in Fig. 1 and described as follows: • For the encoding part (left-U), we add an RNN layer right after each CONV-BN-ReLU block to model sequential dependencies of hidden variables in different hierarchical levels. Those RNN layers are also concatenated with UP-Sample layer in the right-U decoding part. • We add three bi-directional RNN layers as final layers.
Those changes are motivated to enhance the sequential modeling ability of the U-net.

Model training
.., n}, we train UR-net with an interpolated loss function that combines dice loss (DL) and categorical entropy loss (CE). Note that the task loss of edit distance can not be directly optimized. For each segment sample i, DL i and CE i are defined as follows: where t = {1, ..., L} represents the t-th time step in the sequence. For each time step t, we do one-hot encoding for prediction label p t and gold label g t in the 8-label space We interpolate the dice loss and the categorical entropy loss with weight α and β.

Homopolymer repeats processing
In genomes, homopolymer repeats (e.g. AAA and TTTT) commonly exist. Figure  This brings about deletion errors if models are directly trained on this data. To solve this problem, we use an alias trick to differentiate adjacent identical labels. For example, homopolymer repeat "AAAAA" in the training data is converted to "A 1 A 2 A 1 A 2 A 1 " for training UR-net model. In the inference stage, those new labels are transformed into the original representation through post-processing.

Merge basecalls in sliding window into a whole read
In the training phase, a read is split into the nonoverlapping windows of fixed length. In the testing phase, for calculating read accuracy, read signals are scanned with overlapping windows. The sliding window takes a To merge the basecalls of sliding windows, we have two different strategies in general. One is on the nucleotide level after the final basecall is generated. The other is on the segment label level before the final basecalls. Here, we use the latter strategy with 'soft merging' . Shown in Fig. 3, the soft merging combines consecutive predictions at the segment label level, where we use probabilities of each segment label predicted by the deep learning model. We apply weight interpolation for each overlapped position and use the label mask with the maximum score as the prediction label for the overlapped positions. The basecalls are made after merging all sliding windows of a read.

Experiment settings
Data: we compared URnano with the latest version of related basecallers: Chiron (v0.5.1) and ONT Guppy (v3.2.2). Both Chiron and Guppy use CTC decoding for basecalling. For comparing model performances, we used a publicly accessible curated dataset provided by Teng et al. [6]. The dataset contains per-base nucleotide labels for currency segments. In other words, we know the signal segment for each nucleotide. The training set contains a mixture of randomly selected 2000 E. coli reads and 2000 λ-phage reads generated using nanopore's 1D protocol on R9.4 flowcells. The test set contains the same amount of reads from E. coli and λ-phage. To assess read accuracy and assembly performance across species, we use 1000 randomly selected reads from Chromosome 11 (Chr11) of human benchmark sample NA12878 (1D protocol on R9.4 flowcells).
The raw signals are normalized using median shift and median absolute deviation scale parameters For those samples containing Norm_signal larger than 10, we filtered them out for training. In total, we have 830,796 segments of 300-length used for training.
Evaluation metric: we evaluated a basecaller's performance according to the following metrics: • Normalized edit distance (NED) between gold nucleotides and basecalls in non-overlapping windows. It is used to evaluate different deep learning models. • Read accuracy (RA) evaluates the difference between a whole read and its reference Read identity rate (RI)

RI = M number of bases in reference ,
where M is the number of bases identical to the reference. U, I and D are the numbers of mismatches, inserts, and deletions, respectively, according to the reference read. Following the evaluation scheme in Chiron, we used GraphMap (v0.5.2) [10] to align basecalls of a read to the reference genome. The error rates of the aligned reads are calculated using the publicly available Japsa tool (v1.9-3c). • Assembly identity (AI) and relative length (RL). We assembled genomes using the results of each basecaller. Assembly identity and relative length are calculated by taking the mean of individual accuracy rates and relative lengths for each shredded contig, respectively. The details of the assembling process are described in "Read assembly results" section.
where N is the total number of aligned parts, L pred i is the length of the assembled i th basecall and L ref is the length of the reference genome.

Basecalling results on non-overlapping segments
We first investigated different deep network architectures in the URnano framework using normalized edit distance (NED). In total, 847,201 samples of 300-length window are evaluated. In general, the lower the NED is, the more accurate a basecaller is. Table 1 shows NED of using different neural network architectures. The original U-net performs the worst of 0.3528, while UR-net achieves the best of 0.1665. As the sequential dependencies are not modeled in the U-net, these results indicate the importance of sequential information in the 1D segmentation task for basecalling.
To take into account the sequential dependencies, we initially added 3 layers of bi-directional gated recurrent units (GRU) for the output of the U-net. This gives about 0.1728 absolute reduction on the NED compared with the U-net. Meanwhile, we observed that the U-net+3GRU performs significantly better than only using 3GRU (0.1 absolute NED reduction). In addition, we incorporated GRU layers in different hierarchical levels of convolutional layers. It gives a further 7.5% relative reduction of NED, when comparing URnano with U-net+3GRU.

Basecalling results on read accuracy
We evaluated read accuracy for the whole reads on the test set. The results are summarized in Table 2. We first investigated in-species evaluation where the training data contains data of the same species as the test set. We tested on 2000 E. coli and 2000 λ-phage reads, separately. For E.coli, Guppy_taiyaki achieves the best RA score of 0.8636, while URnano has the highest RI of 0.9010. They all perform significantly better than Chiron. For λ-phage, URnano performs better on both RA and RI than the other two basecallers. But the performance gap in RA between URnano and Guppy_taiyaki is not large. For cross-species evaluation, we evaluated on human data by doing basecalling on 1000 randomly selected reads from Chr11. Compared with the evaluation of in-species, the performances of all three basecallers are decreased. From Fig. 2, an obvious difference of GC-content between E. coli/λphage and human can be observed. Such a difference between training and test brings about a performance drop for deep-learning-based basecallers. Guppy_taiyaki performs best among all three basecallers on the human data, which is around 0.015 higher on RA and 0.011 higher on RI than URnano. In all three species, URnano achieves the lowest mismatch rate.

Read assembly results
We also evaluated the quality of the assembled genomes using the reads generated by each basecaller on the test set. We make use of the same evaluation pipeline of Teng et al. [6]. Assembly experiments consist of three steps: read mapping, contig generation, and polishing. Read mapping uses minimap2 (v2.17-r943-dirty) [11], which is designed for mapping each long-read with high-error rate in a pairwise manner. After that, miniasm (v0.3-r179) is applied to generate long contigs based on the pairwise read alignment information generated in the previous read mapping phase. Finally, Racon (v1.4.6) [10] is used to polish the contigs by removing the read errors iteratively. The polishing step consists of mapping the initial long-reads to the contigs and takes the consensus of each mapped read to get higher quality contigs. Polishing is repeated 10 times.
In evaluating the quality of output contigs, each contig is shredded into 10k-base components and aligned to the reference genome. We evaluated the identity rate of each 10k-base component and report the mean of all the 10kbase components as the final identity rate of the assembly. The identity rate is the number of matching bases divided by the total length of the aligned part of the reference. This identity is also referred to as the 'Blast identity' [12]. If the total length of the aligned parts is smaller than half of the read length, we assume it to be unaligned and the identity rate for that contig is 0. Relative length is also calculated in a similar manner. Table 3 gives the assembly results on E. coli, λ-phage and Human test sets (Polished assembly result of each round can be found in Additional file 2: Table S1). Note that different than the conventional approach which evaluates assembly results on relatively higher depth data, our test data is shallow, especially for human data. The read assembly here is mainly used as a side evaluation metric for basecalling. The reference genomes used for each species data are on different scales. λ-phage has the smallest size of 52k bp, while Chr11 has the largest size of 135M bp. E. coli has a number of around 4.6M bp in the middle. Under the circumstance of using a few reads for assembly, species with a smaller genome size tends to have more overlapped reads. On λ-phage data, URnano performs the best on both AI and RL.
In the in-species evaluation, we observed a correlation between the assembly identity and the read accuracy, that a basecaller with a higher RA tends to have a higher AI (the Pearson's correlation coefficient between AI and RA is 0.83). In the cross-species evaluation, the Pearson's correlation reduces to 0.76. Guppy_taiyaki achieves the best AI with the highest RA of 93.28% on the human data. For the relative length (the closer RL to 100% the better), URnano performs the best on both E.coli and λ-phage data, but has a longer relative length on the cross-species data. It is consistent with the result shown in Table 2 that URnano has a higher insertion rate on the human data.

Segmentation results
In this section, we investigated event segments for each predicted nucleotide. Figure 4 demonstrates an example of basecalling and segmentation by URnano. For URnano, the signal segment for each base can be directly derived through label masks. As in the post-process of URnano, consecutive identical masks are merged as one base, a region of consecutive identical masks is just an event segment.
For CTC-based basecaller, segmentation is not explicitly conducted or learned in the model. Although heuristic approaches can be used to derive segments based on intermediate logit output, it is not straightforward and accurate to determine per-base segmentation using CTC basercallers. Figure 4 demonstrates the segmentation results generated by URnano for a randomly selected input. From the gold segmentation, we can observe the length of signals for each nucleotide is not evenly distributed across time. This is mainly due to the fact that, the speed at which a molecule passes through a pore changes over time. The speed issue makes the segmentation a non-trivial task. Traditional statistical approaches without considering the speed changes may not work. Here, the proposed URnano is designed to learn segmentation from the data, which implicitly considers the speed changes embedded in signals. For example, events of 'T's around 150 time-step tend to have short lengths than that in 200 time-step. The URnano can distinguish such speed changes as shown in the third row of the figure. For the beginning part of the signal in this example, URnano makes the correct base predictions, but the segments of 'TT' shift a bit compared to the gold standard.

Speed comparison
We measured the speed of the basecallers by basepairsper-second metric. To calculate the speed, we divided the total length of basecall by the total time. URnano achieves 16,271.15 bp/s on average, which is around 1.77x faster than Chiron with 9,194.78 bp/s on average using Nvidia Tesla V100 GPU under single thread setting. We used the Chiron's script to generate basecalling speed for URnano and Chiron. Note that the previous version of Chiron (v0.3) is slow with using large overlapping of consecutive sequences (90%), while the latest version (v0.5.1) uses smaller overlap for speedup at the cost of certain read accuracy. Both URnano and Chiron were not optimized for speed as Guppy, which are 2-3 orders of magnitude slower than Guppy with a reported speed of ∼1,500,000 bp/s [12].

Discussion
We analyzed the three basecallers and enumerated their key modules including network input, network structure, network output and post-process of each one, shown in  Table 4. For neural network architectures, the CNN layer and RNN layer are commonly used. CNN is generally used to extract features from raw signals and prepares input for RNN. RNN module is used to learn dependencies among hidden units. With URnano, our experiment also demonstrates the usefulness of using RNN for 1D segment mask prediction. Besides using RNN in final layers, it also demonstrates the combination of CNN and RNN layers in the encoding stage can further improve the basecalling performance. Chiron and Guppy use CTC decoding to generate basecalls of variant length through beam-searching in the hidden unit space. The output of Chiron includes blank labels, which are collapsed in the CTC decoding stage.
In a real physical process, the speed of a molecule passing through a nanopore changes over time. This can be  Fig. 4. A k-mer assumption using fixed k may not hold over time. Although incorporating blank labels can deal with the low-speed case, the high-speed one that involves more bases for the same signal length could exceed the limit of the fixed k. For Chiron and URnano, the fixed k-mer assumption is avoided. Chiron uses CTC decoding, while URnano uses label masks that are smaller units than 1-mer.
To curate the data for training a basecaller, a resquiggle algorithm is usually applied. In a re-squiggle algorithm, raw signal and associated basecalls are refined through alignment to a reference. After re-squiggling, a new assignment from squiggle to a reference sequence is defined. In the re-squiggle algorithm [7], event detection and sequence to signal assignment are performed separately. We think the proposed URnano can be used as the basecaller in a re-squiggle algorithm, as it can do basecalling, event detection and sequence to signal assignment jointly in an end-to-end manner. URnano can also be extended to detect DNA methylation, in which event segments are usually required.
In this paper, we only evaluated on a small curated data for fair comparisons between different basecallers. URnano works better on in-species evaluation. To further improve URnano, we intend to train it on larger data covering more species.