Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

Background Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. Results First, we show that the best k-mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k-mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model’s estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k-mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k-mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline—18\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document}× faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. Conclusion Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at: https://github.com/icanforce/lerna-genomics.


Background
The drop in high-throughput sequencing costs has offered unprecedented opportunities to characterize genomes, metagenomes, and single-cell genomes across the tree-oflife. Third-generation sequencing technologies [1,2] have demonstrated the potential to produce unparalleled genome assemblies due to their capability to generate staggeringly longer reads, at a much faster pace, and remarkably low costs [3,4], albeit, at low signal-to-noise ratios. This fundamental challenge makes genomic sequence identification and assembly, challenging, often necessitating hybrid correction approaches over self-correction from a consensus of long reads [5]. The error rates in base calling in these third-generation reads are nearly two orders of magnitude higher than their secondgeneration counterparts [6], with PacBio and Oxford Nanopore reaching error rates of 15% and 40% [7,8] respectively, albeit improving with more sophisticated instrumentation [9]. In order to use long-read datasets, especially the more erroneous ones from the earlier versions of the PacBio and Nanopore sequencers, error correction tools in the assembly pipeline are needed [10]. A majority of already existing datasets have a significant error rate in them and performing error correction on these datasets reduces the cost compared to collecting the reads again with the new technology. Hence, diverse EC tools have been developed to remedy the erroneous reads [11].

k-mer based EC tools
The most dominant technique used by EC tools is k-mer-based error correction [12][13][14][15][16]. In these k-spectrum-based methods, the goal is to convert insolid k-mers (i.e., those likely to be erroneous) to solid ones (i.e., those likely to be correct). The k-mers that appear above a certain threshold frequency, and are therefore expected to be legitimate, are solid k-mers, the others are called insolid or untrusted k-mers. It has been found that performance of EC tools is extremely sensitive to the chosen k-value [4,17], and the optimal value is both tool-and dataset-dependent. Small values of k result in an increase in the probability of overlap between reads at the cost of not allowing the algorithm to distinguish between erroneous and correct k-mers. In contrast, large k-values decrease the overlap probability and most k-mers appear to be unique, thus reducing the likelihood of correcting the errors.
Popular EC tools such as Lighter [18] and LoRDEC [11] for short-(< 400 base pairs) and long-read sequences (> 400 base pairs) respectively, require the user to select the k-value. Determining a favorable k-value among possible ones has been explicitly pointed out as an open area of work [4,19,20], since an arbitrary k-value could generate sub-optimal assemblies. In these scenarios, the best k-values need to be found by exploring all the possible k-values [20]. A large number of EC tools rely on k-mers to make corrections [12][13][14][15][16]18]. We have used popular k-mer-based EC tools-Lighter [18] and LoRDEC [11] for short-and long-reads error correction respectively. Both of these exemplar tools require the user to select the k-value. Performance of error correction is extremely sensitive to this value. Some existing tools (e.g., KMERGENIE [21]) provide intuitive heuristics (e.g., abundance histograms) to guide k-value selection when performing de Bruijn Graph (DBG)-based genome assembly. However, experiments have shown that the optimum value of k varies across datasets and across different EC tools [19]. Thus, there is a need for a data-driven and tool-specific method to select the optimal k-value.
The runtimes of alignment algorithms like Bowtie2 [22] have superlinear complexity [23]. The perplexity computation in Lerna, in contrast, is linear in the length of the input (number of reads × read length). Because Lerna does not need to perform the alignment step, the runtime is significantly reduced. Furthermore, because Lerna subsamples reads, unlike alignment that requires entire read datasets, the time is further reduced. As seen in Table 3, the optimum value of k ranges from 15 to 25 for short reads when using Lighter. Performing an exhaustive search that requires alignment algorithms and entire reads is too time consuming. The optimum parameter configuration being dataset-and tool-dependent, further aggravates this process. For example, scanning over all k-mer values between 15 and 25 for dataset D6 and estimating the corresponding alignment rate with Bowtie2 took 1.5 hours on a server with 8 CPU cores and 12GB of RAM. In comparison, Lerna is 80x to 275x faster than Bowtie2 and was able to scan the same range in less than a minute.

State-of-the-art and our contribution
State-of-the-art automated EC tool tuner Athena [19] leverages NLP techniques to automatically compute the best value of k for error correction. However, Athena suffered from poor performance on datasets with larger read lengths, starting to degrade beyond reads of length 200 (as shown in Table 4). Further, as a result of the inability of RNNs to encode long-term dependencies, along with the character-level LMs being space inefficient, Athena failed to tune EC tools on long-read datasets. To address this problem, we engineered word-level transformer networks to deploy attention mechanisms, at different encoding granularities. This leads to improved alignment performance for lower read lengths and the ability to handle higher read lengths. With Lerna 's character-and word-level encodings encapsulated in transformer networks, our hypothesis of leveraging more sophisticated architectures for noisier and longer reads was validated. Specifically, Lerna either derived the optimal k-value or values corresponding to alignment rates within 0.31% of the optimal, plus improved runtimes of 18× faster than the best-inclass-Athena. Having improved the correlation rates for short reads, we moved on to solve the harder problem of finetuning EC tools for longer reads. This is because thirdgeneration sequencing technologies have low signal-to-noise ratios and a high error rate in base calling, with rates as high as 15% and 40% for PacBio and Nanopore data, nearly two orders of magnitude higher than second-generation sequencing data. Athena used a character-level RNN in its algorithmic suite. We find through detailed analysis that character-level models, in contrast to word-level models, provide weaker correlations between perplexity and alignment rate, making them unsuitable to use over longer and/or noisier datasets. Intuitively, with word-level LMs, we expand the vocabulary size ( 4 W ) and allow the model to learn identifiable patterns in the data, where W is the word length.
Additionally, the much slower inference [18× slower] of Athena (see Table 10) creates a bottleneck for downstream genomic analysis. Furthermore, as indicated before, RNNbased language models have the fundamental flaw of being unable to capture long-term dependencies in sequences [24]; hence it was unable to autotune parameters for longread EC tools.
Our innovation, Lerna, schematically outlined in Fig. 1, employs a variant of Transformer networks [25] to perform automatic tuning of EC tools on both more errorprone, third-generation long reads and second-generation short reads. The original transformer used by Vaswani et al. is causal and directional, i.e., the predictions for position i can depend only on the elements at positions less than i. However, in our problem context of genomic error correction, such directionality is not apt. Therefore, we modify the encoder-decoder architecture in the transformer LM and allow it to train in a nondirectional sense.
In summary, Lerna makes the following contributions: 1. We propose a mechanism to automatically select the best configurations for k-mer based error correction tools. Our mechanism handles both short and long reads by modifying Transformer Language Models [25] by disabling the masks to use nondirectional word-level training, improving k-mer selection accuracy by 35% over unidirectional character-level RNNs used by prior works. Further, these non-directional models were found to be equivalent to the bi-directional models in our experiments and thus we disabled the masks, reducing the number of model hyperparameters. 2. We propose several runtime improvements such as leveraging parallelizable computations in self-attention layers, and utilizing JIT (Just-in-Time) compilers in order to make better optimizations based on the underlying GPU hardware. These improvements lead to efficient utilization of resources and 18× lower inference time than Fig. 1 Workflow of Lerna. Lerna's high level intuition is that by reducing the perplexity metric for the generated reads, we increase the alignment rate and the assembly quality (NG50). Our evaluation shows that a word-level Transformer LM, with a word length of 4 (|w| = 4) works best across all datasets and tools. Our algorithmic suite works on both NGS short reads and PacBio and Nanopore long reads the state-of-the-art tool Athena for choosing the best configuration parameters for genomic error correction (see Table 10). 3. We evaluate the generalizability of Lerna on second-and third-generation sequencing technologies, specifically, Illumina short reads, and PacBio and Nanopore long reads, validating that Lerna can optimize both short and longer, noisier reads.
Furthermore, Lerna reports a negative correlation in each and every dataset (unlike Athena, which has a positive correlation of + 0.95 on D6) with an average of − 0.92 over D1 to D7, degrading with longer read lengths. Our experiments have shown that the highest negative correlation is observed when word-level Transformer LMs are used, rather than character-level. Intuitively, with word-level LMs, we expand the vocabulary size and allow the model to learn identifiable patterns in the data.

Language modeling
Language modeling has played an integral role in improving speech recognition, text analysis, and sentiment analysis. A Language Model (LM) is a probability distribution over a sequence of tokens (e.g., words or characters), using which we can identify the sequences that are more likely to occur [26][27][28][29]. LMs can be count-based (e.g., estimating N-gram probabilities via counting and subsequent smoothing) [30] or continuousspace [31] language modeling, such as the use of Recurrent Neural Networks (RNNs). However, several papers have pointed out various drawbacks of using RNNs (see the next subsection). As an alternative, we make use of computationally efficient transformer LMs and further modify its architecture for speedup and parallelization.

N-gram-based language modeling
N-Gram models [32] are used to predict the probability of observing a token [ W i ] after the series of all already observed tokens [ W 0 , . . . W i−1 ] (i.e., context). This is done for all possibilities present in the vocabulary. However, being a computationally expensive task, the above probability is approximated using Eq. 1.
where n represents the context. As we increase n, the accuracy increases, albeit at the cost of increasing complexity and training time. The conditional probabilities are stored as a lookup table, leading to a higher memory complexity. The size of the model on D7 for the N-gram LM is 2 GB while the size of the RNN model is only 3.5 MB for D7 (a description of the datasets is given in Table 2). Moreover, N-gram-based models suffer from increased data sparsity when trained on longer contexts. As N increases, the number of possible word combinations becomes 2 N and the amount of data becomes inadequate to train the N-grams. In contrast, Neural LMs (such as RNNs) overcome these limitations and are preferred over N-gram LMs [33]. (1)

RNN-based language modeling
RNNs are a class of LMs that rely on an internal "state" to make predictions, usually consisting of three or more layers. The input layer is given an input vector x t . This may be a one-hot encoding vector, such as in our case, of the tth word or character. The hidden layer represents the memory of the network: the internal "state" s t . A non-linear function f (e.g., tanh) is applied to the previous hidden states s t−1 and x t (see Eq. 2) where U and W are the input and state weights respectively. However, RNNs suffer from the problems of vanishing and exploding gradients [34]. Due to the sequential nature of gradient computation results, training of RNNs is computationally demanding. Despite recent advances in training RNNs, capturing long-term dependencies in sequences remains a fundamental challenge [24]. As the sequence length increases, the path length of information flow increases, resulting in greater likelihood of the information being corrupted. The inherent sequential nature of RNNs precludes parallelization within training examples, which becomes critical at larger sequence lengths, as memory constraints limit batching across examples [25]. LSTMs [35] and GRUs [36] are alternatives but the computation cannot be parallelized because of the sequential nature of operations in that the internal state at any time step t depends on the state of previous time step t − 1 . This remains a fundamental issue in all sequential ML models. Various architectures have been proposed to solve these problems. The common target of all of them is to use attention mechanisms to model probability distributions [25,37].

Transformer-based language modeling
Attention mechanisms are a way of selectively focusing on certain parts of a sequence [38]. Specifically, self attention relates different positions of a single sequence in order to compute a representation of the sequence and has been used in a variety of tasks [39], e.g., abstractive summarization [40], textual entailment [41], and learning task-independent sentence representations [42]. We incorporate self-attention in the Transformer network architecture. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths that forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies. In the transformer, this value is constant O(1) . Furthermore, because the transformer contains no recurrence and convolutional operations, in order for the model to make use of the ordering of the input sequence, positional encoding of the input is performed. This encoding is done by using sine and cosine operations of different frequencies as follows: Here, d model represents the embedding dimension, pos is the position, and i is the dimension index. Recurrence relations generate states s t that depend on the previous state s t−1 (2) s t = f (Ux t + Ws t−1 ).  and the input at time t. This sequential nature of computations precludes parallelization, hence the removal of recurrence relations from the Transformer allows it to be parallelized [25].

Genomics preliminaries
This section gives brief definitions for a few genomics concepts that we focus on throughout the paper, such as error correction, alignment, and assembly. Error correction represents the process of identifying erroneous nucleotide base pairs in genomics reads and converting them to the correct pairs. This process is applied as a pre-processing step for the sequenced reads before further analysis is performed. It was shown that error correction can significantly improve the performance for all tasks downstream the analysis pipeline. Alignment is the process of matching the sequenced reads to a continuous error-free reference genome and identifying which segment in the reference genome matches a given read the best. Assembly is the process of merging the sequenced reads together into longer sequences in order to retrieve the original sequence. The design of Lerna focuses on improving the performance of error correction algorithms by identifying the best configuration parameters, which significantly improves the alignment rate and the assembly quality.

Perplexity of the language model
Perplexity is a metric that measures the goodness-of-fit of a sequence given an LM. The LM represents the probability distribution over the entire dataset [43]. Perplexity can be thought of as the probability of selecting a given word uniformly from the effective vocabulary, given some context. Thus, a lower perplexity indicates that the LM is better at making predictions. If an LM learns a uniform probability distribution over a given piece of text, the perplexity generated would be equal to the actual vocabulary size. The smaller the perplexity score is, compared to the actual vocabulary size, the better is the LM at learning patterns that occur with a high probability in the given text. In our context of genome error correction, the error correcting tools are expected to convert most of the insolid k-mers to solid ones. Since solid k-mers have a high frequency of occurrence, more the conversion of insolid to solid k-mers takes place, lower is the effective vocabulary size of the corrected data. This can be quantified using the perplexity metric. Mathematically, perplexity of a sentence (Eq. 5) is the inverse probability of the test set, normalized by the number of words [43].
It is clear from (5) that minimizing perplexity maximizes the probability of the observed set of m words from W 1 to W m .
Various models estimate the perplexity metric differently. However, they achieve the same purpose, which is estimating the correctness of a sequence given the trained probability distribution. Although the perplexity for Transformers and N-Grams is calculated in different ways, it gives us a measure of probability of a sequence given a trained LM. The Kullback-Leibler (KL) divergence (Eq. 6) is a fundamental equation of information theory that quantifies the proximity of two probability distributions. It specifies in bits how close a probability distribution P i is to another distribution Q i [44].
A metric similar to the KL divergence between two probability distributions is the cross entropy (Eq. 7). It is also used to measure the similarity between two probability distributions. However, there are subtle differences between the metrics. Cross entropy gives the average code length needed to represent one distribution by another distribution, and the excess code needed over the optimal coding is given by the KL divergence [45].
In Lerna, the Transformer LM uses the perplexity metric (Eq. 8), which is derived to be the exponential of the cross-entropy loss, where ŷ is the predicted next character-the output of the LM-and |V| is the vocabulary size used during training. Therefore, by using perplexity, we incorporate information-theoretic principles into our LM, and in effect, ensure that by minimizing the perplexity, we maximize the similarity between the ground truth and the predictions made by the LM.
Note that training our LM model on the uncorrected reads can still capture an accurate probability distribution, which we use to measure the quality of the corrected reads. The reason is that genomic reads usually have a high coverage (typically 30X or more), which makes the number of occurrences for an erroneous nucleotide much lower than that of the correct nucleotide. This means that the correct genomic reads still dominate the distribution. Now, we discuss the key design elements of Lerna and the corresponding challenges addressed by each element.

Transformer-based language modeling in Lerna
The attention function (Eq. 9) used in a Transformer can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. Specifically, scaled-dot product attention is done by taking dot product of Query Q and Key K vectors. This is then divided by d k , where d k is the dimension of K. This scalar value is multiplied by value vector V, and a softmax is taken to get a probability distribution.
Using multiple heads, rather than one, allows the language model to attend to information from different representation subspaces rather than from a single one. In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an attention head. Because the layer after the concatenation is a feed forward network with an input dimension that does not depend on the number of heads, we take a linear combination of the concatenated vectors by multiplying with a matrix W O so that the resulting vector has the same dimension as the input dimension of the feed forward network. where The architecture described by [25] introduces masks in the self-attention layers. The self-attention sublayer in the decoder stack has been designed to prevent positions from influencing subsequent positions. This masking ensures that the predictions for position i can depend only on the known outputs at positions < i. We modify this in Lerna to remove these masks, allowing the representations to encode both past and subsequent nucleotides. This is important because each nucleotide in a k-mer provides critical information, so the encoding of a given nucleotide in k-mer should have information about what is to the left and to the right of it. BERT [46] successfully alleviates the unidirectionality constraint of the Transformer, by using deep bidirectional transformers. Drawing inspiration from BERT, we considered implementing two parallel architectures with masks activated-one for the forward direction and the other for the reverse direction. This would allow the implementation of bi-directionality into the LM. However, empirically, we have shown that disabling masks and bi-directionality give the same results. Hence, we simply disable masks in order to reduce the number of parameters in the model. RNNs compute an internal state at each time step s t , which depends on the previous state s t−1 and the current input at time t. The Transformer, which gets rid of recurrent layers, are not restricted by this recursive relation, which must be computed one after the other, allowing parallelization. The use of parallel attention modules has previously been used in reducing training time and establishing a new state-of-the-art model for English to German translation [47]. The ability of the Transformer networks to be parallelized opens up greater scope for optimizations in the machine-translated code.

Language modeling of genome sequences
In general, LMs are trained to statistically determine if a sequence of tokens (e.g., words in the NLP domain, or base-pairs in our usecase) contains erroneous (insolid or untrusted) tokens. We make use of the fact that sequencing technologies require sequencing every base-pair multiple times to provide a better and higher coverage of the genome. Accordingly, the number of erroneous reads becomes far less than the number of error-free reads, allowing correct k-mers to outweigh erroneous ones. We train an LM on the set of reads before correction, expecting the correct sequences to mask the effect of the erroneous ones during the training phase. After correction of the reads by an EC tool, the resulting reads are tested against the trained language model. The evaluation metric chosen is perplexity that is used to model the goodness of fit of the corrected reads compared to the learned LM. The lower the perplexity, the better the fitness of the reads to the probability distribution captured by the LM.
Following this argument, the error-free reads with plenty of solid k-mers are expected to produce a lower perplexity than erroneous reads with insolid k-mers, and this is exactly what we observed in our experiments as seen in Table 1. We generated a dataset with the NanoSim [12] simulator using the reference genome of E. coli, strain K-12, substrain MG1655, with the characteristics of Nanopore sequencing [48]. The generated data set had 588 reads labeled "unaligned" (i.e., with nucleotide-level error rates over 90%) out of 10,000 total reads. We tested a word-level transformer model with word length = 4, on the entire dataset and separately on the erroneous and correct reads. From Table 1, it is clear that correct reads generate a lower perplexity than the erroneous ones for multiple values of k. We use this idea to optimize the performance of EC tools by varying the input parameter k, and select the k-value that minimizes the perplexity. We have validated the results by performing alignment with the reference genome for the erroneous vs. corrected reads. The corrected reads showed a higher alignment rate indicating a more effective correction of the sequences.
Challenges in adapting LM for genomic reads There are a few differences in the training of an LM on the English language (or on any other language for that matter) and a genome sequence. First, a genome sequence has no concrete concept of words unlike an actual language. One may train a character-level model on it, but the training may be negatively affected by the small vocabulary size of sequences consisting of only A, T, G, and C. Also, one may use a fixed word length and preprocess the sequences to generate words, by using a suitable stride value. We have reported the effects of word length on training in our results that a word length of 4 leads to stronger negative correlation between perplexity and alignment rate, in contrast to smaller or larger word lengths as too small or too large vocabulary sizes can make learning patterns difficult for the LM. Second, a genome sequence has no concept of punctuation marks such as periods. The sequenced reads, although having a finite read length, do not necessarily follow in succession. In the English language, the word following a period proceeds the word preceding it. In contrast, in a genome sequence, successive reads may or may not follow each other. Therefore, the batch creation for training or testing cannot be done arbitrarily. It has to be ensured that successive reads do not affect the training of other reads and that there are no unjustifiable breaks within a read during batch creation. Specifically, every read is tokenized into chunks of 100 base-pair length, before being fed to our transformerbased model. To preserve the overlap between the produced chunks, we use a sliding window with a step size (a.k.a. stride length) of 50. If length of a read in not divisible by 100, the residual portion of the read is trimmed as they account for a small fraction of the read. Third, genome sequences can be considered non-directional. A portion of a genome can as much be affected by the portion of the genome preceding it as by the portion of the genome in succession. This is also true for the English language [46] but most of the models available are directional. Recent developments have resulted in the transformer network architecture that we orchestrate in Lerna to conform to the non-directionality attribute of genomics reads. Fourth, since we train our model on the data before correction, expecting the solid k-mers to influence training more than the erroneous ones (by outnumbering the erroneous ones), it has to be taken care that the batch size during training ≥ the batch size during testing. A larger batch size during training will help circumvent the effect of noisy or erroneous data, and help the model better learn the patterns occurring with high frequency. On the contrary, a smaller batch size during the testing phase will help identify sequences that have not occurred frequently during the training. Following the above rationale, we justify the use of NLP techniques to model genome sequences for evaluating the performance of EC tools.

Search heuristic
Our aim is to find the k-value that will minimize the perplexity of the corrected sequence. This ensures that the sequence has the highest chance of occurring, signaling that the EC tool is working optimally. The perplexity function calculated using the LM is denoted by f (Eq. 11). Although we experiment with k-mer-based approaches, any configuration parameter can be used in the search depending on the tool used, such as the maximum number of passes per read in Jabba [49], or the value of the maximum boundary difference to split the subcontigs in HALC [50]. Whatever the configuration parameter may be, our search heuristic uses perplexity as a measure of "fitness". Our search heuristic aims to find the configuration parameter with the minimum perplexity, and in doing so, maximizes the alignment rate and the assembly quality.
Here, LM: trained language model, D 0 : uncorrected read set, and k: the configuration parameter we wish to tune. In Lerna, we apply a Simulated Annealing (SA)-based search that improves an initial solution by walking randomly in the space of possible solutions and gradually adjusts a parameter called "temperature". At high temperature, the random walk is almost unbiased and it converges to essentially the uniform distribution over the whole space of solutions. As the temperature drops, each step of the random walk is less likely to explore sub-optimal choices. Unlike a Hill Climbing-based approach used by prior work Athena, Simulated Annealing does not get stuck at local optimum, and converges faster to the global optimum [51]. The problem with local optima is partially mitigated in Athena by multiple runs of the search with various initializations but this comes at the cost of higher execution times.

Results
Theoretically, Lerna can help tune any EC tool, but for the sake of evaluation, we have used Lighter for short reads correction and LoRDEC for long reads correction.

Evaluation on short reads
We begin with the demonstration of the effectiveness of Lerna on short reads datasets. The details of the seven NGS datasets used are given in Table 2. These are Illumina short reads and have been used in multiple prior studies (e.g., [12,18]) with reference genomes available (used to evaluate the quality of error correction). The datasets have different read lengths (from 36 to 250 bp) and different error rates (< 3%) . After correction using EC tool Lighter [18], we run the Bowtie2 aligner [22] and measure the Alignment Rate. A higher value implies superior error correction. We do an exhaustive search over various k-values to see whether Lerna produces optimal  We test both the Transformer character-and word-level LMs. We observed that the best performance was attained by using the Transformer word-level LM with word length 4 (|w|=4). For the Transformer word-level LM (|w| = 4), Lerna finds either the optimal value or a value with an alignment rate within 0.31% of the theoretical best, consistent with the reported results by Lighter ( Figure 5 in [18]). These slightly sub-optimal configurations are an artefact of sub-sampling  Table 3 summarizes our findings. We see that the Transformer word-level LM performs slightly better than the character-level variant; the former predicts the correct value in 5 datasets while the latter makes accurate predictions in 4 datasets; these out of a total of 7 datasets. Furthermore, when optimal parameters are not chosen, the alignment rate lies within 0.31% of the best possible alignment rate for both charand word-level models. A comparison of Lerna and Athena can be seen in Table 4. We notice that Lerna 's Transformer-based model produces a higher correlation with the alignment rate compared to the RNN-based model used in Athena for longer reads (i.e., D6 with 250 bp). This shows that more sophisticated NLP architectures are better suited for longer and noisier reads as is typical of the long-read sequences produced by PacBio and Nanopore sequencers. Recall that our refined NLP architecture uses parallelizable attention mechanisms in place of sequential recurrent layers, which affords higher optimizations. Furthermore, by restricting the maximum path length to O(1) , we have prevented the corruption of information. This is unlike RNNs, in which the path length depends on the length of the input sequence.
Next we measure the improvement of Lerna 's pipeline on the assembly quality. In short, assembly is the process of merging the reads together to make longer reads (a.k.a contigs), and the longer the constructed contigs, the closer they are to the original sequence (consensus regions). We use SPAdes assembler [52] to perform de novo assembly for the reads before and after correction and then use QUAST [53] on the generated contigs to estimate the assembly quality metric NG50, which we show in Table 4. The N50 is a metric that indicates that contigs equal to or larger than its value represent 50% of the total assembly size, and the NG50 is similar to N50, but based on the estimated genome size. NG50 is similar to mean or median of the generated read contigs but with a higher weight assigned to longer sequences, hence the higher the better. We notice that Lerna is able to improve the NG50 by a geometric mean of 5.08× across the 7 real datasets as opposed to Athena, which shows an improvement of 4.72× . This shows the importance of using Lerna 's pipeline to improve the performance for alignment and assembly tasks significantly.

Table 4 A comparison between Athena and Lerna on short reads
The dataset is described in Table 2. We show the correlation between the perplexity metric and the alignment rate of the data after correction for Lerna vis-à-vis its closest competitor, Athena. On dataset D6 (greatest read length of 250 bp), Athena fails and has a positive correlation of + 0.95 (instead of having a negative correlation, as desired), highlighted in the Table. This is in line with the fact that RNNs are unable to model longer sequences. We also show the improvement of the assembly quality (in NG50) after tuning the EC tool with Lerna versus using the uncorrected reads. The NG50 with Lerna is always higher than, or equal to, that with Athena, other than a small drop for Dataset D7. All the superior values are bolded

Evaluation on long reads
For long-reads evaluation, we use the PacBio simulator [54] to generate 10,000 reads with read lengths that vary between 2000 and 3000 base pairs and an accuracy of 78%, which matches the characteristics of real continuous long reads (CLR). Due to the varying read lengths in the dataset, an additional pre-processing step is performed before training our LM. Specifically, every read is tokenized into chunks of length 100 basepairs before being fed to our transformer-based model. To preserve the overlap between the produced chunks, we use a sliding window with a step size (a.k.a stride length) of 50. If length of a read in not divisible by 100, the residual portion of the read is trimmed as they account for a small fraction of the read. This fraction is upper bounded at 1.6% in our evaluation.
We use the reference genome of E. coli, strain K-12, substrain MG1655. We used LoR-DEC [11] for error correction since it has proven to work well on PacBio reads and is computationally efficient, yielding results within just a few minutes. Since it is a hybrid error correction tool, the short reads we use are those of E. coli, strain K-12, substrain MG1655, accession number SRX000429. The LM was trained on the uncorrected reads, and tested on the corrected reads. The ground truth was generated by running alignment using Minimap2 [55] and the statistics of the aligned reads were generated using paftools (provided in the Minimap2 repository). The experiments were repeated for a character-level transformer and a word-level transformer network, where the input sequences were converted into words during the pre-processing phase.
Given that our data has 4 literals, A, T, G, and C, for a word length of |w|, the vocabulary size of the dataset is equal to 4 |w| . The number of tokens in the model was set to 4 |w| . The size of the word embedding vector was set to 4 |w| for |w| < 4 , and to 128 otherwise, because an embedding size larger than the number of tokens is not a sensible choice. The number of encoder layers was set to 4, same as that of the model trained on short reads. The hyperparameter bptt represents back propagation through time, i.e., the data is divided into chunks of size bptt. It was set to 99 after it was observed that it is not sensitive to the quality of training for a significant range, from 50 to 500. We use perplexity as our evaluation metric, and its correlation with the percentage of aligned reads after correction was computed. The results are reported in Table 5. A more detailed result of variation of perplexity with k is shown in Fig. 2 for |w| = 4.
In Table 5, |w| refers to the word length selected for training and |V| is the vocabulary size of the data. The mean and standard deviation of the perplexity score has been calculated on the corrected data for k = 15,17,19,21,23,25,27,31,37,45, since beyond this value there was no change in the percentage of alignment for the corrected reads with the reference genome. Furthermore, a lower k-value is not usually recommended; in most cases, it is not supported by the error correction tool. The correlation between perplexity and the percentage of aligned reads was computed for the above mentioned values of k after standard normalization of both, perplexity and the percentage of aligned reads. A negative correlation between perplexity and % aligned reads was expected but it was interesting to see a positive correlation for |w| = 2 and 3. A possible reason could be a low information gain in 2-mers and 3-mers; i.e., a 2-mer may not be followed by a character with a high probability, and the prediction of a 3-mer from a given 2-mer appears to be random because of a small vocabulary size. We observe a low negative correlation for |w| = 1 . Although it is known that character-LMs perform well, the weak correlation is not good enough to eliminate the need for a reference genome. According to information theory, entropy is maximized when a uniform distribution is selected over a countable set [56]. The perplexity is the exponential of the cross entropy (Eq. 8). Hence, the perplexity, in essence, models the effective vocabulary size, i.e., it equals the vocabulary size when the probability distribution of the tokens in a text data is uniform, and it is less than the vocabulary size when there are motifs present in the data, with identifiable patterns that occur with high probability.
From the results, we see that the difference between the vocabulary size and the mean perplexity grows with the word size. The standard deviation of perplexity also grows with word size. The task of our search heuristic (of finding the best k) is made easier since our LM can differentiate easily between perplexities corresponding to k-values. To eliminate the need for the ground truth, that is the reference genome, the sole requirement was a high correlation between perplexity and % of aligned reads, and we see that strongest correlation was obtained for |w| = 4 . Effectively, the word length only affects the effective bptt value, given that a stride of 1 is chosen for converting text to words. It also helps set the vocabulary size, allowing a model to look for high probability motifs in different  A char-level model learns the following probability P(T|AAT CGG CGCT) if trained on the given sequence. If we choose a word length of 4 with stride=1, the data is pre-processed to generate the sequence, AATC ATCG TCGG CGGC GGCG GCGC CGCT. With bptt=6, a word-level model learns the probability P(CGCT|AATC ATCG TCGG CGGC GGCG GCGC) that is the same as P(T|AAT CGG CGC). Hence, a word-level model with word length=4, stride=1, bptt=6 effectively learns the same probabilities that a character-level model with bptt=9 learns, with the only difference being the vocabulary size. From our experiments, we have concluded that |w| = 4 is the best choice for our use case. We believe further research can be done on how training LMs on genome sequences is affected by the word length. This will help us gain more insights on how to detect solid k-mers using NLP techniques. The results of our pipeline on PacBio simulated long reads, are reported in Table 6. We find that the results are consistent with our claims made above. For |w| = 4 , our search heuristic selected the value of k corresponding to the best alignment. 9989 out of 10,000 reads aligned with the reference genome for k = 15 which is selected only for |w| = 4 . For all other word lengths |w|, the optimal k value for a minimum perplexity and the optimal k value for maximum alignment rate are different, largely depending on the correlation between the Table 6 Results of Lerna evaluated on PacBio reads: |w| denotes the word length used for training, k is the selected k-value, mapped reads is the total number of reads out of the 10,000 reads that aligned with the reference genome after correction done by LoRDEC using the selected k-value  Table 7 Results of Lerna evaluated on Nanopore simulated reads corrected using Canu We use a word length of 4 ( w = 4 ) and find that the best value of the MhapMerSize comes out to be 19 two metrics. We have, therefore, chosen |w| = 4 for evaluating Lerna on Nanopore simulated long reads as well, reported in Table 7. Finally, we evaluate the improvement in assembly quality for the reads after correction using Lerna versus the original set of reads (uncorrected). We use SPAdes assembler to perform the assembly and generate the contigs, and we use QUAST measuring tool to estimate the NG50 for the generated contigs. The uncorrected reads had an NG50 of 26,034, whereas the corrected reads with LoRDEC+Lerna had an improved NG50 of 38,466, showing an improvement of 47% (over uncorrected reads). This shows the importance of applying Lerna's pipeline to improve the performance for both alignment and assembly tasks. Compare this improvement to the improvement of 5.08× in assembly quality with short reads. This difference is due to the fact that assembly with shorter uncorrected reads is significantly worse than assembly with longer uncorrected reads. This is due to the fact that assembly is a harder problem with shorter reads, akin to piecing a jigsaw puzzle with a larger number of pieces.

Table 8 Lerna results on real PacBio E.Coli K-12 reads
Simulated annealing finds k = 17 as the best k value, which is also evident from the fact that it generated the minimum test perplexity. This value is quite close to k = 19 that generates the highest NG50 on assembly. Both of these k-values are highlighted in the  Table 9 Lerna results on real Nanopore Acinetobacter baumannii reads Simulated annealing here finds k = 13 as the best k value that also generates the highest NG50 on assembly, both of which are highlighted. We evaluate Lerna on two real datasets -E. Coli K-12 PacBio reads (Accession number: SRX4909245) and Acinetobacter baumannii Nanopore reads (Accession number: SRX11521539). The data was corrected with LoRDEC and assembly was done using Canu. The data was tokenized into words with |w| = 4 for the training and testing of the language model. Tables 8 and 9 show the evaluation results of Lerna on these datasets. We have performed experiments with the maximum range of k values that each of LoRDEC and Canu could support on our machine.
Since Lerna uses a heuristic-based optimization strategy (i.e., Simulated annealing), finding the global optimum is not guaranteed. However, for all evaluated datasets, Lerna was able to tune different EC tools and find k-mer sizes that significantly improve the quality of the corrected reads when the assembly results of the corrected reads are compared across the different k-mer sizes. For example, in Table 8, k = 17 generates an NG50 value of 133,690 which is significantly better than for example, k = 37 which has an NG50 value of 32,160. Accordingly, it also improves the quality of assembly exemplified by the improvement in the NG50 metric. Lerna finds a value that is either identical or very close to the actual best k, but without relying on a reference genome. For example, we see in Table 8 that for the PacBio dataset, Lerna finds k = 17 as the optimal k whereas the actual best value was k = 19 . In Table 9, we see  that for the Nanopore dataset, Lerna finds the k value that results in the best assembly. Hence, we can conclude that Lerna successfully tunes LoRDEC on real long reads datasets as well. We use Canu [57] for error correction on Nanosim [58] data. The Nanopore simulated reads are of E. coli. Using Lerna, we try to find the best value of MhapMerSize, i.e., the k-mer size for seeds in the MHAP part of Canu. Through our experiments, we found that a word length of 4 ( |w| = 4 ) gives the optimum value of k = 19 and a corresponding 9,813 out of 10,000 reads, which aligned with the reference genome. This kind of configuration search is done in parallel using the Apache Spark framework as described in our prior work [59]. This is an example of what is called an embarrassingly parallel workload. The number of reads aligned and the corresponding value of MhapMerSize is given in Table 7. The mean perplexity output from the LM is 293.11, with a standard deviation of 0.16836, and the correlation between the aligned reads and perplexity is − 0.889. Hence, we show that Lerna works on both short-and longread datasets.

Just-in-time compilation
We further improve the inference time of Lerna using Just-in-Time compilation [60]. The way bytecodes get converted to the appropriate native instructions for an application has a huge impact on the speed of an application. As seen in Fig. 3, they have access to run time information, such as, input parameters, control flow, and target machine specifics. While JIT compilers provide good performance, their performance is often a blackbox [61]. We employ JIT compilation in Lerna and find an immediate beneficial effect in the inference time. On using JIT compilers on Transformer LMs, we observe that the GPU utilization increases from 7 to 50%, and power usage of the GPU increases by 22W. This increase signifies that the JIT compiler allows for data parallelization in the GPU; more data streams and hence more GPU cores are being concurrently used. Clearly, by using JIT compilers, we have exploited the fact that the self-attention mechanisms can be parallelized, resulting in higher code runtime efficiency.
While the GPU utilization for a transformer increases ∼7× after using a JIT compiler, for an RNN LM, the power usage and GPU utilization are not improved. The Transformer efficiently uses the GPU by allowing computations to occur in parallel. The consequences of using an LM, which has parallelizable attention mechanisms, along with optimizations made by the JIT compiler, is that inference time for the Transformer is improved. The inferences are made using an NVIDIA Tesla P100 16GB GPU. We find that Lerna (using Transformer) is 18× faster than Athena (using RNN) ( Table 10). The gains are quite consistent across the different datasets that span a wide range of characteristics (such as coverage and number of reads).

Discussion
We have presented Lerna that can automatically tune EC tools by exploring their parameter search space, using data-driven techniques without the need for a reference genome for the sequence being corrected. We have made use of the fact that trained language models (LMs) generate lower perplexity when used to evaluate corrected reads. Further, given that finding the optimal configuration traverses a non-convex trajectory, we have used simulated annealing to reduce the chances of getting stuck in a local minima. Further, some EC tools may have performance-sensitive configuration parameters that may have interdependencies and, in such cases, a surrogate model that can encode dependencies, as in our recent work [62,63], can be used. Finally, we have compared Lerna 's findings with ground truth results on Illumina short reads and PacBio and Nanopore long reads. Lerna requires the user to first train an LM and then use the model to evaluate the corrected reads. Training can be a slow process and is directly proportional to the number of reads sampled for training. Our experiments have shown that Lerna works successfully even when as low as 5% of the input data is used for training. The user, therefore, needs to decide between the sample size and the training time. This is because a larger sample size is expected to have a larger input coverage that can result in a more rigorously trained model. Training is however required to be done only once, and the trained model is used to evaluate the corrected reads multiple times as Lerna searches through the parameter space of the EC tool being used. Evaluation is fast, and works in the order of a few seconds, as reported in Table 10.
The EC tool needs to be run for every input parameter that Lerna suggests as it searches through the parameter space. The runtime of the EC tool depends not only on the tool and the dataset, but also on the value of the input parameter. For example, the runtime of LoRDEC increases exponentially as k decreases. Therefore, the region of the parameter space to be explored must be carefully selected for the target EC tool. If the region is left too wide, then the search space becomes large and the search time, correspondingly, becomes long. If, on the other hand, the region is constrained too much, sub-optimal solutions may be found by Lerna. Our analysis in Table 6 captures some interesting empirical results. For example, choosing a word length of 4 for LM training produces better results than any other word length. This observation is consistent across multiple datasets. "This motivates us to study in future work the effect of word length on LM training in greater detail. "

Conclusion
Configuration parameters play a crucial role in tuning EC tools. Manually selecting configuration parameters will degrade the quality of error correction (EC), and impact downstream genome assemblies [64]. Lerna automatically tunes these parameters without the need for a ground truth reference genome. Further, we improve our pipeline by leveraging character-level and word-level transformer networks. We observe that grouping nucleotides into groups of four ( |w| = 4 ) in our LM causes the highest negative correlation with aligned reads, and gives the best results for both Lighter [18] and LoRDEC [11], exemplar short-and long-read EC tools, respectively. In all cases, Lerna can configure the EC tool so that the alignment rate is what would be found by an exhaustive search of the parameter space or the alignment rate is within 1% of the optimum. Further, Lerna has significant speed up of state-of-practice (actually doing the alignment, such as Bowtie2, over which we have a 80× -275× speedup) or the state-of-the-art autotuner for genomic reads, namely Athena ( 18× speedup).
Further, the LMs we have developed may potentially impact future work, unraveling the nuances of the language of the genome. The use of attention mechanisms opens up several avenues toward the interpretability of LM. The values of the attention weights can be analyzed; we expect that beyond a certain distance from a token, the weights will fall below a certain threshold. By finding and quantifying an appropriate threshold, we can reduce the work of the LM by preventing it from encoding information beyond a certain point. Furthermore, the patterns observed through attention may allow future work to explore and find biologically significant portions of the genome. This can play a critical role in the analysis of genetic mutations and their consequences. Lerna can be applied to tune the configuration of many different EC tools. Due to its speed, it can be applied when datasets change or the instruments change, even transiently, to have different error characteristics.
We now discuss a few aspects of Lerna, such as supporting EC tools, which are not k-mer based, the impact of sub-sampling on the perplexity metric, and the tuning complexity for the hyperparameters in Lerna.
Support for non k-mer based EC tools Although our evaluation is performed on k-merbased EC tools, such as Lighter and LoRDEC, the rationale for Lerna is not tightly coupled to k-mer-based correction techniques. Rather, it can be generalized to tune other parameters that impact the correction accuracy. This is achieved by Lerna 's pipeline that treats the EC tool under consideration as a blackbox. For example, tools such as Racer [65], which require users to specify the genome length for the sequenced portion of the genome, can also be tuned with Lerna 's pipeline by applying Lerna 's searching heuristic over the space of possible genome lengths (instead of the space of k-mer values). The selection criteria will be selecting the genome length that minimizes the perplexity metric for the corrected reads.
Impact of Sub-sampling on Perplexity As shown in Fig. 1, Lerna performs sub-sampling over the original set of reads before feeding it to the EC tool being tuned. This is performed to reduce the searching time. However, Lerna 's LM is trained over the entire (uncorrected) dataset to exploit the high coverage of the full set of reads, which allows the LM to accurately capture the probability distribution of the sequence. The subsampling ratio is lower bounded so that the resulting coverage of the sample is ≥ 30× , which is essential for the EC tool to perform efficient error correction [66]. Also notice that Lerna uses the perplexity scores that correspond to different k-values to pick the best among them. Accordingly, even if sub-sampling is impacting the perplexity scores, as long as the relative ordering of the perplexity scores with different k-values is maintained, Lerna can still select the best k-value.
Hyperparameter tuning for Lerna 's search heuristic Lerna relies on a Simulated Annealing (SA)-based searching to find the best k-value. However, SA has its own hyperparameters, such as initial temperature (T), temperature scale factor ( α ), and number of cycles. It may appear that we have replaced one parameter selection for a different, yet equally hard, parameter selection problem. Fortunately, we find empirically that Lerna 's performance is not sensitive to these parameters and a reasonable selection for their values (using common rules of thumb) provides the desirable performance. For example, a good rule of thumb for the "Initial Temperature" is to pick a value that accepts about 98% of the solutions to explore. As we can see from Fig. 4, the selection made by Lerna stays constant for any value of temperature between 2.00 and 3.25, and even outside of that, the fall in alignment quality is not that high. Figure 5 shows that the final choice of Lerna is relatively unaffected by the other parameter of SA, α . Moreover, prior works such as [67] proposed simple algorithms that can automatically select appropriate values for SA's hyperparameters, which can be easily integrated in Lerna 's pipeline.
Relevance of Lerna with newer genome sequencing tools As genomic reads become longer, the improvement due to Lerna will become more significant. This will happen because the longer-read technologies are more error prone, with PacBio and Oxford Nanopore reaching error rates of 15% and 40% [7,8] respectively. Compare this to the error rates of less than 0.1% with short reads [68]. Second, with longer reads, the possible range of k-values in k-mer-based EC tools will increase. This will mean that it will be more time consuming to perform an exhaustive search through the parameter search space. Hence, the efficient design of Lerna will be of greater benefit to the community.

Methods
The proposed method for automatic tuning of EC Tools takes in the uncorrected reads and the EC Tool as the input, and involves the following 4 steps before generating the optimal input parameter (k) for the given EC Tool.
1 Sub-sampling The reads are usually obtained in the fasta file format. We sample out a small portion of the dataset, which typically constitutes 4-5% of the entire dataset. For example, for the dataset D7 (short reads, human genome), we sampled out 700k Fig. 4 The selected value of k-mer length for Illumina dataset D2 ( Table 2). The value of the parameter selected is relatively unaffected by the initial temperature (T). In the worst case, the alignment rate 61.15% compared to 61.42% at the optimum value. In this experiment, we set α = 0.7

Fig. 5
Variation of selected k with parameter α keeping the initial temperature (T) = 2.8. In this case we observe that any α greater than 0.6 and less than 1 gives the optimum value for dataset D2 reads out of 15M reads to work on. We have empirically found that this fraction has good enough coverage for the entire pipeline to be reliable. The sub-sampling step helps us achieve a significant improvement in runtime. 2 LM training We explore both character as well as word level training for both short and long reads. We train the transformer network, that is, our language model on the uncorrected subsampled dataset and use the trained model for evaluating the corrected reads. 3 Data correction Here we use the standard EC Tools to correct the uncorrected reads.
We use Lighter for short reads and LoRDEC for long read correction. In principle, any EC tool can be used with Lerna. The EC Tool generates different outputs for different input k values and our aim is to find the k that corresponds to the best correction. 4 Evaluation We take the corrected outputs and evaluate them using our trained language model. To accelerate the process of evaluation, we have made use of JIT compiler. Our metric of evaluation is perplexity that essentially captures how confident the learned model is during prediction. A lower perplexity corresponds to a higher confidence and vice-versa. We repeat steps 3 and 4 iterating with different k values using a search technique called 'simulated annealing' to finally converge into the most optimal k value. We verify that the output of the EC Tool for this k value corresponds to the best alignment with the reference genome.
The entire workflow is visually represented in Fig. 1.
We show the steps of our searching mechanism in algorithms 1 and 2 . A Transformer LM is trained on a given dataset. The range of k; k min and k max , the LM, and the subsampled dataset S ′ are given as input to the algorithm. We set the parameter δ (i.e., step size) = 1, the initial temperature T 0 = 2.8 , and the constant α = 0.7 ( α represents the factor by which temperature is scaled after each cycle). These parameters are empirically found to perform the best across all datasets (including short and long reads) and therefore no additional tuning efforts are needed here. The search is allowed to run for 8 cycles, with 3 trials per cycle.
The probability of selecting a proposed solution that corresponds to a higher perplexity (a less likely sequence) is given by p = exp (−�E/(T × �E avg )) , where E, the energy variable, determines the volume of the search space. Larger E values correspond to larger search space and viceversa. It is clear that as temperature T is reduced, p is reduced. Hence, the values assigned to parameters T 0 and α determine the overall performance of the search. If the temperature is reduced quickly over various cycles, by reducing α , the search is restricted to a local neighborhood (exploitation). However, taking a value of α close to 1 causes the algorithm to explore a greater search space, and prevents it from converging to a local optimum.