Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

Table 4 A comparison between Athena and Lerna on short reads

Dataset	Read length (bp)	Coverage	Correlation Athena	Correlation Lerna	NG50 without correction	NG50 with Lerna	NG50 with Athena
D1	36	80\(\times\)	− 0.93	− 0.94	3019	6827	6827
D2	47	71\(\times\)	− 0.97	− 0.96	47	2254	2164
D3	36	173\(\times\)	− 0.92	− 0.93	1042	4873	4164
D4	75	62\(\times\)	− 0.86	− 0.97	118	858	858
D5	100	166\(\times\)	− 0.96	− 0.98	186	3524	2799
D6	250	70\(\times\)	+0.95	− 0.84	1098	1344	1237
D7	101	67\(\times\)	− 0.72	− 0.82	723	739	754

The dataset is described in Table 2. We show the correlation between the perplexity metric and the alignment rate of the data after correction for Lerna vis-à-vis its closest competitor, Athena. On dataset D6 (greatest read length of 250 bp), Athena fails and has a positive correlation of + 0.95 (instead of having a negative correlation, as desired), highlighted in the Table. This is in line with the fact that RNNs are unable to model longer sequences. We also show the improvement of the assembly quality (in NG50) after tuning the EC tool with Lerna versus using the uncorrected reads. The NG50 with Lerna is always higher than, or equal to, that with Athena, other than a small drop for Dataset D7. All the superior values are bolded

ISSN: 1471-2105