Skip to main content

Table 5 Effect of word length on correlation between perplexity and the percentage of aligned reads for E. coli PacBio simulated reads

From: Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

|w|

|V|

Mean (PPL)

SD (PPL)

Correlation

Test time (s)

1

4

3.9997

0.00004

− 0.714

367.5

2

16

15.824

0.02298

\(+\) 0.825

990.1

3

64

62.176

0.06331

\(+\) 0.761

1439

4

256

243.45

0.17869

− 0.965

2026

5

1024

951.70

2.86750

− 0.904

1862

6

4096

3724.7

21.5649

− 0.889

2472

7

16,384

14675

133.207

− 0.882

2923

8

65,536

59199

811.820

− 0.877

7150

  1. Strong correlation is observed for higher |w| values, with \(|w| = 4\) producing the strongest correlation. The vocabulary size \(4^{|w|}\) is represented by |V|