Skip to main content

Table 5 Effect of word length on correlation between perplexity and the percentage of aligned reads for E. coli PacBio simulated reads

From: Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

|w| |V| Mean (PPL) SD (PPL) Correlation Test time (s)
1 4 3.9997 0.00004 − 0.714 367.5
2 16 15.824 0.02298 \(+\) 0.825 990.1
3 64 62.176 0.06331 \(+\) 0.761 1439
4 256 243.45 0.17869 − 0.965 2026
5 1024 951.70 2.86750 − 0.904 1862
6 4096 3724.7 21.5649 − 0.889 2472
7 16,384 14675 133.207 − 0.882 2923
8 65,536 59199 811.820 − 0.877 7150
  1. Strong correlation is observed for higher |w| values, with \(|w| = 4\) producing the strongest correlation. The vocabulary size \(4^{|w|}\) is represented by |V|