Skip to main content

Table 2 Confusion matrices based on (a) soft alignment-based approach with a soft alignment score threshold of 20, (b) blastp with an e-value cutoff of 1e−3, and (c) pooled embeddings similarity using KNN search for capsid proteins (CP), regulatory and accessory proteins (RAP), envelope proteins (EP), replication and transcription protein (RTP), and assembly and release proteins (ARP)

From: Improvements in viral gene annotation using large language models and soft alignments

 

CP

RAP

EP

RTP

ARP

Class sensitivity

Class specificity

(a) Soft alignment (minimum soft alignment score 20)

CP

603

0

0

65

15

0.88

0.98

RAP

1

414

0

42

4

0.9

0.99

EP

4

1

83

2

0

0.92

1

RTP

59

59

2

1758

68

0.9

0.98

ARP

13

21

2

19

3249

0.98

0.97

(b) blastp (e-value cutoff 1e–3)

CP

51

0

0

0

7

0.88

1

RAP

0

187

0

53

2

0.77

0.93

EP

1

0

5

0

0

0.83

1

RTP

6

15

0

843

21

0.95

0.95

ARP

2

7

0

7

974

0.98

0.91

(c) Best match using pooled embeddings

CP

675

9

7

80

133

0.75

0.97

RAP

7

435

3

87

76

0.72

0.97

EP

46

6

92

17

8

0.55

0.97

RTP

112

96

11

1680

108

0.84

0.94

ARP

34

66

48

41

3289

0.95

0.94