Skip to main content

Table 1 PTM filtering tokens and information extraction assessment

From: Application of text-mining for updating protein post-translational modification annotation in UniProtKB

PTM

Filtering

Generic corpus

Positive corpora

 

token

# Filtered astracts

# Retrieved abstracts

Precision

# Abstracts

Recall

Acetylation

“acet”

26,144

1,753

65%

97

89%

Amidation

“amid”

21,861

1,515

73%

61

95%

Disulfide bond

“disulf”

6,933

1,095

94%

514

75%

Glycosylation

“glyco”

31,379

2,746

73%

464

85%

Methylation

“methyl”

28,015

664

57%

47

87%

Phosphorylation

“phospho”

61,144

16,129

71%

906

93%

Sulfation

“sulf”

20,834

256

65%

40

92%

  1. “Filtering token” is the term used to select the abstracts, “# filtered abstracts” is the number of abstracts which contain these terms, and “# retrieved abstracts” is the number of abstracts selected by the complete sentence extraction procedure. Precision was estimated based on manual analysis of 100 positive abstracts.