Skip to main content

Table 1 PTM filtering tokens and information extraction assessment

From: Application of text-mining for updating protein post-translational modification annotation in UniProtKB

PTM Filtering Generic corpus Positive corpora
  token # Filtered astracts # Retrieved abstracts Precision # Abstracts Recall
Acetylation “acet” 26,144 1,753 65% 97 89%
Amidation “amid” 21,861 1,515 73% 61 95%
Disulfide bond “disulf” 6,933 1,095 94% 514 75%
Glycosylation “glyco” 31,379 2,746 73% 464 85%
Methylation “methyl” 28,015 664 57% 47 87%
Phosphorylation “phospho” 61,144 16,129 71% 906 93%
Sulfation “sulf” 20,834 256 65% 40 92%
  1. “Filtering token” is the term used to select the abstracts, “# filtered abstracts” is the number of abstracts which contain these terms, and “# retrieved abstracts” is the number of abstracts selected by the complete sentence extraction procedure. Precision was estimated based on manual analysis of 100 positive abstracts.