Application of text-mining for updating protein post-translational modification annotation in UniProtKB

BMC Bioinformatics

Table 1 PTM filtering tokens and information extraction assessment

PTM	Filtering	Generic corpus			Positive corpora
	token	# Filtered astracts	# Retrieved abstracts	Precision	# Abstracts	Recall
Acetylation	“acet”	26,144	1,753	65%	97	89%
Amidation	“amid”	21,861	1,515	73%	61	95%
Disulfide bond	“disulf”	6,933	1,095	94%	514	75%
Glycosylation	“glyco”	31,379	2,746	73%	464	85%
Methylation	“methyl”	28,015	664	57%	47	87%
Phosphorylation	“phospho”	61,144	16,129	71%	906	93%
Sulfation	“sulf”	20,834	256	65%	40	92%

“Filtering token” is the term used to select the abstracts, “# filtered abstracts” is the number of abstracts which contain these terms, and “# retrieved abstracts” is the number of abstracts selected by the complete sentence extraction procedure. Precision was estimated based on manual analysis of 100 positive abstracts.

ISSN: 1471-2105