Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb

BMC Bioinformatics

Table 4 Test corpora for information extraction evaluation. Based on the citation references from UniProtKb a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora were derived from this corpus: the gold standard corpus (GC), which resembles a manually annotated test set, and the cross-validation corpus (XC), which contains automatically assigned annotations based on information from UniProtKb.

Dataset	Gold standard corpus (GC)	Cross-validation corpus (XC1)	Cross-validation corpus (XC2)
Abstracts count	100	55,998	5,253
Method of annotation	manual	automatic	automatic
total/unique residues	362/262 (with 262/191 having residue name + residue sequence position)	N/A	N/A
total/unique proteins	990/511	N/A	N/A
total/unique organisms	323/123	N/A	N/A
total/unique associations	240/172 residue-protein-organism associations	NA/70,401 protein-organism as UTP	NA/68,008 protein-residue as URP
Application	Test the the type, amount and reliability of the extracted information (reproduction of manually annotated information).	Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database.	Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database.

ISSN: 1471-2105