Skip to main content

Table 4 Test corpora for information extraction evaluation. Based on the citation references from UniProtKb a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora were derived from this corpus: the gold standard corpus (GC), which resembles a manually annotated test set, and the cross-validation corpus (XC), which contains automatically assigned annotations based on information from UniProtKb.

From: Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb

Dataset

Gold standard corpus (GC)

Cross-validation corpus (XC1)

Cross-validation corpus (XC2)

Abstracts count

100

55,998

5,253

Method of annotation

manual

automatic

automatic

total/unique residues

362/262 (with 262/191 having residue name + residue sequence position)

N/A

N/A

total/unique proteins

990/511

N/A

N/A

total/unique organisms

323/123

N/A

N/A

total/unique associations

240/172 residue-protein-organism associations

NA/70,401 protein-organism as UTP

NA/68,008 protein-residue as URP

Application

Test the the type, amount and reliability of the extracted information (reproduction of manually annotated information).

Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database.

Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database.