From: Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb
Dataset | Gold standard corpus (GC) | Cross-validation corpus (XC1) | Cross-validation corpus (XC2) |
---|---|---|---|
Abstracts count | 100 | 55,998 | 5,253 |
Method of annotation | manual | automatic | automatic |
total/unique residues | 362/262 (with 262/191 having residue name + residue sequence position) | N/A | N/A |
total/unique proteins | 990/511 | N/A | N/A |
total/unique organisms | 323/123 | N/A | N/A |
total/unique associations | 240/172 residue-protein-organism associations | NA/70,401 protein-organism as UTP | NA/68,008 protein-residue as URP |
Application | Test the the type, amount and reliability of the extracted information (reproduction of manually annotated information). | Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database. | Test set is assumed to contain the same type of information as GC, but certainty is not clear. Study the reproduction of information contained in the database. |