Concept annotation in the CRAFT corpus

Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K Bretonnel; Verspoor, Karin; Blake, Judith A; Hunter, Lawrence E

doi:10.1186/1471-2105-13-161

BMC Bioinformatics

Table 3 Concept annotation attributes of corpora

From: Concept annotation in the CRAFT corpus

corpus/corpora	total # words/tokens	# & type of documents	domain(s)	annotation concept schema(s)	total # concept annotations
CRAFT Corpus (full/initial release)	~790,000/~560,000	97/67 articles	sources of MGI annotations of mouse genes/gene products	Open Biomedical Ontologies (CL, ChEBI, SO, PRO, GO BP/CC/MF, NCBITaxon), Entrez Gene	~140,000/~100,000
ABGene		4,265 sentences		n/a	~8,200
BioInfer	~34,000/~30,000^f	1,100 sentences	protein-protein interactions	~100 entity classes, ~100 relationships	~6,300 named entities, ~2,700 relationships^g
CALBC corpus	~16,000,000	150,000 abstracts	immunology	UniProt, NCBITaxon, UMLS^h	~2,700,000
CLEF Corpus		variousⁱ	clinical/cancer data	6 concept types
FetchProt Corpus		200 articles	protein tyrosine kinase activity	10 concept types, UniProt	~3,800
4^th i2b2/VA Challenge Corpus		~750 discharge summaries	clinical data	3 concept types	~2,000
GENETAG	~548,000	20,000 sentences		n/a	~25,000 genes/proteins, ~19,000 alternative lexical forms
GENIA 3.0	~440,000	2,000 abstracts	human blood-cell transcription factors	35 entity classes, 34 process classes	~93,000 entities, ~36,000 events
GREC		240 abstracts	E. coli gene regulation	433 classes	~5,000
ITI TXM PPI/TE Corpora	~2,000,000/ ~1,900,000	217/238 articles	protein-protein interactions/tissue expression	9/13 concept types, Entrez Gene, RefSeq^j, ChEBI, MeSH, NCBITaxon^k	~160,000/~164,000
MedPost	~156,000
OntoNotes 2.0	~500,000	1,000 newswire documents	English & Chinese news	1000 s of WordNet senses, 50 concept types^l	~58,000 verbs^m
PennBioIE Oncology/CYP v1.0 Corpora	~381,000 (~327,000)/~313,000 (~274,000)	1,414/1,100 abstracts	medical genetics of oncology/inhibition of cytochrome P450 enzymes	n/a
Yapex Corpus		200 abstracts	protein-protein interactions	n/a	~3,700

^fBioInfer has ~34,000 tokens total, and ~30,000 excluding punctuation.
^gBioInfer has ~6,300 named-entity annotations and ~2,700 annotations of what are termed relationships but that might more properly be conceptualized as process or state classes and thus are included here, totaling ~9,000 concept annotations.
^hIn the CALBC corpus, NCBI Taxonomy and UMLS concepts were respectively used to mark up species and disease mentions.
¹The CLEF Corpus is composed of many types of medical documents: 2 entire patient records (themselves composed of 9 narratives, 1 imaging report, 7 histopathology reports, and associated data) and 50 each of clinical narratives, histopathology reports, and imaging reports.
^jThe annotators of the ITI TXM Corpora attempted to assign Entrez Gene IDs to gene annotations and RefSeq IDs to annotations of proteins, mRNAs, and cDNAs (although it is admitted that this assignment was very time-consuming and thus was not performed on the training subset of the PPI Corpus).
^kThe annotators of the ITI TXM Corpora used ChEBI, MeSH, and NCBI Taxonomy concepts for drug, tissue, and sequence mentions.
^lIn OntoNotes, the 700 most frequent polysemous verbs and 1,100 most frequent polysemous nouns have been annotated with the appropriate senses of WordNet 2.0, so the size of the schema (i.e., the total number of senses of these 1,800 words) likely numbers in the thousands; however, they note that this is different from their ontological annotation, for which only approximately 50 concept types are being used to subsume the annotated word senses.
^mIn addition to ~58,000 annotated verbs, OntoNotes has an unstated but presumably large count of annotated nouns.
A summary of counts of words/tokens, of counts and types of component documents, of domains, and of counts of concept annotations for the CRAFT Corpus and related corpora.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com