Skip to main content

Table 3 Concept annotation attributes of corpora

From: Concept annotation in the CRAFT corpus

corpus/corpora

total # words/tokens

# & type of documents

domain(s)

annotation concept schema(s)

total # concept annotations

CRAFT Corpus (full/initial release)

~790,000/~560,000

97/67 articles

sources of MGI annotations of mouse genes/gene products

Open Biomedical Ontologies (CL, ChEBI, SO, PRO, GO BP/CC/MF, NCBITaxon), Entrez Gene

~140,000/~100,000

ABGene

 

4,265 sentences

 

n/a

~8,200

BioInfer

~34,000/~30,000f

1,100 sentences

protein-protein interactions

~100 entity classes, ~100 relationships

~6,300 named entities, ~2,700 relationshipsg

CALBC corpus

~16,000,000

150,000 abstracts

immunology

UniProt, NCBITaxon, UMLSh

~2,700,000

CLEF Corpus

 

variousi

clinical/cancer data

6 concept types

 

FetchProt Corpus

 

200 articles

protein tyrosine kinase activity

10 concept types, UniProt

~3,800

4th i2b2/VA Challenge Corpus

 

~750 discharge summaries

clinical data

3 concept types

~2,000

GENETAG

~548,000

20,000 sentences

 

n/a

~25,000 genes/proteins, ~19,000 alternative lexical forms

GENIA 3.0

~440,000

2,000 abstracts

human blood-cell transcription factors

35 entity classes, 34 process classes

~93,000 entities, ~36,000 events

GREC

 

240 abstracts

E. coli gene regulation

433 classes

~5,000

ITI TXM PPI/TE Corpora

~2,000,000/ ~1,900,000

217/238 articles

protein-protein interactions/tissue expression

9/13 concept types, Entrez Gene, RefSeqj, ChEBI, MeSH, NCBITaxonk

~160,000/~164,000

MedPost

~156,000

    

OntoNotes 2.0

~500,000

1,000 newswire documents

English & Chinese news

1000 s of WordNet senses, 50 concept typesl

~58,000 verbsm

PennBioIE Oncology/CYP v1.0 Corpora

~381,000 (~327,000)/~313,000 (~274,000)

1,414/1,100 abstracts

medical genetics of oncology/inhibition of cytochrome P450 enzymes

n/a

 

Yapex Corpus

 

200 abstracts

protein-protein interactions

n/a

~3,700

  1. fBioInfer has ~34,000 tokens total, and ~30,000 excluding punctuation.
  2. gBioInfer has ~6,300 named-entity annotations and ~2,700 annotations of what are termed relationships but that might more properly be conceptualized as process or state classes and thus are included here, totaling ~9,000 concept annotations.
  3. hIn the CALBC corpus, NCBI Taxonomy and UMLS concepts were respectively used to mark up species and disease mentions.
  4. 1The CLEF Corpus is composed of many types of medical documents: 2 entire patient records (themselves composed of 9 narratives, 1 imaging report, 7 histopathology reports, and associated data) and 50 each of clinical narratives, histopathology reports, and imaging reports.
  5. jThe annotators of the ITI TXM Corpora attempted to assign Entrez Gene IDs to gene annotations and RefSeq IDs to annotations of proteins, mRNAs, and cDNAs (although it is admitted that this assignment was very time-consuming and thus was not performed on the training subset of the PPI Corpus).
  6. kThe annotators of the ITI TXM Corpora used ChEBI, MeSH, and NCBI Taxonomy concepts for drug, tissue, and sequence mentions.
  7. lIn OntoNotes, the 700 most frequent polysemous verbs and 1,100 most frequent polysemous nouns have been annotated with the appropriate senses of WordNet 2.0, so the size of the schema (i.e., the total number of senses of these 1,800 words) likely numbers in the thousands; however, they note that this is different from their ontological annotation, for which only approximately 50 concept types are being used to subsume the annotated word senses.
  8. mIn addition to ~58,000 annotated verbs, OntoNotes has an unstated but presumably large count of annotated nouns.
  9. A summary of counts of words/tokens, of counts and types of component documents, of domains, and of counts of concept annotations for the CRAFT Corpus and related corpora.