corpus/corpora | total # words/tokens | # & type of documents | domain(s) | annotation concept schema(s) | total # concept annotations |
---|---|---|---|---|---|
CRAFT Corpus (full/initial release) | ~790,000/~560,000 | 97/67 articles | sources of MGI annotations of mouse genes/gene products | Open Biomedical Ontologies (CL, ChEBI, SO, PRO, GO BP/CC/MF, NCBITaxon), Entrez Gene | ~140,000/~100,000 |
ABGene | Â | 4,265 sentences | Â | n/a | ~8,200 |
BioInfer | ~34,000/~30,000f | 1,100 sentences | protein-protein interactions | ~100 entity classes, ~100 relationships | ~6,300 named entities, ~2,700 relationshipsg |
CALBC corpus | ~16,000,000 | 150,000 abstracts | immunology | UniProt, NCBITaxon, UMLSh | ~2,700,000 |
CLEF Corpus | Â | variousi | clinical/cancer data | 6 concept types | Â |
FetchProt Corpus | Â | 200 articles | protein tyrosine kinase activity | 10 concept types, UniProt | ~3,800 |
4th i2b2/VA Challenge Corpus | Â | ~750 discharge summaries | clinical data | 3 concept types | ~2,000 |
GENETAG | ~548,000 | 20,000 sentences | Â | n/a | ~25,000 genes/proteins, ~19,000 alternative lexical forms |
GENIA 3.0 | ~440,000 | 2,000 abstracts | human blood-cell transcription factors | 35 entity classes, 34 process classes | ~93,000 entities, ~36,000 events |
GREC | Â | 240 abstracts | E. coli gene regulation | 433 classes | ~5,000 |
ITI TXM PPI/TE Corpora | ~2,000,000/ ~1,900,000 | 217/238 articles | protein-protein interactions/tissue expression | 9/13 concept types, Entrez Gene, RefSeqj, ChEBI, MeSH, NCBITaxonk | ~160,000/~164,000 |
MedPost | ~156,000 | Â | Â | Â | Â |
OntoNotes 2.0 | ~500,000 | 1,000 newswire documents | English & Chinese news | 1000 s of WordNet senses, 50 concept typesl | ~58,000 verbsm |
PennBioIE Oncology/CYP v1.0 Corpora | ~381,000 (~327,000)/~313,000 (~274,000) | 1,414/1,100 abstracts | medical genetics of oncology/inhibition of cytochrome P450 enzymes | n/a | Â |
Yapex Corpus | Â | 200 abstracts | protein-protein interactions | n/a | ~3,700 |