Skip to main content

Table 4 The sizes of corpora about genes and diseases

From: CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations

Corpus # documents # words Annotation
CoMAGC 408 26177 821 sets of fourannotation concepts
Craven 1677 333845 829 gene-diseasepairs
PolsySearch 522 116380 341 gene-diseasepairs
GETM 150 38355 267 gene expression-anatomical locationpairs
MLEE 262 56588 6677 events
ID 30 153153 4150 events
CG 600 129878 17248 events
  1. All the corpora contain PubMed abstracts, except for the ID corpus which contains full text documents. For the Craven and PolySearch corpora, we show the number of positive gene-disease pairs only.