Skip to main content

Table 5 Global and per-relation statistics for data cleaning and dataset generation

From: TBGA: a large-scale Gene-Disease Association dataset for Biomedical Relation Extraction

Granularity

Target

Raw

Data cleaning

Dataset generation

TS

DR

RN

DB

Global

Publications

707,390

572,981

572,607

447,280

57,675

Genes

21,118

17,658

17,658

17,658

8827

Diseases

23,433

17,032

17,023

17,023

6964

Therapeutic

Instances

10,744

4132

3925

3925

3925

Bags

6872

2939

2857

2857

2,857

Biomarker

Instances

1,530,072

1,080,089

1,075,327

580,053

24,739

Bags

605,826

460,334

460,276

383,358

17,459

Genomic Alterations

Instances

849,472

531,601

516,630

516,630

37,346

Bags

289,693

202,548

202,045

202,045

15,028

  1. Columns represent, from left to right, the considered granularity level, the target item, the raw (initial) statistics, and the statistics after each Data Cleaning and Dataset Generation step. The steps are: TS, DR, RN, and DB