Skip to main content

Table 1 Number of variants imported from various external resources

From: Integrating 400 million variants from 80,000 human samples with extensive annotations: towards a knowledge base to analyze disease cohorts

Study

Variant sites

Variants

Unique to study

Variants passed

Samples

1000 Genomes [1]

81,195,126

81,693,252

57,400,612

all

2,504

ESP6500 [2]

1,982,177

1,998,204

184,225

all

6,503

UK10K [47] ALSPAC/TWINS

37,258,978

37,560,436

6,155,493

all

2,432

UK10K with disease c

9,391,582

11,177,227

8,847,466

9,969,036

4,888

TCGA [4] germline c

200,691,728

219,533,884

90,884,769

n/a

4,224

TCGA somatic

876,970

890,172

696,754

all

4,205

Scripps Wellderly [48]

76,144,271

91,947,469

63,331,143

53,303,437

534

ExAC b [3]

9,579,712

10,450,724

6,581,946

8,811,372

63,352

MSSM BioBank genotyping

849,806

849,806

0

all

11,210

In-house resequencing study

29,326,393

29,671,729

10,134,258

23,610,572

142

Total observed

358,152,122

399,404,510

244,216,666

>217,796,115

82,558 b

Other resources:

     

dbNSFP a [18]

30,523,109

89,617,785

73,561,239

ClinVar [12]

101,317

104,455

31,694

OMIM [49]

10,863

10,913

COSMIC [50]

1,483,983

1,525,243

PharmGKB c [51]

672

684

SwissVar d

(77,047)

(84,649)

(34,198)

HGMD c [13]

125,744

133,464

32,178

Literature mining

890,665

Total observed + other

388,902,292

472,965,749

317,841,777

>217,796,115

82,558

  1. The first block refers to sequencing/genotyping studies, the second to sample-independent annotation databases. “Unique to study” counts variants that were observed only in that particular study. “Variants passed” refers to variants that passed quality metrics as defined by the particular study, at least one sample has to pass; n/a: individual sample quality metrics not available. Totals exclude duplicates seen in different studies. Variants in annotation databases are included only if they can be mapped to precise coordinates and allele. Since a large proportion of the variants discovered by literature mining are given at the protein level only, they were not compared to other studies
  2. adbNSFP contains hypothetical variants, see text
  3. bExAC includes samples from 1000 Genomes, ESP6500, and TCGA
  4. cNote that data from HGMD, PharmGKB, UK10K diseases and TCGA germline are not visible to external users on the RVS website
  5. dCounts for SwissVar refer to distinct amino acid changes. Further details on individual resources are provided in Additional file 4: Table S3