Skip to main content

Table 1 A summary matrix of the tools presented in this work along with the different feature measures on which the tools are evaluated

From: Critical assessment of on-premise approaches to scalable genome analysis

 

OpenCGA

GEMINI

Hail

Bcftools

SnpSift

Entry requirement and installation

Many dependencies

Python-based install script

Python package installation

make compilation

Java JAR download

Data management

MongoDB/HBase

SQLite

Matrix Table

Flat-file VCF

Flat-file VCF

Storage of the INFO column

Highly indexed, nested object structure

Partially indexed SQLite tabling

Stored as type-inferred columns

Unindexed VCF file INFO column

Unindexed VCF file INFO column

Annotation availability

34 data sources, manual

18 data sources, automatic

13 data sources, manual (experimental)

N/A, manual

dbNSFP, manual

Query complexity

Multiple clients, unconventional syntax

SQL query-like

DataFrame-like filtering

CLI, documented syntax

CLI, documented syntax

Query speed

Fast, comprehensive indexing

Database indexing, moderate speed

Fast, Spark-backend querying

Fast for flat-file based query, indexes by chromosome

Slow, not indexed

Query ranking

Best in rsID query (Scenario 1)

Best in homozygous genotype query (Scenario 3)

Best in complex query (Scenario 4)

Best in INDEL-type query (Scenario 2)

Overall last place

Scalability

Horizontally scalable, managed platform

Limited vertical, monolithic

Efficient filesystem storage, Spark-based

N/A, monolithic

N/A, monolithic

Customization (function and DB)

Java plugins

Only DB is extensible

Python-native

C plugins

Only built-in commands

Output

JSON, VCF-like, Tabular text

Tabular text

Matrix Table object

VCF file, Tabular text

VCF file