Critical assessment of on-premise approaches to scalable genome analysis

BMC Bioinformatics

Table 1 A summary matrix of the tools presented in this work along with the different feature measures on which the tools are evaluated

	OpenCGA	GEMINI	Hail	Bcftools	SnpSift
Entry requirement and installation	Many dependencies	Python-based install script	Python package installation	make compilation	Java JAR download
Data management	MongoDB/HBase	SQLite	Matrix Table	Flat-file VCF	Flat-file VCF
Storage of the INFO column	Highly indexed, nested object structure	Partially indexed SQLite tabling	Stored as type-inferred columns	Unindexed VCF file INFO column	Unindexed VCF file INFO column
Annotation availability	34 data sources, manual	18 data sources, automatic	13 data sources, manual (experimental)	N/A, manual	dbNSFP, manual
Query complexity	Multiple clients, unconventional syntax	SQL query-like	DataFrame-like filtering	CLI, documented syntax	CLI, documented syntax
Query speed	Fast, comprehensive indexing	Database indexing, moderate speed	Fast, Spark-backend querying	Fast for flat-file based query, indexes by chromosome	Slow, not indexed
Query ranking	Best in rsID query (Scenario 1)	Best in homozygous genotype query (Scenario 3)	Best in complex query (Scenario 4)	Best in INDEL-type query (Scenario 2)	Overall last place
Scalability	Horizontally scalable, managed platform	Limited vertical, monolithic	Efficient filesystem storage, Spark-based	N/A, monolithic	N/A, monolithic
Customization (function and DB)	Java plugins	Only DB is extensible	Python-native	C plugins	Only built-in commands
Output	JSON, VCF-like, Tabular text	Tabular text	Matrix Table object	VCF file, Tabular text	VCF file

ISSN: 1471-2105