From: Critical assessment of on-premise approaches to scalable genome analysis
OpenCGA | GEMINI | Hail | Bcftools | SnpSift | |
---|---|---|---|---|---|
Entry requirement and installation | Many dependencies | Python-based install script | Python package installation | make compilation | Java JAR download |
Data management | MongoDB/HBase | SQLite | Matrix Table | Flat-file VCF | Flat-file VCF |
Storage of the INFO column | Highly indexed, nested object structure | Partially indexed SQLite tabling | Stored as type-inferred columns | Unindexed VCF file INFO column | Unindexed VCF file INFO column |
Annotation availability | 34 data sources, manual | 18 data sources, automatic | 13 data sources, manual (experimental) | N/A, manual | dbNSFP, manual |
Query complexity | Multiple clients, unconventional syntax | SQL query-like | DataFrame-like filtering | CLI, documented syntax | CLI, documented syntax |
Query speed | Fast, comprehensive indexing | Database indexing, moderate speed | Fast, Spark-backend querying | Fast for flat-file based query, indexes by chromosome | Slow, not indexed |
Query ranking | Best in rsID query (Scenario 1) | Best in homozygous genotype query (Scenario 3) | Best in complex query (Scenario 4) | Best in INDEL-type query (Scenario 2) | Overall last place |
Scalability | Horizontally scalable, managed platform | Limited vertical, monolithic | Efficient filesystem storage, Spark-based | N/A, monolithic | N/A, monolithic |
Customization (function and DB) | Java plugins | Only DB is extensible | Python-native | C plugins | Only built-in commands |
Output | JSON, VCF-like, Tabular text | Tabular text | Matrix Table object | VCF file, Tabular text | VCF file |