SparSNP: Fast and memory-efficient analysis of all SNPs for phenotype prediction

BMC Bioinformatics

Table 1 Comparison of the evaluated methods

Method	Memory required	Speed (rank)	Prediction AUC (rank)	Fitted largest data	Genetic models	Ease of use
SparSNP	•••	(1)	(1)	yes	•◦◦◦	•••••
glmnet	••∘	(2)	(1)	no	•∘∘∘	•••∘∘
HyperLasso	•••	(5)	(3)	yes	•••∘	∘∘∘∘∘
LIBLINEAR	••∘	(3)	(2)	no	•∘∘∘	••∘∘∘
LIBLINEAR-CDBLOCK	•••	(4)	(4)	yes	••∘∘∘	••∘∘∘

We evaluated each method in terms of the following criteria:
^(a) Memory requirements: maximum GiB required to complete the prediction experiment. Three points: ≤4GiB, as is commonly available on laptops. Two points: >4GiB and ≤32GiB, commonly available on compute servers. One point: >32GiB, typically available on higher-end servers.
^(b) Speed: time to complete in the timing experiments with 50,000 SNPs (Figure 2).
^(c) Prediction: best cross-validated AUC in the prediction experiment (Figure 3).
^(d) Fitted largest data: whether the tool successfully completed the largest timing experiment, consisting of p = 500,000 SNPs and N = 10,000 samples.
^(e) Models: one point for each natively supported model of (i) additive, (ii) dominant/recessive, (iii) heterozygous models, (iv) and interaction models.
^(f) Ease of use: one point for each of (i) does the tool support input in formats commonly used in the genetics community, such as PLINK BED or PED files, (ii) does the tool implement cross-validation, (iii) does the tool estimate the AUC, R², or explained variance from the cross-validation, (iv) does the tool produce plots of the resulting AUC, R², or explained variance, for easy model selection and evaluation, and (v) does the tool implement native imputation of missing genotypes.

ISSN: 1471-2105