SparSNP: Fast and memory-efficient analysis of all SNPs for phenotype prediction

Background A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genome-wide and whole genome single nucleotide polymorphism (SNP) profiles allows us to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size. Results We introduce SparSNP, a tool for fitting lasso linear models for massive SNP datasets quickly and with very low memory requirements. In analysis on a large celiac disease case/control dataset, we show that SparSNP runs substantially faster than four other state-of-the-art tools for fitting large scale penalised models. SparSNP was one of only two tools that could successfully fit models to the entire celiac disease dataset, and it did so with superior performance. Compared with the other tools, the models generated by SparSNP had better than or equal to predictive performance in cross-validation. Conclusions Genomic datasets are rapidly increasing in size, rendering existing approaches to model fitting impractical due to their prohibitive time or memory requirements. This study shows that SparSNP is an essential addition to the genomic analysis toolkit. SparSNP is available at http://www.genomics.csse.unimelb.edu.au/SparSNP


Figure 1: Analysis workflow
Here we describe a typical analysis using SparSNP. We assume that we have two datasets in PLINK BED/BIM/FAM format, named discovery and validation, on the same SNP microarray platform. The phenotype in the FAM file can be discrete (1 for controls and 2 for cases) or continuous. In the following, we assume the user is working with a Unix-like command line shell such as Bash.

Quality control
SparSNP implements simple imputation of genotypes such that missing genotypes are randomly assigned the value {0,1,2} with probability of 1/3 each, which does not introduce substantial bias to the model when the proportion of missingness is small and the genotypes are missing at random. We recommend removing samples and SNPs with high missingness, and testing for differential missingness between cases and controls.
(remove SNPs as indicated by output)

SparSNP on discovery data
• Run cross-validation. By default, 3 × 3-fold cross-validation will be performed, allowing up to 1024 SNPs in the model, using a case/control phenotype: for classification (case/control) $ ./cv3.sh discovery filtered sqrhinge and for continuous outputs (linear regression) $ ./cv3.sh discovery filtered linear The cv3.sh script can be modified to perform more cross-validation folds or to tune the model parameters.
• Plot AUC and explained phenotypic and genetic variance in the 3 cross-validation replications, with the optional population prevalence K = 1% and heritability h 2 L = 50%: $ ./eval.sh discovery filtered prev=0.01 h2l=0.5 The plots discovery filtered AUC.pdf, discovery filtered VarExp.pdf, and discovery filtered GenVarExp.pdf will be produced in the directory discovery. The raw AUC and explained variance data is stored in the R-data file named discovery filtered.RData. The prevalence and heritability are optional. Prevalence is needed for computing explained genetic and phenotypic variance, and heritability is needed for computing explained genetic variance.
• The set of models with best predictive ability will automatically be chosen from the results, based on smoothing of the AUC or R 2 . The SNPs appearing in these models will be tabulated according to how many time they were included over all cross-validation folds. To inspect the SNPs selected by models that maximise the predictive ability: head discovery/topsnps.txt RS Counts Proportion Replications rs2050189 60 1 60 rs2187668 60 1 60 rs9357152 60 1 60 rs7774954 60 1 60 rs3129763 58 0.966666666666667 60 ...
The SNPs are ordered by the number of times they were included in a model with non-zero weight (Counts) out of the total number of cross-validation folds (Replications), also shown as a Proportion. SNPs at the top of the list are more stably selected by the lasso and are potentially more robust markers than SNPs at the bottom of the list.

Optional: Apply models to validation data
SparSNP models trained on the discovery dataset can be tested on an independent dataset, if one is available. It is crucial for the validation dataset to contain the same SNPs in the same order and use the same allele as the minor allele.
• Plot AUC and optionally, phenotypic and genetic variance explained, on validation dataset: $ ./eval.R mode=validation prev=0.01 h2l=0.5 The plots validation filtered AUC.pdf, validation filtered VarExp.pdf, and validation filtered GenVarExp.pdf, will be produced in the directory validation.

4.
Other post processing • The model weights are stored in each cross-validation directory discovery/crossvalXX/beta.csv.XX.XX using a sparse text format <index:weight>, where index is the zero-based index of the SNP in the data (0 is the intercept), and weight is the model weight (a real number). The weights can be read into R (using read.table with sep=":", header=FALSE) or any other tool for visualisation or for validating the model on other datasets.
Some options that can be set by editing the file cv3.sh are: • NFOLDS: number of cross-validation folds (default=3) • NREPS: number of cross-validation replications (default=3) • NZMAX: maximum number of SNPs to allow in model (default=1024) • NLAMBDA1: number of λ penalties on the grid (default=20) • L1MIN: multiplier on smallest λ used; it should be a some positive fraction such as 0.01. Setting it lower will increase the number of SNPs in the models, but will also increase computational time.