A novel algorithm for simultaneous SNP selection in highdimensional genomewide association studies
 Verena Zuber^{1}Email author,
 A Pedro Duarte Silva^{2} and
 Korbinian Strimmer^{1}
DOI: 10.1186/1471210513284
© Zuber et al.; licensee BioMed Central Ltd. 2012
Received: 1 June 2012
Accepted: 20 October 2012
Published: 31 October 2012
Abstract
Background
Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs.
Results
We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from highdimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs.
Conclusions
Simultaneous SNP selection is a challenging task. We demonstrate that our CAR scorebased algorithm consistently outperforms all competing approaches, both uni and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available fromhttp://strimmerlab.org/software/care/.
Background
Genomewide associations studies (GWAS) are now routinely conducted to search for genetic factors indicative of or even causally linked to disease. Typically, the aim of such a study is to identify a small subset of single nucleotide polymorphisms (SNPs) associated with a phenotype of interest. From an analysis point of view the screening for relevant biomarkers is best cast as a problem of statistical variable selection. In GWAS variable selection is very challenging as the full set of SNPs is often very large while both the effect of each potentially causal SNP as well as their number is very small (e.g.[1–3]).
To date, most GWAS are based on singleSNP analyzes where each SNP is considered independently of all others and association with the phenotype is computed using a univariate test statistic such as variants of the tscore, the ATT statistic[4] or marginal correlation[5]. The advantage of this approach is that it is computationally inexpensive. However, it implicitly assumes complete independence of markers and thus ignores the correlation structure among SNPs, e.g., due to linkage or interaction among SNPs.
In order to increase statistical efficiency and to exploit the correlation among predictive SNPs several authors have recently started to investigate simultaneous SNP selection using fully multivariate approaches. This was pioneered for GWAS in the seminal paper of[1] that introduced the NEG regression model, a shrinkagebased approach to select relevant SNPs. A related approach is LASSO regression that was employed to GWAS by[6], MCP regression[2], and Bayesian variable selection regression[3]. Another promising multivariate approach advocated for highdimensional variable selection is boosting[7] but this has not yet been investigated for GWAS.
Recently, to address the problem of variable importance and selection under correlation in genomics, we have introduced two novel statistics, the correlationadjusted tscore (CAT score) and the correlationadjusted marginal correlation (CAR score), see[8, 9]. These two measures are multivariate generalizations of the standard univariate test statistics that take the correlation among variables explicitly into account and lead to improved rankings of markers as has been shown for data from transcriptomics and metabolomics. However, application of CAT and CAR scores has so far been restricted to medium to large dimensional settings only as computing these scores involves the calculation of the inverse matrix square root of the correlation matrix, which is prohibitively expensive in high dimensions. Thus, for SNP analyzes further computational economies are needed.
Here, we develop a novel multivariate algorithm for large scale SNP selection using CAR score regression. Specifically, we propose a computationally efficient procedure that allows for shrinkage estimation of CAR scores even for very highdimensional data sets. Subsequently, we conduct a systematic comparison of stateoftheart simultaneous SNP selection procedures using data from the GAW17 consortium[10]. These data are particularly suited for investigating relative performance as the true causal SNPs are known. Finally, we demonstrate that SNP rankings based on correlationadjusted statistics consistently outperform all investigated competing approaches, both uni and multivariate.
Methods
Univariate ranking of SNPs
The basic setup we consider here is a linear regression model for a set of d predictors X= {X_{1},…,X_{ d }} and a metric or binary response variable Y . In GWAS the covariates X are given by the genotype and the response Y is the phenotype or trait of interest. The correlation matrix among the predicting variables has size d × d and is denoted by P (capital “rho”). The vector of marginal correlations P_{ X Y }= (ρ X_{1}Y,…,ρ X_{ d }Y)^{ T }contains the correlations between a metric response and each individual SNP. Similarly, for binary response the tscore vector τ= (τ_{1},…,τ_{ d })^{ T }contains the tscores computed for each variable.
If there is no correlation among SNPs (i.e. P= I_{ d }) the tscores τ provide an optimal ranking of SNPs in terms of predicting a binary Y[11]. Likewise, for metric response the marginal correlations lead to an optimal ordering[12]. Moreover, in the absence of SNPSNP correlation the squared values of the ranking statistics (squared tscore, squared marginal correlation) are useful measures of variable importance, adding up to Hotelling’s T^{2} and the squared multiple correlation coefficient R^{2}, respectively.
CAT and CAR score
In many important settings the correlations P do not vanish but rather represent additional structure relating the predictors. In the case of SNPs the correlation may be rather large, e.g. due to linkage effects[13]. Thus, both for variable ranking and for assigning variable importance it can be essential to take the correlation between covariates into account.
also known as coefficient of determination or proportion of variance explained. Because of this decomposition property CAT and CAR scores allow to assign importance not only to individual SNPs but also to groups of SNPs. Moreover, both CAT and CAR score share a grouping property that leads to similar scores for highly correlated SNPs. In addition they protect against antagonistic SNPs, i.e. if two SNPs are highly correlated and one has a protective and the other a risk effect, then both SNPs are assigned low scores.
For model selection using CAT and CAR scores, i.e. for identification of those SNPs that do not contribute to predict the response Y , we use a simple thresholding procedure with the critical threshold obtained by controlling local false discovery rates[14].
In previous work we have shown for synthetic data as well as for data from metabolomic and gene expression experiments that CAT and CAR scores are effective multivariate criteria for obtaining compact yet highly predictive feature sets. Independently, in the study of[15] it was also found that CAT scores result in favorable orderings of variables.
However, with increasing dimension d the correlation matrix P becomes prohibitively large both to compute and to handle effectively. As a result, in high dimensions direct calculation of CAT and CAR scores using Eq. 1 and Eq. 2 is not possible. Thus, for application in highdimensional data such as from GWAS an alternative means of computation must be developed.
Computationally efficient calculation of shrinkage estimators of CAT and CAR scores
where R_{empirical} is the empirical nonregularized correlation matrix and λ is a shrinkage intensity (e.g.[16]). Using computational economies akin to those discussed in[17] we now show that computation of R^{−1/2} and subsequent calculation of estimates of CAT and CAR scores can be done in a computationally highly effective way, even when direct computation of CAT and CAR scores via Eq. 1 and Eq. 2 is infeasible.
Consequently, Eq. 3 allows to obtain shrinkage estimates of CAT and CAR scores effectively even in high dimensions as none of the matrices employed in Eq. 3 is larger than d × m, and most are even smaller (d × 1 or m × 1), all without actually computing the shrinkage correlation matrix R.
Results and discussion
We now compare the proposed CAR score approach to simultaneous SNP selection with competing methods and determine its effectiveness in finding true causal SNPs.
For this purpose we use the miniexome data set compiled for the GAW17 workshop held 1316 October 2010 in Boston (http://www.gaworkshop.org/gaw17/). This data set is a combination of real sequence data and simulated synthetic phenotypes, where the true causal SNPs are known. In our study we investigate univariate ranking by marginal correlation and five multivariate approaches.
In order to facilitate replication of our results we provide complete R code[18]. Our R package “care” implements the developed algorithm. Moreover, we offer R scripts covering all analysis steps from preprocessing the raw data to plotting of figures athttp://strimmerlab.org/software/care/. The data are publicly available from the GAW consortium, seehttp://www.gaworkshop.org/gaw17/data.html for details.
GAW 17 unrelated data
The compilation and simulation of phenotypes for the GAW17 miniexome data set is described in detail in[10]. We focus here on the GAW 17 unrelated data with metric phenotypes Q1, Q2, and Q4. The corresponding sequence data matrix contains information on 24,487 SNPs for n = 697 individuals. For each phenotype there are B = 200 simulations. By construction, phenotype Q1 has a residual heritability of 0.44 and is influenced by 39 SNPs in 9 genes, whereas Q2 has a lower residual heritability of 0.29 and is influenced by 72 SNPs in 13 genes. This suggests that discovery of true causal SNPs should be less challenging for Q1 than for Q2. Phenotype Q4 has a heritability of 0.70 but none of it is due to SNPs contained in the present data set.
Preprocessing
In the preprocessing of the sequences we first recoded the alleles in the raw data into 0, 1, 2 assuming an additive effects model. Second, we standardized the data matrix to column mean zero and column variance 1. Subsequently, we removed duplicate predictors so that 15,076 unique SNPs remained. The set of true causal SNPs for both Q1 and Q2 also contains each a duplicate, reducing the number of true unique SNPs to 38 and 71. Finally, we further filtered out synonymous SNPs, as we are interested only in nonsynonymous mutations. The resulting predictor matrix X is of size 697 × 8,020, i.e. d = 8,020 unique nonsynonymous SNPs are simultaneously considered for selection.
For preprocessing the response variables Q1, Q2, and Q4 we removed the influence of the three nongenetic covariates sex, age, and smoking by linear regression. The resulting residuals were standardized to mean zero and variance 1 which yielded B = 200 response vectors${y}_{1}^{\left(b\right)}$,${y}_{2}^{\left(b\right)}$, and${y}_{4}^{\left(b\right)}$, where b ∈ 1,…,B, each of size 697 × 1.
SNP selection methods included in the comparison study
For each of the B = 200 response vectors for Q1, Q2, and Q4 we computed a regression model including all d = 8,020 SNPs as potential predictors. Following[2] we focused on regularized regression approaches. Specifically, we used the following five methods, all of which have been shown to be powerful tools for variable selection in largescale regression settings:
The corresponding software implementations are listed in Table1. As a reference for comparison we additionally included two baseline methods:

COR: univariate SNP ranking by marginal correlation, and
RND: random ordering of all SNPs.
All methods except CAR and COR combine regularization with variable selection. Thus, for determining model sizes for CAR scores and COR we adaptively estimated a threshold from the data using a local FDR cutoff of 0.5 as recommended in[14]. In settings with rare and weak features this particular choice coincides with the socalled “higher criticism” threshold that has shown to be powerful for signal identification in classification (e.g.,[24–26]). For computing the FDR values we employed the R package fdrtool[27, 28].
Generally, all software were run with default settings. The regularization parameters required by the NEG, MCP, BOOST and CAR approaches were set to fixed values optimizing the overall performance of each method. Specifically, for CAR and MCP we employed λ = 0.1, for BOOST ν = 0.1, and for NEG λ = 85. For LASSO we used the builtin crossvalidation routines.
Relative performance of investigated methods
The aim of this study is to compare simultaneous SNP selection methods with regard to their ability to discover the true known SNPs. For this purpose we investigated the respective SNP rankings and the corresponding true positives, the size of the selected models, and the variability across the 200 repetitions.
Median model sizes and the corresponding interquartile ranges (IQR) as well as the average true positives for phenotypes Q1 and Q2 for all investigated methods summarized across the 200 repetitions (first three columns)
Results  Comparisons  

Method  Model size  TP  TP  TP  TP  
Median (IQR)  Method  CAR  COR  RND  
Q1  
CAR  51 (53)  5.85  5.85  5.42  0.23  
COR  176 (108)  8.06  8.99  8.06  0.88  
NEG  1390 (118)  15.31  17.57  14.38  6.60  
MCP  20 (5)  4.11  4.19  3.95  0.12  
BOOST  53 (5)  5.84  5.91  5.50  0.25  
LASSO  37 (31)  5.19  5.21  4.89  0.18  
Q2  
CAR  31 (38)  2.93  2.93  2.85  0.29  
COR  1 (7)  0.38  0.21  0.38  0.00  
NEG  1632 (755)  20.21  28.08  25.90  14.50  
MCP  29 (5)  2.75  2.82  2.76  0.28  
BOOST  59 (6)  3.92  4.34  3.82  0.59  
LASSO  15 (36)  1.50  1.88  1.97  0.14 
Median model sizes and the corresponding interquartile ranges (IQR) for phenotype Q4
Q4  
Model Size  CAR  COR  NEG  MCP  BOOST  LASSO  
Median  34  0  1900  27  59  1  
IQR  40  1  2713  4  6  6 
True SNPs found among the top 100 SNPs identified by CAR scores in at least 50 of the 200 repetitions for Q1 and Q2
SNP  Frequency  MAF  BETA  Correlation  

Q1  0.014  
ARNT  C1S6533  88  0.011478  0.56190  
FLT1  C13S431  110  0.017217  0.74136  0.147  
FLT1  C13S522  200  0.027977  0.61830  0.147  
FLT1  C13S523  200  0.066714  0.64997  0.147  
FLT1  C13S524  164  0.004304  0.62223  0.147  
KDR  C4S1877  145  0.000717  1.07706  0.111  
KDR  C4S1878  101  0.164993  0.13573  0.111  
KDR  C4S1884  95  0.020803  0.29558  0.111  
VEGFA  C6S2981  69  0.002152  1.20645  
VEGFC  C4S4935  91  0.000717  1.35726  
Q2  0.008  
BCHE  C3S4869  54  0.000717  1.01569  0.001  
BCHE  C3S4875  59  0.000717  1.09484  0.001  
LPL  C8S442  69  0.015782  0.49459  
SIRT1  C10S3048  54  0.002152  0.83224  0.330  
SIRT1  C10S3050  72  0.002152  0.97060  0.330  
VNN1  C6S5380  138  0.170732  0.24437  
VNN3  C6S5441  59  0.098278  0.27053  0.066  
VNN3  C6S5449  57  0.010043  0.66909  0.066 
The last column in Table4 provides information about the average absolute correlation among all true SNPs for Q1 and Q2 as well as among the identified SNPs on the same gene. We observe that the true positive SNPs in Q1 best identified by the CAR score are highly correlated within the same gene. This demonstrates that the CAR score successfully utilizes the correlation structure among SNPs to optimize the ranking. For phenotype Q2 the correlation among the true SNPs is generally lower compared to Q1, still except for BCHE the correlation among SNPs on the same gene is larger compared to the average correlation between a randomly chosen pair of true SNPs.
Proportion of common and rare variants of the true SNPs found among the top 100 SNPs
Q1  
Proportion (%)  CAR  COR  NEG  MCP  BOOST  LASSO  
Common  0.56  0.71  0.63  0.74  0.71  0.73  
Rare  0.44  0.29  0.37  0.26  0.29  0.27  
Q2  
Proportion (%)  CAR  COR  NEG  MCP  BOOST  LASSO  
Common  0.28  0.41  0.36  0.44  0.42  0.43  
Rare  0.72  0.59  0.64  0.56  0.58  0.57 
Conclusions
Large scale simultaneous SNP selection is a statistically and computationally very challenging task. To this end, we have introduced here a novel algorithm based on CAR score regression that can be applied effectively in high dimensions. Subsequently, in a comparison study we have investigated five multivariate regressionbased SNP selection approaches with regard to their ability to correctly recover causal SNPs and corresponding SNP rankings.
As overall best method we recommend using CAR scores since this method was the only approach not only consistently outperforming the competing other multivariate SNP selection procedures in terms of identified true positives but also the only approach uniformly improving over simple univariate ranking by marginal correlation. In addition we have shown that CAR scores also are successful in detecting rare variants which recently have been recognize to be important indicators for human disease[29, 30].
Declarations
Acknowledgements
We thank Peter Ahnert, Arndt Groß, Holger Kirsten, Abdul Nachtigaller, and Markus Scholz for helpful discussion. Part of this research was supported by BMBF grant no. 0315452A (HaematoSys project). The Genetic Analysis Workshop is supported by NIH R01 GM031575. Preparation of the Genetic Analysis Workshop 17 simulated exome data set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (http://www.1000genomes.org).
Authors’ Affiliations
References
 Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Analysis of all SNPs, in genomewide and resequencing association studies. PLoS Genetics 2008, 4: e1000130. 10.1371/journal.pgen.1000130PubMed CentralView ArticlePubMed
 Ayers KL, Cordell H: SNP selection in genomewide and candidate gene studies via penalized logistic regression. Genet Epidemiol 2010, 34: 879–891. 10.1002/gepi.20543PubMed CentralView ArticlePubMed
 Guan Y, Stephens M: Bayesian variable selection regression for genomewide association studies, and other largescale problems. Ann Appl Statist 2011, 5: 1780–1815. 10.1214/11AOAS455View Article
 Armitage P: Tests for linear trends in proportions and frequencies. Biometrics 1955, 11: 375–386. 10.2307/3001775View Article
 Foulkes AS: Applied Statistical Genetics with R. New York: Springer; 2009.View Article
 Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression. Bioinformatics 2009, 25: 714–721. 10.1093/bioinformatics/btp041PubMed CentralView ArticlePubMed
 Hothorn T, Bühlmann P: Modelbased boosting in high dimensions. Bioinformatics 2006, 22: 2828–2829. 10.1093/bioinformatics/btl462View ArticlePubMed
 Zuber V, Strimmer K: Gene ranking and biomarker discovery under correlation. Bioinformatics 2009, 25: 2700–2707. 10.1093/bioinformatics/btp460View ArticlePubMed
 Zuber V, Strimmer K: Highdimensional regression and variable selection using CAR scores. Statist Appl Genet Mol Biol 2011, 10: 34.
 Almasy L, Dyer TD, Peralta JM, Kent Jr JW, Charlesworth JC, Curran JE, Blangero J: Genetic analysis workshop 17 miniexome simulation. BMC Proceedings 2011, 5(Suppl 9):S2. 10.1186/175365615S9S2PubMed CentralView ArticlePubMed
 Efron B: Empirical Bayes, estimates for largescale prediction problems. J Amer Statist Assoc 2009, 104: 1015–1028. 10.1198/jasa.2009.tm08523View Article
 Fan J, Lv J: Sure independence screening for ultrahigh dimensional feature space (with discussion). J R Statist Soc B 2008, 70: 849–911. 10.1111/j.14679868.2008.00674.xView Article
 Ardlie KG, Kruglyak L, Seielstad M: Patterns of linkage disequilibrium in the human genome. Nat Rev Genet 2002, 3: 299–309. 10.1038/nrg777View ArticlePubMed
 Klaus B, Strimmer K: Signal identification for rare and weak features: higher criticism or false discovery rates? Biostatistics 2012. in press in press
 Allen GI, Tibshirani R: Inference with transposable data: modelling the effects of row and column correlations. J R Statist Soc B 2012, 74: 721–743. 10.1111/j.14679868.2011.01027.xView Article
 Schäfer J, Strimmer K: A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Statist Appl Genet Mol Biol 2005, 4: 32.
 Hastie T, Tibshirani T: Efficient quadratic regularization for expression arrays. Biostatistics 2004, 5: 329–340. 10.1093/biostatistics/kxh010View ArticlePubMed
 R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2012. [. [ISBN 3–900051–07–0]. http://www.Rproject.org] []. [ISBN 3900051070].
 Zhang CH: Nearly unbiased variable selection under minimax concave penalty. Ann Statist 2010, 38: 894–942. 10.1214/09AOS729View Article
 Schapire RE: The strength of weak learnability. Machine Learning 1990, 5: 197–227.
 Tibshirani R: Regression shrinkage and selection via the lasso. J R Statist Soc B 1996, 58: 267–288.
 Breheny P, Huang J: Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Applied Statistics 2011, 5: 232–253. 10.1214/10AOAS388View Article
 Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J Statist Soft 2010, 39: 1–13.
 Donoho D, Jin J: Higher criticism thresholding: optimal feature selection when useful features are rare and weak. Proc Natl Acad Sci USA 2008, 105: 14790–15795. 10.1073/pnas.0807471105PubMed CentralView ArticlePubMed
 Donoho D, Jin J: Feature selection by higher criticism thresholding achieves the optimal phase diagram. Phil Trans R Soc A 2009, 367: 4449–4470. 10.1098/rsta.2009.0129View ArticlePubMed
 Duarte Silva AP: Twogroup classification with highdimensional correlated data: a factor model approach. Comput Stat Data An 2011, 55: 2975–2990. 10.1016/j.csda.2011.05.002View Article
 Strimmer K: fdrtool: a versatile R package for estimating local and tail areabased false discovery rates. Bioinformatics 2008, 24: 1461–1462. 10.1093/bioinformatics/btn209View ArticlePubMed
 Strimmer K: A unified approach to false discovery rate estimation. BMC Bioinformatics 2008, 9: 303. 10.1186/147121059303PubMed CentralView ArticlePubMed
 Bodmer W, Bonilla C: Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 2008, 40: 695–701. 10.1038/ng.f.136PubMed CentralView ArticlePubMed
 McClellan J, King MC: Genetic heterogeneity in human disease. Cell 2010, 141: 210–217. 10.1016/j.cell.2010.03.032View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.