Multilocus association mapping using generalized ridge logistic regression
 Zhe Liu^{1},
 Yuanyuan Shen^{2} and
 Jurg Ott^{3}Email author
https://doi.org/10.1186/1471210512384
© Liu et al; licensee BioMed Central Ltd. 2011
Received: 16 October 2010
Accepted: 29 September 2011
Published: 29 September 2011
Abstract
Background
In genomewide association studies, it is widely accepted that multilocus methods are more powerful than testing singlenucleotide polymorphisms (SNPs) one at a time. Among statistical approaches considering many predictors simultaneously, scan statistics are an effective tool for detecting susceptibility genomic regions and mapping disease genes. In this study, inspired by the idea of scan statistics, we propose a novel sliding windowbased method for identifying a parsimonious subset of contiguous SNPs that best predict disease status.
Results
Within each sliding window, we apply a forward model selection procedure using generalized ridge logistic regression for model fitness in each step. In power simulations, we compare the performance of our method with that of five other methods in current use. Averaging power over all the conditions considered, our method dominates the others. We also present two published datasets where our method is useful in causal SNP identification.
Conclusions
Our method can automatically combine genetic information in local genomic regions and allow for linkage disequilibrium between SNPs. It can overcome some defects of the scan statistics approach and will be very promising in genomewide casecontrol association studies.
Background
In genomewide association studies (GWAS), it is generally accepted that multilocus methods can obtain better power than singlelocus approaches that test only one singlenucleotide polymorphism (SNP) at a time [1–3]. Among a large number of mathematical and statistical approaches considering many predictors simultaneously, scan statistics [4] serve as a useful multilocus analytical means for combining genetic information on multiple contiguous SNPs. In this method, the whole genome is scanned by a sliding window with estimated size, and the moving sum for each window is computed as the sum of suitable singlelocus statistics. Then the scan statistic, defined as the largest moving sum, is calculated and its associated empirical pvalue is evaluated by permutation tests.
Despite its remarkable advantages, the scan statistics method has two drawbacks that can restrict its practical use: (I) linkage disequilibrium (LD) within local genomic regions is not taken into account, which can result in an inflated type I error rate; (II) all contiguous SNPs within a genomic region (window) are selected simultaneously, which can bring excess noise and increase the false discovery rate (FDR).
During the last decade, various advances based on the framework of scan statistics have been developed. Zaykin et al. [5], Dudbridge et al. [6], and Yang et al. [7] proposed more effective and powerful test statistics within each sliding window and improved the sensitivity of scan statistics; Sun et al. [8] took into account the complex distribution of human genomic variation in the detection of causal chromosomal regions; Browning [9] and Li et al. [10] proposed a variablesized slidingwindow method based on Markov chain and regularized regression analysis. In their method, there is no need to specify a window size for haplotype tests, which makes it particularly useful in the investigation of association studies.
In this study, we propose a novel sliding windowbased multilocus method for identifying a subset of susceptibility SNPs based on forward variable selection, using generalized ridge logistic regression (GRLR) for model fitness at each step. As a broader and generalized form of ridge logistic regression, GRLR can take advantage of prior information between any pair of SNPs and impose proper shrinkage penalty on each SNP within the genomic region of interest. Our method can automatically combine genetic information within local regions and select a subset of SNPs that best predict disease status, whose associated empirical significance level is assessed by permutation tests. We demonstrate by simulations and analysis of two published datasets that our method is highly informative and promising.
Methods
Generalized ridge logistic regression
Although logistic regression is very popular in casecontrol association analysis, it suffers from several shortcomings. If the number of SNPs in the regression model is larger than the number of observations, this method fails [1]. Moreover, with a large number of SNPs in the regression model, predictors can be highly correlated (high linkage disequilibrium), which can lead to further degradation of the model [11]. With the use of quadratic (L_{2}) penalization, ridge logistic regression (RLR) [12] can overcome these disadvantages of logistic regression and increase the stability of model fitness. This method has recently been applied successfully in several biological investigations including accommodating linkage disequilibrium [13] and uncovering genegene interactions [11].
Where p is a nonnegative definite penalty matrix and λ is a positive scale constant, that is, a tuning parameter, which can be specified by crossvalidation. The regression coefficients in the model are estimated using the NewtonRaphson iterative algorithm. The effective degrees of freedom and the variance of the coefficients can be approximated by estimators introduced in [14]. Then the Wald test can be applied to assign pvalues to the regression coefficients.
Due to the feature of quadratic penalization that none of the coefficient estimators would be equal to zero in the shrinkage, GRLR cannot serve as an independent tool for model selection; however, the traditional forward selection procedure can be utilized, with the use of GRLR for model fitness in each round. In forward variable selection, we start with no predictor in the model and then add the one variable that leads to the best score. We continue adding variables one at a time until the score stops improving. In this study, we choose the AIC (Akaike Information Criterion) [15] as the scoring method in the variable selection procedure, which measures goodness of fit of a statistical model.
To avoid introducing excess noise, we do not use the information from all SNPs in each search region, because those with singlelocus pvalues too large (i.e. marginal effects too low) may contribute little to the power. To do this, we denote t as the pvalue truncation threshold, and those SNPs with singlelocus pvalues larger than t will be excluded from the search region. In practical applications, the threshold t may be set somewhere from 0.05 to 0.10. After the truncation on G_{ i }(u,v), the adjusted search region is denoted by T_{ i }(u,v).
With the above configuration, we apply forward selection to each search region Ti(u,v), using GRLR for model fitting in each step. Denote B_{ i }(u,v) as the current best subset of SNPs for each step of the model selection procedure, which is empty at the beginning. We start by fitting the GRLR model using the central SNP (i.e. the i th SNP) as the only predictor, and calculating the corresponding AIC. Then we remove the central SNP from the search region, T_{ i }(u,v), and add it to B_{ i }(u,v). Next, the remaining SNPs in T_{ i }(u,v) are entered into the GRLR model one by one, along with the SNPs in B_{ i }(u,v) as predictors. We select the one SNP which can reduce the original AIC most. Then we add it to B_{ i }(u,v), remove it from T_{ i }(u,v), and update the current AIC. We repeat this procedure until none of the remaining SNPs in T_{ i }(u,v) can decrease the current AIC. Finally we investigate the last model and calculate the corresponding pvalue ${P}_{\left\{{E}_{i}\right\}}$ for the model fitness, where the subset E_{ i }stands for the selected SNPs in this search region.
Considering all the search regions over the whole genome, our test statistic is defined as the minimum of all ${P}_{\left\{{E}_{i}\right\}}\mathsf{\text{s}}$. To assign a global empirical pvalue for the selected subset of SNPs, we use L permutations by randomly switching the casecontrol labels in the observed dataset. Considering the computational burden, we can construct search regions of interest using the top S SNPs whose pvalues are smallest among all the pvalues computed by the logistic regression. An alternative is to predefine another truncation threshold and exclude the SNPs whose pvalues exceed the threshold.
Simulations
In this model, the predictor x_{ ij }refers to the genotype of the i th individual at the j th causal SNP, where genotypes AA, AB, and BB are coded by 0, 1, and 2, respectively, assuming the B allele is the minor allele. In Scenario (A), k (the number of causal SNPs) equals 2, while in Scenario (B), k equals 3. The intercept β_{o} is determined by requiring a disease prevalence of 0.05; other regression coefficients β_{ j }are set to 1.0.
Assume that each SNP within the region is derived from a multivariate normal distribution, and the variancecovariance matrix Σ is defined by: Σ_{ pq }= 1, when p = q; Σ_{ pq }= r, when p  q ≤ 5; Σ_{ pq }= 0, otherwise. The correlation coefficient r can be varied in simulations. For each SNP, we determine its genotype by its corresponding value generated from a multivariate normal distribution. If the value falls into the interval (∞, qnorm((1  f)^{2})), the genotype of this SNP is set to AA; if the value falls into the interval (qnorm(1  f^{2}), + ∞), the genotype of this SNP is set to BB; otherwise, the genotype of this SNP is set to AB. Here, f is the frequency of the B allele (i.e. minor allele frequency) and the function "qnorm" computes the quartile of standard normal distribution.
Results
Power calculations
We compare the performance of our method (GRLR) with that of logistic regression (LR), Fisher product method (FPM) [16], truncated product method (TPM) [5], lasso logistic regression (Lasso) [10, 17, 18] and elastic net (Enet) [19, 20]. FPM uses a sliding window with fixed size to scan the genomic region, and for each window, it computes the sum of the logarithm of each pvalue. Then the test statistic is defined as the minimum value over all windows. TPM is similar to FPM but it focuses on the pvalues that are no more than a prefixed truncation threshold. In this simulation study, the truncation threshold for TPM method is set to 0.05 and the window size is set to 5 in both FPM and TPM methods. For the Lasso and Enet methods, all SNPs in the search region are entered into regression models and subsets of SNPs are selected. The tuning parameters in both methods are determined by crossvalidation. For the GRLR method, the truncation threshold is also set to 0.05 and the tuning parameter λ is set to 1. We mimic the physical distances between different SNPs by constructing a simple mapping function that the position of the b^{th} SNP in the genomic region is set to b.
For each combination of these parameters, with minor allele frequencies f = 0.1, 0.3, or 0.5 and correlation coefficients r = 0.0, or 0.2, we carry out 500 replications and assess the power of various methods under both scenarios. To ensure that the type I error rate of each method is equal to 0.05, we utilize the following control procedure: for the FPM, TPM, and GRLR methods, 500 permutations are applied in each simulated replication. For the Lasso and Enet methods, since they often include extra terms and tend to have a high type I error rate [21, 22], we apply 10 times tenfold crossvalidations for each generated replication. For each selected SNP in the observed data, its consistency is defined as the number of times that it occurs in 100 subdatasets that contain 90 percent of observations. The consistency thresholds are determined under the null hypothesis to ensure the normal type I error rates.
For power considerations, the testing "success" can be defined by different strategies:
Scenario (A) (two causal SNPs)

Strategy I: at least one of the two SNPs is detected

Strategy II: both SNPs are detected
Scenario (B) (three causal SNPs)

Strategy I: at least one of the three SNPs is detected

Strategy II: at least two of the three SNPs are detected

Strategy III: all three SNPs are detected
Power calculation under Scenario (A)
Strategy  Corr  MAF  Method  

LR  FPM  TPM  Lasso  Enet  GRLR  
I  0.0  0.10  0.196  0.206  0.198  0.210  0.226  0.196 
0.30  0.638  0.620  0.602  0.688  0.670  0.650  
0.50  0.792  0.734  0.730  0.790  0.828  0.826  
0.2  0.10  0.184  0.216  0.194  0.194  0.204  0.206  
0.30  0.712  0.642  0.622  0.690  0.726  0.704  
0.50  0.860  0.794  0.788  0.902  0.888  0.870  
II  0.0  0.10  0.008  0.078  0.070  0.032  0.052  0.124 
0.30  0.122  0.266  0.252  0.340  0.392  0.558  
0.50  0.354  0.304  0.306  0.526  0.624  0.768  
0.2  0.10  0.018  0.088  0.064  0.066  0.078  0.126  
0.30  0.202  0.296  0.300  0.364  0.458  0.608  
0.50  0.400  0.370  0.360  0.628  0.690  0.806 
Power calculation under Scenario (B)
Strategy  Corr  MAF  Method  

LR  FPM  TPM  Lasso  Enet  GRLR  
I  0.0  0.10  0.302  0.346  0.320  0.330  0.388  0.310 
0.30  0.740  0.716  0.732  0.826  0.832  0.696  
0.50  0.906  0.888  0.900  0.932  0.940  0.872  
0.2  0.10  0.342  0.422  0.422  0.436  0.404  0.368  
0.30  0.814  0.846  0.832  0.838  0.854  0.784  
0.50  0.948  0.942  0.946  0.960  0.966  0.900  
II  0.0  0.10  0.040  0.200  0.180  0.158  0.202  0.270 
0.30  0.280  0.478  0.490  0.630  0.704  0.680  
0.50  0.544  0.622  0.612  0.804  0.862  0.856  
0.2  0.10  0.058  0.256  0.264  0.210  0.238  0.314  
0.30  0.440  0.580  0.582  0.646  0.730  0.758  
0.50  0.700  0.672  0.678  0.858  0.914  0.890  
III  0.0  0.10  0.002  0.022  0.016  0.044  0.066  0.100 
0.30  0.030  0.082  0.066  0.340  0.434  0.460  
0.50  0.140  0.084  0.076  0.558  0.640  0.656  
0.2  0.10  0.004  0.056  0.050  0.062  0.080  0.122  
0.30  0.124  0.088  0.090  0.324  0.422  0.490  
0.50  0.292  0.094  0.092  0.550  0.672  0.704 
We conducted additional simulations to test the performance of our method conditional on various choices of the tuning parameter λ. We consider Scenario (A) (two causal SNPs) and apply two definitions of testing success, Strategy I (requiring that at least one of the two SNPs is significant) and Strategy II (requiring significance at both SNPs). The correlation coefficient and the minor allele frequency are set to 0.0 and 0.30, respectively. The number of replication runs is 500 for all simulations. Results show that, when the tuning parameter λ equals 0.01, 0.1, 1.0, 10, and 100, the power of our proposed method under Strategy I equals 0.678, 0.650, 0.650, 0.640, and 0.672, respectively, while the power under Strategy II equals 0.600, 0.560, 0.558, 0.548, and 0.574, respectively. We can see that the power fluctuation caused by different values of λ is not large. It indicates that our method is not very sensitive to the selection of the tuning parameter, although it is unknown which selection can lead to the best result.
We further compared the performance of singlelocus logistic regression and our proposed method by increasing the total number of SNPs in each simulated window to 200 and the number of causal SNPs to 10. For ease of calculation, only 100 permutations were applied in each of 200 replication runs. Simulation results show that, for logistic regression method, the proportion of times that it can successfully detect at least one causal SNP is 0.465, while the proportion of times that it can detect at least two causal SNPs is only 0.110; for our GRLR method, the proportion of times that it can successfully detect at least one causal SNP is 0.455, while the proportion of times that it can detect at least two casual SNPs is 0.445. These results indicate that our method is more powerful in discovering multiple disease variants.
Analyzing published data
To further demonstrate our GRLR method and evaluate its practical applications in real dataset analysis, we apply it to two published genomewide datasets: (1) a casecontrol dataset for heroin addiction [23] with approximately 10,000 SNPs and 200 individuals (100 cases and 100 controls), and (2) a casecontrol dataset for agerelated macular degeneration (AMD) collected in Hong Kong [24, 25] with approximately 100,000 SNPs and 223 individuals (96 cases and 127 controls).
Results for heroin addiction data using logistic regression
Rank  Chr  SNP rs#  Bp Position  Odds Ratio  Original Pvalue  Bonferroni Correction  Empirical Pvalue 

1  1  rs 1408830  189929064  2.36  2.23E04  1.0000  0.573 
2  20  rs 720010  7174248  2.04  4.20E04  1.0000  0.841 
3  13  rs 950064  58410147  2.23  4.44E04  1.0000  0.866 
4  13  rs 2016056  58410016  2.26  4.54E04  1.0000  0.876 
5  4  rs 951299  99955054  0.48  6.68E04  1.0000  0.955 
6  11  rs 1381784  42426136  2.33  8.68E04  1.0000  0.984 
7  5  rs 2421057  158958294  2.19  8.70E04  1.0000  0.984 
8  17  rs 1714984  12558426  2.26  9.37E04  1.0000  0.986 
9  9  rs 3866796  15340199  0.50  9.48E04  1.0000  0.986 
10  4  rs 1986513  126424833  0.17  1.34E03  1.0000  0.998 
Results for both datasets using GRLR
Dataset  Chr  Selected Subset of SNPs  Test Statistic  Empirical Pvalue 

heroin addiction  1  {rs1408830, rs965972}  1.26 × 10^{07}  0.027 
AMD HK  10  {rs2736911, rs10490924, rs763720}  1.32 × 10^{11}  0.000 
Results for AMD Hong Kong data using logistic regression
Rank  Chr  SNP rs#  Bp Position  Odds Ratio  Original Pvalue  Bonferroni Correction  Empirical Pvalue 

1  10  rs 10490924  124204438  0.26  1.25E09  0.0001  0.001 
2  8  rs 10504152  54292668  0.18  5.45E06  0.4428  0.062 
3  7  rs 10499342  4340896  0.36  3.88E05  1.0000  0.550 
4  13  rs 2011847  69496879  0.42  5.86E05  1.0000  0.738 
5  8  rs 1377131  53838124  3.09  7.49E05  1.0000  0.836 
6  4  rs 10520462  182400252  3.21  8.73E05  1.0000  0.888 
7  1  rs 1564485  53333649  0.41  1.10E04  1.0000  0.940 
8  5  rs 10521010  33931083  3.03  1.17E04  1.0000  0.943 
9  5  rs 251610  65314237  0.45  1.55E04  1.0000  0.980 
10  20  rs 1858597  95685  0.42  1.66E04  1.0000  0.987 
Discussion
Our method has the following advantages: (I) GRLR automatically combines association information of SNPs within each sliding window, and by truncating SNPs with low marginal effects and penalizing SNPs with a large distance from the center, our method can exclude excess noise and allow for linkage disequilibrium in local genomic regions; (II) we apply a forward model selection framework and fit the GRLR model at each step, since GRLR cannot serve as a means for variable selection. The results of power simulation and real datasets analysis indicate that this procedure works very well; (III) the global empirical pvalue for the selected subset of SNPs is evaluated by permutation analysis, which properly handles the multiple testing problem and furnishes a valid type I error rate.
There are still some aspects in our GRLR method that can be improved in future work: (I) we do not consider the selection of the tuning parameter λ in this study. Crossvalidation is an important tool for determining the tuning parameters in penalized regression approaches. It would be reasonable to determine λ by crossvalidation, although it would introduce a higher computational burden; (II) the choices of window size and the truncation threshold in our method are arbitrary, which may reduce the stability of our method. Although simulation results show that the impact of these parameters on power is not large, a variablesize slidingwindow procedure could be a better choice. Another way is to allow for the free variability of these parameters and then determine the best ones by permutations; (III) the Lasso and Enet methods can perform better than our proposed method in highdimensional model selection, since they can deal with multiple variables simultaneously. In this case, the introduction of a truncation threshold is not quite necessary for the applications of these two methods. For power comparisons, however, we wanted to apply the same preselection procedure and the same truncation threshold to the Lasso and Enet methods. It is of our interest to investigate this further in future work; (IV) it is ideal to apply our region construction procedure to all SNPs. In this study, the search regions are constructed based on the SNPs whose marginal pvalues do not exceed some predefined threshold; (V) it is of our interest to use the LD measurements or the combination of the LD and the physical distance information in constructing the weights for SNPs.
Conclusions
In this study, we propose a new sliding windowbased multilocus approach for detecting causal SNPs, which is based on forward model selection using generalized ridge logistic regression (GRLR) for model fitness at each step. Our method can overcome some defects of the scan statistics approach and provides a powerful procedure for identifying causal genomic region and mapping susceptibility genes in routine casecontrol association studies. In particular, because of its capability of automatically combining association information of multiple SNPs and its advantage on variable selection, our method can be a useful technique in the analyses of human complex diseases. Our software (available by request) is written in R [27], with the use of the Design Package [28] and glmnet Package [29].
Declarations
Acknowledgements
This work has been supported by NSFC grants from the Chinese government (project numbers 30730057 and 30700442) to JO.
Authors’ Affiliations
References
 Hoh J, Ott J: Mathematical multilocus approaches to localizing complex human trait genes. Nat Rev Genet 2003, 4: 701–709.View ArticlePubMedGoogle Scholar
 McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 2008, 9: 356–369. 10.1038/nrg2344View ArticlePubMedGoogle Scholar
 Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genomewide and resequencing association studies. PLoS Genet 2008, 4.Google Scholar
 Hoh J, Ott J: Scan statistics to scan markers for susceptibility genes. Proc Natl Acad Sci USA 2000, 97: 9615–9617.PubMed CentralView ArticlePubMedGoogle Scholar
 Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS: Truncated product method for combining Pvalues. Genet Epidemiol 2002, 22: 170–185. 10.1002/gepi.0042View ArticlePubMedGoogle Scholar
 Dudbridge F, Koeleman BP: Rank truncated product of Pvalues, with application to genomewide association scans. Genetic Epidemiology 2003, 25: 360–366. 10.1002/gepi.10264View ArticlePubMedGoogle Scholar
 Yang H, Hsieh H, Fann CSJ: Kernelbased association test. Genetics 2008, 179: 1057–1068. 10.1534/genetics.107.084616PubMed CentralView ArticlePubMedGoogle Scholar
 Sun YV, Levin AM, Boerwinkle E, Robertson H, Kardia SL: A scan statistic for identifying chromosomal patterns of SNP association. Genetic Epidemiology 2006, 30: 627–635. 10.1002/gepi.20173View ArticlePubMedGoogle Scholar
 Browning SR: Multilocus association mapping using variableLength markov chains. Am J Hum Genet 2006, 78: 903–913. 10.1086/503876PubMed CentralView ArticlePubMedGoogle Scholar
 Li Y, Sung W, Liu JJ: Association mapping via regularized regression analysis of singlenucleotidepolymorphism haplotypes in variablesized sliding windows. Am J Hum Genet 2007, 80: 705–715. 10.1086/513205PubMed CentralView ArticlePubMedGoogle Scholar
 Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostat 2008, 9: 30–50.View ArticleGoogle Scholar
 Cessie SL, Houwelingen JCV: Ridge estimators in logistic regression. Journal of the Royal Statistical Society Series C (Applied Statistics) 1992, 41: 191–201.Google Scholar
 Malo N, Libiger O, Schork NJ: Accommodating linkage disequilibrium in geneticassociation analyses via ridge regression. Am J Hum Genet 2008, 82: 375–385. 10.1016/j.ajhg.2007.10.012PubMed CentralView ArticlePubMedGoogle Scholar
 Gray RJ: Flexible methods for analyzing survival data using splines, With applications to breast cancer prognosis. Journal of the American Statistical Association 1992, 87: 942–951. 10.2307/2290630View ArticleGoogle Scholar
 Akaike H: A new look at the statistical model identification. Automatic Control, IEEE Transactions on 1974, 19: 723. 716 716 10.1109/TAC.1974.1100733View ArticleGoogle Scholar
 Fisher RA: Statistical methods for research workers. 14th edition. New York: Oliver and Boyd; 1970.Google Scholar
 Tibshirani R: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996, 58: 267–288.Google Scholar
 Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression. Bioinformatics 2009, 25: 714–721. 10.1093/bioinformatics/btp041PubMed CentralView ArticlePubMedGoogle Scholar
 Zou H, Hastie T: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B 2005, 67: 301–320. 10.1111/j.14679868.2005.00503.xView ArticleGoogle Scholar
 Cho S, Kim H, Oh S, Kim K, Park T: Elasticnet regularization approaches for genomewide association studies of rheumatoid arthritis. BMC Proceedings 2009, 3: S25.PubMed CentralView ArticlePubMedGoogle Scholar
 Wu J, Devlin B, Ringquist S, Trucco M, Roeder K: Screen and clean: a tool for identifying interactions in genomewide association studies. Genetic Epidemiology 2010, 34: 275–285.PubMed CentralPubMedGoogle Scholar
 Devlin B, Roeder K, Wasserman L: Analysis of multilocus models of association. Genetic Epidemiology 2003, 25: 36–47. 10.1002/gepi.10237View ArticlePubMedGoogle Scholar
 Nielsen DA, Ji F, Yuferov V, Ho A, Chen A, Levran O, Ott J, Kreek MJ: Genotype patterns that contribute to increased risk for or protection from developing heroin addiction. Mol Psychiatry 2008, 13: 417–428. 10.1038/sj.mp.4002147View ArticlePubMedGoogle Scholar
 DeWan A, Liu M, Hartman S, Zhang SS, Liu DTL, Zhao C, Tam POS, Chan WM, Lam DSC, Snyder M, Barnstable C, Pang CP, Hoh J: HTRA1 promoter polymorphism in wet agerelated macular degeneration. Science 2006, 314: 989–992. 10.1126/science.1133807View ArticlePubMedGoogle Scholar
 Klein RJ, Zeiss C, Chew EY, Tsai J, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J: Complement factor H polymorphism in agerelated macular degeneration. Science 2005, 308: 385–389. 10.1126/science.1109557PubMed CentralView ArticlePubMedGoogle Scholar
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for wholegenome association and populationbased linkage analyses. The American Journal of Human Genetics 2007, 81: 559–575. 10.1086/519795View ArticlePubMedGoogle Scholar
 R Development Core Team: R: a language and environment for statistical computing. Vienna, Austria; 2010.Google Scholar
 Harrell FE: Design: R package version 2.3–0. 2009.Google Scholar
 Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 2010, 33: 1–22.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.