Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis
- Masao Ueki†^{1}Email author and
- Gen Tamiya†^{1}
DOI: 10.1186/1471-2105-13-72
© Ueki and Tamiya; licensee BioMed Central Ltd. 2012
Received: 21 October 2011
Accepted: 5 April 2012
Published: 3 May 2012
Abstract
Background
Genome-wide gene-gene interaction analysis using single nucleotide polymorphisms (SNPs) is an attractive way for identification of genetic components that confers susceptibility of human complex diseases. Individual hypothesis testing for SNP-SNP pairs as in common genome-wide association study (GWAS) however involves difficulty in setting overall p-value due to complicated correlation structure, namely, the multiple testing problem that causes unacceptable false negative results. A large number of SNP-SNP pairs than sample size, so-called the large p small n problem, precludes simultaneous analysis using multiple regression. The method that overcomes above issues is thus needed.
Results
We adopt an up-to-date method for ultrahigh-dimensional variable selection termed the sure independence screening (SIS) for appropriate handling of numerous number of SNP-SNP interactions by including them as predictor variables in logistic regression. We propose ranking strategy using promising dummy coding methods and following variable selection procedure in the SIS method suitably modified for gene-gene interaction analysis. We also implemented the procedures in a software program, EPISIS, using the cost-effective GPGPU (General-purpose computing on graphics processing units) technology. EPISIS can complete exhaustive search for SNP-SNP interactions in standard GWAS dataset within several hours. The proposed method works successfully in simulation experiments and in application to real WTCCC (Wellcome Trust Case–control Consortium) data.
Conclusions
Based on the machine-learning principle, the proposed method gives powerful and flexible genome-wide search for various patterns of gene-gene interaction.
Background
Single SNP association study is a popular method to detect genes that are susceptible to human diseases. Although candidate gene approach that uses prior knowledge about its function is an efficient procedure, it could overlook genes whose function is unknown or ambiguous. GWAS using whole genome SNPs is thus an attractive solution to this issue. Many procedures proposed for GWAS so far are based on marginal association between each SNP and phenotype. However, it has been pointed out that most susceptible-SNPs identified often show low or moderate effect size, and hence may explain only a few percentage of the genetic variance [1]. The fact that, for several complex diseases, the recurrence risk ratio decreases quicker than 1/2 as the relatedness decreases implies the involvement of non-additive interactions in their etiology [2, 3]. In addition to substantial contributions of rarer variants, gene-gene and gene-environment interaction would be one of strong candidates that can explain the missing heritability as well [1]. Without investigating such interactions, we therefore may miss genuine disease-susceptible loci [4, 5]. An effective and accurate method to search gene-gene interactions will utilize immediately original SNP-GWAS data to decipher such missing genetic components of complex human diseases. In this paper, we tackle a development of powerful method for the genome-wide gene-gene interaction analysis using SNPs.
Marchinni et al. [4] suggest that the use of arbitrary single locus-disease association model under which some interaction effect is present might prevent from finding susceptible interaction effect. Therefore exhaustive search for interactions are needed rather than primary screening by marginal effect. Although individual hypothesis tests for all SNP-SNP pairs as frequently used in SNP-GWAS might be the simplest approach, it is however faced with multiple testing issues caused by 300,000-1,000,000 SNPs. Specifically, there is a difficulty in setting genome-wide significance level using, e.g., Bonferroni correction, FPRP [6], or Bayes factor [7, 8], leading to prohibitively conservative result because they fail to successfully incorporate the correlation structure between each hypothesis. For instance, the total number of hypothesises for second-order gene-gene interaction is about 10^{11} –10^{12} in standard GWAS data. Then Bonferroni-corrected significance level must be considerably small by multiplying the nominal significance level such as 0.01 by the correction factor less than 10^{−11}. No efficient and universal multiple testing method to deal with such the huge set of hypotheses having complicated correlation structure is proposed so far. Although multiple logistic regression may be the method that can incorporate the correlation structure between hypothesises by putting them as predictor variables, the so-called large p small n situation (i.e. the number of predictors is larger than the sample size), precludes simultaneous inclusion of all SNP-SNP interactions in the logistic regression model, namely, no unique solution is determined.
There are several software programs for gene-gene interaction analysis including multifactor dimensionality reduction (MDR) [9], fast-epistasis option in PLINK [10], Tuning Relief [11], Random Jungle [12], BEAM [13], and BOOST [14]. Cordell [5] carried out an extensive comparative study for these methods. She reported that MDR and BEAM have computational difficulty in analyzing current scale in GWAS and that random Jungle is applicable to small scale dataset but not for whole-genome dataset. PLINK-fast-epistasis and BOOST are designed for exhaustive search. Since both methods are based on hypothesis testing principles, as stated above, exact determination of significance level for whole-genome data is difficult due to the numerous multiple hypotheses in addition to the complicated correlation structure between them. Hypothesis testing methods [4, 15, 16] could be too conservative and lead to false negative result, because some complicated correlation structure would commonly appear when considering interaction pairs. The BOOST has been shown to outperform the PLINK--fast-epistasis in most of the interaction models considered in [14]. The BOOST primarily performs screening of possible hypotheses using Kirkwood superposition approximation (KSA), then applies standard multiple testing on the basis of the initial number of SNP-SNP pairs to the likelihood ratio statistic under the saturated logistic regression model having the degrees of freedom (df) of 4. However, the test statistic may fail to follow the chi-squared distribution of df 4 under the presence of sparse cells, which causes too small type 1 error rate [14]. Because the sparse cell issue is unavoidable in genotypic interaction data, some solutions in development of summary statistic for gene-gene interaction are strongly necessitated. Our method is free from the sparse cell issue owing to the proposed dummy coding strategy, which generates 2 by 2 contingency table.
In this paper we develop a new method based on contemporary ultrahigh-dimensional variable selection approach, designated the sure independence screening (SIS) [17–19] rather than hypothesis testing. The principle of SIS is based on the same as that of machine learning methods [5, 9] exploring the optimal model that can explain sufficiently the data in the most parsimonious way; its methodological feature to avoid over- or under-fitting enables us flexibly to capture various patterns of gene-gene interaction. SIS is such a simple combination of marginal regression and following penalized multiple regression [17–19] that can become just an effective framework for the analysis of SNP-SNP interaction, which forms a typical ultrahigh-dimensional data In multiple regression model, regression coefficients can be used for evaluation of the importance of predictors, i.e. coefficients closer to zero indicate less contribution to the regression formula. Statistical evaluation that infers regression coefficients to be zero is referred to as the variable selection [20], which may lead to multiple regression model consisting of fewer predictors with improved estimation accuracy. Recent progress has justified theoretically and empirically the usefulness of variable selection in the large p small n situation [21–23]. Wu et al. [24], Hoggart et al. [25], and Ayers and Cordell [26] have been proposed applications of the modern variable selection to the selection of effective SNPs in single SNP association study. Actually, SIS has already been used in single SNP association analysis in GWAS data [27]. Recently, SIS is theoretically proven to accept exponential grows of parameters as the increase of sample sizes in generalized linear models [19]. This capability of exponentially large number of predictors relative to sample size encourages us the advanced application of the SIS to the genome-wide gene-gene interaction analysis. In its application, there is a difficulty in defining predictors for SIS ranking step that can represent the interaction effect for each pair because the combination of two SNPs forms a 3 by 3 contingency table for which multiple patterns of interaction models are conceivable. The ranking step is very important to effectively capture various patterns of interactions. In this paper we elaborate the ranking strategy by proposing a promising three dummy coding methods and following variable selection procedure which is suitable for SNP-SNP interaction analysis. We also implemented the proposed method in a very fast program, EPISIS. This is the first software program to employ the ultrahigh-dimensional variable selection method that can provide a statistically valid and high-speed exhaustive SNP-SNP interaction analysis even for standard GWAS. The application studies to simulated datasets show that the proposed method works successfully and accurately. EPISIS was applied to find some novel genetic components in the WTCCC (Wellcome Trust Case–control Consortium) data.
Results and discussion
Simulation experiments
We carried out simulation experiments to examine power and type 1 error rate for the proposed EPISIS using the same dataset as those used in Wan et al. [14] for BOOST.
Power
Type 1 error rate
We calculate the type 1 error rate as 1- (proportion that no factors are detected in replicates). We use the two scenarios given in [14]. Scenario 1: we generated 1,000 replicates where no LD exists among SNPs using PLINK --simulate option, each having 1,000 SNPs whose MAFs are uniformly distributed within the interval [0.05,0.5] and including 500 case and 500 control individuals. Scenario 2: we generated 100 replicates where LD exists among SNPs using genomeSIMLA [28] on the basis of the marker information on the Affymetrix 500Â K chip from human chromosome 1, where each dataset contains 38,836 SNPs in 500 case and 500 control samples.
Summary of type 1 error simulations of EPISIS
γ | ^{0} | ^{0.1} | ^{0.2} | ^{0.3} | ^{0.4} | ^{0.5} | ^{0.6} | ^{0.7} | ^{0.8} | ^{0.9} | ^{1} | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
^{LCDC} | ^{1} | ^{0.997} | ^{0.965} | ^{0.895} | ^{0.098} | ^{0} | ^{0} | ^{0} | ^{0} | ^{0} | ^{0} | |
^{No LD} | ^{PCDC} | ^{1} | ^{1} | ^{0.991} | ^{0.143} | ^{0.143} | ^{0} | ^{0} | ^{0} | ^{0} | ^{0} | ^{0} |
^{PADC} | ^{1} | ^{1} | ^{1} | ^{0.998} | ^{0.985} | ^{0.204} | ^{0.002} | ^{0} | ^{0} | ^{0} | ^{0} | |
^{LCDC} | ^{1} | ^{1} | ^{1} | ^{0.96} | ^{0.86} | ^{0.08} | ^{0} | ^{0} | ^{0} | ^{0} | ^{0} | |
^{LD} | ^{PCDC} | ^{1} | ^{1} | ^{1} | ^{0.96} | ^{0.81} | ^{0.07} | ^{0} | ^{0} | ^{0} | ^{0} | ^{0} |
^{PADC} | ^{1} | ^{1} | ^{1} | ^{1} | ^{0.99} | ^{0.97} | ^{0.83} | ^{0.04} | ^{0} | ^{0} | ^{0} |
Although it is impossible to complete simulation experiments in a genome-wide scale within realistic computational time, we adopt γ to be 0.4 or 0.5 for LCDC and PCDC, and 0.6 or 0.7 for PADC, in the following WTCCC data analysis, so that both power and type 1 error rate are of practical use as suggested by our simulation studies.
WTCCC data analysis
We applied our program EPISIS to real GWAS datasets provided from WTCCC (the Wellcome Trust Case Control Consortium), which included about 500,000 common SNP genotypes for each 2,000 cases of seven human diseases, bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 and type 2 diabetes (T1D and T2D), with 3,000 controls [29]. We first made high quality SNP datasets through a standard quality-control filter, MAF in control > 0.05, Hardy-Weinberg Exact (HWE) test p-value < 5.7e-7, study-wise missing data proportion > 0.05, 1df Trend Test or 2df General Test p-values < 5.7e-7 between two control populations. After eliminating SNPs without flag of “good-clustering” on signal summary information, for these diseases, we finally had datasets consisting of 357,320 SNP genotypes that were processed in our program.
For an exhaustive search of all possible two-pair interactions in addition to their main effects, on average, we needed about seven hours to complete one search on a standard LINUX machine having four GPU units (NVIDIA Tesla C2050). We carried out three search strategies described in Method section for each dataset using three different combinations of two ranking measures and two coding methods, i.e. 1) CDC with likelihood (LCDC), 2) CDC with p-value (PCDC), and 3) ADC with p-value (PADC). The details are described in Method section.
Summary of interactions from EPISIS for seven WTCCC diseases
LCDC | PCDC | PADC | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
E^{a} | D^{b} | M^{c} | Total^{d} | N^{e} | E | D | M | Total | N | E | D | M | Total | N | |
BD | 2 | 5 | 0 | 7 | 6 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 |
CAD | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CD | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 5 | 0 | 0 | 5 | 3 |
HT | 8 | 29 | 0 | 37 | 17 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 |
RA | 14 | 8 | 1 | 23 | 9 | 0 | 0 | 1 | 1 | - | 1 | 0 | 0 | 1 | 1 |
T1D | 0 | 0 | 2 | 2 | - | 0 | 0 | 1 | 1 | - | 2 | 0 | 0 | 1 | 1 |
T2D | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Sub Total | 26 | 42 | 3 | 4 | 1 | 2 | 8 | 0 | 0 |
Three methods in EPISIS, LCDC, PCDC, and PADC, yielded apparently quite different results in number of detected interactions, but by expressing as gene networks it turns out that they report similar gene-gene interactions in addition to some otherwise extras. For example, in CD (Figure 3a), some interesting interactions including a gene showing main effect were reported; that between ATG16L1 (autophagy-related protein 16-like 1) on 2q37.1 and PDYN on 20p13 (OR_{CDC}(95%CI) = 1.76 (1.542–2.008)) in addition to ATG16L1 - PTGER4 (prostaglandin E receptor 4) - ZNF300 (zinc finger protein 300) (ATG16L1 - PTGER4: OR_{CDC}(95%CI) = 2(1.7–2.4); PTGER4 – ZNF300: OR_{ADC}(95%CI) = 1.7(1.5–1.9)) and so on. LCDC appears to have a tendency to report a greater number of interactions than other two methods that utilize p-value ranking, and the difference was emphasized in dominance effects. Throughout our data analysis, the p-value ranking tends to be more conservative than the likelihood ranking. (We argue the phenomenon in Method section.) On the other hand, CDCs appear to have a tendency to report dominance effects in addition to epistatic effects. This phenomenon may be accounted for by the fact that the dominance effects are observed between linked loci, and so increases the sparsity of the 3 by 3 genotype contingency table because of the concentration to the diagonal cells (i.e. aa-bb, aA-bB or AA-BB). A typical one of the examples is the second SNP-SNP pair found in the result for BD with the LCDC (rs2438083 - rs977673) given in Additional file 3: Table S2 online, where an apparent difference between cases and controls captured by the LCDC is seen in the aA-BB cell while other cells have very small number of observations due to the increased sparstiy. As implied by the power simulations, CDC can work well for capturing disease-development models that suffice with a single cell, and therefore is applicable to the dominance effect mentioned above.
In CD, most of the interactions shared a SNP in ATG16L1, which is an autophagy related gene showing the strongest main effect (p < 10^{−13}) originally reported in WTCCC paper. It is not always necessary that genes involving in interactions show strong main effect. In another type of disease, BD, no strong main effect was reported, but EPISIS detected a substantial number of interactions (Table 2). Range of odds ratio (95%CI) detected was between 0.01(0.000–0.93) and 7.25(4.007–13.12). LCDC and PCDC reported a gene-gene interaction around MRPL15 (Figure 3b).
Conclusion
We proposed an effective method for gene-gene interaction analysis using SNPs and developed a fast software program EPISIS enabling genome-wide gene-gene interaction analysis by utilizing the cost-effective GPGPU technology. This is the first method that successfully implements ultrahigh-dimensional variable selection approach for an exhaustive search for gene-gene interactions using realistic scale of SNP-GWAS data. The method implemented employs a framework of SIS [17–19], which enables us to handle huge set of SNP-SNP interactions based on the modern large p small n regression methodology, rather than prohibitively conservative methods through conventional hypothesis testing.
Simulation studies describe that our EPISIS show successful performance. Among three ranking strategies proposed, PADC showed greater performance than LCDC and PCDC in terms of power for most scenarios. The exceptions exist in models 2 with MAF = 0.1 (n = 1600) and model 4 with MAF = 0.1 (n = 1600), which may be well approximated by the disease-development model that suffices with a single cell of the 3 by 3 genotype interaction table due to the low MAFs. Indeed, the CDC frequently captured the major homozogotes pair (i.e. AA and BB cell) in both two simulations. Intuitively, the CDC is workable for the interaction pattern influenced by a single or a few cells, while the ADC is appealing to more complicated interaction patterns since the ADC strengthens summary statistic of interaction by gathering cell-wise dummy predictors of CDC in an adaptive manner. Following the power simulation studies, we recommend using PADC as a main method and CDC as a useful complement because the CDC may be of help in certain situations. Thus, we recommend using PADC as a main method and CDCs as complements. The power analysis through simulated datasets reveals that PADC captures various interaction patterns, including the models for which BOOST resulted in low power, in practically high performance. On the other hand, there exist some models where PADC shows modestly lower power than BOOST but still maintains practicable performance. Our studies therefore show that the PADC can have practically high power for several diseases models, which is a result from the use of analogous technique used in MDR by classifying SNP-SNP genotype pairs in two groups (high-risk and low-risk groups). Notably, since our proposed strategies generate 2 by 2 contingency table based on dummy coding, the sparse cell issue present in the BOOST is resolved.
One remaining issue in EPISIS is that the control of type 1 error rates is somewhat rough, which is common in variable selection approaches, although it brings flexibility of the method. The proposed choice of EBIC parameter depends on the simulated datasets (scenarios 1 and 2), which are less than genome-wide scale. In addition, we observed modest increase of type 1 error rates under presence of linkage disequilibrium (LD) compared with the simulation under no LD. This observation is explained by a well-known fact that stringent correlation worsens performance of multiple regression. Nevertheless, Additional file 5: Figure S1 shows that LCDC and PCDC at γ = 0.4 or PADC at γ = 0.6 in the application to seven real WTCCC data can reduce the number of detected predictors as seen in the simulation studies, which implies that these EBIC values for each screening method become a realistic solution. Here we note that the information about the number of detected predictors as varying γ may give us a further candidate gene sets. Indeed, the power plots imply that the smaller γ successfully increases the chance to include genuine interaction in the SIS ranking. We therefore encourage calibrating γ to be able to have the interaction pairs ranked high on the list, which could include susceptible loci to be examined in confirmatory study. Such a flexible usage is one of the advantages in using machine learning-based methods.
By applying EPISIS to WTCCC datasets, we obtained a large number of interactions potentially conferring susceptibility to them. The distributions of the interactions detected EPISIS were disease-specific but not software- or method-specific, implying that these results were likely derived due to the genuine distributions of these interaction but not pseudo-negative and/or positive due to the algorithm in EPISIS. Although a confirmatory study using independent sample set is needed to eliminate bias and confounding specific to original data, some of these interactions assembled in an interpretable network graph appear to be plausible from their functional points of view. For example, a statistical interaction among ATG16L1/PTGER4/ZNF300 in CD could imply the involvement of a synergistic combination among autophagy and other possible mechanisms such as immune-inflammation in the etiology of inflammatory bowel disease, which have been repeatedly suggested by genetic studies about their main effects [29–32] as well as mechanistic studies [33, 34]. In contrast, one network of interactions found in BD centered MRPL15 connected to many other genes of which some are possibly involved in mitochondrial function. This finding seems to support the previous hypothesis that mitochondrial (dys)functions underlies the pathophysiology of mental illnesses such as bipolar disorder and schizophrenia [35, 36]. EPISIS also reported a large number of interactions between neighboring SNPs, which are expected to have remarkable dominance effects through potential haplotype-blocks, which might confer disease susceptibility as haplotype blocks consisted of SNPs in a single locus.
Here we compare our results in WTCCC data analysis with those using BOOST and PLINK –fast-epistasis respectively given by Wan et al. [14] and Cordell [5]. First, since exhaustive search using PLINK –fast-epistasis is virtually infeasible due to computational burden, Cordell [5] conducts semi-exhaustive search for CD based on SNPs that passed the single locus p-value threshold of 0.2. She concluded that the SNP-SNP pair that showed highly significant fast-epistasis p-value is false positive. On the other hand, EPISIS detected a few interactions for CD in LCDC, PCDC, and PADC as shown in Additional file 4: Table S3. The analysis using EPISIS contains no interactions located in top 16 using PLINK given in Additional file 2: Table S1 of [5]. This difference comes from the semi-exhaustive search used in [5] as well as the fact that the summary statistic in fast-epistasis is based on allelic correlation whereas that in EPISIS is based on genotypic data. Second, Wan et al. [14] report many gene-gene interactions detected by the BOOST for T1D in the MHC (major histocompatibility complex) region, although they mention that the interactions found for other six diseases are nontrivial except for one SNP pair in CD (the details are not shown in their paper). They state that the interactions obtained in T1D analysis after excluding significant loci under single-locus scan include interesting interaction patterns between the MHC class I and class II. Although our EPISIS finds some interactions at two SNPs in the MHC class II, rs9272723 and rs9272346, these two SNPs show strong main effect and are interacted with SNPs on other chromosome, i.e. not in the MHC class I. This discrepancy from the BOOST’s result could be partly explained by the fact that we analyzed entire SNPs without imposing single-locus significance threshold. Although it could be possible to exclude SNPs showing strong main effect, we prefer including them since they contribute forming the affected and unaffected populations and could be ancillary to interaction analysis through regression model. The EPISIS’s result on T1D tells that the SNPs in the MHC class II region can play a major role in partitioning affected and unaffected individuals even when considering second-order interactions. It is also noteworthy that the interaction detected by EPISIS include various SNP-SNP pairs having cells with fewer observations (Additional file 3: Table S2), implying that the EPISIS overcomes the sparse cell issue. Since BOOST, PLINK –fast-epistasis, and EPISIS have advantage and disadvantage depending on disease patterns that underlie as seen in power simulation, we recommend using them in a mutually complementary manner.
Finally, we summarize our future works in what follows. First, since we use the multiple regression model, inclusion of covariates is possible to adjust the influence of several confounding factors, such as sex, age, or population stratification. Second, the generalized linear model used in SIS allows extension to more general phenotypes other than disease status. Third, although the current version does not allow the presence of missing genotype in the future version, we would overcome this issue. Fourth, an iterative version of SIS (ISIS) has been proposed [17, 18], which could improve the detection ability. We have already implemented the ISIS in EPISIS, which is useful in finding additional interactions confounded by interactions that have already detected. It poses further computational cost due to increased dimensionality of parameters, where the cost depends on the number of detected factors in SIS and subsequent penalized regression. An issue arises when detected dummy predictors after SIS are highly-correlated, i.e. multi-collinearlity, causing failure in convergence of estimating regression coefficients in SIS ranking step, which incurs considerable increase of computational time while some of results lose their reliability. Unfortunately, we provide no reasonable solution to this issue and it is still under consideration. ISIS could work if no multi-collinearlity exists, although we have no knowledge about relationship between type 1 error rates and EBIC tuning parameters, with extra predictors provided from the previous SIS step, in penalized regression step. Consequently, at the moment, we have to pay adequate attention to the multi-collinearity within the dummy predictors as well as to the selection of EBIC tuning parameter in using ISIS procedure, and we will address such issues in our future work. Although the ISIS where the above issues are resolved is ultimately desirable, the current EPISIS will work reasonably as in simple situations used in the simulation studies.
Our program EPISIS described in this manuscript will be freely available from the authors’ webpage soon.
Methods
High-dimensional multiple regression and variable selection
We first review the high/ultrahigh-dimensional variable selection, and then describe our method used in EPISIS. It is well known that the speed of convergence of regression coefficients is 1/n^{1/2} as function of sample size n. However, the argument holds only when the number of predictors p is fixed or small compared with n. Huber [38] and Portony [39] obtained a refined result regarding estimation accuracy of regression coefficients as (p/n)^{1/2}. This implies that larger p is never preferable as it worsens the estimation accuracy. When we handle large number of predictors, i.e. high- and ultrahigh-dimensional data, conventional multiple regression analysis can fail. A possible solution to this issue is derived from assuming that some of predictors are redundant and do not contribute to the response variable. The corresponding regression coefficients are then set to zero, i.e. the variable selection, which explores an optimal model among competitors using an appropriate selection criterion including AIC [40], BIC [41], cross-validation [42], GCV [43], and Mallows’ C_{p}[44]. Shao [45] investigated several theoretical properties regarding them.
Exhaustive search of the possible candidates is virtually infeasible when p is large, because the number of candidate models is 2^{ p }. Lasso [21, 22] and elastic net [23] were proposed in order to address the issue and to complete the variable selection even in such a high-dimensional data. In genome-wide association study, the application of lasso logistic regression [24] and hyper-lasso [25] has been proposed. Ayers and Cordell [26] conduct extensive simulation studies to compare various variable selection methods for several thousands of candidate SNPs. Variable selection has been applied not to gene-gene interaction analysis but to single SNPs association study so far. On the other hand, although coordinate decent-type method [25, 26, 46, 47] is computationally more efficient than Lars-type algorithm [22, 48], it will be still hard to complete the computation in realistic time when considering genome-wide gene-gene interaction. The sequential update also prevents from getting faster using parallel computing. Furthermore, the requirement of selecting tuning parameters makes it difficult in use. To overcome the issue, we employ the SIS method [17–19] that is the most computationally efficient variable selection method that is feasible for genome-wide gene-gene interaction analysis.
Sure independence screening
The SIS [17–19] is the simplest method relatively to the other variable selection procedures. It first looks marginal association using the univariate regression independently, hence is suited to parallel computing. The advantage of SIS is that favorable theoretical properties have been established [19]. Theoretically, SIS can detect effective predictors even in ultrahigh-dimensional situation [19]. To be specific, it allows the number of predictors p such that $\text{log}p=o\left({n}^{1-2\mathrm{\xce\xba}}\right)$, where Îº is some constant between 0 and 0.5. This allows handling exponentially large number of predictors, typically in genome-wide interaction search. Main assumptions in SIS are (i) Number of nonzero parameters are few, and (ii). Marginal covariance between effective predictors and response variables is larger than some but not stringent threshold value. (Note that the “marginal” indicates the single predictors rather than the single SNPs.) Fan and Song’s paper [19] should be consulted for detailed and rigorous technical conditions of original SIS. Since their theory assumes generalized linear models, the SIS method is applicable to case–control studies using logistic regression model. The SIS consists of two steps, feature ranking and penalized regression in a similar fashion to the forward–backward stepwise procedure in multiple regression.
where, Y is the response variables, g is a link function that maps mean of response variable E(Y) to a linear space, X_{ j } represents the j th predictor variable, β_{ j } is the corresponding regression coefficients, and β_{0} is the intercept. In logistic regression with canonical link, g corresponds to the logistic transformation, and the slope β_{ j } corresponds to logarithm of odds ratio. The total number of possible ways of assigning zeros to the regression coefficients is 2^{ p }. The description of the two steps is as follows.
Feature ranking step:
(i). Create ranking for p predictor variables based on marginal evaluation through univariate regression for response regressed byX_{ j },
(ii). Extract top d = 256 predictor variables from ranking obtained from (i).
Penalized regression step: d = 256 predictor variables extracted in feature ranking step are validated and well-selected through the smooth-thresholding logistic regression method. Due to computational limitation, d is restricted to be less than 256, while Fan et al. [18] recommend $d=\lfloor 0.25n/\text{log}n\rfloor $. The restriction is however not essential, because the range of d falls within 146–271 when n = 5000–10000, respectively, makes the restriction to 256 reasonable in terms of current scale of GWA studies.
EPISIS: Feature ranking step
The important task to apply the SIS to gene-gene interaction analysis is to design predictor variables X_{ j } in logistic regression model. A main contribution of our work is to propose a method that generates effective dummy variables designed for capturing SNP-SNP interaction effect. Dummy coding corresponds to the 2 by 2 table, which equivalently assigns the high- and low-risk group for each cell in 3 by 3 genotype 1nteraction table, which is used in MDR [9]. The strategy thus attempts to maximize the difference between case- and control- distributions. The 2 by 2 table representation makes the interpretation easier through the familiar odds ratio. Moreover, it enables to avoid instability caused by the presence of sparse cells as in MDR. We propose two strategies for designing dummy predictor variables, cell-wise dummy coding (CDC) and adaptive dummy coding (ADC).
Cell-wise dummy coding (CDC)
Adding these three predictor variables to the above 9 L(L-1)/2 predictors pertaining to the interactions, we finally have 9 L(L-1)/2 + 3 L predictors in total and create feature ranking for them. Consequently, it allows equal evaluation between interaction and main effects, and enables to sort them according to their strength of the effect through the ranking.
Remarks: Since CDC creates nine dummy variables, the occurrence of multi-collinearlity might be suspected. This is true only if all nine dummy predictors in a pair of SNPs appear in the top 256 in SIS ranking. However, it is unrealistic when the number of loci is large because if one of nine cells shows larger odds ratio relative to the others, the one of the other cells must have smaller odds ratio, consequently resulting that it does not appear in the ranking and in the further variable selection part.
Adaptive dummy coding (ADC)
An alternative way is to define the dummy variable that represents a pair of SNPs.
- 1)
Exhaustive comparison of possible classifications.
- 2)
The ADC compares 2^{9}-2 classifications of dummy coding for each pair of two SNPs that separate high-risk and low-risk groups.
- 3)
(There are 2^{9} ways of separating nine cells to two classes; -2 indicates elimination of two cases with only high-risk and low-risk group.)
- 4)
Evaluation of classification accuracy using original dataset.
- 5)
We evaluate the result of classification using the original dataset.
- 6)
We use the balanced accuracy (BA), which is shown to perform well in the MDR method [50], as a criterion to evaluate the goodness of classification.
- 7)
The formula is given by BA = (Sensitivity + Specificity)/2.
- 8)
BA is also known as the area under the curve (AUC) of the receiver operating characteristic (ROC) that can be also used as a measure of the potential discriminatory power.
The advantage of taking (i) is as follows. Original MDR method generates two groups, high- and low-risk groups, based on the ratio of case–control individuals exceeding unity for each cell of nine pairs (empty cells are exception), which can sometimes cause false positive and negative errors in some situations [51]. The odds ratio [51] is a criterion alternative to the ratio of case–control individuals, however, it could generate missclassification provided that the sample sizes of either cases or controls are small for a specific cell. Furthermore, threshold value common in both measures is difficult to determine. The above arguments deduces that the entire search of possible 2^{9}-2 classifications is preferable to avoid missing the best classification in the marginal SNP-SNP interaction table
On the other hand, regarding (ii), the original MDR uses the cross-validated data to evaluate the classification accuracy instead of original dataset. Although the use of cross-validated dataset is a careful approach, it incurs computational burden in genome-wide gene-gene interaction analysis. The main purpose of this feature ranking is to make a mere list of candidates of genuine classification of the two SNPs combination toward the next penalized regression step but not to infer them. In our procedure, the subsequent penalized regression plays a role of adequate model validation instead of the cross-validation.
Association criterion
We propose two criteria for feature ranking step: p-value of odds ratio and likelihood for the 2 by 2 contingency table corresponding to the dummy coding. Originally, Fan and their colleagues [17, 18] use the absolute slope in logistic regression obtained from univariate regression. In logistic regression model, this slope corresponds to the odds ratio. In the 2 by 2 contingency table, it is known that the odds ratio is more sensitive to the presence of small number of observations as is usual. Thus, instead of odds ratio itself, we use its p-value. If zero cells appear in the table, we add 0.5 to all cells of the observed counts for calculations of p-value and likelihood.
From the test statistic point of view, p-value of odds ratio and likelihood ranking are asymptotically equivalent, because the former and latter correspond to the Wald- and likelihood ratio test statistics for testing the slope of zero in univariate logistic regression model. However, they can differ in finite sample situation. Indeed, the real data analysis of the GWAS data from WTCCC described in the following section shows that interactions detected by the likelihood ranking have larger odds ratio than those by the p-value ranking, while the corresponding p-values of odds ratio are of course smaller than those by p-value ranking. Since the odds ratio is popularly used in epidemiology, possibly due to the relationship with relative risk, the p-value may provide more interpretable result than likelihood criterion.
Comparison between CDC and ADC
CDC allows inclusion of two or more dummy variables in each pair of SNPs, while ADC does only one dummy variable. Thus, ADC is effective for some diseases of which the difference of frequencies between cases and controls can be sufficiently represented by some high- and low-risk classification. For otherwise diseases, CDC will work well, in other words, it is preferable under more complicated situation. Inclusion of more than two predictors in a pair in CDC also loses the chance that the interactions in boundary in the ranking of 256 predictors are included, while the ADC does not occupy 256 rankings by the same pair. ADC carries out re-partitioning of nine cells into two cell categories, i.e. susceptible or not, leading to increase of the number of samples in each category, so that ADC method is likely to be more stable than CDC.
Gene-gene interaction
As mentioned above, a major aim of our feature ranking is to sort SNP-SNP pairs according to the extent of difference between case- and control-distributions, so it takes into no account of distances between two SNPs, possibly leading to strong LD between them. Indeed, our feature ranking from an exhaustive search may include not only pairs of which two SNPs are less-correlated but also those which are highly-correlated. The former clearly corresponds to epistatic interactions between genes represented by each SNP, while the latter might indicate interactions between SNPs located within a narrow region (frequently within a single gene locus), nearly analogous to the conventional concept of “dominance” effect. In fact, focusing on statistical relationship of two SNPs is merely equivalent to mathematically considering 3 by 3 contingency tables of genotype pairs (for cases and controls). Since then, statistical closeness between two SNPs in the table is simply evaluated through some correlation measures such as the Pearson’s correlation coefficients regardless of their physical distance, gene-gene interaction between unlinked regions and that in the same region are indistinguishable at least in the mathematical model as well as in the context of the conventional quantitative genetics, i.e. non-additive effects [52]. Even if any pair of SNPs in a narrow region showing considerable LD are unavoidably appeared in the ranking, it will turn out that the EPISIS effectively detect potential haplotype-blocks consisted of these SNPs in LD that show strong interaction effect but no large marginal association of each SNP. Although epistasis and dominance may have distinct mechanisms differentiating distributions of frequency of genotype pairs between cases and controls, EPISIS may detect any interactions irrespective of either cause.
Penalized regression step
in which ℓ(β) is the log-likelihood function of logistic regression with regression coefficients β_{ j }, j ε{1, …, d}. Here λ_{2} denote the ridge tuning parameter. Regression coefficients β_{ j } not belonging to A(λ) are set to zero, i.e. sparse solution. We utilize the ridge logistic regression estimates for the initial estimates ${\widehat{\beta}}_{j}^{\left(0\right)}$, of which the ridge tuning parameter λ_{2} is determined in the following manner: If the matrix inversion in Newton–Raphson iteration step fails, we increase the ridge parameter until it first succeeds. The initial λ_{2} is set to 1/n. Notably, the ridge smooth-thresholding logistic regression is closely related with the adaptive elastic net [55], which has been applied not to the logistic regression but to least-squares regression.
Although the extension to logistic regression is straightforward, the required convex programming in the adaptive elastic net incurs computational instability in obtaining the solution to the logistic regression estimation such as the failure of convergence, while the smooth-thresholding is free from the issue using simpler Newton–Raphson procedure. Convex programming is required by most of variable selection methods including SCAD [56, 57], MCP [58], Adaptive Lasso [59], Lasso [21, 22], Elastic net [23], Adaptive elastic net [55], and Dantzig selector [60, 61]. Consequently, the smooth-thresholding is favorable over other methods in the present purpose. The ridge penalty λ_{2} plays a role in stabilizing the variable selection process when multi-collinearlity is present, which is useful because dummy predictor variables extracted from the feature ranking step often include members that are highly-correlated, possibly due to LD. We select the tuning parameters λ using the extended BIC [62] which control the number of survived predictors.
where γ is a given constant taking its value within the interval [0,1]. EBIC at γ = 0 reduces to the BIC, and the increase of γ imposes heavier penalty for dimensionality which decreases type 1 error rate. We take the dimension p not the number of screened predictors but the number of predictors before screening step in order to take into account for the variability due to screening. For CDC, p is 9 L(L-1)/2 + 3 L, while p is for ADC L(L-1)/2. EBIC may be conservative if choosing γ = 1 because the penalized regression is applied to the screened predictors not to the full predictors. Since, unfortunately, no universal method to choosing γ has been proposed so far, simulation studies offer a guide for the choice. Other selection criteria such as AIC and GCV are not recommended in use for variable selection purpose since they tend to select variables more than truth [64].
Notes
Declarations
Acknowledgements
M. Ueki was partially supported by Grant-in-Aid for Young Scientists (B) 21700318. G. Tamiya was partially supported by Grant-in-Aid for Scientific Research (C). Both were partially supported by MEXT Grant-in-Aid for Global COE program in Yamagata University. The authors would like to thank to Prof. Cordell for helpful suggestions on this work.
Authors’ Affiliations
References
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al.: Finding the missing heritability of complex diseases. Nature 2009, 461: 747–753. 10.1038/nature08494PubMed CentralView ArticlePubMedGoogle Scholar
- Risch N: Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 1990, 46: 222–228.PubMedGoogle Scholar
- Wray NR, Goddard ME: Multi-locus models of genetic risk of disease. Genome Med 2010, 2: 10. 10.1186/gm131PubMed CentralView ArticlePubMedGoogle Scholar
- Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for detecting multiple loci influencing complex diseases. Nat Genet 2005, 37: 413–417. 10.1038/ng1537View ArticlePubMedGoogle Scholar
- Cordell HJ: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 2009, 10: 392–404.PubMed CentralView ArticlePubMedGoogle Scholar
- Wacholder S, Chanock S, Garcia-Closas M: El Ghormli L, Rothmanm N: Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 2004, 96: 434–42. 10.1093/jnci/djh075View ArticlePubMedGoogle Scholar
- Wakefield J: A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet 2007, 81: 208–227. 10.1086/519024PubMed CentralView ArticlePubMedGoogle Scholar
- Wakefield J: Bayes factors for genome-wide association studies: comparison with p-values. Genet Epidemiol 2009, 33: 79–86. 10.1002/gepi.20359View ArticlePubMedGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138–147. 10.1086/321276PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC: (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet 2007, 81: 559–575. 10.1086/519795PubMed CentralView ArticlePubMedGoogle Scholar
- Moore JH, White BC: Tuning ReliefF for genomewide genetic analysis. Lect Notes Comp Sci 2007, 4447: 166–175. 10.1007/978-3-540-71783-6_16View ArticleGoogle Scholar
- Schwartz DF, Ziegler A, KÃÂ¶nig IR: On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics 2010, 5: 1752–1758.View ArticleGoogle Scholar
- Zhang Y, Liu JS: (2007) Bayesian inference of epistatic interactions in case–control studies. Nat Genet 2007, 39: 1167–1173. 10.1038/ng2110View ArticlePubMedGoogle Scholar
- Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NL, Yu W: BOOST: A fast approach to detecting gene-gene interactions in genome-wide case–control studies. Am J Hum Genet 2010, 87: 325–340. 10.1016/j.ajhg.2010.07.021PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao J, Jin L, Xiong MM: Test for interaction between two unlinked loci. Am J Hum Genet 2006, 79: 831–845. 10.1086/508571PubMed CentralView ArticlePubMedGoogle Scholar
- Wu X, Dong H, Luo L, Zhu Y, Peng G, Reveille JD, Xiong MM: Anovel statistic for genome-wide interaction analysis. PLoS Genet 2010, 6: e1001131. 10.1371/journal.pgen.1001131PubMed CentralView ArticlePubMedGoogle Scholar
- Fan J, Lv J: Sure independence screening for ultra-high dimensional feature space (with discussion). J R Statist Soc B 2008, 70: 849–911. 10.1111/j.1467-9868.2008.00674.xView ArticleGoogle Scholar
- Fan J, Samworth R, Wu Y: Ultrahigh dimensional variable selection: beyond the lienar model. J Mach Learn Res 2009, 10: 1829–1853.Google Scholar
- Fan J, Song R: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality. Ann Statist 2010, 38: 3567–3604. 10.1214/10-AOS798View ArticleGoogle Scholar
- Burnham KP, Anderson DR: Model Selection and Multimodel Inference: A Practical-Theoretic Approach. 2nd edition. Springer-Verlag; 2002.Google Scholar
- Tibshirani R: Regression shrinkage and selection via the lasso. J R Statist Soc B 1996, 58: 267–288.Google Scholar
- Efron B, Hastie T, Johnstone I, Tibshirani T: Least angle regression. Ann Statist 2004, 32: 407–499. 10.1214/009053604000000067View ArticleGoogle Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Statist Soc B 2005, 67: 301–320. 10.1111/j.1467-9868.2005.00503.xView ArticleGoogle Scholar
- Wu TT, Chen YF, Hastie T, Sobel E, et al.: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009, 25: 714–721. 10.1093/bioinformatics/btp041PubMed CentralView ArticlePubMedGoogle Scholar
- Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet 2008, 4: e1000130. 10.1371/journal.pgen.1000130PubMed CentralView ArticlePubMedGoogle Scholar
- Ayers KA, Cordell HJ: SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol 2010, 34: 879–891. 10.1002/gepi.20543PubMed CentralView ArticlePubMedGoogle Scholar
- He Q, Lin D-Y: A variable selection method for genome-wide association studies. Bioinformatics 2011, 27: 1–8. 10.1093/bioinformatics/btq600PubMed CentralView ArticlePubMedGoogle Scholar
- Dudek SM, Motsinger AA, Velez DR, Williams SM, Ritchie MD: Data simulation software for whole-genome association and other studies in human genetics. Pac Symp Biocomput 2006, 499–510.Google Scholar
- Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447: 661–678. 10.1038/nature05911View ArticleGoogle Scholar
- Libioulle C, Louis E, Hansoul S, Sandor C, Farnir F, Franchimont D, Vermeire S, Dewit O, de Vos M, Dixon A, et al.: Novel Crohn Disease Locus Identified by Genome-Wide Association Maps to a Gene Desert on 5p13.1 and Modulates Expression of PTGER4. PLoS Genet 2007, 3: e58. 10.1371/journal.pgen.0030058PubMed CentralView ArticlePubMedGoogle Scholar
- Xavier RJ, Podolsky DK: Unravelling the pathogenesis of inflammatory bowel disease. Nature 2007, 448: 427–434. 10.1038/nature06005View ArticlePubMedGoogle Scholar
- Prescott NJ, Dominy KM, Kubo M, Lewis CM, Fisher SA, Redon R, Huang N, Stranger BE, Blaszczyk K, Hudspith B, et al.: Independent and population-specific association of risk variants at the IRGM locus with Crohn’s disease. Hum Mol Genet 2010, 19: 1828–1839. 10.1093/hmg/ddq041PubMed CentralView ArticlePubMedGoogle Scholar
- Cadwell K, Liu JY, Brown SL, Miyoshi H, Loh J, Lennerz JK, Kishi C, Kc W, Carrero JA, Hunt S, et al.: A key role for autophagy and the autophagy gene Atg16l1 in mouse and human intestinal Paneth cells. Nature 2008, 456: 259–63. 10.1038/nature07416PubMed CentralView ArticlePubMedGoogle Scholar
- Yao C, Sakata D, Esaki Y, Li Y, Matsuoka T, Kuroiwa K, Sugimoto Y, Narumiya S: Prostaglandin E2-EP4 signaling promotes immune inflammation through Th1 cell differentiation and Th17 cell expansion. Nat Med 2008, 15: 633–640.View ArticleGoogle Scholar
- Iwamoto K, Bundo M, Kato T: Altered expression of mitochondria-related genes in postmortem brains of patients with bipolar disorder or schizophrenia, as revealed by large-scale DNA microarray analysis. Hum Mol Genet 2005, 15: 241–253.Google Scholar
- Yasuno K, Ando S, Misumi S, Makino S, Kulski JK, Muratake T, Kaneko N, Amagane H, Someya T, Inoko H, et al.: Synergistic association of mitochondrial uncoupling protein (UCP) genes with schizophrenia. Am J Med Genet B Neuropsychiatr Genet 2007, 144B: 250–253. 10.1002/ajmg.b.30443View ArticlePubMedGoogle Scholar
- Wellek S, Ziegler A: A genotype-based approach to assessing the association between single nucleotide polymorphisms. Hum Hered 2009, 67: 128–139. 10.1159/000179560View ArticlePubMedGoogle Scholar
- Huber PJ: Robust regression: Asymptotics, conjectures and Monte Carlo. Ann Statist 1973, 1: 799–821. 10.1214/aos/1176342503View ArticleGoogle Scholar
- Portnoy S: Asymptotic behavior of M-estimators of p regression parameters when p2/n is large. I. Consistency. Ann Statist 1984, 12: 1298–1309.View ArticleGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Auto Cont 1974, 19: 716–723. 10.1109/TAC.1974.1100705View ArticleGoogle Scholar
- Schwarz GE: Estimating the dimension of a model. Ann Statist 1978, 6: 461–464. 10.1214/aos/1176344136View ArticleGoogle Scholar
- Stone M: Cross-validation choice and assessment of statistical predictions (with Discussion). J R Statist Soc B 1974, 36: 111–147.Google Scholar
- Craven P, Wahba G: Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 1979, 31: 377–403.View ArticleGoogle Scholar
- Mallows CL: Some Comments on CP. Technometrics 1973, 15: 661–675.Google Scholar
- Shao J: An asymptotic theory for linear model selection (with Discussion). Statist Sinica 1997, 7: 221–242.Google Scholar
- Friedman J, Hastie T, HÃÂ¶fling H, Tibshirani R: Pathwise coordinate optimization. Ann Appl Stat 2007, 1: 302–332. 10.1214/07-AOAS131View ArticleGoogle Scholar
- Friedman J, Hastie T, Tibshirani R: Regularized Paths for Generalized Linear Models via Coordinate Descent. J Statist Soft 2010, 33: 1–22.View ArticleGoogle Scholar
- Park MY, Hastie T: L1-regularization path algorithm for generalized linear models. J R Statist Soc B 2007, 69: 659–677. 10.1111/j.1467-9868.2007.00607.xView ArticleGoogle Scholar
- Sham PC, Curtis D: Monte Carlo tests for associations between disease and alleles at highly polymorphic loci. Ann Hum Genet 1995, 59: 97–105. 10.1111/j.1469-1809.1995.tb01608.xView ArticlePubMedGoogle Scholar
- Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol 2007, 31: 306–315. 10.1002/gepi.20211View ArticlePubMedGoogle Scholar
- Chung Y, Lee SY, Elston RC, Park T: Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics 2007, 23: 71–76. 10.1093/bioinformatics/btl557View ArticlePubMedGoogle Scholar
- Falconer D: Introduction to Quantitative Genetics. Edinburgh: Oliver and Boyd; 1960.Google Scholar
- Ueki M: A note on automatic variable selection using smooth-threshold estimating equations. Biometrika 2009, 96: 1005–1011. 10.1093/biomet/asp060View ArticleGoogle Scholar
- Ueki M, Kawasaki Y: Automatic grouping using smooth-threshold estimating equations. Electron J Statist 2011, 5: 309–328. 10.1214/11-EJS608View ArticleGoogle Scholar
- Zou H, Zhang HH: On the adaptive elastic-net with a diverging number of parameters. Ann Statist 2009, 37: 1733–1751. 10.1214/08-AOS625View ArticleGoogle Scholar
- Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc 2001, 96: 1348–1360. 10.1198/016214501753382273View ArticleGoogle Scholar
- Fan J, Peng H: On non-concave penalized likelihood with diverging number of parameters. Ann Statist 2004, 32: 928–961. 10.1214/009053604000000256View ArticleGoogle Scholar
- Zhang CH: Nearly unbiased variable selection under minimax concave penalty. Ann Statist 2010, 38: 894–942. 10.1214/09-AOS729View ArticleGoogle Scholar
- Zou H: The adaptive lasso and its oracle properties. J Am Statist Assoc 2006, 101: 1418–1429. 10.1198/016214506000000735View ArticleGoogle Scholar
- Candes E, Tao T: The Dantzig selector: Statistical estimation when p is much larger than n. Ann Statist 2007, 35: 2313–2351. 10.1214/009053606000001523View ArticleGoogle Scholar
- James GM, Radchenko P: A generalized Dantzig selector with shrinkage tuning. Biometrika 2009, 96: 323–337. 10.1093/biomet/asp013View ArticleGoogle Scholar
- Chen J, Chen Z: Extended Bayesian information criteria for model selection with large model space. Biometrika 2008, 95: 759–771. 10.1093/biomet/asn034View ArticleGoogle Scholar
- Wang H, Li B, Leng C: Shrinkage tuning parameter selection with a diverging number of parameters. J R Statist Soc B 2009, 71: 671–683. 10.1111/j.1467-9868.2008.00693.xView ArticleGoogle Scholar
- Wang H, Li R, Tsai CL: Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94: 553–558. 10.1093/biomet/asm053PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.